<a href="https://colab.research.google.com/github/JakeNauman/Twitter_Sentiment_Analysis/blob/main/Twitter_Webscrape_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Twitter live webscraping sentiment analysis project**

Introduction      
In this project, I chose to use sentiment analysis classification on Twitter tweet text, just like we did in class. However, I took it one step further and used Twitter developer api to return the most recent tweets on a user given topic. I was able to obtain real data which made the program much more useful and practical. By combining this with what we learned in class, I will be able to determine the public sentiment at any given moment on any subject given by the user by running sentiment analysis on the most recent 100 or so tweets that mention that subject. For example, if the user inputs 'Netflix' the algorithm will search 100 recent tweets mentioning Netflix and calculate if it is a generally positive or negative subject. As a result it will most likely show that Netflix is viewed negatively at the moment.

To get access to the Twitter api you have to apply to be a developer and get an access code, but as I dont plan to use Twitter developer tools any further I'll just leave my code in so it will run. Also, it may return innapropriate results because it is using live tweets, and Twitter is a gross place so be warned. At the moment, Twitter only allows you to obtain 100 of the most recent tweets from their data to avoid strain on servers, but that is more than enough for my needs.

The code below is the Twitter code for webscraping twitter data. The function takes in information for what to return of the Twitter data such as fields, search topic, and bearer token which authorizes access. It then forms the URL from which it gathers and returns the specified data in the form of a list.

In [None]:
import requests
import json

#This is my Twitter developer token. I cant include this but it's easy to get one yourself.
bearer_token = "AAAAAAAAAAAAAAAAAAAAAPj6cgEAAAAA0OnV14VJfxRQzvrUaERxWp8DYMg%3Dc4FnO8mYBEMPDrhiq4Zco4zhTFxdLJy9kfLGNmoGFUtbBDS8yI"

#define search twitter function
def search_twitter(query, tweet_fields,bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}

    url = "https://api.twitter.com/2/tweets/search/recent?query={}&{}".format(
        query, 'max_results=100', tweet_fields
    )
    response = requests.request("GET", url, headers=headers)

    print()

    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()






The code below is the sentiment analysis algorithm we did in class. This code trains the program to classify key words as positive or negative in order to use that in sentances to decifer + or - sentiment. It uses the NLTK  API to obtain sample twitter text along with a + or - classification which it uses to train off of.

In [None]:
print("Sentiment Analysis: ")

!pip install -q wordcloud
import wordcloud
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('twitter_samples')
#import pandas as pd
#import matplotlib.pyplot as plt
#import io
#import unicodedata
#import numpy as np
#import re
#import string

#___________________________________________________________________________________________________________________
from nltk.corpus import twitter_samples as tw

# Tweets with positive sentiment ---> 5000
pos_tw = tw.strings('positive_tweets.json')
# Tweets with negative sentment ---> 5000
neg_tw = tw.strings('negative_tweets.json')
# Tweets with no polarity ---> 20000
txt = tw.strings('tweets.20150430-223406.json')

pos_token = tw.tokenized('positive_tweets.json')
neg_token = tw.tokenized('negative_tweets.json')
#print(pos_tw)

#___________________________________________________________________________________________________________________

#normalize data
from nltk.tag import pos_tag
print(pos_tag(pos_token[0]))

#___________________________________________________________________________________________________________________

#word lettemizer to normalize tweet data
from nltk.stem.wordnet import WordNetLemmatizer
def lemmatize_sentence(tokens):
  lem = WordNetLemmatizer()
  lem_sent = []
  for word, tag in pos_tag(tokens):
    if tag.startswith('NN'):
      pos = 'n'
    elif tag.startswith('VB'):
      pos = 'v'
    else:
      pos = 'a'
    lem_sent.append(lem.lemmatize(word, pos))
  return lem_sent
print(lemmatize_sentence(pos_token[0]))

#___________________________________________________________________________________________________________________

#remove special characters with regular expression
import re, string
def noise_removal(twt_tokens, stop_words = ()):
  cleaned_data = []
  for token, tag in pos_tag(twt_tokens):
    token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
    '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
    token = re.sub("(@[A-Za-z0-9_]+)","", token)
    if tag.startswith("NN"):
      pos = 'n'
    elif tag.startswith('VB'):
      pos = 'v'
    else:
      pos = 'a'
    lem = WordNetLemmatizer()
    token = lem.lemmatize(token, pos)
    if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
      cleaned_data.append(token.lower())
  return cleaned_data
from nltk.corpus import stopwords
sw = stopwords.words('english')

#___________________________________________________________________________________________________________________

#remove noise from pos and neg tweets
pos_cleaned_data = []
neg_cleaned_data = []

for token in pos_token:
  pos_cleaned_data.append(noise_removal(token,sw))
for token in neg_token:
  neg_cleaned_data.append(noise_removal(token,sw))

#___________________________________________________________________________________________________________________

#determine word density in data
def make_word_list(cleaned_list):
  for words in cleaned_list:
    for word in words:
      yield word
pos_words = make_word_list(pos_cleaned_data)
neg_words = make_word_list(neg_cleaned_data)

#___________________________________________________________________________________________________________________

#find most common words using frequency distribution
from nltk import FreqDist
pos_word_freq=FreqDist(pos_words)
neg_word_freq=FreqDist(neg_words)
#___________________________________________________________________________________________________________________

#prep data for analysing module

def prepare_data_for_model(data_set):
  for tokens in data_set:
    yield dict([token,True] for token in tokens)
#___________________________________________________________________________________________________________________

#prep both neg and pos tagged words separately
pos_token_model = prepare_data_for_model(pos_cleaned_data)
neg_token_model = prepare_data_for_model(neg_cleaned_data)

#___________________________________________________________________________________________________________________

#randomize the dataset
import random
pos_dataset = [(twt_dict,'positive') for twt_dict in pos_token_model]
neg_dataset = [(twt_dict,'negative') for twt_dict in neg_token_model]

dataset = pos_dataset + neg_dataset
random.shuffle(dataset)
train_data = dataset[:7000]
test_data = dataset[7000:]

#___________________________________________________________________________________________________________________

#training modules using classifiers
from nltk import classify
from nltk import NaiveBayesClassifier as NB
classifier_NB = NB.train(train_data)

print("Accuracy: ", classify.accuracy(classifier_NB,test_data))

print(classifier_NB.show_most_informative_features(10))

#___________________________________________________________________________________________________________________
print("Algorithm Trained")


Sentiment Analysis: 
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]
['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top',

Now that we've got the 100 or so most recent tweets on that particular subject and we've got the algorithm trained, we can combine them and analyze results. The code below will go through every tweet and decide if it is positive or negative. At the end there is an overview of the results. This is the main code to run once the top two code blocks have been run and loaded.

In [None]:
query = input("Enter a search topic: ")

#twitter fields to be returned by api call
tweet_fields = "tweet.fields=text,author_id,created_at"


#twitter api call
json_response = search_twitter(query=query, tweet_fields=tweet_fields, bearer_token=bearer_token)
tweets = json_response['data']
listoftweets=[]
for i in tweets:
  listoftweets.append(i['text'])
print(str(len(listoftweets)) + " of the most recent Tweets mentioning '" + str(query) + "' scraped for analyzing.")
#This will print all tweet info, I only need the text
#print(json.dumps(json_response['data'], indent=4, sort_keys=True))

from nltk.grammar import ProbabilisticProduction
#check classifier w tweets
from nltk.tokenize import word_tokenize as wt
sentimentlist=[]
for i in listoftweets:
  custom_tokens = noise_removal(wt(i),sw)
  #print(i) #show every tweet and if its pos or neg

  sentimentlist.append(classifier_NB.classify(dict([token,True] for token in custom_tokens)))
  #print(classifier_NB.classify(dict([token,True] for token in custom_tokens))) #print if each tweet is pos or neg


#calculate sentiment
poscount=0
negcount=0
for i in sentimentlist:
  if i=='positive':
    poscount+=1
  else:
    negcount+=1
  


print("positive tweets: " + str(poscount))
print("negative tweets: " + str(negcount))



if poscount>negcount:
  print("The topic of '" + str(query) + "' is mostly positive")
elif poscount<negcount:
  print("The topic of '" + str(query) + "' is mostly negative")
else:
  print("The topic of '" + str(query) + "' is neutral")


Enter a search topic: school

100 of the most recent Tweets mentioning 'school' scraped for analyzing.
positive tweets: 30
negative tweets: 70
The topic of 'school' is mostly negative


Some interesting results



Positive Topics
*   'Birthday' was 90:10  mostly positive
*   'The economy' was 72:28  mostly positive (suprisingly)
*   'Basketball' was 70:30  mostly positive
*   'Graduation' was 58:42  mostly positive
*   'Stranger Things' was 61:39  mostly positive
*   'Minnetonka' was 71:29  mostly positive

Neutral Topics (rare)
*   'United States' was 50:50 split


Negative Topics
*   'Funeral' was 17:83  mostly negative
*   'Ukraine' was 40:60 mostly negative
*   'Netflix' was 15:85  mostly negative
*   'School' was 42:58  mostly negative
*   'Nike' was 39:61  mostly negative
*   'Joe Biden' was 45:55  mostly negative
*   'Morbius' was 39:61  mostly negative




This algorithm is pretty accurate for the most part. Most of the results fit the sentiment that you would expect. For example, Netflix is losing popularity at the moment and that fits the mostly negative public feeling. Also 'birthday' you would expect to be positive because it is a happy topic. From these and other examples it is very liekly that this is a useful tool to determine public sentiment on any given subject. However, when running the twitter code I found that a lot of the returned recent tweets are bots that mess up the results quite a bit by spamming the same message. Also some results are simply unexpected, like how 'the economy' is mostly positive while all the stocks are dropping at the moment. It could return positive for other, unknown reasons though. However it still works as intended and maybe taking in a larger amount of data would help.

Report

Introduction

For this project I combined the data I accessed through Twitter developer api with sentiment analysis machine learning to be able to determine the community sentiment on any user given topic. For this project I learned that Twitter gives access to certain amounts of data for personal use through their own api which I had to apply for. Using this you can obtain any data you like. In this case I need a list of the text of the most recent tweets to mention a specific keyword given by the user. This sentiment analysis project did use a lot of similar code to what we did in class, but I'm proud I was able to apply it to actual data in real time to calculate overall sentiment. It uses NLTK API to obtain sample twitter data which it uses to train the algorithm to classify given text as positive or negative. Through this, I can classify sentiment of real time tweets and see how many it predicts are positive and negative.

How it works

It starts off by asking the user for input, the 'query' that needs to be included in the twitter api call to obtain data. Then it combines that with the desired data (in this case only the tweet text) and the bearer token to authorize that the user has access to the call which I obtained just for this project. This is to make sure that whoever requests the data is a verified user who Twitter can track their usage. This exists so crypto bots and whatnot can't constantly take in data and use it for personal gain. AFter given the specified requirements, it returns a specified number of recent tweets that contain the 'query' which I add to a list called tweets. Twitter lets you obtain anywhere from 10-100 tweets for any purpose. You can also specify what data you want (time posted, username, likes, comments...) but the text is obviously the most important component when determining sentiment.

 Then, the code from the machine learning sentiment analysis program is used. This code takes in sample data from NLTK that make up thousands of fake tweets with a specified sentiment. It tokenizes and then scans all the words and decides which words and phrases are most associated with positive or negative tweets. It gets rid of stop words such as (or, the, and a...) so it can only use the notable and important words to associate with a sentiment. Once the algorithm is trained, the user can then determine accuracy and test user given tweets.

 Finally, these two programs can be combined. If we run the list of tweets through a for loop performing the sentiment analysis classifier and make a new list containing around 100 values of either 'positive' or 'negative'. Then I used some simple python to find the ratio of + to - and decide whether the results were mostly positive, mostly negative, or the same. 

 Results

 To the most part, these results were pretty accurate. Even if the classifier classifies a few tweets as the wrong sentiment, 100 tweets are more than enough to even out the data and give you an accurate portrayal. I know that this algorithm is somewhat reliable for a variety of specific reasons relating to current events and general public knowledge. Meaning that the result fits what I expected it to do.
 1. Due to increasing unrest, the topic of Joe Biden would be expected to be mostly negative, which the algorithm mirrors.
 2. There is a current rise in unpopularity in Netflix as its stock plummets and many people choosing to unsubscribe. My data reflects the mostly negative sentiment
 3. Keywords such as 'birthday', 'graduation', and 'basketball' all return mostly positive which fits with the positive relations to the word.
 4. With the war happening in Ukraine, the algorithm would most likely return a mostly negative sentiment  assesment which it does.

Conclusion

In conclusion, this project is very successful and I was able to obtain real data to create something that is actually practical and could be used for a variety of things. This could be used for companies to determine social media's sentiment on their brand. It could also be used for personal use to judge Twitter opinions on a particular product by inputting, say, 'MacBook Pro' and seeing what 100 people think of it. I could maybe imrove this project by obtaining more data, or making it even more practical by obtaining Facebook data, Instagram data, Reddit data and comaring them. If Twitter allowed more than 100 tweets for analysis the results would be more accurate. I could also use more sample tweets to train the algorithm to perform more accurately, but that would take a lot more time and I couldn't find any other data. This was a useful project to learn sentiment analysis and combine it to real world data.


Resources

Twitter Webscrape Tutorial
AuthorPiyushPiyush is a data scientist passionate about using data to understand things better and make informed decisions. In the past, et al. “Get Data from Twitter API in Python - Step by Step Guide.” Data Science Parichay, 15 Dec. 2021, https://datascienceparichay.com/article/get-data-from-twitter-api-in-python-step-by-step-guide/. 


NLTK Sentiment Analysis Tutorial
“NLTK Sentiment Analysis Tutorial: Text Mining &amp; Analysis in Python.” DataCamp, https://www.datacamp.com/tutorial/text-analytics-beginners-nltk?irclickid=zfvTeUVYFxyIUzuxFTRRGWYMUkDxiQ3bhXd-Rc0&amp;irgwc=1&amp;utm_medium=affiliate&amp;utm_source=impact&amp;utm_campaign=2003851.


Twitter data 
Twitter.com



