# Sentiment analysis model

To do:
- Preprocessing, including nlp
- train model based on annotated sentiment
- validate model (x-val, etc. check notes from ApML)
- Visualize: (separate notebook)
    - Average Sentiment by vaccine (get vaccines by hashtags)
    - Amount of positive, negative and neutral tweets
    - Sentiment over time

In [1]:
import pandas as pd
import numpy

In [2]:
clean_vaccine_tweets = pd.read_csv("../data/interim/cleaned_vaccine_tweets.csv", index_col=0)
annotated_vaccine_tweets = pd.read_csv("../data/interim/covid-19_vaccine_tweets_with_sentiment.csv", encoding="latin", index_col=0)

In [3]:
clean_vaccine_tweets.head()

Unnamed: 0,id,created_at,user,geo,full_text,hashtags
0,1338158543359250432,2020-12-13 16:27:13+00:00,76052772,,While the world has been on the wrong side of ...,"['covid19', 'supplychain', 'logistics', 'vacci..."
1,1337840331522453504,2020-12-12 19:22:45+00:00,1300382181605494800,,@cnnbrk #COVID19 #CovidVaccine #vaccine #Coron...,"['COVID19', 'CovidVaccine', 'vaccine', 'Corona..."
2,1338544403795881984,2020-12-14 18:00:29+00:00,1164717209253552000,,The FDA Authorizes Emergency Use Of The Pfizer...,"['PFE', 'Pfizer', 'Pfizervaccine', 'PfizerBioN..."
3,1337735595704115200,2020-12-12 12:26:34+00:00,1316036067754205200,,The #FDA finally issues #EUA now comes the pro...,"['FDA', 'EUA', 'PfizerBioNTech', 'vaccinated']"
4,1337850832256176128,2020-12-12 20:04:29+00:00,1110032180237852700,,There have not been many bright days in 2020 b...,"['BidenHarris', 'Election2020', 'PfizerBioNTec..."


In [4]:
annotated_vaccine_tweets = annotated_vaccine_tweets.rename(columns={"tweet_text":"full_text"})
annotated_vaccine_tweets.head()

Unnamed: 0,tweet_id,label,full_text
0,1360342002961940480,1,"4,000 a day dying from the so called Covid-19 ..."
1,1382896334886248448,2,Pranam message for today manifested in Dhyan b...
2,1375673411846873088,2,Hyderabad-based ?@BharatBiotech? has sought fu...
3,1381310901119287296,1,"Confirmation that Chinese #vaccines ""donÂt ha..."
4,1362165556091191296,3,"Lab studies suggest #Pfizer, #Moderna vaccines..."


Sentiment Label:
- Negative: 1
- Neutral: 2
- Positive: 3

---

# NLP
## Preprocessing

In [5]:
clean_vaccine_tweets["corpus"] = ""
annotated_vaccine_tweets["corpus"] = ""

In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')

[nltk_data] Downloading package stopwords to /Users/ayman/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/ayman/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/ayman/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

Set lowercase, remove punctuation:

In [7]:
# Set lowercasea, remove punctuation

def clean_dataset(dataset):
    for i in range(0, len(dataset)):
        #Tokenize and set words to lowercase 
        review = dataset["full_text"][i]
        review = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",review).split())
        review = re.sub("[^a-zA-Z]", " ", review)
        review = review.lower()
        review = review.split()

        #stopwords: 
        all_stopwords = stopwords.words("english")
        all_stopwords.extend(["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z","not","no"])

        #lemmatization:
        lemma = nltk.wordnet.WordNetLemmatizer()
        review = " ".join([lemma.lemmatize(word) for word in review if word not in set(all_stopwords)])    

        dataset["corpus"][i] = review


In [8]:
clean_dataset(clean_vaccine_tweets)
clean_dataset(annotated_vaccine_tweets)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset["corpus"][i] = review


In [9]:
clean_vaccine_tweets.head()

Unnamed: 0,id,created_at,user,geo,full_text,hashtags,corpus
0,1338158543359250432,2020-12-13 16:27:13+00:00,76052772,,While the world has been on the wrong side of ...,"['covid19', 'supplychain', 'logistics', 'vacci...",world wrong side history year hopefully bigges...
1,1337840331522453504,2020-12-12 19:22:45+00:00,1300382181605494800,,@cnnbrk #COVID19 #CovidVaccine #vaccine #Coron...,"['COVID19', 'CovidVaccine', 'vaccine', 'Corona...",covid covidvaccine vaccine corona pfizerbionte...
2,1338544403795881984,2020-12-14 18:00:29+00:00,1164717209253552000,,The FDA Authorizes Emergency Use Of The Pfizer...,"['PFE', 'Pfizer', 'Pfizervaccine', 'PfizerBioN...",fda authorizes emergency use pfizer vaccine pf...
3,1337735595704115200,2020-12-12 12:26:34+00:00,1316036067754205200,,The #FDA finally issues #EUA now comes the pro...,"['FDA', 'EUA', 'PfizerBioNTech', 'vaccinated']",fda finally issue eua come problem transportin...
4,1337850832256176128,2020-12-12 20:04:29+00:00,1110032180237852700,,There have not been many bright days in 2020 b...,"['BidenHarris', 'Election2020', 'PfizerBioNTec...",many bright day best bidenharris winning elect...


In [10]:
annotated_vaccine_tweets.head()

Unnamed: 0,tweet_id,label,full_text,corpus
0,1360342002961940480,1,"4,000 a day dying from the so called Covid-19 ...",day dying called covid vaccine report vaccine ...
1,1382896334886248448,2,Pranam message for today manifested in Dhyan b...,pranam message today manifested dhyan truth lo...
2,1375673411846873088,2,Hyderabad-based ?@BharatBiotech? has sought fu...,hyderabad based sought fund government ramp pr...
3,1381310901119287296,1,"Confirmation that Chinese #vaccines ""donÂt ha...",confirmation chinese vaccine high protection r...
4,1362165556091191296,3,"Lab studies suggest #Pfizer, #Moderna vaccines...",lab study suggest pfizer moderna vaccine prote...


## Sentiment Analysis with nltk

In [11]:
from nltk.sentiment import SentimentIntensityAnalyzer

In [45]:
clean_vaccine_tweets["sentiment"] = dict
annotated_vaccine_tweets["sentiment"] = dict
clean_vaccine_tweets["sentiment_compound"] = 0.0

In [46]:
def sentiment_score(dataset):
    sia = SentimentIntensityAnalyzer()
    for i in range(len(dataset)):
        dataset["sentiment"][i] = sia.polarity_scores(dataset["corpus"][i])
        dataset["sentiment_compound"][i] = dataset["sentiment"][i]["compound"]

In [47]:
sentiment_score(clean_vaccine_tweets)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset["sentiment"][i] = sia.polarity_scores(dataset["corpus"][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset["sentiment_compound"][i] = dataset["sentiment"][i]["compound"]


In [48]:
clean_vaccine_tweets

Unnamed: 0,id,created_at,user,geo,full_text,hashtags,corpus,sentiment,sentiment_compound
0,1403680293727477760,2021-06-12 11:47:15+00:00,49671976,,#Moderna #Pfizer #JohnsonandJohnson not liable...,"['Moderna', 'Pfizer', 'JohnsonandJohnson']",moderna pfizer johnsonandjohnson liable advers...,"{'neg': 0.656, 'neu': 0.344, 'pos': 0.0, 'comp...",-0.9842
1,1379804303947358208,2021-04-07 14:32:36+00:00,978836543652712400,,"Always said ""they rushed the vaccine"" but no, ...",['oxfordastrazeneca'],always said rushed vaccine trusted scientist l...,"{'neg': 0.55, 'neu': 0.387, 'pos': 0.063, 'com...",-0.9774
2,1405518114746355712,2021-06-17 13:30:06+00:00,760564088854442000,,"1,332 deaths after 💉reported to #MHRA #YellowC...","['MHRA', 'YellowCard', 'ASTRAZENECA', 'PFIZER'...",death reported mhra yellowcard astrazeneca rea...,"{'neg': 0.609, 'neu': 0.391, 'pos': 0.0, 'comp...",-0.9761
3,1402675902220222464,2021-06-09 17:16:10+00:00,1118813148914569200,,@ThrowAw31644033 @Daveyji @socioEqualiser @sap...,"['MHRA', 'ASTRAZENECA', 'PFIZER', 'MODERNA', '...",death reported mhra uk astrazeneca reaction de...,"{'neg': 0.609, 'neu': 0.391, 'pos': 0.0, 'comp...",-0.9761
4,1395373649113268224,2021-05-20 13:39:37+00:00,1030608306,,@MrLichtenstein @Doomsday_Clock 180 UK deaths ...,"['MHRA', 'ASTRAZENECA', 'PFIZER', 'MODERNA', '...",clock uk death reported mhra following syringe...,"{'neg': 0.565, 'neu': 0.435, 'pos': 0.0, 'comp...",-0.9761
...,...,...,...,...,...,...,...,...,...
101942,1367114490890752000,2021-03-03 14:07:49+00:00,143025857,,Better efficacy than Oxford/covishield's 62%. ...,['Covaxin'],better efficacy oxford covishield best inactiv...,"{'neg': 0.0, 'neu': 0.339, 'pos': 0.661, 'comp...",0.9837
101943,1362778395302653952,2021-02-19 14:57:43+00:00,1263498191502282800,,Congratulations on BioAsia Genome Valley Excel...,"['Covaxin', 'Vaccine']",congratulation bioasia genome valley excellenc...,"{'neg': 0.0, 'neu': 0.272, 'pos': 0.728, 'comp...",0.9842
101944,1375483502616010752,2021-03-26 16:23:16+00:00,1356008062620991500,,We will be giving thousands of doses of the in...,"['OxfordAstraZeneca', 'Passover', 'covidjab']",giving thousand dos incredibly safe powerfully...,"{'neg': 0.0, 'neu': 0.298, 'pos': 0.702, 'comp...",0.9842
101945,1362299993504145408,2021-02-18 07:16:43+00:00,160763636,,Hey dear friends my #COVAXIN is done and feel ...,"['COVAXIN', 'COVID19', 'VaccineMaitri']",hey dear friend covaxin done feel good thanks ...,"{'neg': 0.0, 'neu': 0.311, 'pos': 0.689, 'comp...",0.9876


In [55]:
clean_vaccine_tweets = clean_vaccine_tweets.sort_values("sentiment_compound", ascending=False, ignore_index=True)

In [57]:
clean_vaccine_tweets["sentiment"][101945]

{'neg': 0.55, 'neu': 0.387, 'pos': 0.063, 'compound': -0.9774}

---

In [58]:
clean_vaccine_tweets.to_csv("../data/interim/clean_vaccine_tweets_with_sentiment.csv")