# Final Project: Sentiment Analysis on Covid-19 Tweets  
**Math189Z – Covid-19: Data Analytics and Machine Learning**  
Nico Espinosa Dice  
*May, 2020*  

## Sources
This project uses theory presented in the following academic papers:  
- [Sentiment Analysis of Twitter Data](http://www.cs.columbia.edu/~julia/papers/Agarwaletal11.pdf) (1)

- [Sentiment Analysis of Twitter Data](https://arxiv.org/pdf/1711.10377.pdf) (2)

- [Covid-19 Tweets Dataset and Statistics](https://ieee-dataport.org/open-access/corona-virus-covid-19-tweets-dataset)


This project uses code that was inspired and adapted from the following open-source resources:  
- [Twitter Sentiment Analysis with Explanation (Naive Bayes)](https://medium.com/@koshut.takatsuji/twitter-sentiment-analysis-with-full-code-and-explanation-naive-bayes-a380b38f036b)

- [Creating The Twitter Sentiment Analysis Program in Python with Naive Bayes Classification](https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed)

- [How to Do Sentiment Analysis on a Twitter Account](https://medium.com/better-programming/twitter-sentiment-analysis-15d8892c0082)

- [Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset and code](https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/)

### Data
The dataset of Tweets was provided by Professor Gu as part of HMC Math189Z. The data is available [here](https://math189covid19.github.io/resources.html). The original source of the data is unknown at this time.

The Twitter sentiment corpus was provided in this [public repository](https://github.com/zfz/twitter_corpus).

The Covid-19 related data was provided in this [public repository](https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases/resource/d037a9e3-69d8-4452-bc51-3e225fca75c3).

## Importing Data

In [None]:
# Imports the necessary libraries
import numpy as np
import pandas as pd
import re
from textblob import TextBlob
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
# Imports data into Pandas dataframe
feb_tweets = pd.read_csv('Data/feb_data.csv')
march_tweets = pd.read_csv('Data/march_data.csv')
april_tweets = pd.read_csv('Data/april_data.csv')

In [None]:
# Sets column names of dataframes and drop unnecessary column
feb_tweets.columns, march_tweets.columns, april_tweets.columns = ['Number', "Date", "Text"], ['Number', "Date", "Text"], ['Number', "Date", "Text"]
feb_tweets = feb_tweets.drop(columns = ["Number"])
march_tweets = march_tweets.drop(columns = ["Number"])
april_tweets = april_tweets.drop(columns = ["Number"])

In [None]:
feb_tweets["Month"] = "February"
march_tweets["Month"] = "March"
april_april_tweetstweets["Month"] = "April"

data = pd.concat([feb_tweets, march_tweets, april_tweets], ignore_index=True)

## Exploratory Data Analysis

In [None]:
feb_tweets.head()

In [None]:
data.head()

## Data Preprocessing

### Data Cleaning

In [None]:
new_stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])

In [None]:
# "cleans" the text by removing hyperlinks, hashtags, mentions, and retweets
# This function was suggested here: https://medium.com/better-programming/twitter-sentiment-analysis-15d8892c0082
def cleanText(text):
    text = text.lower() # Makes text lowercase
    text = re.sub('https?:\/\/\S+', '', text) # Removes hyperlinks
    text = re.sub('#', '', text) # Removes hashtags
    text = re.sub('@[A-Za-z0–9]+', '', text) # Removes mentions (@)
    text = re.sub('RT[\s]+', '', text) # Removes "RT"
    return text

In [None]:
def convertToList(text):
    text = word_tokenize(text)
    return [word for word in text if word not in new_stopwords] # Source for this line: https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed

In [None]:
# Applys cleanText() to every Tweet in dataframe
data["Text"] = data["Text"].apply(cleanText)
data["List of Words"] = data["Text"].apply(convertToList)

In [None]:
def cleanDate(date):
    month = date[5:7]
    day = date[8:10]
    
    if month[0] == "0":
        month = month[1]
    if day[0] == "0":
        day = day[1]
    
    return month + "/" + day + "/20"

In [None]:
data["Date"] = data["Date"].apply(cleanDate)

In [None]:
data.tail()

## Feature Engineering

### Polarity
-1 → extreme negative,  
0 → neutral,  
1 → extreme positive.

In [None]:
# Returns the polarity of the Tweet's text
def getPolarity(text):
   return  TextBlob(text).sentiment.polarity

In [None]:
# Creates a new column containing the subjectivity of every Tweet
data['Polarity'] = data['Text'].apply(getPolarity)

### Subjectivity
0 → fact,  
1 → opinion.

In [None]:
# Returns the subjectivity of the Tweet's text
def getSubjectivity(text):
   return TextBlob(text).sentiment.subjectivity

In [None]:
# Creates a new column containing the subjectivity of every Tweet
data["Subjectivity"] = data["Text"].apply(getSubjectivity)

## Sentiment Analysis
Polarity < 0 → negative,  
Polarity == 0 → neutral,  
Polarity > 0 → positive. 

In [None]:
# Returns the analysis of each Tweet's text
def getSentiment(polarity):
    if polarity < 0:
      return 'Negative'
    elif polarity == 0:
      return 'Neutral'
    else:
      return 'Positive'

In [None]:
data['Analysis'] = data['Polarity'].apply(getSentiment)

In [None]:
data.head()

## Analysis

### Analysis: Full Dataset

In [None]:
positive_tweets = data.loc[data["Analysis"] == "Positive"]
neutral_tweets = data.loc[data["Analysis"] == "Neutral"]
negative_tweets = data.loc[data["Analysis"] == "Negative"]

In [None]:
positive_tweets.reset_index(drop=True, inplace=True)
neutral_tweets.reset_index(drop=True, inplace=True)
negative_tweets.reset_index(drop=True, inplace=True)

In [None]:
positive_tweets.head()

In [None]:
neutral_tweets["Text"].head()

In [None]:
negative_tweets.head()

In [None]:
# Positive Tweets
plt.scatter(positive_tweets["Polarity"], positive_tweets["Subjectivity"])

plt.title('Sentiment Analysis of Positive Tweets') 
plt.xlabel('Polarity') 
plt.ylabel('Subjectivity') 
plt.show()

# Neutral Tweets
plt.scatter(neutral_tweets["Polarity"], neutral_tweets["Subjectivity"])

plt.title('Sentiment Analysis of Neutral Tweets') 
plt.xlabel('Polarity') 
plt.ylabel('Subjectivity') 
plt.show()

# Negative Tweets
plt.scatter(negative_tweets["Polarity"], negative_tweets["Subjectivity"])

plt.title('Sentiment Analysis of Negative Tweets') 
plt.xlabel('Polarity') 
plt.ylabel('Subjectivity')
plt.show()

In [None]:
print(data["Analysis"].value_counts())
print("Total:", data.shape[0])
print()

print("Percentage of positive Tweets:", (positive_tweets.shape[0] / data.shape[0]))
print()

print("Percentage of neutral Tweets:", (neutral_tweets.shape[0] / data.shape[0]))
print()

print("Percentage of negative Tweets:", (negative_tweets.shape[0] / data.shape[0]))

In [None]:
plt.title("Sentiment Analysis: Full Dataset")
plt.xlabel("Sentiment")
plt.ylabel("Number of Tweets")
data["Analysis"].value_counts().plot(kind = "bar")
plt.show()

In [None]:
plt.title("Sentiment Analysis (Percentage): Full Dataset")
plt.xlabel("Sentiment")
plt.ylabel("Percentage of Total Tweets")
((data["Analysis"].value_counts())/data.shape[0]).plot(kind = "bar")
plt.show()

### Analysis: Monthly

In [None]:
months = list(dict.fromkeys(data["Month"].values))
sentiments = list(dict.fromkeys(data["Analysis"].values))
sentiment_colors = {"Positive": "Blue", "Negative": "Red", "Neutral": "Gray"}

In [None]:
for month in months:
    month_tweets = data.loc[data["Month"] == month]
    plt.title("Sentiment Analysis: " + month)
    plt.xlabel("Sentiment")
    plt.ylabel("Percentage of Total Tweets")
    ((month_tweets["Analysis"].value_counts())/data.shape[0]).plot(kind = 'bar')
    plt.show()

In [None]:
for sentiment in sentiments:
    month_percentages = []
    sentiment_tweets = data.loc[data["Analysis"] == sentiment]
    plt.title("Sentiment Analysis: " + sentiment)
    plt.xlabel("Sentiment")
    plt.ylabel("Percentage of Total Tweets")
    for month in months:
        month_tweets = sentiment_tweets.loc[sentiment_tweets["Month"] == month]
        month_percentages.append(month_tweets["Analysis"].value_counts()/data.shape[0])
    plt.scatter(months, month_percentages) 
    plt.show()

In [None]:
plt.title("Sentiment Analysis")
plt.xlabel("Sentiment")
plt.ylabel("Percentage of Total Tweets")
    
for sentiment in sentiments:
    month_percentages = []
    sentiment_tweets = data.loc[data["Analysis"] == sentiment]
    for month in months:
        month_tweets = sentiment_tweets.loc[sentiment_tweets["Month"] == month]
        month_percentages.append(month_tweets["Analysis"].value_counts()/data.shape[0])
    plt.scatter(months, month_percentages, c = sentiment_colors[sentiment]) 
    plt.plot(months, month_percentages, c = sentiment_colors[sentiment])
plt.show()

## Naive Bayes Classification

### Training the model

In [None]:
corpus = pd.read_csv("Data/full_corpus.csv")

corpus["TweetText"] = corpus["TweetText"].apply(cleanText)
corpus["List of Words"] = corpus["TweetText"].apply(convertToList)

In [None]:
corpus.head()

In [None]:
training_data, testing_data = train_test_split(corpus, test_size = 0.05)

print(corpus.shape)
print(training_data.shape)
print(testing_data.shape)

In [None]:
training_data_list = []
for index, row in training_data.iterrows():
    training_data_list.append((row["TweetText"], row["Sentiment"]))

testing_data_list = []
for index, row in testing_data.iterrows():
    testing_data_list.append((row["TweetText"], row["Sentiment"]))

In [None]:
# This function was adapted from https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed
# (The resource above is open source).
def build(input_data):
    world_list = []
    
    for (words, sentiment) in input_data:
        world_list.extend(words)

    words = nltk.FreqDist(world_list)
    word_features = words.keys()
    
    return word_features

In [None]:
# This function was adapted from https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed
# (The resource above is open source).
def get_features(text):
    words = set(text)
    features = {}
    
    for word in word_features:
        features['contains(%s)' % word] = (word in words)
        
    return features 

In [None]:
word_features = build(training_data_list)
training_features = nltk.classify.apply_features(get_features, training_data_list)

In [None]:
naive_bayes_classifier = nltk.NaiveBayesClassifier.train(training_features)

In [None]:
# This piece of code was suggested in https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed
classifier_result_labels = [naive_bayes_classifier.classify(get_features(tweet[0])) for tweet in testing_data_list]


In [None]:
print("Percentage of Positive Sentiments:", classifier_result_labels.count('positive') / testing_data.shape[0])
print()
print("Percentage of Negative Sentiments:", classifier_result_labels.count('negative') / testing_data.shape[0])


### Applying the Model to Covid-19 Twitter Data

In [None]:
# training_data, testing_data = train_test_split(data, test_size = 0.2)

In [None]:
print(data.shape)
print(training_data.shape)
print(testing_data.shape)

In [None]:
# word_features = build(training_data_list)
# training_features = nltk.classify.apply_features(get_features, training_data_list)

In [None]:
# naive_bayes_classifier = nltk.NaiveBayesClassifier.train(training_features)

In [None]:
# This piece of code was suggested in https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed
covid_classifier_labels = [naive_bayes_classifier.classify(get_features(row[3])) for index, row in data.iterrows()]

In [None]:
data["Classification"] = covid_classifier_labels
data["Ensemble"] = covid_classifier_labels # temporary

In [None]:
for index, row in data.iterrows():
    if row["Polarity"] < -0.2:
        row["Ensemble"] = "negative"
    elif row["Polarity"] > 0.2 and row["Classification"] == "neutral":
        row["Ensemble"] = "positive"

In [None]:
data.loc[data["Polarity"] < -0.2, "Ensemble"] = "negative"
data.loc[data["Polarity"] > 0.2, "Ensemble"] = "positive"

In [None]:
print(data["Ensemble"].value_counts())

## Covid Cases

In [None]:
confirmed = pd.read_csv('Data/covid_confirmed.csv')
recovered = pd.read_csv('Data/covid_recovered.csv')
deaths = pd.read_csv('Data/covid_deaths.csv')

In [None]:
confirmed.head()

In [None]:
confirmed_us = confirmed.loc[confirmed["Country/Region"] == "US"]
deaths_us = deaths.loc[deaths["Country/Region"] == "US"]
recovered_us = recovered.loc[recovered["Country/Region"] == "US"]

In [None]:
confirmed_us.head()

In [None]:
df_list = [confirmed_us, deaths_us, recovered_us]

for i in range(3):
    df = df_list[i].drop(columns = ["Province/State", "Country/Region", "Lat", "Long"])
    dates, values = [], []

    for j in df:
        dates.append(j)
        values.append(df.iloc[0][j])
        
    new_data = {'Date': dates, 'Cases': values}
    df_list[i] = pd.DataFrame(new_data)

confirmed_us = df_list[0]
deaths_us = df_list[1]
recovered_us = df_list[2]

In [None]:
confirmed_us.head()

In [None]:
plt.title("Confirmed Cases - US")
plt.xlabel("Date")
plt.ylabel("Cases")
plt.plot(confirmed_us["Date"], confirmed_us["Cases"])
plt.show()

In [None]:
plt.title("Deaths - US")
plt.xlabel("Date")
plt.ylabel("Cases")
plt.plot(deaths_us["Date"], deaths_us["Cases"])
plt.show()

In [None]:
plt.title("Recovered - US")
plt.xlabel("Date")
plt.ylabel("Cases")
plt.plot(recovered_us["Date"], recovered_us["Cases"])
plt.show()

In [None]:
def convertToWeekly(df):
    new_dates, new_cases = [], []
    count = 0

    for index, row in df.iterrows():
        if (count == 6):
            new_dates.append(row["Date"])
            new_cases.append(row["Cases"])
            count = 0
        else:
            count += 1
    
    new_data = {'Date': new_dates, 'Cases': new_cases}
    return pd.DataFrame(new_data)

In [None]:
confirmed_us_weekly = convertToWeekly(confirmed_us)
deaths_us_weekly = convertToWeekly(deaths_us)
recovered_us_weekly = convertToWeekly(recovered_us)

In [None]:
confirmed_us_weekly.head()

In [None]:
dates = list(confirmed_us_weekly["Date"].values)

In [None]:
# I realize that this is an incredibly inefficient function
def getWeeklyResults(df):
    results_list = []

    for date in dates:
        month = date[0:1]
        day = ""
        
        if len(date) == 7:
            day = date[2:4]
        else:
            day = date[2:3]
                
        date_results = []
    
        for index, row in df.iterrows():
            tweet_date = row["Date"]
            tweet_month = date[0:1]
            tweet_day = "" 
            
            if len(tweet_date) == 7:
                tweet_day = tweet_date[2:4]
            else:
                tweet_day = tweet_date[2:3]
            
            if tweet_month <= month:
                if tweet_day <= day:
                    date_results.append(row["Ensemble"])
        
        results_list.append(date_results)
    
    return results_list

In [None]:
weekly_results = getWeeklyResults(data)

In [None]:
print(len(weekly_results[7]))

In [None]:
week_counts = {}
for i in range(len(weekly_results)):
    if (len(weekly_results[i]) != 0):
        date = dates[i]
        total_count = len(weekly_results[i])
        counter = Counter(weekly_results[i])
        positive_percent = counter["positive"] / total_count
        neutral_percent = counter["neutral"] / total_count
        negative_percent = counter["negative"] / total_count
        irrelevant_percent = counter["irrelevant"] / total_count
        week_counts[date] = [total_count, positive_percent, neutral_percent, negative_percent]

In [None]:
week_counts

In [None]:
def normalize(df):
    df["Normalized"] = (df["Cases"] - df["Cases"].min()) / (df["Cases"].max() - df["Cases"].min())
    return df

In [None]:
for df in [confirmed_us_weekly, recovered_us_weekly, deaths_us_weekly]:
    df = normalize(df)

In [None]:
recovered_us.tail()

In [None]:
def clean(df):
    for index, row in df.iterrows():
        if row["Date"] not in week_counts:
            df = df[df.Date != row["Date"]]
    return df

In [None]:
confirmed_us_weekly = clean(confirmed_us_weekly)
recovered_us_weekly = clean(recovered_us_weekly)
deaths_us_weekly = clean(deaths_us_weekly)

In [None]:
def addTweetSentiment(df, i):
    sentiment_list = []
    
    for index, row in df.iterrows():
        if row["Date"] in week_counts:
            sentiments = week_counts[row["Date"]]
            sentiment_list.append(sentiments[i])
        
    df[str(i)] = sentiment_list
    return df

In [None]:
for i in range(1, 4, 1):
    confirmed_us_weekly = addTweetSentiment(confirmed_us_weekly, i)
    recovered_us_weekly = addTweetSentiment(recovered_us_weekly, i)
    deaths_us_weekly = addTweetSentiment(deaths_us_weekly, i)

In [None]:
confirmed_us_weekly.head()

In [None]:
plt.rcParams["figure.figsize"] = (20,3)

plt.title("Tweets and Confirmed Cases - US")
plt.xlabel("Date")
plt.ylabel("Percent")

plt.plot(confirmed_us_weekly["Date"], confirmed_us_weekly["Normalized"])
for i in range(1, 4, 1):
    plt.plot(confirmed_us_weekly["Date"], confirmed_us_weekly[str(i)])
plt.legend(["Confirmed Cases", "Positive Tweets", "Neutral Tweets", "Negative Tweets"])

plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (20,3)

plt.title("Tweets and Deaths - US")
plt.xlabel("Date")
plt.ylabel("Percent")

plt.plot(confirmed_us_weekly["Date"], confirmed_us_weekly["Normalized"])
for i in range(1, 4, 1):
    plt.plot(confirmed_us_weekly["Date"], confirmed_us_weekly[str(i)])
plt.legend(["Confirmed Cases", "Positive Tweets", "Neutral Tweets", "Negative Tweets"])

plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (20,3)

plt.title("Tweets and Recovered - US")
plt.xlabel("Date")
plt.ylabel("Percent")

plt.plot(confirmed_us_weekly["Date"], confirmed_us_weekly["Normalized"])
for i in range(1, 4, 1):
    plt.plot(confirmed_us_weekly["Date"], confirmed_us_weekly[str(i)])
plt.legend(["Confirmed Cases", "Positive Tweets", "Neutral Tweets", "Negative Tweets"])

plt.show()

In [None]:
confirmed_us_weekly = confirmed_us_weekly.rename(columns = {"1": "Positive Sentiment", "2": "Neutral Sentiment", "3": "Negative Sentiment"})
recovered_us_weekly = recovered_us_weekly.rename(columns = {"1": "Positive Sentiment", "2": "Neutral Sentiment", "3": "Negative Sentiment"})
deaths_us_weekly = deaths_us_weekly.rename(columns = {"1": "Positive Sentiment", "2": "Neutral Sentiment", "3": "Negative Sentiment"})

In [None]:
confirmed_us_weekly.corr()

In [None]:
recovered_us_weekly.corr()

In [None]:
deaths_us_weekly.corr()