# Data Analysis

At this time of the project, we have a fully working POC of an application that process Streaming Data in real time and uses machine learning to do online predictions. The results of the predictions are currently being stored in a MySQL database. 

In particular, this application reads a stream of tweets related to Covid-19, pre-process them and uses state-of-the-art NLP models to do sentiment and topic prediction analysis. Finally, these results are post-processed and written into a relational database in AWS. 

In this dashboard, we analyize our tweets' predictions:

In [1]:
########### Database Connection ##################

import pymysql
import numpy as np

ENDPOINT = "database-kaka.c8wdpocz3thc.us-east-1.rds.amazonaws.com"
PASSWORD = "Bf2TiD4M4aOpbglEd9lM"
DBNAME = "databasekafka"
USR = "admin"
PORT = 3306
connection = pymysql.connect(host=ENDPOINT, user=USR, password=PASSWORD, port=PORT, db=DBNAME)
import pandas as pd
query = 'SELECT * FROM covid_tweets'
result = pd.read_sql(query, connection).set_index('date_creation').sort_index()
result = result.loc["2021-02-09":]
connection.close() #close the connection
display(f"There are {len(result)} classified tweets in the database")
result[["tweet","sentiment_prediction", "topic_prediction"]].head()

'There are 31106 classified tweets in the database'

Unnamed: 0_level_0,tweet,sentiment_prediction,topic_prediction
date_creation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021-02-09 13:44:47,RT @iamcindychu: I beg my non-Asian allies and...,NEGATIVE,conspiracy
2021-02-09 13:44:47,RT @MartinAButters: How a Powerful #ERP System...,POSITIVE,commerce
2021-02-09 13:44:47,@B52Malmet @JoannBreitling This is a horrible ...,NEGATIVE,public health
2021-02-09 13:44:47,"@NIAIDNews Director Anthony Fauci, discusses w...",NEGATIVE,disease
2021-02-09 13:44:47,@gatrick_liz @TUIUK Liz sadly you got more cha...,NEGATIVE,commerce


## Sentiment Analysis

We expect much more negative predictions than positive predictions:

In [2]:
result.value_counts('sentiment_prediction')

sentiment_prediction
NEGATIVE    23939
POSITIVE     7167
dtype: int64

In [24]:
%matplotlib widget
import matplotlib.pyplot as plt  
import ipympl
import seaborn as sns

import matplotlib as mpl

plt.rcParams['axes.edgecolor']='#333F4B'
plt.rcParams['axes.linewidth']=0.8
plt.rcParams['xtick.color']='#333F4B'
plt.rcParams['ytick.color']='#333F4B'

plt.rcParams['font.family'] = 'DejaVu Sans'
plt.rcParams['font.sans-serif'] = 'DejaVu Sans'

plt.rcParams.update({'axes.spines.top': False, 'axes.spines.right': False})
sns.set_palette("Set1")

# Tweets Sentiments across time


In [26]:
mpl.rcParams['text.color'] = 'black'
result.groupby(["date_creation", "sentiment_prediction"]) \
.agg({"sentiment_score": np.sum})\
.unstack().plot(xlabel="Seconds", ylabel="# of tweets")
plt.legend(["Negative tweets", "Positive Tweets"])
plt.title("Tweet's sentiment across time")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Text(0.5, 1.0, "Tweet's sentiment across time")

And indeed there are. Out of the 18'000 tweets processed, only 3'000 have a positive sentiment. Let's check the word clouds for both batches of predictions.

In [18]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt

def word_cloud_sentiment(sentiment):

    text = " ".join(tweet for tweet in result.query(f'sentiment_prediction=="{sentiment}"')["tweet"])

    stopwords = set(STOPWORDS)
    stopwords.update(["RT", "https", "t", "COVID", "coronavirus", "Covid-19", "pandemic", "co",
                     "s"])
    print(f"{sentiment} tweets: Word cloud \n")
    wordcloud = WordCloud(stopwords=stopwords, background_color="white",
                         width=2000, height= 1200).generate(text)
    plt.figure()
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

In [6]:
result.value_counts('topic_prediction')

topic_prediction
disease                8071
conspiracy             5313
politics               3387
vaccines               3361
public health          1667
comedy                 1608
death                  1490
commerce               1367
education              1208
China                  1037
mental health           885
sports events           507
the economy             455
natural environment     386
globalization           364
dtype: int64

Here are the word clouds for each of the topics:

In [15]:
def create_word_cloud(topic):
    text = " ".join(tweet for tweet in result.query(f'topic_prediction=="{topic}"')["tweet"])

    stopwords = set(STOPWORDS)
    stopwords.update(["RT", "https", "t", "COVID", "coronavirus", "Covid-19", "pandemic", "co",
                 "s", "sound", "starting", "lot", "u"])
    wordcloud = WordCloud(stopwords=stopwords, background_color="white",
                     width=2000, height= 1200).generate(text)
    plt.figure()
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

# Sentiment and Topic Prediction

Let's try to analyze how sentiment and topic prediction are related. For example, which are the topics that were most mentioned in the positive tweets relative to the negative tweets? Are vaccines most tweeted in a positive or in a negative fashinon?

The following plots will help us to solve those questions. 

## Sentiments within topics

With the following table, we can analyze the sentiment within each topic. For example, what percentage of the vaccines tweets were positive? What percentage were negative? 

And so on for the different topics predicted. 

In [8]:
pivot_values = pd.crosstab(result['topic_prediction'], result['sentiment_prediction'],normalize='index') \
                 .round(2)
pivot_values['difference'] = pivot_values.eval('NEGATIVE-POSITIVE')
pivot_values.sort_values('difference', ascending = False)

sentiment_prediction,NEGATIVE,POSITIVE,difference
topic_prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
China,0.89,0.11,0.78
death,0.86,0.14,0.72
conspiracy,0.85,0.15,0.7
disease,0.8,0.2,0.6
vaccines,0.78,0.22,0.56
the economy,0.77,0.23,0.54
politics,0.76,0.24,0.52
public health,0.75,0.25,0.5
commerce,0.7,0.3,0.4
comedy,0.68,0.32,0.36


In the table to the left, for each combination of `topic_prediction` and `sentiment_prediction`, it appears the percentage of the combination in all the predictions for the given topic. That is, of all the tweets about vaccines, 89% were of a negative sentiment and 11% were with a positive sentiment. 

That is, most tweets about death had a negative sentiment. Whereas the balance evens out the most with globalization. 

In [9]:
pivot_values = pd.crosstab(result['topic_prediction'], result['sentiment_prediction'],normalize='columns') \
                 .round(2)
pivot_values['difference'] = pivot_values.eval('NEGATIVE-POSITIVE')
pivot_values.sort_values('difference', ascending = False)

sentiment_prediction,NEGATIVE,POSITIVE,difference
topic_prediction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
conspiracy,0.19,0.11,0.08
disease,0.27,0.23,0.04
death,0.05,0.03,0.02
China,0.04,0.02,0.02
vaccines,0.11,0.1,0.01
politics,0.11,0.11,0.0
the economy,0.01,0.01,0.0
public health,0.05,0.06,-0.01
globalization,0.01,0.02,-0.01
commerce,0.04,0.06,-0.02


## Topics across Sentiments

With the following table, we can analyze which topics were the most popular within each sentiment. For example, which was the most popular topic in the positive tweets? Which was for the negative tweets?

Whereas for negative tweets the death topic represents 10%, for the positive tweets it only represents 4%. Conversely, globalization tweets represent the 7% of the positive tweets; however, globalization tweets only represent 2% of the overall negative tweets. 

# Creating widgets out of the analysis

In [10]:
import ipywidgets as widgets

## Word Clouds: Different Topics

With the following widget, you can see the word clouds for the tweets that were predicted at each of the 14 different topics with the Zero Shot Learning models from Hugging face. 

In [11]:
result.value_counts('topic_prediction').index

Index(['disease', 'conspiracy', 'politics', 'vaccines', 'public health',
       'comedy', 'death', 'commerce', 'education', 'China', 'mental health',
       'sports events', 'the economy', 'natural environment', 'globalization'],
      dtype='object', name='topic_prediction')

In [21]:
widgets.interact(create_word_cloud, topic = result.value_counts('topic_prediction').index)

interactive(children=(Dropdown(description='topic', options=('disease', 'conspiracy', 'politics', 'vaccines', …

<function __main__.create_word_cloud(topic)>

## Word Clouds: Different Sentiments

With the following widget, you can see the word clouds for the tweets that the Zero Shot Learning model from Hugging Face predicted. 

In [22]:
widgets.interact(word_cloud_sentiment, sentiment=["NEGATIVE", "POSITIVE"])

interactive(children=(Dropdown(description='sentiment', options=('NEGATIVE', 'POSITIVE'), value='NEGATIVE'), O…

<function __main__.word_cloud_sentiment(sentiment)>

# Topics and Sentiments: The overall picture

In [31]:
labels = {}
for topic in result.value_counts('topic_prediction').index:
    labels[('NEGATIVE', topic)] = topic
    labels[('POSITIVE', topic)] = topic
labellizer = lambda k: labels[k]

In [33]:
import matplotlib as mpl
sns.set_palette("Set1")
mpl.rcParams['text.color'] = 'white'
from statsmodels.graphics.mosaicplot import mosaic
with mpl.rc_context():
    mpl.rc("figure", figsize=(10, 15))
    mosaic(result, ['sentiment_prediction', 'topic_prediction'], 
          title = "Sentiment vs Topics",
          labelizer=labellizer)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …