# Sentiment Analyses using VADER on Twitter Data

Using the processed data saved from the Get Tweets notebook, here the VADER sentiment analyzer is used to get the compound
sentiment analyses for each tweet. This compound score is added to the Tweet json files and are stored in the Data/Analyzed
folder.

The mean sentiment is also stored as a json file in the Data folder.

In [3]:
import pandas as pd
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
import nltk
from tqdm.notebook import tqdm
from collections import defaultdict
import json

nltk.download('vader_lexicon')

languages = {
                1: 'en',
                2: 'es',
                3: 'fr',
                4: 'de',
                5: 'nl',
                6: 'it',
            }

months = ['December', 'January', 'February', 'March', 'April', 'May']

[nltk_data] Downloading package vader_lexicon to C:\Users\Aiden
[nltk_data]     Williams\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In the below cell we loop for each tweet text file and get the compound sentiment score. This extra feature is added to
the dataframe before being saved in the Data/Analyzed folder.

In [4]:
Language = defaultdict(lambda: [])
for month in tqdm(months):
    for day in [0, 1, 2, 3, 4]:
        for language in languages:
            path = 'Data/Text/' + str(month) + str(day) + languages[language] + '.json'
            tweetsP = pd.read_json(path).T
            results = []
            for text in tweetsP['text']:
                if isinstance(text, str):
                    pol_score = SIA().polarity_scores(text) # run analysis
                    pol_score['text'] = text # add headlines for viewing
                    results.append(pol_score)

            tweetsP['Score'] = pd.DataFrame(results)['compound']
            Language[language].append(np.average(tweetsP['Score']))
            tweetsP.to_json('Data/Analyzed Tweets/' + str(month) + str(day) + languages[language] + '.json')

  0%|          | 0/6 [00:00<?, ?it/s]

KeyboardInterrupt: 

numpy sometimes returns a NAN, here this is checked and replaced with a 0, equivalent to a true neutral score.

In [None]:
for l in Language:
    _curr_l = Language[l]
    curr_l = []
    for mean in _curr_l:
        if np.isnan(mean):
            curr_l.append(0)
        else:
            curr_l.append(mean)
    Language[l] = curr_l

The mean sentiment is stored in this format:

{

Language 0 : {Month 0: [Day 0 ... Day 29] ... Month 5: [Day 0 ... Day 29]}
.
.
.
Language 5 : {Month 0: [Day 0 ... Day 29] ... Month 5: [Day 0 ... Day 29]}

}

In [None]:
to_save = {}
for i, month in enumerate(Language):
    to_save[i] = {'month': month, 'day': Language[month]}
json.dump(to_save, open('Data/MeanSentiment.json', 'w+'))

# Sentiment Score using VADER on the Article Headings Dataset

Another dataset gathered is the article dataset. 5 articles for each day (1, 7, 14, 21, 28) were collected for each
language's european mother country, .i.e. English = UK, French = France.

This dataset set was collected manually using the Google search engine using the following query:


```{Country} Covid* before:{Date in YYYY-MM-DD Format} After:{Day before Date in YYYY-MM-DD Format} ```

The Article Headings dataset consists of 5 csv files, with 30 dates corresponding to the twitter dataset dates. Each date
has 5 articles attached to it in separate columns. For this reason each row in the dataframe has a loop run to get the
sentiment score for each article heading. The mean score is collected at this stage as well.

In [13]:
import os

root = 'Data/Articles'
if not os.access(root, os.R_OK):
    print("Check dataroot!!")

mean_score = defaultdict(lambda: [])
for country in tqdm(os.listdir(root)):
    file_path = os.path.join(root, country)
    articles = pd.read_csv(file_path)
    new_dataframe = pd.DataFrame(columns=['Date', 'Article', 'Score'])

    i = 0
    while i < articles.shape[0]:
        _articles = []
        for j in range(1, 6):
            _articles.append(SIA().polarity_scores(articles['Article ' + str(j)][i])['compound'])
            new_dataframe = new_dataframe.append(
                            {
                                'Date': articles['Date'][i],
                                'Article': articles['Article ' + str(j)][i],
                                'Score': _articles[-1]
                            },
                            ignore_index=True)
        mean_score[country[:-4]].append(np.average(_articles))
        i += 1

    new_dataframe.to_json('Data/Analyzed Articles/Article Score ' + str(country[:-4]) + '.json')

  0%|          | 0/6 [00:00<?, ?it/s]

Check for NaN items then save the mean score in json.

In [None]:
for c in mean_score:
    _curr_c = mean_score[c]
    curr_c = []
    for mean in _curr_c:
        if np.isnan(mean):
            curr_c.append(0)
        else:
            curr_c.append(mean)
    mean_score[c] = curr_c

to_save = {}
for i, country in enumerate(mean_score):
    to_save[os.listdir(root)[i][:-4]] = {'country': country, 'day': mean_score[country]}
json.dump(to_save, open('Data/ArticleMeanSentiment.json', 'w+'))
