# Sentiment Algorithm <p>
This is the second jupyter notebook in the workflow. It includes the preparation & filtering of harvested Tweets, the translation of Tweets as well as the performance of the sentiment analysis  .

<b> IMPORTING THE PACKAGES </b> - Several modules are used in this notebook. Next to common libraries, we use NLTK (Natural Language Toolkit), which represents a module to work with human language data.
Furthermore, we use the library deep_translator, which facilitates access to a range of translation APIs. 

In [1]:
#import modules 
import pandas as pd
import re
import numpy as np
import nltk as nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
from deep_translator import GoogleTranslator


<b> OPENING CSV WITH RAW TWEETS </b> - The first step of the sentiment analyis is to load the dataset containing the the raw tweets derived from the Twitter API into a pandas dataframe (DF). This csv file is stored in a new variable called "tweets".

In [8]:
#url = 'https://github.com/zyankarli/Data-Science-Course/blob/master/tweets.csv'
tweets = pd.read_csv('https://raw.githubusercontent.com/zyankarli/Data-Science-Course/master/tweets_historic_full_text.csv?token=AP2IGIV7XDJVD2FDWPURZATAPFWVW')

<b> DEFINITION OF FUNCTIONS </b> - within this cell, all the functions used in this notebook are defined. <p>
DATA_CLEANING - This function converts the csv file into a pandas dataframe. Then, columns that are redundant are removed from the dataframe. Here, also privacy aspects are taken into account by deleting information related to the Twitter user. Furthermore, duplicate tweets are removed based on 'id-str' which is basically a unique identfication code for each tweet. <p>
DATA_PER_CITY - This function filters the tweets per city. At this point, the csv contains tweets from the entire country. the 'loc' function is used to filter tweets based on 'user_location'. The 'city' parameter can be filled in manually and corresponds to the four cities we are investigating. After filtering, the subset of the csv is stored into a new csv and returned as 'tweets_city'. <p>
FIND_CITY - The function here acts as a help function for the function described after this. From the package re, regular expressions are used to identify the words 'Amsterdam', 'Rotterdam', 'Utrecht', 'Den Haag' in every bio-location (the location manually set by the user). <p>
ADD_CITY_COLUMN - This function uses the previous function where all words of the cities have been idenified. It creates a new column called 'city' where it puts -for every tweet- the name of each city where the tweet was posted. This extra column of city names helps later on to filter tweets from different cities for the sentiment algorithm. <p>
ADD_DAY_COLUMN - This function basically converts the information in the dataframe column called 'created' into a date-time object. Currently this information is formatted as a string and in order to be able to filter on it for later analysis it needs to be a date-time object. It also saves a csv with the clean and filtered tweets to use for for visualization later. <p>
TRANSLATOR - Translates a given text from any language into english. If no text is to be found, a Nan-value is returned.

In [13]:
#defining functions
def data_cleaning(tweets):
    tweets_df = pd.DataFrame(tweets)
    tweets_df.drop(['coordinates','geo', 'user_created', 'user_name', 'user_followers', 'user_bg_color'], axis = 1, inplace = True)
    tweets_df.drop_duplicates(subset = ['id_str'], keep = False, inplace = True)
    return(tweets_df)

## CAN THIS ONE BE DELETED ????????
def data_per_city(tweets_df, city):
    tweets_city = tweets_df.loc[tweets_df['user_location'] == city]
    tweets_city.to_csv(r'~/Desktop/SmartEnvironments_Code/tweets_city1.csv', mode = 'a', index = False, header = True)
    return tweets_city

def find_city(string):
    if len(re.findall(r'amsterdam', str(string))) != 0:
        return 'Amsterdam'
    elif len(re.findall(r'den haag', str(string))) != 0:
        return 'Den Haag'
    elif len(re.findall(r'rotterdam', str(string))) != 0:
        return "Rotterdam"
    elif len(re.findall(r'utrecht', str(string))) != 0:
        return 'Utrecht'
    else:
        return np.nan

def add_city_column(df):
    clean_df = df
    clean_df ['city'] = clean_df['user_location'].apply(find_city)
    clean_df = clean_df.dropna()
    return clean_df

def add_day_column(df):
    clean_df = df
    clean_df['created'] = pd.to_datetime(clean_df['created'])
    #create new column that only incorporates date
    clean_df['date'] = clean_df['created'].dt.date
    return clean_df

def translator(text):
    if len(text)>1:
        return GoogleTranslator(source='auto', target='en').translate(text)
    else:
        return np.nan

<b> CALLING ALL FUNCTIONS </b> - The following code block  simply calls all functions created above to execute them. The result is a dataframe cleaned, filtered and contains a correct city and date label.

In [15]:
df = data_cleaning(tweets)
df = add_city_column(df)
df = add_day_column(df)
df

Unnamed: 0,id,user_description,user_location,text,id_str,created,retweet_count,city,date
393,394,"Open, eerlijk , transparant","amsterdam, nederland",vanaf 15 mei quarantaineplicht bij aankomst ui...,1382689775170289666,2021-04-15 13:38:26,0,Amsterdam,2021-04-15
792,793,Observing this #plutocracy. Gezondheid lijkt v...,"van galenbuurt, amsterdam",hier heb ik een half jaar moeten wachten op ee...,1382679111039123456,2021-04-15 12:56:03,0,Amsterdam,2021-04-15
796,797,• vrouw • liberaal •,'s-gravenhage & utrecht,voor een zelftest naar een particuliere testst...,1382678924954591236,2021-04-15 12:55:19,0,Utrecht,2021-04-15
802,803,Kan niet tegen onrecht Probeer de waarheid te ...,"utrecht, nederland","er snel nog even een miljard doorheen jassen, ...",1382678534292901888,2021-04-15 12:53:46,0,Utrecht,2021-04-15
803,804,https://t.co/PKtZwEPWDf\nCouchsurfer\nPolarste...,"rotterdam blijdorp, nederland",die afname in 2020 komt toch 'gewoon' door cor...,1382678530929070080,2021-04-15 12:53:45,0,Rotterdam,2021-04-15
...,...,...,...,...,...,...,...,...,...
57385,57386,Wij brengen het nieuws uit uw regio in beeld,"den haag, nederland",één op de tien 75-plussers is eenzaam. de acti...,1380196413875875848,2021-04-08 16:30:42,1,Den Haag,2021-04-08
57397,57398,🇳🇱 Dutch | Vader | Feyenoord | Stemt rechts | ...,"rotterdam, nederland",🤡🖕🏼 #staatspropaganda 🤡🖕🏼 er zijn nauwelijks b...,1380196054344273920,2021-04-08 16:29:16,0,Rotterdam,2021-04-08
57398,57399,🇳🇱 Dutch | Vader | Feyenoord | Stemt rechts | ...,"rotterdam, nederland",🤡🖕🏼 #staatspropaganda 🤡🖕🏼 er zijn nauwelijks b...,1380195939839848451,2021-04-08 16:28:49,0,Rotterdam,2021-04-08
57401,57402,🇳🇱 Dutch | Vader | Feyenoord | Stemt rechts | ...,"rotterdam, nederland",🤡🖕🏼 #staatspropaganda 🤡🖕🏼 er zijn nauwelijks b...,1380195795895529478,2021-04-08 16:28:15,0,Rotterdam,2021-04-08


<b> TRANSLATION </b> - Translate all tweets in the dataframe. Takes very long (about two hours) for 4800 tweets. In order to avoid you running this cell unintetionally, the code is kept as a comment.

In [16]:
#might take some time!
#df['text_eng']=df['text'].apply(translator)

<b>APPLYING SENTIMENT ANALYSIS </b> - Here the 'SentimentIntensityAnalyzer' tool of the 'NLTK' package is used to execute the sentiment analysis and assigned to a variable 'vader'. Within this cell, five new columns are defined and added to the dataframe. The first three columns save the respective values of the polarity_scores function of the SentimentIntensityAnalyzer for each tweet-text. The polarity_scores function returns a score for each of the three categories: negative ('neg'), neutral ('neu') and positive ('pos'). The compound score, which is added afterwards, aggregates these polarity scores on a number between -1 and +1, where -1 indicates very negative and +1 very positive sentiment . Finally, within the 'sentiment' column,the compound score is translated into one of the three sentiment categories. Commonly, texts that exhibit a compound score of equal to or higher than 0.05 are understood to express a positive sentiment. Therefore, such tweet-texts get labelled as positive. On the other hand, Tweets that exhibit a compound score equal or lower than -0.05 get labelled as negative. For every polarity score in between, the tweet is considered neutral. For each column definition, the apply-function, which allows to execute a specified funtion on every cell, to is used. In our case, we use lambda-statements as inputs for apply-functions to define single-lined functions.

In [17]:
vader = SentimentIntensityAnalyzer()
#rather seperate scores in columns
df['neg'] = df['text_eng'].apply(lambda x:vader.polarity_scores(x)['neg'])
df['neu'] = df['text_eng'].apply(lambda x:vader.polarity_scores(x)['neu'])
df['pos'] = df['text_eng'].apply(lambda x:vader.polarity_scores(x)['pos'])
df['compound'] = df['text_eng'].apply(lambda x:vader.polarity_scores(x)['compound'])

#set highest sentiment column as sentiment of tweet
#positive sentiment : (compound score >= 0.05)
#neutral sentiment : (compound score > -0.05) and (compound score < 0.05)
#negative sentiment : (compound score <= -0.05)
#source: https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
df['sentiment'] = df['compound'].apply(lambda i: 'positive' if i >= 0.05 else ('negative' if i <= -0.05 else 'neutral'))

<b> CHECK THE OUTPUT  </b> - Here simply the output of the sentiment analysis is checked by showing the first rows of the dataframe that now includes a new column indicating the sentiment score.

In [21]:
df.head()

Unnamed: 0,id,user_description,user_location,text,id_str,created,retweet_count,city,date,text_eng,neg,neu,pos,compound,sentiment
393,394,"Open, eerlijk , transparant","amsterdam, nederland",vanaf 15 mei quarantaineplicht bij aankomst ui...,1382689775170289666,2021-04-15 13:38:26,0,Amsterdam,2021-04-15,"from 15 May, quarantine obligation upon arriva...",0.115,0.816,0.069,-0.3612,negative
792,793,Observing this #plutocracy. Gezondheid lijkt v...,"van galenbuurt, amsterdam",hier heb ik een half jaar moeten wachten op ee...,1382679111039123456,2021-04-15 12:56:03,0,Amsterdam,2021-04-15,here I had to wait six months for a new indica...,0.141,0.772,0.087,-0.1779,negative
796,797,• vrouw • liberaal •,'s-gravenhage & utrecht,voor een zelftest naar een particuliere testst...,1382678924954591236,2021-04-15 12:55:19,0,Utrecht,2021-04-15,for a self-test to a private test lane? #coron...,0.0,1.0,0.0,0.0,neutral
802,803,Kan niet tegen onrecht Probeer de waarheid te ...,"utrecht, nederland","er snel nog even een miljard doorheen jassen, ...",1382678534292901888,2021-04-15 12:53:46,0,Utrecht,2021-04-15,"a billion more quickly, for which poor Netherl...",0.331,0.669,0.0,-0.8849,negative
803,804,https://t.co/PKtZwEPWDf\nCouchsurfer\nPolarste...,"rotterdam blijdorp, nederland",die afname in 2020 komt toch 'gewoon' door cor...,1382678530929070080,2021-04-15 12:53:45,0,Rotterdam,2021-04-15,"that decrease in 2020 is 'just' due to corona,...",0.0,1.0,0.0,0.0,neutral


<b> EXPORT </b> - Save filtered and translated DF into a csv-file. In order to avoid you running this cell unintetionally, the code is kept as a comment.

In [19]:
#df.to_csv('Filtered_Tweets.csv', mode = 'a', index = False, header = True)

# Discussion / Conclusion
Two crucial decisions are made in this notebook. <p>
The first decision concerns the type of sentiment analysis conducted. Several alternatives to conduct sentiment analyses in Python are available, amongst others the Google Natural Language API a self-trained Bayesian machine learning algorithm. However, we decided to use the VADER Sentiment Analyzer due to three reasons: scientific recognition, code transparency and simplicity.
Firstly, Elbagir & Yang (2019) conclude in their paper that the VADER Sentiment Analyzer is a ‘an effective choice for sentiment analysis classification using Twitter data.’ (Elbagir & Yang. 2019. p.5). 
Furthermore, the algorithm is able to take into account hashtags, punctuation marks as well as capital letters. This reinsured us, that the algorithm is capable to deliver high-qualitative results.
Secondly, the VADER Sentiment Analyzer is less a black box than alternatives would be. Under the following link, the source code is freely accessible: https://www.nltk.org/_modules/nltk/sentiment/vader.html. Finally, the VADER Sentiment Analyzer proved to be easy to implement and promised to yield results that are relatively easy to interpret. Moreover, the algorithm can handle larger dataset without problems.

The second decision concerns whether to translate the Dutch tweets or not. It is clear, that the sentiment analysis is in danger of getting blurred when translating Dutch tweets into English language, which is required for the use of the VADER Sentiment Analyzer. The alternative would have been, to translate the lexicons underlying the VADER Sentiment Analyzer, i.e. to translate the algorithm into Dutch. However, Mohammad et al (2016) show in their study that both approaches deliver similar results. Therefore, we decided to translate all relevant Tweets using Google Translate via the Pyhton deep_translator package.

References: <p>
Elbagir, S., Yang, J. 2019. Proceedings of the International MultiConference of Engineers and Computer Scientists 2019, IMECS 2019, March 13-15, Hong Kong. <p>
Mohammad, S.,  Salameh, M., & Kiritchenko, S. (2016). How Translation Alters Sentiment. J. Artif. Intell. Res. (JAIR). 55. 95-130. 10.1613/jair.4787.