# Sentiment Analysis of Climate Change Tweets

[Insert intro and explanation to the project]

In [13]:
# Import libraries
import pandas as pd
import re

## Data
For this project I am going to use two sources of data. 

The first one is a dataset from [kaggle.com](https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset) containing information about 43943 tweets shared in between Apr 27, 2015 and Feb 21, 2018. For each tweet the full text of the tweet and the sentiment of the tweet is reported (more on this below). I plan on using this dataset to train a machine learning algorithm that classifies the tweets' sentiment out of sample with high accuracy. Unfortunately, I did not manage to get in touch with the creator of this dataset to get information on which restrictions he used to extract the tweets. This information would be relevant for me when extracting additional tweets for further analysis.

The second source of data I am going to use comes from Twitter. I use the Twitter's API to extract new tweets following criteria similar to those used for building the training dataset ([if I manage to find the guy; otherwise: arbitrary; maybe I'll try to see if he uses a specific collection of hashtags]). I plan on using my model to classify these tweets. (further info on this when you actually extract it; it would be interesting to tkae tweets after 2018)

In this section I will extract and perform pre-processing on both these dataframes.

### Training Dataset
Here I load the training dataset from Kaggle and take a peek at what it contains.

In [14]:
df_raw = pd.read_csv("data/twitter_sentiment_data.csv")
df_raw.head()

Unnamed: 0,sentiment,message,tweetid
0,-1,@tiniebeany climate change is an interesting h...,792927353886371840
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,793124211518832641
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...,793124402388832256
3,1,RT @Mick_Fanning: Just watched this amazing do...,793124635873275904
4,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ...",793125156185137153


Here, I take a small detour to look at the most common hashtags in the dataframe. I do this to think more carefully whether there could be some advantage in keeping hashtags in my analysis, and to also try to understand what could be a list of hashtags to use for the API search later.

In [15]:
import regex as re
import pre_processing_functions as pp

temp = df_raw['message'].apply(pp.find_hashtags)
hashtag_list = temp.to_list()
hashtag_list
hashtags_df = pd.DataFrame([item for sublist in hashtag_list for item in sublist])
hashtags_df.columns = ['hashtags']
hashtags_df


Unnamed: 0,hashtags
0,#BeforeTheFlood
1,#DiCaprio
2,#climate
3,#DonaldTrump
4,#monaco
...,...
12056,#FactsAreTruth
12057,#ImVoting4JillBecause
12058,#Awareness
12059,#AikBaatSuniThi


This is a histogram with the top 20 most used hashtags in this dataset.

In [16]:
temp = hashtags_df["hashtags"].value_counts()[:20]
temp

#climate            890
#climatechange      461
#BeforeTheFlood     248
#ClimateChange      156
#ActOnClimate       146
#Trump              144
#ParisAgreement     134
#COP22              133
#environment        123
#auspol              97
#COP21               96
#ImVotingBecause     93
#globalwarming       84
#Climate             67
#tcot                65
#science             62
#news                61
#GlobalWarming       58
#p2                  54
#MAGA                53
Name: hashtags, dtype: int64

In [17]:
import plotly.express as px
temp = hashtags_df["hashtags"].value_counts()[:20]
temp = pd.DataFrame({'hashtag':temp.index, 'count':temp.values})

fig = px.histogram(temp, x = 'hashtag', y = 'count')

fig.show()

As we can see from here the most diffused hashtags are those related directly to the world climate. However, the number of hashtags seems too low to be the criterion used to extract this data. 



#### Pre-processing 

I perform pre-processing on the dataset from Kaggle. 

First, I reduce all tweets to lowercase.

In [18]:
df_wip = df_raw
df_wip['message'] = df_raw.message.str.lower()
df_wip['message']

0        @tiniebeany climate change is an interesting h...
1        rt @natgeochannel: watch #beforetheflood right...
2        fabulous! leonardo #dicaprio's film on #climat...
3        rt @mick_fanning: just watched this amazing do...
4        rt @cnalive: pranita biswasi, a lutheran from ...
                               ...                        
43938    dear @realdonaldtrump,\nyeah right. human medi...
43939    what will your respective parties do to preven...
43940    rt @mikkil: un poll shows climate change is th...
43941    rt @taehbeingextra: i still can$q$t believe th...
43942    @likeabat77 @zachhaller \n\nthe wealthy + foss...
Name: message, Length: 43943, dtype: object

Then I handle URLs. There are many strategies here, one is to remove all links directly. However, I believe that links may convey useful information in this setting, so I decided to remove all prefixes from them, and only keep the domains. 

In [19]:
df_wip.message = df_wip.message.apply(lambda url: pp.url_cleaner(url))

By looking at the dataset, we see that there is a problem with special expressions and encoding. It is unclear what exactly the encoding here is. Obviously the optimal solution would be to use a decode function to get the right text from this, but most functions I have tried do not seem to match this particular encoding, so for now I will simply remove all special characters of this sort and proceed. I found most of these characters by visual inspection.

In [20]:
df_wip.message = df_wip.message.apply(lambda x: re.sub(r'[Ã¢â¬â€œ]', '', x))

I do not perform stopwords removal at this stage because literature on this matter showed that stopwords convey important information for classification in sentiment analysis (http://www.lrec-conf.org/proceedings/lrec2014/pdf/292_Paper.pdf). One of the fundamental problems is that common stopwords list used for this contain words that convey meaning, such as negation words (e.g. "not", "won't"). I will later try to implement a different method that uses word frequency as a criterion to remove specific lemmas.

I also decided to keep mentions because they are definetly relevant to this analysis. However, I remove punctuation, including "@", "#", and all punctuation associated with emojis since I believe emojis are not particular informative on Twitter (they are not as common as other platforms), and are not especially used when discussing political topics such as climate change. Notice that I substitue punctuation with a space rather than just remove it. I do this to keep possessive pronouns and verbs in text like "How's it going" separated so that I end up with "How s it going". 

For now I also do not proceed with any normalization. I may decide to change this later.

In [21]:
import string
punct_pattern = re.compile("[" + re.escape(string.punctuation) + "]")
df_wip.message = df_wip.message.apply(lambda x: punct_pattern.sub(r" ", x))


Finally I tokenize the words to prepare for the actual analysis.

In [22]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
df_wip['tokens'] = df_wip['message'].apply(tknzr.tokenize)
df_wip.head()

Unnamed: 0,sentiment,message,tweetid,tokens
0,-1,tiniebeany climate change is an interesting h...,792927353886371840,"[tiniebeany, climate, change, is, an, interest..."
1,1,rt natgeochannel watch beforetheflood right...,793124211518832641,"[rt, natgeochannel, watch, beforetheflood, rig..."
2,1,fabulous leonardo dicaprio s film on climat...,793124402388832256,"[fabulous, leonardo, dicaprio, s, film, on, cl..."
3,1,rt mick fanning just watched this amazing do...,793124635873275904,"[rt, mick, fanning, just, watched, this, amazi..."
4,2,rt cnalive pranita biswasi a lutheran from ...,793125156185137153,"[rt, cnalive, pranita, biswasi, a, lutheran, f..."



Now I proceed with the last steps of the preparation of the dataset. I take out the corpus to look at word frequencies.

In [None]:
corpus = df_wip['tokens'].sum()

I save the corpus in an intermediate step, given the amount of time to compile it.

In [None]:
import pickle

file_name = "corpus.pkl"
file = open(file_name, "wb")
pickle.dump(corpus, file)
file.close()

NameError: name 'corpus' is not defined

In [None]:
open_file = open(file_name, "rb")
corpus = pickle.load(open_file)
open_file.close()

corpus[0:5]

EOFError: Ran out of input

Then, I use nltk to create a frequency distribution and look at which words appear more often, and also which words appear only once.

In [None]:
from nltk import FreqDist
corpus_freq = FreqDist(corpus)
len(corpus_freq)


46942

I take a look at how many words appear only once, and find that it is more than half the total. I want to remove these words because they will not be of any use for the algorithms. 

In [None]:
once = [x for x in corpus_freq.most_common() if x[1] == 1]
len(once)

NameError: name 'corpus_freq' is not defined

In [None]:
to_remove = list(zip(*once))[0]
df_wip['tokens'] = df_wip['tokens'].apply(lambda x: [i for i in x if i not in to_remove])

In [None]:
# Store work-in-progress 
df_wip.to_csv('data/df_wip_1.csv', index = False)

Unnamed: 0,sentiment,message,tweetid,tokens
0,-1,tiniebeany climate change is an interesting h...,792927353886371840,"['climate', 'change', 'is', 'an', 'interesting..."
1,1,rt natgeochannel watch beforetheflood right...,793124211518832641,"['rt', 'natgeochannel', 'watch', 'beforetheflo..."
2,1,fabulous leonardo dicaprio s film on climat...,793124402388832256,"['fabulous', 'leonardo', 'dicaprio', 's', 'fil..."
3,1,rt mick fanning just watched this amazing do...,793124635873275904,"['rt', 'mick', 'fanning', 'just', 'watched', '..."
4,2,rt cnalive pranita biswasi a lutheran from ...,793125156185137153,"['rt', 'cnalive', 'pranita', 'biswasi', 'a', '..."


In [None]:
# Load wip
df_wip = pd.read_csv("data/df_wip_1.csv")
df_wip.drop(['Unnamed: 0'], axis = 1, inplace= True)
df_wip.head()

Unnamed: 0,sentiment,message,tweetid,tokens
0,-1,tiniebeany climate change is an interesting h...,792927353886371840,"['climate', 'change', 'is', 'an', 'interesting..."
1,1,rt natgeochannel watch beforetheflood right...,793124211518832641,"['rt', 'natgeochannel', 'watch', 'beforetheflo..."
2,1,fabulous leonardo dicaprio s film on climat...,793124402388832256,"['fabulous', 'leonardo', 'dicaprio', 's', 'fil..."
3,1,rt mick fanning just watched this amazing do...,793124635873275904,"['rt', 'mick', 'fanning', 'just', 'watched', '..."
4,2,rt cnalive pranita biswasi a lutheran from ...,793125156185137153,"['rt', 'cnalive', 'pranita', 'biswasi', 'a', '..."


In [None]:
from nltk import FreqDist
corpus_check = df_wip['tokens'].sum()
corpus_freq_check = FreqDist(corpus_check)


In [None]:
len(corpus_freq_check)
len(to_remove)

NameError: name 'to_remove' is not defined

In [None]:
once_check = [x for x in corpus_freq_check.most_common() if x[1] == 1]
once_check

NameError: name 'corpus_freq_check' is not defined

#### Final Data and Visualization

## Algorithms

## Compare Performances

## Potential Analysis and Future Uses