# Sentiment Analysis of Climate Change Tweets

[Insert intro and explanation to the project]

In [17]:
# Import libraries
import pandas as pd
import re

## Data
For this project I am going to use two sources of data. 

The first one is a dataset from [kaggle.com](https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset) containing information about 43943 tweets shared in between Apr 27, 2015 and Feb 21, 2018. For each tweet the full text of the tweet and the sentiment of the tweet is reported (more on this below). I plan on using this dataset to train a machine learning algorithm that classifies the tweets' sentiment out of sample with high accuracy. Unfortunately, I did not manage to get in touch with the creator of this dataset to get information on which restrictions he used to extract the tweets. This information would be relevant for me when extracting additional tweets for further analysis.

The second source of data I am going to use comes from Twitter. I use the Twitter's API to extract new tweets following criteria similar to those used for building the training dataset ([if I manage to find the guy; otherwise: arbitrary; maybe I'll try to see if he uses a specific collection of hashtags]). I plan on using my model to classify these tweets. (further info on this when you actually extract it; it would be interesting to tkae tweets after 2018)

In this section I will extract and perform pre-processing on both these dataframes.

### Training Dataset
Here I load the training dataset from Kaggle and take a peek at what it contains.

In [18]:
df_raw = pd.read_csv("data/twitter_sentiment_data.csv")
df_raw.head()

Unnamed: 0,sentiment,message,tweetid
0,-1,@tiniebeany climate change is an interesting h...,792927353886371840
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,793124211518832641
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...,793124402388832256
3,1,RT @Mick_Fanning: Just watched this amazing do...,793124635873275904
4,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ...",793125156185137153


Here, I take a small detour to look at the most common hashtags in the dataframe. I do this to think more carefully whether there could be some advantage in keeping hashtags in my analysis, and to also try to understand what could be a list of hashtags to use for the API search later.

In [19]:
import regex as re
import pre_processing_functions as pp

temp = df_raw['message'].apply(pp.find_hashtags)
hashtag_list = temp.to_list()
hashtag_list
hashtags_df = pd.DataFrame([item for sublist in hashtag_list for item in sublist])
hashtags_df.columns = ['hashtags']
hashtags_df


Unnamed: 0,hashtags
0,#BeforeTheFlood
1,#DiCaprio
2,#climate
3,#DonaldTrump
4,#monaco
...,...
12056,#FactsAreTruth
12057,#ImVoting4JillBecause
12058,#Awareness
12059,#AikBaatSuniThi


This is a histogram with the top 20 most used hashtags in this dataset.

In [20]:
temp = hashtags_df["hashtags"].value_counts()[:20]
temp

#climate            890
#climatechange      461
#BeforeTheFlood     248
#ClimateChange      156
#ActOnClimate       146
#Trump              144
#ParisAgreement     134
#COP22              133
#environment        123
#auspol              97
#COP21               96
#ImVotingBecause     93
#globalwarming       84
#Climate             67
#tcot                65
#science             62
#news                61
#GlobalWarming       58
#p2                  54
#MAGA                53
Name: hashtags, dtype: int64

In [21]:
import plotly.express as px
temp = hashtags_df["hashtags"].value_counts()[:20]
temp = pd.DataFrame({'hashtag':temp.index, 'count':temp.values})

fig = px.histogram(temp, x = 'hashtag', y = 'count')

fig.show()

As we can see from here the most diffused hashtags are those related directly to the world climate. However, the number of hashtags seems too low to be the criterion used to extract this data. 



#### Pre-processing 

I perform pre-processing on the dataset from Kaggle. 

First, I reduce all tweets to lowercase.

In [22]:
df_wip = df_raw
df_wip['message'] = df_raw.message.str.lower()
df_wip['message']

0        @tiniebeany climate change is an interesting h...
1        rt @natgeochannel: watch #beforetheflood right...
2        fabulous! leonardo #dicaprio's film on #climat...
3        rt @mick_fanning: just watched this amazing do...
4        rt @cnalive: pranita biswasi, a lutheran from ...
                               ...                        
43938    dear @realdonaldtrump,\nyeah right. human medi...
43939    what will your respective parties do to preven...
43940    rt @mikkil: un poll shows climate change is th...
43941    rt @taehbeingextra: i still can$q$t believe th...
43942    @likeabat77 @zachhaller \n\nthe wealthy + foss...
Name: message, Length: 43943, dtype: object

Then I handle URLs. There are many strategies here, one is to remove all links directly. However, I believe that links may convey useful information in this setting, so I decided to remove all prefixes from them, and only keep the domains. 

In [23]:
df_wip.message = df_wip.message.apply(lambda url: pp.url_cleaner(url))

By looking at the dataset, we see that there is a problem with special expressions and encoding. It is unclear what exactly the encoding here is. Obviously the optimal solution would be to use a decode function to get the right text from this, but most functions I have tried do not seem to match this particular encoding, so for now I will simply remove all special characters of this sort and proceed. I found most of these characters by visual inspection.

In [24]:
df_wip.message = df_wip.message.apply(lambda x: re.sub(r'[Ã¢â¬â€œ]', '', x))

### Data from Twitter

## Pre-processing
In order to use my data 

## Final Data

## Algorithm

## Visualization

## Potential Analysis and Future Uses