# Applying a ULMFiT model to Twitter US Airlines Sentiment Analysis

***Introduction***

Twitter US Airlines Sentiment provides a csv file of Twitter data scraped from Feburary 2015, which contains tweets regarding six major US airlines. In this project, we apply a supervised ULMFiT model to classfiy positive, negative and neutral tweets. ULMFiT is a transfer learning model, which essentialy aims to use a model that has been trained to solve one problem as the basis to solve a similar problem. 
In this instance,  we fine tune a pre-trained language model (trained on Wikipedia data), and apply it to classify tweets. 

---



In [0]:
!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html
!pip install fastai
import fastai
from fastai import *
from fastai.text import * 

***EDA ***

After installling the required packages, we begin by reading the csv as a pandas DataFrame and inspecting the structure of the data. It can be observed that sentiment analysis labels are positive (16%), negative (63%) and neutral (21%). We can also see that that body of the tweets are within the 'text' column. We therefore  transform the DataFrame to contain only the 'airline_sentiment'  and 'text' columns, stored as 'label' and 'text' respectively. These two columns will be the labels and features for our model. 

In [3]:
df = pd.read_csv('Tweets.csv')
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [4]:
df.airline_sentiment.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [0]:
df = pd.DataFrame({'label':df.airline_sentiment , 'text':df.text})

***Cleaning/preprocessing the data***

Before proceeding to the construction of the model, we must first clean the tweets and remove irrevelant content. Upon inspection, it can be seen that many of the tweets include hastags, mentions and other content not correctly encoded. 

As a first step we apply regular expressions to remove hastags but keep the word after the '#', as adjacent text may contain useful information about the tweet. Similarly regex is used to remove any hyperlinks and mentions, as they do not add value to sentiment analysis.

It also appears that HTML encoding has not been converted to text. To account for this, we use BeautifulSoup to decode HTML. Other common methods in NLP analysis include the removal of stopwords and tokenisation of tweets but this was neglected, as it yielded lower accuracies. We encapsultate these ideas in the *clean_tweets* function. 

In [0]:
import re 
from bs4 import BeautifulSoup
#function to clean a tweet, removing #s, mentions , URLs, and decode HTML
def clean_tweets(tweet):
    #remove hashes
    tweet = re.sub(r'#', '', tweet)
    
    #remove mentions
    tweet= re.sub(r'@[A-Za-z0-9]+','',tweet)
    
    #remove URLS
    tweet = re.sub('https?://[A-Za-z0-9./]+','',tweet)
    
    #HTML decoding
    tweet = re.sub('https?://[A-Za-z0-9./]+','',tweet)
    altered = BeautifulSoup(tweet, 'lxml')
    tweet = altered.get_text()
    
    return tweet 

To see the impact of the clean, we can view the head before and after preprocessing. 

In [7]:
#Data before preprocessing
df.head()

Unnamed: 0,label,text
0,neutral,@VirginAmerica What @dhepburn said.
1,positive,@VirginAmerica plus you've added commercials t...
2,neutral,@VirginAmerica I didn't today... Must mean I n...
3,negative,@VirginAmerica it's really aggressive to blast...
4,negative,@VirginAmerica and it's a really big bad thing...


In [0]:
df['text']=df['text'].apply(clean_tweets)

In [9]:
#data after the clean 
df.head()

Unnamed: 0,label,text
0,neutral,What said.
1,positive,plus you've added commercials to the experienc...
2,neutral,I didn't today... Must mean I need to take ano...
3,negative,"it's really aggressive to blast obnoxious ""ent..."
4,negative,and it's a really big bad thing about it


It is evident that the text is now more structured and appears neater. We can now proceed to the model construction. 

---






***Model Construction***

As a first step, we use the scikit-learn library to partition the data into a training set and a test data. In this model we use a train-test split of 70/30, such that the entire data set is randomly separated into a training set (10000 samples, 70%) and a test set (4636 samples, 30%). 

Implementing ULMFiT involves:

1.   Creating a language model with pretrained weights that you fine-tune to a new dataset
2.   Creating a classifier on top of the encoder of the language model

In this context, we fine-tune a pretrained language model (trained on the Wikitext 103 dataset) to the training data set, and build a classfier for the test data set. To get the data ready for modelling, we use the *TextLMDataBunch* and *TestClasDataBunch* classes from the fastai library to prepare the data for the language model and the classification model respectively. 

In [0]:
#path to the Wiki103 dataset used by ULMFit
path = untar_data(URLs.WT103_1)
from sklearn.model_selection import train_test_split
df_trn, df_val = train_test_split(df, stratify = df['label'], test_size = 0.3, random_state = 1)
#Language model data
data_lm = TextLMDataBunch.from_df(path=path, train_df = df_trn, valid_df = df_val)
#classifier model data
data_clas = TextClasDataBunch.from_df(path=path, train_df = df_trn, valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=42)

We now create a learner object, ‘learn’, that will directly create a model, download the pre-trained weights, and be ready for fine-tuning. We can use the *data_lm* object we previously created to fine-tune. We use the AWD_LSTM architecture provided by [fast.ai](https://www.fast.ai):

In [0]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,5.424283,4.465004,0.216194,05:55


We can unfreeze the model and fine-tune it.

In [0]:
learn.unfreeze()
learn.fit_one_cycle(1, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,4.118608,3.912419,0.275915,12:21


We save the encoder to use it for classification on the test data.

In [0]:
learn.save_encoder('ft_enc')

***Classification***

We now use the *data_clas* object we created previously to build a classifier with our fine-tuned encoder.

In [0]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('ft_enc')

In [0]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.710382,0.572643,0.765301,04:35


Again, we can unfreeze the model and fine-tune it.

In [0]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.654022,0.537686,0.776776,06:01


In [0]:
learn.unfreeze()
learn.fit_one_cycle(3, slice(2e-3/100, 2e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.553355,0.513982,0.795082,17:52
1,0.57212,0.505976,0.798361,17:15
2,0.538764,0.475355,0.811475,17:40


---
**Conclusion**

In this project, we have shown that we can achieve a high accuracy of 81% on a 70/30 train-test split. This performace is comparable to the Naive Bayes model proposed by [Duan et al](http://cs229.stanford.edu/proj2016spr/report/042.pdf)  and is substanially better than the RNN model proposed by [Yuan/Zhou](https://cs224d.stanford.edu/reports/YuanYe.pdf). For further work, we can aim to deduce the reasons behind each sentiment and better visualise the view on each US airline. We can also try using a semi-superivised implementation of ULMFiT.