<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Sentiment-Analysis-with-Airline-Tweets" data-toc-modified-id="Sentiment-Analysis-with-Airline-Tweets-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Sentiment Analysis with Airline Tweets</a></span><ul class="toc-item"><li><span><a href="#Data-analysis" data-toc-modified-id="Data-analysis-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data analysis</a></span></li><li><span><a href="#Data-Engineering" data-toc-modified-id="Data-Engineering-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Data Engineering</a></span></li><li><span><a href="#Algorithm-Testing" data-toc-modified-id="Algorithm-Testing-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Algorithm Testing</a></span></li><li><span><a href="#Combining-Algorithms-(Ensemble)" data-toc-modified-id="Combining-Algorithms-(Ensemble)-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Combining Algorithms (Ensemble)</a></span></li></ul></li></ul></div>

# Sentiment Analysis with Airline Tweets

As there are many ways to do natural word processing, I decided to try a different approach from the one I tried for the fake news filter. 

In this kernel, I will be using the NLTK method instead.

Firstly, importing the relevant libaries!

In [1]:
from nltk import word_tokenize, NaiveBayesClassifier, classify, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import re

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

## Data analysis

Importing the dataset

In [2]:
data = pd.read_csv('tweets.csv', encoding = 'latin')

I will also be making a copy of the data to prevent confusion and make clearing of mistakes easier as well

In [3]:
data_copy = data.copy()

# Then, extract the relevant columns, in this case, the tweet column and the column which determines the sentiment of the tweet 

main_data = data_copy[['text', 'airline_sentiment']]

Let's talk a look at the first few rows of tweets

In [4]:
main_data.head()

Unnamed: 0,text,airline_sentiment
0,=@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative


Let's also check how many tweets are there and finding if there are any null values

In [5]:
main_data.describe()

Unnamed: 0,text,airline_sentiment
count,14639,14639
unique,14418,3
top,=@united Thanks,negative
freq,6,9178


In [6]:
main_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14639 entries, 0 to 14638
Data columns (total 2 columns):
text                 14639 non-null object
airline_sentiment    14639 non-null object
dtypes: object(2)
memory usage: 228.8+ KB


As the data is clean, the next step will be to extract the tweets from the main dataframe

In [7]:
tweets = main_data['text']

In [8]:
tweets

0                     =@VirginAmerica What @dhepburn said.
1        @VirginAmerica plus you've added commercials t...
2        @VirginAmerica I didn't today... Must mean I n...
3        @VirginAmerica it's really aggressive to blast...
4        @VirginAmerica and it's a really big bad thing...
5        @VirginAmerica seriously would pay $30 a fligh...
6        @VirginAmerica yes, nearly every time I fly VX...
7        @VirginAmerica Really missed a prime opportuni...
8        @virginamerica Well, I didn'tâ¦but NOW I DO! :-D
9        @VirginAmerica it was amazing, and arrived an ...
10       @VirginAmerica did you know that suicide is th...
11       @VirginAmerica I &lt;3 pretty graphics. so muc...
12       @VirginAmerica This is such a great deal! Alre...
13       @VirginAmerica @virginmedia I'm flying your #f...
14                                  @VirginAmerica Thanks!
15          =@VirginAmerica SFO-PDX schedule is still MIA.
16       @VirginAmerica So excited for my first cross c.

## Data Engineering

As seen from the tweets, there are many additional characters and letters which will complicate the model and lower the accuracy. As such, I will be removing the irrelevant letters and special characters in the next step. Each step will be explained in further detail as well.

In [9]:
# Create an empty list to store the processed tweets
processed_tweets = []

# To remove unnecessary words, characters and spaces, re order them too

for each_tweet in range(0, len(tweets)):
    # First step involves removing the special characters
    processed_tweet = re.sub(r'\W', ' ', str(tweets[each_tweet]))

    # Then, single characters will be removed
    processed_tweet = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_tweet)

    # Remove single characters from the start
    processed_tweet = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_tweet)

    # Substituting multiple spaces with single space
    processed_tweet = re.sub(r'\s+', ' ', processed_tweet, flags = re.I)

    # Removing prefixed 'b'
    processed_tweet = re.sub(r'^b\s+', '', processed_tweet)

    # Converting to Lowercase
    processed_tweet = processed_tweet.lower()

    # Store the processed tweet into the list created above by appending it
    processed_tweets.append(processed_tweet)

The next step will be to take out the common words and lemmatisation. Common words in this scenario will refer to words such as 'I', 'he' and many others where they do not really provide context to the sentiment analysis. As such, such common words will be removed to improve the efficiency of the algorithms later.

A brief description of lemmatisation involves classifying words that have approximately the same meaning but with different spellings as the same type.

In [11]:
#creating the bag of words model
CommonWords = stopwords.words('english')
wordLemmatizer = WordNetLemmatizer()
corpus = []
for i in range(0, len(processed_tweets)):
    #reading each text
    tweet = processed_tweets[i]
    #lemmatizing each word of the text. When we tokeninze a sentence we get individual words 
    wordtokens = [wordLemmatizer.lemmatize(word.lower()) for word in word_tokenize(tweet)] 
    #filtering out the stopwords from the text and combining them into a list again.
    text = ' '.join([x for x in wordtokens if x not in CommonWords])
    corpus.append(text)

Now, let's see the corpus array to see how they have been processed

In [12]:
print(corpus[0:5])

['virginamerica dhepburn said', 'virginamerica plus added commercial experience tacky', 'virginamerica today must mean need take another trip', 'virginamerica really aggressive blast obnoxious entertainment guest face amp little recourse', 'virginamerica really big bad thing']


## Algorithm Testing

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
#creating the sparse matrix
cv = CountVectorizer(max_features = 5000)  
x = cv.fit_transform(corpus).toarray()
y = main_data.iloc[:,1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

In [14]:
# tfidf_vectorizer = TfidfVectorizer(max_df = 0.7)
# tfidf_train = tfidf_vectorizer.fit_transform(x_train)
# tfidf_test = tfidf_vectorizer.transform(x_test)

In [15]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(solver = 'liblinear')
LR.fit(x_train, y_train)
y_pred_LR = LR.predict(x_test)

print(confusion_matrix(y_test, y_pred_LR))
print(classification_report(y_test, y_pred_LR))
print(accuracy_score(y_test, y_pred_LR))

[[1643  131   45]
 [ 229  342   60]
 [  90   65  323]]
             precision    recall  f1-score   support

   negative       0.84      0.90      0.87      1819
    neutral       0.64      0.54      0.59       631
   positive       0.75      0.68      0.71       478

avg / total       0.78      0.79      0.78      2928

0.7882513661202186


In [16]:
from sklearn.linear_model import PassiveAggressiveClassifier
PAC = PassiveAggressiveClassifier()
PAC.fit(x_train, y_train)
y_pred_PAC = PAC.predict(x_test)

print(confusion_matrix(y_test, y_pred_PAC))
print(classification_report(y_test, y_pred_PAC))
print(accuracy_score(y_test, y_pred_PAC))

[[1604  189   26]
 [ 287  310   34]
 [ 140   95  243]]
             precision    recall  f1-score   support

   negative       0.79      0.88      0.83      1819
    neutral       0.52      0.49      0.51       631
   positive       0.80      0.51      0.62       478

avg / total       0.73      0.74      0.73      2928

0.7366803278688525


In [17]:
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
MNB = MultinomialNB()
MNB.fit(x_train, y_train)
y_pred_MNB = MNB.predict(x_test)

print(confusion_matrix(y_test, y_pred_MNB))
print(classification_report(y_test, y_pred_MNB))
print(accuracy_score(y_test, y_pred_MNB))

[[1613  136   70]
 [ 272  305   54]
 [ 101   64  313]]
             precision    recall  f1-score   support

   negative       0.81      0.89      0.85      1819
    neutral       0.60      0.48      0.54       631
   positive       0.72      0.65      0.68       478

avg / total       0.75      0.76      0.75      2928

0.7619535519125683


In [18]:
GNB = GaussianNB()
GNB.fit(x_train, y_train)
y_pred_GNB = GNB.predict(x_test)

print(confusion_matrix(y_test, y_pred_GNB))
print(classification_report(y_test, y_pred_GNB))
print(accuracy_score(y_test, y_pred_GNB))

[[871 386 562]
 [107 200 324]
 [ 90  68 320]]
             precision    recall  f1-score   support

   negative       0.82      0.48      0.60      1819
    neutral       0.31      0.32      0.31       631
   positive       0.27      0.67      0.38       478

avg / total       0.62      0.48      0.50      2928

0.475068306010929


In [19]:
BNB = BernoulliNB()
BNB.fit(x_train, y_train)
y_pred_BNB = BNB.predict(x_test)

print(confusion_matrix(y_test, y_pred_BNB))
print(classification_report(y_test, y_pred_BNB))
print(accuracy_score(y_test, y_pred_BNB))

[[1602  153   64]
 [ 231  345   55]
 [  90   76  312]]
             precision    recall  f1-score   support

   negative       0.83      0.88      0.86      1819
    neutral       0.60      0.55      0.57       631
   positive       0.72      0.65      0.69       478

avg / total       0.77      0.77      0.77      2928

0.7715163934426229


In [20]:
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()
DT.fit(x_train,y_train)
y_pred_DT = DT.predict(x_test)

print(confusion_matrix(y_test, y_pred_DT))
print(classification_report(y_test, y_pred_DT))
print(accuracy_score(y_test, y_pred_DT))

[[1442  267  110]
 [ 278  279   74]
 [ 126   83  269]]
             precision    recall  f1-score   support

   negative       0.78      0.79      0.79      1819
    neutral       0.44      0.44      0.44       631
   positive       0.59      0.56      0.58       478

avg / total       0.68      0.68      0.68      2928

0.6796448087431693


In [21]:
from sklearn.ensemble import RandomForestClassifier
classifierRF = RandomForestClassifier()
classifierRF.fit(x_train, y_train)
y_pred_RF = classifierRF.predict(x_test)

print(confusion_matrix(y_test, y_pred_RF))
print(classification_report(y_test, y_pred_RF))
print(accuracy_score(y_test, y_pred_RF))

[[1594  163   62]
 [ 284  285   62]
 [ 144   86  248]]
             precision    recall  f1-score   support

   negative       0.79      0.88      0.83      1819
    neutral       0.53      0.45      0.49       631
   positive       0.67      0.52      0.58       478

avg / total       0.71      0.73      0.72      2928

0.7264344262295082


In [22]:
from xgboost import XGBClassifier
XGB = XGBClassifier()
XGB.fit(x_train,y_train)
y_pred_XGB = XGB.predict(x_test)

print(confusion_matrix(y_test, y_pred_XGB))
print(classification_report(y_test, y_pred_XGB))
print(accuracy_score(y_test, y_pred_XGB))

[[1681   80   58]
 [ 331  246   54]
 [ 126   66  286]]
             precision    recall  f1-score   support

   negative       0.79      0.92      0.85      1819
    neutral       0.63      0.39      0.48       631
   positive       0.72      0.60      0.65       478

avg / total       0.74      0.76      0.74      2928

0.7558060109289617


In [23]:
table = pd.DataFrame({'Model': ['Logistic Regression', 'Passive Aggressive Classifier', 'Multinomial Naive Bayes'
                                , 'Gaussian Naive Bayes', 'Bernoulli Naive Bayes'
                                , 'Decision Tree', 'Random Forest', 'XGBoost']
                                , 'Accuracy Scores': ['0.616', '0.507', '0.617', '0.302', '0.609', '0.467'
                                                      , '0.576', '0.610']})


table['Model'] = table['Model'].astype('category')
table['Accuracy Scores'] = table['Accuracy Scores'].astype('float32')

pd.pivot_table(table, index = ['Model']).sort_values(by = 'Accuracy Scores', ascending=False)

Unnamed: 0_level_0,Accuracy Scores
Model,Unnamed: 1_level_1
Multinomial Naive Bayes,0.617
Logistic Regression,0.616
XGBoost,0.61
Bernoulli Naive Bayes,0.609
Random Forest,0.576
Passive Aggressive Classifier,0.507
Decision Tree,0.467
Gaussian Naive Bayes,0.302


As we can see from the table above, the the algorithms are sorted according to their accuracy. It must be noted that these algorithms are yet to be tuned. Hence, algorithms such as XGBoost are much more potential to have a higher accuracy.

## Combining Algorithms (Ensemble)

In the next section, I will be experimenting with combining the top few algorithms to see if it is possible to achieve an even higher accuracy without tuning them.

In [25]:
from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(estimators = [('MultinomialNB', MNB)
                            , ('Logistic regression', LR)]
                            , voting = 'soft', weights = [1.5,1]).fit(x_train, y_train)

print('The accuracy for the combined algorithms are:',ensemble.score(x_test, y_test))

The accuracy for the combined algorithms are: 0.7793715846994536


After trying out different weights and mixing different algorithms, the best pair is the one above. As we can see, there is not much significant difference between the combined algorithms and the top few algorithms on their own. This only emphasises the importance of feature selection and data engineering as that is where the accuracy can be changed the most.