# Context
## Description
A sentiment analysis job about the problems of each major U.S. airline. <br/>
Twitter data was scraped from February of 2015 and contributors were asked to:
- Classify positive, negative, and neutral tweets.
- Categorizing negative reasons (such as "late flight" or "rude service").

## Dataset
The dataset has to be downloaded from: https://www.kaggle.com/crowdflower/twitter-airline-sentiment<br/>
* tweet_id
* airline_sentiment
* airline_sentiment_confidence
* negativereason
* negativereason_confidence
* airline
* airline_sentiment_gold
* name
* negativereason_gold
* retweet_count
* text
* tweet_coord
* tweet_created
* tweet_location
* user_timezone

# Objective
To implement the techniques learnt as a part of the course.

# Learning Outcomes
- Basic understanding of text pre-processing.
- What to do after text pre-processing:
    - Bag of words
    - Tf-idf
- Build the classification model.
- Evaluate the Model.

## 1.1 Import libraries and load dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_colwidth', 0) # Display full dataframe information (Non-turncated Text column.)

In [2]:
tweetsData = pd.read_csv("Tweets.csv")

## 1.2 Shape of the dataset

In [3]:
tweetsData.shape

(14640, 15)

## 1.3 Data description

In [4]:
tweetsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
tweet_id                        14640 non-null int64
airline_sentiment               14640 non-null object
airline_sentiment_confidence    14640 non-null float64
negativereason                  9178 non-null object
negativereason_confidence       10522 non-null float64
airline                         14640 non-null object
airline_sentiment_gold          40 non-null object
name                            14640 non-null object
negativereason_gold             32 non-null object
retweet_count                   14640 non-null int64
text                            14640 non-null object
tweet_coord                     1019 non-null object
tweet_created                   14640 non-null object
tweet_location                  9907 non-null object
user_timezone                   9820 non-null object
dtypes: float64(2), int64(2), object(11)
memory usage: 1.7+ MB


Columns have null values:
* negativereason
* negativereason_confidence
* airline_sentiment_gold
* negativereason_gold
* tweet_coord
* tweet_location
* user_timezone

In [5]:
tweetsData.head(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials to the experience... tacky.,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I need to take another trip!,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing about it,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [6]:
tweetsData["airline"].value_counts(dropna = False)

United            3822
US Airways        2913
American          2759
Southwest         2420
Delta             2222
Virgin America    504 
Name: airline, dtype: int64

In [7]:
tweetsData["airline_sentiment"].value_counts(dropna = False)

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [8]:
tweetsData["negativereason"].value_counts(dropna = False)

NaN                            5462
Customer Service Issue         2910
Late Flight                    1665
Can't Tell                     1190
Cancelled Flight               847 
Lost Luggage                   724 
Bad Flight                     580 
Flight Booking Problems        529 
Flight Attendant Complaints    481 
longlines                      178 
Damaged Luggage                74  
Name: negativereason, dtype: int64

In [9]:
tweetsData["airline_sentiment_gold"].value_counts(dropna = False)

NaN         14600
negative    32   
positive    5    
neutral     3    
Name: airline_sentiment_gold, dtype: int64

In [10]:
tweetsData["negativereason_gold"].value_counts(dropna = False)

NaN                                         14608
Customer Service Issue                      12   
Late Flight                                 4    
Can't Tell                                  3    
Cancelled Flight                            3    
Cancelled Flight\nCustomer Service Issue    2    
Lost Luggage\nDamaged Luggage               1    
Flight Attendant Complaints                 1    
Customer Service Issue\nCan't Tell          1    
Bad Flight                                  1    
Late Flight\nLost Luggage                   1    
Customer Service Issue\nLost Luggage        1    
Late Flight\nFlight Attendant Complaints    1    
Late Flight\nCancelled Flight               1    
Name: negativereason_gold, dtype: int64

## 2.1 Drop all other columns except “text” and “airline_sentiment”

In [11]:
tweetsData_reduced = tweetsData[["text", "airline_sentiment"]]

## 2.2 Shape of new dataset

In [12]:
tweetsData_reduced.shape

(14640, 2)

## 2.3 Print first 5 rows of this dataset

In [13]:
tweetsData_reduced.head(5)

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials to the experience... tacky.,positive
2,@VirginAmerica I didn't today... Must mean I need to take another trip!,neutral
3,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",negative
4,@VirginAmerica and it's a really big bad thing about it,negative


## 3.1 Text pre-processing: HTML tag removal

In [14]:
from bs4 import BeautifulSoup

def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

tweetsData_reduced['text'] = tweetsData_reduced['text'].apply(lambda x: strip_html(x))
tweetsData_reduced.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials to the experience... tacky.,positive
2,@VirginAmerica I didn't today... Must mean I need to take another trip!,neutral
3,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces & they have little recourse",negative
4,@VirginAmerica and it's a really big bad thing about it,negative
5,@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA,negative
6,"@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)",positive
7,"@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP",neutral
8,"@virginamerica Well, I didn't…but NOW I DO! :-D",positive
9,"@VirginAmerica it was amazing, and arrived an hour early. You're too good to me.",positive


* On line 3, "$&amp;$" was converted back to "&"

## 3.2 Text pre-processing: Tokenization

In [15]:
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.

tweetsData_reduced['text'] = tweetsData_reduced.apply(lambda row: word_tokenize(row['text']), axis=1) # Tokenization of data

tweetsData_reduced.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,text,airline_sentiment
0,"[@, VirginAmerica, What, @, dhepburn, said, .]",neutral
1,"[@, VirginAmerica, plus, you, 've, added, commercials, to, the, experience, ..., tacky, .]",positive
2,"[@, VirginAmerica, I, did, n't, today, ..., Must, mean, I, need, to, take, another, trip, !]",neutral
3,"[@, VirginAmerica, it, 's, really, aggressive, to, blast, obnoxious, ``, entertainment, '', in, your, guests, ', faces, &, they, have, little, recourse]",negative
4,"[@, VirginAmerica, and, it, 's, a, really, big, bad, thing, about, it]",negative
5,"[@, VirginAmerica, seriously, would, pay, $, 30, a, flight, for, seats, that, did, n't, have, this, playing, ., it, 's, really, the, only, bad, thing, about, flying, VA]",negative
6,"[@, VirginAmerica, yes, ,, nearly, every, time, I, fly, VX, this, “, ear, worm, ”, won, ’, t, go, away, :, )]",positive
7,"[@, VirginAmerica, Really, missed, a, prime, opportunity, for, Men, Without, Hats, parody, ,, there, ., https, :, //t.co/mWpG7grEZP]",neutral
8,"[@, virginamerica, Well, ,, I, didn't…but, NOW, I, DO, !, :, -D]",positive
9,"[@, VirginAmerica, it, was, amazing, ,, and, arrived, an, hour, early, ., You, 're, too, good, to, me, .]",positive


## 3.3 Text pre-processing: Remove the numbers

In [16]:
import re

def remove_numbers(words):
    new_words = []
    for word in words:
        new_word = re.sub(r'\d+', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

## 3.4 Text pre-processing: Removal of Special Characters and Punctuations

In [17]:
def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

In [18]:
import unicodedata

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        if new_word != '':
            new_words.append(new_word)
    return new_words

In [19]:
from nltk.corpus import stopwords                       # Import stopwords.

stopwords = stopwords.words('english')

customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# Set custom stop-word's list as not, couldn't etc. words matter in Sentiment, so not removing them from original data.
stopwords = list(set(stopwords) - set(customlist)) 

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words

## 3.5 Text pre-processing: Convert to lowercase

In [20]:
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

## 3.6 Text pre-processing: Lemmatize

In [21]:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

from nltk.stem.wordnet import WordNetLemmatizer         # Import Lemmatizer.
import nltk

lemmatizer = WordNetLemmatizer()
    
def lemmatize_list(words):
    new_words = []
    
    for word, pos in nltk.pos_tag(words):
        if (word != ''):
            new_words.append(lemmatizer.lemmatize(word, get_wordnet_pos(pos)))        
    return new_words

## 3.7 Text pre-processing: Normalize

In [22]:
def normalize(words):
    words = remove_numbers(words)
    words = remove_punctuation(words)
    words = remove_non_ascii(words)
    words = remove_stopwords(words)
    words = to_lowercase(words)
    words = lemmatize_list(words)
    return ' '.join(words)

tweetsData_reduced['text'] = tweetsData_reduced.apply(lambda row: normalize(row['text']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.


In [23]:
tweetsData_reduced.head(10)

Unnamed: 0,text,airline_sentiment
0,virginamerica what dhepburn say,neutral
1,virginamerica plus added commercial experience tacky,positive
2,virginamerica i nt today must mean i need take another trip,neutral
3,virginamerica really aggressive blast obnoxious entertainment guest face little recourse,negative
4,virginamerica really big bad thing,negative
5,virginamerica seriously would pay flight seat nt play really bad thing fly va,negative
6,virginamerica yes nearly every time i fly vx ear worm win go away,positive
7,virginamerica really miss prime opportunity men without hat parody https tcomwpggrezp,neutral
8,virginamerica well i didntbut now i do d,positive
9,virginamerica amaze arrive hour early you good,positive


## 4.1 Vectorization - Use CountVectorizer + 1000 most-frequently used features

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000)
tweetsData_reduced_features_1 = vectorizer.fit_transform(tweetsData_reduced['text'])
tweetsData_reduced_features_1.shape

(14640, 1000)

## 4.2 Vectorization - Use TfidfVectorizer + 1000 most-frequently used features

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
tweetsData_reduced_features_2 = vectorizer.fit_transform(tweetsData_reduced['text'])
tweetsData_reduced_features_2.shape

(14640, 1000)

## 4.3 Split 4.1 & 4.2 data into training and testing sets

In [26]:
labels = tweetsData_reduced['airline_sentiment']
labels = labels.replace("neutral", "0").replace("negative", "-1").replace("positive", "1")
labels = labels.astype('int')

from sklearn.model_selection import train_test_split

X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(tweetsData_reduced_features_1, labels, test_size=0.3, random_state=1)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(tweetsData_reduced_features_2, labels, test_size=0.3, random_state=1)

## 5.1 Evaluate model - Use LinearSVC + 4.1 data

In [27]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

svc1 = LinearSVC(random_state=1)
svc1 = svc1.fit(X_train_1, y_train_1)

print(svc1)
print(np.mean(cross_val_score(svc1, tweetsData_reduced_features_1, labels, cv=10)))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)
0.755327868852459


In [28]:
result1 = svc1.predict(X_test_1)
acc1 = svc1.score(X_test_1, y_test_1)

In [29]:
from sklearn.metrics import confusion_matrix

conf_mat_1 = confusion_matrix(y_test_1, result1)
print(conf_mat_1)

resultsDf = pd.DataFrame({'Method':['LinearSVC + CountVectorizer 1000'],
                              'Accuracy': [acc1],
                              'Correct Negative': conf_mat_1[0,0], 
                              'Correct Neutral': conf_mat_1[1,1],
                              'Correct Positive': conf_mat_1[2,2]})
resultsDf

[[2415  233   93]
 [ 298  543   95]
 [ 147  106  462]]


Unnamed: 0,Method,Accuracy,Correct Negative,Correct Neutral,Correct Positive
0,LinearSVC + CountVectorizer 1000,0.778689,2415,543,462


## 5.2 Evaluate model - Use LinearSVC + 4.2 data

In [30]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

svc2 = LinearSVC(random_state=1)
svc2 = svc2.fit(X_train_2, y_train_2)

print(svc2)
print(np.mean(cross_val_score(svc2, tweetsData_reduced_features_2, labels, cv=10)))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)
0.7626366120218578


In [31]:
result2 = svc2.predict(X_test_2)
acc2 = svc2.score(X_test_2, y_test_2)

In [32]:
from sklearn.metrics import confusion_matrix

conf_mat_2 = confusion_matrix(y_test_2, result2)
print(conf_mat_2)

resultsDf = pd.concat([resultsDf, pd.DataFrame({'Method':['LinearSVC + TfidfVectorizer 1000'], 
                          'Accuracy': [acc2],
                          'Correct Negative': conf_mat_2[0,0], 
                          'Correct Neutral': conf_mat_2[1,1],
                          'Correct Positive': conf_mat_2[2,2]})])
resultsDf

[[2461  206   74]
 [ 342  517   77]
 [ 164   98  453]]


Unnamed: 0,Method,Accuracy,Correct Negative,Correct Neutral,Correct Positive
0,LinearSVC + CountVectorizer 1000,0.778689,2415,543,462
0,LinearSVC + TfidfVectorizer 1000,0.781193,2461,517,453


## 6.1 Vectorization - Use CountVectorizer + all unigrams, bigrams and trigrams features

In [33]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1,3))
tweetsData_reduced_features_3 = vectorizer.fit_transform(tweetsData_reduced['text'])
tweetsData_reduced_features_3.shape

(14640, 210594)

## 6.2 Vectorization - Use TfidfVectorizer + all unigrams, bigrams and trigrams features

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,3))
tweetsData_reduced_features_4 = vectorizer.fit_transform(tweetsData_reduced['text'])
tweetsData_reduced_features_4.shape

(14640, 210594)

## 6.3 Split 6.1 & 6.2 data into training and testing sets

In [35]:
labels = tweetsData_reduced['airline_sentiment']
labels = labels.replace("neutral", "0").replace("negative", "-1").replace("positive", "1")
labels = labels.astype('int')

from sklearn.model_selection import train_test_split

X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(tweetsData_reduced_features_3, labels, test_size=0.3, random_state=1)
X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(tweetsData_reduced_features_4, labels, test_size=0.3, random_state=1)

## 7.1 Evaluate model - Use LinearSVC + 6.1 data

In [36]:
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

svc3 = LinearSVC(random_state=1)
svc3 = svc1.fit(X_train_3, y_train_3)

print(svc3)
print(np.mean(cross_val_score(svc3, tweetsData_reduced_features_3, labels, cv=10)))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)
0.7616120218579233


In [37]:
result3 = svc3.predict(X_test_3)
acc3 = svc3.score(X_test_3, y_test_3)

In [38]:
from sklearn.metrics import confusion_matrix

conf_mat_3 = confusion_matrix(y_test_3, result3)
print(conf_mat_3)

resultsDf = pd.concat([resultsDf, pd.DataFrame({'Method':['LinearSVC + CountVectorizer all'], 
                          'Accuracy': [acc3],
                          'Correct Negative': conf_mat_3[0,0], 
                          'Correct Neutral': conf_mat_3[1,1],
                          'Correct Positive': conf_mat_3[2,2]})])
resultsDf

[[2399  250   92]
 [ 284  562   90]
 [ 114  116  485]]


Unnamed: 0,Method,Accuracy,Correct Negative,Correct Neutral,Correct Positive
0,LinearSVC + CountVectorizer 1000,0.778689,2415,543,462
0,LinearSVC + TfidfVectorizer 1000,0.781193,2461,517,453
0,LinearSVC + CountVectorizer all,0.784608,2399,562,485


## 7.2 Evaluate model -  Use LinearSVC + 6.2 data

In [39]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

svc4 = LinearSVC(random_state=1)
svc4 = svc4.fit(X_train_4, y_train_4)

print(svc4)
print(np.mean(cross_val_score(svc4, tweetsData_reduced_features_4, labels, cv=10)))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=1, tol=0.0001,
          verbose=0)
0.7550546448087432


In [40]:
result4 = svc4.predict(X_test_4)
acc4 = svc4.score(X_test_4, y_test_4)

In [41]:
from sklearn.metrics import confusion_matrix

conf_mat_4 = confusion_matrix(y_test_4, result4)
print(conf_mat_4)

resultsDf = pd.concat([resultsDf, pd.DataFrame({'Method':['LinearSVC + TfidfVectorizer all'], 
                          'Accuracy': [acc4],
                          'Correct Negative': conf_mat_4[0,0], 
                          'Correct Neutral': conf_mat_4[1,1],
                          'Correct Positive': conf_mat_4[2,2]})])
resultsDf

[[2619   90   32]
 [ 488  404   44]
 [ 230   72  413]]


Unnamed: 0,Method,Accuracy,Correct Negative,Correct Neutral,Correct Positive
0,LinearSVC + CountVectorizer 1000,0.778689,2415,543,462
0,LinearSVC + TfidfVectorizer 1000,0.781193,2461,517,453
0,LinearSVC + CountVectorizer all,0.784608,2399,562,485
0,LinearSVC + TfidfVectorizer all,0.782332,2619,404,413


## 8. Summary

In [42]:
resultsDf

Unnamed: 0,Method,Accuracy,Correct Negative,Correct Neutral,Correct Positive
0,LinearSVC + CountVectorizer 1000,0.778689,2415,543,462
0,LinearSVC + TfidfVectorizer 1000,0.781193,2461,517,453
0,LinearSVC + CountVectorizer all,0.784608,2399,562,485
0,LinearSVC + TfidfVectorizer all,0.782332,2619,404,413


* Pre-processing and Vectorization methods convert data to numbers, so that we can feed the data in the model.
* Adding more bigrams and trigrams features seems to increase model accuracy.