<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP Exercise

In today's exercise we have ~15,000 tweets aimed at various US airlines. Their sentiments are pre-labelled, and it's your task to build a classifier that can predict a tweet's sentiment based on its text.

#### Import the data

In [1]:
import pandas as pd

df = pd.read_csv("../assets/data/airline_tweets.csv", encoding="latin-1")
print(df.shape)
df.head()

(14640, 3)


Unnamed: 0,airline_sentiment,airline,text
0,neutral,Virgin America,@VirginAmerica What @dhepburn said.
1,positive,Virgin America,@VirginAmerica plus you've added commercials t...
2,neutral,Virgin America,@VirginAmerica I didn't today... Must mean I n...
3,negative,Virgin America,@VirginAmerica it's really aggressive to blast...
4,negative,Virgin America,@VirginAmerica and it's a really big bad thing...


#### 1: Exploration

- how many airlines are there?
- what is the proportion of sentiment across tweets?

In [2]:
df.airline.value_counts() # 6 airlines - not with particularly even sample size

United            3822
US Airways        2913
American          2759
Southwest         2420
Delta             2222
Virgin America     504
Name: airline, dtype: int64

#### 2: Data cleaning

- tidy up the tweets as you see fit
- you might want to remove Twitter handles for example

In [3]:
df.isnull().sum() # No null values 

airline_sentiment    0
airline              0
text                 0
dtype: int64

In [4]:
df.describe()

Unnamed: 0,airline_sentiment,airline,text
count,14640,14640,14640
unique,3,6,14427
top,negative,United,@united thanks
freq,9178,3822,6


In [5]:
df[df.text.duplicated()].head() #A bit suspicious how many of these there are

Unnamed: 0,airline_sentiment,airline,text
331,positive,Virgin America,@VirginAmerica Thanks!
515,positive,United,@united thanks
1477,positive,United,@united thank you!
1864,positive,United,@united thank you
1938,positive,United,@united thank you


In [6]:
# Check positive sentiments per airline
df[df.airline_sentiment == 'positive'].groupby('airline').agg(['count'])

Unnamed: 0_level_0,airline_sentiment,text
Unnamed: 0_level_1,count,count
airline,Unnamed: 1_level_2,Unnamed: 2_level_2
American,336,336
Delta,544,544
Southwest,570,570
US Airways,269,269
United,492,492
Virgin America,152,152


In [7]:
# Check negative sentiments per airline
df[df.airline_sentiment == 'negative'].groupby('airline').agg(['count'])

Unnamed: 0_level_0,airline_sentiment,text
Unnamed: 0_level_1,count,count
airline,Unnamed: 1_level_2,Unnamed: 2_level_2
American,1960,1960
Delta,955,955
Southwest,1186,1186
US Airways,2263,2263
United,2633,2633
Virgin America,181,181


In [8]:
# Check neutral sentiments per airline
df[df.airline_sentiment == 'neutral'].groupby('airline').agg(['count'])

Unnamed: 0_level_0,airline_sentiment,text
Unnamed: 0_level_1,count,count
airline,Unnamed: 1_level_2,Unnamed: 2_level_2
American,463,463
Delta,723,723
Southwest,664,664
US Airways,381,381
United,697,697
Virgin America,171,171


#### 3: Train-test split

Do a train-test split so we can test our best algorithm at the end

In [23]:
from sklearn.model_selection import train_test_split
# Note, CountVectorizer can only take one feature in X
X = df.loc[:, "text"]
y = df.loc[:, "airline_sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
y.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

#### 4: Try a simple binary count-based model

- Transform your raw text into binary features
- *Hint: it's a simple parameter change in `CountVectorizer`*
- Choose an appropriate machine learning model for the task
- Use cross-validation to evaluate your model's performance on the training set

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(binary=True,
                      stop_words='english',
                      lowercase=True # default
                     )

# Transform the training feature into a vectorized feature
X_train_text = vec.fit_transform(X_train)
X_test_text = vec.transform(X_test)

In [25]:
# Fit RadomForest model  on vectorized training feature
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier()

scores = cross_val_score(rf, X_train_text, y_train, scoring="accuracy", cv=7) # You can't use f1 scoring for non-binary targets
print(scores, np.mean(scores))


[0.72354949 0.72354949 0.7295082  0.71857923 0.71857923 0.72131148
 0.72024624] 0.7221890508879734


#### 5: Try playing around with some of the options in `CountVectorizer`

In [26]:
# Set a function to try different variations of CountVectorizer
def try_new_vectoriser(vec, X, y):
    X_train_text = vec.fit_transform(X)
    rf = RandomForestClassifier()
    scores = cross_val_score(rf, X_train_text, y_train, scoring="accuracy", cv=7)
    print(scores, np.mean(scores))

In [27]:
# Try a CountVectorizer model with minimum document presence of word = 2
try_new_vectoriser(CountVectorizer(binary=True,
                                   stop_words='english',
                                   min_df=2),X_train,y_train)

[0.74880546 0.71194539 0.73565574 0.7260929  0.71857923 0.71174863
 0.72161423] 0.724920226151535


In [40]:
# Try a CountVectorizer model with minimum document presence of word = 2 AND an actual count of features AND max features AND a mix of words/n-grams
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2, max_features=1000, ngram_range=(1, 2)),X_train,y_train)

[0.71877133 0.7112628  0.72131148 0.71516393 0.71038251 0.71243169
 0.72093023] 0.7157505685339015


In [35]:
# Try a CountVectorizer model with minimum document presence of word = 2 AND an actual count of features
try_new_vectoriser(CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2),X_train,y_train)

[0.73651877 0.72423208 0.72677596 0.71448087 0.7260929  0.73155738
 0.72777018] 0.7267754478437192


In [37]:
best_vectorizer = CountVectorizer(binary=False,
                                   stop_words='english',
                                   min_df=2)

X_train_text = best_vectorizer.fit_transform(X_train)

rf = RandomForestClassifier()
rf.fit(X_train_text, y_train);

In [38]:
def get_feature_importances(vocabulary, rf_importances, top_n):
    vocab_features = sorted(vocabulary.items(), key=lambda x: x[1])
    importances = zip(vocab_features, rf_importances)
    
    for z in sorted(importances, key=lambda x: abs(x[1]), reverse=True)[:top_n]:
        print(z)

get_feature_importances(best_vectorizer.vocabulary_, rf.feature_importances_, 10)

(('thank', 4452), 0.03268685128557979)
(('thanks', 4456), 0.03103994198730629)
(('jetblue', 2602), 0.013376437698327494)
(('great', 2206), 0.011843815339328526)
(('http', 2400), 0.010025125963450475)
(('southwestair', 4194), 0.008479834424117271)
(('flight', 1989), 0.007950787896334114)
(('hours', 2391), 0.007360461031742852)
(('usairways', 4728), 0.007099562524584833)
(('love', 2833), 0.0067671784261059825)


#### 6: Try moving to a TF-IDF model. Can you improve on your best score?

In [46]:
# Model measures TF-IDF - a value that measure the importance of a word in a document
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer(stop_words="english",
                            min_df=2,
                            max_features=1000)

X_train_text = tfidf_vec.fit_transform(X_train)
rf_best = RandomForestClassifier()
scores = cross_val_score(rf_best, X_train_text, y_train, scoring="accuracy", cv=7)
print(scores, np.mean(scores))

[0.74266212 0.71262799 0.73702186 0.73702186 0.72882514 0.72472678
 0.74281806] 0.7322433983228457


In [45]:
# Get most important words
rf = RandomForestClassifier()
rf.fit(tfidf_vec.fit_transform(X_train), y_train)
get_feature_importances(tfidf_vec.vocabulary_, rf.feature_importances_, 10)

(('thanks', 860), 0.0428060935685377)
(('thank', 859), 0.03600870730320447)
(('jetblue', 481), 0.024799050988902043)
(('southwestair', 807), 0.02235506218820487)
(('united', 916), 0.019430878777553902)
(('americanair', 67), 0.017982927778568865)
(('usairways', 925), 0.0174655509630593)
(('http', 452), 0.014306430105298047)
(('great', 403), 0.013447209419994286)
(('virginamerica', 937), 0.012714783539499605)


#### 7: Evaluate your best model (the one with the highest cross-validated score) on your test set. What is its final performance?

When you evaluate your model on the test set, remember to **only fit your Vectorizer on the training set** and use the fit Vectorizer to transform the test set. This is just like the rule for using z-score standardisation. We don't want the test set to interfere in our data transformations.

Also make sure to look at more than just a single score - consider printing the confusion matrix for example!

In [53]:
# Transform training features into vectorized features
X_test_text = tfidf_vec.fit_transform(X_test)

# Fit random forest (or other) model with vectorized training features
rf_best.fit(tfidf_vec.fit_transform(X_train), y_train)

# Predict test set targets using test features, with model trained on training data (vectorized)
rf_best.predict(X_test_text)

array(['negative', 'negative', 'positive', ..., 'negative', 'neutral',
       'positive'], dtype=object)

In [None]:
# Check accuracy of test data
scores = cross_val_score(rf_best, X_test_text, y_test, scoring="accuracy", cv=7)
print(scores, np.mean(scores))

In [59]:
df['prediction'] = rf_best.predict(tfidf_vec.fit_transform(df['text']))
df.head()

Unnamed: 0,airline_sentiment,airline,text,prediction
0,neutral,Virgin America,@VirginAmerica What @dhepburn said.,negative
1,positive,Virgin America,@VirginAmerica plus you've added commercials t...,negative
2,neutral,Virgin America,@VirginAmerica I didn't today... Must mean I n...,negative
3,negative,Virgin America,@VirginAmerica it's really aggressive to blast...,negative
4,negative,Virgin America,@VirginAmerica and it's a really big bad thing...,positive
