# Twitter US airline sentiment

In this notebook we examine tweets about US airlines.
It is a popular dataset, description of which can be found at https://www.kaggle.com/crowdflower/twitter-airline-sentiment

Our aim is twofold: 
* to write a **classifier which uses tweet's body only to classify a given tweet as positive, negative or neutral,**
* to **use twitter API** to fetch some recent tweets about airlines and **test our classifier in the real world.**

We write our **own classes for feature extraction**, in particular we **extract emojis and exclamation marks** (both single and multiple) from tweets, as well as perform standard tokenization and stemming using nltk library.

**Our classes import from sklearn mixin classes, so that we can use them together with sklearn's pipelines and grid search.**

Withour further ado, let's begin necessary imports. Note a dependecy on a non standard emoji library (pip install emoji --upgrade)

In [160]:
# standard imports
import pandas as pd
import numpy as np

# regex
import re

# library for emojis regex
import emoji

# nltk import
import nltk
from nltk import PorterStemmer
from nltk.tokenize import RegexpTokenizer
#nltk.download()

# sklearn imports
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import classification_report, f1_score, make_scorer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.externals import joblib

# clasifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# twitter
from twitter import Twitter, OAuth

## EDA

Firstly, let's read in and peek at our raw data.

In [3]:
raw_data = pd.read_csv('tweets.csv')

In [4]:
raw_data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In this exercise we will use solely the tweet's body (text) to predict it's sentiment: positive, negative or neutral.
Because of that, we don't inspect any of the other columns of the table.

Let's look at the distribution of the sentiments for each airline and in total. 

First let's look at raw counts.

In [5]:
pivot  = raw_data.pivot_table(values='tweet_id', index='airline', columns = 'airline_sentiment' ,
                      aggfunc = 'count', margins=True, margins_name = 'Total')
pivot

airline_sentiment,negative,neutral,positive,Total
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
American,1960.0,463.0,336.0,2759.0
Delta,955.0,723.0,544.0,2222.0
Southwest,1186.0,664.0,570.0,2420.0
US Airways,2263.0,381.0,269.0,2913.0
United,2633.0,697.0,492.0,3822.0
Virgin America,181.0,171.0,152.0,504.0
Total,9178.0,3099.0,2363.0,14640.0


Secondly, let's look at percentages.

In [6]:
pivot['negative'] = pivot['negative']/pivot['Total']
pivot['neutral'] = pivot['neutral']/pivot['Total']
pivot['positive'] = pivot['positive']/pivot['Total']
pivot.drop('Total', axis = 1, inplace = True)
pivot

airline_sentiment,negative,neutral,positive
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
American,0.710402,0.167814,0.121783
Delta,0.429793,0.325383,0.244824
Southwest,0.490083,0.27438,0.235537
US Airways,0.776862,0.130793,0.092345
United,0.688906,0.182365,0.128728
Virgin America,0.359127,0.339286,0.301587
Total,0.626913,0.21168,0.161407


We observe that tweets aimed at the airlines are used mainly for complaints. US Airways had the highest percentage of negative tweets (although, in absolute terms, the glorious first place goes to United). At the other end of the spectrum is Virgin America, which has both the smallest absolute number of tweets (by a factor of 4!) and the smallest proportion of negative tweets. In fact, for Virgin, the split between the three classes is almost equal.

Another useful observation is that our dataset is heavily skewed for negative tweets. This will play a role in training of our classifiers - care will be taken to make sure classifiers can recognize less frequent classes (positive and neutral) as well as the most prominent type of tweets (negative).

Now, let's extract the series containing tweets' bodies and associated sentiments, which will serve as our X (data) and y (labels) respectively.

In [7]:
raw_tweets = raw_data['text']
raw_sentiment = raw_data['airline_sentiment']

Map the sentiments to integers.

In [8]:
y = raw_sentiment.map({'negative': 0, 'positive': 1, 'neutral' : 2}).values
X = raw_tweets.values

Finally, let's split our data into training (80%) and testing (20%) sets.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                            random_state = 0, stratify = y)

## Feature extraction: custom classes

We will write four custom classes for extracting features from text:
* ExclamationMarksExtractor to extract single or multiple exclamation marks (we differentiate between '!' and '!!!!!')
* KeyboardEmojisExtractor to extract emojis like :), :( or :D
* GraphicsEmojisExtractor to extract emojis like 😂 or 😀
* TextCleaner, which removes tweeter mentions (@Airline), replaces any hyperlinks with the word URL, removes punctuation and stop words, lowercases everything and finally performs tokenization and stemming

*A note on hashtags:* We decide against separately extracting hashtags; instead, # sign is stripped by the TextCleaner class and remaining words become part of the vocabulary for Count Vectorizer. 

*A note on regeular expressions:* In the classes below, given a repeated application of the same regular expressions, it would have been advantageous to compile regexes once at the beginning, for improved performance. However, this leads to problems when using sklearn's GridSearchCV, as regexes are not deep copyable. Because of that, we decided to store our regexes as strings. 

To obtain a string representation of a regex for graphical emojis, we modified slightly a function from the emoji library.

In [10]:
# this piece of code is a slight modification of a function of the same name from 
# https://github.com/carpedm20/emoji/blob/master/emoji/core.py
# essentially, we want the funciton to return a string, not a compiled regex

def get_emoji_regexp():
    '''Returns a string representing a regular expression caputing all grapical emojis.'''
    
    emojis = sorted(emoji.EMOJI_UNICODE.values(), key=len,
                        reverse=True)
    emoji_regex = u'(' + u'|'.join(re.escape(u) for u in emojis) + u')'
    
    return emoji_regex

In [11]:
class ExclamationMarksExtractor(BaseEstimator, TransformerMixin):
    '''Class for extracting single or multiple exclamation marks from strings.'''
    
    def __init__(self, r1 =  r'(?<!\!)!(?!\!)', r2 = r'!{2,}'):
        '''Args: 
            r1 (str): regex for a single exclamation mark
            r2 (str): regex for two or more exclamation mars
        '''
        
        self.r1 = r1 
        self.r2 = r2
    
    def fit(self,x, y = None):
        ''' Fit method required by sklearn'''
        return self
    
    def transform(self, text):
        ''' 
        Args:
            text: list of strings
        
        Returns:
            A list of transformed strings, ie a list of strings consisting of '!' for 
            each occurence of a single exclamation mark and '!!' for two or more 
            exclamation marks, joined by spaces. 
            
            eg: 'I loved it! Can't wait to fly United again!!!!!!' is transformed to '! !!'.
        '''
        
        # find all single or multiple exclamation marks
        exclamation = [re.findall(self.r1,tweet) + re.findall(self.r2,tweet) for tweet in text]
        
        #joins lists back into single strings
        exclamation = [' '.join(tweet) for tweet in exclamation]
        
        #represent 2 or more '!'s with '!!'
        exclamation = [re.sub(self.r2,'!!',tweet) for tweet in exclamation]
        
        return exclamation

In [12]:
class KeyboardEmojisExtractor(BaseEstimator, TransformerMixin):
    '''Class for extracting keyboard emojis, like :) or :P'''
    
    def __init__(self, emojis_regexp_list = [r':\)', r':\(', r':-D', r':P', r';\)', r';P', r';D',
                              r';\)', r'xD', r'XD', r':-\)',r':-\(', r':-P', r':\\',
                              r':\/', r':D', r':-\(']):
        '''
        Args:
            emojis_regex_list (list of str): list of regexes for extraction of keyboard emojis 
        '''
        
        self.emojis_regexp_list = emojis_regexp_list 
    
    
    def fit(self, x, y = None):
        ''' Fit method required by sklearn'''
        return self
    
    def transform(self, text):
        ''' 
        Args:
            text: list of strings
        
        Returns:
            A list of transformed strings, ie a list of strings consisting of keyboard emojis 
            found in tweets, joined by spaces. 
            
            eg: 'I loved it! :):) is transformed to ':) :)'.
        '''
        
        # create list of empty lists to keep emojis found in each tweet
        keybord_emojis = []
        for i in range(len(text)):
            keybord_emojis.append([])
        
        # iterate over all regexp and all tweets 
        # to find all emojis from the emojis_regex_list for each tweet
        for rgxp in self.emojis_regexp_list:
            for i in range(len(text)):
                keybord_emojis[i] += re.findall(rgxp, text[i])
        
        # join into a format required by Count/Tfidf Vectorizer
        keybord_emojis = [' '.join(tweet) for tweet in keybord_emojis]
        
        return keybord_emojis

In [13]:
class GraphicsEmojisExtractor(BaseEstimator, TransformerMixin):
    '''Class for extracting graphics emojis'''  
    
    def __init__(self, r = get_emoji_regexp()):
        ''' 
        Args:
            r (str): regex for emoji extraction
        '''
        
        self.r = r 
    
    
    def fit(self, x, y = None):
        ''' Fit method required by sklearn'''
        return self
    
    def transform(self, text):
        '''
        Args:
            text: list of strings
        
        Returns:
            A list of transformed strings, ie a list of strings consisting of 
            graphics emojis found in tweets, joined by spaces. 
        '''
        
        #extract graphics emojis
        graphics_emojis = [re.findall(self.r, tweet) for tweet in text]
        
        #join into a format required by CountVectorizer
        graphics_emojis = [' '.join(tweet) for tweet in graphics_emojis]
        
        return graphics_emojis

In [180]:
class TextCleaner(BaseEstimator, TransformerMixin):
    '''Class for cleaning, tokenizing and stemming text.'''
    
    def __init__(self, r1 = r'@\w+', 
            r2 = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
            tokenizer = RegexpTokenizer(r'\w+'), stemmer = PorterStemmer(),
            stop_words = nltk.corpus.stopwords.words("english")):
        '''
        Args:
            r1 (str): regex for twitter mentions
            r2 (str): regex for URL's
            tokenizer: RegexpTokenizer
            stemmer: nltk porter stemmer
            stop_words: list of stop words
        '''
        
        self.r1 = r1 
        self.r2 = r2 
        self.tokenizer = tokenizer
        self.stemmer = stemmer
        self.stop_words = stop_words
        
    def fit(self, x, y = None):
        ''' Fit method required by sklearn'''
        return self
    
    def transform(self, text):
        '''
        Args:
            text: list of strings
        
        Returns:
            A list of transformed strings, ie a list of strings 
            consisting of stemmed lowercase tokens. 
        '''
        
        #replace all @airline_name mentions with empty string
        tweets = [re.sub(self.r1,'',tweet) for tweet in text]
        
        #substitute any urf for a string 'url
        tweets = [re.sub(self.r2,'url',tweet) for tweet in tweets]
        
        #make everything lowercase
        tweets = [tweet.lower() for tweet in tweets]
        
        #tokenize
        tweets = [self.tokenizer.tokenize(tweet) for tweet in tweets]
        
        #remove stop words
        tweets = [[word for word in tweet if word not in self.stop_words] for tweet in tweets]
        
        #perform stemming
        tweets = [[self.stemmer.stem(word) for word in tweet] for tweet in tweets]
        
        #join to a form digestible for CountVectorizer
        tweets = [' '.join(tweet) for tweet in tweets]
        
        return tweets

## Model selection and evaluation functions

We will now define functions to automate hyperparameter tuning and model evaluation.

In [17]:
# function for tuning hyperparams of a classifiers

def tune_and_fit(clf, p_grid, cv = 5, X_train = X_train, y_train = y_train):
    '''Performs grid search over p_grid with cv-fold cross validation.
    
    Args:
        clf: sklearn classifier
        p_grid (dict): parameters grid
        cv (int): number of folds in cross validation
        X_train (array): training set
        y_train (array): training labels
        
    Returns:
        best model found, fitted on the entire data set
    '''
    
    grid = GridSearchCV(clf, param_grid = p_grid, cv = cv, n_jobs = -1, verbose = True, 
                        scoring = make_scorer(f1_score, average='weighted'))
    grid.fit(X_train,y_train)
    print('Best parameters found: ')
    print(grid.best_params_)
    print('\n Best score:')
    print(grid.best_score_)
    
    return grid.best_estimator_

In [18]:
# function to evalute tuned models on the test set

def evaluate_model(model, X_test = X_test, y_test = y_test):
    '''
    Args:
        model: fitted sklearn classifier
        X_test (array): testing set
        y_test (array): testing labels
    '''
    
    y_pred = model.predict(X_test)
    
    print(classification_report(y_test, y_pred))

## Pipelines

The below pipeline will constitute the feature selection part of every model we create. In short, it uses all four of our custom classes to extract various features, then performs count vectorization on each set of features and finally unions them up using FeatureUnion.

*A note on vectorization:* by running some simple pipelines with MultinomialBayes classifier we established CountVectorizer to be much more effective than TfidfVectorizer, hence our choice of the vectorizer below.

In [181]:
# this pipeline will be be a feature engineering part of every pipeline we produce
# CountVectorizer uses lowercase = False to save time on unecessary lowercasing step: 
# it is either irrelevant or already done

feature_selection = Pipeline([
('union',FeatureUnion(transformer_list=[

    #pipeline for extracing exclamation marks
    ('!', Pipeline([
        ('ext', ExclamationMarksExtractor()), #ext = extractor
        ('vct', CountVectorizer(token_pattern = r'\S+', lowercase = False)) #vct = vectorizer
    ])),
    
    #pipeline for extracting keybord emojis
    ('key_emoji', Pipeline([
        ('ext', KeyboardEmojisExtractor()),
        ('vct', CountVectorizer(token_pattern = r'\S+', lowercase = False))
    ])),
    
    #pipeline for exracting graphical emojis
    ('graph_emoji', Pipeline([
        ('ext', GraphicsEmojisExtractor()),
        ('vct', CountVectorizer(token_pattern = get_emoji_regexp(), lowercase = False))
    ])),
    
    #pipeline for standard bag-of-words model for body of the tweets
    ('body', Pipeline([
        ('ext', TextCleaner()),
        ('vct', CountVectorizer(lowercase = False))
    ]))
]))])

Our pipelines will consist of the feature selection pipeline as above, followed by a classifier. The function below automates the process of adjoining a classifier onto a feature selection pipeline.

In [21]:
def add_clf(clf, name = 'clf'):
    '''Appends a classifier to the feature selection pipeline.
    Args:
        clf: sklearn classifier
        name (str): name of the classifier in the pipeline
    '''
    
    pipe = Pipeline([
        ('fts', feature_selection), #fts - features
        (name, clf)
        ])
    
    return pipe

## Baseline models

Before we start looking for the best model, let's look at two baselines. Firstly, we will do a simple count vectorization of raw tweets (no preprocessing) and then run a multinomial bayes classifier.

In [22]:
baseline1 = Pipeline([
    ('vct', CountVectorizer()),
    ('clf',MultinomialNB())
    ])

baseline1.fit(X_train,y_train)
evaluate_model(baseline1)

             precision    recall  f1-score   support

          0       0.78      0.96      0.86      1836
          1       0.81      0.50      0.62       472
          2       0.70      0.41      0.52       620

avg / total       0.77      0.77      0.75      2928



Not a bad result for virtually no work! Let's see if extracting emojis and exclamation marks, coupled with more careful text cleaning improves on that!

In [23]:
baseline2 = add_clf(MultinomialNB())
baseline2.fit(X_train,y_train)
evaluate_model(baseline2)

             precision    recall  f1-score   support

          0       0.80      0.94      0.86      1836
          1       0.78      0.63      0.69       472
          2       0.67      0.41      0.51       620

avg / total       0.77      0.78      0.76      2928



After all the effort with the classes, a more complex pipeline and a longer training time, f1 score improvement is so small it might as well be a fluke. We will nonetheless stick to this more complicated pipeline in a hope that more sophisticated classifiers will be able to use the extra features extracted by our custom classes.

# MODELLING

We will now optimize a number of popular classifiers and test their performance on the test set. We first run grid searches with 'sparse' sets of parameters, and later narrow down our search.

### Logistic Regression

In [62]:
# create a pipeline
maxent = add_clf(LogisticRegression(random_state = 0),'maxent')

# create a paramter grid
maxent_grid = {'maxent__C' : 10**np.arange(-3,4, dtype = float), 
                   'maxent__penalty' : ['l1','l2'], 'maxent__class_weight' : ['balanced',None]}

In [63]:
# check the performance of out of the box logistic regression
maxent_baseline = add_clf(LogisticRegression(random_state = 0),'maxent')
maxent_baseline.fit(X_train,y_train)
evaluate_model(maxent_baseline)

             precision    recall  f1-score   support

          0       0.85      0.91      0.88      1836
          1       0.79      0.72      0.75       472
          2       0.65      0.56      0.60       620

avg / total       0.80      0.80      0.80      2928



In [None]:
# optimize hyperparameters
maxent = tune_and_fit(maxent,maxent_grid);

In [None]:
# Best parameters found: 
# {'maxent__C': 1.0, 'maxent__class_weight': None, 'maxent__penalty': 'l2'}

# Best score:
# 0.778322037165

In [34]:
# evaluate optimized model on test set
evaluate_model(maxent)

             precision    recall  f1-score   support

          0       0.85      0.91      0.88      1836
          1       0.79      0.72      0.75       472
          2       0.65      0.56      0.60       620

avg / total       0.80      0.80      0.80      2928



We see that class weight and penalty should remain as default. Let's try and narrow down our search for optimal value of C.

In [75]:
# define a new grid and search it
maxent = add_clf(LogisticRegression(random_state = 0),'maxent')
maxent_grid = {'maxent__C' : np.array([0.5,1,2,3,4,5])}
maxent = tune_and_fit(maxent,maxent_grid)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  3.4min finished


Best parameters found: 
{'maxent__C': 1.0}

 Best score:
0.778322037165


In [76]:
# evaluate optimized model on test set
evaluate_model(maxent)

             precision    recall  f1-score   support

          0       0.85      0.91      0.88      1836
          1       0.79      0.72      0.75       472
          2       0.65      0.56      0.60       620

avg / total       0.80      0.80      0.80      2928



Looks like default parameters for Logistic Regression work best for our task. Let's pickle the best logistic regression model obtained.

In [78]:
joblib.dump(maxent, 'maxent.pkl');

### SVM

In [51]:
# create a pipeline
svm = add_clf(SVC(random_state = 0, kernel='linear'),'svm')

# create a parameter grid
svm_grid = {'svm__C' : 10**np.arange(-3,4, dtype = float), 
                'svm__kernel' : ['linear','rbf','poly'],
                'svm__class_weight' : ['balanced',None]}

In [40]:
# check the performance of out of the box SVM
svm_baseline = add_clf(SVC(random_state = 0, kernel='linear'),'svm')
svm_baseline.fit(X_train,y_train)
evaluate_model(svm_baseline)

             precision    recall  f1-score   support

          0       0.84      0.86      0.85      1836
          1       0.74      0.70      0.72       472
          2       0.59      0.56      0.57       620

avg / total       0.77      0.78      0.77      2928



In [None]:
# optimize hyperparameters
svm = tune_and_fit(svm, svm_grid);

In [None]:
# Best parameters found: 
# {'svm__C': 1000.0, 'svm__class_weight': None, 'svm__kernel': 'rbf'}

#  Best score:
# 0.772769928666

In [53]:
# evaluate on test set
evaluate_model(svm)

             precision    recall  f1-score   support

          0       0.85      0.88      0.87      1836
          1       0.76      0.69      0.72       472
          2       0.61      0.58      0.60       620

avg / total       0.78      0.79      0.79      2928



Since the optimal value of C is at the boundary of our search space, let's extend it.

In [82]:
# create a pipeline
svm = add_clf(SVC(random_state = 0),'svm') #kernel = rbf

# create a parameter grid
svm_grid = {'svm__C' : 10**np.arange(3,6,1, dtype = float)}

In [83]:
# optimize hyperparameters
svm = tune_and_fit(svm, svm_grid)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  4.1min finished


Best parameters found: 
{'svm__C': 1000.0}

 Best score:
0.772769928666


In [84]:
# evaluate on test set
evaluate_model(svm)

             precision    recall  f1-score   support

          0       0.85      0.88      0.87      1836
          1       0.76      0.69      0.72       472
          2       0.61      0.58      0.60       620

avg / total       0.78      0.79      0.79      2928



Our best value of C remained 1000. Let's search if there is a better value in the vicinity of 1000.

In [85]:
# create a pipeline
svm = add_clf(SVC(random_state = 0),'svm') #kernel = rbf

# create a parameter grid
svm_grid = {'svm__C' : np.arange(500,1600,100)}

In [86]:
# optimize hyperparameters
svm = tune_and_fit(svm, svm_grid)

Fitting 5 folds for each of 11 candidates, totalling 55 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  8.3min
[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed: 10.3min finished


Best parameters found: 
{'svm__C': 900}

 Best score:
0.773973569851


In [87]:
# evaluate on test set
evaluate_model(svm)

             precision    recall  f1-score   support

          0       0.85      0.89      0.87      1836
          1       0.76      0.68      0.72       472
          2       0.61      0.57      0.59       620

avg / total       0.78      0.79      0.78      2928



So our best value of C is between 900 and 1000. Let's find it to the neareast 10!

In [90]:
# create a pipeline
svm = add_clf(SVC(random_state = 0),'svm') #kernel = rbf

# create a parameter grid
svm_grid = {'svm__C' : np.arange(900,1000,10)}

In [91]:
# optimize hyperparameters
svm = tune_and_fit(svm, svm_grid)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  8.5min finished


Best parameters found: 
{'svm__C': 900}

 Best score:
0.773973569851


In [92]:
# evaluate on test set
evaluate_model(svm)

             precision    recall  f1-score   support

          0       0.85      0.89      0.87      1836
          1       0.76      0.68      0.72       472
          2       0.61      0.57      0.59       620

avg / total       0.78      0.79      0.78      2928



So 900 it is! Let's pickle our best SVM model for later.

In [93]:
joblib.dump(svm, 'svm.pkl');

### Random Forest

In [107]:
# create a pipeline
forest = add_clf(RandomForestClassifier(random_state = 0), 'forest')

# create parameter grids
forest_grid = {'forest__max_depth' : [10, 50, 100],
                'forest__n_estimators': [100, 500],
                'forest__min_samples_leaf': [10, 50, 100],
                'forest__class_weight' : ['balanced',None]}

In [102]:
# check the performance of out of the box RandomForest
forest_baseline = add_clf(RandomForestClassifier(random_state = 0), 'forest')
forest_baseline.fit(X_train,y_train)
evaluate_model(forest_baseline)

             precision    recall  f1-score   support

          0       0.79      0.89      0.84      1836
          1       0.70      0.62      0.66       472
          2       0.59      0.41      0.48       620

avg / total       0.73      0.75      0.73      2928



In [None]:
# optimize hyperparameters
forest = tune_and_fit(forest,forest_grid);

In [None]:
# Best parameters found: 
# {'forest__class_weight': 'balanced', 'forest__max_depth': 100, 'forest__min_samples_leaf': 10, 'forest__n_estimators': 500}

#  Best score:
# 0.725788550582

In [109]:
# evaluate on test set
evaluate_model(forest)

             precision    recall  f1-score   support

          0       0.89      0.72      0.80      1836
          1       0.59      0.77      0.67       472
          2       0.49      0.65      0.56       620

avg / total       0.76      0.72      0.73      2928



Some parameters were at their boundary values, let's extend our search.

In [115]:
# create a pipeline
forest = add_clf(RandomForestClassifier(random_state = 0, class_weight = 'balanced'), 'forest')

# create parameter grids
forest_grid = {'forest__max_depth' : [100,200,300],
                'forest__n_estimators': [100, 500,1000],
                'forest__min_samples_leaf': [2,5,10]}

In [116]:
# optimize hyperparameters
forest = tune_and_fit(forest,forest_grid)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed: 17.9min finished


Best parameters found: 
{'forest__max_depth': 300, 'forest__min_samples_leaf': 2, 'forest__n_estimators': 1000}

 Best score:
0.756624210259


In [117]:
# evaluate on test set
evaluate_model(forest)

             precision    recall  f1-score   support

          0       0.87      0.79      0.83      1836
          1       0.63      0.77      0.69       472
          2       0.55      0.61      0.58       620

avg / total       0.77      0.75      0.76      2928



Again, we're at the boundary, so let's look further.

In [119]:
# create a pipeline
forest = add_clf(RandomForestClassifier(random_state = 0, class_weight='balanced'), 'forest')

# create parameter grids
forest_grid = {'forest__max_depth' : [300,400,500],
                'forest__n_estimators': [1000,1500,2000],
               'forest__min_samples_leaf': [1,2]}

In [120]:
# optimize hyperparameters
forest = tune_and_fit(forest,forest_grid)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 31.2min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed: 61.9min finished


Best parameters found: 
{'forest__max_depth': 300, 'forest__min_samples_leaf': 1, 'forest__n_estimators': 1000}

 Best score:
0.758534316104


In [121]:
# evaluate on test set
evaluate_model(forest)

             precision    recall  f1-score   support

          0       0.82      0.89      0.85      1836
          1       0.74      0.66      0.70       472
          2       0.62      0.53      0.57       620

avg / total       0.77      0.77      0.77      2928



Ok, looks like we found the rough region where the best parameters lie. Let's try and narrow them down.

In [122]:
# create a pipeline
forest = add_clf(RandomForestClassifier(random_state = 0, 
                class_weight='balanced'), 'forest') #default min samples leaf is 1

# create parameter grids
forest_grid = {'forest__max_depth' : [250,300,350],
                'forest__n_estimators': [900,1000,1100]}

In [123]:
# optimize hyperparameters
forest = tune_and_fit(forest,forest_grid)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed: 34.5min finished


Best parameters found: 
{'forest__max_depth': 250, 'forest__n_estimators': 900}

 Best score:
0.758980866869


In [124]:
# evaluate on test set
evaluate_model(forest)

             precision    recall  f1-score   support

          0       0.83      0.88      0.85      1836
          1       0.74      0.66      0.70       472
          2       0.62      0.54      0.58       620

avg / total       0.77      0.78      0.77      2928



Let's narrow down forest's max depth to the nearest 10 and keep searching for optimal number of estimators.

In [125]:
# create a pipeline
forest = add_clf(RandomForestClassifier(random_state = 0, 
                class_weight='balanced'), 'forest') #default min samples leaf is 1

# create parameter grids
forest_grid = {'forest__max_depth' : np.arange(250,280,10),
                'forest__n_estimators': [700,800,900]}

In [126]:
# optimize hyperparameters
forest = tune_and_fit(forest,forest_grid)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed: 28.2min finished


Best parameters found: 
{'forest__max_depth': 250, 'forest__n_estimators': 700}

 Best score:
0.759013393401


In [127]:
# evaluate on test set
evaluate_model(forest)

             precision    recall  f1-score   support

          0       0.82      0.88      0.85      1836
          1       0.73      0.66      0.70       472
          2       0.61      0.53      0.57       620

avg / total       0.77      0.77      0.77      2928



In [128]:
# create a pipeline
forest = add_clf(RandomForestClassifier(random_state = 0, 
            class_weight='balanced', max_depth = 250), 'forest') #default min samples leaf is 1

# create parameter grids
forest_grid = {'forest__n_estimators': [600,650,700,750]}

In [129]:
# optimize hyperparameters
forest = tune_and_fit(forest,forest_grid)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  9.6min finished


Best parameters found: 
{'forest__n_estimators': 700}

 Best score:
0.759013393401


We settle our search at max depth of 250 and 700 estimators. Let's pickle the best forest found.

In [130]:
joblib.dump(forest, 'forest.pkl');

### Have emojis helped?

We conclude that our best model is logistic regression. Due to the complextiy of our pipeline, it is difficult to track down which words or emojis played significant roles in our decision function. Let's try and see if including emojis and exclamation marks yielded an improved predictive power by running a logistic regression of features obtained by a simple Count Vectorizer (no preprocessing).

In [159]:
baseline3 = Pipeline([
    ('vct', CountVectorizer()),
    ('clf',LogisticRegression())
    ])

baseline3.fit(X_train,y_train)
evaluate_model(baseline1)

             precision    recall  f1-score   support

          0       0.78      0.96      0.86      1836
          1       0.81      0.50      0.62       472
          2       0.70      0.41      0.52       620

avg / total       0.77      0.77      0.75      2928



We see that we obtained a 0.05 improvement of f1 score on the test set with our custom pipeline.

# CONNECTING TO TWITTER

We will now test our best model on real world data by fetching some recent tweets and classyfing them.

In [96]:
# to use fetch_and_classify, fill these with your own key and token 
ACCESS_TOKEN = 'YOUR ACCESS TOKEN'
ACCESS_SECRET = 'YOUR ACCESS SECRET'
CONSUMER_KEY = 'YOUR CONSUMER KEY'
CONSUMER_SECRET = 'YOUR CONSUMER SECRET'

In [99]:
def fetch_and_classify(airline, model, language = 'en', n = 10, sentiment = None):
    '''
    Fetches n recent tweets in language about airline. 
    Prints tweets together with their sentiment, as classified by the model
    
    Args:
        n (int): number of tweets to fetch
        airline (str): airline's twitter username (without @)
        language (str): tweets' language, default is English
        model (str): path to a trained and pickled sklearn classifier
        sentiment (str): one of 'positive', 'negative' or None, defualt None. 
                        Allows to fetch tweets of a given sentiment only (as per twitter api)
        
    Returns: 
        None
    '''
    #sentiment dict
    s_dict = {'positive': ':)', 'negative' : ':('}
    
    # connect to twitter
    oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
    twitter = Twitter(auth=oauth)
    
    # fetch tweets
    if sentiment:
        tweets = twitter.search.tweets(q = '@{} {} -filter:retweets'.format(airline,
                s_dict[sentiment]), result_type = 'recent', lang = 'en', 
                count = n, tweet_mode = 'extended')
    else:
        tweets = twitter.search.tweets(q = '@{} -filter:retweets'.format(airline), 
                result_type = 'recent', lang = 'en', count = n, tweet_mode = 'extended')
    
    # extract text
    tweets = [tweet['full_text'] for tweet in tweets['statuses']]
    
    # load classifier
    clf = joblib.load(model) 
    
    # make predictions
    predictions = clf.predict(tweets)
    predictions = pd.Series(predictions).map({0 :'negative', 1: 'positive', 2: 'neutral'}).values
    
    # print the results
    for i in range(n):
        print(tweets[i])
        print('Classifier prediction: {}'.format(predictions[i]))
        print('\n')

Let's see how our model handles some recent tweets about United.

In [100]:
fetch_and_classify('United','maxent.pkl')

@take2review @united That's pretty anticlimactic. I'm disappointed.
Classifier prediction: negative


Yo @united your basic economy rules scares me can I bring a backpack on? What's considered full size carry on?
Classifier prediction: neutral


@united It's only U.S., not worldwide.
Classifier prediction: neutral


@fortis_semper @united did @united get back to you? @seatguru shows United's 767-300 row 42 "limited or no recline," passengers claim no recline in that row https://t.co/bMcze6EBlI
Classifier prediction: negative


Second flight of the day. Boarded @united flight UA6313 IAD - YUL. Getting closer to @NEPHP #nephp17 https://t.co/pkpsegxxsF
Classifier prediction: neutral


@mandakayrocks @united Reporting in.  Nothing happened.  Lol. I never saw him again.
Classifier prediction: negative


@united Do you think Dao would love that paradise? https://t.co/PT17S2mwSp
Classifier prediction: neutral


@united  today you rock. Capt Dave Harvey got flt 2063 down safe and CLE crew is a

We leave it up to you to decide if the classification is good enough, but we're satisfied :)

Since most tweets aimed at airlines are negative and we had the highest proportion of such in our training set, these are the easiest to classify. Let's see how our classifier deals with tweets which Twitter API considers positive.

In [185]:
fetch_and_classify('VirginAmerica','maxent.pkl', sentiment = 'positive')

Thanks to @VirginAmerica's inaugural SFO/SEA fares. @betterinrealife and I were able to survive a LDR phase. And now, look at us. :) HBD VA! https://t.co/dERAenBjNv
Classifier prediction: positive


On a @VirginAmerica flight, of all the movies my 8yo could have selected to watch, she decided to re-watch @HiddenFigures :)
Classifier prediction: neutral


@VirginAmerica Ok thanks. No worries. Just making sure.  Heading to airport soon. :)
Classifier prediction: positive


Maybe the one time I've responded to a corporate newsletter. Could not resist @VirginAmerica. Or could I? :) https://t.co/0xLXwTwj4f
Classifier prediction: neutral


@VirginAmerica flight 221 looks like there are 2 open 1st class and my wife and I need a free upgrade.:) 11a11b #donthurttoask
Classifier prediction: neutral


@VetCopGOP @AnnCoulter @Delta @VirginAmerica @JetBlue glad to know we've got good people like you out there protecting us, and protecting us from ourselves :)
Classifier prediction: positive


@VetC

Interestingly, some of the tweets Twitter considered positive were actually the opposite, and our classifier has caught that.

**Finally, let's see how our classifier deals with an airline which was not present in our training set.** We will use Ryanair.

In [189]:
fetch_and_classify('Ryanair', 'maxent.pkl')

@Ryanair My flight from STN to SXF is listed as on time, but the plane isn't even here yet. Could you give an update? (FR8544)
Classifier prediction: negative


Fly for Stanstead direct to Genoa from as little as £15 each way! A good reason to #learntosail in Italy. @Ryanair https://t.co/VUb6xbkGxt
Classifier prediction: positive


@Ryanair Done thank you
Classifier prediction: positive


@Ryanair flight to corfu 1 hr late. No apology. People sitting on flight. Awful.
Classifier prediction: negative


@saratctravel @Flight_Refunds @Ryanair Where in the hell were you flying to? Mars? Jings oh!
Classifier prediction: negative


Congratulations @Ryanair for great idea of uniformly distributing families along plane. Outcome: delay until people manage to switch places.
Classifier prediction: negative


@Ryanair Flight 4195 is now expected to be at 10pm instead of 14.30. You planning on having anyone in the airport to explain what's going on?
Classifier prediction: negative


Reasons NOT to 

Again, we leave it up to you to gauge the quality of the predictions.
** Finally, fell free to play around with pickled classifiers and our tweets-fetching function!**

*Sidenote*: Interestingly, none of the fetched tweets contained any emojis.

# Dicussion

Our task was to create a robust sentiment classifier and test it in the field by fetching some recent, real world tweets. We accomplished our task by creating custom feature extraction classes: extracting exclamation marks, two types of emojis and finally cleaning, tokenizing and stemming text. Subsequently, multi stage pipelines were created and a number of classifiers were tuned and fitted. The best (according to f1 score) classifier was logistic regression, closely followed by SVM with gaussian kernel. Surprisingly, Random Forests underperformed. 

Even though a task of testing our classifier with real world tweets was completed, our logistic regression classifier still leaves room for improvement. Let's list some of potential strategies we could have used aimed at improving our f1 score:

* using a different type of model, eq: xgboost, neural networks
* incorporating ngrams: quick and dirty testing with multinomial NB showed that using ngrams actually decreased predictive power, but perhaps using 2-grams with a specific combination of parameters would have yielded better scores. Testing this would have produced much larger parameter grids, though
* incorporating different weights to different parts of our feature extraction, say, weighing emojis twice as important as words: as above, quick and dirty check revealed this to either not change anything or actually decrease performance. Again, a more systematic way to check this would be to use grid search, but computational costs were prohibitive for a small laptop
* using ensemble methods: boosting, bagging, stacking?

Finally, let's note that our aim was not to max out our f1 score at all cost. The main purpose of this exercises was to learn and practice writing custom sklearn compatible classes, use them together with sklearn pipelines and grid searches and have a perhaps imperfect, but decently working classifier with which we can connect to twitter and classify some actual tweets. We consider this mission accomplished and we learned loads during the process :)