# Natural Language Processing for sentiment analysis of Tweets

This notebook has been written as part of a Kaggle challenge. The original challenge can be found here:
https://www.kaggle.com/c/nlp-itmo-sentiment/

The task given is to classify the sentiment value of a dataset of tweets in English language. The possible values that the label "sentiment" can assume are ['Negative','Positive'], and are stored in the training dataset as, respectively, 0 or 1.

Note: all training data comes pre-labelled. Validity of this model is thus subject to the validity of the labelling process, which is dubious. It is unlikely that this model generalises well on a dataset labeled by a different person or  algorithm, whichever was used for this one.

This program scores a public score of 0.77012 on Kaggle, accordingly to the f1-metric which is used. The competion is over though so the leaderboard does not show this result.

Further description of the program follows in comments and markdown

# Basic imports

In [1]:
# For linear algebra and database handling
import numpy as np
import pandas as pd

# Machine learning libraries:
#
# 1) Vectorization and normalization of texts. Corresponds to a pipeline comprised of 
# CountVectorizer and TfidfTransformer, in this order.
from sklearn.feature_extraction.text import TfidfVectorizer

# 2) Linear classifier. Uses stochastic gradient descent to converge to a minimum of the loss function.
from sklearn.linear_model import SGDClassifier

# Pipeline. Imposes order in the operations.
from sklearn.pipeline import Pipeline 

# GridSearchCV allows to find the optimal parameters for the ML models used
from sklearn.model_selection import GridSearchCV

# Assorted imports for prettifying dictionaries and time handling
from pprint import pprint
import time

# Reading the training dataset

The dataset is provided as 'train.csv', and comprises of about 80000 tweets, labeled by sentiment value.

It is not known how the label was provided.

The data is first loaded, and then some exploratory analysis is performed

In [2]:
train = pd.read_csv('train.csv', 
                    encoding ='ISO-8859-1', # 'utf-8', which is the default encoding if no value is passed to 
                                            # the parameter encoding=, returns a UnicodeError
                   header=0,
                   index_col=0)
print(train.info())
train.head(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79988 entries, 1 to 79999
Data columns (total 2 columns):
Sentiment        79988 non-null int64
SentimentText    79988 non-null object
dtypes: int64(1), object(1)
memory usage: 1.8+ MB
None


Unnamed: 0_level_0,Sentiment,SentimentText
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,is so sad for my APL frie...
2,0,I missed the New Moon trail...
3,1,omg its already 7:30 :O
4,0,.. Omgaga. Im sooo im gunna CRy. I'...
5,0,i think mi bf is cheating on me!!! ...
6,0,or i just worry too much?
7,1,Juuuuuuuuuuuuuuuuussssst Chillin!!
8,0,Sunny Again Work Tomorrow :-| ...
9,1,handed in my uniform today . i miss you ...
10,1,hmmmm.... i wonder how she my number @-)


There are no null values in the dataset, which is good. A few of the index values are missing, but this does not affect us.

Since accordingly to the description of the challenge the dataset consists of tweets, it is likely that the texts contain multiple hashtags or mentions or emoticons.
 
If their presence is significant, it might be worth to use a tokenizer specialised for tweets during preprocessing

In [3]:
mask = train.SentimentText.str.contains('[@#]') # This searches for @ and # in tweets

train[mask].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68253 entries, 10 to 79999
Data columns (total 2 columns):
Sentiment        68253 non-null int64
SentimentText    68253 non-null object
dtypes: int64(1), object(1)
memory usage: 1.6+ MB


As displayed above, the vastest majority of the texts contain Twitter-related symbols. It is thus worth using a tokenizer with twitter-specific regex when preprocessing the texts

# Using a Tweet tokenizer

NLTK provides a tokenizer that contains Twitter-specific regex. It is important to pass the parameter reduce_len=True when instantiating the class, so that it cuts the maximum number of repeated characters to 3.

Note that TfidfVectorizer normally considers the following strings as all different:
'a','aa','aaa','aaaa','aaaaa'

In the previous example, if reduce_len is set to true in preprocessing, TfidfVectorizer would receive 'a','aa', and three times 'aaa'.

In [4]:
from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)

# Building the pipeline

It is now time to build the pipeline.

We will be using, in succession, TfidfVectorizer and SGDClassifier. TfidfVectorizer will use as a tokenizer the tokenizer that we have instantiated earlier. After comparing the usage of the default tokenizer of TfidfVectorizer, and the TweetTokenizer of NLTK, the latter scores better.

Through GridSearch we will be searching the optimal choice of model parameters.

The computation takes about 12 minutes in my computer, CPUs only.

In [5]:
np.random.seed(42) # seeds the RNG for consistency in the results

pipeline = Pipeline([
    ('tfidf',TfidfVectorizer(tokenizer=tokenizer.tokenize)),
    ('sgd',SGDClassifier())
])

parameters = {'tfidf__use_idf':(True,False),               # Whether to multiply the term frequency
                                                           # for the inverse document frequency, or not
              
              'tfidf__ngram_range':((1,1),(1,2),(1,3),     # Dimensionality of the ngrams to use.
                                    (2,2),(2,3)),          # By extracting bigrams or trigrams, in addition to 
                                                           # unigram, the model should have better scoring
              
              'sgd__loss':('hinge','log'),                 # Possible loss functions for the classifier.
                                                           # One is fully differentiable, the other is not.
                                                           # Note that if 'hinge' is used, predict_proba()
                                                           # method of the classifier cannot be called.
             }

grid_search = GridSearchCV(pipeline,
                           parameters,
                           n_jobs=-1,                      # Uses all CPUs available
                           verbose=1)

t = time.time()
grid_search.fit(train.SentimentText,train.Sentiment)
print('Computation done in {} seconds'.format(int(time.time()-t)))

print('\nParameters used for GridSearch: ')
pprint(parameters)
print('\nParameters selected as the best fit: ')
pprint(grid_search.best_params_)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 11.4min finished


Computation done in 712 seconds

Parameters used for GridSearch: 
{'sgd__loss': ('hinge', 'log'),
 'tfidf__ngram_range': ((1, 1), (1, 2), (1, 3), (2, 2), (2, 3)),
 'tfidf__use_idf': (True, False)}

Parameters selected as the best fit: 
{'sgd__loss': 'hinge', 'tfidf__ngram_range': (1, 2), 'tfidf__use_idf': True}


# Assigning the best parameters to our model

We can now assign the best scoring parameters to our model, and refit it. 

GridSearchCV allows to use the instantiated class to make predictions with the best-performing estimator, by calling the method GridSearchCV.predict() after fitting, which automatically uses the predict method of the best estimator. I however prefer to have more control over the model parameters, and thus choose to reinstantiate and refit the pipeline manually.

Please also note that before performing grid search it is not known which loss function of the SGDClassifier will perform best. While the log-loss function, if selected as best fit, would allow to call GridSearchCV.predict_proba() after fitting, the hinge loss function would not allow it. This is an additional reason for reinstantiating the model manually, if you plan to call the predict_proba() method.

In [6]:
pipeline = Pipeline([
    ('tfidf',TfidfVectorizer(use_idf=True,
                             ngram_range=(1,2),
                             analyzer='word',
                             tokenizer=tokenizer.tokenize
    )),
    ('sgd',SGDClassifier(loss='hinge',
                         fit_intercept=True
    ))
])

In [7]:
t = time.time()
pipeline.fit(train.SentimentText,train.Sentiment)

print('Time taken to fit the newly-instantiated model: {} seconds'.format(int(time.time()-t)))



Time taken to fit the newly-instantiated model: 24 seconds


# Loading the test dataset

The test dataset on which the program is evaluated was provided in a separate csv file, 'test.csv'.
Be careful when loading this file, as there is an extra space in the column name that contains the texts.

In [8]:
test = pd.read_csv('test.csv',
                  encoding ='ISO-8859-1',
                  header=0,
                  names=['SentimentText'],  # Somebody put a space in the column name :-| 
                                            # It initially reads as ' SentimentText' 
                                            # It took me a while to figure out what was wrong with it
                   index_col=0)

test.head()

Unnamed: 0,SentimentText
80000,"@ChaMberSWasHerE Oh, we've always planted rose..."
80001,@chamcircuit im going 2 try your comp but as i...
80002,@ChamCircuit is 13th Top Dance &amp; Electroni...
80003,@chamcircuit so how was Up?
80004,@chamelledesigns Quite the opposite! I'm sure ...


# Prediction time!

It is now time to predict the labels of the test set with our fitted model.

The results of the predictions are then saved in 'submission.csv', ready to be uploaded to Kaggle for scoring.

In [9]:
predictions = pipeline.predict(test.SentimentText.values)

print('Number of 1\'s predicted: {}'.format(predictions.sum()))
print('Out of a total of {} texts'.format(len(predictions)))

Number of 1's predicted: 12571
Out of a total of 19999 texts


In [10]:
submission = pd.DataFrame({'sentiment':predictions},index=test.index)
submission.index.names = ['id']
submission.head()

Unnamed: 0_level_0,sentiment
id,Unnamed: 1_level_1
80000,0
80001,1
80002,1
80003,1
80004,1


In [11]:
submission.to_csv('submission.csv')