# 1. Importing and exploring the data using a pandas dataframe

We start by importing our dataset. For this lab, we will use a subset of the data distributed for a Shared Task on sentiment analysis in tweets, conducted as part of <a href="http://alt.qcri.org/semeval2017/">SemEval 2017</a>.

We are interested in <a href="http://alt.qcri.org/semeval2017/task4/">Semeval Task 4A</a>. For convenience, one part of the data has been provided for you for this lab.

In [2]:
import pandas as pd
import gc
gc.enable()

In [3]:
input_file = 'NiaveBayesDataset.txt' #Input file, tab-delimited

In [4]:
#Our data has no header. Also, we only want the first 3 cols  -- see usecols argument
#Some lines have a fourth (date) column
data = pd.read_csv(input_file, sep='\t', encoding="utf-8", header=None, usecols=range(3))

In [5]:
#Now we can name our own columns
data.columns = ['ID', 'Polarity', 'Tweet']

## 1.1 Explore the data: head() and tail()

Try to explore the data using the following commands:
* head()
* tail()
* finding a column by name, e.g. data['Polarity']
* finding a specific row, by treating the data frame as you would a normal python list

In [6]:
data.head()

Unnamed: 0,ID,Polarity,Tweet
0,619950566786113536,neutral,"Picturehouse's, Pink Floyd's, 'Roger Waters: T..."
1,619969366986235905,neutral,Order Go Set a Watchman in store or through ou...
2,619971047195045888,negative,If these runway renovations at the airport pre...
3,619974445185302528,neutral,If you could ask an onstage interview question...
4,619987808317407232,positive,A portion of book sales from our Harper Lee/Go...


In [7]:
data.tail()

Unnamed: 0,ID,Polarity,Tweet
20627,681877834982232064,neutral,@ShaquilleHoNeal from what I think you're aski...
20628,681879579129200640,positive,"Iran ranks 1st in liver surgeries, Allah bless..."
20629,681883903259357184,neutral,Hours before he arrived in Saudi Arabia on Tue...
20630,681904976860327936,negative,@VanityFair Alex Kim Kardashian worth how to ...
20631,681910549211287552,neutral,I guess even Pandora knows Justin Bieber is a ...


## 1.2 Explore the data distribution by plotting Polarity

It is important to explore the distribution of data. We are often working with highly skewed distributions. In real-world applications, we don't normally find values which are equally distributed.

We're going to use numpy to run a quick procedure over the 'Polarity' column in our data frame, to find the unique values (which should be three) and their corresponding frequencies. This returns a pair: the unique values and their counts, in separate lists.

In [8]:
import numpy as np
#Use numpy to find unique polarity vals and count them
unique, counts = np.unique(data['Polarity'], return_counts=True) 

Now we can create a new pandas dataframe, treating unique and counts as its columns.

In [9]:
#Now create a temporary pandas frame with these values and frequencies
polarities = pd.DataFrame({'polarity': unique, 'frequency': counts})

In [10]:
polarities.head()

Unnamed: 0,polarity,frequency
0,negative,3231
1,neutral,10342
2,positive,7059


Finally, we plot the values in the new dataframe. This uses a built-in plot() command, but we need to import the plot function from matplotlib first.

In [11]:
import matplotlib.pyplot as plt
#and plot
polarities.plot.bar(x='polarity', y='frequency', rot=0)

<matplotlib.axes._subplots.AxesSubplot at 0x1acd71447f0>

## 1.3 Splitting the data into training and testing

Split the data into two disjoint sets, for training (90%) and for testing (10%). The function below needs to be completed, such that it takes a dataframe and a ratio for the training proportion (e.g. 0.9) and randomly splits the data accordingly. One possible strategy:
1. collect the indices correaponding to all the rows in the DF 
2. shuffle them to create a random permutation using numpy
3. take the first x% as the training indices, the rest for training
4. use the convenient pandas iloc function to retrieve the rows for each set by their index

In [12]:
def split_train_test(data, ratio):
    train = data.iloc[0:round(len(data)*0.9)]
    test = data.iloc[round(len(data)*0.9):]
    return train, test
    

In [13]:
train, test = split_train_test(data, 0.9)

An alternative way is to split using a built-in function in scikit-learn, as shown below.

In [14]:
from sklearn.model_selection import train_test_split
#random_state param is just a random number seed
train, test = train_test_split(data, test_size = 0.1, random_state=42)

In [15]:
train.head()

Unnamed: 0,ID,Polarity,Tweet
4336,630134008718958592,neutral,Miguel Montero with a go-ahead RBI single in h...
13485,640191552065658881,neutral,Even maybe more odd is the Marion County judge...
4505,630465779511771136,neutral,"Frank Gifford, who died today at 84, worked th..."
20018,680634687463788544,neutral,@Julienaticadiks: You can now Listen new song ...
6497,633505111105409025,neutral,"@_alexisaguirre_ Not sure if you'd enjoy them,..."


# 2. Implementing a Naive Bayes classifier

We'll now use sklearn to implement a classifier. Our strategy will be to:
1. Extract the vocabulary from our training instances
2. Vectorise our training instances using the Bag of Words assumption
3. Initially we'll use word frequencies. So each document (each Tweet in our dataframe) becomes a list of numbers of length |V|, where, for each element of our vocabulary V, there is a corresponding number indicating the freqency of that word in the document.

## 2.1 The CountVectorizer class

A count vectorizer in sklearn is a class that transforms text into a vector of word features with their corresponding counts. The CountVEctorizer can apply a stop list (it's built in for English, so we can just tell it to use that one. But otherwise can be supplied as a list). See <a href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction">the documentation for more details on this class</a> and the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer">API reference</a>.

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
counter = CountVectorizer(stop_words='english',ngram_range=(1,2))

#pass all tweets in the training set to the count vectoriser and apply 'fit_transform'
train_features = counter.fit_transform(train['Tweet'])

In [34]:
train_features

<18568x175925 sparse matrix of type '<class 'numpy.int64'>'
	with 399129 stored elements in Compressed Sparse Row format>

The result is a large, sparse matrix where each row corresponds to a tweet. Each column corresponds to one of the words in the vocab.
You can convince yourself that this is the case by comparing the <b>shape</b> of the train array and the matrix:

In [18]:
train.shape

(18568, 3)

In [19]:
train_features.shape

(18568, 175925)

In a CountVectorizer matrix, columns correspond to words. We can see what words we have. You'll notice that there is a lot of noise, partly due to tweets containing urls etc.

In [20]:
counter.get_feature_names()

['00',
 '00 00p',
 '00 00pm',
 '00 05',
 '00 08',
 '00 10',
 '00 12',
 '00 13',
 '00 1st',
 '00 2nd',
 '00 30',
 '00 30pm',
 '00 amp',
 '00 bids',
 '00 brown',
 '00 buddha',
 '00 case',
 '00 centraleuropeantime',
 '00 cet',
 '00 ch4',
 '00 donation',
 '00 era',
 '00 hey',
 '00 hot',
 '00 http',
 '00 jan',
 '00 ken',
 '00 lydia',
 '00 meets',
 '00 mn',
 '00 mon',
 '00 newport',
 '00 old',
 '00 pm',
 '00 sam',
 '00 time',
 '00 uur',
 '00 voodoo',
 '00 yin',
 '00 yoga',
 '000',
 '000 000',
 '000 barrels',
 '000 candidates',
 '000 coins',
 '000 cops',
 '000 employees',
 '000 enjoyed',
 '000 fans',
 '000 fighters',
 '000 help',
 '000 https',
 '000 ilk',
 '000 innocents',
 '000 ira',
 '000 islamists',
 '000 jerry',
 '000 lamborghini',
 '000 legal',
 '000 muslims',
 '000 people',
 '000 spectators',
 '000 students',
 '000 things',
 '000 views',
 '000 women',
 '001',
 '001 940',
 '007',
 '007 film',
 '007 spectre',
 '00am',
 '00am 11',
 '00am christians',
 '00am live',
 '00am matthews',
 '00am 

If you want to take a look at the features themselves, you can turn the sparse matrix object returned by the CountVectorizer into an array (this will be memory intensive, so you might get a memory error). Observe that most words just have zero values.

In [21]:
train_features.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## 2.3 Training the NB classifier

Training a classifier in sklearn involves these steps:
1. Initialising an instance of the MultinomialNB class
2. Fitting the classifier (forcing it to learn parameters) to the training data with corresponding labels.

Our training data is now <b>train_features</b>; the corresponding labels for each row are in the column <b>train['Polarity']</b>

The code below uses the sklearn built-in NB classifier.


In [22]:
from sklearn.naive_bayes import MultinomialNB

#Returns a fitter (trained) classifier
nb_classifier = MultinomialNB().fit(train_features, train['Polarity'])

## 2.4 Testing the classifier

Before we can apply our classifier, we need to also perform the same vectorization operations on the test set. However, we do not call fit_transform(), but only transform(). This is because the CountVectorizer has already been fit to our training data. We only want to extract the known features from the test set.

In [23]:
test_features = counter.transform(test['Tweet'])

Now we can try to apply it to the test data. This returns an array of predicted labels.

In [24]:
predictions = nb_classifier.predict(test_features)

In [42]:
#What is the prediction for the first tweet?
print('%r => %s - %r' % (test['Tweet'].iloc[0], predictions[0], test['Polarity'].iloc[0]))



"I may be mean with niall's legs sometimes but it doesnt mean i dont care" => neutral - 'neutral'


### Using Tfidf Transformer

In [56]:
from sklearn.feature_extraction.text import TfidfTransformer

In [57]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(train_features)
X_train_tfidf.shape

(18568, 175925)

In [58]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(train_features, train['Polarity'] )

In [75]:
docs_new = ['God is love, God is life', 'I want to die, life is shit, somebody kill me', 'I am a stupid boy who likes to play in the mud','Existence is shallow, there is no purpose, life has no meaning']
X_new_counts = counter.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

print(predicted)

['positive' 'negative' 'neutral' 'positive']


In [60]:
predicted2 = clf.predict(test_features)

In [66]:
print('%r => %s - %r' % (test['Tweet'].iloc[348], predicted2[348], test['Polarity'].iloc[348]))


'1st SCOTUS pisses off Christians with gay marriage LAW then rightly jails #kimdavies for breaking it ! #bravo https://t.co/nFE1qXe1s3' => negative - 'neutral'


## 2.5 Evaluation
Finally, we can look at some evaluation data:

In [69]:
from sklearn import metrics
print(metrics.classification_report(test['Polarity'], predictions))

              precision    recall  f1-score   support

    negative       0.70      0.17      0.28       308
     neutral       0.61      0.80      0.70      1036
    positive       0.65      0.57      0.61       720

   micro avg       0.63      0.63      0.63      2064
   macro avg       0.65      0.51      0.53      2064
weighted avg       0.64      0.63      0.60      2064



We can generate a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html">confusion matrix</a> to see which categories tend to be confused with each other. Note that rows and columns are ordered alphabetically by label (negative, neutral, positive) so that, e.g. row 0 column 1 is the number of times the first class (negative) is mislabelled as the second (neutral). 

In [27]:
print(metrics.confusion_matrix(test['Polarity'], predictions))

[[ 53 215  40]
 [ 20 831 185]
 [  3 307 410]]
