# 1. Importing and exploring the data using a pandas dataframe

We start by importing our dataset. For this lab, we will use a subset of the data distributed for a Shared Task on sentiment analysis in tweets, conducted as part of <a href="http://alt.qcri.org/semeval2017/">SemEval 2017</a>.

We are interested in <a href="http://alt.qcri.org/semeval2017/task4/">Semeval Task 4A</a>. For convenience, one part of the data has been provided for you for this lab.

In [None]:
import pandas as pd

In [None]:
input_file = 'SemEval2017-task4-dev.subtask-A.english.INPUT.txt' #Input file, tab-delimited

In [None]:
#Our data has no header. Also, we only want the first 3 cols  -- see usecols argument
#Some lines have a fourth (date) column
data = pd.read_csv(input_file, sep='\t', encoding="utf-8", header=None, usecols=range(3))

In [None]:
#Now we can name our own columns
data.columns = ['ID', 'Polarity', 'Tweet']

## 1.1 Explore the data: head() and tail()

Try to explore the data using the following commands:
* head()
* tail()
* finding a column by name, e.g. data['Polarity']
* finding a specific row, by treating the data frame as you would a normal python list

In [1]:
#Your code here using head and tail on your pandas DF

## 1.2 Explore the data distribution by plotting Polarity

It is important to explore the distribution of data. We are often working with highly skewed distributions. In real-world applications, we don't normally find values which are equally distributed.

We're going to use numpy to run a quick procedure over the 'Polarity' column in our data frame, to find the unique values (which should be three) and their corresponding frequencies. This returns a pair: the unique values and their counts, in separate lists.

In [None]:
import numpy as np
#Use numpy to find unique polarity vals and count them
unique, counts = np.unique(data['Polarity'], return_counts=True) 

Now we can create a new pandas dataframe, treating unique and counts as its columns.

In [None]:
#Now create a temporary pandas frame with these values and frequencies
polarities = pd.DataFrame({'polarity': unique, 'frequency': counts})

Finally, we plot the values in the new dataframe. This uses a built-in plot() command, but we need to import the plot function from matplotlib first.

In [3]:
import matplotlib.pyplot as plt
#and plot
polarities.plot.bar(x='polarity', y='frequency', rot=0)

NameError: name 'polarities' is not defined

## 1.3 Splitting the data into training and testing

Split the data into two disjoint sets, for training (90%) and for testing (10%). The function below needs to be completed, such that it takes a dataframe and a ratio for the training proportion (e.g. 0.9) and randomly splits the data accordingly. One possible strategy:
1. collect the indices correaponding to all the rows in the DF 
2. shuffle them to create a random permutation using numpy
3. take the first x% as the training indices, the rest for training
4. use the convenient pandas iloc function to retrieve the rows for each set by their index

In [4]:
def split_train_test(data, ratio):
    #Your code here
    #Modify below to return subsets of the data 
    return None, None
    

In [None]:
train, test = split_train_test(data, 0.9)

An alternative way is to split using a built-in function in scikit-learn, as shown below.

In [None]:
from sklearn.model_selection import train_test_split
#random_state param is just a random number seed
train, test = train_test_split(data, test_size = 0.1, random_state=42)

## 1.4 Exercise for you

We've seen above that the distribution of data isn't uniform (many more neutral than positive/negative tweets). What would you do to make the training and test datasets approximately reflect the distribution?

# 2. Implementing a Naive Bayes classifier

We'll now use sklearn to implement a classifier. Our strategy will be to:
1. Extract the vocabulary from our training instances
2. Vectorise our training instances using the Bag of Words assumption
3. Initially we'll use word frequencies. So each document (each Tweet in our dataframe) becomes a list of numbers of length |V|, where, for each element of our vocabulary V, there is a corresponding number indicating the freqency of that word in the document.

## 2.1 The CountVectorizer class

A count vectorizer in sklearn is a class that transforms text into a vector of word features with their corresponding counts. The CountVEctorizer can apply a stop list (it's built in for English, so we can just tell it to use that one. But otherwise can be supplied as a list). See <a href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction">the documentation for more details on this class</a> and the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer">API reference</a>.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
counter = CountVectorizer(stop_words='english')

#pass all tweets in the training set to the count vectoriser and apply 'fit_transform'
train_features = counter.fit_transform(train['Tweet'])

The result is a large, sparse matrix where each row corresponds to a tweet. Each column corresponds to one of the words in the vocab.
You can convince yourself that this is the case by comparing the <b>shape</b> of the train array and the matrix:

In [None]:
train.shape

In [None]:
train_features.shape

In a CountVectorizer matrix, columns correspond to words. We can see what words we have. You'll notice that there is a lot of noise, partly due to tweets containing urls etc.

In [None]:
counter.get_feature_names()

If you want to take a look at the features themselves, you can turn the sparse matrix object returned by the CountVectorizer into an array. Observe that most words just have zero values.

In [None]:
train_features.toarray()

## 2.2 Exercise for you. 

Find a way to only include alphabetic strings in your features (ie excluding punctuation, numbers etc).

## 2.3 Training the NB classifier

Training a classifier in sklearn involves these steps:
1. Initialising an instance of the MultinomialNB class
2. Fitting the classifier (forcing it to learn parameters) to the training data with corresponding labels.

Our training data is now <b>train_features</b>; the corresponding labels for each row are in the column <b>train['Polarity']</b>

The code below uses the sklearn built-in NB classifier.


In [None]:
from sklearn.naive_bayes import MultinomialNB

#Returns a fitter (trained) classifier
nb_classifier = MultinomialNB().fit(train_features, train['Polarity'])

## 2.4 Testing the classifier

Before we can apply our classifier, we need to also perform the same vectorization operations on the test set. However, we do not call fit_transform(), but only transform(). This is because the CountVectorizer has already been fit to our training data. We only want to extract the known features from the test set.

In [None]:
test_features = counter.transform(test['Tweet'])

Now we can try to apply it to the test data. This returns an array of predicted labels.

In [None]:
predictions = nb_classifier.predict(test_features)

In [None]:
#What is the prediction for the first tweet?
print('%r => %s' % (test['Tweet'].iloc[0], predictions[0]))

## 2.5 Evaluation
Finally, we can look at some evaluation data:

In [None]:
from sklearn import metrics
print(metrics.classification_report(test['Polarity'], predictions))

We can generate a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html">confusion matrix</a> to see which categories tend to be confused with each other. Note that rows and columns are ordered alphabetically by label (negative, neutral, positive) so that, e.g. row 0 column 1 is the number of times the first class (negative) is mislabelled as the second (neutral). 

In [None]:
print(metrics.confusion_matrix(test['Polarity'], predictions))

## 2.6 Exercise (continued)
1. Can you interpret the above metric report? What do the micro- and the macro-average tell you?
2. Compare the precision and recall for each class. What do you notice? (Recall the distribution we saw at the outset)
3. Look at the confusion matrix above. Which class tends to be the one the classifier confuses the most? What is the most frequent prediction when the classifier is wrong?
4. Now, go back to the NB training and feature selection and try out a few ways to make the classifier better. In particular, look at the documentation for CountVectorizer and see if, by (a) incorporating only alphabetic words and (b) incorporating n-grams of, say, length 2, you achieve better results.

## 2.7 Implement your own
Now, implement your own NB Classifier. Evaluate it and compare the resulting Precision, Recall metrics to the model you imported from sklearn. Do they match?

To train your model, you can optionally use the training features you extracted using the sklearn CountVectorizer.

Remember, the training process in the NB algorithm requires that you compute P(c|F), under the independence assumption, by computing:
1. The likelihood, P(F|c); and
2. The prior P(c)

You can use maximum likelihood estimates for that (optionally, you can apply a simple smoothing methods).