## Import packages


In [None]:
# system tools
import os
import sys
sys.path.append(os.path.join(".."))

# data munging tools
import pandas as pd
import utils.classifier_utils as clf

# Machine learning stuff
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ShuffleSplit
from sklearn import metrics

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

## Reading in the data

In [None]:
filename = os.path.join("..", "data", "labelled_data", "fake_or_real_news.csv")

DATA = pd.read_csv(filename, index_col=0)

__Inspect data__

In [None]:
DATA.sample(10)

In [None]:
DATA.shape

<br>
Q: How many examples of each label do we have?

In [None]:
DATA["label"].value_counts()

## Create balanced data

We can use the function ```balance``` to create a more even dataset.

In [None]:
DATA_balanced = clf.balance(DATA, 1000)

In [None]:
DATA_balanced.shape

<br>

What do the label counts look like now?

In [None]:
DATA_balanced["label"].value_counts()

<br>

Let's now create new variables called ```texts``` and ```lables```, taking the data out of the dataframe so that we can mess around with them.

In [None]:
texts = DATA_balanced["text"]
labels = DATA_balanced["label"]

# Train-test split

I've included most of the 'hard work' for you here already, because these are long cells which might be easy to mess up while live-coding.

Instead, we'll discuss what's happening. If you have questions, don't be shy!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(texts,           # texts for the model
                                                    labels,          # classification labels
                                                    test_size=0.2,   # create an 80/20 split
                                                    random_state=42) # random state for reproducibility

# Vectorizing and Feature Extraction

Vectorization. What is it and why are all the cool kids talking about it?

Essentially, vectorization is the process whereby textual or visual data is 'transformed' into some kind of numerical representation. One of the easiest ways to do this is to simple count how often individual features appear in a document.

Take the following text: 
<br><br>
<i>My father’s family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip. So, I called myself Pip, and came to be called Pip.</i>
<br>

We can convert this into the following vector

| and | be | being | both | called | came | christian | could | explicit | family | father | i | infant | longer | make | more | my | myself | name | names | nothing | of | or | philip | pip | pirrip | s | so | than | to | tongue|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |  --- |
| 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 |

<br>
Our textual data is hence reduced to a jumbled-up 'vector' of numbers, known somewhat quaintly as a <i>bag-of-words</i>.
<br>
<br>
To do this in practice, we first need to create a vectorizer. 

Tfidf vectors tend to be better for training classifiers. Why might that be?

__Create vectorizer object__

In [None]:
vectorizer = TfidfVectorizer(ngram_range = (1,2),     # unigrams and bigrams (1 word and 2 word units)
                             lowercase =  True,       # why use lowercase?
                             max_df = 0.95,           # remove very common words
                             min_df = 0.05,           # remove very rare words
                             max_features = 100)      # keep only top 500 features

This vectorizer is then used to turn all of our documents into a vector of numbers, instead of text.

In [None]:
# First we do it for our training data...
X_train_feats = vectorizer.fit_transform(X_train)
#... then we do it for our test data
X_test_feats = vectorizer.transform(X_test)
# We can also create a list of the feature names. 
feature_names = vectorizer.get_feature_names()

<br>
Q: What are the first 20 features that are picked out by the CountVectorizer?

## Classifying and predicting

We now have to 'fit' the classifier to our data. This means that the classifier takes our data and finds correlations between features and labels.

These correlations are then the *model* that the classifier learns about our data. This model can then be used to predict the label for new, unseen data.

In [None]:
classifier = LogisticRegression(random_state=42).fit(X_train_feats, y_train)

Q: How do we use the classifier to make predictions?

In [None]:
y_pred = classifier.predict(X_test_feats)

Q: What are the predictions for the first 20 examples of the test data?

In [None]:
print(y_pred[0:20])

We can also inspect the model, in order to see which features are most informative when trying to predict a label. 

To do this, we can use the ```show_features``` function that I defined earlier - how convenient!

Q: What are the most informative features? Use ```show_features```to find out!

In [None]:
clf.show_features(vectorizer, y_train, classifier, n=20)

## Evaluate

The computer has now learned a model of how our data behaves. Well done, computer! But is it accurate?

Q: How do we measure accuracy?

<img src="../img/confusionMatrix.jpg">

Thankfully, libraries like ```sklearn``` come with a range of tools that are useful for evaluating models.

One way to do this, is to use a confusion matrix, similar to what you see above.

Q: What should go in the argument called ```labels```?

In [None]:
clf.plot_cm(y_test, y_pred, normalized=False)

We can also do some quick calculations, in order to assess just how well our model performs.

In [None]:
classifier_metrics = metrics.classification_report(y_test, y_pred)
print(classifier_metrics)

## Cross validation and further evaluation

One thing we can't be sure of is that our model performance is simply related to how the train-test split is made.

To try to mitigate this, we can perform cross-validation, in order to test a number of different train-test splits and finding the average scores.

Let's do this on the full dataset

In [None]:
X_vect = vectorizer.fit_transform(texts)

The first plot is probably the most interesting. Some terminology:

- If two curves are "close to each other" and both of them but have a low score, the model suffers from an underfitting problem (High Bias)

- If there are large gaps between two curves, then the model suffer from an overfitting problem (High Variance)


In [None]:
title = "Learning Curves (Logistic Regression)"
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = LogisticRegression(random_state=42)
clf.plot_learning_curve(estimator, title, X_vect, labels, cv=cv, n_jobs=4)

- The second plot shows the times required by the models to train with various sizes of training dataset. 
- The third plot show how much time was required to train the models for each training sizes.