# Simple text classification using ```scikit-learn```

## Import packages


In [1]:
# system tools
import os
import sys
sys.path.append("..")

# data munging tools
import pandas as pd
import utils.classifier_utils as clf

# Machine learning stuff
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn import metrics

# Visualisation
import matplotlib.pyplot as plt

## Reading in the data

Our data is already in a tabular format, so we're going to load it using ```pandas```

test data = fake or real news\
text data =

In [9]:

filename = os.path.join("..", "..", "431868", "classification_data", "fake_or_real_news.csv")

data = pd.read_csv(filename, index_col=0)

__Inspect data__

In [10]:
data.sample(10)

Unnamed: 0,title,text,label
9076,Sean Hannity SHREDS FBI Director James Comey f...,Sean Hannity SHREDS FBI Director James Comey f...,FAKE
1467,Super PACs Escalate Air War Ahead of Iowa Cauc...,A new set of super PAC advertisements released...,REAL
2424,Future Obamacare Costs Keep Falling,Nearly five years after President Barack Obama...,REAL
6917,7 Ways To Prepare For An Economic Crisis,"Bill White November 7, 2016 7 Ways To Prepare ...",FAKE
3085,Is Facebook to blame for making us more polari...,Critics have worried that the algorithm Facebo...,REAL
1041,Why the death of GOP 'loyalty pledge' matters,"Donald Trump, Ted Cruz, and John Kasich have a...",REAL
7068,"If You Live HERE, Forget Christmas Lights – Th...",0 comments \nPerhaps no country has been more ...,FAKE
9104,Trump And His Supporters Are Fighting A Rigged...,Trump And His Supporters Are Fighting A Rigged...,FAKE
8996,"US, Japan Push to Fortify Alliances Amid Threa...",Get short URL 0 0 0 0 US Deputy Secretary of D...,FAKE
2245,"After Kim Davis is jailed, marriage license is...",(CNN) With the clerk who had refused them in j...,REAL


In [11]:
data.shape

(6335, 3)

<br>
Q: How many examples of each label do we have?

In [12]:
data["label"].value_counts()

REAL    3171
FAKE    3164
Name: label, dtype: int64

<br>

Let's now create new variables called ```X``` and ```y```, taking the data out of the dataframe so that we can mess around with them.

In [13]:
X = data["text"]
y = data["label"]

## Train-test split

I've included most of the 'hard work' for you here already, because these are long cells which might be easy to mess up while live-coding.

Instead, we'll discuss what's happening. If you have questions, don't be shy!
*every time we rund this function we* 
First value of X_train correspond to first section og y_train. 

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,           # texts for the model
                                                    y,          # classification labels
                                                    test_size=0.2,   # create an 80/20 split
                                                    random_state=42) # random state for reproducibility

## Vectorizing and Feature Extraction

Vectorization. What is it and why are all the cool kids talking about it?

Essentially, vectorization is the process whereby textual or visual data is 'transformed' into some kind of numerical representation. One of the easiest ways to do this is to simple count how often individual features appear in a document.

Take the following text: 
<br><br>
<i>My father’s family name being Pirrip, and my Christian name Philip, my infant tongue could make of both names nothing longer or more explicit than Pip. So, I called myself Pip, and came to be called Pip.</i>
<br>

We can convert this into the following vector

| and | be | being | both | called | came | christian | could | explicit | family | father | i | infant | longer | make | more | my | myself | name | names | nothing | of | or | philip | pip | pirrip | s | so | than | to | tongue|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |  --- |
| 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 |

<br>
Our textual data is hence reduced to a jumbled-up 'vector' of numbers, known somewhat quaintly as a <i>bag-of-words</i>.
<br>
<br>
To do this in practice, we first need to create a vectorizer. 

Tfidf vectors tend to be better for training classifiers. Why might that be?

### Create vectorizer object
  **defining a range of vectorizers that we want to work with**\
  the larger the ngrams, the more computational force it take. Could make threegrams or fourggrams instead of unigrams\
removing common/frequent word = stopwordlist *can't tell us anything, no strong prediction*\
removing rare words, *if we have words that only occur once, might be spelling mistakes or very rare concepts, just extreme outliers*\
removing top and bottom 5% \
computer thinks it can really trust that this  because it's true 100% of the time\
task-specific: 
cultural dataobjects, what are we doing why are we doing it and what is necessary for this specific task

In [15]:
vectorizer = CountVectorizer(ngram_range = (1,2),     # make into unigrams and bigrams (1 word and 2 word units)
                             lowercase =  True,       # why use lowercase?
                             max_df = 0.95,           # remove very common words
                             min_df = 0.05,           # remove very rare words
                             max_features = 100)      # keep only top 100 features, makin a 100 words vector

This vectorizer is then used to turn all of our documents into a vector of numbers, instead of text.\
use my vectoriser on my training data\
fit the data to the model\
transform it to unigram and bigrams\
use same vectoriser to test data, vocabulary we used to make prediction, should be the same we're using for our test data\
do NOT fit the vectoriser to fit the test data, 

In [16]:
# first we fit to the training data... 
X_train_feats = vectorizer.fit_transform(X_train)

#... then do it for our test data
X_test_feats = vectorizer.transform(X_test)


In [None]:
# get feature names
feature_names = vectorizer.get_feature_names_out()

In [17]:
feature_names

array(['about', 'after', 'all', 'also', 'an', 'and', 'and the', 'are',
       'as', 'at', 'at the', 'be', 'because', 'been', 'but', 'by',
       'campaign', 'can', 'clinton', 'could', 'do', 'even', 'first',
       'for', 'for the', 'from', 'had', 'has', 'have', 'he', 'her',
       'hillary', 'him', 'his', 'how', 'if', 'in', 'in the', 'into', 'is',
       'it', 'its', 'just', 'like', 'many', 'more', 'most', 'new', 'no',
       'not', 'now', 'obama', 'of', 'of the', 'on', 'on the', 'one',
       'only', 'or', 'other', 'our', 'out', 'over', 'party', 'people',
       'president', 'republican', 'said', 'she', 'so', 'some', 'state',
       'states', 'than', 'that', 'that the', 'their', 'them', 'there',
       'they', 'this', 'time', 'to be', 'to the', 'trump', 'two', 'up',
       'us', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will',
       'with', 'with the', 'would', 'you'], dtype=object)

## Classifying and predicting

We now have to 'fit' the classifier to our data.

This means that the classifier takes our data and finds correlations between features and labels.

These correlations are then the *model* that the classifier learns about our data. This model can then be used to predict the label for new, unseen data.

In [18]:
# heres my input and output I want you to fit it to the training dataset and by taht train the model
classifier = LogisticRegression(random_state=42).fit(X_train_feats, y_train)

Q: How do we use the classifier to make predictions?\
 by training it on the training dataset, and use that to make predicitions about the real text

In [19]:
y_pred = classifier.predict(X_test_feats)

Q: What are the predictions for the first 20 examples of the test data?

In [20]:
print(y_pred[:20])

['FAKE' 'FAKE' 'FAKE' 'FAKE' 'FAKE' 'FAKE' 'REAL' 'FAKE' 'REAL' 'FAKE'
 'FAKE' 'REAL' 'REAL' 'FAKE' 'FAKE' 'FAKE' 'FAKE' 'REAL' 'REAL' 'REAL']


We can also inspect the model, in order to see which features are most informative when trying to predict a label. 

To do this, we can use the ```show_features``` function that I defined earlier - how convenient!

Q: What are the most informative features? Use ```show_features```to find out!

In [21]:
# even though we took away the most common words, there are still quite a lot of grammatical words
# still don't know how well the model is actually working
clf.show_features(vectorizer, y_train, classifier, n=20)

FAKE				REAL

-0.2027	just           		0.3138	but            
-0.1674	by             		0.2158	said           
-0.1255	that the       		0.1835	state          
-0.1192	us             		0.1717	than           
-0.1078	be             		0.1492	who            
-0.0968	this           		0.1446	most           
-0.0906	with           		0.1258	obama          
-0.0878	had            		0.1145	other          
-0.0820	you            		0.1073	more           
-0.0690	so             		0.1019	up             
-0.0670	to the         		0.0988	on the         
-0.0668	all            		0.0953	also           
-0.0626	is             		0.0834	president      
-0.0616	of the         		0.0723	one            
-0.0614	into           		0.0693	she            
-0.0612	there          		0.0693	two            
-0.0560	was            		0.0652	that           
-0.0550	like           		0.0647	out            
-0.0548	now            		0.0544	to be          


## Evaluate

We can also do some quick calculations, in order to assess just how well our model performs.

In [None]:
metrics.ConfusionMatrixDisplay.from_estimator(classifier,           # the classifier name
                                            X_train_feats,          # the training features
                                            y_train,                # the training labels
                                            cmap=plt.cm.Blues,      # make the colours prettier
                                            labels=["FAKE", "REAL"])# the labels in your data arranged alphabetically

This confusion matrix can be broken down a little bit more and used to draw more meaningful statistical results:

<img src="../img/confusionMatrix.jpg" alt="Alternative text" />

__Calculating metrics__

```scikit-learn``` has a built-in set of tools which can be used to calculate these metrics, to get a better idea of how our model is performing.

In [None]:
classifier_metrics = metrics.classification_report(y_test, y_pred)
print(classifier_metrics)

## Cross validation and further evaluation

One thing we can't be sure of is that our model performance is simply related to how the train-test split is made.

To try to mitigate this, we can perform cross-validation, in order to test a number of different train-test splits and finding the average scores.

Let's do this on the full dataset:

In [None]:
X_vect = vectorizer.fit_transform(X)

The first plot is probably the most interesting. Some terminology:

- If two curves are "close to each other" and both of them but have a low score, the model suffers from an underfitting problem (High Bias)

- If there are large gaps between two curves, then the model suffer from an overfitting problem (High Variance)


In [None]:
title = "Learning Curves (Logistic Regression)"
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = LogisticRegression(random_state=42)
clf.plot_learning_curve(estimator, title, X_vect, y, cv=cv, n_jobs=4)

- The second plot shows how model performance scales when more data is added;
- The third plot shows how much of a performance improvement we get from adding more data

## Save models

It is also somewhat trivial to save models and reload them for later use. For that, we can use the library ```joblib```.

In [None]:
from joblib import dump, load
dump(classifier, "LR_classifier.joblib")
dump(vectorizer, "tfidf_vectorizer.joblib")

We can restart the kernel for our notebook to see how that works:

In [None]:
from joblib import dump, load
loaded_clf = load("LR_classifier.joblib")
loaded_vect = load("tfidf_vectorizer.joblib")

In [None]:
sentence = "Hilary Clinton is a crook who eats babies!"

In [None]:
test_sentence = loaded_vect.transform([sentence])
loaded_clf.predict(test_sentence)

## Appendix - Interpreting a confusion matrix

Imagine that we are testing a classifier to see how well it can predict if someone has COVID:


```Accuracy => (TP+TN)/(TP+FP+FN+TN)```
- Ratio of correct classifications across all of the patients

```True Positive Rate => Recall  => Sensitivity => (TP / TP + FN)```
- The proportion of the positive class who were correctly classified
    - I.e sick people correctly identified as being sick

```Precision =>  (TP / TP + FP)```
- The ration of true positives to everyone predicted as positive
    - I.e. the proportion we identify as having COVID who actually do have it

```True negative rate => Specificity => (TN / TN + FP)```
- The proportion of the negative class who were correctly classified
    - I.e. healthy people who were correctly identified as being healthy

The following can also be calculated but are not featured on the confusion matrix above:

```False negative rate => (FN / TP + FN)```
- Proportion of the positive class who were incorrectly classified by the classifier
  - I.e. people predicted as healthy who are actually sick

```False positive rate = (FP / TN + FP) = 1 - Specificity```
- Proportion of the negative class who were incorrectly classified by the classifier
  - I.e. people predicted as sick who are actually healthy

```F1 => 2(P*R / P + R)```
- Harmonic mean of precision and recall, useful where both precision and recall are important