In [None]:
# https://www.learndatasci.com/tutorials/predicting-reddit-news-sentiment-naive-bayes-text-classifiers/

In [None]:
import numpy as np
import pandas as pd
import math
import random
from collections import defaultdict

# Prevent future/deprecation warnings from showing in output
import warnings
warnings.filterwarnings(action='ignore')





These are basic imports used across the entire notebook, and are usually imported in every data science project. The more specific imports from sklearn and other libraries will be brought up when we use them.

## Loading the Dataset

First let's load the dataset that we created in the last article:

In [3]:
df = pd.read_csv('reddit_headlines_labels.csv', encoding='utf-8')
df.head()

Unnamed: 0,headline,label,Unnamed: 2
0,GOP voters more likely to choose candidates wh...,0,
1,U.S. Rep. John Yarmuth says Louisville Kroger ...,-1,
2,Trump Comes Face to Face With His Nightmare: A...,0,
3,Trump ‘Most Consequential’ President Since Lin...,0,
4,'We're heading north!' Migrants nix offer to s...,0,


In [4]:
df.dtypes

headline       object
label          object
Unnamed: 2    float64
dtype: object

In [5]:
df["label"].value_counts()

-1                                             12515
0                                              11152
1                                               4743
label                                             30
 Beto O'Rourke calls for Americans to unify        1
 draws link to stickers on suspect's van           1
 House a ‘complete dogfight’                       1
 Trump Blames Lack of Guns                         1
Name: label, dtype: int64

Now that we have the dataset in a dataframe, let's remove the neutral (0) headlines labels so we can focus on only classifying positive or negative:

In [6]:
dfNeg =( 
    df
    .loc[lambda df: df['label']== '-1'])

In [7]:
dfPos =( 
    df
    .loc[lambda df: df['label']== '1'])

In [8]:
df= pd.concat([dfNeg, dfPos],axis=0)

In [9]:
df["label"].value_counts()

-1    12515
1      4743
Name: label, dtype: int64

Our dataframe now only contains positive and negative examples, and we've confirmed again that we have more negatives than positives.

Let's move into featurization of the headlines.

## Transform lines into Features (= columns)

In order to train our classifier, we need to transform our lines of words into numbers, since algorithms only know how to work with numbers.

To do this transformation, we're going to use `CountVectorizer` from sklearn. This is a very straightforward class for converting words into features.

Unlike in the last tutorial where we manually tokenized and lowercased the text, `CountVectorizer` will handle this step for us. All we need to do is pass it the headlines.

Let's work with a tiny example to show how vectorizing words into numbers works:

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

s1 = "Senate panel moving ahead, with caution, with Mueller bill despite McConnell opposition"
s2 = "Bill protecting Robert Mueller to get vote despite McConnell opposition"

vect = CountVectorizer(binary=False)
X = vect.fit_transform([s1, s2])  # fit_tranform  or just transform 
# fit_transform all the newwords are accounted for ==> TRAIN set
# transform  only the words already met ( in the TRAIN)  are accounted for ==> TEST set 

X.toarray()

array([[1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 2],
       [0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0]], dtype=int64)

What we've done here is take two headlines about a similar topic and vectorized them.

`vect` is set up with default params to tokenize and lowercase words. On top of that, we have set `binary=True` so we get an output of 0 (word doesn't exist in that sentence) or 1 (word exists in that sentence).

`vect` builds a vocabulary from all the words it sees in all the text you give it, then assigns a 0 or 1 if that word exists in the current sentence. To see this more clearly, let's check out the feature names mapped to the first sentence:

In [11]:
list(zip(X.toarray()[0], vect.get_feature_names()))

[(1, 'ahead'),
 (1, 'bill'),
 (1, 'caution'),
 (1, 'despite'),
 (0, 'get'),
 (1, 'mcconnell'),
 (1, 'moving'),
 (1, 'mueller'),
 (1, 'opposition'),
 (1, 'panel'),
 (0, 'protecting'),
 (0, 'robert'),
 (1, 'senate'),
 (0, 'to'),
 (0, 'vote'),
 (2, 'with')]

This is the vectorization mapping of the first sentence. You can see that there's a 1 mapped to 'ahead' because 'ahead' shows up in `s1`.  But if we look at `s2`:

In [12]:
list(zip(X.toarray()[1], vect.get_feature_names()))

[(0, 'ahead'),
 (1, 'bill'),
 (0, 'caution'),
 (1, 'despite'),
 (1, 'get'),
 (1, 'mcconnell'),
 (0, 'moving'),
 (1, 'mueller'),
 (1, 'opposition'),
 (0, 'panel'),
 (1, 'protecting'),
 (1, 'robert'),
 (0, 'senate'),
 (1, 'to'),
 (1, 'vote'),
 (0, 'with')]

There's a 0 at 'ahead' since that word doesn't show up in `s2`. But notice that each row contains **every** word seen so far.

### Preparing for Training

Before training, and even vectorizing, let's split our data into training and testing sets. It's important to do this before doing anything with the data so we have a fresh test set.

In [15]:
from sklearn.model_selection import train_test_split

X = df["headline"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Our test size is 0.2, or 20%. This means that `X_test` and `y_test` contains 20% of our data which we reserve for testing.

Let's now fit the vectorizer on the training set only and perform the vectorization. 

Just to reiterate, it's important to not fit the vectorizer on all of the data since we want a clean test set for evaluating performance. Fitting the vectorizer on everything would result in *data leakage*, causing unreliable results since the vectorizer shouldn't know about future data.

We can fit the vectorizer and transform `X_train` in one step:

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

## VERY HANDY FOR NOT THAT FANCY COMPUTERS, SETTING MAX_FEATURES IN THE COUnTVECTORIZER

vect = CountVectorizer(max_features=1000, binary=False)

## NOTE AGAIN THE USE OF fit_transform FOR THE TRAIN SET
## NOTE AGAIN THE USE OF     transform FOR THE TEST SET

X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)

`X_train_vect` is now transformed into the right format to give to the Naive Bayes model, but let's first look into balancing the data.


### Balancing the data

It seems that there may be a lot more negative headlines than positive headlines (hmm), and so we have a lot more negative labels than positive labels.

In [None]:
counts = df["label"].value_counts()
print(counts)

print("\nPredicting only -1 = {:.2f}% accuracy".format(counts[0] / sum(counts) * 100))

We can see from above, we have slightly more negatives than positives, making our dataset slightly imbalanced.

By calculating if our model only chose to predict -1, the larger class, we would get a ~72,5% accuracy. 


## Naive Bayes

For our first algorithm, we're going to use the extremely fast and versatile Naive Bayes model.

Let's instantiate one from sklearn and fit it to our training data:

In [17]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

model.fit(X_train_vect, y_train)

model.score(X_train_vect, y_train)

0.9245255685933652

Naive Bayes has successfully fit all of our training data and is ready to make predictions. You'll notice that we have a score of ~92%. This is the *fit* score, and not the actual accuracy score. You'll see next that we need to use our test set in order to get a good estimate of accuracy.

Let's vectorize the test set, then use that test set to predict if each test headline is either positive or negative. Since we're avoiding any data leakage, we are only transforming, not refitting. And we won't be using SMOTE to oversample either.

In [18]:
y_pred = model.predict(X_test_vect)
y_prob =  model.predict_proba(X_test_vect)[::,1]

y_pred

array(['-1', '-1', '-1', ..., '-1', '-1', '-1'], dtype='<U2')

`y_pred` now contains a prediction for every row of the test set. With this prediction result, we can pass it into an sklearn metric with the true labels to get an accuracy score, F1 score, and generate a confusion matrix: 

In [19]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix,roc_auc_score

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("AUC: \n", roc_auc_score(y_test, y_prob))

Accuracy: 91.31%
Confusion Matrix:
 [[2351  166]
 [ 134  801]]
AUC: 
 0.9702540372525649


### Other Classification Algorithms in scikit-learn

As you can see Naive Bayes performed pretty well, so let’s experiment with other classifiers.

We'll use the same shuffle splitting as before, but now we'll run several types of models in each loop:

In [None]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
#from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

models = [
    BernoulliNB(),
    LogisticRegression(),
    RandomForestClassifier()
]



# Init a dictionary for storing results of each run for each model
results = {
    model.__class__.__name__: {
        'accuracy': [], 
        'confusion_matrix': [],
        'auc': []
    } for model in models
}


X_train_vect = vect.fit_transform(X_train)
X_test_vect = vect.transform(X_test)    

for model in models:
    model.fit(X_train_vect, y_train)
    y_pred = model.predict(X_test_vect)
    y_prob =  model.predict_proba(X_test_vect)[::,1]
        
    acc = accuracy_score(y_test, y_pred)   
    cm = confusion_matrix(y_test, y_pred)
    auc= roc_auc_score(y_test, y_prob)
        
    results[model.__class__.__name__]['accuracy'].append(acc)
    results[model.__class__.__name__]['confusion_matrix'].append(cm) 
    results[model.__class__.__name__]['auc'].append(auc)   

We now have a bunch of accuracy scores and confusion matrices stored for each model. Let's average these together to get average scores across models and folds:

In [None]:
slashes = '-' * 30
for model, d in results.items():
    avg_acc = sum(d['accuracy']) / len(d['accuracy']) * 100
    avg_cm = sum(d['confusion_matrix']) / len(d['confusion_matrix'])
    avg_auc= sum(d['auc']) / len(d['auc']) 
    
    s = f"""{model}\n{slashes}
        Accuracy: {avg_acc:.2f}%
           Confusion Matrix: 
        \n{avg_cm}
        \n{avg_auc}
        """
    print(s)