## Feature extraction

We start by cleaning the tweets we will be using to train our model:

In [1]:
import os
import re
import pandas as pd


def clean_text(text):
    text = re.sub(r"'", '', text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'pic.twitter\S+', '', text)
    text = re.sub(r'\W+', ' ', text.lower())

    return text


df = pd.read_csv(os.path.join('tweets', 'tweets.csv'),
                 low_memory=False)
df.drop_duplicates(inplace=True)
df['tweet-clean'] = df['tweet'].apply(clean_text)
drop_index = []

for i in range(len(df)):
    if df['tweet-clean'].iloc[i] in ('', ' '):
        drop_index.append(i)

df.drop(drop_index, inplace=True)

Next, we will set two constants:

In [2]:
random_state = 0
n_jobs = -1

Note that by setting ```random_state = 0```, we are not actually initializing our random seed, but will be passing an integer to our various scikit-learn implementations which will allow us to keep the same random state with each run of our code. While tuning hyperparameters, we want to ensure that we generate the same random sequence each time we test our model so that we can be sure that any improvement in performance that we see is due to the changes we made and not due to randomness created by the random number generator.

Scikit-learn supports multithreading. Most computers today use a multi-core CPU. We can set ```n_jobs = -1``` to use all the threads on a CPU for faster model training.

Now we will compute the tfidf vectors for our tweets:

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
X = tfidf.fit_transform(df['tweet-clean'])

print(f'Number of documents: {X.shape[0]}')
print(f'Size of vocabulary:  {X.shape[1]}')

Number of documents: 34648
Size of vocabulary:  86092


To stay consistent throughout the notebook, we’ll assign them as:

Number of documents: n = 34 648 tweets

Size of the vocabulary: v = 86 092 terms

We set ```ngram_range=(1, 2)``` and ```min_df=2``` which states that our vocabulary will consist of unigrams and bigrams with a document frequency of at least 2. By eliminating terms with a document frequency of only 1, we can reduce the number of features in our model and reduce overfitting. This method works well here because we have tens of thousands of tweets. This may not be appropriate for projects containing fewer documents.

Next, we will encode our labels as integers:

In [4]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['name'])

for i in range(len(le.classes_)):
    print(f'{le.classes_[i]:<15} = {i}')

Bernie Sanders  = 0
Donald J. Trump = 1


Before training our model, it is important that we split our data set into a train and test set. The training set will be used to train our model, but it is important to have a separate testing data set to get a true measure of how accurate our model is on data that it has never seen before.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                    test_size=0.5,
                                    random_state=random_state,
                                    stratify=y)

Here we specified that we will keep 50% of the data set as our testing set with ```test_size=0.5```. We also set ```stratify=y``` to ensure that our test and training data set have the same ratio of Bernie Sanders to Donald Trump tweets.


## Logistic regression

The standard logistic function:

$$\phi\left(z\right)=\frac{1}{1+e^{-z}}$$

receives a value *z* and outputs a value between 0 and 1. We are working with a binary classification problem, meaning that the result must be Bernie Sanders or Donald Trump, with no other options. This is why we previously assigned each politician a value of 0 or 1. Now let’s take a look at how we can convert a tweet into a value *z*:

$$z=b_0+w_1x_1+w_2x_2+\ldots+w_vx_v$$

From this equation, we see that *z* is equal to the dot product of our tfidf vector (**x**) with the weight vector (**w**) to which a bias term (*$b_0$*) is added. Note that although Python lists begin indexing at 0, it is more common to begin indexing at 1 when describing the mathematical model and reserving the index 0 for our bias term (which can sometimes also appear as *$w_0$*). For each tweet, we can calculate a value *z*, pass it through the logistic function, and round to the nearest integer to get 0=Bernie Sanders or 1=Donald Trump. To do that, our machine learning algorithm needs to figure out which values for the weight vector (**w**) will result in the highest percentage of tweets being attributed to the correct politician.

We won’t look into the various methods of solving for **w** here. Scikit-learn implements various solvers for training a logistic regression classifier. In this app we set ```solver='saga'``` as our optimization method and ```C=20``` as our inverse regularization term to reduce overfitting and make it easier to train our model:

In [6]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression(C=20, solver='saga',
                             random_state=random_state,
                             n_jobs=n_jobs)

clf_log.fit(X_train, y_train)
log_score = clf_log.score(X_test, y_test)
print(f'Logistic Regression accuracy: {log_score:.1%}')

Logistic Regression accuracy: 95.8%


## Bernoulli naive Bayes

Naive Bayes classifiers are relatively popular for text classification tasks. We will implement a specific event model known as Bernoulli naive Bayes. The likelihood of a specific tweet having been written by each politician can be calculated as follows:

$$p\left(\mathbf{x}\ \right|\ C_k)=\ \prod_{i=1}^{v}{p_{ki}^{x_i}\left(1-p_{ki}\right)^{(1-x_i)}}$$

where *$p_{ki}$*, is the probability that politician *$C_k$*, will use the term *i* in a tweet.

In [7]:
from sklearn.naive_bayes import BernoulliNB

clf_bnb = BernoulliNB(alpha=0.01, binarize=0.09)
clf_bnb.fit(X_train, y_train)

BernoulliNB(alpha=0.01, binarize=0.09, class_prior=None, fit_prior=True)

Now let’s consider the following Bernie Sanders quote:

*“Medicare for all”* - Bernie Sanders

This quote will contain the term “medicare for”. Let’s see how this translates into the parameters in the above equation:

In [8]:
import numpy as np

C_k = 'Bernie Sanders'
k = le.transform([C_k])[0]
i = tfidf.vocabulary_['medicare for']
p_ki = np.exp(clf_bnb.feature_log_prob_[k, i])
print(f'k = {k}')
print(f'i = {i}')
print(f'C_k = {C_k}')
print(f'p_ki = {p_ki:.3}')

k = 0
i = 44798
C_k = Bernie Sanders
p_ki = 0.0289


Here, *$p_{ki}$* = 0.0289 can be interpreted as 2.89% of tweets from Bernie Sanders in our training set contain the term “medicare for”. What’s interesting about Bernoulli naive Bayes is that the *$p_{ki}$* values can be solved for directly. It is simply equal to the document frequency of term *i* in politician *$C_k$*’s corpus divided by the total number of documents in politician *$C_k$*’s corpus:

In [9]:
df_ki = clf_bnb.feature_count_[k, i]
n_k = clf_bnb.class_count_[k]
p_ki_manual = df_ki / n_k
print(f'{p_ki:.5}')
print(f'{p_ki_manual:.5}')

0.028924
0.028922


In logistic regression, **x** contained tfidf values. In Bernoulli naive Bayes, *$x_i$* must equal to 0 or 1. In its simplest form, *$x_i$* would equal to 1 if the term *i* was present in the tweet, and 0 if it was absent. However, we can extract a bit more information by setting a threshold on our tfidf values instead. An optimal threshold is typically found through trial and error or exhaustive search methods such as GridSearchCV in scikit-learn. For this app, an optimal threshold was ```binarize=0.09```. Therefore, any tfidf value above 0.09 was set to 1, and any below was set to 0. We also set ```alpha=0.01``` which is a smoothing parameter.

Going back to our equation, we can see that if we receive a document from the user whereby *$x_i$* = 1, then we are multiplying the probability (*$p_{ki}$*) of the term *i* appearing in politician *$C_k$*’s tweet. Conversely, if *$x_i$* = 0, then we are multiplying the probability that the term *i* would not appear in politician *$C_k$*’s tweet. This multiplication is done for each term in the vocabulary and for each politician. Then the politician with the highest joint log-likelihood is output as the result. Remember that while the logistic function outputs a probability, Bernoulli naive Bayes outputs a likelihood. These two terms are related, but do not mean the same thing. However, ```BernoulliNB``` can convert the likelihood into a probability using its ```predict_proba``` method. Now let’s calculate our accuracy on the test data set:

In [10]:
bnb_score = clf_bnb.score(X_test, y_test)
print(f'Bernoulli Naive Bayes accuracy: {bnb_score:.1%}')

Bernoulli Naive Bayes accuracy: 96.2%


## Ensemble averaging

We trained two models, logistic regression and Bernoulli naive Bayes, both with a relatively high accuracy on the test data set. Using scikit-learn’s ```VotingClassifier```, we can take the weighted average of the probabilities they each calculate by selecting our classifiers with ```estimators=[('log', clf_log), ('bnb', clf_bnb)]``` and setting ```voting='soft'``` to obtain a more accurate result. Adding 60% of the logistic regression probability with 40% of the Bernoulli naive Bayes probability by setting ```weights=(0.6, 0.4)``` had the highest accuracy. An interesting result, considering that our Bernoulli naive Bayes classifier had a higher accuracy, but we are assigning it less weight. This is likely because the underlying estimation of the probabilities from our Bernoulli naive Bayes classifier is not necessarily as reliable as the probability calculated from our logistic regression classifier.

In [11]:
from sklearn.ensemble import VotingClassifier

clf_vot = VotingClassifier(
    estimators=[('log', clf_log), ('bnb', clf_bnb)],
    voting='soft', weights=(0.6, 0.4), n_jobs=n_jobs)

clf_vot.fit(X_train, y_train)
vot_score = clf_vot.score(X_test, y_test)
print(f'Ensemble Averaging accuracy: {vot_score:.1%}')

Ensemble Averaging accuracy: 96.4%


Although this is a marginal increase in accuracy, we should keep in mind that people will likely input messages that are not necessarily something that either candidate might say. In these cases, ensemble averaging can be beneficial since it’s the equivalent to getting a second opinion before making a decision on which politician was more likely to tweet what the user had written.

## Final training

Once we are sure that we are not going to change any more hyperparameters (*e.g.*, the threshold on naive Bayes, the weighted average for voting, *etc.*) then we can retrain our model on the entire data set to get twice as many training examples.

In [12]:
clf_vot.fit(X, y)

VotingClassifier(estimators=[('log',
                              LogisticRegression(C=20, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto', n_jobs=-1,
                                                 penalty='l2', random_state=0,
                                                 solver='saga', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('bnb',
                              BernoulliNB(alpha=0.01, binarize=0.09,
                                          class_prior=None, fit_prior=True))],
                 flatten_transform=True, n_jobs=-1, voting='soft',
                 weights=(0.6, 0.4))