Naive Bayes

We've looked at the Naive Bayes classifier from a probability point of view. Now let's apply code to it to a natural language processing problem.

### Before we begin... what is natural language processing?

- If I'm explaining this to my non-technical peers, natural language processing is just a way for us to get computers to understand written language the way you and I do.

- If I'm explaining this to someone with a more technical background, natural language processing is a set of tools that represent words as numbers. This is commonly done by feature engineering (i.e. turning words into columns in your dataframe), but more complicated methods exist.

You'll often see natural language processing abbreviated as **NLP**.

We'll cover NLP in detail later. In this case, we've already done the NLP for you where we turn social media posts into features. You and I will use these features in a Naive Bayes classification model to predict whether a post comes from Twitter or Facebook.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../processed_tweets.csv')

In [3]:
df.dropna(inplace=True)

In [4]:
df.reset_index(drop=True, inplace=True)

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,unit_id,trusted_judgments,audience_feature,bias_feature,message_feature,label_feature,source_feature,text_feature,00,...,young,youtube,û_,ûª,ûªm,ûªre,ûªs,ûªt,ûªve,ûò
0,0,766192484.0,1.0,national,partisan,policy,From: Trey Radel (Representative from Florida),twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,766192485.0,1.0,national,partisan,attack,From: Mitch McConnell (Senator from Kentucky),twitter,VIDEO - #Obamacare: Full of Higher Costs and ...,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,766192486.0,1.0,national,neutral,support,From: Kurt Schrader (Representative from Oregon),twitter,Please join me today in remembering our fallen...,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,766192487.0,1.0,national,neutral,policy,From: Michael Crapo (Senator from Idaho),twitter,RT @SenatorLeahy: 1st step toward Senate debat...,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,4,766192488.0,1.0,national,partisan,policy,From: Mark Udall (Senator from Colorado),twitter,.@amazon delivery #drones show need to update ...,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We have social media data! This includes almost 5,000 messages on either Twitter or Facebook from various politicians. We can use the features we generated to predict things like whether the source is Twitter or Facebook, whether the bias is neutral or partisan, and so on.

In [6]:
df.shape

(4776, 509)

#### We've done a lot of the preprocessing of text for you! We have:
- gotten rid of some of the excess columns,
- broken each message (`text_feature`) out into columns and rows:
    - each row corresponds to one tweet or Facebook post.
    - each column corresponds to a word used.
    - each value corresponds to how many times that word was used in that Facebook post.
        - For example, if the word "health" was used four times in someone's Facebook post, then the value in the cell for that post and the "health" column should be 4.
- and removed particularly common words.

Over the next few days, you'll be learning how to do this on your own! If you want to check out the code that made this preprocessing happen, check out the `extras` folder in this repository. (This code will make more sense later this week.) 

You may note that there are some extra symbols in the data. This is a common problem in natural language processing, especially when dealing with social media (think emoji, hashtags, links, etc.), but we're going to ignore that for now.

### Let's use Naive Bayes to predict whether a social media post was featured on Facebook or Twitter.

#### 1. Engineer a feature to turn `source_feature` into a 1/0 column, where 1 indicates `Twitter`.

In [7]:
df['twitter'] = [1 if df.loc[i,'source_feature'] == 'twitter' else 0 for i in range(df.shape[0])]

#### NOTE: Since we are solving a classification problem, what potential issue should I check for here?

In [8]:
df['twitter'].value_counts()

1    2391
0    2385
Name: twitter, dtype: int64

#### 2. Split our data into `X` and `y`.

In [9]:
X = df.drop(columns = ['Unnamed: 0', 'unit_id', 'trusted_judgments', 'audience_feature',
                       'bias_feature', 'message_feature', 'label_feature', 'source_feature',
                       'text_feature', 'twitter'])
y = df['twitter']

In [10]:
X.head()

Unnamed: 0,00,000,10,100,11,12,20,2013,2014,30,...,young,youtube,û_,ûª,ûªm,ûªre,ûªs,ûªt,ûªve,ûò
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### 3. Split our data into training and testing sets.

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

> Note: This is where we would usually turn our words into features - after the train/test split.

#### 4. Fit a Naive Bayes model!

<details><summary> Which Naive Bayes model should we pick, and why? </summary>
```
- The columns of X are all integer counts, so MultinomialNB is the best choice here.
- BernoulliNB is best when we have 0/1 counts in all columns of X. (a.k.a. dummy variables)
- GaussianNB is best when the columns of X are Normally distributed. (Practically, though, it gets used whenever BernoulliNB and MultinomialNB are inappropriate.)
```
</details>

In [13]:
from sklearn.naive_bayes import MultinomialNB

In [14]:
# Instantiate our model!
nb = MultinomialNB()

Remember earlier that I said we had the opportunity to set priors. We could do so here if we wanted, but we'll stick with the default and allow `sklearn` to estimate priors from the training data directly.

In [15]:
# Fit our model!

model = nb.fit(X_train, y_train)

In [16]:
# Generate our predictions!

predictions = model.predict(X_test)

<details><summary> How might we evaluate our model's performance? </summary>
```
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Sensitivity = TP / (TP + FN)
- Specificity = TN / (TN + FP)
- Precision = TP / (TP + FP)
- AUC ROC
```
</details>

<details><summary> If we have to select only one, which one should we choose? </summary>
```
- It depends on how exactly you define "positive" and "negative." In this case, it probably doesn't really matter - incorrectly mistaking a tweet for a Facebook post doesn't seem much better or worse than incorrectly mistaking a Facebook post for a tweet. 
- Because I believe false positives and false negatives are equally as bad, I'd probably use accuracy.
```
</details>

In [17]:
model.score(X_train, y_train)

0.8259048758600059

In [18]:
model.score(X_test, y_test)

0.8025122121423587

In [19]:
from sklearn.metrics import confusion_matrix

In [20]:
confusion_matrix(y_test, predictions)

array([[592, 108],
       [175, 558]])

In [21]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

In [22]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 592
False Positives: 108
False Negatives: 175
True Positives: 558


<details><summary> By default, what does a false positive mean here? </summary>
```
- False positives are things we falsely predict to be positive.
- In this case, since Twitter = 1, a false positive means I incorrectly think something is a tweet when it's really a Facebook post.
```
</details>

<details><summary> How might you try to improve our model's performance? </summary>
```
- Try a non-default prior, if I think it's warranted.
- Check out how a logistic regression model or k-NN model compares.
```
</details>