<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Tackling an NLP Problem with Naive Bayes
_Author: Matt Brems_

----

We can sketch out the data science process as follows:
1. Define the problem.
2. Obtain the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we are going to apply a **new** modeling technique to natural language processing data.

> "But how can we apply a modeling technique we haven't learned?!"

The DSI program is great - but we can't teach you *everything* about data science in 12 weeks! This lab is designed to help you start learning something new without it being taught in a formal lesson. 
- Later in the cohort (like for your capstone!), you'll be exploring models, libraries, and resources that you haven't been explicitly taught.
- After the program, you'll want to continue developing your skills. Being comfortable with documentation and being confident in your ability to read something new and decide whether or not it is an appropriate method for the problem you're trying to solve is **incredibly** valuable.

### Step 1: Define the problem.

Many organizations have a substantial interest in classifying users of their product into groups. Some examples:
- A company that serves as a marketplace may want to predict who is likely to purchase a certain type of product on their platform, like books, cars, or food.
- An application developer may want to identify which individuals are willing to pay money for "bonus features" or to upgrade their app.
- A social media organization may want to identify who generates the highest rate of content that later goes "viral."

### Summary
In this lab, you're an engineer for Facebook. In recent years, the organization Cambridge Analytica gained worldwide notoriety for its use of Facebook data in an attempt to sway electoral outcomes.

Cambridge Analytica, an organization staffed with lots of Ph.D. researchers, used the Big5 personality groupings (also called OCEAN) to group people into one of 32 different groups.
- The five qualities measured by this personality assessment are:
    - **O**penness
    - **C**onscientiousness
    - **E**xtroversion
    - **A**greeableness
    - **N**euroticism
- Each person could be classified as "Yes" or "No" for each of the five qualities.
- This makes for 32 different potential combinations of qualities. ($2^5 = 32$)
- You don't have to check it out, but if you want to learn more about this personality assessment, head to [the Wikipedia page](https://en.wikipedia.org/wiki/Big_Five_personality_traits).
- There's also [a short (3-4 pages) academic paper describing part of this approach](./celli-al_wcpr13.pdf).

Cambridge Analytica's methodology was, roughly, the following:
- Gather a large amount of data from Facebook.
- Use this data to predict an individual's Big5 personality "grouping."
- Design political advertisements that would be particularly effective to that particular "grouping." (For example, are certain advertisements particularly effective toward people with specific personality traits?)

You want to know the **real-world problem**: "Is what Cambridge Analytica attempted to do actually possible, or is it junk science?"

However, we'll solve the related **data science problem**: "Are one's Facebook statuses predictive of whether or not one is agreeable?"
> Note: If Facebook statuses aren't predictive of one being agreeable (one of the OCEAN qualities), then Cambridge Analytica's approach won't work very well!

### Step 2: Obtain the data.

Obviously, there are plenty of opportunities to discuss the ethics surrounding this particular issue... so let's do that.

In [1]:
import pandas as pd
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB # NLP classification
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # New libraries

In [2]:
data = pd.read_csv('./mypersonality_final.csv', encoding = 'ISO-8859-1')

In [3]:
data.head()

Unnamed: 0,#AUTHID,STATUS,sEXT,sNEU,sAGR,sCON,sOPN,cEXT,cNEU,cAGR,cCON,cOPN,DATE,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY
0,b7b7764cfa1c523e4e93ab2a79a946c4,likes the sound of thunder.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/19/09 03:21 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
1,b7b7764cfa1c523e4e93ab2a79a946c4,is so sleepy it's not even funny that's she ca...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/02/09 08:41 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
2,b7b7764cfa1c523e4e93ab2a79a946c4,is sore and wants the knot of muscles at the b...,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/15/09 01:15 PM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
3,b7b7764cfa1c523e4e93ab2a79a946c4,likes how the day sounds in this new song.,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,06/22/09 04:48 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1
4,b7b7764cfa1c523e4e93ab2a79a946c4,is home. <3,2.65,3.0,3.15,3.25,4.4,n,y,n,n,y,07/20/09 02:31 AM,180.0,14861.6,93.29,0.03,15661.0,0.49,0.1


In [4]:
data.shape

(9917, 20)

**1. What is the difference between anonymity and confidentiality? All else held equal, which tends to keep people safer?**

In [5]:
# Anonymity is data that removes all personal data and information.
# For perfect anonymity would make it impossible to identify the participants.

# Confidentiality is data with personal data and information is known and included 
# but the researcher promises or is contractually unable to share private data.

**2. Suppose that the "unique identifier" in the above data, the `#AUTHID`, is a randomly generated key so that it can never be connected back to the original poster. Have we guaranteed anonymity here? Why or why not?**

In [6]:
# No, we can not guaranteed anonymity because the data can be traced back to the owner
# using contextual clues and message. The more data on the user, the easier it is to connected
# back the infomation to the original poster.

# For example, the status about their experience at school in a location would scope down
# potential users from millions to hundreds.


**3. As an engineer for Facebook, you recognize that user data will be used by Facebook and by other organizations - that won't change. However, what are at least three recommendations you would bring to your manager to improve how data is used and shared? Be as specific as you can.**

In [7]:
# 1. Sensitive and personal data should be stored seperately and securely as it
# could be very damaging and cause serious damage.
# 2. Data should always be anonymised before sharing internally and externally.

### Step 3: Explore the data.

- Note: For our $X$ variable, we will only use the `STATUS` variable. For our $Y$ variable, we will only use the `cAGR` variable.

**4. Explore the data here.**
> We aren't explicitly asking you to do specific EDA here, but what EDA would you generally do with this data? Do the EDA you usually would, especially if you know what the goal of this analysis is.

In [8]:
data.isnull().sum()

#AUTHID         0
STATUS          0
sEXT            0
sNEU            0
sAGR            0
sCON            0
sOPN            0
cEXT            0
cNEU            0
cAGR            0
cCON            0
cOPN            0
DATE            0
NETWORKSIZE     0
BETWEENNESS     0
NBETWEENNESS    0
DENSITY         0
BROKERAGE       0
NBROKERAGE      0
TRANSITIVITY    1
dtype: int64

In [9]:
data.describe()

Unnamed: 0,sEXT,sNEU,sAGR,sCON,sOPN,NETWORKSIZE,BETWEENNESS,NBETWEENNESS,DENSITY,BROKERAGE,NBROKERAGE,TRANSITIVITY
count,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9917.0,9916.0
mean,3.35476,2.609453,3.616643,3.474201,4.130386,429.37712,135425.3,94.66517,3.154012,137642.5,0.48992,0.128821
std,0.857578,0.760248,0.682485,0.737215,0.585672,428.760382,199433.8,5.506696,311.073343,201392.1,0.011908,0.106063
min,1.33,1.25,1.65,1.45,2.25,24.0,93.25,0.04,0.0,0.49,0.18,0.0
25%,2.71,2.0,3.14,3.0,3.75,196.0,16902.2,93.77,0.01,17982.0,0.49,0.06
50%,3.4,2.6,3.65,3.4,4.25,317.0,47166.9,96.44,0.02,48683.0,0.49,0.09
75%,4.0,3.05,4.15,4.0,4.55,633.0,196606.0,97.88,0.03,198186.0,0.5,0.17
max,5.0,4.75,5.0,5.0,5.0,29724.9,1251780.0,99.82,30978.0,1263790.0,0.5,0.63


**5. What is the difference between CountVectorizer and TFIDFVectorizer?**

In [10]:
# CountVectorizer convert the word into features using the number of frequencies appears
# in the dataset.

**6. What are stopwords?**

In [11]:
# Stops are common words that are often removed in machine learning models as
# it does not provide any useful information. It is only to provide structure and
# context for people to understand clearly.

**7. Give an example of when you might remove stopwords.**

In [12]:
# You should remove stopwords for sentiment analysis as the context is not
# the focus of the model.

# An example is restaurant review sentiment analysis, the main focus is
# to understand the customer needs and improve on feedback.

**8. Give an example of when you might keep stopwords in your model.**

In [13]:
# You might need to keep stopwords for more advance model that need context.

# An example is sequence tagging to understand part of speech for each word.
# Another example is squence to squence modelling like translation.
# Proper translation need to understand the context and sentiment of the text

### Step 4: Model the data.

We are going to fit two types of models: a logistic regression and [a Naive Bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html).

**Reminder:** We will only use the feature `STATUS` to model `cAGR`.

### We want to attempt to fit our models on sixteen sets of features:

1. CountVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
2. CountVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
3. CountVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
4. CountVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
5. CountVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
6. CountVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
7. CountVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
8. CountVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.
9. TFIDFVectorizer with 100 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
10. TFIDFVectorizer with 100 features, with English stopwords removed and with the default `ngram_range`.
11. TFIDFVectorizer with 100 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
12. TFIDFVectorizer with 100 features, with English stopwords kept in and with the default `ngram_range`.
13. TFIDFVectorizer with 500 features, with English stopwords removed and with an `ngram_range` that includes 1 and 2.
14. TFIDFVectorizer with 500 features, with English stopwords removed and with the default `ngram_range`.
15. TFIDFVectorizer with 500 features, with English stopwords kept in and with an `ngram_range` that includes 1 and 2.
16. TFIDFVectorizer with 500 features, with English stopwords kept in and with the default `ngram_range`.

**9. Rather than manually instantiating 16 different vectorizers, what `sklearn` class have we learned about that might make this easier? Use it.**

In [14]:
# GridSearch is useful for optimising the model by trial and error method.

In [15]:
X = data["STATUS"]
y = data["cAGR"]

In [16]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    stratify=y,
                                                    random_state=42)

In [17]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(7933,) (1984,) (7933,) (1984,)


In [18]:
y_train.value_counts(normalize=True).mul(100).round(2)

y    53.12
n    46.88
Name: cAGR, dtype: float64

In [19]:
y_test.value_counts(normalize=True).mul(100).round(2)

y    53.12
n    46.88
Name: cAGR, dtype: float64

In [20]:
# Let's set a pipeline up with two stages:
# 1. CountVectorizer (transformer)
# 2. Multinomial Naive Bayes (estimator)

pipe = Pipeline([
    ("cvec", CountVectorizer()), # Transformer (fit, transform)
    ("nb", MultinomialNB()) # Estimator or model (fit, predict)    
])

# .predict() of MultinomialNB allows us to have a score to judge
# our hyperparameters combinations when GridSearching

In [21]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 100,500
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_params = {
    "cvec__max_features":[100,500],
    "cvec__ngram_range": [(1,1),(1,2)],
    "cvec__stop_words": [None,"english"]
}

# ngram_range of (1,1) just returns individual tokens
# ngram_range of (1,2) just returns individual tokens AND bi-grams

In [22]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [23]:
# Fit GridSearch to training data.
gs.fit(X_train,y_train)

In [24]:
# What's the best score?
gs.best_score_

0.5485949442626129

In [25]:
# What's the best params?
gs.best_params_

{'cvec__max_features': 500,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english'}

In [26]:
# Score model on training set.
# What is the score on a classifier? Accuracy
gs.score(X_train,y_train)

0.6049413840917686

In [27]:
# Score model on testing set.
gs.score(X_test,y_test)

0.5378024193548387

In [28]:
# Let's set a pipeline up with two stages:
# 1. TfidfVectorizer (transformer)
# 2. Multinomial Naive Bayes (estimator)

pipe_tvec = Pipeline([
    ("tvec", TfidfVectorizer()), # Transformer (fit, transform)
    ("nb", MultinomialNB()) # Estimator or model (fit, predict)    
])

# .predict() of MultinomialNB allows us to have a score to judge
# our hyperparameters combinations when GridSearching

In [29]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000, 5000
# No stop words and english stop words
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_tvec_params = {"tvec__max_features" : [100,500],
                    "tvec__stop_words" : [None, "english"],
                    "tvec__ngram_range" : [(1,1),(1,2)]
                  }
    

In [30]:
# Instantiate GridSearchCV.
gs_tvec = GridSearchCV(estimator=pipe_tvec,
                      param_grid=pipe_tvec_params,
                      cv=5)

In [31]:
# Fit GridSearch to training data.
gs_tvec.fit(X_train,y_train)

In [32]:
gs_tvec.best_params_

{'tvec__max_features': 500,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}

In [33]:
# Score model on training set.
gs_tvec.score(X_train,y_train)

0.614773729988655

In [34]:
# Score model on testing set.
gs_tvec.score(X_test,y_test)

0.5342741935483871

**10. What are some of the advantages of fitting a logistic regression model?**

In [35]:
# The advantages of Logistic Regression Model are
# 1. The model have coeffiecient which is useful for explanation on the
# effect of each feature.
# 2. The model is relatively quick and easy to implement.
# 3. The concept of the model is easy to understand for non-technical audiences.

**11. Fit a logistic regression model and compare it to the baseline.**

In [51]:
# Let's set a pipeline up with two stages:
# 1. TfidfVectorizer (transformer)
# 2. Logistic Regression (estimator)

cvec_log = Pipeline([
    ("cvec", CountVectorizer(max_features=500)), # Transformer (fit, transform)
    ("logreg", LogisticRegression()) # Estimator or model (fit, predict)    
])

# .predict() of LogisticRegression allows us to have coefficients to judge
# our the effect of each feature

In [52]:
# Fit model
cvec_log.fit(X_train, y_train)

In [53]:
# Score model on training set.
cvec_log.score(X_train,y_train)

0.6258666330518089

In [54]:
# Score model on testing set.
cvec_log.score(X_test,y_test)

0.5362903225806451

TfidfVectorizer Logistic Regression (estimator) Model

In [55]:
# Let's set a pipeline up with two stages:
# 1. TfidfVectorizer (transformer)
# 2. Logistic Regression (estimator)

pipe_log = Pipeline([
    ("tvec", TfidfVectorizer()), # Transformer (fit, transform)
    ("logreg", LogisticRegression()) # Estimator or model (fit, predict)    
])

# .predict() of LogisticRegression allows us to have coefficients to judge
# our the effect of each feature

In [56]:
# Fit model
pipe_log.fit(X_train, y_train)

In [57]:
# Score model on training set.
pipe_log.score(X_train,y_train)

0.835245178368839

In [58]:
# Score model on testing set.
pipe_log.score(X_test,y_test)

0.579133064516129

### Summary of Naive Bayes 

Naive Bayes is a classification technique that relies on probability to classify observations.
- It's based on a probability rule called **Bayes' Theorem**... thus, "**Bayes**."
- It makes an assumption that isn't often met, so it's "**naive**."

Despite being a model that relies on a naive assumption, it often performs pretty well! (This is kind of like linear regression... we aren't always guaranteed homoscedastic errors in linear regression, but the model might still do a good job regardless.)
- [Interested in details? Read more here if you want.](https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf)


The [sklearn documentation](https://scikit-learn.org/stable/modules/naive_bayes.html) is here, but it can be intimidating. So, to quickly summarize the Bayes and Naive parts of the model...

#### Bayes' Theorem
If you've seen Bayes' Theorem, it relates the probability of $P(A|B)$ to $P(B|A)$. (Don't worry; we won't be doing any probability calculations by hand! However, you may want to refresh your memory on conditional probability from our earlier lessons if you forget what a conditional probability is.)

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)}
\end{eqnarray*}
$$

- Let $A$ be that someone is "agreeable," like the OCEAN category.
- Let $B$ represent the words used in their Facebook post.

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)} \\
\Rightarrow P(\text{person is agreeable}|\text{words in Facebook post}) &=& \frac{P(\text{words in Facebook post}|\text{person is agreeable})P(\text{person is agreeable})}{P(\text{words in Facebook post})}
\end{eqnarray*}
$$

We want to calculate the probability that someone is agreeable **given** the words that they used in their Facebook post! (Rather than calculating this probability by hand, this is done under the hood and we can just see the results by checking `.predict_proba()`.) However, this is exactly what our model is doing. We can (a.k.a. the model can) calculate the pieces on the right-hand side of the equation to give us a probability estimate of how likely someone is to be agreeable given their Facebook post.

#### Naive Assumption

If our goal is to estimate $P(\text{person is agreeable}|\text{words in Facebook post})$, that can be quite tricky.

---

<details><summary>Bonus: if you want to understand why that's complicated, click here.</summary>
    
- The event $\text{"words in Facebook post"}$ is a complicated event to calculate.

- If a Facebook post has 100 words in it, then the event $\text{"words in Facebook post"} = \text{"word 1 is in the Facebook post" and "word 2 is in the Facebook post" and }\ldots \text{ and "word 100 is in the Facebook post"}$.

- To calculate the joint probability of all 100 words being in the Facebook post gets complicated pretty quickly. (Refer back to the probability notes on how to calculate the joint probability of two events if you want to see more.)
</details>

---

To simplify matters, we make an assumption: **we assume that all of our features are independent of one another.**

In some contexts, this assumption might be realistic!

**12. Why would this assumption not be realistic with NLP data?**

In [40]:
# The words are NOT independent of one another as the context matters.

# An example is the word "Not" would reverse the meaning of the word behind it.
# Not good would result in negative sentiment.

Despite this assumption not being realistic with NLP data, we still use Naive Bayes pretty frequently.
- It's a very fast modeling algorithm. (which is great especially when we have lots of features and/or lots of data!)
- It is often an excellent classifier, outperforming more complicated models.

There are three common types of Naive Bayes models: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes.
- How do we pick which of the three models to use? It depends on our $X$ variable.
    - Bernoulli Naive Bayes is appropriate when our features are all 0/1 variables.
        - [Bernoulli NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB)
    - Multinomial Naive Bayes is appropriate when our features are variables that take on only positive integer counts.
        - [Multinomial NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
    - Gaussian Naive Bayes is appropriate when our features are Normally distributed variables. (Realistically, though, we kind of use Gaussian whenever neither Bernoulli nor Multinomial works.)
        - [Gaussian NB Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

**13. Suppose you CountVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.**

In [41]:
# The best Naive Bayel model for CountVectorized is 
# {'cvec__max_features': 500, # Top 500 features
#  'cvec__ngram_range': (1, 1), # Unigram (1 word)
#  'cvec__stop_words': 'english'} # Stopword in english is excluded.

# This model is the most appropiate because the model have the highest score 
# given the sets of parameter. 

**14. Suppose you TFIDFVectorized your features. Which Naive Bayes model would be most appropriate to fit? Why? Fit it.**

In [42]:
# The best Naive Bayel model for TFIDFVectorized is 
# {'tvec__max_features': 500, # Top 500 features
#  'tvec__ngram_range': (1, 2), # Bigram (2 word)
#  'tvec__stop_words': 'english'} # Stopword in english is excluded.

# This model is the most appropiate because the model have the highest score 
# given the sets of parameter. 

**15. Compare the performance of your models.**

In [45]:
# Score model CVEC Multinomial on testing set.
gs.score(X_test,y_test)

0.5378024193548387

In [46]:
# Score model TFID VEC Multinomial on testing set.
gs_tvec.score(X_test,y_test)

0.5342741935483871

In [59]:
# Score model CVEC Logistic on testing set.
cvec_log.score(X_test,y_test)

0.5362903225806451

In [60]:
# Score model TFID VEC Logistic on testing set.
pipe_log.score(X_test,y_test)

0.579133064516129

In [None]:
# The TFID Logistic model is the most appropiate model
# because the model have the highest score for given the sets of parameter. 

**16. Even though we didn't explore the full extent of Cambridge Analytica's modeling, based on what we did here, how effective was their approach at using Facebook data to model agreeableness?**

In [None]:
# The Big5 personality "grouping" could be considered junk science as the
# accuracy is not significantly over 50% baseline from random chance.

# The model is likely to use other methods to model agreeableness and
# political advertisements such as racial profiling and data profiling
# from past behaviour and status about ideology.

# The microtargeting of certain group of voters are highly effective
# from the success of Donald Trump's presidential campaign 
# as well as for Leave.EU (Brexit).

# These methods could be considered unethical but effective as 
# the scandal shows the extend of private information being misused.