# Ungraded lab #2: Visualizing Naive Bayes

**Objective:** Visualize and interpret the Naive Bayes model for binary classification of tweets

**Steps: **
* Use the conditional probabilities of each tweet to create a scatter plot of tweets. Color represents the class.
* Calculate the ellipses that contain 93% (3 std) of positive and negative examples. 
* Show the ellipses and examples in the same plot


In [None]:
import numpy as np # Library for linear algebra and math utils
import pandas as pd # Dataframe library

import matplotlib.pyplot as plt # Library for plots
from utils import confidence_ellipse # Function to add confidence ellipses to charts

 ## Calculate the likelihoods for each tweet

In this notebook you will use the log likelihood of the the tweet to give us a visual idea of the set of data that we have.

For each tweet we have calculated the likelihood of the tweet to be positive and the likelihood to be negative. i.e we have calculated in different columns the numerator and denominator of the likelihood ratio introduced previouly. 

$$log \frac{P(tweet|pos)}{P(tweet|neg)} = log(P(tweet|pos)) - log(P(tweet|neg)) $$
$$positive = log(P(tweet|pos)) = \sum_{i=0}^{n}{log P(W_i|pos)}$$
$$negative = log(P(tweet|neg)) = \sum_{i=0}^{n}{log P(W_i|neg)}$$

So, for each tweet, we have calculated and stored in a table the loglikelihood of the positive and the negative termn separately, alonside its corresponding sentiment. This table have been stored in the file called __'bayes_features.csv'__.

The bellow cell load the table in a dataframe. Dataframes are data structures that simplify the manipulation of data, allowing filtering, slicing, joining and summarization.

In [None]:
data = pd.read_csv('bayes_features.cvs'); # Load the data from the csv file

data.head(5) # Print the first 5 tweets features. Each row represents a tweet

In [None]:
# Plot the samples using columns 1 and 2 of the matrix

fig, ax = plt.subplots(figsize = (6, 6)) #Create a new figure with a custom size

colors = ['red', 'green'] # Define a color palete

# Color base on sentiment
ax.scatter(data.positive, data.negative, 
    c=[colors[int(k)] for k in data.sentiment], s = 0.1, marker='*')  # Plot a dot for each tweet
# Custom limits for this chart
plt.xlim(-250,0)
plt.ylim(-250,0)
plt.xlabel("Positive") # Axe label
plt.ylabel("Negative") # Axe label

# Using Confidence Ellipses to interpret Naïve Bayes

In this section we will use the [confidence ellipse]( https://matplotlib.org/3.1.1/gallery/statistics/confidence_ellipse.html#sphx-glr-gallery-statistics-confidence-ellipse-py) to give us an idea of what the Naïva Bayes model "see".

A confidence ellipse is a way to visusalize a 2D normal random variable. The center of the ellipse is the mean of the data attributes. The angle of the ellipse is determined using the direction of the principal components of the covariance matrix and the width and height of the ellipse are determined using the explained variance of each principal component. Don't worry if you don't undestand the later paragraph. In the next  modules you will have the oportunity to implement the algorithm that does this, when you see Principal Component Analysis(PCA). 

For now, just see how these ellipses more or less summarizes the pattern in the data. The less overlap between the ellipses the better will work the Naïve Bayes classifier. 

In [None]:
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (6, 6))

colors = ['red', 'green'] # Define a color palete

# Color base on sentiment

ax.scatter(data.positive, data.negative, c=[colors[int(k)] for k in data.sentiment], s = 0.1, marker='*')  # Plot a dot for tweet
# Custom limits for this chart
plt.xlim(-200,40)  
plt.ylim(-200,40)
plt.xlabel("Positive") # Axe label
plt.ylabel("Negative") # Axe label

data_pos = data[data.sentiment == 1] # Filter only the positive samples
data_neg = data[data.sentiment == 0] # Filter only the negative samples

# Print confidence ellipses of 2 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=2, edgecolor='black', label=r'$2\sigma$' )
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=2, edgecolor='orange')

# Print confidence ellipses of 3 std
confidence_ellipse(data_pos.positive, data_pos.negative, ax, n_std=3, edgecolor='black', linestyle=':', label=r'$3\sigma$')
confidence_ellipse(data_neg.positive, data_neg.negative, ax, n_std=3, edgecolor='orange', linestyle=':')
ax.legend()

plt.show()

To give away: Undestanding your data allows to predict if the method that you are going to use will perform well or not. Or it will allows to undestand why it worked well or bad.