# Multinomial Naive Bayes

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

### Load the Dataset

In [3]:
dataset = pd.read_table("./demo_3_dataset/spam_or_ham.txt", header=None, names=["target", "text"])
dataset.head()

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [39]:
print('ham:')
print(dataset.iloc[4,1])
print('spam:')
print(dataset.iloc[2,1])

ham:
Nah I don't think he goes to usf, he lives around here though
spam:
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's


In [11]:
dataset.shape

(5572, 2)

### Vectorize

Transform the input from text into a bag of words matrix ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)).

In [16]:
vectorizer = CountVectorizer()
vectorized_data= vectorizer.fit_transform(dataset["text"])

In [50]:
vectorizer.get_feature_names_out()[4000:4200]

array(['huge', 'hugging', 'hugh', 'hugs', 'huh', 'hui', 'huiming', 'hum',
       'humanities', 'humans', 'hun', 'hundred', 'hundreds', 'hungover',
       'hungry', 'hunks', 'hunny', 'hunt', 'hunting', 'hurricanes',
       'hurried', 'hurry', 'hurt', 'hurting', 'hurts', 'husband',
       'hussey', 'hustle', 'hut', 'hv', 'hv9d', 'hvae', 'hw', 'hyde',
       'hype', 'hypertension', 'hypotheticalhuagauahahuagahyuhagga',
       'iam', 'ias', 'ibh', 'ibhltd', 'ibiza', 'ibm', 'ibn', 'ibored',
       'ibuprofens', 'ic', 'iccha', 'ice', 'icic', 'icicibank', 'icky',
       'icmb3cktz8r7', 'icon', 'id', 'idc', 'idea', 'ideal', 'ideas',
       'identification', 'identifier', 'idew', 'idiot', 'idk', 'idps',
       'idu', 'ie', 'if', 'iff', 'ifink', 'ig11', 'ignorant', 'ignore',
       'ignoring', 'ihave', 'ijust', 'ikea', 'ikno', 'iknow', 'il',
       'ileave', 'ill', 'illness', 'illspeak', 'ilol', 'im', 'image',
       'images', 'imagination', 'imagine', 'imat', 'imf', 'img', 'imin',
       'imma'

In [53]:
vectorized_data.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Train/Test Split

In [54]:
X_train, X_test, y_train, y_test = train_test_split(vectorized_data, dataset.target, test_size=0.2)

### Model Training

In [55]:
clf = MultinomialNB().fit(X_train, y_train)

### Model Evaluation

Podemos ver algún caso en particular

Busquemos una entrada spam:

In [106]:
y_train[:20]

2433     ham
4950     ham
3772     ham
2304     ham
2947     ham
1813     ham
1175     ham
4032     ham
2534     ham
5210     ham
5189    spam
4003     ham
3098     ham
1682     ham
4120     ham
3643     ham
161      ham
2055     ham
5386     ham
2135     ham
Name: target, dtype: object

Primero comparemos si nuestro clasificador está prediciendo correctamente los datos del train

In [115]:
y_train[:20] == clf.predict(X_train)[:20]

2433    True
4950    True
3772    True
2304    True
2947    True
1813    True
1175    True
4032    True
2534    True
5210    True
5189    True
4003    True
3098    True
1682    True
4120    True
3643    True
161     True
2055    True
5386    True
2135    True
Name: target, dtype: bool

Podemos obtener la probabilidad de que haya sido spam o ham:

In [113]:
np.exp(clf.feature_log_prob_).round(4)[:,5189]

array([0.0001, 0.    ])

En este caso, la probabilidad de spam es mayor que la de ham. Veamos lo mismo pero para un caso que no era spam:

In [None]:
np.exp(clf.feature_log_prob_).round(8)[:,2135]

array([1.695e-05, 4.396e-05])

En este caso la probabilidad de ham es mayor que la de spam

In [128]:
y_test[:20] == clf.predict(X_test)[:20]

3625     True
4199     True
4080     True
4136     True
4062     True
625      True
2606     True
264      True
4769     True
1090     True
2759     True
919      True
3798     True
1500     True
5369     True
3121     True
2704     True
4600    False
3043     True
3853     True
Name: target, dtype: bool

In [56]:
print(classification_report(y_test, clf.predict(X_test)))

              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       968
        spam       0.92      0.97      0.94       147

    accuracy                           0.98      1115
   macro avg       0.96      0.98      0.97      1115
weighted avg       0.99      0.98      0.98      1115



The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label a negative sample as positive.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.