# Python Programming: Bayes Theorem

The Bayes Theorem is applicable in machine learning where we get to use a Bayes classifier inorder to make a prediction. In this session, we will learn how to apply this classifer to a few machine learning problems even though later during Core we will spent time exhaustively on working on such problems. While working, we should note that the bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. 

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Such classifiers, Naive Bayes classifiers, are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.


## Example 

In [None]:
# Example 1
# ---
# Let's see an overview on how this classifier works, which suitable applications it has, 
# and how to use it in just a few lines of Python and the Scikit-Learn library.
# ---
# Question: Build a very simple SPAM detector for SMS messages given the following dataset; 
# ---
# Dataset source = https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
#

In [None]:
# Importing our library
# ---
#
import pandas as pd

import numpy as np

In [None]:
# Loading our uploaded Data
# ---
# We define a separator (in this case, a tab) and rename the columns accordingly
# 
df = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['label', 'message'], encoding='latin-1')
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# Pre-processing
# ---
# 1. Converting the labels from strings to binary values for our classifier
# 
df['label'] = df.label.map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# Pre-processing
# ---
# 2. Converting all characters in the message to lower case:
# 
df['message'] = df.message.map(lambda x: x.lower())
df.head()

Unnamed: 0,label,message
0,0,"go until jurong point, crazy.. available only ..."
1,0,ok lar... joking wif u oni...
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor... u c already then say...
4,0,"nah i don't think he goes to usf, he lives aro..."


In [None]:
# Pre-processing
# ---
# 3. Remove any punctuation:
# 
df['message'] = df.message.str.replace('[^\w\s]', '')
df.head()

  """


Unnamed: 0,label,message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...


In [None]:
# Pre-processing
# ---
# 4. tokenize the messages into into single words using nltk. 
# First, we have to import and download the tokenizer from the console:
# 
import nltk
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

In [None]:
# Pre-processing
# ---
# 5. Applying the tokenization. 
# What is tokenization (http://bit.ly/WhatisTokenization)
# 
df['message'] = df['message'].apply(nltk.word_tokenize)
df.head()

Unnamed: 0,label,message
0,0,"[go, until, jurong, point, crazy, available, o..."
1,0,"[ok, lar, joking, wif, u, oni]"
2,1,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,0,"[u, dun, say, so, early, hor, u, c, already, t..."
4,0,"[nah, i, dont, think, he, goes, to, usf, he, l..."


In [None]:
# Pre-processing
# ---
# 6. We then perform some word stemming. 
# The idea of stemming is to normalize our text for all variations of words carry the same meaning, 
# regardless of the tense. One of the most popular stemming algorithms is the Porter Stemmer:
# 
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
 
df['message'] = df['message'].apply(lambda x: [stemmer.stem(y) for y in x])
df.head()

Unnamed: 0,label,message
0,0,"[go, until, jurong, point, crazi, avail, onli,..."
1,0,"[ok, lar, joke, wif, u, oni]"
2,1,"[free, entri, in, 2, a, wkli, comp, to, win, f..."
3,0,"[u, dun, say, so, earli, hor, u, c, alreadi, t..."
4,0,"[nah, i, dont, think, he, goe, to, usf, he, li..."


In [None]:
# Pre-processing
# ---
# 7. We will transform the data into occurrences, 
# which will be the features that we will feed into our model:
#
from sklearn.feature_extraction.text import CountVectorizer

# This converts the list of words into space-separated strings
df['message'] = df['message'].apply(lambda x: ' '.join(x))

count_vect = CountVectorizer()
counts = count_vect.fit_transform(df['message'])
df.head()

Unnamed: 0,label,message
0,0,go until jurong point crazi avail onli in bugi...
1,0,ok lar joke wif u oni
2,1,free entri in 2 a wkli comp to win fa cup fina...
3,0,u dun say so earli hor u c alreadi then say
4,0,nah i dont think he goe to usf he live around ...


In [None]:
# Pre-processing
# ---
# 8. We could leave it as the simple word-count per message, 
# but it is better to use Term Frequency Inverse Document Frequency, more known as tf-idf:
#
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer().fit(counts)

counts = transformer.transform(counts)
df.head()

Unnamed: 0,label,message
0,0,go until jurong point crazi avail onli in bugi...
1,0,ok lar joke wif u oni
2,1,free entri in 2 a wkli comp to win fa cup fina...
3,0,u dun say so earli hor u c alreadi then say
4,0,nah i dont think he goe to usf he live around ...


In [None]:
# Training the Model
# ---
# Now that we have performed feature extraction from our data, 
# it is time to build our model. We will start by splitting our data into training and test sets:
#
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, df['label'], test_size=0.1, random_state=69)

In [None]:
# Training the Model
# ---
# Then, all that we have to do is initialize the Naive Bayes Classifier and fit the data. 
# For text classification problems, the Multinomial Naive Bayes Classifier is well-suited:
# 
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)

In [None]:
# Evaluating the Model
# ---
# Once we have put together our classifier, we can evaluate its performance in the testing set:
#
predicted = model.predict(X_test)

print(np.mean(predicted == y_test))

# Our simple Naive Bayes Classifier has 94.8% accuracy with this specific test set!

0.9480286738351255


## <font color="green">Challenges</font>

In [None]:
# Example 1
# ---
# In this challenge, we have been tasked with creating a classifier, the training set,
# then training the classifier using the training set and making a prediction.
# ---
# The training set (X) consits of length, weight and shoe size. 
# Y contains the associated labels (male or female).
# 

X = [[121, 80, 44], [180, 70, 43], [166, 60, 38], [153, 54, 37], [166, 65, 40], [190, 90, 47], [175, 64, 39],
     [174, 71, 40], [159, 52, 37], [171, 76, 42], [183, 85, 43]]

Y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female', 'female', 'male', 'male']

# Training the classifier:
#
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

# Making the prediciton:
# Using the GaussianNB classifier (i.e. from sklearn.naive_bayes import GaussianNB) 
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train, y_train)
print("Naive Bayes score: ",nb.score(x_test, y_test))


Naive Bayes score:  0.25


In [None]:
# Example 2
# ---
# Question: Use the titanic disaster dataset to create a Gaussian Naive Bayes classifier model 
# (i.e. from sklearn.naive_bayes import GaussianNB) that will make a prediction of survival 
# using passenger ticket fare information. 
# ---
# Dataset url: http://bit.ly/TitanicDataset 
# 
df1 = pd.read_csv("/content/tested.csv")
df1.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [None]:
df = df1[df1['Fare'].notna()]


In [None]:
#define function
import random
def replace_na(x):
    """Replace NaN values with values randomly selected from the Series."""
    vc = x.value_counts()
    r = random.choices(vc.keys(), weights=vc.values, k=x.isnull().sum())
    x[x.isnull()] = r
    return x 
#apply
df.apply(lambda x: replace_na(x))
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,C53,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,B61,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,C106,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,D19,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,C23 C25 C27,S


In [None]:
df=df.drop('Name', axis = 1)
df=df.drop('PassengerId', axis = 1)
df=df.drop('Cabin', axis = 1)
df=df.drop('Ticket', axis = 1)
#df=df.drop('Age', axis = 1)
df=df.drop('Fare', axis = 1)
df=df.drop('Pclass', axis = 1)
df.head()

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Embarked
0,0,male,34.5,0,0,Q
1,1,female,47.0,1,0,S
2,0,male,62.0,0,0,Q
3,0,male,27.0,0,0,S
4,1,female,22.0,1,1,S


In [None]:
#shape of the column
#df1['Ticket'].str.strip("SOTON/O.Q. ").str.strip("PC ").str.strip("A.5. ").str.strip("A/4 ")

In [None]:
#df1['Ticket'] = df1['Ticket'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

In [None]:
#encoding male and female
df['Sex'] = df.Sex.map({'male': 0, 'female': 1})
df.head()

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Embarked
0,0,0,34.5,0,0,Q
1,1,1,47.0,1,0,S
2,0,0,62.0,0,0,Q
3,0,0,27.0,0,0,S
4,1,1,22.0,1,1,S


In [None]:
df=pd.get_dummies(df, columns=["Embarked"]).head()

In [None]:
df.columns

Index(['Survived', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked_C', 'Embarked_Q',
       'Embarked_S'],
      dtype='object')

In [None]:
# Training the Model
# ---
# Now that we have performed feature extraction from our data, 
# it is time to build our model. We will start by splitting our data into training and test sets:
#
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, df['Survived'], test_size=0.1, random_state=69)

In [None]:
# Then, all that we have to do is initialize the Naive Bayes Classifier and fit the data. 
# For text classification problems, the Multinomial Naive Bayes Classifier is well-suited:
# 
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)

In [None]:
# Once we have put together our classifier, we can evaluate its performance in the testing set:
#
predicted = model.predict(X_test)

print(np.mean(predicted == y_test))


1.0


In [None]:
# Example 3
# ---
# Question: Create a GaussianNB classifier (i.e. from sklearn.naive_bayes import GaussianNB) 
# to identify the different species of iris flowers.
# ---
# Dataset url = http://bit.ly/MSIrisDatasetNB
# 
df2 = pd.read_csv('http://bit.ly/MSIrisDatasetNB')
df2.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [None]:
df2.shape

(150, 5)

In [None]:
df2['species'].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64