**Part-1: THEORY AND PREDICT TITANIC SURVIVAL USING NAIVE BAYES**

**Part-1: BUILD EMAIL SPAM DETECTOR**

**PART-1**
- If you toss a coin, the probability of getting head, p(Head) = 1/2
- Pick a random card, what is the probability of getting a queen? 4 queens out of 52 total cards, P(queen) = 4/52 = 13
- Pick a random card, you know it is a diamond. Now what is the probability of that card being a queen? Total diamonds = 13, queen = 1, P(queen/diamond) = 1/13

![Conditional Probability](image-15.png)
![Bayes Theory](image-16.png)
![Our Problem](image-17.png)

In [16]:
import pandas as pd
df = pd.read_csv("../DataSets/titanic.csv")
df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [17]:
df.drop(columns=["PassengerId", "Name", "SibSp", "Parch", "Ticket", "Cabin", "Embarked"], axis=1, inplace=True)
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.925,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


In [18]:
target = df["Survived"]
inputs = df.drop("Survived", axis=1)

# convert "Sex" column to text
dummies = pd.get_dummies(inputs["Sex"])
dummies.head(3)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0


In [19]:
inputs = pd.concat([inputs, dummies], axis=1)
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0
3,1,female,35.0,53.1,1,0
4,3,male,35.0,8.05,0,1


In [20]:
inputs.drop("Sex", axis=1, inplace=True)
inputs.head(3)

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0


In [21]:
# find out which column have NaN values
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [22]:
# see 1st 10 rows of the dataset
inputs["Age"][:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [23]:
# fill those NaN values with mean values
inputs["Age"] = inputs["Age"].fillna(inputs["Age"].mean())
inputs.head(10)

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1
5,3,29.699118,8.4583,0,1
6,1,54.0,51.8625,0,1
7,3,2.0,21.075,0,1
8,3,27.0,11.1333,1,0
9,2,14.0,30.0708,1,0


In [24]:
# split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2)

len(inputs), len(X_train), len(X_test)

(891, 712, 179)

In [25]:
# Gaussian Naive Bayes (it is used when the data distribution is normal)
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

# train the model
model.fit(X_train, y_train)

GaussianNB()

In [26]:
model.score(X_test, y_test)

0.8491620111731844

In [27]:
X_test[:10]

Unnamed: 0,Pclass,Age,Fare,female,male
631,3,51.0,7.0542,0,1
660,1,50.0,133.65,0,1
67,3,19.0,8.1583,0,1
606,3,30.0,7.8958,0,1
108,3,38.0,7.8958,0,1
371,3,18.0,6.4958,0,1
719,3,33.0,7.775,0,1
890,3,32.0,7.75,0,1
623,3,21.0,7.8542,0,1
97,1,23.0,63.3583,0,1


In [28]:
y_test[:10]

631    0
660    1
67     0
606    0
108    0
371    0
719    0
890    0
623    0
97     1
Name: Survived, dtype: int64

In [29]:
model.predict(X_test[:10])

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [30]:
model.predict_proba(X_test[:10])

array([[0.98539958, 0.01460042],
       [0.09461066, 0.90538934],
       [0.98485289, 0.01514711],
       [0.9873922 , 0.0126078 ],
       [0.98766701, 0.01233299],
       [0.98436156, 0.01563844],
       [0.98762844, 0.01237156],
       [0.98756568, 0.01243432],
       [0.98552977, 0.01447023],
       [0.81108972, 0.18891028]])

**PART: 2 (Email spam classification)**

In [32]:
import pandas as pd
df = pd.read_csv("../DataSets/spam.csv")
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [33]:
df.groupby("Category").describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [34]:
# convert the "category" column to numeric
df["spam"] = df["Category"].apply(lambda x: 1 if x == "spam" else 0)
df.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [35]:
# split into training set and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df["Message"], df["spam"], test_size=0.25)

len(df), len(X_train), len(X_test)

(5572, 4179, 1393)

In [36]:
# convert the "message" column into numeric by "Count Vectorizer Technique"
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:3]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

**What is the difference between the the Gaussian, Bernoulli, Multinomial and the regular Naive Bayes algorithms?**

Here is the answer -> https://www.quora.com/What-is-the-difference-between-the-the-Gaussian-Bernoulli-Multinomial-and-the-regular-Naive-Bayes-algorithms

In [37]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count, y_train)

MultinomialNB()

In [38]:
emails = [
    "Hey mohan, can we together to watch footbal game tomorrow?",
    "Upto 20% discount on parking, exclusive offer just for you. Dont miss this reward!"
]

emails_count = v.transform(emails)
model.predict(emails_count)

array([0, 1], dtype=int64)

In [39]:
# measure the accuracy
X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

0.9820531227566404

In [40]:
# sklearn pipeline
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("nb", MultinomialNB())
])

# now train
clf.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer()), ('nb', MultinomialNB())])

In [41]:
clf.score(X_test, y_test)

0.9820531227566404

In [42]:
clf.predict(emails)

array([0, 1], dtype=int64)