<a href="https://colab.research.google.com/github/JatinB22/DSlab/blob/main/DS_exp_6/Titanic_Phishing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Naive Bayes Classifier

Naive Bayes is a **probabilistic machine learning algorithm** used for **classification tasks**. It is based on **Bayes' Theorem** with a strong (naive) assumption of **feature independence**.

---

## What is Bayes' Theorem?

Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

$
[
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
]
$

Where:
- \( P(A|B) \) is the **posterior probability**
- \( P(B|A) \) is the **likelihood**
- \( P(A) \) is the **prior probability**
- \( P(B) \) is the **evidence** (normalizing constant)

---

## Why "Naive"?

The "naive" part comes from the assumption that all features (predictors) are **independent** of each other given the class label.

In real-world data, this assumption is rarely true, but Naive Bayes still performs well in many practical situations.

---

## Types of Naive Bayes Classifiers

1. **Gaussian Naive Bayes** – Assumes features follow a normal distribution (used for continuous data).
2. **Multinomial Naive Bayes** – Works with discrete counts (commonly used in text classification).
3. **Bernoulli Naive Bayes** – Works with binary/boolean features.

---

## Advantages of Naive Bayes

- Simple and fast to train
- Works well with high-dimensional data
- Performs well with text classification (e.g., spam detection)
- Requires less training data

---

##  Limitations

- Assumes feature independence, which may not hold in reality
- Struggles with highly correlated features
- Zero-frequency problem: if a category wasn't observed in training data, it gets a zero probability

---

## Applications

- Email spam detection
- Sentiment analysis
- Medical diagnosis

---

## Summary

Naive Bayes is a **fast**, **simple**, yet **powerful** classifier that leverages probability theory. Despite its naive assumptions, it often performs surprisingly well on real-world data, especially in NLP and text classification tasks.



<h2 style='color:purple' align='center'>Naive Bayes Tutorial Part 1: Predicting survival from titanic crash</h2>

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("/content/titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [None]:
inputs = df.drop('Survived',axis='columns')
target = df.Survived

In [None]:
#inputs.Sex = inputs.Sex.map({'male': 1, 'female': 2})

In [None]:
dummies = pd.get_dummies(inputs.Sex)
dummies.head(3)

Unnamed: 0,female,male
0,False,True
1,True,False
2,True,False


In [None]:
inputs = pd.concat([inputs,dummies],axis='columns')
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,False,True
1,1,female,38.0,71.2833,True,False
2,3,female,26.0,7.925,True,False


**I am dropping male column as well because of dummy variable trap theory. One column is enough to repressent male vs female**

In [None]:
inputs.drop(['Sex','male'],axis='columns',inplace=True)
inputs.head(3)

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,False
1,1,38.0,71.2833,True
2,3,26.0,7.925,True


In [None]:
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [None]:
inputs.Age[:10]

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
5,
6,54.0
7,2.0
8,27.0
9,14.0


In [None]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,False
1,1,38.0,71.2833,True
2,3,26.0,7.925,True
3,1,35.0,53.1,True
4,3,35.0,8.05,False


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.3)

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [None]:
model.fit(X_train,y_train)

In [None]:
model.score(X_test,y_test)

0.7761194029850746

In [None]:
X_test[0:10]

Unnamed: 0,Pclass,Age,Fare,female
470,3,29.699118,7.25,False
530,2,2.0,26.0,True
344,2,36.0,13.0,False
790,3,29.699118,7.75,False
218,1,32.0,76.2917,True
300,3,29.699118,7.75,True
371,3,18.0,6.4958,False
57,3,28.5,7.2292,False
621,1,42.0,52.5542,False
755,2,0.67,14.5,False


In [None]:
y_test[0:10]

Unnamed: 0,Survived
470,0
530,1
344,0
790,0
218,1
300,1
371,0
57,0
621,1
755,1


In [None]:
model.predict(X_test[0:10])

array([0, 1, 0, 0, 1, 1, 0, 0, 0, 0])

In [None]:
model.predict_proba(X_test[:10])

array([[0.96734359, 0.03265641],
       [0.14849188, 0.85150812],
       [0.93375588, 0.06624412],
       [0.96743288, 0.03256712],
       [0.02793848, 0.97206152],
       [0.439123  , 0.560877  ],
       [0.9609544 , 0.0390456 ],
       [0.96698967, 0.03301033],
       [0.70841336, 0.29158664],
       [0.86646043, 0.13353957]])

**Calculate the score using cross validation**

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train, y_train, cv=5)

array([0.856     , 0.752     , 0.768     , 0.73387097, 0.79032258])

Problem Statement: Email Phishing Detection

Goal: Predict whether an email is phishing (fraudulent) or legitimate based on its content and metadata.

Dataset features might include:
<ol>
<li>email_length → number of characters in the email


<li>num_links → number of hyperlinks in the email

<li>num_special_chars → number of suspicious characters (e.g., $, %, @)

<li>contains_login_request → whether the email asks for login details (Yes/No)

<li>sender_domain → domain of the sender (categorical)

Target: is_phishing → 1 for phishing, 0 for legitimate

| Email\_ID | Subject                                      | Contains\_Link | Contains\_Attachment | Urgent\_Words | From\_Trusted\_Domain | Label    |
| --------- | -------------------------------------------- | -------------- | -------------------- | ------------- | --------------------- | -------- |
| 1         | "Urgent: Verify your bank account"           | Yes            | No                   | Yes           | No                    | Phishing |
| 2         | "Meeting schedule for next week"             | No             | Yes                  | No            | Yes                   | Legit    |
| 3         | "Claim your lottery prize now"               | Yes            | No                   | Yes           | No                    | Phishing |
| 4         | "Invoice attached for your recent purchase"  | No             | Yes                  | No            | Yes                   | Legit    |
| 5         | "Security alert: Unusual login detected"     | Yes            | No                   | Yes           | No                    | Phishing |
| 6         | "Team lunch invitation"                      | No             | No                   | No            | Yes                   | Legit    |
| 7         | "Update your payment information"            | Yes            | Yes                  | Yes           | No                    | Phishing |
| 8         | "Your order has been shipped"                | No             | No                   | No            | Yes                   | Legit    |
| 9         | "Reset your password immediately"            | Yes            | No                   | Yes           | No                    | Phishing |
| 10        | "Happy Birthday from all of us!"             | No             | No                   | No            | Yes                   | Legit    |
| 11        | "Verify your email to avoid suspension"      | Yes            | No                   | Yes           | No                    | Phishing |
| 12        | "Monthly performance report attached"        | No             | Yes                  | No            | Yes                   | Legit    |
| 13        | "Get rich quick investment opportunity"      | Yes            | No                   | Yes           | No                    | Phishing |
| 14        | "Reminder: Project deadline next Monday"     | No             | No                   | No            | Yes                   | Legit    |
| 15        | "Your account will be closed unless you act" | Yes            | No                   | Yes           | No                    | Phishing |


<h2> Sample Prediction Input

After fitting your Naive Bayes model, predict

| Subject                            | Contains\_Link | Contains\_Attachment | Urgent\_Words | From\_Trusted\_Domain |
| ---------------------------------- | -------------- | -------------------- | ------------- | --------------------- |
| "Update your password to continue" | 1              | 0                    | 1             | 0                     |


In [None]:
import pandas as pd
from io import StringIO
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [None]:
data = {
    'Email_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    'Subject': [
        "Urgent: Verify your bank account", "Meeting schedule for next week",
        "Claim your lottery prize now", "Invoice attached for your recent purchase",
        "Security alert: Unusual login detected", "Team lunch invitation",
        "Update your payment information", "Your order has been shipped",
        "Reset your password immediately", "Happy Birthday from all of us!",
        "Verify your email to avoid suspension", "Monthly performance report attached",
        "Get rich quick investment opportunity", "Reminder: Project deadline next Monday",
        "Your account will be closed unless you act"
    ],
    'Contains_Link': ['Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes'],
    'Contains_Attachment': ['No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'No'],
    'Urgent_Words': ['Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes'],
    'From_Trusted_Domain': ['No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No'],
    'Label': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}

In [None]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Email_ID,Subject,Contains_Link,Contains_Attachment,Urgent_Words,From_Trusted_Domain,Label
0,1,Urgent: Verify your bank account,Yes,No,Yes,No,1
1,2,Meeting schedule for next week,No,Yes,No,Yes,0
2,3,Claim your lottery prize now,Yes,No,Yes,No,1
3,4,Invoice attached for your recent purchase,No,Yes,No,Yes,0
4,5,Security alert: Unusual login detected,Yes,No,Yes,No,1


In [None]:
df = df.drop(columns=['Email_ID'], axis=1)
df.head()

Unnamed: 0,Subject,Contains_Link,Contains_Attachment,Urgent_Words,From_Trusted_Domain,Label
0,Urgent: Verify your bank account,Yes,No,Yes,No,1
1,Meeting schedule for next week,No,Yes,No,Yes,0
2,Claim your lottery prize now,Yes,No,Yes,No,1
3,Invoice attached for your recent purchase,No,Yes,No,Yes,0
4,Security alert: Unusual login detected,Yes,No,Yes,No,1


In [None]:
df['Contains_Attachment'] = df['Contains_Attachment'].map({'Yes': 1, 'No': 0})
df['Contains_Link'] = df['Contains_Link'].map({'Yes': 1, 'No': 0})
df['Urgent_Words'] = df['Urgent_Words'].map({'Yes': 1, 'No': 0})
df['From_Trusted_Domain'] = df['From_Trusted_Domain'].map({'Yes': 1, 'No': 0})
df.head()

Unnamed: 0,Subject,Contains_Link,Contains_Attachment,Urgent_Words,From_Trusted_Domain,Label
0,Urgent: Verify your bank account,1,0,1,0,1
1,Meeting schedule for next week,0,1,0,1,0
2,Claim your lottery prize now,1,0,1,0,1
3,Invoice attached for your recent purchase,0,1,0,1,0
4,Security alert: Unusual login detected,1,0,1,0,1


In [None]:
X = df.drop(columns=['Subject','Label'], axis=1)
y = df['Label']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(f"Training features shape: {X_train.shape}")
print(f"Test features shape: {X_test.shape}")

Training features shape: (12, 4)
Test features shape: (3, 4)


In [None]:
model = GaussianNB()

model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 1])

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 100.00%


In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         2
           1       1.00      1.00      1.00         1

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



In [None]:
print("Confusion Matrix: \n")
print(confusion_matrix(y_test, y_pred))

Confusion Matrix: 

[[2 0]
 [0 1]]


In [None]:
new_email_data = {
    'Contains_Link': [1],
    'Contains_Attachment': [0],
    'Urgent_Words': [1],
    'From_Trusted_Domain': [0]
}

new_email_df = pd.DataFrame(new_email_data)
new_email_df.head()

Unnamed: 0,Contains_Link,Contains_Attachment,Urgent_Words,From_Trusted_Domain
0,1,0,1,0


In [None]:
prediction = model.predict(new_email_df)

if prediction[0] == 1:
    print("The email is predicted to be Phishing.")
else:
    print("The email is predicted to be Legit.")

The email is predicted to be Phishing.
