In [None]:
# 6) Data Analytics III
# 1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset.
# 2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on
# the given dataset.

import pandas as pd

# URL of the dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(url)

# Save the dataset locally
df.to_csv("iris.csv", index=False)

print("Dataset downloaded and saved as 'iris.csv'.")

# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

# Load the Iris dataset
df = pd.read_csv("iris.csv")  # Ensure 'iris.csv' is in the same directory

# Display the first few rows of the dataset
print("Dataset Overview:")
print(df.head())

# Prepare the features (X) and the target variable (y)
X = df.drop(columns=["species"])  # Features: Sepal & Petal measurements
y = df["species"]                # Target: Species

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Naïve Bayes classifier
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

# Make predictions on the test dataset
y_pred = nb_classifier.predict(X_test)

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred, labels=y.unique())
print("\nConfusion Matrix:")
print(conf_matrix)

# Calculate True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN)
tp = conf_matrix.diagonal()  # True Positives: Diagonal elements
fp = conf_matrix.sum(axis=0) - tp  # False Positives: Column sum - TP
fn = conf_matrix.sum(axis=1) - tp  # False Negatives: Row sum - TP
tn = conf_matrix.sum() - (tp + fp + fn)  # True Negatives: Total sum - (TP + FP + FN)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')

# Display the computed metrics
print("\nMetrics:")
print(f"True Positives (TP): {tp}")
print(f"False Positives (FP): {fp}")
print(f"True Negatives (TN): {tn}")
print(f"False Negatives (FN): {fn}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Error Rate: {error_rate:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")


Dataset downloaded and saved as 'iris.csv'.
Dataset Overview:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Metrics:
True Positives (TP): [10  9 11]
False Positives (FP): [0 0 0]
True Negatives (TN): [20 21 19]
False Negatives (FN): [0 0 0]
Accuracy: 1.00
Error Rate: 0.00
Precision: 1.00
Recall: 1.00


📚 THEORY (In Short)
1. Naïve Bayes Classifier
A simple probabilistic classifier based on Bayes' Theorem.

It assumes that features are independent of each other ("naïve" assumption).

Formula:

𝑃
(
𝐶
𝑙
𝑎
𝑠
𝑠
∣
𝐹
𝑒
𝑎
𝑡
𝑢
𝑟
𝑒
𝑠
)
=
𝑃
(
𝐹
𝑒
𝑎
𝑡
𝑢
𝑟
𝑒
𝑠
∣
𝐶
𝑙
𝑎
𝑠
𝑠
)
×
𝑃
(
𝐶
𝑙
𝑎
𝑠
𝑠
)
𝑃
(
𝐹
𝑒
𝑎
𝑡
𝑢
𝑟
𝑒
𝑠
)
P(Class∣Features)=
P(Features)
P(Features∣Class)×P(Class)
​

In simple words, it predicts the class that has the highest probability given the input features.

2. Gaussian Naïve Bayes
Gaussian because it assumes the data is normally distributed.

Suitable for continuous features like sepal length, petal length, etc.

3. Confusion Matrix
It shows how well the model performed:


Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)
TP = Correctly predicted positive.

TN = Correctly predicted negative.

FP = Wrongly predicted positive.

FN = Wrongly predicted negative.

4. Metrics Formulas
Accuracy =
𝑇
𝑃
+
𝑇
𝑁
𝑇
𝑃
+
𝐹
𝑃
+
𝐹
𝑁
+
𝑇
𝑁
TP+FP+FN+TN
TP+TN
​


Error Rate =
1
−
𝐴
𝑐
𝑐
𝑢
𝑟
𝑎
𝑐
𝑦
1−Accuracy

Precision =
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑃
TP+FP
TP
​

(How many selected are relevant?)

Recall =
𝑇
𝑃
𝑇
𝑃
+
𝐹
𝑁
TP+FN
TP
​

(How many relevant were selected?)

🧠 CODE EXPLANATION (Line by Line)
Step 1: Import Libraries
python
Copy
Edit
import pandas as pd
Import Pandas for data manipulation (handling CSV files, dataframes).

Step 2: Download and Save Dataset
python
Copy
Edit
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)
df.to_csv("iris.csv", index=False)
print("Dataset downloaded and saved as 'iris.csv'.")
Download Iris dataset from URL.

Save it locally as iris.csv.

Print confirmation message.

Step 3: More Imports
python
Copy
Edit
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
Import:

train_test_split: To split data into train/test.

GaussianNB: For Naïve Bayes model.

Metrics: For evaluating performance (confusion matrix, accuracy, precision, recall).

Step 4: Load Dataset
python
Copy
Edit
df = pd.read_csv("iris.csv")
print("Dataset Overview:")
print(df.head())
Load the saved iris.csv.

Show the first 5 rows to understand the data.

Step 5: Prepare Features and Target
python
Copy
Edit
X = df.drop(columns=["species"])  
y = df["species"]
X = Input features (sepal and petal measurements).

y = Target label (species name).

Step 6: Split Data into Train and Test
python
Copy
Edit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
80% data for training, 20% data for testing.

random_state=42 ensures same split every time (for reproducibility).

Step 7: Train the Model
python
Copy
Edit
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
Create a Gaussian Naïve Bayes object.

Train (fit) it on training data.

Step 8: Make Predictions
python
Copy
Edit
y_pred = nb_classifier.predict(X_test)
Predict the species on the test data.

Step 9: Confusion Matrix
python
Copy
Edit
conf_matrix = confusion_matrix(y_test, y_pred, labels=y.unique())
print("\nConfusion Matrix:")
print(conf_matrix)
Generate a confusion matrix comparing actual (y_test) vs predicted (y_pred).

Print the confusion matrix.

Step 10: Calculate TP, FP, TN, FN
python
Copy
Edit
tp = conf_matrix.diagonal()
fp = conf_matrix.sum(axis=0) - tp
fn = conf_matrix.sum(axis=1) - tp
tn = conf_matrix.sum() - (tp + fp + fn)
TP = Diagonal elements.

FP = Column sums - TP.

FN = Row sums - TP.

TN = Total samples - (TP + FP + FN).

Step 11: Calculate Metrics
python
Copy
Edit
accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
Accuracy: How many total predictions were correct.

Error Rate: How many were wrong.

Precision: How many predicted positives are actually positive.

Recall: How many actual positives are correctly predicted.

Step 12: Print Results
python
Copy
Edit
print("\nMetrics:")
print(f"True Positives (TP): {tp}")
print(f"False Positives (FP): {fp}")
print(f"True Negatives (TN): {tn}")
print(f"False Negatives (FN): {fn}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Error Rate: {error_rate:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
Display all calculated values neatly.

🎯 FINAL RESULT
By running this code:

You trained a simple Naïve Bayes model on Iris data.

You evaluated the model using confusion matrix and computed:

TP, FP, TN, FN, Accuracy, Error Rate, Precision, Recall.