# ML Assignment 3: Naive Bayes, Decision Trees, and Ensemble Learning

## Task 1 : Theory Questions
1. What is the core assumption of Naive Bayes?
    Naive Bayes assumes that all features (input variables) are conditionally independent of each other given the class label. This means the presence or absence of one feature does not affect the presence of another, simplifying computation using Bayes’ Theorem.

2. Differentiate between GaussianNB, MultinomialNB, and BernoulliNB.

    GaussianNB is used when the features follow a normal (Gaussian) distribution and is ideal for continuous data.

    MultinomialNB is suitable for discrete data like word counts or frequencies, commonly used in text classification.

    BernoulliNB works with binary/boolean features, assuming features are either 0 or 1 (e.g., word presence or absence).

3. Why is Naive Bayes considered suitable for high-dimensional data?
    Naive Bayes performs well with high-dimensional data because it simplifies computation by assuming feature independence, which avoids the need for modeling complex relationships. Its training is fast and efficient even when the number of features is very large, such as in text classification tasks.

## Task 2: Spam Detection using MultinomialNB

In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Load SMS Spam Collection Dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Encode labels
df['label_num'] = df.label.map({'ham':0, 'spam':1})

# Vectorization
vec = CountVectorizer()
X = vec.fit_transform(df['message'])
y = df['label_num']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


## Task 3: GaussianNB on Iris Dataset

In [None]:

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
gnb_pred = gnb.predict(X_test)

lr = LogisticRegression(max_iter=200)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

print("GaussianNB Accuracy:", accuracy_score(y_test, gnb_pred))
print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_pred))


## Task 4 : Conceptual Questions
1. What is entropy and information gain?

    Entropy measures the impurity or uncertainty in a dataset.

    Information Gain is the reduction in entropy after a dataset is split on an attribute—it helps decide the best feature to split on in a decision tree.

2. Explain the difference between Gini Index and Entropy.
    Both are used to measure impurity, but:

    Gini Index focuses on misclassification probability and is faster to compute.

    Entropy uses logarithmic calculations and is based on information theory. They often give similar results, but Gini is slightly more efficient.

3. How can a decision tree overfit? How can this be avoided?
    A decision tree can overfit by growing too deep and capturing noise in the training data.
    This can be avoided by pruning, setting a maximum depth, minimum samples per leaf, or using ensemble methods like Random Forest

## Task 5: Decision Tree on Titanic Dataset

In [None]:

import seaborn as sns
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder

titanic = sns.load_dataset('titanic')
df = titanic[['survived', 'pclass', 'sex', 'age', 'sibsp', 'fare', 'embarked']].dropna()
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])
df['embarked'] = le.fit_transform(df['embarked'])

X = df.drop('survived', axis=1)
y = df['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

plt.figure(figsize=(12, 6))
plot_tree(tree, feature_names=X.columns, class_names=['Not Survived', 'Survived'], filled=True)
plt.show()


## Task 6: Decision Tree Model Tuning

In [None]:

train_acc, test_acc = [], []
depths = range(1, 20)
for d in depths:
    model = DecisionTreeClassifier(max_depth=d, random_state=42)
    model.fit(X_train, y_train)
    train_acc.append(model.score(X_train, y_train))
    test_acc.append(model.score(X_test, y_test))

plt.plot(depths, train_acc, label='Train Accuracy')
plt.plot(depths, test_acc, label='Test Accuracy')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Overfitting Visualization')
plt.legend()
plt.grid(True)
plt.show()


## Task 7: Conceptual Questions
1. What is the difference between Bagging and Boosting?

    Bagging builds multiple independent models in parallel using random subsets of data (e.g., Random Forest) to reduce variance.

    Boosting builds models sequentially, where each new model focuses on correcting the errors of the previous one to reduce bias.

2. How does Random Forest reduce variance?
    Random Forest reduces variance by averaging predictions from many decision trees trained on different random subsets of data and features. This aggregation smooths out individual tree errors and prevents overfitting.

3. What is the weakness of boosting-based methods?
    Boosting methods can overfit noisy data and take longer to train due to their sequential nature. They are also sensitive to outliers since each step tries to fix previous mistakes aggressively.



## Task 8: Random Forest vs Decision Tree

In [None]:

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
print("Random Forest Precision:", precision_score(y_test, rf_pred))
print("Random Forest Recall:", recall_score(y_test, rf_pred))

importances = rf.feature_importances_
plt.barh(X.columns, importances)
plt.title("Feature Importances")
plt.show()


## Task 9: AdaBoost Comparison

In [None]:

start = time.time()
ada = AdaBoostClassifier(random_state=42)
ada.fit(X_train, y_train)
ada_pred = ada.predict(X_test)
end = time.time()

print("AdaBoost Accuracy:", accuracy_score(y_test, ada_pred))
print("F1 Score:", f1_score(y_test, ada_pred))
print("Training Time:", round(end - start, 4), "seconds")
