<a href="https://colab.research.google.com/github/JairusTheAnalyst/JairusTheAnalyst/blob/main/Machine_Learning_Model_Development_and_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Assignment: Machine Learning Model Development and Evaluation**

In [None]:
# module6_classification.py

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris, fetch_openml

In [None]:

# ------------------------------
# 1) Logistic Regression - Customer Churn
# ------------------------------
# Example: synthetic churn dataset
data = pd.DataFrame({
    'age': np.random.randint(18,70,100),
    'usage': np.random.randint(1,100,100),
    'payment_history': np.random.randint(0,5,100),
    'churn': np.random.choice([0,1],100)
})

X = data[['age','usage','payment_history']]
y = data['churn']

In [None]:
X.head()

Unnamed: 0,age,usage,payment_history
0,27,13,4
1,61,77,3
2,22,69,4
3,54,99,4
4,44,95,0


In [None]:
y.head()

Unnamed: 0,churn
0,1
1,1
2,1
3,1
4,1


In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)
y_pred_logreg = logreg.predict(X_test_scaled)

In [None]:
print("Logistic Regression Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_logreg):.3f}")
print(f"Precision: {precision_score(y_test, y_pred_logreg):.3f}")
print(f"Recall: {recall_score(y_test, y_pred_logreg):.3f}\n")

Logistic Regression Metrics:
Accuracy: 0.467
Precision: 0.529
Recall: 0.529




# 2) Decision Tree - Titanic Survival Example

In [None]:
from sklearn.datasets import fetch_openml

# Fetch Titanic dataset
titanic = fetch_openml('titanic', version=1, as_frame=True)

# Access the DataFrame
titanic_df = titanic.frame

# Keep only relevant columns and drop missing values
titanic_df = titanic_df[['pclass','sex','age','sibsp','parch','fare','survived']].dropna()

# Encode categorical features
titanic_df['sex'] = titanic_df['sex'].map({'male':0,'female':1})

# Features and target
X_titanic = titanic_df[['pclass','sex','age','sibsp','parch','fare']]
y_titanic = titanic_df['survived'].astype(int)

# Split train/test
from sklearn.model_selection import train_test_split
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(X_titanic, y_titanic, test_size=0.3, random_state=42)

# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier(max_depth=5, random_state=42)
dtree.fit(X_train_t, y_train_t)
y_pred_dt = dtree.predict(X_test_t)

from sklearn.metrics import classification_report
print(classification_report(y_test_t, y_pred_dt))


              precision    recall  f1-score   support

           0       0.76      0.87      0.81       175
           1       0.80      0.65      0.72       139

    accuracy                           0.77       314
   macro avg       0.78      0.76      0.77       314
weighted avg       0.78      0.77      0.77       314



**3) KNN - Customer Churn**

In [None]:

knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)

print("KNN Metrics:")
print(classification_report(y_test, y_pred_knn))


KNN Metrics:
              precision    recall  f1-score   support

           0       0.38      0.46      0.41        13
           1       0.50      0.41      0.45        17

    accuracy                           0.43        30
   macro avg       0.44      0.44      0.43        30
weighted avg       0.45      0.43      0.44        30



**4) SVM with different kernels**

In [None]:

iris = load_iris()
X_svm = iris.data
y_svm = iris.target

X_train_svm, X_test_svm, y_train_svm, y_test_svm = train_test_split(X_svm, y_svm, test_size=0.3, random_state=42)

kernels = ['linear','poly','rbf']
for k in kernels:
    svc = SVC(kernel=k)
    svc.fit(X_train_svm, y_train_svm)
    y_pred_svc = svc.predict(X_test_svm)
    print(f"SVM ({k}) Metrics:")
    print(classification_report(y_test_svm, y_pred_svc))


SVM (linear) Metrics:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

SVM (poly) Metrics:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45

SVM (rbf) Metrics:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2  

In [None]:
# Grid search for RBF kernel
param_grid = {'C':[0.1,1,10], 'gamma':[0.01,0.1,1]}
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=3, scoring='accuracy')
grid.fit(X_train_svm, y_train_svm)
print("Best RBF SVM Params:", grid.best_params_)

Best RBF SVM Params: {'C': 10, 'gamma': 0.01}


1. Overall Accuracy:

Linear SVM: 100% accuracy — all 45 samples were correctly classified.

Polynomial SVM: 98% accuracy — one or two misclassifications, likely due to slightly overfitting or underfitting on non-linear boundaries.

RBF SVM: 100% accuracy — perfectly classified all samples.

2. Precision, Recall, F1-score:

Linear SVM: Perfect scores (1.00) for all classes, indicating it correctly predicted all labels without errors.

Polynomial SVM: Class 1 has a recall of 0.92, meaning a few true positives were missed. Class 2 precision is 0.93, meaning a few false positives were predicted. Overall, still very strong performance.

RBF SVM: Perfect precision, recall, and F1 for all classes, showing excellent non-linear separation.

3. Interpretation:

The dataset is small (45 samples) and well-structured, which allows even simple SVMs to perform extremely well.

Polynomial kernel shows minor variation because it may slightly overfit to small datasets, while RBF adapts well to complex boundaries.

RBF SVM with C=10, gamma=0.01 balances margin width and flexibility, giving optimal classification.

4. Insight:

RBF kernel is usually preferred for non-linear relationships because it can capture complex patterns.

On small, clean datasets, linear SVM may perform just as well.

Kernel choice becomes more important with larger, noisier, or highly non-linear datasets.

**5) Random Forest vs Decision Tree**

In [None]:

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train_scaled, y_train)
y_pred_rf = rf.predict(X_test_scaled)

print("Random Forest vs Decision Tree Metrics:")
print("Decision Tree on Churn:")
print(classification_report(y_test, y_pred_logreg)) # reuse logreg pred for comparison
print("Random Forest on Churn:")
print(classification_report(y_test, y_pred_rf))


Random Forest vs Decision Tree Metrics:
Decision Tree on Churn:
              precision    recall  f1-score   support

           0       0.38      0.38      0.38        13
           1       0.53      0.53      0.53        17

    accuracy                           0.47        30
   macro avg       0.46      0.46      0.46        30
weighted avg       0.47      0.47      0.47        30

Random Forest on Churn:
              precision    recall  f1-score   support

           0       0.36      0.31      0.33        13
           1       0.53      0.59      0.56        17

    accuracy                           0.47        30
   macro avg       0.44      0.45      0.44        30
weighted avg       0.46      0.47      0.46        30



1. Accuracy:

Both models have similar overall accuracy (~47%). This is relatively low, likely due to the small test set (30 samples) and possible class imbalance.

2. Precision:

Decision Tree: Precision for class 1 (“churn”) is 0.53, meaning when it predicts a customer will churn, it is correct 53% of the time.

Random Forest: Precision for class 1 is slightly higher at 0.53 as well, but slightly lower for class 0.

3. Recall:

Decision Tree: Recall for class 1 is 0.53, meaning it identifies 53% of actual churn cases.

Random Forest: Recall for class 1 is 0.59, slightly better at catching churners. Recall for class 0 dropped to 0.31, meaning fewer non-churners were correctly identified.

4. F1-score:

F1-score balances precision and recall. For churners (class 1), Random Forest (0.56) performs slightly better than Decision Tree (0.53). For non-churners, both models perform poorly (~0.33-0.38).

5. Interpretation:

Random Forest slightly improves recall and F1 for churners due to ensemble averaging, but at the cost of worse detection for non-churners.

Both models struggle with the small dataset and class imbalance. Increasing dataset size or tuning hyperparameters (tree depth, number of estimators, class weighting) would likely improve performance.

6. Key insight:

Random Forest is generally more robust and better at capturing patterns in complex data, but here the small sample size limits its advantage. Decision Trees are simpler but may underfit or overfit depending on depth.