# 📌 Assignment: Model Optimization and Performance Tuning

# 🚀 Solve It Yourself!

This assignment is your chance to think like a data scientist. Don’t rely on AI to do the work for you — the real learning happens when you explore, experiment, and problem-solve.

Mistakes are okay — they’re part of the journey. Trust your skills, stay curious, and give it your best shot.

You’ve got this! 💪

## 🎯 Objective:

- Explore Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree (with CCP Post-Pruning), and Random Forest.
- Optimize and compare model performance.

## 📌 Hint:

- Make a result dataframe to append to it model name and performance metrics for final comparison (use visualization as well).
---

## 📝 Part 1: Data Preparation
1. **Download a dataset from Kagglehub**.
2. **Load the dataset** and inspect its structure (columns, types, missing values).
3. **Preprocess the data:**
   - Handle missing values
   - Encode categorical variables
   - Scale numeric features

👉 **Question:** What preprocessing steps did you apply, and why?

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("wenruliu/adult-income-dataset")

print("Path to dataset files:", path)
import os
import pandas as pd

data_name = os.listdir(path)[0]

full_path = os.path.join(path,data_name)

df = pd.read_csv(full_path)


## 🔍 Part 2: Model Building

### 🔹 2.1 Logistic Regression
- Build a baseline Logistic Regression model.
- **Experiment:** Tune the `C` parameter (regularization strength).

👉 **Question:** How does changing `C` affect the model’s performance?

### 🔹 2.2 K-Nearest Neighbors (KNN)
- Train a KNN model with a default `k=5`.
- **Experiment:**
   - Test different values of `k`.
   - Compare performance using `euclidean` vs. `manhattan` distance.

👉 **Question:** What is the best `k` for your dataset? Why did it perform better?

## 🌳 Part 3: Decision Tree with Pre-pruning & CCP (Post Pruning)
- Train a Decision Tree with default settings.
- Try pre-pruning hyperparameters.
- Check feature importance attribute.
- Extract `ccp_alpha` values using `cost_complexity_pruning_path`.
- Build pruned trees for different `ccp_alpha` values.

👉 **Question:** What pre-pruning hyperparameter did you tune? How did you change them to increase performance?

👉 **Question:** Which `ccp_alpha` value gave the best results, and why?

👉 **Question:** How did the tree size change after pruning?

## 🌲 Part 4: Random Forest
- Train a Random Forest model with 100 trees.
- **Experiment:** Vary `n_estimators` and `max_depth` and other hyperparameters.

👉 **Question:** How did changing these hyperparameters affect performance?

## 🧠 Part 5: Model Comparison and Optimization
- Compare all models using Accuracy, Precision, Recall, and F1-score.
- **Reflect:**
   - Which model performed best?
   - How did tuning improve performance?
   - What trade-offs (e.g., overfitting vs. underfitting) did you observe?

👉 **Question:** Summarize which model you would choose for this dataset and why.

## ⭐ Stretch Goal (Optional):
- Use **GridSearchCV** or **RandomizedSearchCV** to fully optimize one model and retrieve best parameters and best model for each.
- Visualize **feature importance** (especially for Decision Tree/Random Forest).

👉 **Bonus Question:** Did advanced tuning or feature importance insights change your final model choice?

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score, 
                           recall_score, f1_score, 
                           confusion_matrix, classification_report)

# Part 1: Data Preparation
# Download and load the dataset
import kagglehub
path = kagglehub.dataset_download("wenruliu/adult-income-dataset")
data_name = os.listdir(path)[0]
full_path = os.path.join(path, data_name)
df = pd.read_csv(full_path)

# Preprocessing pipeline
numeric_features = ['age', 'income']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['education']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Split data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Part 2: Model Building and Optimization
results = []

# Logistic Regression
lr_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression())])

param_grid = {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]}
lr_grid = GridSearchCV(lr_pipe, param_grid, cv=5)
lr_grid.fit(X_train, y_train)

best_lr = lr_grid.best_estimator_
y_pred = best_lr.predict(X_test)
results.append({
    'Model': 'Logistic Regression',
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1': f1_score(y_test, y_pred),
    'Best Params': lr_grid.best_params_
})

# K-Nearest Neighbors
knn_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', KNeighborsClassifier())])

param_grid = {'classifier__n_neighbors': range(3, 15, 2),
              'classifier__weights': ['uniform', 'distance'],
              'classifier__metric': ['euclidean', 'manhattan']}
knn_grid = GridSearchCV(knn_pipe, param_grid, cv=5)
knn_grid.fit(X_train, y_train)

best_knn = knn_grid.best_estimator_
y_pred = best_knn.predict(X_test)
results.append({
    'Model': 'KNN',
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1': f1_score(y_test, y_pred),
    'Best Params': knn_grid.best_params_
})

# Decision Tree
dt_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', DecisionTreeClassifier(random_state=42))])

param_grid = {'classifier__max_depth': [3, 5, 7, None],
              'classifier__min_samples_split': [2, 5, 10],
              'classifier__min_samples_leaf': [1, 2, 4]}
dt_grid = GridSearchCV(dt_pipe, param_grid, cv=5)
dt_grid.fit(X_train, y_train)

best_dt = dt_grid.best_estimator_
y_pred = best_dt.predict(X_test)
results.append({
    'Model': 'Decision Tree',
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1': f1_score(y_test, y_pred),
    'Best Params': dt_grid.best_params_
})

# Random Forest
rf_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', RandomForestClassifier(random_state=42))])

param_grid = {'classifier__n_estimators': [50, 100, 200],
              'classifier__max_depth': [3, 5, 7, None],
              'classifier__min_samples_split': [2, 5, 10]}
rf_grid = GridSearchCV(rf_pipe, param_grid, cv=5)
rf_grid.fit(X_train, y_train)

best_rf = rf_grid.best_estimator_
y_pred = best_rf.predict(X_test)
results.append({
    'Model': 'Random Forest',
    'Accuracy': accuracy_score(y_test, y_pred),
    'Precision': precision_score(y_test, y_pred),
    'Recall': recall_score(y_test, y_pred),
    'F1': f1_score(y_test, y_pred),
    'Best Params': rf_grid.best_params_
})

# Part 5: Model Comparison
results_df = pd.DataFrame(results)
print("\nModel Comparison Results:")
print(results_df.to_string(index=False))

# Visualize feature importance for tree-based models
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
feature_importances = best_dt.named_steps['classifier'].feature_importances_
features = numeric_features + list(best_dt.named_steps['preprocessor'].transformers_[1][1]
                                 .named_steps['onehot'].get_feature_names_out(categorical_features))
pd.Series(feature_importances, index=features).sort_values().plot(kind='barh', title='Decision Tree Feature Importance')

plt.subplot(1, 2, 2)
feature_importances = best_rf.named_steps['classifier'].feature_importances_
pd.Series(feature_importances, index=features).sort_values().plot(kind='barh', title='Random Forest Feature Importance')
plt.tight_layout()
plt.show()


: 