**Find 2 datasets, one suitable for classification and another for regression. Then build 2 Random Forests using Xgboost or LightGBM library. However, you have to find the best set of parameters using GridSearchCV.**

In [12]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from lightgbm import LGBMRegressor
import warnings
warnings.filterwarnings("ignore")

#CLASSIFICATION TASK: Mushroom Dataset
#Load dataset
mushrooms = pd.read_csv('C:/Users/DCL/Desktop/DS422 Machine Learning Driven Data Analysis Lab I/mushrooms.csv')

#Encode all categorical columns
le_dict = {col: LabelEncoder().fit(mushrooms[col]) for col in mushrooms.columns}
for col, le in le_dict.items():
    mushrooms[col] = le.transform(mushrooms[col])

#Features and target
X_mushroom = mushrooms.drop('class', axis=1)
y_mushroom = mushrooms['class']

#Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_mushroom, y_mushroom, test_size=0.2, random_state=42)

#XGBoost Classifier with GridSearch
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_params = {'n_estimators': [50, 100], 'max_depth': [3, 5], 'learning_rate': [0.1, 0.2]}
grid_xgb = GridSearchCV(xgb_clf, xgb_params, cv=3)
grid_xgb.fit(X_train_c, y_train_c)

# Evaluation
y_pred_c = grid_xgb.predict(X_test_c)
print("🔍 Mushroom Classification (XGBoost)")
print("Best Params:", grid_xgb.best_params_)
print("Accuracy:", accuracy_score(y_test_c, y_pred_c))
print(classification_report(y_test_c, y_pred_c))

#REGRESSION 
students = pd.read_csv('C:/Users/DCL/Desktop/DS422 Machine Learning Driven Data Analysis Lab I/Student_Performance.csv')

#Clean column names 
students.columns = students.columns.str.strip()

#Features & target
X_student = students.drop('Scores', axis=1)
y_student = students['Scores']

# Encode categorical features if any
for col in X_student.select_dtypes(include='object'):
    X_student[col] = LabelEncoder().fit_transform(X_student[col])

#Split data
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_student, y_student, test_size=0.2, random_state=42)

#LightGBM Regressor with GridSearch
lgb_reg = LGBMRegressor()
lgb_params = {'n_estimators': [100, 200], 'max_depth': [5, 10], 'learning_rate': [0.1, 0.05]}
grid_lgb = GridSearchCV(lgb_reg, lgb_params, cv=3)
grid_lgb.fit(X_train_r, y_train_r)

#Evaluation
y_pred_r = grid_lgb.predict(X_test_r)
print("\n🔍 Student Score Regression (LightGBM)")
print("Best Params:", grid_lgb.best_params_)
print("MAE:", mean_absolute_error(y_test_r, y_pred_r))
print("RMSE:", np.sqrt(mean_squared_error(y_test_r, y_pred_r)))

🔍 Classification Report (Mushroom Edibility with XGBoost):
Best Params: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       843
           1       1.00      1.00      1.00       782

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

Columns in Student Data: Index(['Hours Studied', 'Previous Scores', 'Extracurricular Activities',
       'Sleep Hours', 'Sample Question Papers Practiced', 'Performance Index'],
      dtype='object')
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000046 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 118
[LightGBM] [Info] Number of data points in the train set: 5333, numb

**1.Hyperparameter Tuning**

**Classification Task (Mushroom Dataset - XGBoost)**
XGBoost Classifier is used for classifying whether mushrooms are edible or poisonous. The following hyperparameters were tuned using **GridSearchCV** 
with 3-fold cross-validation

| Hyperparameter | Values Tried          | Rationale |
|----------------|-----------------------|-----------|
| `n_estimators` | [50, 100]             | Controls number of boosting rounds. A higher value may improve accuracy but risks overfitting. |
| `max_depth`    | [3, 5]                | Limits tree depth to avoid overfitting. Small trees generalize better. |
| `learning_rate`| [0.1, 0.2]            | A smaller value makes the model learn slowly and robustly. |

These parameters were selected to balance model complexity*, training time & generalize performance.

**Regression Task (Student Scores - LightGBM)**
**LightGBM Regressor** is applied for predicting student performance (scores). The tuned hyperparameters are:

| Hyperparameter | Values Tried          | Rationale |
|----------------|-----------------------|-----------|
| `n_estimators` | [100, 200]            | More trees often lead to better performance until overfitting. |
| `max_depth`    | [5, 10]               | Restricts tree depth, preventing overfitting. |
| `learning_rate`| [0.1, 0.05]           | Slower learning (0.05) allows finer convergence but needs more trees. |

Grid search was applied with 3-fold cross-validation to find the optimal combination for minimal error on unseen data.

                                                                            
**2.Model Evaluation**

**Train/Test Split**: 80% training, 20% testing.
**Cross-Validation (CV)**: 3-fold CV during GridSearchCV to validate model performance during hyperparameter tuning.
**Multiple Metrics**:
   **Classification**: Accuracy, Precision, Recall, F1-score via `classification_report`.
   **Regression**: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE).

**Evaluation Results Summary:**

| Task | Model | Accuracy / MAE | RMSE / F1-score | Best Hyperparams |
|------|-------|----------------|------------------|------------------|
| **Classification** | XGBoost | **~1.0** (very high) | Excellent precision and recall | `{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}` |
| **Regression** | LightGBM | MAE: ~1.5 – 2.5 | RMSE: ~2.0 – 3.0 | `{'learning_rate': 0.05, 'max_depth': 10, 'n_estimators': 200}` |

Classification model achieved near-perfect accuracy which indicates strong feature separability in the mushroom dataset.  
Regression model performed well though real-world student performance can be influenced by many latent variables (e.g., motivation, sleep, etc.).
