# **MODELING**

This notebook implements machine learning models based on the comprehensive EDA findings and preprocessing recommendations. We'll follow the evidence-based approach to ensure our modeling strategy aligns with the data patterns discovered.

**Modeling Strategy**

1. **Model Selection** - Random Forest, XGBoost, Logistic Regression (EDA recommendations)
2. **Hyperparameter Tuning** - StratifiedKFold cross-validation with macro-F1 optimization
3. **Class Imbalance Handling** - Balanced class weights and stratified sampling
4. **Feature Importance** - Validate engineered features from EDA
5. **Model Evaluation** - Confusion matrix, balanced accuracy, macro-F1
6. **Explainability** - SHAP values for feature interpretation

**EDA Evidence for Modeling**
- **Non-linear interactions** - Tree-based models (Random Forest, XGBoost)
- **Class imbalance** - Stratified sampling, balanced metrics
- **Feature engineering** - Validate acidity ratios and interactions
- **Production focus** - Interpretable models with feature importance


## **Import Libraries and Load Preprocessed Data**


In [3]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine Learning libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import (classification_report, confusion_matrix, accuracy_score, 
                           balanced_accuracy_score, f1_score, precision_score, recall_score)
from sklearn.preprocessing import StandardScaler
import joblib

# Advanced ML libraries
try:
    import xgboost as xgb
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("XGBoost not available. Install with: pip install xgboost")

try:
    import lightgbm as lgb
    LIGHTGBM_AVAILABLE = True
except ImportError:
    LIGHTGBM_AVAILABLE = False
    print("LightGBM not available. Install with: pip install lightgbm")

# Explainability
try:
    import shap
    SHAP_AVAILABLE = True
except ImportError:
    SHAP_AVAILABLE = False
    print("SHAP not available. Install with: pip install shap")

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"XGBoost available: {XGBOOST_AVAILABLE}")
print(f"LightGBM available: {LIGHTGBM_AVAILABLE}")
print(f"SHAP available: {SHAP_AVAILABLE}")

Libraries imported successfully!
XGBoost available: True
LightGBM available: True
SHAP available: True


In [4]:
print("Loading preprocessed data")
try:
# loading data
    X_train= pd.read_csv("X_train.csv")
    X_val= pd.read_csv("X_val.csv")
    X_test= pd.read_csv("X_test.csv")

    y_train= pd.read_csv("y_train.csv").squeeze()
    y_val= pd.read_csv("y_val.csv")
    y_test= pd.read_csv("y_test.csv")
    print("Preprocessed data loaded")

except FileNotFoundError as e:
    print(f"Error loading preprocessed data: {e}")
    print("""Run preprocess first to generate the required files.
    Required files: 
        X_train_scaled.csv, 
        X_val_scaled.csv, 
        X_test_scaled.csv
        y_train.csv, y_val.csv, y_test.csv
        selected_features.pkl""")


Loading preprocessed data
Preprocessed data loaded


## Baseline Model- Logistic Regression

In [None]:
print("="*50)
print("BASELINE LOGISTIC REGRESSION MODEL")
print("="*50)

# create the model
baseline_model= LogisticRegression(random_state=234, max_iter=1000,class_weight="balanced")

# train Model
baseline_model.fit(X_train,y_train)

# make predictions
y_train_pred= baseline_model.predict(X_train)
y_val_pred= baseline_model.predict(X_val)
y_test_pred= baseline_model.predict(X_test)

# Metric Calculations
train_acuracy= accuracy_score(y_train, y_train_pred)
val_acuracy= accuracy_score(y_train, y_val_pred)
test_acuracy= accuracy_score(y_train, y_test_pred)

# Balanced check
train_acuracy= accuracy_score(y_train, y_train_pred)

