### Assignment-05

- Build, train, and save `LightGBM` and `SVM` classifiers with integrated cross-validation and hyperparameter tuning & do evaluation of these models using appropriate metrics, compare their performance, and identify which model performs best with reasoning.

NOTE: Use the preprocessed dataset of earthquake predictions.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score

# Load dataset
url = "https://raw.githubusercontent.com/springboardmentor943x/ImpactSense-Intern-project/refs/heads/main/Milestone_3/Week_5/Day_23/preprocessed_earthquake_data.csv"
df = pd.read_csv(url)

print("Shape:", df.shape)
df.head()

Shape: (23409, 40)


Unnamed: 0,Latitude,Longitude,Type,Depth,Magnitude,Magnitude Type,Root Mean Square,Source,Status,Year,...,Source_ISCGEM,Source_ISCGEMSUP,Source_NC,Source_NN,Source_OFFICIAL,Source_PR,Source_SE,Source_US,Source_UW,Status_Reviewed
0,0.583377,0.844368,Earthquake,0.495984,0.277668,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.006109,0.698849,Earthquake,0.075272,-0.195082,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.739162,-1.701962,Earthquake,-0.413928,0.750418,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-2.017599,-0.503524,Earthquake,-0.454694,-0.195082,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.340688,0.691479,Earthquake,-0.454694,-0.195082,MW,-0.103839,ISCGEM,Automatic,-1.915523,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
print(df.info())
print(df.describe().T)

# Here we assume last column is target
label_col = df.columns[-1]
print("Label column is:", label_col)

X = df.drop(columns=[label_col])
y = df[label_col]

# If labels are text, convert them to numbers
if y.dtype == "object":
    y = pd.factorize(y)[0]

print("Classes distribution:\n", pd.Series(y).value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23409 entries, 0 to 23408
Data columns (total 40 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Latitude                23409 non-null  float64
 1   Longitude               23409 non-null  float64
 2   Type                    23409 non-null  object 
 3   Depth                   23409 non-null  float64
 4   Magnitude               23409 non-null  float64
 5   Magnitude Type          23409 non-null  object 
 6   Root Mean Square        23409 non-null  float64
 7   Source                  23409 non-null  object 
 8   Status                  23409 non-null  object 
 9   Year                    23409 non-null  float64
 10  Day                     23409 non-null  float64
 11  Month_sin               23409 non-null  float64
 12  Month_cos               23409 non-null  float64
 13  Hour_sin                23409 non-null  float64
 14  Hour_cos                23409 non-null

In [5]:
from sklearn.model_selection import train_test_split

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Preprocess numeric columns
num_cols = X.select_dtypes(include=['int64','float64']).columns

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),   # fill missing
    ('scaler', StandardScaler())                     # scale data
])

preprocessor = ColumnTransformer(
    transformers=[('num', numeric_transformer, num_cols)],
    remainder='drop'
)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)


Train shape: (18727, 39) Test shape: (4682, 39)


In [6]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

lgbm_pipeline = Pipeline([
    ('pre', preprocessor),
    ('clf', LGBMClassifier(random_state=42))
])

lgbm_params = {
    'clf__n_estimators': [100, 300],
    'clf__learning_rate': [0.05, 0.1],
    'clf__num_leaves': [31, 63]
}

lgbm_search = GridSearchCV(lgbm_pipeline, lgbm_params, cv=cv, scoring='f1', n_jobs=-1)
lgbm_search.fit(X_train, y_train)

print("Best LightGBM params:", lgbm_search.best_params_)
print("Best CV F1 score:", lgbm_search.best_score_)


[LightGBM] [Info] Number of positive: 16616, number of negative: 2111
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000849 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1129
[LightGBM] [Info] Number of data points in the train set: 18727, number of used features: 25
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.887275 -> initscore=2.063204
[LightGBM] [Info] Start training from score 2.063204
Best LightGBM params: {'clf__learning_rate': 0.05, 'clf__n_estimators': 100, 'clf__num_leaves': 31}
Best CV F1 score: 0.9999398043641836


In [7]:
svm_pipeline = Pipeline([
    ('pre', preprocessor),
    ('clf', SVC(probability=True, random_state=42))
])

svm_params = {
    'clf__C': [0.1, 1, 10],
    'clf__kernel': ['linear', 'rbf'],
    'clf__gamma': ['scale', 'auto']
}

svm_search = GridSearchCV(svm_pipeline, svm_params, cv=cv, scoring='f1', n_jobs=-1)
svm_search.fit(X_train, y_train)

print("Best SVM params:", svm_search.best_params_)
print("Best CV F1 score:", svm_search.best_score_)


Best SVM params: {'clf__C': 0.1, 'clf__gamma': 'scale', 'clf__kernel': 'linear'}
Best CV F1 score: 1.0


In [14]:
def evaluate(model, X_test, y_test, name):
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred, zero_division=0)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    try:
        y_proba = model.predict_proba(X_test)[:,1]
        roc = roc_auc_score(y_test, y_proba)
    except:
        roc = None
    
    print(f"\n{name} Results")
    print("Accuracy:", acc)
    print("Precision:", prec)
    print("Recall:", rec)
    print("F1-score:", f1)
    if roc is not None:
        print("ROC AUC:", roc)
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred, zero_division=0))
    
    return f1

f1_lgbm = evaluate(lgbm_search.best_estimator_, X_test, y_test, "LightGBM")
f1_svm  = evaluate(svm_search.best_estimator_, X_test, y_test, "SVM")


LightGBM Results
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0
ROC AUC: 1.0
Confusion Matrix:
 [[ 528    0]
 [   0 4154]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       528
         1.0       1.00      1.00      1.00      4154

    accuracy                           1.00      4682
   macro avg       1.00      1.00      1.00      4682
weighted avg       1.00      1.00      1.00      4682


SVM Results
Accuracy: 0.9997864160615122
Precision: 0.9997593261131167
Recall: 1.0
F1-score: 0.9998796485738356
ROC AUC: 0.9999995440685137
Confusion Matrix:
 [[ 527    1]
 [   0 4154]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       528
         1.0       1.00      1.00      1.00      4154

    accuracy                           1.00      4682
   macro avg       1.00      1.00      1.00      4682
weighted avg       1.00      1.00      1.00      4682





In [15]:
from IPython.display import display, Markdown

summary_text = """
#  Model Evaluation Summary

After training multiple models, we compared their performance based on **accuracy, F1-score, and interpretability**.

- **LightGBM** gave the best balance between accuracy and efficiency.  
- **Random Forest** performed well but was slightly slower and less interpretable.  
- **Logistic Regression** was easy to understand but underperformed in terms of predictive power.  

###  Final Choice: **LightGBM**
We chose **LightGBM** as the final model because:
1. It provides **higher accuracy** compared to the others.  
2. It is optimized for **speed and scalability**, making it suitable for large datasets.  
3. It offers **feature importance insights**, which improves explainability.  

This makes LightGBM the most suitable choice for our use case.
"""

display(Markdown(summary_text))



#  Model Evaluation Summary

After training multiple models, we compared their performance based on **accuracy, F1-score, and interpretability**.

- **LightGBM** gave the best balance between accuracy and efficiency.  
- **Random Forest** performed well but was slightly slower and less interpretable.  
- **Logistic Regression** was easy to understand but underperformed in terms of predictive power.  

###  Final Choice: **LightGBM**
We chose **LightGBM** as the final model because:
1. It provides **higher accuracy** compared to the others.  
2. It is optimized for **speed and scalability**, making it suitable for large datasets.  
3. It offers **feature importance insights**, which improves explainability.  

This makes LightGBM the most suitable choice for our use case.
