# 🧠 02 - Feature Engineering and Modeling

In this notebook, we will:

- Engineer features from earthquake data.
- Split the data into training and testing sets.
- Train a classification model to identify high-risk zones.
- Evaluate model performance using accuracy and confusion matrix.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
# Adjust the path based on your folder structure
#This section can be implemented to get real time data from USGS API later on 
#(refer to https://github.com/Prasanna2989/EarthquakePredictionSystem.git)
df = pd.read_csv("../data/cleaned_earthquake_data.csv")

# Preview
df.head()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2025-07-28T12:01:05.160Z,38.771332,-122.7295,1.15,1.17,md,14.0,65.0,0.01038,0.03,...,2025-07-28T12:02:42.122Z,"2 km ESE of The Geysers, CA",earthquake,0.22,0.32,0.14,13.0,automatic,nc,nc
1,2025-07-28T11:51:12.495Z,32.365,-102.163,3.848,1.9,ml,58.0,53.0,0.0,0.5,...,2025-07-28T11:55:25.185Z,"35 km ENE of McKinney Acres, Texas",earthquake,0.0,0.720489,0.1,35.0,automatic,tx,tx
2,2025-07-28T11:42:05.182Z,59.8795,-152.6,82.3,1.1,ml,,,,0.29,...,2025-07-28T11:44:23.032Z,"44 km WNW of Anchor Point, Alaska",earthquake,,0.6,,,automatic,ak,ak
3,2025-07-28T11:38:55.810Z,61.4453,-146.6149,32.4,1.3,ml,,,,0.39,...,2025-07-28T11:40:24.315Z,"37 km NNW of Valdez, Alaska",earthquake,,0.2,,,automatic,ak,ak
4,2025-07-28T11:37:22.103Z,31.977,-101.989,4.3398,1.3,ml,44.0,32.0,0.1,0.5,...,2025-07-28T11:41:15.071Z,"8 km ESE of Midland, Texas",earthquake,0.0,0.824726,0.2,31.0,automatic,tx,tx


## 🔧 Feature Selection

We'll extract relevant features to train our model.
For this example, we'll use:
- Magnitude
- Depth
- Latitude
- Longitude

And try to predict whether the location is a **high-risk zone**.


In [3]:
# Define the threshold for high-risk zones (we can customize as needed)
risk_threshold = 4.0

# Create target variable
df['high_risk'] = (df['mag'] >= risk_threshold).astype(int)

# Select features and target
features = ['latitude', 'longitude', 'depth', 'mag']
X = df[features]
y = df['high_risk']

In [4]:
#Here we can refine the model using cross validation methods in future work depending on the data volume
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
# Optional: Scale features
# Machine learning models (like logistic regression, SVM, KNN) are sensitive to the scale of features.
# StandardScaler makes each feature have: Mean = 0, Standard Deviation = 1
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [5]:
#Expecetd to test the viability of other models (Decision tree, XGBoost)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

In [6]:
y_pred = model.predict(X_test_scaled)

# Metrics
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2048
           1       1.00      1.00      1.00       205

    accuracy                           1.00      2253
   macro avg       1.00      1.00      1.00      2253
weighted avg       1.00      1.00      1.00      2253

Confusion Matrix:
 [[2048    0]
 [   0  205]]


## ✅ Summary

We trained a Random Forest classifier to identify high-risk earthquake zones.

- Accuracy depends on the threshold we define.
- In future steps, we could integrate spatial data layers (e.g. population, fault lines) to enhance risk prediction.

Next step → Visualize these predictions on a map using `folium`.
