# Model Creation And Optimization

### **Note:**

Due to the computational heaviness of the Random Forest model and the large size of the dataset, performing grid search to tune the model can be extremely resource-intensive and time-consuming. Therefore, in this section, I will create and optimize an XGBoost model using a 500k sample of the data. This downsampled dataset was created during the earlier phases of the project. By focusing on a smaller, yet substantial, subset of the data, we can achieve efficient model training and tuning while ensuring the model is adequately representative of the overall dataset.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load your dataset
df = pd.read_csv('downsampled_data_updated.csv')

# Calculate the number of rows needed
n_columns = df.shape[1]
n_rows = 500000 // n_columns

# Perform stratified sampling
sampled_df, _ = train_test_split(df, train_size=n_rows, stratify=df['Arr_Delay_At_Least_15_Minutes'])

# Save the sampled data
sampled_df.to_csv('sampled_dataset.csv', index=False)

### Define Parameter Grid for Grid Search

In [37]:
param_grid = {
    'n_estimators': [100, 150],  # Number of boosting rounds
    'max_depth': [3, 6],         # Depth of trees
    'learning_rate': [0.1, 0.2], # Step size
}

### Perform Grid Search with Cross-validation

In [38]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Initialize XGBoost classifier
xgb_model = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False)

# Initialize Grid Search Cross-validation
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)

# Fit grid search on the training data
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 8 candidates, totalling 40 fits




In [40]:
# Print best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

Best Parameters: {'learning_rate': 0.2, 'max_depth': 6, 'n_estimators': 150}
Best CV Score: 0.6275333333333333


### Evaluate Optimized Model

In [39]:
# Get the best model from grid search
best_xgb = grid_search.best_estimator_

# Predict on the test set
y_pred = best_xgb.predict(X_test)

# Calculate accuracy, precision, recall, F1-score, ROC-AUC
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, best_xgb.predict_proba(X_test)[:, 1])

# Print the metrics
print("Optimized XGBoost Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")

Optimized XGBoost Metrics:
Accuracy: 0.6289
Precision: 0.6332
Recall: 0.6135
F1-score: 0.6232
ROC-AUC: 0.6738
