# Model Selection

### Problem Statement

The goal is to predict whether a flight will arrive delayed by at least 15 minutes based on various features such as carrier, flight date, weekday, destination state, and flight distance.

#### Candidate Models

1. **Logistic Regression**
   - **Advantages**: Simple, interpretable, and efficient for binary classification tasks.
   - **Considerations**: Assumes linear relationship between features and log odds of the outcome.

2. **Random Forest**
   - **Advantages**: Handles non-linearity and interactions well, robust to overfitting.
   - **Considerations**: May require tuning for optimal performance.

3. **Gradient Boosting(XGBOOST)**
   - **Advantages**: Builds trees sequentially to correct errors of previous models, generally high predictive power.
   - **Considerations**: More computationally expensive than Random Forest, requires careful parameter tuning.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_val_score

# Define function to evaluate the models
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)
    return accuracy, precision, recall, f1, roc_auc

# Initialize the models
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
xgb_model = XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric='logloss', n_jobs=-1, random_state=42)

# Train the models
logistic_model.fit(X_train, y_train)
random_forest_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

# Evaluate the models
logistic_metrics = evaluate_model(logistic_model, X_test, y_test)
random_forest_metrics = evaluate_model(random_forest_model, X_test, y_test)
xgb_metrics = evaluate_model(xgb_model, X_test, y_test)

# Print the evaluation metrics
print(f"Logistic Regression: Accuracy: {logistic_metrics[0]:.4f}, Precision: {logistic_metrics[1]:.4f}, Recall: {logistic_metrics[2]:.4f}, F1-score: {logistic_metrics[3]:.4f}, ROC-AUC: {logistic_metrics[4]:.4f}")
print(f"Random Forest: Accuracy: {random_forest_metrics[0]:.4f}, Precision: {random_forest_metrics[1]:.4f}, Recall: {random_forest_metrics[2]:.4f}, F1-score: {random_forest_metrics[3]:.4f}, ROC-AUC: {random_forest_metrics[4]:.4f}")
print(f"XGBoost: Accuracy: {xgb_metrics[0]:.4f}, Precision: {xgb_metrics[1]:.4f}, Recall: {xgb_metrics[2]:.4f}, F1-score: {xgb_metrics[3]:.4f}, ROC-AUC: {xgb_metrics[4]:.4f}")

# Perform cross-validation
logistic_cv_scores = cross_val_score(logistic_model, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
random_forest_cv_scores = cross_val_score(random_forest_model, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)
xgb_cv_scores = cross_val_score(xgb_model, X_train, y_train, cv=5, scoring='accuracy', n_jobs=-1)

# Print the cross-validation scores
print(f"Logistic Regression CV Accuracy: {np.mean(logistic_cv_scores):.4f} ± {np.std(logistic_cv_scores):.4f}")
print(f"Random Forest CV Accuracy: {np.mean(random_forest_cv_scores):.4f} ± {np.std(random_forest_cv_scores):.4f}")
print(f"XGBoost CV Accuracy: {np.mean(xgb_cv_scores):.4f} ± {np.std(xgb_cv_scores):.4f}")

Logistic Regression: Accuracy: 0.5937, Precision: 0.5856, Recall: 0.6452, F1-score: 0.6140, ROC-AUC: 0.5936
Random Forest: Accuracy: 0.6286, Precision: 0.6318, Recall: 0.6192, F1-score: 0.6254, ROC-AUC: 0.6286
XGBoost: Accuracy: 0.6278, Precision: 0.6317, Recall: 0.6156, F1-score: 0.6235, ROC-AUC: 0.6278




Logistic Regression CV Accuracy: 0.5961 ± 0.0011
Random Forest CV Accuracy: 0.6294 ± 0.0006
XGBoost CV Accuracy: 0.6290 ± 0.0006


Based on cross-validated accuracy, we evaluated three models for predicting flight delays:

- **Logistic Regression**: Mean CV Accuracy of 0.5961 ± 0.0011
- **Random Forest**: Mean CV Accuracy of 0.6294 ± 0.0006
- **XGBoost**: Mean CV Accuracy of 0.6290 ± 0.0006

### Interpretation:

1. **Accuracy Comparison**:
   - Random Forest and XGBoost show similar mean cross-validated accuracies (around 0.629), indicating robust performance.
   - Logistic Regression performs lower with an accuracy around 0.596.

2. **Precision and Stability**:
   - Random Forest and XGBoost have narrower confidence intervals (\( \pm 0.0006 \)) compared to Logistic Regression (\( \pm 0.0011 \)), suggesting more stable performance across different folds.

3. **Model Selection**:
   - Between Random Forest and XGBoost, both perform similarly well in terms of accuracy.
   - Choose based on additional factors such as interpretability, computational efficiency, or specific requirements of your project.

### Conclusion:

- **Recommended Model**: Given their similar performance, either Random Forest or XGBoost would be suitable choices for predicting flight delays.
- **Random Forest** generally shows slightly better performance across most metrics compared to XGBoost. So we will be tuning and optimizing the random forest model on the following
