# Problem Introduction 
# what is the problem
The problem is to develop a classification model that predicts whether flights will be delayed using the dataset. By accurately predicting delays, airlines can optimize their operations, minimize disruptions, and improve customer satisfaction. The goal is to tune and optimize the model for precise predictions, recognizing the various consequences of potential delays.

# Why is it Important?
Flight delays can have widespread impacts, such as:

- Customer Dissatisfaction: Flight delays are a major pain point for passengers and can harm the airline’s reputation and lead to a loss of customer loyalty.
- Operational Efficiency: Delays can disrupt schedules, lead to increased crew and aircraft costs, and affect airport operations and gate usage.
- Regulatory and Financial Costs: Airlines often face fines and other financial penalties for flight delays, as well as additional costs related to rebooking, accommodations, and compensation to passengers.
- Resource Allocation: Accurate prediction of delays can help airlines proactively manage resources like staff, gates, and ground services, reducing operational inefficiencies and costs.
- Competitiveness: An airline that can minimize delays has a competitive edge, leading to greater customer loyalty, lower costs, and improved operational reliability.

# Who are the Key Stakeholders?
- Airline Operations Teams: Responsible for managing flights, crew, gates, and aircraft, they need accurate predictions to optimize schedules and mitigate the impact of delays.
- Passengers: The most directly affected by flight delays, passengers’ experiences can influence brand loyalty and public perception of the airline.
- Airline Management: Focused on improving profitability, operational efficiency, and customer satisfaction, they are key users of data-driven insights for strategic decisions.
- Air Traffic Control and Airport Authorities: They play a role in ensuring smooth airport operations and could use predictive data to better allocate runway slots and gate usage.
- Customer Support Teams: Responsible for managing passenger communications and rebooking efforts, they need accurate predictions to minimize disruption and effectively assist passengers.
- Data Analysts and Modelers: They build, tune, and maintain the predictive model, ensuring it delivers useful insights and remains accurate as conditions change over time.
- Regulatory Bodies: Organizations overseeing airline operations may be interested in predictive measures that help improve compliance and safety standards, thereby reducing delays and incidents.

In [29]:
#importing required data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
import numpy as np

In [30]:
path = "C:\\Users\\Gahan\\Downloads\\FlightDelays_Clean.csv"

# Data Collection
we are loading the data and then displaying it.

In [31]:
# loading data
data = pd.read_csv(path)

In [32]:
data.head(5)

Unnamed: 0,status_delayed,sch_dep_time,carrier_delta,carrier_us,carrier_envoy,carrier_continental,carrier_discovery,carrier_other,dest_jfk,dest_ewr,...,origin_iad,origin_bwi,bad_weather,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
0,0,14.92,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
1,0,14.92,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
2,1,14.92,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
3,0,14.92,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
4,0,14.92,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


# Data Exploration and Understanding
we are displaying the statistical summary of the data.

In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2201 entries, 0 to 2200
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   status_delayed       2201 non-null   int64  
 1   sch_dep_time         2201 non-null   float64
 2   carrier_delta        2201 non-null   int64  
 3   carrier_us           2201 non-null   int64  
 4   carrier_envoy        2201 non-null   int64  
 5   carrier_continental  2201 non-null   int64  
 6   carrier_discovery    2201 non-null   int64  
 7   carrier_other        2201 non-null   int64  
 8   dest_jfk             2201 non-null   int64  
 9   dest_ewr             2201 non-null   int64  
 10  dest_lga             2201 non-null   int64  
 11  distance             2201 non-null   int64  
 12  origin_dca           2201 non-null   int64  
 13  origin_iad           2201 non-null   int64  
 14  origin_bwi           2201 non-null   int64  
 15  bad_weather          2201 non-null   i

In [37]:
data.describe()

Unnamed: 0,status_delayed,sch_dep_time,carrier_delta,carrier_us,carrier_envoy,carrier_continental,carrier_discovery,carrier_other,dest_jfk,dest_ewr,...,origin_iad,origin_bwi,bad_weather,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
count,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,...,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0,2201.0
mean,0.194457,13.855161,0.176284,0.183553,0.13403,0.042708,0.250341,0.213085,0.175375,0.302135,...,0.311677,0.065879,0.014539,0.139482,0.145388,0.169014,0.177647,0.113585,0.114948,0.139936
std,0.395872,4.316158,0.381148,0.387207,0.340762,0.202244,0.433308,0.40958,0.380374,0.459288,...,0.463284,0.248127,0.119725,0.346528,0.352572,0.37485,0.382302,0.317378,0.319031,0.347
min,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,14.92,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,17.17,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,21.5,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Data Cleaning or Preprocessing
This is not required since the data is already preprocessed data.

# Feature Selection
since it is a small dataset with less features, we don't need to perform this step.

In [36]:
# checking for null values
data.isnull().sum()

status_delayed         0
sch_dep_time           0
carrier_delta          0
carrier_us             0
carrier_envoy          0
carrier_continental    0
carrier_discovery      0
carrier_other          0
dest_jfk               0
dest_ewr               0
dest_lga               0
distance               0
origin_dca             0
origin_iad             0
origin_bwi             0
bad_weather            0
Monday                 0
Tuesday                0
Wednesday              0
Thursday               0
Friday                 0
Saturday               0
Sunday                 0
dtype: int64

# Modelling
# Data splitting
Here we are performing the train test split and also dividing the target variable from the dataset in order to train the model.

In [39]:
# Splitting data into features and target variable
X = data.drop('status_delayed', axis=1)
y = data['status_delayed']

In [40]:
X.head(5)

Unnamed: 0,sch_dep_time,carrier_delta,carrier_us,carrier_envoy,carrier_continental,carrier_discovery,carrier_other,dest_jfk,dest_ewr,dest_lga,...,origin_iad,origin_bwi,bad_weather,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
0,14.92,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,14.92,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
2,14.92,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,14.92,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,14.92,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [41]:
y.head(5)

0    0
1    0
2    1
3    0
4    0
Name: status_delayed, dtype: int64

In [42]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Classification Modelling
here we are training two different classification models and checking the performance measures and then declaring the winning model based on threshold.wers

# SVM Model
here we are looping c value for various 10 powers

In [43]:
# Defining a range of C values for SVM
C_values = [10**i for i in range(-4, 5)]

training the svm model with each c value and finding the best c value

In [44]:
# Training and evaluating SVM models for each C value
best_C = None
best_accuracy = 0
svm_results = []

for C in C_values:
    model = SVC(C=C, kernel='linear', probability=True, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    svm_results.append((C, accuracy))
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_C = C

In [45]:
# Training the final SVM model using the best C value
final_svm_model = SVC(C=best_C, kernel='linear', probability=True, random_state=42)
final_svm_model.fit(X_train, y_train)
y_proba = final_svm_model.predict_proba(X_test)[:, 1]

creating the performance measure table over a range of possible threshold values.

In [46]:
# Defining thresholds and evaluate SVM model performance
thresholds = np.arange(0.0, 1.1, 0.1)
performance_measures = []

for threshold in thresholds:
    y_pred_threshold = (y_proba >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_threshold).ravel()
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_threshold, average='binary', zero_division=0)
    accuracy = accuracy_score(y_test, y_pred_threshold)
    performance_measures.append((threshold, tn, tp, fn, fp, precision, recall, f1, accuracy))

performance_df = pd.DataFrame(performance_measures, columns=['Threshold', 'TN', 'TP', 'FN', 'FP', 'Precision', 'Recall', 'F1', 'Accuracy'])

# Logistic Model
Training a logistic model.

In [50]:
# Training a Logistic Regression model
logistic_model = LogisticRegression(solver='saga',max_iter=10000,random_state=42)
logistic_model.fit(X_train, y_train)
y_proba_logistic = logistic_model.predict_proba(X_test)[:, 1]

creating the performance measure table over a range of possible threshold values.

In [51]:
# Evaluating the Logistic Regression model performance over various thresholds
logistic_performance_measures = []

for threshold in thresholds:
    y_pred_threshold_logistic = (y_proba_logistic >= threshold).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_threshold_logistic).ravel()
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_threshold_logistic, average='binary', zero_division=0)
    accuracy = accuracy_score(y_test, y_pred_threshold_logistic)
    logistic_performance_measures.append((threshold, tn, tp, fn, fp, precision, recall, f1, accuracy))

logistic_performance_df = pd.DataFrame(logistic_performance_measures, columns=['Threshold', 'TN', 'TP', 'FN', 'FP', 'Precision', 'Recall', 'F1', 'Accuracy'])

In [52]:
# Printing the results
print("Best C for SVM:", best_C)
print("\nSVM Performance Table:\n", performance_df)
print("\nLogistic Regression Performance Table:\n", logistic_performance_df)

Best C for SVM: 0.1

SVM Performance Table:
     Threshold   TN  TP  FN   FP  Precision    Recall        F1  Accuracy
0         0.0    0  87   0  354   0.197279  1.000000  0.329545  0.197279
1         0.1   28  84   3  326   0.204878  0.965517  0.338028  0.253968
2         0.2  291  24  63   63   0.275862  0.275862  0.275862  0.714286
3         0.3  354   2  85    0   1.000000  0.022989  0.044944  0.807256
4         0.4  354   2  85    0   1.000000  0.022989  0.044944  0.807256
5         0.5  354   2  85    0   1.000000  0.022989  0.044944  0.807256
6         0.6  354   2  85    0   1.000000  0.022989  0.044944  0.807256
7         0.7  354   2  85    0   1.000000  0.022989  0.044944  0.807256
8         0.8  354   2  85    0   1.000000  0.022989  0.044944  0.807256
9         0.9  354   2  85    0   1.000000  0.022989  0.044944  0.807256
10        1.0  354   0  87    0   0.000000  0.000000  0.000000  0.802721

Logistic Regression Performance Table:
     Threshold   TN  TP  FN   FP  Preci

# Picking a winning ModelLogistic Regression appears to be the better choice overall, particularly due to its better performance at lower thresholds. Specifically, at Threshold = 0.2, it offers a good trade-off with higher recall, precision, and F1 score compared to SVM, while maintaining decent accuracy..3.

# Evaluation at a Low Probability Threshold
I am picking up a lower threshold of 0.2

# How many False Positives?
127 false positives.

# What do these numbers represent?
It implies that there are 127 cases where the model incorrectly predicts a positive outcome. This could lead to the allocation of resources to cases that are not relevant, which may incur unnecessary costs or effort.

# What are the potential costs to the business if we were to make these mistakes in practice?
Resource wastage, potential inefficiencies, and a drain on operational budgets.

# How many False Negatives?
29 False Negatives.

# What do these numbers represent?
It implies that there are 29 cases where the model fails to identify a true positive. This might lead to missed opportunities or tasks not being addressed, potentially having a significant negative impact.

# What are the potential costs to the business if we were to make these mistakes in practice?
Missed critical actions or opportunities, leading to potential revenue loss or unmet objectives.

# Which prediction mistakes do you consider to be more costly?
if the cost of missing a positive (FN) is higher (e.g., missing important customer actions or critical business events), false negatives are more concerning. On the other hand, if the cost of acting on false positives is too high, then reducing FP is key.

# Evaluation at two Other Probability Thresholds# 
At threshold 031# 
How many False Positives31
2 false positive# s

What do these numbers represFewer false positives than at a lower threshold, leading to reduced unnecessary resource allocation.
t# ives

What are the potential costs to the business if we were to make these mistakes in prReduced compared to the previous threshold due to fewer incorrect positives. but if they were present it could cost Resource wastage, potential inefficiencies, and a drain on operational budgets.nt#  cases.

How many False 
63egatives? 0 false n# egatives.

What do these numberHigher number of missed positive cases compared to the threshold of 0.2.op# portunities.

What are the potential costs to the business if we were to make these mistaPotentially more costly due to a higher number of missed cases.d # critical tasks.
# 
At threshold 0.4
How ma5y False Positives?# 
0 false positives

What do thVery few false positives, implying a very precise model. m# isclassified positives

What are the potential costs to the business if we were to make thAlmost negligible. but if they were present it could cost Resource wastage, potential inefficiencies, and a drain on operational budgets.ct# ed at non-relevant cases.80 
How many False Ne# gatives?
1 false negatives.

WMuch higher number of missed positives, leading to significant missed opportunities.
t# here is one missed opportunity.

What are the potential costs to the business if we were tit could lead to Much higher risk of lost business opportunities.cy, or other missed critical tasks.?

# Based on your careful consideration of probability threshold options and the corresponding speculated risks/costs - which probability threshold do you recommend going forward with?
Based on the analysis, a threshold of 0.2 for Logistic Regression is recommended. Because it provides a better balance between false positives and false negatives, ensuring more relevant positive cases are captured (higher recall) while maintaining reasonable precision and accuracy. It minimizes the risks of missing key opportunities while managing the operational cost of false positives more effectively than at lower thresholds.