# **Airline On-Time Performance and Delay Prediction**

# Project Objective
---

**Problem Statement:**
An airline aims to improve its operational efficiency by understanding the factors contributing to flight delays and predicting the likelihood of delays in future flights. The airline seeks to analyze key variables, such as weather conditions, airport congestion, flight routes, and past delay history, to identify patterns and predict delays. The challenge is to develop a robust predictive model that can forecast the on-time performance of flights, enabling the airline to optimize scheduling and reduce delays. Additionally, the airline requires actionable insights on the factors that most significantly influence delays, helping to implement strategies to enhance on-time performance. 

**Research Questions:**
1. What are the primary factors affecting flight delays?
2. Can a predictive model be developed to estimate the likelihood of a flight delay?
3. What actions can the airline take to improve on-time performance?

**Dataset:** The dataset includes flight details (airline, route, departure/arrival times), weather conditions, airport congestion data, and past flight delay records.

# Import Libraries
---

In [30]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

#from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Model evaluation
from sklearn.model_selection import cross_val_score

# XGBoost and LightGBM
import xgboost as xgb
import lightgbm as lgb

# Warning suppression
import warnings
warnings.filterwarnings('ignore')

# About Dataset
---

In [2]:
data_url = 'https://raw.githubusercontent.com/GopinathAchuthan/Airline-On-Time-Performance-and-Delay-Prediction/refs/heads/main/Airline_Flight_Delay_Dataset(in).csv'

In [3]:
# Load the dataset into a pandas DataFrame
raw_data = pd.read_csv(data_url)

# Display the first few rows of the dataset
raw_data.head()

Unnamed: 0,Flight_ID,Airline,Departure_Airport,Arrival_Airport,Scheduled_Departure,Weather_Conditions,Delay_Minutes
0,1,Airline A,ATL,CDG,2,Rain,95
1,2,Airline A,ORD,SIN,15,Fog,182
2,3,Airline D,LAX,SIN,18,Snow,120
3,4,Airline A,DFW,DXB,12,Storm,194
4,5,Airline B,ATL,SIN,22,Fog,249


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Flight_ID            2000 non-null   int64 
 1   Airline              2000 non-null   object
 2   Departure_Airport    2000 non-null   object
 3   Arrival_Airport      2000 non-null   object
 4   Scheduled_Departure  2000 non-null   int64 
 5   Weather_Conditions   2000 non-null   object
 6   Delay_Minutes        2000 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 109.5+ KB


In [5]:
raw_data.shape

(2000, 7)

In [6]:
# Convert 'Scheduled_Departure' to categorical
raw_data['Scheduled_Departure'] = raw_data['Scheduled_Departure'].astype('category')

In [7]:
# Replace whitespace with underscore in 'column_name'
raw_data['Airline'] = raw_data['Airline'].str.replace(' ', '_')

In [8]:
# Check for missing data
raw_data.isnull().sum()

Flight_ID              0
Airline                0
Departure_Airport      0
Arrival_Airport        0
Scheduled_Departure    0
Weather_Conditions     0
Delay_Minutes          0
dtype: int64

In [9]:
# Drop the 'Flight_ID' column
raw_data = raw_data.drop(columns=['Flight_ID'])

In [10]:
# Check for duplicates
duplicates = raw_data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


# Data wrangling
---

In [11]:
raw_data.columns

Index(['Airline', 'Departure_Airport', 'Arrival_Airport',
       'Scheduled_Departure', 'Weather_Conditions', 'Delay_Minutes'],
      dtype='object')

In [12]:
# Count the number of zero values in the 'Delay_Minutes' column
zero_delay_count = raw_data[raw_data['Delay_Minutes'] <= 60].shape[0]

print(f"Number of zero Delay_Minutes values: {zero_delay_count}")

Number of zero Delay_Minutes values: 409


In [13]:
# Get the number of unique values for the input features
features = ['Airline', 'Departure_Airport', 'Arrival_Airport', 
            'Scheduled_Departure', 'Weather_Conditions'
           ]

for feature in features:
    unique_values = raw_data[feature].unique()  # Get unique values for each column
    num_unique_values = len(unique_values)  # Get the number of unique values
    print(f"Feature: {feature}")
    print(f"Number of unique values: {num_unique_values}")
    print(f"Unique values: {unique_values}")
    print("-" * 40)

Feature: Airline
Number of unique values: 4
Unique values: ['Airline_A' 'Airline_D' 'Airline_B' 'Airline_C']
----------------------------------------
Feature: Departure_Airport
Number of unique values: 5
Unique values: ['ATL' 'ORD' 'LAX' 'DFW' 'JFK']
----------------------------------------
Feature: Arrival_Airport
Number of unique values: 5
Unique values: ['CDG' 'SIN' 'DXB' 'LHR' 'HKG']
----------------------------------------
Feature: Scheduled_Departure
Number of unique values: 24
Unique values: [2, 15, 18, 12, 22, ..., 9, 6, 13, 5, 14]
Length: 24
Categories (24, int64): [0, 1, 2, 3, ..., 20, 21, 22, 23]
----------------------------------------
Feature: Weather_Conditions
Number of unique values: 5
Unique values: ['Rain' 'Fog' 'Snow' 'Storm' 'Clear']
----------------------------------------


In [None]:
# Group by 'Airline' and then calculate the number of unique values for each of the specified columns
unique_values_per_airline = raw_data.groupby('Airline')[['Departure_Airport', 'Arrival_Airport', 'Weather_Conditions', 'Scheduled_Departure']].nunique()

# Display the result
unique_values_per_airline

In [None]:
# Define the features
features = ['Airline', 'Departure_Airport', 'Arrival_Airport', 
            'Scheduled_Departure', 'Weather_Conditions'
           ]

# Find the number of unique rows based on these features
unique_rows = raw_data[features].drop_duplicates()

# Get the number of unique rows
num_unique_rows = unique_rows.shape[0]

# Print the result
print(f"Number of unique rows: {num_unique_rows}")

**Note:** The effectiveness of a predictive or regression model relies on having a diverse set of unique data points. In this dataset, out of 2000 rows, only 1829 are unique based on the selected features. The remaining 171 rows are duplicates, which limits the variety of scenarios the model can learn from. Additionally, since all the input features are categorical, the model may struggle to generalize without sufficient variety in the data. This reduced diversity can lead to overfitting and poor performance on unseen data.

In [None]:
# # Mapping the Scheduled_Departure_hour to a new categorical column
# def time_of_day(hour):
#     if 0 <= hour < 6:
#         return 'Late Night'
#     elif 6 <= hour < 12:
#         return 'Morning'
#     elif 12 <= hour < 18:
#         return 'Afternoon'
#     elif 18 <=hour < 24:
#         return 'Night'

# # Apply the function to the 'Scheduled_Departure_hour' column
# raw_data['Scheduled_Departure_time_of_day'] = raw_data['Scheduled_Departure'].apply(time_of_day)

In [None]:
# # Combine 'Departure_Airport' and 'Arrival_Airport' into a new column 'Airport_Route'
# raw_data['Airport_Route'] = raw_data['Departure_Airport'] + '_' + raw_data['Arrival_Airport']

In [None]:
raw_data.head()

# Exploratory Data Analysis
---

### Average Delay Minutes by Different Features

In [None]:
# Define the features
features = ['Airline', 'Departure_Airport', 'Arrival_Airport', 
            'Scheduled_Departure', 'Weather_Conditions'
           ]

for feature in features:
    print(f'{feature}:')
    print(raw_data.groupby(feature)['Delay_Minutes'].mean())
    print()

### Distribution Visualization Function of Delay Minutes by feature

In [None]:
def dist_deplay_by_feature(df, feature):
    unique_feature = np.sort(df[feature].unique())
    n_rows = (len(unique_feature) + 1) // 2
    fig, axes = plt.subplots(n_rows, 2, figsize=(10, 4 * n_rows))
    axes = axes.flatten()
    
    # Plot histograms for each value of the feature in separate subplots
    for i, value in enumerate(unique_feature):
        ax = axes[i]  # To position the plots in a grid
        sns.histplot(df[df[feature] == value], x='Delay_Minutes', kde=True, ax=ax, color=sns.color_palette("Set2")[i % len(sns.color_palette("Set2"))])
        ax.set_title(f'Distribution of Delay Minutes for {value}')
        ax.set_xlabel('Delay Minutes')
        ax.set_ylabel('Frequency')
    
    # Remove empty subplots (if any) based on the number of unique values
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    plt.show()


### Distribution of Delay Minutes by Airline

In [None]:
dist_deplay_by_feature(raw_data, 'Airline')

### Distribution of Delay Minutes by Departure Airport

In [None]:
dist_deplay_by_feature(raw_data, 'Departure_Airport')

### Distribution of Delay Minutes by Arrival_Airport

In [None]:
dist_deplay_by_feature(raw_data, 'Arrival_Airport')

### Distribution of Delay Minutes by Scheduled_Departure

In [None]:
dist_deplay_by_feature(raw_data, 'Scheduled_Departure')

### Distribution of Delay Minutes by Weather_Conditions

In [None]:
dist_deplay_by_feature(raw_data, 'Weather_Conditions')

In [None]:
# Define the features
features = ['Airline', 'Departure_Airport', 'Arrival_Airport', 
            'Scheduled_Departure', 'Weather_Conditions'
           ]

# Grouping by each feature and calculating the mean, median, sum, count, std, min, and max of 'Delay_Minutes'
for feature in features:
    feature_delay_stats = raw_data.groupby(feature)['Delay_Minutes'].agg(['mean', 'median', 'sum', 'count', 'std', 'min', 'max']).reset_index()
    print(f'{feature} Delay Statistics:')
    print(feature_delay_stats)
    print()

### **Insight for Each Airline**:

- **Airline A**: Moderate mean delays, fairly consistent with a reasonable number of flights. Reliable but occasionally has long delays.

- **Airline B**: Largest operation with the highest number of flights. High variability in delay times, leading to the highest total delay.

- **Airline C**: Lowest average delay and fewer extreme delays. Fewer flights, so total delay is the lowest.

- **Airline D**: Highest average delay with significant variation. Fewer flights but still experiences long delays.

### **Insight for Each Departure_Airport**:

- **LAX** has the highest average and variation in delays, which might be due to its size, congestion, or operational challenges. It’s a major hub, so high variability in delays is expected.

- **JFK** has similarly high delays, particularly in median values, suggesting that most flights from JFK experience delays, and there may be operational factors affecting this.

- **ATL** stands out as the most efficient airport with the lowest average delay and high number of flights, making it a relatively reliable airport in terms of delay times.

- **DFW** also experiences relatively high delays, with many flights contributing to a large total delay. However, its delays are not as extreme as those at LAX or JFK.

- **ORD** has a lower total delay but still experiences a moderate number of delays. It has the lowest standard deviation, meaning delays are more predictable.

In [None]:
# Grouping by 'Airline' and 'Departure_Airport' and calculating the mean of 'Delay_Minutes'
bar_data = raw_data.groupby(['Airline', 'Departure_Airport'])['Delay_Minutes'].mean().reset_index()

# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(data=bar_data, x='Airline', y='Delay_Minutes', hue='Departure_Airport', palette='Set2')

# Adding labels and title
plt.title('Average Delay Minutes by Airline and Departure Airport')
plt.xlabel('Airline')
plt.ylabel('Average Delay (Minutes)')
plt.legend(title='Departure Airport', loc='upper left', bbox_to_anchor=(1, 1))

# Show the plot
plt.xticks(rotation=0)  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent overlap
plt.show()

In [None]:
# Grouping by 'Airline' and 'Departure_Airport' and calculating the mean of 'Delay_Minutes'
bar_data = raw_data.groupby(['Airline', 'Departure_Airport'])['Delay_Minutes'].std().reset_index()

# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(data=bar_data, x='Departure_Airport', y='Delay_Minutes', hue='Airline', palette='Set2')

# Adding labels and title
plt.title('Average Delay Minutes by Airline and Departure Airport')
plt.xlabel('Departure Airport')
plt.ylabel('Average Delay (Minutes)')
plt.legend(title='Airline', loc='upper left', bbox_to_anchor=(1, 1))

# Show the plot
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# List of categorical features
categorical_features = ['Airline', 'Weather_Conditions',
                        'Departure_Airport', 'Scheduled_Departure'
                       ]


fig, axes = plt.subplots(2, 2, figsize=(15, 5 * 3))
axes = axes.flatten()

# Loop through the features and plot on the corresponding subplot
for i, feature in enumerate(categorical_features):
    sns.barplot(x=feature, y='Delay_Minutes', data=raw_data, estimator='mean', ax=axes[i])
    axes[i].set_title(f"Avg Delay vs {feature}")
    axes[i].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

In [None]:
# List of features you want to use as hue
hue_features = ['Airline', 'Weather_Conditions', 'Departure_Airport', 'Arrival_Airport']

fig, axes = plt.subplots(4, 1, figsize=(15, 12))
axes = axes.flatten()

for i, hue in enumerate(hue_features):
    sns.lineplot(data=raw_data, x='Scheduled_Departure', y='Delay_Minutes', hue=hue, ax=axes[i], marker='o', errorbar=None)
    axes[i].set_title(f"Scheduled Departure vs Delay Minutes by {hue}")
    axes[i].set_xlabel('Scheduled Departure')
    axes[i].set_ylabel('Delay Minutes')
    axes[i].tick_params(axis='x', rotation=0)
    # Move the legend outside the plot area (to the right)
    axes[i].legend(loc='upper left', bbox_to_anchor=(1, 1), title=hue)


plt.tight_layout()
plt.show()

# Feature Engineering
---

In [14]:
data = raw_data.copy()

In [15]:
data.head()

Unnamed: 0,Airline,Departure_Airport,Arrival_Airport,Scheduled_Departure,Weather_Conditions,Delay_Minutes
0,Airline_A,ATL,CDG,2,Rain,95
1,Airline_A,ORD,SIN,15,Fog,182
2,Airline_D,LAX,SIN,18,Snow,120
3,Airline_A,DFW,DXB,12,Storm,194
4,Airline_B,ATL,SIN,22,Fog,249


In [16]:
data.columns

Index(['Airline', 'Departure_Airport', 'Arrival_Airport',
       'Scheduled_Departure', 'Weather_Conditions', 'Delay_Minutes'],
      dtype='object')

### One-Hot Encoding

In [17]:
# Perform one-hot encoding on categorical columns
data_encoded = pd.get_dummies(data, drop_first=True)  # drop_first=True avoids creating dummy variable trap

data_encoded.head()

Unnamed: 0,Delay_Minutes,Airline_Airline_B,Airline_Airline_C,Airline_Airline_D,Departure_Airport_DFW,Departure_Airport_JFK,Departure_Airport_LAX,Departure_Airport_ORD,Arrival_Airport_DXB,Arrival_Airport_HKG,...,Scheduled_Departure_18,Scheduled_Departure_19,Scheduled_Departure_20,Scheduled_Departure_21,Scheduled_Departure_22,Scheduled_Departure_23,Weather_Conditions_Fog,Weather_Conditions_Rain,Weather_Conditions_Snow,Weather_Conditions_Storm
0,95,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,182,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False
2,120,False,False,True,False,False,True,False,False,False,...,True,False,False,False,False,False,False,False,True,False
3,194,False,False,False,True,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,True
4,249,True,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,False


### Frequency Encoding

In [None]:
input_features = ['Airline', 'Departure_Airport', 'Arrival_Airport',
                  'Scheduled_Departure', 'Weather_Conditions'
                 ]

def frequency_encoding(df, columns):
    df_encoded = df.copy()
    
    for col in columns:
        freq_encoding = df[col].value_counts() / len(df)  # Calculate frequency
        df_encoded[col + '_encoded'] = df[col].map(freq_encoding)  # Map frequencies to the column
    
    df_encoded.drop(columns=columns, inplace=True)
    return df_encoded

data_freq_encoded = frequency_encoding(data, input_features)

data_freq_encoded.head()

### Target Encoding

In [None]:
# Target encoding function
def target_encoding(df, columns, target):
    df_encoded = df.copy()  # Create a copy to avoid modifying the original dataframe
    
    for col in columns:
        # Calculate mean of the target for each category in the column
        target_mean = df.groupby(col)[target].mean()  # Group by column and compute the mean of the target
        df_encoded[col + '_encoded'] = df[col].map(target_mean)  # Map the mean target to each row
    
    df_encoded.drop(columns=columns, inplace=True)  # Drop the original categorical columns
    return df_encoded

# Columns to be target encoded
input_features = ['Airline', 'Departure_Airport', 'Arrival_Airport', 
                  'Scheduled_Departure', 'Weather_Conditions']

# Apply target encoding to the dataset
data_target_encoded = target_encoding(data, input_features, 'Delay_Minutes')

data_target_encoded.head()

# Split the data into training and test data
---

In [31]:
# Define your features (X) and target (y)
X = data_encoded.drop(columns=['Delay_Minutes'])
y = data_encoded['Delay_Minutes']

# Split the data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the resulting datasets
print("Training data (X_train):", X_train.shape)
print("Testing data (X_test):", X_test.shape)
print("Training labels (y_train):", y_train.shape)
print("Testing labels (y_test):", y_test.shape)

Training data (X_train): (1600, 38)
Testing data (X_test): (400, 38)
Training labels (y_train): (1600,)
Testing labels (y_test): (400,)


# Model Selection
---

# Train the Model
---

In [32]:
def eval_metrics(y_test, y_pred):
    # Calculate the metrics
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    
    # Print all metrics with 4 decimal places
    print(f'Mean Absolute Error (MAE): {mae:.4f}')
    print(f'Mean Squared Error (MSE): {mse:.4f}')
    print(f'Root Mean Squared Error (RMSE): {rmse:.4f}')
    print(f'R²: {r2:.4f}')

1. Decision Tree
2. Random Forest
3. XGBoost
4. LightGBM
5. CatBoost

## Decision Tree Regression

In [33]:
# Define the parameter grid for Decision Tree
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson']
}

In [34]:
# Initialize the DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state=42)

# Initialize GridSearchCV
dt_grid_search = GridSearchCV(estimator=dt_model, param_grid=param_grid, 
                           cv=5, n_jobs=-1, verbose=1, scoring='neg_mean_squared_error')

# Perform the grid search on the training data
dt_grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 144 candidates, totalling 720 fits


In [35]:
# Get the best parameters and the best model
best_df_params = dt_grid_search.best_params_
best_df_model = dt_grid_search.best_estimator_

# Print the best parameters
print(f'Best parameters for Decision Tree: {best_df_params}')

Best parameters for Decision Tree: {'criterion': 'squared_error', 'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10}


In [36]:
# Make predictions using the best model
y_pred_df = best_df_model.predict(X_test)

# Evaluate the Random Forest model using the eval_metrics function
print("\nEvaluating Decision Tree Regession Model:")
eval_metrics(y_test, y_pred_df)


Evaluating Decision Tree Regession Model:
Mean Absolute Error (MAE): 73.6587
Mean Squared Error (MSE): 7660.4588
Root Mean Squared Error (RMSE): 87.5240
R²: -0.0672


## Random Forest Regression

In [38]:
# Define the parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 5, 10], 
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

In [40]:
from sklearn.ensemble import RandomForestRegressor

# Define the Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV for Random Forest
rf_grid_search = GridSearchCV(estimator=rf_model, param_grid=rf_param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')

# Fit the model using GridSearchCV
rf_grid_search.fit(X_train, y_train)

In [41]:
# Get the best parameters and the best model
best_rf_params = rf_grid_search.best_params_
best_rf_model = rf_grid_search.best_estimator_

# Print the best parameters
print(f'Best parameters for Random Forest: {best_rf_params}')

Best parameters for Random Forest: {'max_depth': 5, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 200}


In [42]:
# Get feature importance from Random Forest
rf_importance = pd.DataFrame(best_rf_model.feature_importances_, X.columns, columns=['Importance'])
rf_importance = rf_importance.sort_values(by='Importance', ascending=False)
rf_importance

Unnamed: 0,Importance
Weather_Conditions_Snow,0.062829
Scheduled_Departure_1,0.043817
Arrival_Airport_SIN,0.043354
Scheduled_Departure_7,0.042209
Weather_Conditions_Rain,0.038868
Arrival_Airport_LHR,0.038015
Airline_Airline_C,0.037231
Departure_Airport_DFW,0.037
Arrival_Airport_DXB,0.035688
Scheduled_Departure_13,0.034553


In [None]:
# Plotting the feature importance from Random Forest (with switched axes)
plt.figure(figsize=(8, 8))
sns.barplot(y=rf_importance.index, x=rf_importance['Importance'])  # Swap x and y
plt.title("Feature Importance (Random Forest Regressor)")
plt.xlabel("Importance")
plt.ylabel("Features")
plt.show()


In [43]:
# Make predictions using the best model
y_pred_rf = best_rf_model.predict(X_test)

# Evaluate the Random Forest model using the eval_metrics function
print("\nEvaluating Random Forest Model:")
eval_metrics(y_test, y_pred_rf)


Evaluating Random Forest Model:
Mean Absolute Error (MAE): 72.7682
Mean Squared Error (MSE): 7188.9972
Root Mean Squared Error (RMSE): 84.7880
R²: -0.0015


## XGBoost Regression

In [44]:
# Define the parameter grid for XGBoost
xgb_param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 6, 10],
    'min_child_weight': [1, 3, 5],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0]
}

In [45]:
# Define the XGBoost model
xgb_model = xgb.XGBRegressor(random_state=42)

# Initialize GridSearchCV for XGBoost
xgb_grid_search = GridSearchCV(estimator=xgb_model, param_grid=xgb_param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')

# Fit the model using GridSearchCV
xgb_grid_search.fit(X_train, y_train)

In [46]:
# Get the best parameters and the best model
best_xgb_params = xgb_grid_search.best_params_
best_xgb_model = xgb_grid_search.best_estimator_

# Print the best parameters
print(f'Best parameters for XGBoost: {best_xgb_params}')

Best parameters for XGBoost: {'colsample_bytree': 1.0, 'learning_rate': 0.01, 'max_depth': 3, 'min_child_weight': 5, 'n_estimators': 100, 'subsample': 0.8}


In [47]:
# Make predictions using the best model
y_pred_xgb = best_xgb_model.predict(X_test)

# Evaluate the XGBoost model using the eval_metrics function
print("\nEvaluating XGBoost Model:")
eval_metrics(y_test, y_pred_xgb)


Evaluating XGBoost Model:
Mean Absolute Error (MAE): 72.5246
Mean Squared Error (MSE): 7152.3154
Root Mean Squared Error (RMSE): 84.5714
R²: 0.0036


## LightGBM Regression

In [48]:
# Define the parameter grid for LightGBM
lgb_param_dist = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 6, 10, -1],
    'min_child_samples': [10, 20, 30],
    'subsample': [0.7, 0.8, 1.0],
    'colsample_bytree': [0.7, 0.8, 1.0],
    'reg_alpha': [0, 0.1, 0.5, 1.0],
    'reg_lambda': [0, 0.1, 0.5, 1.0]
}

In [49]:
# Define the LightGBM model
lgb_model = lgb.LGBMRegressor(random_state=42)

# Initialize RandomizedSearchCV for LightGBM
lgb_random_search = RandomizedSearchCV(estimator=lgb_model, param_distributions=lgb_param_dist, 
                                       n_iter=100, cv=5, n_jobs=-1, scoring='neg_mean_squared_error', 
                                       random_state=42)

# Fit the model using RandomizedSearchCV
lgb_random_search.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000676 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 76
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features: 38
[LightGBM] [Info] Start training from score 149.564063
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000302 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 76
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features: 38
[LightGBM] [Info] Start training from score 150.825000
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000817 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 76
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features: 38
[LightGBM] [Info] Start training

In [50]:
# Get the best parameters and the best model
best_lgb_params_random = lgb_random_search.best_params_
best_lgb_model_random = lgb_random_search.best_estimator_

# Print the best parameters
print(f'Best parameters for LightGBM using RandomizedSearchCV: {best_lgb_params_random}')

Best parameters for LightGBM using RandomizedSearchCV: {'subsample': 0.7, 'reg_lambda': 0.1, 'reg_alpha': 0, 'n_estimators': 100, 'min_child_samples': 20, 'max_depth': 6, 'learning_rate': 0.01, 'colsample_bytree': 0.8}


In [51]:
# Make predictions using the best model
y_pred_lgb = best_lgb_model_random.predict(X_test)

In [52]:
# Evaluate the LightGBM model using the eval_metrics function
print("\nEvaluating LightGBM Model (RandomizedSearchCV):")
eval_metrics(y_test, y_pred_lgb)


Evaluating LightGBM Model (RandomizedSearchCV):
Mean Absolute Error (MAE): 72.6452
Mean Squared Error (MSE): 7169.3085
Root Mean Squared Error (RMSE): 84.6718
R²: 0.0012
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.033714 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 76
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features: 38
[LightGBM] [Info] Start training from score 150.825000
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000457 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 76
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features: 38
[LightGBM] [Info] Start training from score 149.867188
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005258 seconds.
You can set `force_col_wise=true` to remove

In [75]:
data = pd.read_csv('age_data(in).csv')

In [73]:
df.Age.median()

39.5

In [72]:
df['Age'].describe()

count    50.000000
mean     40.140000
std      13.278815
min      19.000000
25%      29.500000
50%      39.500000
75%      52.250000
max      64.000000
Name: Age, dtype: float64

In [76]:
# Calculate Q1, Q3, and IQR
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1

# Calculate the lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = data[(data < lower_bound) | (data > upper_bound)]

# Calculate the median age
median_age = data.median()

# Add the number of outliers to the median age
result = median_age + len(outliers)

# Output the results
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)
print("Outliers:", outliers)
print("Median Age:", median_age)
print("Number of Outliers:", len(outliers))
print("Median Age + Number of Outliers:", result)

Lower Bound: Age   -4.625
dtype: float64
Upper Bound: Age    86.375
dtype: float64
Outliers:     Age
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
10  NaN
11  NaN
12  NaN
13  NaN
14  NaN
15  NaN
16  NaN
17  NaN
18  NaN
19  NaN
20  NaN
21  NaN
22  NaN
23  NaN
24  NaN
25  NaN
26  NaN
27  NaN
28  NaN
29  NaN
30  NaN
31  NaN
32  NaN
33  NaN
34  NaN
35  NaN
36  NaN
37  NaN
38  NaN
39  NaN
40  NaN
41  NaN
42  NaN
43  NaN
44  NaN
45  NaN
46  NaN
47  NaN
48  NaN
49  NaN
Median Age: Age    39.5
dtype: float64
Number of Outliers: 50
Median Age + Number of Outliers: Age    89.5
dtype: float64


In [77]:
data = pd.read_csv('sales_data(in).csv')

In [78]:
data.head()

Unnamed: 0,Sales
0,501
1,829
2,655
3,261
4,301


In [79]:
import pandas as pd

# Load the dataset
data = pd.read_csv('sales_data(in).csv')

# 1. Mean of 'Sales'
mean_sales = data['Sales'].mean()
print("Mean of Sales:", mean_sales)

# 2. Standard deviation of 'Sales'
std_sales = data['Sales'].std()
print("Standard Deviation of Sales:", std_sales)

# 3. Lower Bound: Mean - Standard Deviation
lower_bound = mean_sales - std_sales
print("Lower Bound (Mean - Std):", lower_bound)

# 4. Upper Bound: Mean + Standard Deviation
upper_bound = mean_sales + std_sales
print("Upper Bound (Mean + Std):", upper_bound)

# 5. Round both bounds to the nearest whole number
lower_bound_rounded = round(lower_bound)
upper_bound_rounded = round(upper_bound)
print("Rounded Lower Bound:", lower_bound_rounded)
print("Rounded Upper Bound:", upper_bound_rounded)

# 6. Range: Maximum - Minimum of 'Sales'
sales_range = data['Sales'].max() - data['Sales'].min()
print("Range of Sales:", sales_range)


Mean of Sales: 560.54
Standard Deviation of Sales: 263.37280243264394
Lower Bound (Mean - Std): 297.167197567356
Upper Bound (Mean + Std): 823.912802432644
Rounded Lower Bound: 297
Rounded Upper Bound: 824
Range of Sales: 864


In [80]:
data = pd.read_csv('predictive_data(in).csv')

In [81]:
data.head()

Unnamed: 0,Ad Spend,Sales
0,4223,6699
1,3608,4987
2,4590,14446
3,2556,2218
4,3670,5735


In [83]:
from sklearn.linear_model import LinearRegression

# 1. Fit the linear regression model
X = data[['Ad Spend']]  # Feature: Ad Spend
y = data['Sales']  # Target: Sales

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X, y)

# 2. Predict Sales for Ad Spend = 3000
ad_spend_input = 3000
predicted_sales = model.predict([[ad_spend_input]])

# Print the predicted sales value
print(f"Predicted Sales for Ad Spend = {ad_spend_input}: {predicted_sales[0]}")

# 3. Extract the first 3 digits of the predicted sales value
# Convert the predicted sales to a string and get the first 3 digits
predicted_sales_str = str(int(predicted_sales[0]))
first_3_digits = predicted_sales_str[:3]

# Print the first 3 digits of the predicted sales value
print(f"First 3 digits of predicted sales: {first_3_digits}")

Predicted Sales for Ad Spend = 3000: 10523.718159889702
First 3 digits of predicted sales: 105
