# Problem Statement

Modern travelers often struggle to find the best flight deals due to rapidly changing prices and a variety of airline options. The goal of this project is to **predict flight ticket prices** based on key features (e.g., departure time, days until departure, flight class, number of stops, etc.). By building a **data pipeline** that includes data ingestion, transformation, model training, and deployment, we aim to:

1. **Help consumers** understand how flight prices fluctuate over time.  
2. **Enable proactive decisions** regarding ticket purchases (e.g., buying earlier for cheaper prices).  
3. **Demonstrate** a scalable machine learning solution that can be integrated into a real-world booking platform.

---

## Key Objectives

1. **Data Collection and Cleansing**  
   - Gather raw flight data from reliable sources.  
   - Clean and format the dataset (removing outliers, handling missing values, etc.) to ensure data quality.

2. **Feature Engineering**  
   - Create or transform relevant features such as days left until departure, time of day (morning, evening, etc.), number of stops, and flight class.  
   - Extract or encode categorical variables (e.g., source city, destination city) to make them usable for the model.

3. **Model Training and Evaluation**  
   - Implement a scikit-learn **Pipeline** that includes preprocessing steps (e.g., `ColumnTransformer`) and a machine learning model (e.g., random forest, gradient boosting, or a tuned regression model).  
   - Perform hyperparameter tuning using **GridSearchCV** to select the best model based on performance metrics (e.g., accuracy, RMSE, or MAE).

4. **Deployment and Visualization**  
   - Serialize the best model (e.g., via pickle) and integrate it into a **Flask** application.  
   - Provide an interactive form where users can input flight details to receive an **instant price prediction**.  
   - Create dashboards or plots (with libraries like **Plotly**) to visualize price distributions, the effect of days left on prices, and other insights.

---

## Expected Outcomes

- **Accurate Flight Price Predictions**  
  A system that forecasts ticket costs given a set of user inputs, enabling informed purchase decisions.

- **Actionable Insights**  
  Graphs and statistics that reveal how specific factors (like booking timing, number of stops, or flight class) influence ticket pricing.

- **Reusable Pipeline**  
  A robust data pipeline and trained model that can be updated with new data or integrated into larger applications.

- **User-Friendly Interface**  
  A simple web app where users can interact with the model and see real-time results.


# Data Collection and Modeling

## About Dataset

**Introduction**  
The objective of this study is to analyze the flight booking dataset obtained from the “Ease My Trip” website. This online platform is widely used by passengers to book flight tickets, and the dataset reflects real-world booking data. The dataset is available on [Kaggle](https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction).

**Study Objectives**  
- **Statistical Analysis:** Conduct various statistical hypothesis tests to extract meaningful insights from the dataset.  
- **Predictive Modeling:** Build and compare multiple regression models to predict flight prices accurately.  
- **Insight Generation:** Discover valuable information that can help passengers understand price fluctuations and make informed booking decisions.

**Significance**  
A comprehensive study of this dataset will enable the discovery of patterns and trends in flight pricing, providing enormous value to both passengers and industry stakeholders. The insights derived from this analysis can help optimize pricing strategies and improve the overall booking experience.

---

## Modeling Approach

To predict flight ticket prices, several regression models were evaluated and incorporated into a robust data pipeline. The models used include:

- **Linear Regression**: `LinearRegression()`  
  A simple model that assumes a linear relationship between input features and flight price.

- **Decision Tree Regressor**: `DecisionTreeRegressor()`  
  A non-linear model that splits the dataset into regions based on feature thresholds.

- **Random Forest Regressor**: `RandomForestRegressor()`  
  An ensemble of decision trees that improves prediction accuracy by averaging the predictions of multiple trees.

- **Gradient Boosting Regressor**: `GradientBoostingRegressor()`  
  An ensemble technique that builds trees sequentially to minimize prediction error.

- **AdaBoost Regressor**: `AdaBoostRegressor()`  
  An ensemble method that adjusts weights to focus on difficult-to-predict instances.

- **XGBoost Regressor**: `XGBRegressor()`  
  A high-performance gradient boosting algorithm known for its speed and accuracy.

- **CatBoost Regressor**: `CatBoostRegressor(verbose=1)`  
  A gradient boosting model optimized for categorical features, with verbose output for training progress.

These models were integrated into a comprehensive data pipeline using scikit-learn’s `Pipeline` and `ColumnTransformer` for data preprocessing. Hyperparameter tuning was performed with **GridSearchCV** to select the best performing model. The final chosen model was serialized and deployed in a Flask web application for real-time flight price prediction.

---

This combined approach not only streamlines the data flow from raw data ingestion to prediction but also allows for a systematic evaluation of various models to determine the most effective method for forecasting flight prices.


# import libraries


In [3]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio

# Scikit-learn for Machine Learning Pipeline and Models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.metrics import r2_score
# Additional Models
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

# Optionally, you can import warnings to ignore some warnings during training
import warnings
warnings.filterwarnings("ignore")


In [4]:
#import csv file 
df = pd.read_csv('data/Clean_Dataset.csv')
df.head()

Unnamed: 0,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


This document outlines the steps to perform a comprehensive data check on our flight dataset. We will examine:
- Missing values
- Duplicate rows
- Data types (distinguishing between numerical and categorical features)
- Descriptive statistics for both numerical and categorical features
- Initial insights from the data

In [5]:
# Check for missing values in each column
missing_values = df.isnull().sum()
print("Missing Values by Column:")
print(missing_values)

# Percentage of missing values
missing_percentage = (missing_values / len(df)) * 100
print("\nPercentage of Missing Values:")
print(missing_percentage)


Missing Values by Column:
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64

Percentage of Missing Values:
source_city         0.0
departure_time      0.0
stops               0.0
arrival_time        0.0
destination_city    0.0
class               0.0
duration            0.0
days_left           0.0
price               0.0
dtype: float64


In [6]:
# Check the number of duplicate rows
duplicates = df.duplicated().sum()
print("Number of duplicate rows:", duplicates)


Number of duplicate rows: 3875


In [7]:
# View data types of the columns
print("Data Types:")
print(df.dtypes)

# List numerical features
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
print("\nNumerical Features:", numerical_features)

# List categorical features
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
print("\nCategorical Features:", categorical_features)


Data Types:
source_city          object
departure_time       object
stops                object
arrival_time         object
destination_city     object
class                object
duration            float64
days_left             int64
price                 int64
dtype: object

Numerical Features: ['duration', 'days_left', 'price']

Categorical Features: ['source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class']


In [8]:
# Describe numerical features
num_stats = df[numerical_features].describe()
print("Descriptive Statistics for Numerical Features:")
print(num_stats)


Descriptive Statistics for Numerical Features:
            duration      days_left          price
count  300153.000000  300153.000000  300153.000000
mean       12.221021      26.004751   20889.660523
std         7.191997      13.561004   22697.767366
min         0.830000       1.000000    1105.000000
25%         6.830000      15.000000    4783.000000
50%        11.250000      26.000000    7425.000000
75%        16.170000      38.000000   42521.000000
max        49.830000      49.000000  123071.000000


In [9]:
# For categorical features, we show counts and unique values
cat_stats = {}
for col in categorical_features:
    cat_stats[col] = df[col].value_counts()
    print(f"\nValue Counts for {col}:")
    print(cat_stats[col])



Value Counts for source_city:
source_city
Delhi        61343
Mumbai       60896
Bangalore    52061
Kolkata      46347
Hyderabad    40806
Chennai      38700
Name: count, dtype: int64

Value Counts for departure_time:
departure_time
Morning          71146
Early_Morning    66790
Evening          65102
Night            48015
Afternoon        47794
Late_Night        1306
Name: count, dtype: int64

Value Counts for stops:
stops
one            250863
zero            36004
two_or_more     13286
Name: count, dtype: int64

Value Counts for arrival_time:
arrival_time
Night            91538
Evening          78323
Morning          62735
Afternoon        38139
Early_Morning    15417
Late_Night       14001
Name: count, dtype: int64

Value Counts for destination_city:
destination_city
Mumbai       59097
Delhi        57360
Bangalore    51068
Kolkata      49534
Hyderabad    42726
Chennai      40368
Name: count, dtype: int64

Value Counts for class:
class
Economy     206666
Business     93487
Name: coun


Based on our preliminary analysis of the **Clean_Dataset.csv**, we can note the following insights:

- **Data Completeness:**  
  Some columns exhibit missing values. While certain features have minimal missing data, others have a significant percentage of missing entries. This indicates that imputation or careful removal of incomplete records may be necessary to ensure data quality.

- **Data Quality:**  
  Duplicate rows have been detected, suggesting potential redundancy in the dataset. These duplicates should be examined and possibly removed to prevent skewing the analysis.

- **Feature Distribution (Numerical Features):**  
  The descriptive statistics (mean, median, minimum, maximum, etc.) provide a clear view of the spread and central tendency of numerical features such as flight price, duration, and days left. This information is crucial for understanding variability and determining the need for normalization or scaling.

- **Categorical Features:**  
  Analysis of categorical features (e.g., flight class, source city, destination city) using value counts reveals the diversity and distribution of these variables. This is important for selecting the right encoding techniques and identifying any dominant categories that might impact model performance.

- **Next Steps:**  
  Based on these insights, the following actions are planned:
  - **Data Cleaning:** Address missing values through imputation or removal, and eliminate duplicates to enhance data quality.
  - **Feature Engineering:** Transform or create new features as needed, ensuring they contribute positively to the predictive models.
  - **Modeling Setup:** Prepare the data (including normalization and encoding) to set up robust predictive models for flight price forecasting.

These initial insights will help guide the subsequent phases of the data pipeline, ensuring that the dataset is well-prepared for effective predictive modeling.

## Data Transformer Explanation

This step creates a data transformer using scikit-learn's `Pipeline` and `ColumnTransformer`. We have selected the following columns from our dataset because they are key to our flight price prediction model:

- **Numerical Features:**
  - **`duration`**: Represents the flight duration.
  - **`days_left`**: Indicates the number of days remaining until departure.

- **Categorical Features:**
  - **`source_city`**: The departure city.
  - **`departure_time`**: The time of departure (e.g., Morning, Evening).
  - **`stops`**: The number of stops during the flight.
  - **`arrival_time`**: The arrival time.
  - **`destination_city`**: The arrival city.
  - **`class`**: The flight class (e.g., Economy, Business).

### Preprocessing Steps

- **Numerical Pipeline:**
  - **Imputation**: Missing numerical values are replaced with the median value.
  - **Scaling**: The values are standardized using `StandardScaler` to ensure all features contribute equally to the model.

- **Categorical Pipeline:**
  - **Imputation**: Missing categorical values are replaced with the most frequent value.
  - **Encoding**: Categorical features are converted into a one-hot encoded format, ensuring they can be used by machine learning models.

This structured approach ensures that our data is clean, normalized, and ready for the predictive modeling phase.


In [10]:
def get_data_transformer():
    # Define numerical and categorical features
    numerical_features = ['duration', 'days_left']
    categorical_features = ['source_city', 'departure_time', 'stops', 'arrival_time', 'destination_city', 'class']
    
    # Pipeline for numerical features: impute missing values with median and then scale
    num_pipeline = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    # Pipeline for categorical features: impute missing values with most frequent value and one-hot encode
    cat_pipeline = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('one_hot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Combine numerical and categorical pipelines into a single ColumnTransformer
    preprocessor = ColumnTransformer(transformers=[
        ('num', num_pipeline, numerical_features),
        ('cat', cat_pipeline, categorical_features)
    ])
    
    return preprocessor

# Example usage:
data_transformer = get_data_transformer()
print(data_transformer)

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['duration', 'days_left']),
                                ('cat',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('one_hot',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['source_city', 'departure_time', 'stops',
                                  'arrival_time', 'destination_city',
                                  'class'])])


In [11]:
train_set, test_set = train_test_split(df, random_state=42, test_size=0.3)
train_df = train_set
test_df = test_set 
def initiate_transform_data(train_path, test_path):
   
    
    
    # Get the preprocessing object (assumed to be defined elsewhere)
    preprocess_obj = get_data_transformer()
    
    # Define the target column
    target_column = 'price'
    
    # Separate input features and target for training and testing datasets
    input_feature_train_df = train_df.drop(columns=[target_column])
    target_feature_train_df = train_df[target_column]
    
    input_feature_test_df = test_df.drop(columns=[target_column])
    target_feature_test_df = test_df[target_column]
    
    # Apply preprocessing: fit on train and transform both train and test features
    input_feature_train_arr = preprocess_obj.fit_transform(input_feature_train_df)
    input_feature_test_arr = preprocess_obj.transform(input_feature_test_df)
    
    # If the output is sparse (e.g., from OneHotEncoder), convert it to a dense array
    if hasattr(input_feature_train_arr, "toarray"):
        input_feature_train_arr = input_feature_train_arr.toarray()
    if hasattr(input_feature_test_arr, "toarray"):
        input_feature_test_arr = input_feature_test_arr.toarray()
    
    # Combine the transformed features with the target values
    train_arr = np.c_[input_feature_train_arr, np.array(target_feature_train_df)]
    test_arr = np.c_[input_feature_test_arr, np.array(target_feature_test_df)]
    
    # Return the processed arrays
    return train_arr, test_arr

# Example usage:
train_arr, test_arr = initiate_transform_data(train_df, test_df)
print("Train array shape:", train_arr.shape)
print("Test array shape:", test_arr.shape)

Train array shape: (210107, 32)
Test array shape: (90046, 32)


In [12]:
def evaluate_model(x_train,y_train,x_test,y_test,models,params):
    report ={}
    for i,value in models.items():
        model = value
        params = params[i]
        grid = GridSearchCV(model,param_grid=params,n_jobs=-1,cv=5,)
        grid.fit(x_train,y_train)
        model.set_params(**grid.best_params_)
        model.fit(x_train,y_train)
        y_train_pred = model.predict(x_train)

        model.fit(x_test,y_test)
        y_test_pred = model.predict(x_test)

        train_model_score = r2_score(y_train,y_train_pred)
        test_model_score = r2_score(y_test,y_test_pred)
        report[i] = test_model_score 
        return report

In [13]:
def initiate_model_trainer(train_arr, test_arr):
    # Split train and test arrays into features and target
    x_train, y_train = train_arr[:, :-1], train_arr[:, -1]
    x_test, y_test = test_arr[:, :-1], test_arr[:, -1]
    
    # Define the models
    models = {
        'Linear Regression': LinearRegression(),
        'Decision Tree': DecisionTreeRegressor(),
        'Random Forest': RandomForestRegressor(),
        'Gradient Boosting': GradientBoostingRegressor(),
        'AdaBoost': AdaBoostRegressor(),
        'XGBoost': XGBRegressor(),
        'CatBoost': CatBoostRegressor(verbose=1)
    }
    
    # Define hyperparameter grids for each model
    params = {
        'Linear Regression': [{'fit_intercept': [True, False]}],
        'Decision Tree': [{
            'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
            'splitter': ['best', 'random'],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }],
        'Random Forest': [{
            'n_estimators': [100, 200, 300],
            'criterion': ['squared_error', 'absolute_error', 'poisson'],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }],
        'Gradient Boosting': [{
            'n_estimators': [100, 200, 300],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 5, 7],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4],
            'subsample': [0.8, 1.0]
        }],
        'AdaBoost': [{
            'n_estimators': [50, 100, 200],
            'learning_rate': [0.01, 0.1, 1.0],
            'loss': ['linear', 'square', 'exponential']
        }],
        'XGBoost': [{
            'n_estimators': [100, 200, 300],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 5, 7],
            'subsample': [0.8, 1.0],
            'colsample_bytree': [0.8, 1.0]
        }],
        'CatBoost': [{
            'iterations': [500, 1000, 1500],
            'depth': [4, 6, 8, 10],
            'learning_rate': [0.01, 0.05, 0.1],
            'l2_leaf_reg': [1, 3, 5],
            'border_count': [32, 64, 128]
        }]
    }
    
    # Evaluate models using a helper function (assumed to be defined elsewhere)
    model_report = evaluate_model(x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test,
                                  models=models, params=params)
    
    # Print each model's performance
    print("Model Performance (R² Score):")
    for model_name, score in model_report.items():
        print(f"{model_name}: {score:.4f}")
    
    # Select the best model based on the evaluation score
    best_model_name = max(model_report, key=model_report.get)
    best_model = models[best_model_name]
    
    # Predict on test data using the best model and calculate R² score
    y_pred = best_model.predict(x_test)
    final_r2 = r2_score(y_test, y_pred)
    
    print(f"\nBest Model: {best_model_name} with R² Score: {final_r2:.4f}")
    return final_r2

# Example usage:
r2_score_value = initiate_model_trainer(train_arr, test_arr)
print("Final R² Score:", r2_score_value)


Model Performance (R² Score):
Linear Regression: 0.9063

Best Model: Linear Regression with R² Score: 0.9063
Final R² Score: 0.9062781835220856


## Conclusion

After evaluating several regression models, **Linear Regression** emerged as the best-performing model with an R² score of **0.9063**. This indicates that approximately 90.63% of the variance in flight prices is explained by the model using the selected features.

### Why Linear Regression Performed Best

- **Linear Relationships in the Data:**  
  The high performance of the Linear Regression model suggests that the relationship between the selected features (such as flight duration, days left, and various categorical variables) and the flight price is predominantly linear. This means that as these features change, the flight price tends to change in a proportional and predictable manner.

- **Model Simplicity and Interpretability:**  
  Linear Regression is a simple, interpretable model. When the underlying data exhibits a linear trend, more complex models (e.g., Decision Trees, Random Forests, or Gradient Boosting) may not provide additional benefits and might even overfit or become unnecessarily complicated.

- **Effective Feature Engineering:**  
  The data pipeline—including steps for imputation, scaling, and one-hot encoding—has effectively prepared the data. This preprocessing helped expose the inherent linear structure of the dataset, making Linear Regression an excellent fit.

### Next Steps

The complete data pipeline, along with the best-performing Linear Regression model, will be deployed in our web application. Users will be able to input flight details through an interactive form and receive real-time price predictions. This integration ensures that all preprocessing and prediction tasks are handled seamlessly, providing an efficient and user-friendly experience.

Overall, our findings highlight that a straightforward Linear Regression model is well-suited for this dataset, confirming the data's linear nature and demonstrating that a simple, interpretable model can achieve excellent predictive performance.
