# Flight Price Exploratory Data Analysis

## Project Overview

This project is an Exploratory Data Analysis (EDA) for flight prices.
Base on two datasets for two flight classes we:

- create clean dataset,
- then we explore this data to investigate which factors (class, airline, booking time, etc.) have the biggest impact
- and at the end we try to predict ticket price by creating a model.

We simulate that the current date, used for predicting prices based on the time remaining until departure, is '10-02-2022' as our data is for period between february 11 and march 31 of the year 2022.

## How to run this

### Setup

To setup this project we use `./run.sh` script which should work both on Windows and Linux.

This setup has following steps:

- creating Python virtual environment using `venv` (which should be preinstalled),
- activating this environment,
- installing dependencies from `requirements.txt` in it,
- cleaning and preparing main dataset that will be used in this notebook.

## Research Questions

- **Class Difference**: How much more do Business-class tickets cost compared to Economy?
- **Airline Comparison**: Do ticket prices differ a lot between airlines?
- **Last-minute Booking**: How are prices affected if you book just 1 or 2 days before departure?
- **Timing Effects**: Does the departure or arrival time of day change ticket prices?
- **Route Impact**: How do different source/destination cities influence price?
- **Key Drivers**: Which features seem to influence the price the most?

## Dataset Overview

The dataset has 11 columns:

- Airline: Name of one of 8 airlines,
- Flight: Flight code (like SG-8709),
- Source City: Name of one of 6 cities where flight departure,
- Departure Time: Part of the day of the departue,
- Stops: Number of stops on the route,
- Arrival Time: Part of the day of the arrival,
- Destination City: Name of one of 6 cities where the flight lands.
- Class: Ticket class: Economy or Business.
- Duration: Total flight time in hours (numeric).
- Days Left: How many days between the booking date and the flight date (numeric).
- Price: Ticket price (numeric).

All columns except price will be used as inputs to predict price.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor
from catboost import CatBoostRegressor

sns.set_style("whitegrid")

In [None]:
df = pd.read_csv("data/clean_dataset.csv", index_col=0)
print(f"There are {df.shape[0]} observations and {df.shape[1]} columns in the data.")
df.head()

In [None]:
print(df.isnull().sum())

No missing values were found.


In [None]:
df.describe(include='all').T

# Summary

The summary shows that the data is well-structured, with a count of 300,257 non-null observations (rows) for each column.

## Categorical Features

- **Airline**: There are 8 unique airlines, with Vistara being the most frequent.

- **Flight**: The dataset contains 1,569 unique flight numbers.

- **Departure/Arrival City**: There are 6 unique source and destination cities, with Delhi being the most frequent departure city and Mumbai the most frequent destination city.

- **Departure/Arrival Time**: The data is categorized into 5 unique time slots, with Morning being the most frequent departure time and Night being the most frequent arrival time.

- **Class**: There are two unique classes, Economy and Business.

## Numerical Features:

- **Duration**: The mean flight duration is approximately 12.2 hours. The duration ranges from as short as 0.83 hours (50 minutes) to as long as 49.83 hours.

- **Days Left**: The number of days left until departure ranges from 1 to 49 days, with an average of approximately 26 days.

- **Price**: The average ticket price is approximately $20,884, but the standard deviation is very high ($22,695), suggesting a wide spread of prices. This is further supported by the 25th percentile being around $4,783 and the 75th percentile being around $42,521

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(1, 2, 1)
sns.boxplot(x='price', data=df)
plt.title("Ticket Price Boxplot")

plt.subplot(1, 2, 2)
sns.histplot(x='price', data=df, kde=True)
plt.title("Ticket Price Distribution")
plt.show()

The boxplot and histogram for ticket price reveal key characteristics of our data.

The distribution is heavily right-skewed, as evidenced by a long tail of high-priced flights and a large cluster of outliers.

The histogram's bimodal shape, with two distinct peaks, strongly suggests the presence of two price clusters: one for more affordable economy class tickets and another for higher-priced business class tickets.

This observation explains the significant difference between the mean and median price.


# Average Ticket Price by Airline and Class

The barplot visualizes the average ticket price, broken down by both airline and class.

This gives us a clear understanding of the price differentiation between different airlines and ticket classes.


In [None]:
plt.figure(figsize=(8,6))
sns.barplot(x='airline', y='price', hue='class', data=df)
plt.title("Average Ticket Price by Airline and Class", fontsize=14)
plt.ylabel("Avg Price")
plt.show()

# Price: Economy vs Business

To directly compare Economy vs Business class prices, let's plot average price by airline and class. (We know Business tickets are usually more expensive. This will confirm how big the difference is.)

## Insight

Business class is only offered by Air India and Vistara in this dataset. The Business tickets are dramatically more expensive – roughly about 4–5 times the Economy price for the same airline.


# Price vs Airline

Since class heavily affects the price, we will compare airlines within economy class.


In [None]:
economy_df = df[df['class']=='Economy']
plt.figure(figsize=(8,6))
sns.barplot(x='airline', y='price', data=economy_df)
plt.title("Average Economy Ticket Price by Airline", fontsize=14)
plt.ylabel("Avg Price (Economy)")
plt.show()

This shows that AirAsia tends to have the cheapest tickets, while Vistara and Air India are more expensive among economy flights.

In general, prices vary by airline, even in the same class, though not as dramatically as the class difference.


# Booking Time (Days Left vs Price)

Next we check how does the time of booking affect price. In general, last-minute tickets (1–2 days before departure) are more expensive.

In [None]:
avg_price_by_days = df.groupby('days_left')['price'].mean().reset_index()
plt.figure(figsize=(8,5))
sns.lineplot(data=avg_price_by_days, x='days_left', y='price')
plt.title("Average Ticket Price vs Days Before Departure", fontsize=14)
plt.ylabel("Avg Price")
plt.xlabel("Days Left (Booking vs Departure)")
plt.show()

This reveals a clear pattern: tickets bought very close to the departure date tend to cost more.

For example, booking 1-2 days ahead can be much more expensive than booking weeks in advance.


# Time of Day & Route

This section investigates how ticket price is influenced by the time of day and the flight route.

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(x="departure_time", y="price", hue="departure_time", data=df, palette="Set2", legend=False)
plt.title("Flight Price by Time of Day")
plt.xlabel("Departure Time")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

The boxplot shows that while there is a wide range of prices at all times of the day, the median price for flights departing in the Early Morning, Morning, Afternoon, and Night are all quite similar.

This suggests that the time of day is not the primary factor influencing the base ticket price.

In [None]:
df["route"] = df["source_city"] + " → " + df["destination_city"]

plt.figure(figsize=(10, 6))
sns.boxplot(x="route", y="price", hue="route", data=df, palette="Set2", legend=False)
plt.title("Flight Price by Route")
plt.xlabel("Route")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

df.drop(columns=["route"], inplace=True)

This boxplot reveals that the flight route has a more significant impact on price.

The median prices vary greatly between different routes.

For example, some routes, like Hyderabad -> Bangalore, appear to be less expensive on average compared to others, such as Delhi -> Hyderabad.

This indicates that supply and demand for specific routes play a key role in pricing.

# Summary

- **Class (Business vs Economy)**: Massive price difference. A very strong effect - business class costs much more.
- **Days Left**: Booking very late tends to raise price significantly.
- **Airline**: Some airlines tend to be expensive (Vistara/Air India) while others seem to be cheaper (AirAsia).
- **Time of Day**: Despite the slightly lower price in the afternoon, this doesn't seem to be significant factor for the price.
- **Route**: The flight route has a more notable impact than the time of day, as demand and other factors cause significant price variations between different routes.


# Machine Learning Model

Now we'll build a predictive model for ticket price.

First, we prepare the data with encoding for categorical features, split it into train/test, and then try a few regressors:
- Linear Regression,
- KNN,
- XGBoost,
- CatBoost
to see which performs best (using cross-validated R²).

Then we'll apply the best one on the test set.

In [None]:
def preprocessing(df):
    for col in ['airline','flight','source_city','departure_time','arrival_time','destination_city','class','stops']:
        df[col] = df[col].astype('category')
    return df

df = preprocessing(df)

In [None]:
X = df.copy()
y = X.pop("price")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)

In [None]:
models = {
    "KNN": KNeighborsRegressor(n_neighbors=50),
    "LinearRegression": LinearRegression(),
    "XGB": XGBRegressor(n_jobs=5, learning_rate=0.1, max_depth=10, random_state=1),
    "CatBoost": CatBoostRegressor(logging_level='Silent', iterations=500, random_state=1)
}

In [None]:
from sklearn.model_selection import cross_val_score

def score_model(X, y, model):
    X_enc = X.copy()
    for col in X_enc.select_dtypes(['category']):
        X_enc[col] = X_enc[col].cat.codes
    scores = cross_val_score(model, X_enc, y, cv=5, scoring="r2")
    return scores.mean()

# Model Evaluation

Here we are going to evaluate each model to test which one will be the best for our task.

In [None]:
for name, model in models.items():
    r2 = score_model(X_train, y_train, model)
    print(f"{name} R2 (5-fold CV): {r2:.3f}")

XGBRegressor gives by far the highest R² (~0.98) on the training folds, followed closely by CatBoost. 

This suggests XGBoost is capturing the data patterns best (likely because of non-linear effects like class).

In [None]:
best_model = XGBRegressor(n_jobs=5, learning_rate=0.1, max_depth=10, random_state=1)

X_train_enc = X_train.copy()
X_test_enc = X_test.copy()
for col in X_train_enc.select_dtypes(['category']):
    X_train_enc[col] = X_train_enc[col].cat.codes
    X_test_enc[col] = X_test_enc[col].cat.codes

best_model.fit(X_train_enc, y_train)
pred = best_model.predict(X_test_enc)

test_r2 = r2_score(y_test, pred)
test_mae = mean_absolute_error(y_test, pred)
print(f"Test set R²: {test_r2:.4f}")
print(f"Test set MAE: {test_mae:.2f}")

# Model Prediction Result

Our trained XGBoost model has yielded excellent results on the test dataset.

- **High R² Score**: With an R² of 0.9880, the model explains nearly 99% of the variance in the ticket prices.
This indicates that the features we used (class, days left, airline, etc.) are highly predictive and that the model has learned the underlying patterns in the data exceptionally well.

- **Low Mean Absolute Error (MAE)**: The model's MAE of 1273.89 shows that, on average, the predicted price is off by about $1,274. 
Given that the mean ticket price in our dataset is over $20,000, this is a very small and acceptable error margin.

- **Generalization**: The strong performance on the unseen test data demonstrates that the model generalizes well and is not overfitting to the training data.
This means it can be relied upon to make accurate predictions on new, real-world flight data.

## Conclusion

In conclusion, the combination of our well-structured and cleaned dataset and the powerful XGBoost algorithm has resulted in a highly accurate and robust predictive model for flight ticket prices.

The results are a testament to the effectiveness of the Exploratory Data Analysis and data preprocessing steps we performed.