<a href="https://colab.research.google.com/github/MohitMotivaras/Regression-Model-Development/blob/main/1_Regression_Model_Development.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Regression Model Development



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -** Mohit Motivaras

# **Project Summary -**

In the realm of travel and tourism, the intersection of data analytics and machine learning presents an opportunity to revolutionize the way travel experiences are curated and delivered. This capstone project revolves around a trio of datasets - users, flights, and hotels - each providing a unique perspective on travel patterns and preferences. The goal is to leverage these datasets to build and deploy sophisticated machine learning models, serving a dual purpose: enhancing predictive capabilities in travel-related decision-making and mastering the art of MLOps through hands-on application.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


The goal of this project is to develop a regression model that accurately predicts the price of a flight using data from the flights.csv dataset. This involves analyzing and preprocessing the dataset, selecting relevant features such as airline, source, destination, number of stops, duration, and journey date, and training a machine learning model capable of generalizing well on unseen data. The model aims to support airlines and travel platforms in pricing strategy, customer guidance, and market competitiveness by providing reliable flight fare predictions based on historical flight attributes.

#### **Define Your Business Objective?**

Develop a regression model that accurately predicts the price of a flight ticket based on various flight-related features provided in the flights.csv dataset.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
!pip install missingno

In [None]:
# Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
import joblib
from google.colab import files

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Voyage Analytics Integrating MLOps in Travel/flights.csv")

### Dataset First View

In [None]:
# List of specific columns you want to check
columns_to_check = ['from', 'to', 'flightType', 'agency']

# Show unique values in each of the selected columns
for col in columns_to_check:
    print(f"Unique values in '{col}':")
    print(df[col].unique())
    print()

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12,6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

The dataset consists of 271,888 rows and 10 columns. It includes a mix of data types: 3 columns with floating-point values, 2 columns with integer values, and 5 columns containing object (categorical or string) data types.

The dataset is clean, with no duplicate records, and contains no missing or null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

travelCode: Identifier for the travel.

userCode: User identifier(linked to the Users dataset)

from: Origin of the flight.

to: Destination of the flight.

flightType: Type of flight (e.g., first class).

price: Price of the flight.

time: Flight duration.

distance: Distance of the flight.

agency: Flight agency.

date: Date of the flight.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    unique_vals = df[col].nunique()
    print(f"Column '{col}' has {unique_vals} unique values.")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df['date'] = pd.to_datetime(df['date'])  # Convert to datetime format

# Create new columns
df['day_of_journey'] = df['date'].dt.day
df['month_of_journey'] = df['date'].dt.month
df['year_of_journey'] = df['date'].dt.year

# To check result
print(df[['date', 'day_of_journey', 'month_of_journey', 'year_of_journey']].head())

In [None]:
df.head()

In [None]:
sales_by_type = df.groupby('flightType')['price'].sum().reset_index()

# Sort descending to see highest sales first
sales_by_type = sales_by_type.sort_values(by='price', ascending=False)

print(sales_by_type)

In [None]:
yearly_sales = df.groupby('year_of_journey')['price'].sum().reset_index()

print(yearly_sales)

In [None]:
# Group by agency and sum the price column
agency_sales = df.groupby('agency')['price'].sum().reset_index().sort_values(by='price', ascending=False)

# Display the result
print(agency_sales)

In [None]:
# Count flights grouped by 'from' (departure location)
flights_from_counts = df['from'].value_counts()

# Or equivalently using groupby
flights_from_counts = df.groupby('from').size().sort_values(ascending=False)

print(flights_from_counts)

In [None]:
for col in ['from', 'to', 'flightType', 'agency']:
    df[col] = LabelEncoder().fit_transform(df[col])

In [None]:
df.head()

In [None]:
corr_matrix = df[['price', 'time', 'distance']].corr()
print(corr_matrix)

In [None]:
X = df[['from', 'to', 'flightType', 'time', 'distance', 'agency', 'day_of_journey', 'month_of_journey', 'year_of_journey']]
y = df['price']  # Target variable


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Initialize and train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

In [None]:
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"Cross-Validated R² Score: {scores.mean():.2f}")

In [None]:
joblib.dump(model, 'flight_price_model.pkl')

In [None]:
files.download('flight_price_model.pkl')

### What all manipulations have you done and insights you found?

1. FirstClass flights dominate sales, generating the highest total revenue, more than double the economic class. This indicates strong demand or higher pricing in this segment.

2. Premium class holds the middle ground, with respectable sales but less than FirstClass, showing moderate market preference.

3. Economic flights, despite being usually more affordable and common, contribute the least in total sales value, possibly due to lower ticket prices per flight.

4. 2020 had the highest sales
With over 107 million in total flight revenue, 2020 was the peak year for sales, significantly outperforming all other years.

5. Sales declined post-2020
After 2020, sales dropped each year — 2021 still performed well (72M), but 2022, 2019, and especially 2023 saw a clear downward trend.

6. 2023 shows lowest revenue
With only around 6 million in sales, 2023 had the weakest performance, possibly due to reduced travel demand, market changes, or limited data for that year.

7. Rainbow and CloudFy are the top-performing agencies, contributing nearly 41% each to total flight sales.

8. FlyingDrops lags behind significantly, contributing only about 17.7%, indicating weaker market performance.

9. Sales distribution is highly concentrated in just two agencies, which may suggest strong customer preference or better service offerings from those providers.

10. The chart reveals that Florianopolis (SC) has the highest number of flight departures, significantly more than other locations. Cities like Aracaju (SE), Campo Grande (MS), and Brasilia (DF) also have strong flight activity, while locations such as Salvador (BH) and Rio de Janeiro (RJ) have fewer departures. This highlights the busiest airports and possible regional demand variations for flights.

11. Price is moderately positively correlated with both time and distance (around 0.64), meaning that as flight duration or distance increases, the price tends to increase as well, which is expected in flight pricing.

12. Time and distance have a very strong positive correlation (close to 1.00), indicating that longer distances usually correspond to longer flight durations.

13. This suggests that flight price is influenced by both how far and how long the flight is, but time and distance themselves are almost directly proportional

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:

# Assuming sales_by_type is already calculated:
# sales_by_type = df.groupby('flightType')['price'].sum().reset_index()

sns.set(style="whitegrid")

plt.figure(figsize=(8,5))
barplot = sns.barplot(data=sales_by_type, x='flightType', y='price', palette='viridis')

# Add value labels on top of each bar
for p in barplot.patches:
    height = p.get_height()
    barplot.text(p.get_x() + p.get_width()/2., height + 0.05*height, f'{height:,.0f}',
                 ha="center", va="bottom", fontsize=10)

plt.title('Total Sales by Flight Type')
plt.xlabel('Flight Type')
plt.ylabel('Total Sales (Sum of Price)')

plt.show()

##### 1. Why did you pick the specific chart?

I need to show clear sales of each flight type

##### 2. What is/are the insight(s) found from the chart?

FirstClass flights dominate sales, generating the highest total revenue, more
than double the economic class. This indicates strong demand or higher pricing in this segment.

Premium class holds the middle ground, with respectable sales but less than FirstClass, showing moderate market preference.

Economic flights, despite being usually more affordable and common, contribute the least in total sales value, possibly due to lower ticket prices per flight.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can drive positive business impact. Understanding that FirstClass flights generate the highest revenue allows the business to prioritize premium services, targeted marketing, and dynamic pricing strategies to maximize profit in this segment. It also highlights growth potential in Premium and Economic classes, where tailored offers or improved customer experience could increase sales.

However, some insights might signal risks of negative growth:
The relatively low sales from the Economic class, despite likely having higher volume, may indicate issues like poor pricing strategy, lack of customer appeal, or strong competition. Ignoring this segment could reduce market share and long-term revenue potential. Additionally, over-reliance on FirstClass sales might be risky if market conditions change (e.g., economic downturns reducing luxury travel demand).

#### Chart - 2

In [None]:
plt.figure(figsize=(8, 5))
bars = plt.barh(yearly_sales['year_of_journey'].astype(str), yearly_sales['price'], color='steelblue')

# Step 5: Add value labels to each bar
for bar in bars:
    width = bar.get_width()
    plt.text(width + 10000, bar.get_y() + bar.get_height() / 2,
             f'{width:,.0f}', va='center', fontsize=9)

# Chart labels and title
plt.title('Total Flight Sales by Year')
plt.xlabel('Total Sales (Price)')
plt.ylabel('Year')

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Horizontal bars allow easy reading of year labels, especially when there are multiple years or longer text. This avoids clutter and overlapping seen in vertical bar charts.

##### 2. What is/are the insight(s) found from the chart?

This trend suggests 2020 was a high-performing year, but the gradual decline afterward may signal market shifts, changing customer preferences, or external factors (e.g., economic conditions, competition, or post-pandemic behavior). This insight can guide future pricing, promotion, or investment strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — these insights are valuable for strategic planning:

Knowing that 2020 was the peak sales year helps businesses analyze what worked — such as marketing, pricing, or customer demand — and replicate those successful tactics.

Negative growth -
Drastic drop in 2023 (only ~6M in sales) may point to reduced customer interest, economic slowdown, or ineffective business strategies.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8, 8))
plt.pie(
    agency_sales['price'],
    labels=agency_sales['agency'],
    autopct=lambda pct: f'{pct:.1f}%\n({int(pct/100.*agency_sales["price"].sum()):,})',
    startangle=140,
    colors=plt.cm.Set3.colors
)

plt.title('Agency-wise Total Flight Sales', fontsize=14)
plt.axis('equal')  # Keeps pie chart circular
plt.show()

##### 1. Why did you pick the specific chart?

Compare overall contributions of each agency at a glance.

Highlight which agency dominates the sales market (e.g., Rainbow and CloudFy here).

Show both percentage share and exact sales figures, making the insights more informative and visually intuitive.

##### 2. What is/are the insight(s) found from the chart?

Rainbow and CloudFy are the top-performing agencies, contributing nearly 41% each to total flight sales.

FlyingDrops lags behind significantly, contributing only about 17.7%, indicating weaker market performance.

Sales distribution is highly concentrated in just two agencies, which may suggest strong customer preference or better service offerings from those providers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Yes, these insights can drive positive impact by helping the company focus resources and marketing efforts on the top-performing agencies like Rainbow and CloudFy to maximize revenue. Understanding their success factors could help replicate their strategies across other agencies or regions. Additionally, identifying the lagging agency (FlyingDrops) allows targeted interventions, such as improving service quality or promotional campaigns, to boost their sales.

Insights Leading to Negative Growth:

The heavy sales concentration in just two agencies indicates a lack of diversification, which is risky. If either Rainbow or CloudFy faces operational issues or loses market share, it could significantly impact overall sales. Moreover, the poor performance of FlyingDrops may reflect internal inefficiencies or competitive weaknesses that, if unaddressed, could lead to further loss of market share and negative growth for that agency.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import matplotlib.pyplot as plt

flights_from_counts.plot(kind='bar', figsize=(10,6), color='skyblue')
plt.title('Number of Flights by Departure Location')
plt.xlabel('Departure Location')
plt.ylabel('Number of Flights')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

I picked the vertical bar chart because it clearly shows the number of flights departing from each location, making it easy to compare flight volumes across different cities. Bar charts are ideal for displaying categorical data with counts, and the vertical layout with labeled x-axis helps highlight which departure locations have the highest or lowest number of flights at a glance.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that Florianopolis (SC) has the highest number of flight departures, significantly more than other locations. Cities like Aracaju (SE), Campo Grande (MS), and Brasilia (DF) also have strong flight activity, while locations such as Salvador (BH) and Rio de Janeiro (RJ) have fewer departures. This highlights the busiest airports and possible regional demand variations for flights.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can drive positive business impact by helping airlines and agencies focus marketing, resources, and flight schedules on high-demand locations like Florianopolis, maximizing revenue. Conversely, airports with fewer departures, like Salvador (BH) and Rio de Janeiro (RJ), might indicate lower demand or operational challenges, which could signal negative growth or missed opportunities. Addressing these through promotions or improved services could help reverse such trends.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Get feature importances from the trained model
importances = model.feature_importances_

# Get feature names from your X dataframe
feature_names = X.columns

# Create a DataFrame for easy plotting
feat_imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})

# Sort by importance
feat_imp_df = feat_imp_df.sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(10, 6))
plt.barh(feat_imp_df['Feature'], feat_imp_df['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('Feature Importance in Random Forest Regressor')
plt.gca().invert_yaxis()  # Most important on top
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Random Forest natively provides feature importance as numeric scores.

A bar chart is a natural and effective way to visualize these scores in descending order.

##### 2. What is/are the insight(s) found from the chart?

The feature importance chart reveals that distance, flight time, and flight type are the most influential factors in predicting flight prices, indicating that longer and more time-consuming flights generally cost more, and certain flight types (e.g., business or economy) significantly impact pricing. Additionally, route-related features such as departure (from) and arrival (to) locations also play a meaningful role, suggesting that specific city pairs may have consistently higher or lower fares. In contrast, date-related features like day, month, and year of journey contribute less to the model, implying that seasonality has a limited impact on pricing in this dataset. These insights can help prioritize which variables matter most when analyzing or optimizing flight pricing strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from the feature importance chart can drive positive business impact. Knowing that factors like distance, flight time, flight type, and route (from/to) have the greatest influence on price allows airlines and travel platforms to optimize pricing strategies. This can lead to smarter dynamic pricing, targeted promotions on specific routes, and better inventory management based on what truly drives customer pricing.

However, there is a slight risk if the insights are misinterpreted. For example, the low importance of date-related features might suggest that seasonality or holidays don’t affect flight prices, which may not be entirely true. If these time-based effects are not properly captured or engineered (e.g., weekday vs weekend, holidays), it could lead to missed revenue during peak demand periods, impacting growth. So while the insights are valuable, they must be applied thoughtfully.



You're now using our basic model.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr_matrix = df[['price', 'time', 'distance']].corr()

# Plot heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix: Price, Time, Distance")
plt.show()

##### 1. Why did you pick the specific chart?

I picked the correlation heatmap because it provides a clear and intuitive visual summary of the relationships between multiple numerical variables at once. This chart makes it easy to see the strength and direction of correlations between the flight price, travel time, and distance, helping us quickly identify which features are closely related. The color gradient and annotated values allow for easy interpretation and comparison, which is much more efficient than looking at raw numbers or multiple pairwise scatter plots. This makes the heatmap an ideal choice for exploring feature dependencies in a regression or predictive modeling context.

##### 2. What is/are the insight(s) found from the chart?

Price is moderately positively correlated with both time and distance (around 0.64), meaning that as flight duration or distance increases, the price tends to increase as well, which is expected in flight pricing.

Time and distance have a very strong positive correlation (close to 1.00), indicating that longer distances usually correspond to longer flight durations.

This suggests that flight price is influenced by both how far and how long the flight is, but time and distance themselves are almost directly proportional

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Develop a regression model that accurately predicts the price of a flight ticket based on various flight-related features.

# **Conclusion**

In this project, a Random Forest regression model was developed to predict flight ticket prices using various flight-related features from the dataset. After preprocessing and encoding categorical variables, the model demonstrated strong predictive accuracy with low error and a high R² score, validated through cross-validation. Feature importance analysis revealed that factors such as distance, flight time, and flight type are the most influential in determining price, offering valuable insights for pricing strategies. Overall, the model provides a reliable tool to support airlines and travel agencies in optimizing pricing decisions and improving revenue management.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***