Attempting to fit house prices, using the [Filght fare](https://www.kaggle.com/datasets/yashdharme36/airfare-ml-predicting-flight-fares)

Imports of relevant packages

In [None]:
#data processing
import pandas as pd
import numpy as np

#data visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Machine learning library
import sklearn

# import warnings
# warnings.filterwarnings("ignore")

1. Introducing the House prices dataset -- Exploratory Data Analysis (EDA)

In [None]:
dtf = pd.read_csv("./data/data_airfare.csv")

In [None]:
numeric_columns = dtf.dtypes[(dtf.dtypes=="float64") | (dtf.dtypes=="int64")].index.tolist()
categorical_columns = [c for c in dtf.columns if c not in numeric_columns]

In [None]:
cols = ["Date_of_journey","Journey_day","Airline","Flight_code","Class","Source","Departure","Total_stops",
        "Arrival","Destination", "Duration_in_hours", "Days_left", "Fare"]
dtf = dtf[cols]

In [None]:
dtf.head()

In [None]:
dtf.describe()

Examining the target feature - "SalePrice": Using a histogram, a KDE plot, and a box plot

In [None]:
dtf.Fare.hist()

x: the flight fare 
y: the amount of flights of each fare range

Exploring tickets fare:
From the table above we learn that the minimum ticket fare is 1,307 and maximum is 143,019. There exists differnce in mean and median values as well. Let's visualize the price column using a box plot.

In [None]:
sns.kdeplot(dtf.Fare)

We can see that the data may contains outliers.

Let's Examine outliers

In [None]:
sns.boxplot(dtf.Fare, orient="h")

A Fare outlier is acceptable because there are different ticket classes like - Economy, Premium Economy, Business and First class.
Even though the mean is around 20000, we can see here that the median is approximately 14000.
On the First graph, we can see that the dispersion seems to be composed by two gaussian curves. From 1,000 to 30,000 there is one peak, corresponding to the cheap tickets and the second peak from 40,000 to 80,000 corresponding to the expensive class tickets.
We have decided not to remove the price outliers.

Exploration of the data and understanding the relationships between the different features in the dataset.(todo)

Cleaning and Preprocessing

Making sure there are no null values in our data.

In [None]:
dtf.isnull().sum()

There are no missing values in our dataset.

In [None]:
#Checking duplicates
dtf.duplicated().sum()

There are 6722 duplicated rows. So let's remove them.

In [None]:
dtf = dtf.drop_duplicates()

In [None]:
#check that the duplicates are gone
dtf.duplicated().sum()

In [None]:
# Change the column  from 'Days_left' to 'Advance_purchase_days' for clearity
dtf.rename(columns={'Days_left': 'Advance_days'}, inplace=True)

In [None]:
#check distinct values of departure and arrival cloumns
dtf['Departure'].unique()

In [None]:
dtf['Arrival'].unique()

In [None]:
#convert the departure and arrival columns to time 

# Function to map departure time ranges to categories
def map_departure_time_range(departure_time):
    if 'Before 6 AM' in departure_time:
        return 'Early morning'
    elif '6 AM - 12 PM' in departure_time:
        return 'morning'
    elif '12 PM - 6 PM' in departure_time:
        return 'noon'
    else:
        return 'night'

# Apply the function to the columns
dtf['Departure'] = dtf['Departure'].apply(map_departure_time_range)
dtf['Arrival'] = dtf['Arrival'].apply(map_departure_time_range)

dtf.head()

In [None]:
# Convert the "Date_of_journey" column to datetime format
dtf['Date_of_journey'] = pd.to_datetime(dtf['Date_of_journey'])

# Extract the month from the dates
dtf['Month'] = dtf['Date_of_journey'].dt.month

# Group the flights by month and count the number of flights in each month
dtf.groupby('Month').size()

In [None]:
Month_ = pd.to_datetime(dtf.Date_of_journey.values).month
pd.Series(Month_.value_counts(normalize = True).values,index=["Feb","Jan","Mar"]).\
    plot(kind="barh",title="Flights monthly variations", figsize = [2,2], xlabel = "Relative frequencies")

In [None]:
dtf.Journey_day.value_counts(normalize = True, ascending = True).plot(kind="barh",
 title = "Flights daily variations",xlabel = "Relative frequencies")

Given the daily variations observed in the dataset, with flights evenly distributed across the seven weekdays. We want to make sure that the frequencies of values in our training and test sets reflect the daily variations reported in the original dataset. Therefore, we'll apply a stratified split based on this feature later when we split the data into train and test set.

In [None]:
#we will create a histogram for each categorical attribute

# Define the names of categorical columns to remove
columns_to_remove = ["Date_of_journey", "Flight_code"]

# Define the desired order of categorical columns
desired_order = ["Airline", "Departure", "Arrival", "Total_stops", "Journey_day", "Source", "Destination", "Class"]

# Filter categorical columns based on the condition and desired order
categorical_columns_filtered = [c for c in desired_order if c not in columns_to_remove]
                                
n = len(categorical_columns_filtered)
cols = 2
max_bars = 8

rows = (n // cols) + (1 if n % cols != 0 else 0)

#generate a figures grid:
fig, axes = plt.subplots(rows,cols,figsize=(cols*5,rows*5))
fig.subplots_adjust(hspace=0.5)

for i,column in enumerate(categorical_columns_filtered):
    #calculate the current place on the grid
    r=int(i/cols)
    c=i%cols
    
    #create the "value counts" for the first <max_bars> categories:
    u=min(dtf[column].nunique(),max_bars)
    vc = dtf[column].value_counts()[:u]
    
    # plot a bar chart using Pandas
    vc.plot(kind='bar',ax=axes[r,c],title=column)
    axes[r, c].set_xlabel('')


In [None]:
#we will create a histogram for each numeric attribute
dtf.Duration_in_hours.hist()

In [None]:
dtf.Advance_days.hist()

In [None]:
data = dtf.groupby("Destination")["Arrival"].value_counts()
data.head(28)

In [None]:
# Create the violin plot
plt.figure(figsize=(12, 8))
sns.violinplot(x='Class', y='Fare', data=dtf)
plt.title('Price Distribution of the flights for each number of stops')
plt.xlabel('Class')
plt.ylabel('Price')
plt.xticks(rotation=45)
plt.show()

By using the violin plot we can learn about the distribution of the fligh prices in India.
todo

Examining Correlations to the target feature:

In [None]:
numeric_columns = dtf.dtypes[(dtf.dtypes=="float64") | (dtf.dtypes=="int64")].index.tolist()
numeric_columns = dtf[numeric_columns]
dtf_corr = numeric_columns.corr(method="pearson").loc[["Fare"]]
fig, ax = plt.subplots(figsize=(15,2))

sns.heatmap(dtf_corr, annot=True, fmt='.2f', cmap="YlGnBu", cbar=True, linewidths=0.5,ax=ax)

Baseline Model

In [None]:
# We will split the dataset into features and target variables. the target variable is the proce, and all the others are the features.
x = dtf.drop(["fare"], axis=1)
y = dtf["fare"]

In [None]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)