# Flight Price Predictions for _EaseMyTrip.com_ Springboard Data Science Career Track Capstone Project 2

The data here that was [imported from _Kaggle_](https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction) has already been cleaned and made available by the publisher.

**Importing the relevant libraries/modules for the data collection.**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

However, we should still look at what we got and double-check that nothing is out of the ordinary.

In [2]:
# Converting the flight data into a Pandas dataframe
# Then checking the info
flight_df = pd.read_csv('../Capstone Project 2/Capstone 2 Data/Clean_Dataset.csv')
flight_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300153 entries, 0 to 300152
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        300153 non-null  int64  
 1   airline           300153 non-null  object 
 2   flight            300153 non-null  object 
 3   source_city       300153 non-null  object 
 4   departure_time    300153 non-null  object 
 5   stops             300153 non-null  object 
 6   arrival_time      300153 non-null  object 
 7   destination_city  300153 non-null  object 
 8   class             300153 non-null  object 
 9   duration          300153 non-null  float64
 10  days_left         300153 non-null  int64  
 11  price             300153 non-null  int64  
dtypes: float64(1), int64(3), object(8)
memory usage: 27.5+ MB


Since this data was already mostly cleaned by the publisher, it makes sense that there seems to already be consistency with the data. We can see that there aren't any missing values.

Still, let's take a closer look at the table itself.

In [3]:
flight_df.head(2)

Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953


**Let's get rid of the "Unnamed" index column**

In [4]:
flight_df = flight_df.drop(labels='Unnamed: 0', axis=1)
flight_df.head(2)

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953


In [5]:
num_rows = flight_df.shape[0]
num_feats = flight_df.shape[1]
print(f'The data has {num_rows} observations and {num_feats} features to work with.')

The data has 300153 observations and 11 features to work with.


**Let's double check to make sure that there aren't any missing values.**

In [6]:
print(flight_df.isnull().sum())

airline             0
flight              0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64


Great! Still no missing values to be concerned about.

**Let's look at which features have numeric values and categorical values.**

In [7]:
flight_df.dtypes

airline              object
flight               object
source_city          object
departure_time       object
stops                object
arrival_time         object
destination_city     object
class                object
duration            float64
days_left             int64
price                 int64
dtype: object

**Checking how many unique values are in each feature**

In [8]:
flight_df.nunique().sort_values(ascending=False)

price               12157
flight               1561
duration              476
days_left              49
airline                 6
source_city             6
departure_time          6
arrival_time            6
destination_city        6
stops                   3
class                   2
dtype: int64

Looking at the first 5 rows of the dataframe as well as the unique values from each column, it would be safe to determine that the _price_, _duration_, and _days_left_ features will be our continuous-valued features. The rest (except _flight_) will be our categorical features.

In [9]:
# cat_cols will be our list of categorical features
cat_cols = ['airline','source_city', 'destination_city', 
            'departure_time', 'arrival_time', 'stops', 'class']

for col in cat_cols:
    unique_values = flight_df[col].unique()
    print(f"Unique values in '{col}': {unique_values}")

Unique values in 'airline': ['SpiceJet' 'AirAsia' 'Vistara' 'GO_FIRST' 'Indigo' 'Air_India']
Unique values in 'source_city': ['Delhi' 'Mumbai' 'Bangalore' 'Kolkata' 'Hyderabad' 'Chennai']
Unique values in 'destination_city': ['Mumbai' 'Bangalore' 'Kolkata' 'Hyderabad' 'Chennai' 'Delhi']
Unique values in 'departure_time': ['Evening' 'Early_Morning' 'Morning' 'Afternoon' 'Night' 'Late_Night']
Unique values in 'arrival_time': ['Night' 'Morning' 'Early_Morning' 'Afternoon' 'Evening' 'Late_Night']
Unique values in 'stops': ['zero' 'one' 'two_or_more']
Unique values in 'class': ['Economy' 'Business']


Although intuitively something like _stops_ would be a continuous variable.

However, with "two_or_more" meaning that a flight may have at least 2 stops, this would make the feature categorical.

**Still, let's change the values of _stops_ to be represented by their respective numeric symbols, but as strings with "two_or_more" being represented as "2+"**

In [10]:
# Let's change the values in 'stops' to 0, 1, 2+ all as strings
flight_df['stops'] = flight_df['stops'].apply(lambda x: '0' if x == 'zero' 
                                              else ('1' if x == 'one' else '2+'))
flight_df['stops'].unique()

array(['0', '1', '2+'], dtype=object)

Let's see if this applied to the actual dataframe

In [11]:
flight_df[flight_df['stops'] == '2+'].head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
175,Indigo,6E-282,Delhi,Morning,2+,Evening,Mumbai,Economy,9.67,2,11678
312,GO_FIRST,G8-286,Delhi,Morning,2+,Night,Mumbai,Economy,11.0,3,12045
496,GO_FIRST,G8-357,Delhi,Early_Morning,2+,Evening,Mumbai,Economy,12.08,4,11295
611,GO_FIRST,G8-199,Delhi,Afternoon,2+,Night,Mumbai,Economy,8.58,5,5954
628,Indigo,6E-282,Delhi,Morning,2+,Evening,Mumbai,Economy,9.67,5,6953


Looks like that worked!

**One more little thing to change for sake of consistency is tweaking the _airline_ value "GO_FIRST" into "Go_First".**

In [12]:
flight_df['airline'] = flight_df['airline'].replace('GO_FIRST', 'Go_First')
flight_df['airline'].unique()

array(['SpiceJet', 'AirAsia', 'Vistara', 'Go_First', 'Indigo',
       'Air_India'], dtype=object)

In [13]:
# Let's look at index 611 since we know that it is a Go First flight
# To also verify the change
flight_df.iloc[611]

airline              Go_First
flight                 G8-199
source_city             Delhi
departure_time      Afternoon
stops                      2+
arrival_time            Night
destination_city       Mumbai
class                 Economy
duration                 8.58
days_left                   5
price                    5954
Name: 611, dtype: object

# Feature Descriptions

## Categorical Features
- **Airline**: The name of the airline company of the flight. **6** unique airline names
- **Flight**: The flight's assigned flight code. **1,561** unique flight numbers
- **Source City**: The name of the city the flight is departing from. **6** categories/cities.
- **Departure Time**: Categorical feature that describes when during the day the flight is departing. **6** unique departure times of the day.
    
- **Stops**: The amount of layover stops between the departing city and the _Destination City_. **3** unique layover stop values.
- **Arrival Time**: Feature that describes when during the day the flight arrives at the _Destination City_. Same number (**6**) as _Departure Time_.
- **Destination City**: The name of the city the flight will land at. Same number (**6**) as _Departure City_.

## Continuous Features
- **Duration**: The net flight time between the _Source City_ and the _Destination City_. Measured in **hours** as floats.
- **Days Left**: The amount of days the flight was booked in advanced before the departure date. Measured in **days** as integers.
- **Price**: Continuous **target variable** for the ticket price of each booking. Measured in **Indian Rupees (&#8377;)** as integers.

As mentioned earlier, the _flight_ feature will not be included in the final dataset for our study. It is a categorical feature that has too many unique values. Each airline has it's own system to how it assigns flight numbers. This will be irrelevant to our data.

**Let's remove the _flight_ column in that case**

In [14]:
# Removing flight column
flight_df = flight_df.drop('flight', axis=1)
flight_df.head(2)

Unnamed: 0,airline,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,SpiceJet,Delhi,Evening,0,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,Delhi,Early_Morning,0,Morning,Mumbai,Economy,2.33,1,5953


In [15]:
num_obs = flight_df.shape[0]
num_feats = flight_df.shape[1]
print(f'The dataset still has {num_obs} rows, but now with {num_feats} features.')

The dataset still has 300153 rows, but now with 10 features.


In [16]:
# Let's save the newly edited dataframe
folder_name = 'Capstone 2 Data'
file_name = 'flight_data.csv'

file_path = os.path.join(folder_name, file_name)

flight_df.to_csv(file_path, index=False)