[Data Cleaning](#Data-Cleaning)<br>
&emsp;[1. Import Libraries](#1.-Import-Libraries)<br>
&emsp;[2. Reading Dataset](#2.-Reading-Dataset)<br>
&emsp;[3. Preliminary Analysis](#3.-Preliminary-Analysis)<br>
&emsp;&emsp;[3.1 Check Data Types](#3.1-Check-Data-Types)<br>
&emsp;&emsp;[3.2 Check for Duplicates](#3.2-Check-for-Duplicates)<br>
&emsp;&emsp;[3.3 Check for Missing Values](#3.3-Check-for-Missing-Values)<br>
&emsp;[4. Detailed Analysis](#4.-Detailed-Analysis)<br>
&emsp;&emsp;[4.1 Airline](#4.1-Airline)<br>
&emsp;&emsp;[4.2 Date of Journey](#4.2-Date-of-Journey)<br>
&emsp;&emsp;[4.3 Source](#4.3-Source)<br>
&emsp;&emsp;[4.4 Destination](#4.4-Destination)<br>
&emsp;&emsp;[4.5 Route](#4.5-Route)<br>
&emsp;&emsp;[4.6 Departure Time](#4.6-Departure-Time)<br>
&emsp;&emsp;[4.7 Arrival Time](#4.7-Arrival-Time)<br>
&emsp;&emsp;[4.8 Duration](#4.8-Duration)<br>
&emsp;&emsp;[4.9 Total Stops](#4.9-Total-Stops)<br>
&emsp;&emsp;[4.10 Additional Info](#4.10-Additional-Info)<br>
&emsp;&emsp;[4.11 Price](#4.11-Price)<br>
&emsp;[5. Cleaning Operation](#5.-Cleaning-Operation)<br>
&emsp;[6. Split The Datset](#6.-Split-The-Datset)<br>
&emsp;[7. Export The Datset Subset](#7.-Export-The-Datset-Subset)<br>

## 1. Import Libraries

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## 2. Reading Dataset

In [2]:
PROJECT_DIR = 'R:\Jaydeep\Flight-Price-Prediction'
DATA_DIR = 'data'
MAIN_DATASET_NAME = 'flight_price'

In [3]:
def get_dataset(dataset_name):
    file_name = f'{dataset_name}.csv'
    file_path = os.path.join(PROJECT_DIR, DATA_DIR, file_name)
    return pd.read_csv(file_path)

In [4]:
df = get_dataset(MAIN_DATASET_NAME)

In [5]:
pd.to_datetime(df['Dep_Time'], format='%H:%M').dt.time

0        22:20:00
1        05:50:00
2        09:25:00
3        18:05:00
4        16:50:00
           ...   
10678    19:55:00
10679    20:45:00
10680    08:20:00
10681    11:30:00
10682    10:55:00
Name: Dep_Time, Length: 10683, dtype: object

In [6]:
pd.__version__

'2.2.2'

In [7]:
df.sample(5)

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
487,IndiGo,18/05/2019,Kolkata,Banglore,CCU → HYD → BLR,19:20,23:45,4h 25m,1 stop,No info,3717
7076,Multiple carriers,1/06/2019,Delhi,Cochin,DEL → BOM → COK,10:20,19:15,8h 55m,1 stop,No info,9526
5969,Jet Airways,12/05/2019,Kolkata,Banglore,CCU → BOM → BLR,06:30,19:50,13h 20m,1 stop,No info,13941
8633,Jet Airways,15/05/2019,Mumbai,Hyderabad,BOM → HYD,02:55,04:20,1h 25m,non-stop,In-flight meal not included,3210
497,Jet Airways,1/04/2019,Delhi,Cochin,DEL → BOM → COK,10:00,19:00,9h,1 stop,In-flight meal not included,5406


## 3. Preliminary Analysis

### 3.1 Check Data Types

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


### 3.2 Check for Duplicates

In [9]:
df.duplicated().sum()

220

### 3.3 Check for Missing Values

In [10]:
df.isnull().sum().sum()

2

In [11]:
df.isnull().sum()[lambda x: x>0]

Route          1
Total_Stops    1
dtype: int64

In [12]:
df.loc[df['Route'].isnull()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
9039,Air India,6/05/2019,Delhi,Cochin,,09:45,09:25 07 May,23h 40m,,No info,7480


In [13]:
df.loc[df['Total_Stops'].isnull()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
9039,Air India,6/05/2019,Delhi,Cochin,,09:45,09:25 07 May,23h 40m,,No info,7480


- There are 10683 rows & 11 columns.
- Need to convert `Date_of_Journey` `Dep_Time` & `Arrival_Time` to datetime.
- Need to convert `Duration` & `Total_Stops` to numeric.
- There are 220 duplicates, it should be removed.
- There are 2 null values.
    - `Route`:1
    - `Total_Stops`:1

## 4. Detailed Analysis

In [14]:
df.columns

Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
       'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
       'Additional_Info', 'Price'],
      dtype='object')

### 4.1 Airline

In [15]:
df.Airline.nunique()

12

In [16]:
df.Airline.unique()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [17]:
(
	df
	.Airline
	.str.replace(" Premium economy", "")
	.str.replace(" Business", "")
	.str.title()
).sample(5)

9606         Indigo
4768       Spicejet
2582    Jet Airways
3055      Air India
2426       Spicejet
Name: Airline, dtype: object

### 4.2 Date of Journey

In [18]:
df.Date_of_Journey.nunique()

44

In [19]:
df.Date_of_Journey.unique()

array(['24/03/2019', '1/05/2019', '9/06/2019', '12/05/2019', '01/03/2019',
       '24/06/2019', '12/03/2019', '27/05/2019', '1/06/2019',
       '18/04/2019', '9/05/2019', '24/04/2019', '3/03/2019', '15/04/2019',
       '12/06/2019', '6/03/2019', '21/03/2019', '3/04/2019', '6/05/2019',
       '15/05/2019', '18/06/2019', '15/06/2019', '6/04/2019',
       '18/05/2019', '27/06/2019', '21/05/2019', '06/03/2019',
       '3/06/2019', '15/03/2019', '3/05/2019', '9/03/2019', '6/06/2019',
       '24/05/2019', '09/03/2019', '1/04/2019', '21/04/2019',
       '21/06/2019', '27/03/2019', '18/03/2019', '12/04/2019',
       '9/04/2019', '1/03/2019', '03/03/2019', '27/04/2019'], dtype=object)

In [20]:
pd.to_datetime(df.Date_of_Journey, dayfirst=True).sample(5)

1717   2019-03-01
7201   2019-06-03
444    2019-06-09
3784   2019-06-06
7316   2019-03-24
Name: Date_of_Journey, dtype: datetime64[ns]

### 4.3 Source

In [21]:
df.Source.nunique()

5

In [22]:
df.Source.unique()

array(['Banglore', 'Kolkata', 'Delhi', 'Chennai', 'Mumbai'], dtype=object)

### 4.4 Destination

In [23]:
df.Destination.nunique()

6

In [24]:
df.Destination.unique()

array(['New Delhi', 'Banglore', 'Cochin', 'Kolkata', 'Delhi', 'Hyderabad'],
      dtype=object)

### 4.5 Route

In [25]:
df.Route.sample(5)

4934    CCU → BOM → BLR
9536          BLR → DEL
169           BLR → DEL
8779    DEL → BOM → COK
603     BLR → BOM → DEL
Name: Route, dtype: object

### 4.6 Departure Time

In [26]:
df.Dep_Time.nunique()

222

In [27]:
df.Dep_Time.sample(5)

4387     17:30
10139    18:55
4684     02:15
9106     06:45
3487     12:20
Name: Dep_Time, dtype: object

In [28]:
(
    df
    .Dep_Time
    .loc[lambda x: x.str.contains('[^0-9:]')]
)

Series([], Name: Dep_Time, dtype: object)

In [29]:
pd.to_datetime(df.Dep_Time).dt.time.sample(5)

  pd.to_datetime(df.Dep_Time).dt.time.sample(5)


4827    14:05:00
6630    20:45:00
8216    22:00:00
2755    19:45:00
6026    07:10:00
Name: Dep_Time, dtype: object

### 4.7 Arrival Time

In [30]:
df.Arrival_Time.sample(5)

1473    10:15
9306    06:50
4051    07:25
5444    08:45
1798    17:15
Name: Arrival_Time, dtype: object

In [31]:
(
    df
    .Arrival_Time
    .loc[lambda x: x.str.contains('[^0-9:]')]
    .str.split(n=1)
    .str.get(1)
    .unique()
)

array(['22 Mar', '10 Jun', '13 Mar', '02 Mar', '10 May', '04 Mar',
       '13 Jun', '28 May', '19 Mar', '07 May', '02 Jun', '16 Jun',
       '19 May', '16 May', '28 Jun', '02 May', '28 Mar', '19 Jun',
       '04 Apr', '25 Mar', '07 Mar', '25 Jun', '07 Jun', '25 May',
       '13 May', '16 Mar', '22 May', '10 Apr', '04 Jun', '20 May',
       '28 Apr', '25 Apr', '10 Mar', '19 Apr', '13 Apr', '02 Apr',
       '23 Mar', '22 Apr', '11 May', '07 Apr', '03 May', '08 Mar',
       '03 Mar', '05 Mar', '22 Jun', '04 May', '26 May', '16 Apr',
       '26 Jun', '29 May', '29 Jun', '29 Mar', '23 May', '17 Jun'],
      dtype=object)

### 4.8 Duration

In [32]:
df.Duration.sample(5)

4814     2h 15m
1162     7h 15m
9315         6h
8211    21h 25m
862     14h 20m
Name: Duration, dtype: object

In [33]:
(
    df
    .Duration
    .loc[lambda x: ~x.str.contains('m')]
    .unique()
)

array(['19h', '23h', '22h', '12h', '3h', '5h', '10h', '18h', '24h', '15h',
       '16h', '8h', '14h', '20h', '13h', '11h', '9h', '27h', '26h', '4h',
       '7h', '30h', '21h', '28h', '47h', '6h', '25h', '38h', '34h'],
      dtype=object)

In [34]:
(
    df
    .Duration
    .loc[lambda x: ~x.str.contains('h')]
    .unique()
)

array(['5m'], dtype=object)

In [35]:
(
    df
    .Duration
    .loc[lambda x: ~x.str.contains('h')]
)

6474    5m
Name: Duration, dtype: object

In [36]:
df.iloc[[6474]]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
6474,Air India,6/03/2019,Mumbai,Hyderabad,BOM → GOI → PNQ → HYD,16:50,16:55,5m,2 stops,No info,17327


In [37]:
(
    df
    .Duration
    .drop(index=6474)
    .str.split(expand=True)
    .set_axis(['hour', 'minute'], axis=1)
    .assign(
        hour = lambda df_: (
            df_
            .hour
            .str.replace('h','')
            .astype(int)
            .mul(60)
        ),
        minute = lambda df_: (
            df_
            .minute
            .str.replace('m','')
            .fillna('0')
            .astype(int)
        )
    )
    .sum(axis=1)
    .rename('duration_minutes')
    .to_frame()
    .join(df.Duration)
).sample(5)

Unnamed: 0,duration_minutes,Duration
10566,1640,27h 20m
5006,165,2h 45m
10023,85,1h 25m
5743,755,12h 35m
2500,1035,17h 15m


### 4.9 Total Stops

In [38]:
df.Total_Stops.nunique()

5

In [39]:
df.Total_Stops.unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [40]:
(
    df
    .Total_Stops
    .str.replace('non-stop', '0')
    .str.replace('stops?', '', regex=True)
    .pipe(lambda x: pd.to_numeric(x))
).unique()

array([ 0.,  2.,  1.,  3., nan,  4.])

### 4.10 Additional Info

In [41]:
df.Additional_Info.nunique()

10

In [42]:
df.Additional_Info.unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)

In [43]:
(
    df
    .Additional_Info
    .str.lower()
).unique()

array(['no info', 'in-flight meal not included',
       'no check-in baggage included', '1 short layover',
       '1 long layover', 'change airports', 'business class',
       'red-eye flight', '2 long layover'], dtype=object)

### 4.11 Price

In [44]:
df.Price.sample(5)

1813    16079
7930     4441
1894     6810
7572     9103
6070     7563
Name: Price, dtype: int64

In [45]:
df.Price.nunique()

1870

## 5. Cleaning Operation

In [46]:
def clean_column_names(df):
    # Convert all column names to lowercase for consistency
    return df.rename(columns=str.lower)

def strip_string_columns(df):
    # Identify columns with string data type
    string_columns = df.select_dtypes(include='O').columns
    # Strip leading and trailing whitespace from all string columns
    for col in string_columns:
        df[col] = df[col].str.strip()
    return df

def clean_airline_names(df):
    # Clean and standardize 'airline' column by removing specific substrings and title-casing
    df['airline'] = (
        df['airline']
        .str.replace(" Premium economy", "")
        .str.replace(" Business", "")
        .str.title()
    )
    return df

def convert_dates(df):
    # Convert 'date_of_journey' column to datetime format, assuming day-first format
    df['date_of_journey'] = pd.to_datetime(df['date_of_journey'], dayfirst=True)
    return df

def convert_times(df):
    # Convert 'dep_time' and 'arrival_time' columns to time format
    df['dep_time'] = pd.to_datetime(df['dep_time'], format='mixed').dt.time
    df['arrival_time'] = pd.to_datetime(df['arrival_time'], format='mixed').dt.time
    return df

def convert_duration(df):
    # Split 'duration' column into hours and minutes
    duration_split = df['duration'].str.split(" ", expand=True).set_axis(["hour", "minute"], axis=1)
    # Convert hours to minutes and fill missing minutes with 0
    duration_split['hour'] = duration_split['hour'].str.replace("h", "").astype(int).mul(60)
    duration_split['minute'] = duration_split['minute'].str.replace("m", "").fillna("0").astype(int)
    # Sum hours and minutes to get total duration in minutes
    df['duration_minute'] = duration_split.sum(axis=1)
    # drop duration column
    df = df.drop(columns=['duration'])
    return df

def convert_total_stops(df):
    # Standardize and convert 'total_stops' column to numeric
    df['total_stops'] = (
        df['total_stops']
        .replace("non-stop", "0")  # Replace 'non-stop' with '0'
        .str.replace(" stops?", "", regex=True)  # Remove ' stop' or ' stops'
        .pipe(lambda ser: pd.to_numeric(ser))  # Convert to numeric
    )
    return df

def lower_additional_info(df):
    # Convert 'additional_info' column to lowercase
    df['additional_info'] = df['additional_info'].str.lower()
    return df

def preprocess_df(df):
    # Drop rows where 'Duration' is '5m' and drop the 'Route' column
    df = df.drop(index=df[df['Duration'].isin(['5m'])].index, columns=['Route'])
    # Drop duplicate rows to ensure data consistency
    df = df.drop_duplicates()
    # Drop null values
    df = df.dropna()
    # Apply all preprocessing steps in sequence to clean and standardize the DataFrame
    df = clean_column_names(df)         # Convert column names to lowercase
    df = strip_string_columns(df)       # Strip leading/trailing whitespace from string columns
    df = clean_airline_names(df)        # Clean and standardize 'airline' column
    df = convert_dates(df)              # Convert 'date_of_journey' column to datetime format
    df = convert_times(df)              # Convert 'dep_time' and 'arrival_time' columns to time format
    df = convert_duration(df)           # Convert 'duration' column to total minutes
    df = convert_total_stops(df)        # Standardize and convert 'total_stops' column to numeric
    df = lower_additional_info(df)      # Convert 'additional_info' column to lowercase
    # Drop duplicate rows to ensure data consistency
    df = df.drop_duplicates()
    return df

In [47]:
df = preprocess_df(df)

In [48]:
columns = [
    'airline', 'date_of_journey', 'source', 'destination', 'dep_time',
    'arrival_time',  'duration_minute', 'total_stops', 'additional_info',
    'price',
]

In [49]:
df = df[columns]

In [50]:
df.sample(10)

Unnamed: 0,airline,date_of_journey,source,destination,dep_time,arrival_time,duration_minute,total_stops,additional_info,price
4791,Multiple Carriers,2019-03-21,Delhi,Cochin,10:20:00,19:15:00,535,1,no info,7531
1507,Jet Airways,2019-05-09,Delhi,Cochin,19:10:00,19:00:00,1430,2,no info,15129
4210,Indigo,2019-03-06,Mumbai,Hyderabad,02:30:00,04:00:00,90,0,no info,3175
1860,Jet Airways,2019-05-09,Kolkata,Banglore,16:30:00,08:15:00,945,1,no info,13941
393,Indigo,2019-03-03,Delhi,Cochin,17:30:00,01:35:00,485,1,no info,14871
8164,Indigo,2019-06-18,Chennai,Kolkata,07:55:00,10:15:00,140,0,no info,3850
4365,Indigo,2019-06-18,Banglore,Delhi,08:30:00,11:20:00,170,0,no info,4823
5525,Air India,2019-06-09,Kolkata,Banglore,09:50:00,12:30:00,1600,2,no info,14960
7051,Indigo,2019-06-27,Delhi,Cochin,06:50:00,16:10:00,560,1,no info,6442
8263,Jet Airways,2019-05-27,Banglore,Delhi,07:10:00,10:10:00,180,0,in-flight meal not included,4030


## 6. Split The Datset

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10459 entries, 0 to 10682
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   airline          10459 non-null  object        
 1   date_of_journey  10459 non-null  datetime64[ns]
 2   source           10459 non-null  object        
 3   destination      10459 non-null  object        
 4   dep_time         10459 non-null  object        
 5   arrival_time     10459 non-null  object        
 6   duration_minute  10459 non-null  int64         
 7   total_stops      10459 non-null  int64         
 8   additional_info  10459 non-null  object        
 9   price            10459 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(6)
memory usage: 898.8+ KB


In [52]:
X = df.drop(columns="price")
y = df.price.copy()

X_, X_test, y_, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_, y_, test_size=0.2, random_state=42)

print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
print(X_test.shape, y_test.shape)

(6693, 9) (6693,)
(1674, 9) (1674,)
(2092, 9) (2092,)


## 7. Export The Datset Subset

In [53]:
def export_dataset(X, y, name):
	file_name = f"{name}.csv"
	file_path = os.path.join(PROJECT_DIR, DATA_DIR, file_name)
	X.join(y).to_csv(file_path, index=False)

In [54]:
export_dataset(X_train, y_train, 'train')
export_dataset(X_val, y_val, 'validation')
export_dataset(X_test, y_test, 'test')

In [55]:
get_dataset('train').sample(3)

Unnamed: 0,airline,date_of_journey,source,destination,dep_time,arrival_time,duration_minute,total_stops,additional_info,price
2913,Multiple Carriers,2019-06-24,Delhi,Cochin,09:15:00,19:00:00,585,1,no info,11622
5512,Jet Airways,2019-06-24,Banglore,Delhi,07:10:00,10:10:00,180,0,no info,8016
1648,Indigo,2019-06-12,Delhi,Cochin,09:15:00,01:30:00,975,1,no info,6628


In [56]:
get_dataset('validation').sample(3)

Unnamed: 0,airline,date_of_journey,source,destination,dep_time,arrival_time,duration_minute,total_stops,additional_info,price
926,Indigo,2019-06-01,Banglore,Delhi,04:00:00,06:50:00,170,0,no info,3943
14,Jet Airways,2019-05-06,Banglore,Delhi,19:50:00,22:50:00,180,0,no info,7229
589,Jet Airways,2019-05-15,Kolkata,Banglore,21:10:00,04:40:00,450,1,no info,14388


In [57]:
get_dataset('test').sample(3)

Unnamed: 0,airline,date_of_journey,source,destination,dep_time,arrival_time,duration_minute,total_stops,additional_info,price
556,Jet Airways,2019-05-21,Banglore,Delhi,19:50:00,22:50:00,180,0,no info,7229
263,Indigo,2019-06-24,Delhi,Cochin,11:55:00,22:30:00,635,1,no info,6442
1879,Indigo,2019-04-09,Kolkata,Banglore,21:25:00,00:05:00,160,0,no info,4174


In [58]:
# (
#     df
#     .drop(index=df[df.Duration.isin(['5m'])].index, columns=['Route'])
#     .drop_duplicates()
#     .rename(columns=str.lower)
#     .assign(
#         **{
#             col: df[col].str.strip()
#             for col in df.select_dtypes(include="O").columns
#         }
#     )
#     .assign(
#         airline = lambda df_: (
# 				df_
# 				.airline
# 				.str.replace(" Premium economy", "")
# 				.str.replace(" Business", "")
# 				.str.title()
# 		),
#         date_of_journey = lambda df_: (
#             pd.to_datetime(df_.date_of_journey, dayfirst=True)
#         ),
#         dep_time = lambda df_: (
#                 pd.to_datetime(df_.dep_time).dt.time
#         ),
#         arrival_time = lambda df_: (
#             pd.to_datetime(df_.arrival_time).dt.time
#         ),
#         duration = lambda df_: (
#             df_.duration.pipe(lambda x: (
#                 x
#                 .str.split(" ", expand=True)
#         		.set_axis(["hour", "minute"], axis=1)
#         		.assign(
#                     hour=lambda df_: (
#                         df_
#         				.hour
#         				.str.replace("h", "")
#         				.astype(int)
#         				.mul(60)
#             		),
#         			minute=lambda df_: (
#         				df_
#         				.minute
#         				.str.replace("m", "")
#         				.fillna("0")
#         				.astype(int)
#         			)
#         		).sum(axis=1)
#             )
#           )
#         ),
#         total_stops = lambda df_: (
# 				df_
# 				.total_stops
# 				.replace("non-stop", "0")
# 				.str.replace(" stops?", "", regex=True)
# 				.pipe(lambda ser: pd.to_numeric(ser))
# 		),
#         additional_info = lambda df_ : (
#             df_
#             .additional_info
#             .str
#             .lower()
#         )
#     )
# ).sample(10)