<a href="https://colab.research.google.com/github/MArtistForLife/FlightPriceEDA/blob/main/FlightPriceEDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## EDA And Feature Engineering Flight Price Prediction
check the dataset info below
https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction

### FEATURES
The various features of the cleaned dataset are explained below:
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10)Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.

In [142]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
## this part displays matplotlib plots directly within the output cells

from google.colab import drive
drive.mount('/content/drive')

import os
# to change to specific folder in google drive that is needed
os.chdir('/content/drive/My Drive/Colab Notebooks/EDA')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [143]:
flyMoney = pd.read_excel("flight_price.xlsx")
## note: when we have a .csv file, write pd.read_csv("blah blah")
## but when we have a .xlsx file, write pd.read_xlsx("blah blah")
flyMoney[:5]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [144]:
## each date string will get split by the / character, creating
## the strings ["dd", "mm", "yyyy"]
## .str[0] gets the first element from the split list, which corresponds to month
## so, if we get 25/12/2023, for example, splitting it gives you
## ["25", "12", "2023"]
## and .str[0] gives you "25" (the DAY, not the MONTH)

flyMoney["dateByDay"] = flyMoney["Date_of_Journey"].str.split("/").str[0]
flyMoney["dateByDay"]

flyMoney["dateByMonth"] = flyMoney["Date_of_Journey"].str.split("/").str[1]
flyMoney["dateByMonth"]

flyMoney["dateByYear"] = flyMoney["Date_of_Journey"].str.split("/").str[2]
flyMoney["dateByYear"]

Unnamed: 0,dateByYear
0,2019
1,2019
2,2019
3,2019
4,2019
...,...
10678,2019
10679,2019
10680,2019
10681,2019


In [145]:
flyMoney[:5]
## with 3 new columns!! yayy :)

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019


In [146]:
flyMoney.info()
## we need to make the objects (string) into integers

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
 11  dateByDay        10683 non-null  object
 12  dateByMonth      10683 non-null  object
 13  dateByYear       10683 non-null  object
dtypes: int64(1), object(13)
memory usage: 1.1+ MB


In [147]:
## when converting between types: use .astype(blah blah)
flyMoney["dateByDay"] = flyMoney["dateByDay"].astype(int)
flyMoney["dateByMonth"] = flyMoney["dateByMonth"].astype(int)
flyMoney["dateByYear"] = flyMoney["dateByYear"].astype(int)

In [148]:
flyMoney["dateByDay"]

Unnamed: 0,dateByDay
0,24
1,1
2,9
3,12
4,1
...,...
10678,9
10679,27
10680,27
10681,1


In [150]:
flyMoney["dateByMonth"]

Unnamed: 0,dateByMonth
0,3
1,5
2,6
3,5
4,3
...,...
10678,4
10679,4
10680,4
10681,3


In [151]:
flyMoney["dateByYear"]

Unnamed: 0,dateByYear
0,2019
1,2019
2,2019
3,2019
4,2019
...,...
10678,2019
10679,2019
10680,2019
10681,2019


In [152]:
## we don't need the original date_of_journey column
flyMoney.drop("Date_of_Journey", axis = 1)
## gotta specify whether column (axis = 1) or row (axis = 0)

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,9,4,2019
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,27,4,2019
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,27,4,2019
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,1,3,2019


In [153]:
flyMoney["Arrival_Time"] = flyMoney["Arrival_Time"].apply(lambda x:x.split(" ")[0])
## this splits each value by spaces, and then takes the first part, which would
## be hh:mm because the seconds are a space away from the rest of the time given

In [154]:
flyMoney["arrivalByHour"] = flyMoney["Arrival_Time"].str.split(":").str[0]
flyMoney["arrivalByMin"] = flyMoney["Arrival_Time"].str.split(":").str[1]

In [155]:
flyMoney[:5]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10,2h 50m,non-stop,No info,3897,24,3,2019,1,10
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019,13,15
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19h,2 stops,No info,13882,9,6,2019,4,25
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019,23,30
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019,21,35


In [156]:
flyMoney["arrivalByHour"] = flyMoney["arrivalByHour"].astype(int)
flyMoney["arrivalByMin"] = flyMoney["arrivalByMin"].astype(int)

In [157]:
## we fixed the time formatting, so we no longer need the old Arrival_Time column
flyMoney.drop("Arrival_Time", axis = 1, inplace = True)

In [158]:
flyMoney[:5]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,24,3,2019,1,10
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,1,5,2019,13,15
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,9,6,2019,4,25
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,5h 25m,1 stop,No info,6218,12,5,2019,23,30
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,4h 45m,1 stop,No info,13302,1,3,2019,21,35


In [159]:
## do same thing for departure time
flyMoney["Dep_Time"] = flyMoney["Dep_Time"].apply(lambda x:x.split(" ")[0])
flyMoney["depByHour"] = flyMoney["Dep_Time"].str.split(":").str[0]
flyMoney["depByMin"] = flyMoney["Dep_Time"].str.split(":").str[1]

In [160]:
flyMoney[:5]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin,depByHour,depByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,24,3,2019,1,10,22,20
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,1,5,2019,13,15,5,50
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,9,6,2019,4,25,9,25
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,5h 25m,1 stop,No info,6218,12,5,2019,23,30,18,5
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,4h 45m,1 stop,No info,13302,1,3,2019,21,35,16,50


In [161]:
flyMoney = flyMoney.drop("Dep_Time", axis = 1)
flyMoney

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin,depByHour,depByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,24,3,2019,1,10,22,20
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,1,5,2019,13,15,05,50
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,9,6,2019,4,25,09,25
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,12,5,2019,23,30,18,05
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,1,3,2019,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,2h 30m,non-stop,No info,4107,9,4,2019,22,25,19,55
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,2h 35m,non-stop,No info,4145,27,4,2019,23,20,20,45
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,3h,non-stop,No info,7229,27,4,2019,11,20,08,20
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,2h 40m,non-stop,No info,12648,1,3,2019,14,10,11,30


In [162]:
flyMoney["depByHour"] = flyMoney["depByHour"].astype(int)
flyMoney["depByMin"] = flyMoney["depByMin"].astype(int)
flyMoney

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin,depByHour,depByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,24,3,2019,1,10,22,20
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,1,5,2019,13,15,5,50
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,9,6,2019,4,25,9,25
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,12,5,2019,23,30,18,5
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,1,3,2019,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,2h 30m,non-stop,No info,4107,9,4,2019,22,25,19,55
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,2h 35m,non-stop,No info,4145,27,4,2019,23,20,20,45
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,3h,non-stop,No info,7229,27,4,2019,11,20,8,20
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,2h 40m,non-stop,No info,12648,1,3,2019,14,10,11,30


In [163]:
flyMoney.info()
## great!! neither the new arrival time columns nor departure time columns are objects :D

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Duration         10683 non-null  object
 6   Total_Stops      10682 non-null  object
 7   Additional_Info  10683 non-null  object
 8   Price            10683 non-null  int64 
 9   dateByDay        10683 non-null  int64 
 10  dateByMonth      10683 non-null  int64 
 11  dateByYear       10683 non-null  int64 
 12  arrivalByHour    10683 non-null  int64 
 13  arrivalByMin     10683 non-null  int64 
 14  depByHour        10683 non-null  int64 
 15  depByMin         10683 non-null  int64 
dtypes: int64(8), object(8)
memory usage: 1.3+ MB


In [164]:
flyMoney[:2]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin,depByHour,depByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,24,3,2019,1,10,22,20
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,1,5,2019,13,15,5,50


In [165]:
flyMoney["Total_Stops"].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [166]:
flyMoney["Total_Stops"].mode()

Unnamed: 0,Total_Stops
0,1 stop


In [167]:
flyMoney["Total_Stops"].isna().value_counts()

Unnamed: 0_level_0,count
Total_Stops,Unnamed: 1_level_1
False,10682
True,1


In [168]:
flyMoney["Total_Stops"] = flyMoney["Total_Stops"].map({"non-stop": 0, "1 stop": 1, "2 stops": 2, "3 stops": 3, "4 stops": 4, np.nan: 1})
## .map({}) helps us assign an integer value to each string value in the column's values
## np.nan: 1 will replace every nan value with 1
flyMoney["Total_Stops"]

Unnamed: 0,Total_Stops
0,0
1,2
2,2
3,1
4,1
...,...
10678,0
10679,0
10680,0
10681,0


In [169]:
flyMoney[flyMoney["Total_Stops"].isna()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin,depByHour,depByMin


In [170]:
flyMoney[:2]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin,depByHour,depByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,2h 50m,0,No info,3897,24,3,2019,1,10,22,20
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2,No info,7662,1,5,2019,13,15,5,50


In [171]:
## we don't need the Route column bc we have Source and Destination
flyMoney.drop("Route", axis = 1, inplace = True)

In [172]:
flyMoney[:5]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Duration,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin,depByHour,depByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,2h 50m,0,No info,3897,24,3,2019,1,10,22,20
1,Air India,1/05/2019,Kolkata,Banglore,7h 25m,2,No info,7662,1,5,2019,13,15,5,50
2,Jet Airways,9/06/2019,Delhi,Cochin,19h,2,No info,13882,9,6,2019,4,25,9,25
3,IndiGo,12/05/2019,Kolkata,Banglore,5h 25m,1,No info,6218,12,5,2019,23,30,18,5
4,IndiGo,01/03/2019,Banglore,New Delhi,4h 45m,1,No info,13302,1,3,2019,21,35,16,50


In [173]:
## we need to split the duration into hours and min like earlier with arrival and departure times
flyMoney["durByHour"] = flyMoney["Duration"].apply(lambda x:x.split(" ")[0])
flyMoney["durByHour"].str.split(" ").str[0].str.split("h").str[0]

Unnamed: 0,durByHour
0,2
1,7
2,19
3,5
4,4
...,...
10678,2
10679,2
10680,3
10681,2


In [174]:
## this applies the split and takes the value in index 1, but ONLY IF there is
## more than 1 element there, otherwise "0m" will be put there (applied)
flyMoney["durByMin"] = flyMoney["Duration"].apply(lambda x: x.split(" ")[1] if len(x.split(" ")) > 1 else "0m")
flyMoney["durByMin"] = flyMoney["durByMin"].str.split("m").str[0]
flyMoney["durByMin"]

Unnamed: 0,durByMin
0,50
1,25
2,0
3,25
4,45
...,...
10678,30
10679,35
10680,0
10681,40


In [175]:
flyMoney.drop("Duration", axis = 1)

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Total_Stops,Additional_Info,Price,dateByDay,dateByMonth,dateByYear,arrivalByHour,arrivalByMin,depByHour,depByMin,durByHour,durByMin
0,IndiGo,24/03/2019,Banglore,New Delhi,0,No info,3897,24,3,2019,1,10,22,20,2h,50
1,Air India,1/05/2019,Kolkata,Banglore,2,No info,7662,1,5,2019,13,15,5,50,7h,25
2,Jet Airways,9/06/2019,Delhi,Cochin,2,No info,13882,9,6,2019,4,25,9,25,19h,0
3,IndiGo,12/05/2019,Kolkata,Banglore,1,No info,6218,12,5,2019,23,30,18,5,5h,25
4,IndiGo,01/03/2019,Banglore,New Delhi,1,No info,13302,1,3,2019,21,35,16,50,4h,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,0,No info,4107,9,4,2019,22,25,19,55,2h,30
10679,Air India,27/04/2019,Kolkata,Banglore,0,No info,4145,27,4,2019,23,20,20,45,2h,35
10680,Jet Airways,27/04/2019,Banglore,Delhi,0,No info,7229,27,4,2019,11,20,8,20,3h,0
10681,Vistara,01/03/2019,Banglore,New Delhi,0,No info,12648,1,3,2019,14,10,11,30,2h,40


In [176]:
flyMoney["Airline"].unique()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [177]:
flyMoney["Source"].unique()

array(['Banglore', 'Kolkata', 'Delhi', 'Chennai', 'Mumbai'], dtype=object)

In [179]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

In [180]:
encoder.fit_transform(flyMoney[["Airline", "Source", "Destination"]]).toarray()
## double brackets because 2D array

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [184]:
pd.DataFrame(encoder.fit_transform(flyMoney[["Airline", "Source", "Destination"]]).toarray(), columns = encoder.get_feature_names_out())
## basically makes all the values binary between 0 and 1 so it's quicker to see which airline, source, and destination apply :)

Unnamed: 0,Airline_Air Asia,Airline_Air India,Airline_GoAir,Airline_IndiGo,Airline_Jet Airways,Airline_Jet Airways Business,Airline_Multiple carriers,Airline_Multiple carriers Premium economy,Airline_SpiceJet,Airline_Trujet,...,Source_Chennai,Source_Delhi,Source_Kolkata,Source_Mumbai,Destination_Banglore,Destination_Cochin,Destination_Delhi,Destination_Hyderabad,Destination_Kolkata,Destination_New Delhi
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10679,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10680,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
10681,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [185]:
## this is how we make categorical features numerical!!!