# SkyInsight: Predictive Analytics for Cost-Effective Air Travel"

by: Mark Dunlea Tate, Landry Houston, Anthony Amadasun

---

### 1.1 Data Collection

In [2]:
import pandas as pd

In [3]:
# data url = 'https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction/data'

Read and assign data set. Dropped any unnecessary columns.

In [4]:
df = pd.read_csv("../data/raw_dataset_FINAL.csv")
df.drop(columns=["Unnamed: 0", "days_left"], inplace=True)
df.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,5953
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,5956
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,5955
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,5955


---

### 1.2 Data Cleaning

Check for missing values.

In [5]:
df.isnull().sum()

airline             0
flight              0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
price               0
dtype: int64

Chang 'price' column from Rupees to US dollars.

In [18]:
df["price"] = round(df["price"] * 0.12 / 10, 2)

Chang 'duration' column from hours to minutes.

In [19]:
df["duration"] = round(df["duration"] * 60, 0).astype(int)

Chang 'stops' column to numeric for better modeling.

In [20]:
df["stops"].replace({"zero": 0, "one": 1, "two_or_more": 2}, inplace=True)

Binarize the 'class' column for better modeling.

In [21]:
df["class"].replace({"Economy": 0, "Business": 1}, inplace=True)

Renam columns for easier understanding.

In [22]:
df.rename(
    columns={"source_city": "origin", "destination_city": "destination"}, inplace=True
)

Export clean data set to a CSV file.

In [23]:
df.to_csv("../data/clean_dataset_FINAL.csv", index=False)