# ECEN 4322-5322 Data and Network Science

## Title- Fare prediction for flights


#### Group Members - Chirag Chandrashekar, Chris Alexander, Viveka Salinamekki.

## Introduction 

The dataset chosen for analysis is the itineraries of flights in the USA over 6 months. For exploratory data analysis, we aim to find the airport with the highest traffic or the best-connected airports, the price of flights during the different hours of the day, the average/minimum travel distance for which people prefer flight, whether an average flight is fully booked/percentage of flights booked, and popular airlines. The end goal here is to predict the fare of a flight. Due to the large number of rows and features, the prediction can provide a good estimation of the fare. Techniques such as data grouping and manipulation, visualization, regular expressions, data modeling, feature engineering, model validation, and prediction will help achieve our goal. 


<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Importing the Data

In [1]:
import numpy as np

import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os

from ecen5322_utils import run_linear_regression_test

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

ModuleNotFoundError: No module named 'ecen5322_utils'

### Sampling the dataset

Write how the data is read as chunks and sampled

In [None]:
'''
dfs=[]
with pd.read_csv("car_price_prediction.csv", chunksize=1000000) as reader: #read as chunks due to low system resorces to read 30gb file
    reader
    for chunk in reader:
        #print(type(chunk))
        data_index=chunk.index
        shuffled_indices = np.random.permutation(data_index)  #shuffling and sampling data to reduce the data to 500000 entries
        #print(chunk.loc[shuffled_indices])
        #print(type(chunk))
        chunk,leftover= np.split(chunk.loc[shuffled_indices],[12000]) #selects first N rows from each chunk
        #print(type(chunk))
        #print(chunk)
        #chunks=chunk.to_frame
        #print(chunks)
        dfs.append(chunk) #makes a list of dataframe chunks
        #joined_chunk=pd.concat(chunk)
        #print(joined_chunk)
final_df=pd.concat(dfs) #concats all the chunks in the list of dataframe
print("final dataframe")
final_df
'''

In [None]:
#Load data to new csv file
#final_df
'''
data_index2=final_df.index
shuffled_indices2 = np.random.permutation(data_index2)
final_df2,leftover2=np.split(final_df.loc[shuffled_indices2],[500000])
final_df2
'''

In [None]:
#from google.colab import  files
#final_df2.to_csv('sampled_file.csv')
#files.download('sampled_file.csv')

In [9]:
#data=pd.read_csv('sampled_file.csv')

#New read_csv function to read empty values as -1 and set index as legId
data=pd.read_csv("sampled_file.csv"
                 , index_col="legId"
                 #, dtype=str
                 , keep_default_na=True
                 , na_values=-1
                 , na_filter=True)
data

Unnamed: 0_level_0,Unnamed: 0,searchDate,flightDate,startingAirport,destinationAirport,fareBasisCode,travelDuration,elapsedDays,isBasicEconomy,isRefundable,...,segmentsArrivalTimeEpochSeconds,segmentsArrivalTimeRaw,segmentsArrivalAirportCode,segmentsDepartureAirportCode,segmentsAirlineName,segmentsAirlineCode,segmentsEquipmentDescription,segmentsDurationInSeconds,segmentsDistance,segmentsCabinCode
legId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
e95cef0009893d65558d17324e468aea,18860256,2022-06-01,2022-06-15,PHL,JFK,SUAJZNB3,PT5H17M,0,True,False,...,1655309340||1655323200,2022-06-15T12:09:00.000-04:00||2022-06-15T16:0...,BOS||JFK,PHL||BOS,American Airlines||American Airlines,AA||AA,Airbus A321||AIRBUS INDUSTRIE A321 SHARKLETS,5160||4920,280||185,coach||coach
778d47d0785023302cd075d735d27db8,55299894,2022-08-11,2022-08-20,CLT,ATL,TA7NA0MC,PT14H,1,False,False,...,1661037300||1661050560||1661081220,2022-08-20T19:15:00.000-04:00||2022-08-20T22:5...,DTW||IND||ATL,CLT||DTW||IND,Delta||Delta||Delta,DL||DL||DL,Boeing 717||Airbus A321||Boeing 737-900,6480||4020||5220,505||241||434,coach||coach||coach
55ca7cc1d822f310963a25d84656bb47,29820237,2022-06-26,2022-07-01,OAK,JFK,HA0NA0MC,PT8H13M,0,False,False,...,1656706200||1656730740,2022-07-01T13:10:00.000-07:00||2022-07-01T22:5...,LAX||JFK,OAK||LAX,Delta||Delta,DL||DL,Embraer 175 (Enhanced Winglets)||,5040||20040,338||2458,coach||coach
598db4391eb0bc5752b09c871111a5ce,12719339,2022-05-14,2022-05-23,BOS,EWR,QAA3OFEN,PT1H41M,0,False,False,...,1653347940,2022-05-23T19:19:00.000-04:00,EWR,BOS,United,UA,Airbus A319,6060,185,coach
e09441b23c76e8a35be3acb61d8e3e95,77054039,2022-09-25,2022-10-13,ATL,BOS,TAUNX0BC,PT10H32M,0,True,False,...,1665667500||1665697620,2022-10-13T09:25:00.000-04:00||2022-10-13T17:4...,JFK||BOS,ATL||JFK,Delta||Delta,DL||DL,Boeing 737-900||Airbus A220-100,7800||5220,762||185,coach||coach
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
cfd0ca3c7574a87f1ce1dfa83d1a053b,33226811,2022-07-02,2022-07-15,LAX,CLT,HA7OA0MQ,PT9H10M,1,False,False,...,1657969320||1657986300,2022-07-16T07:02:00.000-04:00||2022-07-16T11:4...,DTW||CLT,LAX||DTW,Delta||Delta,DL||DL,Airbus A321||Boeing 717,16020||6300,1985||505,coach||coach
3c671c288391b26e5888c84ef143424d,8650528,2022-05-04,2022-06-23,CLT,LAX,UAVNA0MC,PT14H11M,0,False,False,...,1655982720||1655995500||1656029760,2022-06-23T07:12:00.000-04:00||2022-06-23T09:4...,ATL||DFW||LAX,CLT||ATL||DFW,Delta||Delta||Delta,DL||DL||DL,Boeing 717||Airbus A321||Airbus A320,4020||8100||11760,228||725||1238,coach||coach||coach
2313cf9772214eaddf14651c8847eaeb,57058136,2022-08-14,2022-08-26,BOS,OAK,QAA0OHEN,PT11H11M,0,False,False,...,1661558040||1661573100||1661583120,2022-08-26T18:54:00.000-05:00||2022-08-26T22:0...,IAH||SLC||OAK,BOS||IAH||SLC,United||Delta||Delta,UA||DL||DL,Airbus A320||Airbus A319||Airbus A220-100,15180||11460||6780,1602||1204||588,coach||coach||coach
9141b51058bfbdf9adc91abb2be9078c,54686029,2022-08-09,2022-09-06,EWR,SFO,L7AHZNN1,PT8H13M,0,False,False,...,1662485760||1662506040,2022-09-06T12:36:00.000-05:00||2022-09-06T16:1...,ORD||SFO,EWR||ORD,American Airlines||American Airlines,AA||AA,Boeing 737-800||Boeing 737-800,9300||16740,720||1847,coach||coach


<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Cleaning the data

### Adding NaN to the file

### convert True and false to 0 and 1

### rename columns

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Exploratory Data Analysis

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Prediction of____

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Validation of model

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Inference and Conclusion