# Exploratory data analysis

**We still start out by looking at our dataset after, we will see their correlation of each of the predictors and choose predictors that are beneficial in solving our problem**

**Function to show Boxplot, histplot and violinplot ( showPlots(df,len(df.columns) )**

In [None]:
def showPlots(tempData, columns):
    f, axes = plt.subplots(columns,3, figsize=(30,20))
    count =0
    for var in tempData:
        try:
            sb.boxplot(data =tempData[var],orient ="h", ax =axes[count,0])
            sb.histplot(data =tempData[var],ax =axes[count,1])
            sb.violinplot(data =tempData[var],orient ="h", ax =axes[count,2])
            count+=1
        except:
            print("Skipping column: ",var)

**Function to show HeatMap ( heatMap(df,row size, col size) )**

In [None]:
def heatMap(df,row,col):
    f = plt.figure(figsize=(row, col))
    sb.heatmap(df.corr(), vmin = -1, vmax = 1, linewidths = 1,
               annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "RdBu")

**Function to retrieve List/Key**

In [None]:
def getList(dict):
    return dict.keys()
name = getList(carrierindex)

In [None]:
def get_key(my_dict,val):
    for key, value in my_dict.items():
         if val == value:
             return key
 
    return "key doesn't exist"

In [None]:
heatMap(flightdata_df,30,30);

In [None]:
flightdata_df.plot.scatter(x="DEP_DELAY_NEW", y="ARR_DELAY_NEW");

**Based on our observation, we also observed that ARR_DELAY_NEW is highly correlated to DEP_DELAY_NEW and it make sense because when a flight depart late, its arrival time will eventually be late. We also observed several predictors which have low correlation to DEP/ARR delays. We will extract "OP_CARRIER" to find out the carrier that has the highest number of delays, extract "ORIGIN" to find out the country that has the highest number of delays.**

In [None]:
flightdata_df[['OP_CARRIER','DEP_DELAY_NEW','ARR_DELAY_NEW','CANCELLATION_CODE','CARRIER_DELAY','WEATHER_DELAY','NAS_DELAY','SECURITY_DELAY','LATE_AIRCRAFT_DELAY']].describe()

**Based on the observation, we can see that there are outliers that is too far away from the median. We will remove the outliers later**

# Choose predictors and get rid of outliers



In [None]:
#remove Outliers
Q1 = flightdata_df.quantile(0.25)
Q3 = flightdata_df.quantile(0.75)
IQR = Q3-Q1
newData = flightdata_df[~((flightdata_df <(Q1 -1.5*IQR))|(flightdata_df > (Q3 +1.5*IQR))).any(axis=1)]

In [None]:
flightData = newData[['ARR_TIME','ARR_DELAY_NEW','DEP_DELAY_NEW','DEP_DELAY','CARRIER_DELAY','NAS_DELAY','LATE_AIRCRAFT_DELAY','OP_CARRIER','ORIGIN']]

In [None]:
flightData[['ARR_DELAY_NEW','DEP_DELAY_NEW']].describe()

**We chose 'ARR_TIME','ARR_DELAY_NEW','DEP_DELAY_NEW','DEP_DELAY','CARRIER_DELAY','NAS_DELAY','LATE_AIRCRAFT_DELAY' predictors as we observe from the heatmap that they are correlated with each other. We chose 'OP_CARRIER','ORIGIN' because we want to explore the delays based on the airlines and origin of the airport**

In [None]:
sb.boxplot(x=newData['DEP_DELAY'])

**Based on our observation, we can conclude that more than 75% of delays are 0 as negative delays refers to early departure which is not beneficial for our problem. We will be converting all negative delays into zeros and store into "DEP_DELAY_NEW". Hence, we will be using "DEP_DELAY_NEW" from now on. We were not able to show the boxplot for "DEP_DELAY_NEW" it is highly compressed**

**Comparing before/after removing outliers using heatmap, we could see that NAS_DELAY is highly correlated to the Departure Delay**

In [None]:
sb.kdeplot(
   data=newData, x="DEP_DELAY",
   fill=True,bw_method=0.3)

**Based on the Kernel Density Estimate plot, we observed that most of the delays lies between -5 to -2**

In [None]:
print(flightData.corr())
heatMap(flightData,30,30)

In [None]:
sb.set(rc={'figure.figsize':(15,13)})
averageFlight = newData.groupby('OP_CARRIER').DEP_DELAY_NEW.mean()
averageFlight.index = list(name)
averageFlight=averageFlight.sort_values(ascending=False).head(20)
sb.barplot(averageFlight.values, averageFlight.index, alpha=0.8,orient='h').set(title="Airline's average delay time")
ax =sb.barplot(averageFlight.values, averageFlight.index, alpha=0.8,orient='h')
ax.bar_label(ax.containers[0])
ax.set(xlabel="Minutes", ylabel = "Airline")

**By Visualizing the plot, we observed that the top 5 Airlines with highest average delay timing are as follows:**

**1) Frontier Airline**\
**2) JetBlue Airways**\
**3) Allegiant Air**\
**4) DexpressJet**\
**5) PSA Airlines**


In [None]:
temptemptemp = flightData["OP_CARRIER"].value_counts()
temptemptemp.index = list(name)
AirlineDelay = temptemptemp.sort_values(ascending=False).head(20)
sb.set(rc={'figure.figsize':(15,13)})
sb.barplot(AirlineDelay.values, AirlineDelay.index, alpha=0.8,orient='h').set(title="Airline delay count")
ax =sb.barplot(AirlineDelay.values, AirlineDelay.index, alpha=0.8,orient='h')
ax.bar_label(ax.containers[0])

**By Visualizing the plot, we observed that the top 5 Airlines with highest delay counts are as follows:**

**1) United Airline**\
**2) Alaska Airways**\
**3) Endeavor Air**\
**4) JetBlue Airways**\
**5) ExpressJet**


In [None]:
temptemptemp = flightdata_df["ORIGIN"].value_counts()
airportList = getList(airportDict)
temptemptemp.index =list(airportList)
originDelay = temptemptemp.sort_values(ascending=False).head(30)
sb.set(rc={'figure.figsize':(15,13)})
sb.barplot(originDelay.values, originDelay.index, alpha=0.8,orient='h').set(title="ORIGIN delay count")
ax =sb.barplot(originDelay.values, originDelay.index, alpha=0.8,orient='h')
ax.bar_label(ax.containers[0])

**By Visualizing the plot, we observed that the top 5 States with highest delay counts are as follows:**

**1) Denver**\
**2) San Francisco**\
**3) O’Hare Airport (Chicago)** \
**4) Albany International Airport (New York)** \
**5) Omaha (Eppley Airfield)**



In [None]:
new_dict = dict([(value, key) for key, value in airportDict.items()])
newData["ORIGINNAME"] = newData['ORIGIN'].map(new_dict)
sb.set(rc={'figure.figsize':(15,13)})
averageFlight = newData.groupby('ORIGINNAME').DEP_DELAY_NEW.mean()

#averageFlight.index = list(airportList)
averageFlight=averageFlight.sort_values(ascending=False).head(20)
averageFlight.rename(new_dict)
sb.barplot(averageFlight.values, averageFlight.index, alpha=0.8,orient='h').set(title="ORIGIN average delay time")
ax =sb.barplot(averageFlight.values, averageFlight.index, alpha=0.8,orient='h')
ax.bar_label(ax.containers[0])
ax.set(xlabel="Minutes", ylabel = "Airline")

**By Visualizing the plot, we observed that the top 5 States with highest average delay time are as follows:**

**1) STL (St. Louis)** \
**2) OAK (Oakland)** \
**3) BUR (Burbank)** \
**4) LAS (Las vegas)** \
**5) BWI (Baltimore/D.C)**



In [None]:
newData.to_csv (r'INPUT PATH HERE', index = False, header=True)

# note: change the path name inside the quotation marks to the address where you want the CSV to be exported to.
# make sure the imported file ends with (.csv). You can also use .txt if you want.
# after running this code snippet the CSV will be downloaded into the path address