# Data Analysis

Complete the following tasks related to data analysis for this project.

## Part 1

Read the `data/processed/2017-delays.csv` file into a dataframe.

## Part 2

Create a csv that contains the probability of a flight being delayed more than 30 minutes for each airport within the dataframe.

- There should be one row for each airport in the dataset.
- Name the csv `data/processed/delay-probabilities.csv`.

## Part 3

Create a csv that contains the mean airport arrival delay for each airport on each day within the dataframe. The mean airport arrival delay for airport $i$ on day $t$ is the mean of arrival delays for all flights arriving at airport $i$ on day $t$.

- Each row of the csv should contain the mean arrival delay for a single airport on a single day.
- Name the csv `data/processed/daily-mean-delays.csv`.
  
## Part 4

Create a csv that contains the mean and standard deviation of the mean airport arrival delays for each airport within the dataframe, across all days. The mean of the mean airport arrival delays for airport $i$ is the mean of mean airport arrival delays for airport $i$ over all days in the dataframe. The standard deviation of mean airport arrival delays for airport $i$ is the standard deviation of mean airport arrival delays for airport $i$ over all days in the dataframe.

- There should be one row for each airport in the dataset.
- Name the csv `data/processed/mean-std-delays.csv`.

## Hints

- Example 1 (Constructing DataFrame from a dictionary) from [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) might help you create a dataframe from 2 lists of values.
- Part 2 from lab 04 might help with Part 2.
- The pandas [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function might be helpful for Part 3.
- Part 4 from lab 04 might help with Part 4.
- The pandas xs function might help with Part 4. See [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html) for documentation and [here](https://stackoverflow.com/questions/14964493/multiindex-based-indexing-in-pandas) for a stack overflow example.

In [8]:
import numpy as np
import pandas as pd

## Part 1 - Read data into a dataframe

In [9]:
bigframe_Delay = pd.read_csv('data/processed/2017-delays.csv')
bigframe_Delay

Unnamed: 0.1,Unnamed: 0,FL_DATE,OP_UNIQUE_CARRIER,ORIGIN,DEST,ARR_DELAY
0,0,2017-01-01,AA,JFK,LAX,27.0
1,1,2017-01-01,AA,LAX,JFK,42.0
2,2,2017-01-01,AA,LAX,JFK,42.0
3,7,2017-01-01,AA,JFK,SFO,-22.0
4,8,2017-01-01,AA,LAX,JFK,-30.0
...,...,...,...,...,...,...
1839457,5674583,2017-12-31,WN,TPA,PHL,-3.0
1839458,5674584,2017-12-31,WN,TPA,PHX,2.0
1839459,5674585,2017-12-31,WN,TPA,PHX,-7.0
1839460,5674597,2017-12-31,WN,TPA,STL,-4.0


## Part 2 - Calculate delay probabilities

In [22]:
# YOUR CODE HERE
total_count1 = bigframe_Delay.loc[bigframe_Delay['ARR_DELAY']>30]
total_count2 = total_count1.filter(['DEST','ARR_DELAY'])
total_count3 = total_count2.sort_values(['DEST'],ascending=True)# I intentially sorted the destinations in ascending order
total_count_frame = total_count3.groupby(['DEST']).size().reset_index() #In this case, the column on the right of total_frame4 shows the quantity(counts)corresponding to each airport on the left



# I picked random variable nmaes...total_count_frame is the final filtered dataframe
bigframe_Delay = bigframe_Delay.sort_values(['DEST'],ascending=True)
airports_Dest = list(total_count_frame['DEST'])

probability_list = []
for i in range(0,30): #there are only 30 different airlines in the DEST column (0th airline to 29th airline)
    probability = total_count_frame[0][i] / (bigframe_Delay.loc[bigframe_Delay['DEST']==airports_Dest[i]].shape[0])
    
    probability_list.append(probability)
    
df = pd.DataFrame(list(zip(probability_list,airports_Dest)),columns = ['delay_prob','AIRLINES'])
df.to_csv('data/processed/delay-probabilities.csv')

In [19]:
total_count_frame['DEST'].unique().size

30

In [23]:
df

Unnamed: 0,delay_prob,AIRLINES
0,0.090128,ATL
1,0.14521,BOS
2,0.096333,BWI
3,0.08107,CLT
4,0.112255,DAL
5,0.100607,DCA
6,0.097771,DEN
7,0.096559,DFW
8,0.079909,DTW
9,0.192545,EWR


## Part 3 - Calculate daily mean airport delays

In [12]:
daily_mean_delay = bigframe_Delay.groupby(['DEST','FL_DATE']).mean()


daily_mean_delay.to_csv('data/processed/daily-mean-delays.csv')
daily_mean_delay

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,ARR_DELAY
DEST,FL_DATE,Unnamed: 2_level_1,Unnamed: 3_level_1
ATL,2017-01-01,8.558469e+03,6.900312
ATL,2017-01-02,2.063851e+04,24.286070
ATL,2017-01-03,3.463805e+04,-1.661417
ATL,2017-01-04,5.332651e+04,-3.650273
ATL,2017-01-05,6.521672e+04,-1.082888
...,...,...,...
TPA,2017-12-27,5.607592e+06,7.619718
TPA,2017-12-28,5.622785e+06,13.071429
TPA,2017-12-29,5.639634e+06,11.866197
TPA,2017-12-30,5.653835e+06,17.547445


## Part 4 - Calculate the mean of daily mean airport delays (over the entire time period)

In [13]:
# YOUR CODE HERE
bigframe_Delay = pd.read_csv('data/processed/2017-delays.csv')
daily_apt_means_std = bigframe_Delay.groupby(['DEST']).agg([np.mean,np.std]).reset_index() # I switched to Ethan's method here because it's lot more efficient than my orignal one

daily_apt_means_std.to_csv('data/processed/mean-std-delays.csv')

In [14]:
daily_apt_means_std
# the two columns on the far right are the desired data

Unnamed: 0_level_0,DEST,Unnamed: 0,Unnamed: 0,ARR_DELAY,ARR_DELAY
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std
0,ATL,2817522.0,1644575.0,2.032871,46.689497
1,BOS,2852316.0,1620843.0,6.618463,45.223511
2,BWI,2859406.0,1634365.0,2.398086,32.635335
3,CLT,2830093.0,1645016.0,0.869615,36.599919
4,DAL,2827228.0,1656722.0,5.478702,31.872158
5,DCA,2823146.0,1642234.0,2.647105,38.880459
6,DEN,2843044.0,1631718.0,1.548991,36.976978
7,DFW,2839313.0,1648060.0,2.257873,48.705293
8,DTW,2826220.0,1631680.0,-0.187743,42.283943
9,EWR,2829218.0,1644005.0,11.43118,56.035474
