# USA Airlines analysis By [BTS](https://transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) 

### Introduction

The United States Department of Transportation has Flight Stats available through the Bureau of Transportation Statistics.
This data comes from The Bureau of Transportation Statistics and tracks destinations, distance, and delay information of flights across U.S. For this project, I chose the years **1995** through **February 2020** 

### Notes

Throughout this notebook, you'll notice that most operations are single cell, this was because my machine kept running out of usable memory. The dataset was really large and required as much memory as possible for each operation. You will see multiple cells of code that could possibly be more code efficient, but were lengthened to reduce memory errors. You will also notice that after certain blocks of code, I outputted the dataframe to a csv. Some blocks of code took more than an hour to run. So to save my progress and make sure that I would not waste time re-running previous code, I would output my progress to a csv and then read it back in. I did not include these datasets in the repository, but have kept the code in the notebook.

The BTS website did not allow users to download full years of data. I downloaded data from the BTS website month by month from **1995 to February 2020*** and then used pandas to merge the data together.after cleaning each Csv  file then save it back to another csv to reduce memory usage and drop unnessasry columns in this **Gigantic** dataset.

# ETL Pipeline Preparation
Follow the instructions below to help you create your ETL pipeline.
### 1. Import libraries and load datasets.
- Import Python libraries
- Load `2019_1.csv` into a dataframe and inspect the first few lines. this CSV file contains data for all airports for the period 1-1-2019 till 31-1-2019


In [1]:
# import libraries
import pandas as pd
import sqlite3 as slq
import math
import datetime
import time as t

In [2]:
# load january 2019 dataset
df1=pd.read_csv(r"D:\on time data for airlines\2019\2019_1.csv",low_memory=False)

In [3]:
df1.head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,...,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum,Unnamed: 109
0,2019,1,1,4,5,2019-01-04,OO,20304,OO,N945SW,...,,,,,,,,,,
1,2019,1,1,4,5,2019-01-04,OO,20304,OO,N932SW,...,,,,,,,,,,
2,2019,1,1,4,5,2019-01-04,OO,20304,OO,N932SW,...,,,,,,,,,,
3,2019,1,1,4,5,2019-01-04,OO,20304,OO,N916SW,...,,,,,,,,,,
4,2019,1,1,4,5,2019-01-04,OO,20304,OO,N107SY,...,,,,,,,,,,


In [4]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 110 entries, Year to Unnamed: 109
dtypes: float64(69), int64(21), object(20)
memory usage: 490.1+ MB


#### Memory usage_1
Due to high data load this file only which contain just one month data uses **490 MB** so let's munge our data and keep necessary columns only to decrease or eliminate this issue

## Memory size issue

it's clear that after we have run this single month data for 2019 we use about **500 MB** of memory so what about data soan from 2020 to 1995 so about 25 years of data each year has 12 csv file so let's overcome this problem.

### Soluntions

1- Fisrt thing i will drop all unrelevant columns or columns provide unuseful information for my analysis 

2- I will invastigate all columns to find ranges of numerical data then i will downcast it to proper data type

### Step 1 : remove Unnecessary columns

### Un-necessary Data for our analysis

We are not interested in further information about more than one **DIVERTED airport** so i will drop theses columns furthermore these columns contain almot 100% Nan values beacuse amot all flights didn't need diverted airport or at most only one diverted Airport

In [5]:
Diverted_cols = ['Div2Airport','Div2AirportID','Div2AirportSeqID','Div2WheelsOn','Div2TotalGTime',
 'Div2LongestGTime','Div2WheelsOff','Div2TailNum','Div3Airport','Div3AirportID','Div3AirportSeqID',
 'Div3WheelsOn','Div3TotalGTime','Div3LongestGTime','Div3WheelsOff','Div3TailNum','Div4Airport',
 'Div4AirportID','Div4AirportSeqID','Div4WheelsOn','Div4TotalGTime','Div4LongestGTime','Div4WheelsOff',
 'Div4TailNum','Div5Airport','Div5AirportID','Div5AirportSeqID','Div5WheelsOn','Div5TotalGTime',
 'Div5LongestGTime','Div5WheelsOff','Div5TailNum','Unnamed: 109']

In [6]:
df1=df1.drop(Diverted_cols,axis=1)

In [7]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 77 entries, Year to Div1TailNum
dtypes: float64(39), int64(21), object(17)
memory usage: 343.1+ MB


#### Memory usage_2
After removing unnecessary data which describe the Second,third,fourth and fifth diverted airport we decreased memory usage by **26.85%** which becomes **343.1 MB**

#### Redundant columns
the follwing columns are repeted or contain information that not necessary or repetead in another form in another columns so i will drop it for memory issues 

`'DOT_ID_Reporting_Airline'`          didn't provide useful info 

`'IATA_CODE_Reporting_Airline'`       i will use **Reporting_Airline** column instead

`'OriginAirportSeqID'`                i will use **OriginAirportID** as this column is unique for each airport over the years

`'OriginStateFips'`                   this column contain fedral identfication number for each airport 

`'OriginState'`                       **OriginStateName** column in more handy

`'OriginWac'`                         origin area code is not helpful for my analysis

`'DestAirportSeqID'`                  i will use **DestAirportID** as this column is unique for each airport over the years

`'DestState'`                         **DestStateName** column in more handy

`'DestStateFips'`                     this column contain fedral identfication number for each airport 

`'DestWac'`                           Destination area code is not helpful for my analysis

In [8]:
# unrelevant columns

unrelevant_col = ['DOT_ID_Reporting_Airline',
                  'IATA_CODE_Reporting_Airline',
                  'OriginAirportSeqID',
                  'OriginStateFips',
                  'OriginState',
                  'OriginWac',
                  'DestAirportSeqID',
                  'DestState',
                  'DestStateFips',
                  'Div1WheelsOff',                     
                  'Div1TailNum',
                  'Div1AirportSeqID',
                  'DestWac']

In [9]:
df1= df1.drop(unrelevant_col,axis=1)

In [10]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: float64(37), int64(14), object(13)
memory usage: 285.1+ MB


#### Memory usage_3
After removing Redundant columns we decreased memory usage by **17%** which becomes **285.1 MB**

### Step 2: Datatypes Downcasting

In [11]:
cols=list(df1.select_dtypes('int64').columns)
cols

['Year',
 'Quarter',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'Flight_Number_Reporting_Airline',
 'OriginAirportID',
 'OriginCityMarketID',
 'DestAirportID',
 'DestCityMarketID',
 'CRSDepTime',
 'CRSArrTime',
 'DistanceGroup',
 'DivAirportLandings']

After examining each column and apply pandas.dataframe.describe() method to find statistics about each column its min and max specially to find ranges and which of them fit with **int16** and we find that all of thes columns fit wih **int16** except `'Flight_Number_Reporting_Airline'` 

as we know int16 uses 2 bytes of data while **int64** uses 8 bytes of data by this technique we will save alot of memory

In [12]:
int_downcast=['Year',
 'Quarter',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'OriginAirportID',
 'DestAirportID',
 'DistanceGroup',
 'DivAirportLandings']


In [13]:
df1[int_downcast]=df1[int_downcast].astype('int16')

In [14]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: float64(37), int16(9), int64(5), object(13)
memory usage: 255.1+ MB


#### Memory usage_4
After downcasting **int64** columns which can fit with **int16** we decreased memory usage from **285.1 MB** by **14.7%** which becomes **248.4 MB**

In [15]:
cols=list(df1.select_dtypes('float64').columns)
cols

['DepTime',
 'DepDelay',
 'DepDelayMinutes',
 'DepDel15',
 'DepartureDelayGroups',
 'TaxiOut',
 'WheelsOff',
 'WheelsOn',
 'TaxiIn',
 'ArrTime',
 'ArrDelay',
 'ArrDelayMinutes',
 'ArrDel15',
 'ArrivalDelayGroups',
 'Cancelled',
 'Diverted',
 'CRSElapsedTime',
 'ActualElapsedTime',
 'AirTime',
 'Flights',
 'Distance',
 'CarrierDelay',
 'WeatherDelay',
 'NASDelay',
 'SecurityDelay',
 'LateAircraftDelay',
 'FirstDepTime',
 'TotalAddGTime',
 'LongestAddGTime',
 'DivReachedDest',
 'DivActualElapsedTime',
 'DivArrDelay',
 'DivDistance',
 'Div1AirportID',
 'Div1WheelsOn',
 'Div1TotalGTime',
 'Div1LongestGTime']

In [16]:
float_downcast=['Distance',
 'CarrierDelay',
 'WeatherDelay',
 'NASDelay',
 'SecurityDelay',
 'LateAircraftDelay',
 'DivArrDelay',
 'DivDistance','Div1TotalGTime',
 'Div1LongestGTime',
 'Div1AirportID']

##### First 

`'Flights'` can be converted to int8

##### Second
 **`'Distance','CarrierDelay','WeatherDelay','NASDelay'`,**
 
**`'SecurityDelay','LateAircraftDelay','DivArrDelay','DivDistance'`,**
 
 **`'Div1TotalGTime','Div1LongestGTime'`**
 
**can be converted to float16 as the range for theire values between -32768 and 32767**

other columns like 
**`'DepDel15','ArrDel15','Cancelled','Diverted','DivReachedDest'`** are ust an **indicator** 0,1 so i will convert them into **Boolean** in the next step after converting the previous columns into **float16**

In [17]:
df1['Flights']=df1['Flights'].astype('int8')

df1[float_downcast]=df1[float_downcast].astype('float16')

In [18]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: float16(11), float64(25), int16(9), int64(5), int8(1), object(13)
memory usage: 214.4+ MB


#### Memory usage_4
After downcasting **float64** columns which can fit with **float16** we decreased memory usage from **248.4 MB** by **15%** which becomes **211.1 MB**

In [19]:
bool_downcast=['DepDel15','ArrDel15','Cancelled','Diverted','DivReachedDest']

df1[bool_downcast]=df1[bool_downcast].astype('bool')

In [20]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: bool(5), float16(11), float64(20), int16(9), int64(5), int8(1), object(13)
memory usage: 194.9+ MB


#### Memory usage_5
After downcasting **some of float64** columns which are just an indicator into **bool** we decreased memory usage from **211.1 MB** by **9.24%** which becomes **191.6 MB**

After that we still have 4 columns their dtypes are **int64** while we can convert them into **int32** to save more memory

In [21]:
int_downcast_32=list(df1.select_dtypes('int64').columns)
int_downcast_32

['Flight_Number_Reporting_Airline',
 'OriginCityMarketID',
 'DestCityMarketID',
 'CRSDepTime',
 'CRSArrTime']

In [22]:
df1[int_downcast_32]=df1[int_downcast_32].astype('int32')

In [23]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: bool(5), float16(11), float64(20), int16(9), int32(5), int8(1), object(13)
memory usage: 183.8+ MB


In [24]:
cols=list(df1.select_dtypes('float64').columns)
cols

['DepTime',
 'DepDelay',
 'DepDelayMinutes',
 'DepartureDelayGroups',
 'TaxiOut',
 'WheelsOff',
 'WheelsOn',
 'TaxiIn',
 'ArrTime',
 'ArrDelay',
 'ArrDelayMinutes',
 'ArrivalDelayGroups',
 'CRSElapsedTime',
 'ActualElapsedTime',
 'AirTime',
 'FirstDepTime',
 'TotalAddGTime',
 'LongestAddGTime',
 'DivActualElapsedTime',
 'Div1WheelsOn']

In [25]:
categorical_downcast=['DepartureDelayGroups','ArrivalDelayGroups','DistanceGroup']
df1[categorical_downcast]=df1[categorical_downcast].astype('category')


In [26]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: bool(5), category(3), float16(11), float64(18), int16(8), int32(5), int8(1), object(13)
memory usage: 175.4+ MB


#### Time and date columns in float64 Format

we can observe that each ne of the previous **18** columns describe date or time information and after investigating their range all of them can fit into **float16** format this is a transitional step before converting them into approprite **datetime** format

In [27]:
time_float_downcast=['DepTime',
 'DepDelay',
 'DepDelayMinutes',
 'TaxiOut',
 'WheelsOff',
 'WheelsOn',
 'TaxiIn',
 'ArrTime',
 'ArrDelay',
 'ArrDelayMinutes',
 'CRSElapsedTime',
 'ActualElapsedTime',
 'AirTime',
 'FirstDepTime',
 'TotalAddGTime',
 'LongestAddGTime',
 'DivActualElapsedTime',
 'Div1WheelsOn',
 'CRSArrTime', 'CRSDepTime']

In [28]:
df1[time_float_downcast]=df1[time_float_downcast].astype('float16')

In [29]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583985 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: bool(5), category(3), float16(31), int16(8), int32(3), int8(1), object(13)
memory usage: 113.1+ MB


In [30]:
# hhmm columns in our dataframe
hhmm=['DepTime',
      'CRSArrTime', 
      'CRSDepTime',
      'WheelsOff',
      'WheelsOn','ArrTime']

In [31]:
df1 = df1.dropna(subset=hhmm)

In [32]:
df1.shape

(566924, 64)

In [33]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 566924 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: bool(5), category(3), float16(31), int16(8), int32(3), int8(1), object(13)
memory usage: 114.1+ MB


In [34]:
def std_time_1(col):

    Hours = int(col / 100)
    Minute = round(float(col/ 100) - Hours,2)
    Hours = str(int(Hours)).zfill(2)
    Minute = str(Minute).replace('0.','').zfill(2)
    my_str = Hours+':'+Minute
    
    return my_str
    

In [35]:
for i in hhmm:
    df1[i]=df1[i].apply(lambda x:std_time_1(x))

In [36]:
for i in hhmm:
    df1[i]=df1[i].str.replace('24:00','23:59')

In [37]:
def std_time(string):
    return datetime.datetime.strptime(string, '%H:%M').time()

In [38]:
def std_time_2(string):
    
    return datetime.time.strptime(string)

In [39]:
for i in hhmm:
    df1[i]=df1[i].apply(lambda x:std_time(x))

In [40]:
df1[hhmm].head(15)

Unnamed: 0,DepTime,CRSArrTime,CRSDepTime,WheelsOff,WheelsOn,ArrTime
0,13:53:00,15:01:00,14:00:00,14:02:00,14:39:00,14:44:00
1,09:03:00,11:18:00,09:35:00,09:57:00,11:13:00,11:19:00
2,06:37:00,08:55:00,06:43:00,06:54:00,08:22:00,08:38:00
3,13:14:00,14:33:00,13:35:00,13:37:00,13:57:00,14:04:00
4,08:26:00,10:04:00,08:36:00,08:52:00,09:59:00,10:09:00
5,16:00:00,18:26:00,16:01:00,16:21:00,18:11:00,18:14:00
6,16:14:00,18:56:00,16:15:00,16:43:00,18:55:00,18:59:00
7,10:53:00,12:04:00,10:45:00,11:07:00,12:18:00,12:26:00
8,06:24:00,07:51:00,06:37:00,07:08:00,07:45:00,07:49:00
9,07:59:00,11:07:00,08:03:00,08:18:00,10:53:00,11:08:00


In [50]:
df1.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 566924 entries, 0 to 583984
Columns: 64 entries, Year to Div1LongestGTime
dtypes: bool(5), category(3), float16(25), int16(8), int32(3), int8(1), object(19)
memory usage: 153.5+ MB
