 # Opis notatnika

 W tym notatniku pobieram dane do obszaru roboczego, i wgrywam do bazy danych. Ich obróbka oraz analiza zostanie przeprowadzona w kolejnych notatnikach.

 Korzystam z dedykowanego serwisu API, stworzonego przez szkołę CodersLab, który dostępny jest pod adresem: https://api-datalab.coderslab.com/api/v2. 
 Dodatkowo udostępniona została dokumentacja, z którą można zapoznać się tutaj: [klik](https://api-datalab.coderslab.com/v2/docs/).

 > Dokumentacja jest czysto techniczna i ma na celu prezentację dostępnych endpointów wraz ze zwracanym typem. W celu przetestowania należy kliknąć przysisk `Authorize`, podać token (dostępny tylko dla kursantów CodersLab), a następnie `Try it out!` oraz uzupełnić wymagane pola (parametry requesta).

 Zgodnie z dokumentacją udostępnione zostały 4 endpointy:
 - `airport` - dane o lotnisku,
 - `weather` - informacje o zarejestrowaniej pogodzie na lotnisku danego dnia,
 - `aircraft` - dane o samolotach
 - `flights` - dane o wylotach z danego lotniska per dzień.

 W celu pobrania informacji, gdzie wymagany jest paramatr `airportId`, korzystam z pliku `airports.csv` z folderu `data`.

 # Konfiguracja notatnika

 Import wymaganych bibliotek

In [18]:
import requests
import pandas as pd
import time
import datetime
from dateutil.relativedelta import relativedelta

 Paramatry połączenia do API

In [19]:
headers = {'accept': 'application/json', 'authorization': '***'} # token udostępniony tylko dla kursantów

 Wczytanie pliku `airports.csv` aby pozyskać listę lotnisk (dostępna w kolumnie `origin_airport_id`).

In [20]:
airports = pd.read_csv(r"..\data\airports.csv")

airports_list = airports['origin_airport_id'].to_list()

len(airports_list)

364

 # Pobieranie `Airport`

 Pobieram dane z endpoint'u `airport` do listy.

In [21]:
all_API_airports_in_list = []
total_processing_time = 0

for airportId in airports_list:
    start_time = time.time()
    response = requests.get(f'https://api-datalab.coderslab.com/api/v2/airport/{airportId}', headers=headers)
    try:
        if response.status_code == 200:
            data = response.json()
            all_API_airports_in_list.append(data)
            print(f"For airport ID : {airportId} API status code is {response.status_code}")
        else:
            print(f"Failed to fetch data for airport ID : {airportId}, API status code is {response.status_code}")
    except Exception as e:
        print(f"An error occurred while processing airport ID : {airportId}. Error message: {str(e)}")
        continue

    end_time = time.time()
    processing_time = end_time - start_time
    total_processing_time += processing_time

    time.sleep(1)

total_minutes = int(total_processing_time // 60)
total_seconds = int(total_processing_time % 60)

print("--------------------------------------------------------------------------")
print(f"Total processing time: {total_minutes} minutes and {total_seconds} seconds")
print("--------------------------------------------------------------------------")

Failed to fetch data for airport ID : 10874, API status code is 400
Failed to fetch data for airport ID : 11233, API status code is 400
Failed to fetch data for airport ID : 13360, API status code is 400
Failed to fetch data for airport ID : 15008, API status code is 400
For airport ID : 11638 API status code is 200
Failed to fetch data for airport ID : 14150, API status code is 400
Failed to fetch data for airport ID : 15323, API status code is 400
Failed to fetch data for airport ID : 14814, API status code is 400
Failed to fetch data for airport ID : 12007, API status code is 400
Failed to fetch data for airport ID : 11337, API status code is 400
For airport ID : 13342 API status code is 200
Failed to fetch data for airport ID : 15070, API status code is 400
For airport ID : 13244 API status code is 200
Failed to fetch data for airport ID : 12280, API status code is 400
For airport ID : 15096 API status code is 200
Failed to fetch data for airport ID : 11641, API status code is 400


In [22]:
len(all_API_airports_in_list)

97

In [23]:
all_API_airports_in_list[0]

{'ORIGIN_AIRPORT_ID': 11638,
 'DISPLAY_AIRPORT_NAME': 'Fresno Air Terminal',
 'ORIGIN_CITY_NAME': 'Fresno, CA',
 'NAME': 'FRESNO YOSEMITE INTERNATIONAL, CA US'}

Zapisuje dane do ramki `airport_df`: najpierw do listy ramek danych używając metody Pandas `from_records`, następnie za pomocą funkcji `concat` łączę dane do jednej ramki danych. 

In [24]:
df_temp_list = []

for i in range(0, len(all_API_airports_in_list)):
    df_temp = pd.DataFrame.from_records(all_API_airports_in_list[i], index=[0])
    df_temp_list.append(df_temp)

airport_df = pd.concat(df_temp_list, ignore_index=True)

In [25]:
airport_df.shape    

(97, 4)

In [26]:
airport_df.tail()

Unnamed: 0,DISPLAY_AIRPORT_NAME,NAME,ORIGIN_AIRPORT_ID,ORIGIN_CITY_NAME
92,Kansas City International,"KANSAS CITY INTERNATIONAL AIRPORT, MO US",13198,"Kansas City, MO"
93,Austin - Bergstrom International,"AUSTIN BERGSTROM INTERNATIONAL AIRPORT, TX US",10423,"Austin, TX"
94,Tulsa International,"OKLAHOMA CITY WILL ROGERS WORLD AIRPORT, OK US",15370,"Tulsa, OK"
95,Miami International,"MIAMI INTERNATIONAL AIRPORT, FL US",13303,"Miami, FL"
96,Myrtle Beach International,"NORTH MYRTLE BEACH, SC US",10693,"Myrtle Beach, SC"


In [27]:
airport_df.describe()

Unnamed: 0,ORIGIN_AIRPORT_ID
count,97.0
mean,12975.0
std,1584.854679
min,10140.0
25%,11433.0
50%,13204.0
75%,14321.0
max,15919.0


In [28]:
airport_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   DISPLAY_AIRPORT_NAME  97 non-null     object
 1   NAME                  97 non-null     object
 2   ORIGIN_AIRPORT_ID     97 non-null     int64 
 3   ORIGIN_CITY_NAME      97 non-null     object
dtypes: int64(1), object(3)
memory usage: 3.2+ KB


 ## Sprawdzenie
Kod poniżej sprawdza, czy ta część została poprawnie wykonana

In [29]:
airport_df_expected_shape = (97, 4)
assert airport_df_expected_shape == airport_df.shape

 ## Zapis do pliku
 Zapisuję ramkę `airport_df` do pliku `airport_list.csv` w katalogu `data\raw`

In [30]:
airport_df.to_csv(r"..\data\raw\airport_list.csv")

 # Pobieranie `Weather`

 - Data początkowa danych to `2019-01-01`, zaś data końcowa to `2020-03-31`, czyli 15 miesięcy,

 Do endpoint'u `weather` potrzebuję poszczególnych miesięcy. Tworzę listę miesięcy, korzystając z modułu `datetime`, formatowania daty `strftime`, inkrementacji z pomocą klasy `relativedelta`.

In [31]:
year_month_list = []
start_date = datetime.date(2019, 1, 1)
end_date = datetime.date(2020, 4, 1) # tutaj moze byc 03.2020 i wtedy w pętli while i <= end_date:
i = start_date
while i < end_date:
    year_month_list.append(i.strftime("%Y-%m"))
    i = i + relativedelta(months =+ 1)
print(year_month_list)

['2019-01', '2019-02', '2019-03', '2019-04', '2019-05', '2019-06', '2019-07', '2019-08', '2019-09', '2019-10', '2019-11', '2019-12', '2020-01', '2020-02', '2020-03']


 Pobieram dane z endpoint'u `weather` do listy.

In [32]:
all_API_airport_weather_in_list = []
total_processing_time = 0

for year_month in year_month_list:
    start_time = time.time()
    response = requests.get(f'https://api-datalab.coderslab.com/api/v2/airportWeather?date={year_month}', headers=headers)
    try:
        if response.status_code == 200:
            data = response.json()
            all_API_airport_weather_in_list.append(data)
            print(f"For year_month : {year_month} API status code is {response.status_code}")
        else:
            print(f"Failed to fetch data for year_month : {year_month}, API status code is {response.status_code}")
    except Exception as e:
        print(f"An error occurred while processing year_month : {year_month}. Error message: {str(e)}")
        continue
    
    end_time = time.time()
    processing_time = end_time - start_time
    total_processing_time += processing_time

    time.sleep(1)

total_minutes = int(total_processing_time // 60)
total_seconds = int(total_processing_time % 60)

print("--------------------------------------------------------------------------")
print(f"Total processing time: {total_minutes} minutes and {total_seconds} seconds")
print("--------------------------------------------------------------------------")

For year_month : 2019-01 API status code is 200
For year_month : 2019-02 API status code is 200
For year_month : 2019-03 API status code is 200
For year_month : 2019-04 API status code is 200
For year_month : 2019-05 API status code is 200
For year_month : 2019-06 API status code is 200
For year_month : 2019-07 API status code is 200
For year_month : 2019-08 API status code is 200
For year_month : 2019-09 API status code is 200
For year_month : 2019-10 API status code is 200
For year_month : 2019-11 API status code is 200
For year_month : 2019-12 API status code is 200
For year_month : 2020-01 API status code is 200
For year_month : 2020-02 API status code is 200
For year_month : 2020-03 API status code is 200
--------------------------------------------------------------------------
Total processing time: 0 minutes and 16 seconds
--------------------------------------------------------------------------


In [33]:
len(all_API_airport_weather_in_list)

15

In [34]:
len(all_API_airport_weather_in_list[0])

3286

In [35]:
all_API_airport_weather_in_list[0][0] # lista zawiera listy ze słownikami, ponieważ 1 miesiąc z endpointu zwraca dane dla każdego dnia, dlatego trzeba wejść głębiej w strukturę danych.

{'WT18': None,
 'STATION': 'USW00013874',
 'NAME': 'ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPORT, GA US',
 'DATE': '2019-01-01',
 'AWND': 4.7,
 'PRCP': 0.14,
 'SNOW': 0,
 'SNWD': 0,
 'TAVG': 64,
 'TMAX': 66,
 'TMIN': 57,
 'WDF2': 310,
 'WDF5': 310,
 'WSF2': 15,
 'WSF5': 19,
 'WT01': 1}

Zapisuje dane do ramki `airport_weather_df`: najpierw do listy ramek danych używając metody Pandas `from_records`, następnie za pomocą funkcji `concat` łączę dane do jednej ramki danych. 

In [36]:
df_temp_list = []

for i in range(0, len(all_API_airport_weather_in_list)):
    for j in range(0, len(all_API_airport_weather_in_list[i])):
        df_temp = pd.DataFrame.from_records(all_API_airport_weather_in_list[i][j], index=[0])
        df_temp_list.append(df_temp)

airport_weather_df = pd.concat(df_temp_list, ignore_index=True)

In [37]:
airport_weather_df.shape

(46226, 33)

In [38]:
airport_weather_df.tail()

Unnamed: 0,AWND,DATE,NAME,PRCP,SNOW,SNWD,STATION,TAVG,TMAX,TMIN,...,PGTM,WT10,WESD,SN32,SX32,PSUN,TSUN,TOBS,WT07,WT11
46221,3.58,2020-03-27,"PITTSBURGH ALLEGHENY CO AIRPORT, PA US",0.21,,,USW00014762,,59.0,49.0,...,146.0,,,,,,,,,
46222,6.93,2020-03-28,"PITTSBURGH ALLEGHENY CO AIRPORT, PA US",1.29,,,USW00014762,,77.0,51.0,...,1535.0,,,,,,,,,
46223,16.55,2020-03-29,"PITTSBURGH ALLEGHENY CO AIRPORT, PA US",0.02,,,USW00014762,,78.0,57.0,...,1408.0,,,,,,,,,
46224,13.42,2020-03-30,"PITTSBURGH ALLEGHENY CO AIRPORT, PA US",0.0,,,USW00014762,,57.0,42.0,...,817.0,,,,,,,,,
46225,3.8,2020-03-31,"PITTSBURGH ALLEGHENY CO AIRPORT, PA US",0.06,,,USW00014762,,47.0,39.0,...,110.0,,,,,,,,,


In [39]:
airport_weather_df.describe()

Unnamed: 0,AWND,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,WDF2,WDF5,WSF2,...,PGTM,WT10,WESD,SN32,SX32,PSUN,TSUN,TOBS,WT07,WT11
count,45845.0,46197.0,32338.0,31750.0,34625.0,46203.0,46200.0,45854.0,45704.0,45854.0,...,4484.0,5.0,7.0,453.0,454.0,430.0,429.0,355.0,28.0,1.0
mean,8.041885,0.110614,0.055353,0.364189,57.708823,67.765015,48.80658,200.300083,200.136662,18.69385,...,1308.886262,1.0,0.0,64.600442,71.339207,35.074419,261.335664,61.512676,1.0,1.0
std,3.751485,0.338897,0.462628,1.766932,18.773347,19.753157,18.852169,101.702095,101.493734,6.490189,...,580.823669,0.0,0.0,11.916067,12.245805,35.946983,275.225989,13.300811,0.0,
min,0.0,0.0,0.0,0.0,-22.0,-14.0,-39.0,10.0,3.0,1.1,...,0.0,1.0,0.0,44.0,50.0,0.0,0.0,30.0,1.0,1.0
25%,5.37,0.0,0.0,0.0,44.0,53.0,35.0,120.0,120.0,14.1,...,1047.0,1.0,0.0,54.0,60.0,0.0,1.0,51.0,1.0,1.0
50%,7.38,0.0,0.0,0.0,59.0,70.0,50.0,200.0,210.0,17.0,...,1400.0,1.0,0.0,62.0,69.5,21.5,154.0,65.0,1.0,1.0
75%,10.07,0.04,0.0,0.0,74.0,84.0,65.0,290.0,290.0,21.9,...,1637.25,1.0,0.0,78.0,84.0,69.0,508.0,73.0,1.0,1.0
max,33.78,11.63,17.0,25.0,103.0,120.0,93.0,360.0,360.0,62.0,...,2359.0,1.0,0.0,83.0,94.0,100.0,908.0,79.0,1.0,1.0


In [40]:
airport_weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46226 entries, 0 to 46225
Data columns (total 33 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   AWND     45845 non-null  float64
 1   DATE     46226 non-null  object 
 2   NAME     46226 non-null  object 
 3   PRCP     46197 non-null  float64
 4   SNOW     32338 non-null  float64
 5   SNWD     31750 non-null  float64
 6   STATION  46226 non-null  object 
 7   TAVG     34625 non-null  float64
 8   TMAX     46203 non-null  float64
 9   TMIN     46200 non-null  float64
 10  WDF2     45854 non-null  float64
 11  WDF5     45704 non-null  float64
 12  WSF2     45854 non-null  float64
 13  WSF5     45704 non-null  float64
 14  WT01     16798 non-null  float64
 15  WT18     0 non-null      object 
 16  WT08     5589 non-null   float64
 17  WT02     2268 non-null   float64
 18  WT03     5085 non-null   float64
 19  WT04     362 non-null    float64
 20  WT09     316 non-null    float64
 21  WT06     522

 ## Sprawdzenie
 Kod poniżej sprawdza, czy ta część została poprawnie wykonana

In [41]:
airport_weather_df_expected_shape = (46226, 33)
assert airport_weather_df_expected_shape == airport_weather_df.shape

 ## Zapis do pliku
 Zapisuję ramkę `airport_weather_df` do pliku `airport_weather.csv` w katalogu `data\raw`

In [42]:
airport_weather_df.to_csv(r"..\data\raw\airport_weather.csv")

 # Pobranie `Aircraft`


 Pobieram dane z endpoint'u `aircraft` do ramki danych za pomocą metody `from_records`.

In [43]:
start_time = time.time()

response = requests.get(f'https://api-datalab.coderslab.com/api/v2/aircraft', headers=headers)
data = response.json()
aircraft_df = pd.DataFrame.from_records(data)

end_time = time.time()
processing_time = end_time - start_time

total_minutes = int(processing_time // 60)
total_seconds = int(processing_time % 60)

print("--------------------------------------------------------------------------")
print(f"Total processing time: {total_minutes} minutes and {total_seconds} seconds")
print("--------------------------------------------------------------------------")

--------------------------------------------------------------------------
Total processing time: 0 minutes and 1 seconds
--------------------------------------------------------------------------


In [44]:
aircraft_df.shape

(7383, 3)

In [45]:
aircraft_df.tail()

Unnamed: 0,MANUFACTURE_YEAR,TAIL_NUM,NUMBER_OF_SEATS
7378,2019,N14011,337.0
7379,2019,N16008,337.0
7380,2019,N16009,337.0
7381,2019,N2250U,276.0
7382,2019,N2749U,276.0


In [46]:
aircraft_df.describe()

Unnamed: 0,MANUFACTURE_YEAR,NUMBER_OF_SEATS
count,7383.0,7376.0
mean,2005.135717,118.390591
std,9.617305,77.714492
min,1944.0,0.0
25%,1999.0,66.0
50%,2005.0,143.0
75%,2014.0,175.0
max,2019.0,524.0


In [47]:
aircraft_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7383 entries, 0 to 7382
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MANUFACTURE_YEAR  7383 non-null   int64  
 1   TAIL_NUM          7383 non-null   object 
 2   NUMBER_OF_SEATS   7376 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 173.2+ KB


 ## Sprawdzenie
 Kod poniżej sprawdza, czy ta część została poprawnie wykonana

In [48]:
aircraft_df_expected_shape = (7383, 3)
assert aircraft_df_expected_shape == aircraft_df.shape

 ## Zapis do pliku
 Zapisuję ramkę `aircraft_df` do pliku `aircraft.csv` w katalogu `data\raw`

In [49]:
aircraft_df.to_csv(r"..\data\raw\aircraft.csv")

 # Pobranie `Flight`

 * Pobieram dane z endpoint'u `flight` do listy.
 
 W trakcie pracy ze wskazanym API sprawdziłam, że jeśli dla danego lotniska występują dane, to są one dostępne dla każdego miesiąca z zakresu. 
 
 > Celem automatyzacji kodu przeprowadzam wstępną weryfikację dla wszystkich lotnisk i przykładowego miesiąca w pierwszej pętli. 
 > Jeśli odpowiedź z API to 200 wtedy uruchamia się pętla dla każdego miesiąca z osobna.

 > Gdyby sytuacja była inna/niepewna, wtedy kod byłby zmodyfikowany do przejścia w zagnieżdżonej pętli przez każde lotniko i każdy miesiąc.

In [50]:
all_API_flights_in_list = []
total_processing_time = 0

for airportId in airports_list:
    start_time = time.time() 
    response_init = requests.get(f'https://api-datalab.coderslab.com/api/v2/flight?airportId={airportId}&date=2019-01', headers=headers)
    try:
        if response_init.status_code == 200:
          for year_month in year_month_list:
              response = requests.get(f'https://api-datalab.coderslab.com/api/v2/flight?airportId={airportId}&date={year_month}', headers=headers)
              data = response.json()
              all_API_flights_in_list.append(data)
              print(f"For airport ID {airportId} and year_month : {year_month} API status code is {response.status_code}")
              time.sleep(6)
        elif response_init.status_code == 204:
              print(f"No content for airport ID {airportId}, API status code is {response_init.status_code}")
        else:
              print(f"For airport ID {airportId}, API status code is {response_init.status_code}")
    except Exception as e:
        print(f"An error occurred while processing airport ID {airportId} and year_month : {year_month}. Error message: {str(e)}")
        continue

    end_time = time.time()
    processing_time = end_time - start_time
    total_processing_time += processing_time

    time.sleep(6)

total_minutes = int(total_processing_time // 60)
total_seconds = int(total_processing_time % 60)

print("--------------------------------------------------------------------------")
print(f"Total processing time: {total_minutes} minutes and {total_seconds} seconds")
print("--------------------------------------------------------------------------") # cały kod 126m 11s

No content for airport ID 10874, API status code is 204
No content for airport ID 11233, API status code is 204
No content for airport ID 13360, API status code is 204
No content for airport ID 15008, API status code is 204
No content for airport ID 11638, API status code is 204
No content for airport ID 14150, API status code is 204
No content for airport ID 15323, API status code is 204
No content for airport ID 14814, API status code is 204
No content for airport ID 12007, API status code is 204
No content for airport ID 11337, API status code is 204
No content for airport ID 13342, API status code is 204
No content for airport ID 15070, API status code is 204
No content for airport ID 13244, API status code is 204
No content for airport ID 12280, API status code is 204
No content for airport ID 15096, API status code is 204
No content for airport ID 11641, API status code is 204
No content for airport ID 13832, API status code is 204
No content for airport ID 10268, API status code

In [51]:
len(all_API_flights_in_list)

585

In [52]:
len(all_API_flights_in_list[0])

10159

In [53]:
all_API_flights_in_list[0][0] # lista zawiera listy ze słownikami, ponieważ 1 miesiąc z endpointu zwraca dane dla każdego dnia, dlatego trzeba wejść głębiej w strukturę danych.

{'MONTH': 1,
 'DAY_OF_MONTH': 20,
 'DAY_OF_WEEK': 7,
 'OP_UNIQUE_CARRIER': 'WN',
 'TAIL_NUM': 'N204WN',
 'OP_CARRIER_FL_NUM': 682,
 'ORIGIN_AIRPORT_ID': 10397,
 'DEST_AIRPORT_ID': 11292,
 'CRS_DEP_TIME': 605,
 'DEP_TIME': 602,
 'DEP_DELAY_NEW': 0,
 'DEP_TIME_BLK': '0600-0659',
 'CRS_ARR_TIME': 730,
 'ARR_TIME': 726,
 'ARR_DELAY_NEW': 0,
 'ARR_TIME_BLK': '0700-0759',
 'CANCELLED': 0,
 'CRS_ELAPSED_TIME': 205,
 'ACTUAL_ELAPSED_TIME': 204,
 'DISTANCE': 1199,
 'DISTANCE_GROUP': 5,
 'YEAR': 2019}

Zapisuje dane do ramki `flight_df`: najpierw do listy ramek danych używając metody Pandas `from_records`, następnie za pomocą funkcji `concat` łączę dane do jednej ramki danych. 

In [54]:
df_temp_list = []

for i in range(0, len(all_API_flights_in_list)):
    for j in range(0, len(all_API_flights_in_list[i])):
        df_temp = pd.DataFrame.from_records(all_API_flights_in_list[i][j], index=[0])
        df_temp_list.append(df_temp)

flight_df = pd.concat(df_temp_list, ignore_index=True)

In [55]:
flight_df.shape

(1386120, 27)

In [56]:
flight_df.tail()

Unnamed: 0,ACTUAL_ELAPSED_TIME,ARR_DELAY_NEW,ARR_TIME,ARR_TIME_BLK,CANCELLED,CRS_ARR_TIME,CRS_DEP_TIME,CRS_ELAPSED_TIME,DAY_OF_MONTH,DAY_OF_WEEK,...,OP_CARRIER_FL_NUM,OP_UNIQUE_CARRIER,ORIGIN_AIRPORT_ID,TAIL_NUM,YEAR,CARRIER_DELAY,LATE_AIRCRAFT_DELAY,NAS_DELAY,SECURITY_DELAY,WEATHER_DELAY
1386115,169.0,0.0,2233.0,2200-2259,0,2259,1956,183,26,4,...,1982,DL,13303,N350DN,2020,,,,,
1386116,109.0,0.0,1306.0,1300-1359,0,1321,1120,121,26,4,...,1987,DL,13303,N908DE,2020,,,,,
1386117,,,,2000-2059,1,2022,1817,125,26,4,...,1998,DL,13303,,2020,,,,,
1386118,107.0,0.0,2115.0,2100-2159,0,2140,1937,123,26,4,...,2025,DL,13303,N352NW,2020,,,,,
1386119,,,,1000-1059,1,1003,700,183,26,4,...,2151,DL,13303,N352DN,2020,,,,,


In [57]:
flight_df.describe()

Unnamed: 0,ACTUAL_ELAPSED_TIME,ARR_DELAY_NEW,ARR_TIME,CANCELLED,CRS_ARR_TIME,CRS_DEP_TIME,CRS_ELAPSED_TIME,DAY_OF_MONTH,DAY_OF_WEEK,DEP_DELAY_NEW,...,DISTANCE_GROUP,MONTH,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,YEAR,CARRIER_DELAY,LATE_AIRCRAFT_DELAY,NAS_DELAY,SECURITY_DELAY,WEATHER_DELAY
count,1344638.0,1344638.0,1346730.0,1386120.0,1386120.0,1386120.0,1386120.0,1386120.0,1386120.0,1347711.0,...,1386120.0,1386120.0,1386120.0,1386120.0,1386120.0,257318.0,257318.0,257318.0,257318.0,257318.0
mean,133.5066,13.89713,1454.546,0.02825369,1478.206,1334.447,139.0266,15.70936,3.903999,13.64892,...,3.4202,5.673192,1774.918,12627.2,2019.195,18.330004,25.693189,20.68811,0.074515,2.824019
std,74.71373,44.67795,551.0973,0.1656968,529.6749,494.5813,74.57335,8.747288,1.987461,44.69326,...,2.388941,3.563204,1455.752,1484.031,0.3962006,56.595923,51.710956,42.651874,3.527666,26.57426
min,25.0,0.0,1.0,0.0,1.0,1.0,32.0,1.0,1.0,0.0,...,1.0,1.0,1.0,10299.0,2019.0,0.0,0.0,0.0,0.0,0.0
25%,82.0,0.0,1041.0,0.0,1055.0,915.0,90.0,8.0,2.0,0.0,...,2.0,2.0,628.0,11292.0,2019.0,0.0,0.0,0.0,0.0,0.0
50%,114.0,0.0,1459.0,0.0,1510.0,1329.0,120.0,16.0,4.0,0.0,...,3.0,5.0,1494.0,12892.0,2019.0,0.0,0.0,6.0,0.0,0.0
75%,159.0,7.0,1917.0,0.0,1922.0,1739.0,163.0,23.0,6.0,6.0,...,4.0,9.0,2380.0,13930.0,2019.0,16.0,31.0,24.0,0.0,0.0
max,538.0,2560.0,2400.0,1.0,2400.0,2359.0,727.0,31.0,7.0,2579.0,...,11.0,12.0,7881.0,15304.0,2020.0,2560.0,1438.0,1567.0,1078.0,1239.0


In [58]:
flight_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1386120 entries, 0 to 1386119
Data columns (total 27 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   ACTUAL_ELAPSED_TIME  1344638 non-null  float64
 1   ARR_DELAY_NEW        1344638 non-null  float64
 2   ARR_TIME             1346730 non-null  float64
 3   ARR_TIME_BLK         1386120 non-null  object 
 4   CANCELLED            1386120 non-null  int64  
 5   CRS_ARR_TIME         1386120 non-null  int64  
 6   CRS_DEP_TIME         1386120 non-null  int64  
 7   CRS_ELAPSED_TIME     1386120 non-null  int64  
 8   DAY_OF_MONTH         1386120 non-null  int64  
 9   DAY_OF_WEEK          1386120 non-null  int64  
 10  DEP_DELAY_NEW        1347711 non-null  float64
 11  DEP_TIME             1347712 non-null  float64
 12  DEP_TIME_BLK         1386120 non-null  object 
 13  DEST_AIRPORT_ID      1386120 non-null  int64  
 14  DISTANCE             1386120 non-null  int64  
 15

 ## Sprawdzenie
 Kod poniżej sprawdza, czy ta część została poprawnie wykonana

In [59]:
flight_df_expected_shape = (1386120, 27)
assert flight_df_expected_shape == flight_df.shape

 ## Zapis do pliku
 Zapisuję ramkę `flight_df` do pliku `flight.csv` w katalogu `data\raw`

In [60]:
flight_df.to_csv(r"..\data\raw\flight.csv")

 # Podsumowanie
 W tym notatniku pozyskałam dane które są gotowe do dalszej pracy. W kolejnym notatniku stworzę bazę danych wraz ze strukturą.