 # Opis notatnika
 Ten notatnik jest kontunacją analizy danych o lotach i ich opóźnieniach. Od tego momentu będę łączyć zbiory danych, aby dokonać dodatkowych analiz.
 Sprawdzę czy na opóźnienia ma wpływ rok produkcji samolotu.


 Import wymaganych bibliotek

In [38]:
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.engine import URL
import plotly.express as px

 ## Połączenie z bazą danych
 Konfiguracja połączenia

In [39]:
username = 'postgres'
password = '****'

host = 'localhost'
database = 'airlines'
port= '5432'

 Tworzę zmienne `url` oraz `engine`

In [40]:
url = URL.create(
    "postgresql",
    username=username,
    password=password,
    host=host,
    port=port,
    database=database,
)
engine = create_engine(url)

 Implementacja funkcji `read_sql_table`, która przyjmuje jeden argument:
 * `table_name` - nazwa ramki na bazie.

In [41]:
def read_sql_table(table_name):
    df = pd.read_sql(f"SELECT * FROM {table_name}", engine)
    return df

 Wczytuję zapisaną wcześniej ramkę danych `flight_df` do zmniennej o takiej samej nazwie

In [42]:
flight_df = pd.read_csv(r"..\data\processed\flight_df_01.csv")
flight_df.shape

(1057391, 31)

In [43]:
flight_df.columns

Index(['id', 'month', 'day_of_month', 'day_of_week', 'op_unique_carrier',
       'tail_num', 'op_carrier_fl_num', 'origin_airport_id', 'dest_airport_id',
       'crs_dep_time', 'dep_time', 'dep_delay', 'dep_time_blk', 'crs_arr_time',
       'arr_time', 'arr_delay_new', 'arr_time_blk', 'cancelled',
       'crs_elapsed_time', 'actual_elapsed_time', 'distance', 'distance_group',
       'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay',
       'late_aircraft_delay', 'year', 'is_delayed', 'is_weekend',
       'distance_agg'],
      dtype='object')

In [44]:
flight_df['tail_num'].nunique()

5416

 ## Sprawdzenie
 Kod poniżej sprawdza, czy ta część została poprawnie wykonana -> Sprawdzenie poprawności danych w ramce `flight_df` 

In [45]:
flight_df_expected_rows_amount = 1057391
flight_df_rows_amount = flight_df.shape[0]

assert flight_df_rows_amount == flight_df_expected_rows_amount, f'Oczekiwano {flight_df_expected_rows_amount} wierszy, otrzymano {flight_df_rows_amount}'

 # Wzbogacenie o `aircraft`

 Wczytuję tabelę `aircraft` używając funkcji `read_sql_table`

In [46]:
aircraft_df = read_sql_table('aircraft')

In [47]:
aircraft_df.shape

(7383, 4)

In [48]:
aircraft_df.columns

Index(['id', 'manufacture_year', 'tail_num', 'number_of_seats'], dtype='object')

 Usuwam niepotrzebne kolumny: `number_of_seats`, `id` oraz duplikaty z ramki `aircraft_df`

In [49]:
aircraft_df = aircraft_df.drop(['id', 'number_of_seats'], axis=1)
aircraft_df.columns

Index(['manufacture_year', 'tail_num'], dtype='object')

In [50]:
aircraft_df.drop_duplicates(inplace=True)
aircraft_df.shape

(7364, 2)

 ### Sprawdzenie
 Kod poniżej sprawdza, czy ta część została poprawnie wykonana

In [54]:
aircraft_df_expected_rows = 7364
aircraft_df_expected_columns = set(['tail_num', 'manufacture_year'])

aircraft_df_rows = aircraft_df.shape[0]

diff = aircraft_df_expected_columns.symmetric_difference(set(aircraft_df.columns))
assert aircraft_df_rows == aircraft_df_expected_rows, f'Spodziewano się {aircraft_df_expected_rows} wierszy , otrzymano {aircraft_df_rows} wierszy'

assert diff == set([]), f'Spodziewano się {aircraft_df_expected_columns} kolumn, otrzymano: {aircraft_df_expected_columns} kolumn. Różnica: \n\t{diff}'

> Sprawdzam, czy dla kolumny `tail_num` występuje więcej niż jeden rok produkcji. Jeśli tak to:  
> * w przypadku duplikatu za datę wytworzenia samolotu, uznana zostanie najnowsza data produkcji

In [51]:
year_count = aircraft_df.groupby('tail_num')['manufacture_year'].count().reset_index()
year_count[year_count['manufacture_year'] > 1]

Unnamed: 0,tail_num,manufacture_year
4860,N783CA,2
5713,N856GT,2
6028,N877AS,2


In [52]:
aircraft_df_duplicated = aircraft_df.duplicated(subset='tail_num', keep=False)
aircraft_df_duplicated = aircraft_df.loc[aircraft_df_duplicated == True]
aircraft_df_duplicated

Unnamed: 0,manufacture_year,tail_num
1734,1999,N783CA
2086,2000,N783CA
2460,2001,N877AS
4917,2011,N856GT
5725,2014,N856GT
6746,2017,N877AS


In [53]:
aircraft_df_duplicated = aircraft_df.duplicated(subset='tail_num', keep='first')
aircraft_df_duplicated = aircraft_df.loc[aircraft_df_duplicated == True]
aircraft_df_duplicated

Unnamed: 0,manufacture_year,tail_num
2086,2000,N783CA
5725,2014,N856GT
6746,2017,N877AS


In [55]:
aircraft_df_is_duplicated = aircraft_df.duplicated(subset='tail_num')
aircraft_df_duplicated = aircraft_df.loc[aircraft_df_is_duplicated]

 ### Sprawdzenie
 Kod poniżej sprawdza, czy ta część została poprawnie wykonana

In [56]:
aircraft_df_expected_rows = 3
aircraft_df_duplicated_rows = aircraft_df_duplicated.shape[0]
assert aircraft_df_duplicated_rows == aircraft_df_expected_rows, f"Oczekiwano {aircraft_df_expected_rows} wierszy, otrzymano {aircraft_df_duplicated_rows}"

 ## Modyfikacja `aircraft_df`
 Aktualizuję tabelę `aircraft_df` tak aby, dla powielonych `tail_num`, `manufacture_year` został ustawiony jako najwyższy

In [57]:
to_exclude = aircraft_df.duplicated(subset='tail_num', keep='last')
aircraft_df = aircraft_df[~to_exclude]
aircraft_df[aircraft_df['tail_num'].isin(['N783CA', 'N856GT', 'N877AS'])]

Unnamed: 0,manufacture_year,tail_num
2086,2000,N783CA
5725,2014,N856GT
6746,2017,N877AS


In [58]:
aircraft_df.shape

(7361, 2)

In [59]:
aircraft_df['tail_num'].nunique()

7361

 ### Sprawdzenie
 Kod poniżej sprawdza, czy ta część została poprawnie wykonana

In [60]:
test_tail = 'N783CA'
test_value = aircraft_df.loc[aircraft_df['tail_num']
                             == test_tail]['manufacture_year']
test_value = int(test_value)

expected_value = 2000
assert test_value == expected_value, f"Dla 'tail_num' == '{test_tail}' oczekiwano {expected_value} otrzymano {test_value}"


Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead



 ## Połączenie `aircraft_df` oraz `flight_df`

 Łączę ramki `flight_df` z `aircraft_df`, wynik zapisuję do `tmp_flight_df`

In [61]:
tmp_flight_df = flight_df.merge(aircraft_df, how='left', on='tail_num')
tmp_flight_df.shape

(1057391, 32)

 Sprawdzam, czy nie pojawiły się duplikaty

In [62]:
duplicated = tmp_flight_df.duplicated()
duplicated = tmp_flight_df.loc[duplicated == True]
duplicated

Unnamed: 0,id,month,day_of_month,day_of_week,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,dest_airport_id,crs_dep_time,...,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,year,is_delayed,is_weekend,distance_agg,manufacture_year


 Nadpisuję `flight_df` używając `tmp_flight_df`

In [63]:
flight_df = tmp_flight_df.copy()

 ## Opóźnienia vs. rok produkcji samolotu
 Wyznaczam zależność roku produkcji do częstotliwości opóźnień

In [64]:
delays_by_manufacture_year_df = flight_df.groupby('manufacture_year')['is_delayed'].agg(['mean', 'count']).reset_index()
delays_by_manufacture_year_df

Unnamed: 0,manufacture_year,mean,count
0,1987.0,0.126411,443
1,1988.0,0.181388,634
2,1989.0,0.153846,13
3,1990.0,0.155453,4786
4,1991.0,0.181795,7701
5,1992.0,0.176055,13882
6,1993.0,0.187882,5562
7,1994.0,0.232518,5191
8,1995.0,0.194024,7731
9,1996.0,0.190382,12186


 Tworzę wykres punktowy na bazie ramki `delays_by_manufacture_year_df`

In [65]:
fig = px.scatter(delays_by_manufacture_year_df, 
                 x='manufacture_year', 
                 y='mean', 
                 size='count',
                 size_max=40, 
                 title='Frequency of delays by manufacture year',
                 labels={'manufacture_year': 'Manufacture year', 'mean': 'Average delay frequency'},
                 height=600, 
                 width=1200,
                 )

fig.update_traces(marker=dict(color='#5995ED'), marker_line_width=0.7, marker_line_color = '#304C89',
                  hovertemplate="Manufacture year: <b>%{x}</b> <br>Average delay frequency (%): <b>%{y:.2%}</b> <br>Total flights per year: <b>%{marker.size}</b>", 
                  ) 
fig.update_layout(yaxis=dict(tickformat=".2%"),
                  hoverlabel=dict(bgcolor="white"),
                  hoverlabel_align = 'left'
                                            )
fig.show()

 Modyfikuję wykres tak, aby prezentował tylko te roczniki, które odbyły więcej niż 10000 lotów

In [66]:
df_over_10k_flights = delays_by_manufacture_year_df[delays_by_manufacture_year_df['count'] > 10000]
df_over_10k_flights

Unnamed: 0,manufacture_year,mean,count
5,1992.0,0.176055,13882
9,1996.0,0.190382,12186
11,1998.0,0.178461,40317
12,1999.0,0.198373,44018
13,2000.0,0.187512,58215
14,2001.0,0.154113,100251
15,2002.0,0.21261,35845
16,2003.0,0.187752,21081
17,2004.0,0.18139,43266
18,2005.0,0.200692,41621


In [67]:
fig = px.scatter(df_over_10k_flights, 
                 x='manufacture_year', 
                 y='mean', 
                 size='count',
                 size_max=40, 
                 title='Frequency of delays by manufacture year',
                 labels={'manufacture_year': 'Manufacture year', 'mean': 'Average delay frequency'},
                 height=600, 
                 width=1200,
                 )

fig.update_traces(marker=dict(color='#5995ED'), marker_line_width=0.7, marker_line_color = '#304C89',
                  hovertemplate="Manufacture year: <b>%{x}</b> <br>Average delay frequency (%): <b>%{y:.2%}</b> <br>Total flights per year: <b>%{marker.size}</b>", 
                  ) 
fig.update_layout(yaxis=dict(tickformat=".2%"),
                  hoverlabel=dict(bgcolor="white"),
                  hoverlabel_align = 'left'
                                            )
fig.show()

 Dodaję kolumnę `manufacture_year_agg` do ramki `flight_df` grupując dane co 3 lata

In [68]:
min_year = int(flight_df['manufacture_year'].min())
max_year = int(flight_df['manufacture_year'].max())

flight_df['manufacture_year_agg'] = pd.cut(flight_df['manufacture_year'], bins=range(min_year, max_year, 3))
flight_df.head()

Unnamed: 0,id,month,day_of_month,day_of_week,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,dest_airport_id,crs_dep_time,...,weather_delay,nas_delay,security_delay,late_aircraft_delay,year,is_delayed,is_weekend,distance_agg,manufacture_year,manufacture_year_agg
0,1,1,20,7,WN,N204WN,682,10397,11292,605,...,,,,,2019,0,1,"(1100, 1200]",2005.0,"(2002, 2005]"
1,2,1,20,7,WN,N8682B,2622,10397,11292,2120,...,,,,,2019,0,1,"(1100, 1200]",2016.0,"(2014, 2017]"
2,3,1,20,7,WN,N717SA,2939,10397,11292,1800,...,0.0,10.0,0.0,3.0,2019,0,1,"(1100, 1200]",1998.0,"(1996, 1999]"
3,4,1,20,7,WN,N709SW,3848,10397,11292,1355,...,,,,,2019,0,1,"(1100, 1200]",1998.0,"(1996, 1999]"
4,5,1,20,7,WN,N7864B,1352,10397,11697,1125,...,,,,,2019,0,1,"(500, 600]",2001.0,"(1999, 2002]"


 Agreguję dane do zmiennej `flight_delays_by_manufacture_year_agg_df`

In [69]:
flight_delays_by_manufacture_year_agg_df = flight_df.groupby('manufacture_year_agg')['is_delayed'].agg(['mean', 'count']).reset_index()
flight_delays_by_manufacture_year_agg_df





Unnamed: 0,manufacture_year_agg,mean,count
0,"(1987, 1990]",0.158476,5433
1,"(1990, 1993]",0.180107,27145
2,"(1993, 1996]",0.200215,25108
3,"(1996, 1999]",0.190678,93456
4,"(1999, 2002]",0.17491,194311
5,"(2002, 2005]",0.190237,105968
6,"(2005, 2008]",0.214158,131314
7,"(2008, 2011]",0.21412,70339
8,"(2011, 2014]",0.186348,134399
9,"(2014, 2017]",0.187688,186970


In [70]:
flight_delays_by_manufacture_year_agg_df['manufacture_year_agg'] = flight_delays_by_manufacture_year_agg_df['manufacture_year_agg'].astype(str)

 Tworzę wykres w oparciu o dane zawarte w `flight_delays_by_manufacture_year_agg_df`

In [71]:
fig = px.scatter(flight_delays_by_manufacture_year_agg_df, 
                 x='manufacture_year_agg', 
                 y='mean', 
                 size='count',
                 size_max=50, 
                 title='Frequency of delays by manufacture year',
                 labels={'manufacture_year_agg': 'Manufacture year range', 'mean': 'Average delay frequency'},
                 height=600, 
                 width=1200,
                 )

fig.update_traces(marker=dict(color='#5995ED'), marker_line_width=0.7, marker_line_color = '#304C89',
                  hovertemplate="Manufacture year range: <b>%{x}</b> <br>Average delay frequency (%): <b>%{y:.2%}</b> <br>Total flights per year range: <b>%{marker.size}</b>", 
                  ) 

ranges = ['1987-1990', '1991-1993', '1994-1996', '1997-1999', '2000-2002', '2003-2005', '2006-2008', '2009-2011', '2012-2014', '2015-2017']
fig.update_layout(hoverlabel=dict(bgcolor="white"),
                  hoverlabel_align = 'auto',
                  xaxis=dict(tickvals=[i for i in range(0,10)], 
                             ticktext=ranges)
                  )

fig.update_layout(yaxis=dict(tickformat=".2%"),
                  hoverlabel=dict(bgcolor="white"),
                  hoverlabel_align = 'auto'
                                            )

fig.show()

Wyznaczam TOP 5 roczników produkcji sortując według liczby wykonanych lotów.

In [72]:
top_manufactured_df = delays_by_manufacture_year_df.sort_values(by='count', ascending=False).head(5)
top_manufactured_df

Unnamed: 0,manufacture_year,mean,count
14,2001.0,0.154113,100251
29,2016.0,0.186717,66191
30,2017.0,0.187289,62353
27,2014.0,0.178658,61128
28,2015.0,0.189214,58426


 ## Zapis ramki do pliku csv
 Zapisuje dane z ramki `flight_df` do pliku csv w katalogu `data\processed`

In [73]:
flight_df.to_csv(r"..\data\processed\flight_df_02.csv", index=False)

In [74]:
flight_df.shape

(1057391, 33)

 # Podsumowanie
 W tym notatniku do wyjściowej ramki danych `flight_df` dołączyłam tabelę `aircraft_df` i za jej pomocą dodałam kolejny wymiar do analizy. W kolejnym notatniku wzbogacę ramkę o dane pogodowe i dane o lotniskach.