# Algorithmic Machine Learning 🧠
## AML-Challenge_1_Baseline 🧑🏻‍💻

![alt text](https://drive.google.com/uc?export=view&id=1Uxqe7gHt6GTLjZxIXd2WAjgIvPnMkKad)

Professor : Pietro MICHIARDI 👨🏻‍🏫

**Team 7️⃣ :**   
Sourish GHOSH   
Shree hari BOYALLA  
Tanmay CHAKRABORTY  
Utkarsh TREHAN  


## Connecting Google Drive 🔐

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

## Data Acquisition ⬇️

In [None]:
! pip install -q kaggle

In [None]:
from google.colab import files
files.upload()

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Uncomment the following to test if everything's okay by running this command.
# !kaggle datasets list

In [None]:
!kaggle competitions download -c eurecom-aml-2021-challenge-1

In [None]:
! mkdir train

In [None]:
! unzip train_features.csv.zip -d train

In [None]:
! unzip train_targets.csv.zip -d train

In [None]:
! mkdir test

In [None]:
! unzip test_features.csv.zip -d test

## Data Wrangling 🔗

In [None]:
#import the necssary libraries
import os
from datetime import datetime
from datetime import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc 
from matplotlib.ticker import FuncFormatter
import seaborn as sns
import matplotlib.patches as mpatches

In [None]:
X_dtype = {
    'ID'                   : int,
    'YEAR'                 : int,  
    'MONTH'                : int,  
    'DAY'                  : int,  
    'DAY_OF_WEEK'          : int,  
    'AIRLINE'              : str, 
    'FLIGHT_NUMBER'        : str,  
    'TAIL_NUMBER'          : str, 
    'ORIGIN_AIRPORT'       : str, 
    'DESTINATION_AIRPORT'  : str, 
    'SCHEDULED_DEPARTURE'  : str,  
    'DEPARTURE_TIME'       : str, 
    'DEPARTURE_DELAY'      : float,
    'TAXI_OUT'             : str, 
    'WHEELS_OFF'           : str,
    'SCHEDULED_TIME'       : float,
    'AIR_TIME'             : float,
    'DISTANCE'             : int,
    'SCHEDULED_ARRIVAL'    : str,
    'DIVERTED'             : int,  
    'CANCELLED'            : int,  
    'CANCELLATION_REASON'  : str
}

y_dtype = {
    'ID'                   : int,
    "ARRIVAL_DELAY"        : float
}


In [None]:
X_train_df = pd.read_csv("../input/eurecom-aml-2021-challenge-1/data/train_features.csv", dtype=X_dtype)
y_train_df = pd.read_csv("../input/eurecom-aml-2021-challenge-1/data/train_targets.csv", dtype=y_dtype)

airlines_df = pd.read_csv('../input/eurecom-aml-2021-challenge-1/data/airlines.csv').rename({'AIRLINE': 'AIRLINE_NAME'}, axis='columns')
airports_df = pd.read_csv('../input/eurecom-aml-2021-challenge-1/data/airports.csv').rename({'AIRPORT': 'AIRPORT_NAME'}, axis='columns')

In [None]:
def null_rows(data = None):
    rows_with_nan = []
    for index, row in data.iterrows():
        is_nan_series = row.isnull()
        if is_nan_series.any():
            rows_with_nan.append(index)
    return rows_with_nan

In [None]:
def data_probing(data = None, n=5):
    print("---------- Head ----------")
    display(data.head(n))
    print("\n---------- Shape ----------")
    print("Number of Rows: {}\nNumber of Columns: {}".format(data.shape[0], data.shape[1]))
    print("\n---------- Null Values ----------")
    print(data.isnull().sum())
    print("\n---------- Rows with null values ----------")
    rows_with_null = [data.iloc[[index]] for index in null_rows(data)]
    for row in rows_with_null:
        display(row)

In [None]:
data_probing(airports_df)

**Note:** There are NULL values present in airports_df in the column "LATITUDE" and "LONGITUDE".

In [None]:
data_probing(airlines_df)

In [None]:
def parse_hhmm(x):
    if x == '2400':
        x = '0000'
    x =  x[:-2] + ':' + x[-2:]
    return x

X_train_df.SCHEDULED_DEPARTURE = X_train_df.SCHEDULED_DEPARTURE.apply(parse_hhmm)
X_train_df.DEPARTURE_TIME = X_train_df.DEPARTURE_TIME.apply(parse_hhmm)
X_train_df.SCHEDULED_ARRIVAL = X_train_df.SCHEDULED_ARRIVAL.apply(parse_hhmm)

In [None]:
X_train_df.SCHEDULED_DEPARTURE = pd.to_datetime(X_train_df['SCHEDULED_DEPARTURE'],format= '%H:%M' ).dt.time
X_train_df.DEPARTURE_TIME = pd.to_datetime(X_train_df['DEPARTURE_TIME'],format= '%H:%M' ).dt.time
X_train_df.SCHEDULED_ARRIVAL = pd.to_datetime(X_train_df['SCHEDULED_ARRIVAL'],format= '%H:%M' ).dt.time

In [None]:
X_train_df.head(5)

In [None]:
# Merge feature dataframe and target dataframe for data exploration
df_train_merged = pd.merge(X_train_df, y_train_df, on='ID')

In [None]:
# Create new columns used for data exploration
df_train_merged['DELAYED'] = (df_train_merged.ARRIVAL_DELAY > 0)
df_train_merged['DATE'] = pd.to_datetime(df_train_merged[['YEAR', 'MONTH', 'DAY']])

**Note:** There are NULL values present in X_train_df in the column "CANCELLATION_REASON". Other datasets have no NULL values in them.

## Data Analysis 📈📉📊

**Basic queries:**  
1. How many unique origin airports?  
2. How many unique destination airports?  
3. How many carriers?  
4. How many flights that have a scheduled departure time later than 18h00?  

**Statistics on flight volume: this kind of statistics are helpful to reason about delays. Indeed, it is plausible to assume that "the more flights in an airport, the higher the probability of delay".**

1. How many flights in each month of the year?  
2. Is there any relationship between the number of flights and the days of week?  
3. How many flights in different days of months and in different hours of days?  
4. Which are the top 20 busiest airports (this depends on inbound and outbound traffic)?  
5. Which are the top 20 busiest carriers?  

**Statistics on the fraction of delayed flights**
1. What is the percentage of delayed flights (over total flights) for different hours of the day?  
2. Which hours of the day are characterized by the longest flight delay?
3. What are the fluctuation of the percentage of delayed flights over different time granularities?  
4. What is the percentage of delayed flights which depart from one of the top 20 busiest airports?  
5. What is the percentage of delayed flights which belong to one of the top 20 busiest carriers?  

In [None]:
def unique_values(data):
    unique_tuple = data.unique()
    return (unique_tuple, len(unique_tuple))

In [None]:
time_compare = time(18, 0, 0)
print("How many unique origin airports?\nThere are {} origin airports\n".format(unique_values(df_train_merged['ORIGIN_AIRPORT'])[1]))
print("How many unique destination airports?\nThere are {} destination airports\n".format(unique_values(df_train_merged['DESTINATION_AIRPORT'])[1]))
print("How many carriers?\nThere are {} carriers\n".format(unique_values(df_train_merged['AIRLINE'])[1]))
print("How many flights that have a scheduled departure time later than 18h00?\nThere are {} flights later then 18h00\n".format((X_train_df['SCHEDULED_DEPARTURE'] > time_compare).sum()))

In [None]:
# How many flights in each month of the year?
plt.rcParams["figure.figsize"] = (15,7)
val = df_train_merged.MONTH.value_counts().sort_index()
x = val.index
y = val.values
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul',
          'Aug', 'Sept', 'Oct', 'Nov', 'Dec']

def millions(x, pos):
    'The two args are the value and tick position'
    return '%1.1fM' % (x*1e-6)

formatter = FuncFormatter(millions)

fig, ax = plt.subplots()

plt.xlabel('Month')
plt.ylabel('Number of flights')
plt.title('Number of flights in a month')
plt.bar(x, y, width=0.35, color=['#239CD3'])
plt.plot(x,y, '-ok')
plt.xticks(x, months)
ax.yaxis.set_major_formatter(formatter)
plt.show()
plt.close()

**Note:** The number of flights running in a given month of the year, it can be observed that, the months of june, july and December, have high traffic, which is plausible because of the holiday seasons in those month.Infact, it is obviuos for airlines to run more flights during the holidays season, as the demand to travel is high.

In [None]:
# Is there any relationship between the number of flights and the days of week?
plt.rcParams["figure.figsize"] = (15,7)
val = df_train_merged.DAY_OF_WEEK.value_counts().sort_index()
x = val.index
y = val.values
plt.plot(x,y, '-o')
plt.xlabel('Days of WEEK')
plt.ylabel('Number of Flights')
plt.xticks(x, ['SUNDAY', 'MONDAY', 'TUESDAY', 'WEDNESDAY', 'THURSDAY', 'FRIDAY', 'SATURDAY'])
plt.title('Number of Flights in a Day Of Week')
for a,b in zip(x, y): 
    plt.annotate(str(b), (a,b),ha='right', va='bottom', fontsize=14)
plt.show()

**Note:** Considering the 1st day is sunday and so on, we can see that the number of flights increasing as days progress from sunday to Thursday as we move from one week to another. Then there is a decrease in the number of flights in the friday which is starting of the weekend, and then there is increase again on saturday.

In [None]:
#busiest airports by inbound and outbound traffic
plt.rcParams["figure.figsize"] = (18,10)
total = df_train_merged.ORIGIN_AIRPORT.value_counts()[:20]+df_train_merged.DESTINATION_AIRPORT.value_counts()[:20]
labels = total.index
total = total.values

#getting the delayed flights
origin_del = pd.DataFrame(df_train_merged.groupby('ORIGIN_AIRPORT').DELAYED.sum()).reset_index().sort_values('DELAYED', ascending=False)
dest_del = pd.DataFrame(df_train_merged.groupby('DESTINATION_AIRPORT').DELAYED.sum()).reset_index().sort_values('DELAYED', ascending=False)
total_del_airports = origin_del.merge(dest_del, left_on='ORIGIN_AIRPORT', right_on='DESTINATION_AIRPORT').head(20)
total_del = (total_del_airports.DELAYED_x+total_del_airports.DELAYED_x).values
sns.set_theme(style="whitegrid")
bar1 = sns.barplot(x=labels,  y=total, color='red')

bar2 = sns.barplot(x=labels, y=total_del, color='black')

# add legend
top_bar = mpatches.Patch(color='red', label='Total_flights')
bottom_bar = mpatches.Patch(color='black', label='Delayed_flights')

plt.ylabel('NUMBER OF FLIGHTS', fontsize=18)
plt.xlabel('Airports', fontsize=18)
plt.title('TOTAL FLIGHTS VS DELAYED FLIGHTS  of AIRLINE', fontsize=18)
plt.legend(handles=[top_bar, bottom_bar], fontsize=18)
# show the graph
plt.show()

Top 20 busiest aiports by traffic(Arrival+Departure). It can also, be seen that, the airport that has highest traffic also has more flights delayed. 

In [None]:
#pd.merge(airlines_df, train_df.AIRLINE.value_counts(), left_on='IATA', right_on=train_df.AIRLINE.index)
counts = pd.DataFrame(df_train_merged.AIRLINE.value_counts()).reset_index()
delayed_counts = pd.DataFrame(df_train_merged.groupby('AIRLINE').DELAYED.sum()).reset_index()
delayed_counts.columns =  ['IATA_CODE', 'DELAYED']

counts.columns = ['IATA_CODE', 'fleet']
counts = counts.merge(delayed_counts, on='IATA_CODE', how='inner')
total = counts.merge(airlines_df, on='IATA_CODE', how='inner')
labels = total.AIRLINE_NAME
t_f = total.fleet
d_f = total.DELAYED
sns.set_theme(style="whitegrid")
bar1 = sns.barplot(x=labels,  y=t_f, color='darkblue')

bar2 = sns.barplot(x=labels, y=d_f, color='lightblue')

# add legend
top_bar = mpatches.Patch(color='darkblue', label='Total_flights')
bottom_bar = mpatches.Patch(color='lightblue', label='Delayed_flights')

plt.xticks(rotation=-80)
plt.ylabel('NUMBER OF FLIGHTS', fontsize=18)
plt.xlabel('Airlines', fontsize=18)
plt.title('TOTAL FLIGHTS VS DELAYED FLIGHTS  of AIRLINE', fontsize=18)
plt.legend(handles=[top_bar, bottom_bar], fontsize=18)
# show the graph
plt.show()

The number of flights run by each carrier and the percentage of them delayed is shown. Its quite evident that, more the flights a carrier runs, there are more delays, which could be because of the fact that more flights that are run, more maintence delays and scheduling conflicts etc.

In [None]:
def parse_hour(x):
    return x.hour

df_train_merged['SCHEDULED_DEPARTURE_HOUR'] = df_train_merged.SCHEDULED_DEPARTURE.apply(parse_hour)

In [None]:
# What is the percentage of delayed flights (over total flights) for different hours of the day?
val = df_train_merged.groupby(df_train_merged.SCHEDULED_DEPARTURE_HOUR)['DELAYED'].agg("mean")
x = val.index
y = val.values
hour_interval = ['00:00-1:00', '01:00-2:00', '02:00-3:00', '03:00-4:00', '04:00-5:00', '05:00-6:00', '06:00-7:00', '07:00-8:00', '08:00-9:00', '09:00-10:00', '10:00-11:00', '11:00-12:00', '12:00-13:00', '13:00-14:00', '14:00-15:00', '15:00-16:00', '16:00-17:00', '17:00-18:00', '18:00-19:00', '19:00-20:00', '20:00-21:00', '21:00-22:00', '22:00-23:00', '23:00-00:00' ]

def millions(x, pos):
    'The two args are the value and tick position'
    return '%1.1f' % (x*100)

formatter = FuncFormatter(millions)

fig, ax = plt.subplots()

plt.xlabel('Interval')
plt.ylabel('Percentage of delayed flights')
plt.title('Number of flights in a given hour of a day')
plt.bar(x, y, width=0.35, color=['#239CD3'])
plt.plot(x,y, '-ok')
plt.xticks(x, hour_interval, rotation=45)
ax.yaxis.set_major_formatter(formatter)
plt.show()
plt.close()

The most busiest hours are in the evening, which can be seen from the trend in the graph above.

In [None]:
def parse_dep_delay(x):
    return abs(x)

df_train_merged['DEPARTURE_DELAY_ABS'] = df_train_merged.DEPARTURE_DELAY.apply(parse_dep_delay)

In [None]:
# Which hours of the day are characterized by the longest flight delay?
df_train_merged.groupby(df_train_merged.SCHEDULED_DEPARTURE_HOUR)['DEPARTURE_DELAY_ABS'].agg("mean").sort_values()

The longest flight delay hour is 20:00 with an average delay of 19.895, it can also be observed,that this is also one of the busiest hours in a given day in terms of flight traffic.

# Data Preprocessing


In [None]:
X_dtype = {
    'ID'                   : int,
    'YEAR'                 : int,  
    'MONTH'                : int,  
    'DAY'                  : int,  
    'DAY_OF_WEEK'          : int,  
    'AIRLINE'              : str, 
    'FLIGHT_NUMBER'        : str,  
    'TAIL_NUMBER'          : str, 
    'ORIGIN_AIRPORT'       : str, 
    'DESTINATION_AIRPORT'  : str, 
    'SCHEDULED_DEPARTURE'  : str,  
    'DEPARTURE_TIME'       : str, 
    'DEPARTURE_DELAY'      : float,
    'TAXI_OUT'             : str, 
    'WHEELS_OFF'           : str,
    'SCHEDULED_TIME'       : float,
    'AIR_TIME'             : float,
    'DISTANCE'             : int,
    'SCHEDULED_ARRIVAL'    : str,
    'DIVERTED'             : int,  
    'CANCELLED'            : int,  
    'CANCELLATION_REASON'  : str
}

y_dtype = {
    'ID'                   : int,
    "ARRIVAL_DELAY"        : float
}

X_train_df = pd.read_csv("../input/eurecom-aml-2021-challenge-1/data/train_features.csv", dtype=X_dtype)
y_train_df = pd.read_csv("../input/eurecom-aml-2021-challenge-1/data/train_targets.csv", dtype=y_dtype)

In [None]:
#droping certain colums
def drop_col(df, columns, inplace=True):
  df.drop(columns=columns, inplace=inplace)

In [None]:
columns = ['ID','DIVERTED','CANCELLED','CANCELLATION_REASON', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'WHEELS_OFF']

In [None]:
drop_col(X_train_df, columns)
drop_col(y_train_df, ['ID'])

In [None]:
#label encoding columns with categorical values
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
def encoding(df):
  for col in df.columns:
    if df[col].dtype == 'object':
      df[col] = encoder.fit_transform(df[col])

encoding(X_train_df)

In [None]:
X_train_df.head()

In [None]:
X_train_df.info()

# Model

# Why we chose this model?

Here in this project we chose to work with Random Forest beacuse of the following reasons:


*   An important aspect of a data science project is the interpretability of the data. Models are generally a black-box to interpret our data. There is always a trade-off between the accuracy and interpretability of model. So we had to find the right balance between them. Here Random Forest prevails. Here trees work together to accurately represent feature importance of the decision trees.
*   Essentially trees are weak classifiers with high bias. But even upon increasing the numbers of trees in random forest, it splits over the features randomly, and then by means of bootsrapped aggregation(bagging), it reduces the overall variance of the model. Thus it maintains the Bias-Variance tradeoff. 
*    Normalization of the data is not required as it uses the rule based approach of the decison trees.


Here in this project, we are working on the prediction of the airline delay. Thus we wanted to potray on what factors the model decides why the certain flights are delayed. Thus using the "Feature Imporance" feature of the Random Forest we displayed which of the features are quintessential to its decision making.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X_train_df, y_train_df, test_size = 0.2, random_state=0)

In [None]:
model = RandomForestRegressor(n_estimators=20, criterion='mse', bootstrap=True)
model.fit(x_train, y_train)

In [None]:
Y_pred = model.predict(x_test)

In [None]:
model.score(x_test, y_test)

In [None]:
mean_squared_error(y_test, Y_pred)**0.5

In [None]:
def plot_feature_importance(importance,names,model_type):

  #Create arrays from feature importance and feature names
  feature_importance = np.array(importance)
  feature_names = np.array(names)

  #Create a DataFrame using a Dictionary
  data={'feature_names':feature_names,'feature_importance':feature_importance}
  fi_df = pd.DataFrame(data)

  #Sort the DataFrame in order decreasing feature importance
  fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

  #Define size of bar plot
  plt.figure(figsize=(10,8))
  #Plot Searborn bar chart
  sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
  #Add chart labels
  plt.title(model_type + 'FEATURE IMPORTANCE')
  plt.xlabel('FEATURE IMPORTANCE')
  plt.ylabel('FEATURE NAMES')

# Conclusion 

In [None]:
plot_feature_importance(model.feature_importances_,X_train_df.columns,'RANDOM FOREST')

As we can see the most important features are "DEPARTURE_DELAY", "AIR_TIME", "TAXI_OUT", "SCHEDULED_TIME", and "DISTANCE". 

Based on this we can say that the model comes to a decision based on these parameters, which help us say by certainty the reason why a flight might be delayed, which is our target prediction. 