## 1. Ingest

- Importar la data mediante Pandas utilizando el metodo [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). Visualizar las primeras 10 filas del dataset.

In [None]:
## import dependencies
import pandas as pd

In [None]:
url = 'https://topcs.blob.core.windows.net/public/FlightData.csv'

## load data from csv to dataframe
df = pd.read_csv(url)
df.head(n=10)

another solution:

Download csv
```bash
!curl https://topcs.blob.core.windows.net/public/FlightData.csv -o flightdata.csv
```

Load data from csv to dataframe
```
df = pd.read_csv('flightdata.csv')
df.head(n=10)
```

## 2. Process

- Analizar la data utilizando las herramientas de pandas para determinar el tipo de datos que se tienen disponibles en el dataset.
- Determinar si existe data nula y el impacto que tiene para nuestro caso de uso.
- Eliminar o completar la data nula de acuerdo al analisis previo.
- Aplicar "data [binning](https://en.wikipedia.org/wiki/Data_binning)" y convertir variables categoricas a variables de indicador para procesar la data de previo al entrenamiento del modelo.

In [None]:
# analize data
df.describe()

In [None]:
# dimensionality of the dataframe.
df.shape

In [None]:
# get data type of each column.
df.dtypes

In [None]:
# check if there ir any null value
if (df.isnull().values.any()):
    print('Existen valores nulos')
else:
    print('No existen valores nulos')

In [None]:
# check which values are null
df.isnull().sum()

In [None]:
# select data of last column 'Unnamed: 25' that is completely null
df.iloc[:,-1]

In [None]:
# remove data of last column
df = df.iloc[:,:-1]

another solution:
```
df.drop(['Unnamed: 25'], axis =1)
```

In [None]:
# inspect the output and confirm that column 26 ('Unnamed: 25') has disappeared from the DataFrame
df.isnull().sum()

In [None]:
# filter just required columns to the model that we want to train and inspect output and to confirm nulls is greatly reduced
df = df[['MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'ORIGIN', 'DEST', 'CRS_DEP_TIME', 'ARR_DEL15']]
df.isnull().sum()

In [None]:
# review data with null values
df[df.isnull().values.any(axis=1)].head()

In [None]:
# flights that were canceled or diverted are going to be treated as late
df = df.fillna({'ARR_DEL15': 1})
df.iloc[177:185]

#### Bin departure times and add indicator columns

In [None]:
df.head()

Apply data [binning](https://en.wikipedia.org/wiki/Data_binning)

In [None]:
import math

for index, row in df.iterrows():
    df.loc[index, 'CRS_DEP_TIME'] = math.floor(row['CRS_DEP_TIME'] / 100)
df.head()

Apply pandas [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to origin and destination airports

In [None]:
df = pd.get_dummies(df, columns=['ORIGIN', 'DEST'])
df.head() 

## 3. Predict

Utilizar la data procesada para entrenar un modelo capaz de 'predecir' las probabilidades de que un vuelo llegue a tiempo.
- Dividir el set de datos en dator para entrenamiento y datos para prueba.
- Utilizar [Sckit-learn](https://scikit-learn.org/stable/index.html) para entrenar el modelo.
- Validar el nivel de precisión del modelo entrenado.

In [None]:
# split dataframe to use 80% per training and 20% for testing model
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(df.drop('ARR_DEL15', axis=1), df['ARR_DEL15'], test_size=0.2, random_state=42)

In [None]:
# number of rows and columns in the DataFrame containing the feature columns used for training
train_x.shape

In [None]:
# number of rows and columns in the DataFrame containing the feature columns used for testing
test_x.shape

use [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) algorithm for training the model

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=13)
model.fit(train_x, train_y)

In [None]:
# testing model
predicted = model.predict(test_x)
model.score(test_x, test_y)

In [None]:
# generate a set of prediction probabilities from the test data
from sklearn.metrics import roc_auc_score
probabilities = model.predict_proba(test_x)

In [None]:
# generate an ROC AUC score from the probabilities using Sckit-learn's roc_auc_score method
roc_auc_score(test_y, probabilities[:, 1])

In [None]:
# generate a confusion matrix ("error matrix")
from sklearn.metrics import confusion_matrix
confusion_matrix(test_y, predicted)

In [None]:
# generate a precision_score for computing precision
from sklearn.metrics import precision_score

train_predictions = model.predict(train_x)
precision_score(train_y, train_predictions)

In [None]:
from sklearn.metrics import recall_score

recall_score(train_y, train_predictions)

## 4. Visualize

- Utilizar [Matplotlib](https://matplotlib.org/) para visualizar los resultados.
- Crear función para obtener probabilidad de atraso en vuelos para dias y horas especificos, así como origen y destinos especificos.
- Graficar posibilidades de falso / positivo.
- Graficar posibilidades de atraso para días especificos.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, _ = roc_curve(test_y, probabilities[:, 1])
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], color='grey', lw=1, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

In [None]:
def predict_delay(departure_date_time, origin, destination):
    from datetime import datetime

    try:
        departure_date_time_parsed = datetime.strptime(departure_date_time, '%d/%m/%Y %H:%M:%S')
    except ValueError as e:
        return 'Error parsing date/time - {}'.format(e)
    
    month = departure_date_time_parsed.month
    day = departure_date_time_parsed.day
    day_of_week = departure_date_time_parsed.isoweekday()
    hour = departure_date_time_parsed.hour
    
    origin = origin.upper()
    destination = destination.upper()

    input = [{'MONTH': month,
              'DAY': day,
              'DAY_OF_WEEK': day_of_week,
              'CRS_DEP_TIME': hour,
              'ORIGIN_ATL': 1 if origin == 'ATL' else 0,
              'ORIGIN_DTW': 1 if origin == 'DTW' else 0,
              'ORIGIN_JFK': 1 if origin == 'JFK' else 0,
              'ORIGIN_MSP': 1 if origin == 'MSP' else 0,
              'ORIGIN_SEA': 1 if origin == 'SEA' else 0,
              'DEST_ATL': 1 if destination == 'ATL' else 0,
              'DEST_DTW': 1 if destination == 'DTW' else 0,
              'DEST_JFK': 1 if destination == 'JFK' else 0,
              'DEST_MSP': 1 if destination == 'MSP' else 0,
              'DEST_SEA': 1 if destination == 'SEA' else 0 }]

    return model.predict_proba(pd.DataFrame(input))[0][0]

In [None]:
predict_delay('1/10/2018 21:45:00', 'JFK', 'ATL')

In [None]:
predict_delay('2/10/2018 21:45:00', 'JFK', 'ATL')

In [None]:
predict_delay('2/10/2018 10:00:00', 'ATL', 'SEA')

### Plot predictions

In [None]:
# plot the probability of on-time arrivals for an evening flight from JFK to ATL over a range of days
import numpy as np

labels = ('Oct 1', 'Oct 2', 'Oct 3', 'Oct 4', 'Oct 5', 'Oct 6', 'Oct 7')
values = (predict_delay('1/10/2018 21:45:00', 'JFK', 'ATL'),
          predict_delay('2/10/2018 21:45:00', 'JFK', 'ATL'),
          predict_delay('3/10/2018 21:45:00', 'JFK', 'ATL'),
          predict_delay('4/10/2018 21:45:00', 'JFK', 'ATL'),
          predict_delay('5/10/2018 21:45:00', 'JFK', 'ATL'),
          predict_delay('6/10/2018 21:45:00', 'JFK', 'ATL'),
          predict_delay('7/10/2018 21:45:00', 'JFK', 'ATL'))
alabels = np.arange(len(labels))

plt.bar(alabels, values, align='center', alpha=0.5)
plt.xticks(alabels, labels)
plt.ylabel('Probability of On-Time Arrival')
plt.ylim((0.0, 1.0))

In [None]:
labels = ('Apr 10', 'Apr 11', 'Apr 12', 'Apr 13', 'Apr 14', 'Apr 15', 'Apr 16')
values = (predict_delay('10/4/2018 13:00:00', 'JFK', 'MSP'),
          predict_delay('11/4/2018 13:00:00', 'JFK', 'MSP'),
          predict_delay('12/4/2018 13:00:00', 'JFK', 'MSP'),
          predict_delay('13/4/2018 13:00:00', 'JFK', 'MSP'),
          predict_delay('14/4/2018 13:00:00', 'JFK', 'MSP'),
          predict_delay('15/4/2018 13:00:00', 'JFK', 'MSP'),
          predict_delay('16/4/2018 13:00:00', 'JFK', 'MSP'))
alabels = np.arange(len(labels))

plt.bar(alabels, values, align='center', alpha=0.5)
plt.xticks(alabels, labels)
plt.ylabel('Probability of On-Time Arrival')
plt.ylim((0.0, 1.0))

In [None]:
# graph the probability that flights leaving SEA for ATL at 9:00 a.m., noon, 3:00 p.m., 6:00 p.m., and 9:00 p.m. on January 30 will arrive on time.
labels = ('9:00 a.m.', '12:00 m.d.', '3:00 p.m.', '6:00 p.m.', '9:00 p.m.')
values = (predict_delay('30/01/2019 09:00:00', 'SEA', 'ATL'),
          predict_delay('30/01/2019 12:00:00', 'SEA', 'ATL'),
          predict_delay('30/01/2019 15:00:00', 'SEA', 'ATL'),
          predict_delay('30/01/2019 18:00:00', 'SEA', 'ATL'),
          predict_delay('30/01/2019 21:00:00', 'SEA', 'ATL'))
alabels = np.arange(len(labels))

plt.bar(alabels, values, align='center', alpha=0.5)
plt.xticks(alabels, labels)
plt.ylabel('Probability of On-Time Arrival')
plt.ylim((0.0, 1.0))