# <font color='#d50283'>IT Academy - Data Science Itinerary</font>
## Sprint 11 Task 1 - Clustering - PreProcessing
### Assignment by: Kat Weissman

#### General objective:

- Become familiar with clustering algorithms.

#### Python Learning Objectives:
- K Means
- Hierarchical clustering

*Recommended learning resources:*
- https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
- https://realpython.com/k-means-clustering-python/
- https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
- https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
- https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019
- https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/


In [1]:
#Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

### Level 1
### Exercise 1 
Group the different flights using the K-means algorithm.

In [2]:
pd.set_option('display.max_columns', None)  #set display to show all columns

I will load the data which I pre-processed and sampled in a different notebook.

- https://github.com/KatBCN/Supervisat_Classificacio/blob/main/Sprint%2010%20-%20Classification%20Model%20-%20Pre-Processing.ipynb

In [3]:
# load data
data_link = 'https://github.com/KatBCN/Supervisat_Classificacio/blob/main/flights-processed-sampled.pkl.bz2?raw=true'
df = pd.read_pickle(data_link,compression='bz2')

#### Data Exploration

In [4]:
# Show number of rows and columns in dataframe
df.shape

(96419, 33)

In [5]:
# Show column names and dtypes
df.dtypes

Unnamed: 0                  int64
Year                       object
Month                      object
DayofMonth                 object
DayOfWeek                  object
DepTime                   float64
CRSDepTime                  int64
ArrTime                   float64
CRSArrTime                  int64
UniqueCarrier              object
FlightNum                  object
TailNum                    object
ActualElapsedTime         float64
CRSElapsedTime            float64
AirTime                   float64
ArrDelay                  float64
DepDelay                  float64
Origin                     object
Dest                       object
Distance                    int64
TaxiIn                    float64
TaxiOut                   float64
Cancelled                  object
CancellationCode           object
Diverted                   object
CarrierDelay              float64
WeatherDelay              float64
NASDelay                  float64
SecurityDelay             float64
LateAircraftDe

In [6]:
# Display first 5 rows of dataframe
df.head(5)

Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,SumDelay,Diff_SumDelay_DepDelay,Delayed
1098428,3654724,2008,7,18,5,1511.0,1505,1740.0,1740,WN,1418,N452WN,269.0,275.0,252.0,0.0,6.0,BNA,OAK,1959,5.0,12.0,0,N,0,,,,,,,,0
112686,361246,2008,1,11,5,1736.0,1730,1913.0,1858,FL,327,N955AT,97.0,88.0,69.0,15.0,6.0,BOS,BWI,370,9.0,19.0,0,N,0,6.0,0.0,9.0,0.0,0.0,15.0,9.0,1
599454,1878648,2008,4,29,2,1611.0,1600,1711.0,1700,WN,884,N241WN,60.0,60.0,41.0,11.0,11.0,ONT,LAS,197,3.0,16.0,0,N,0,,,,,,,,1
743117,2434787,2008,5,16,5,1551.0,1535,1803.0,1740,WN,1564,N718SW,132.0,125.0,121.0,23.0,16.0,STL,HOU,687,3.0,8.0,0,N,0,0.0,0.0,7.0,0.0,16.0,23.0,7.0,1
581400,1812607,2008,4,9,3,1754.0,1725,1834.0,1820,WN,2284,N686SW,100.0,115.0,86.0,14.0,29.0,SLC,LAX,590,6.0,8.0,0,N,0,,,,,,,,1


In [7]:
# Check for duplicates
sum(df.duplicated())

0

In [8]:
# Check for NA values
df.isna().sum()

Unnamed: 0                    0
Year                          0
Month                         0
DayofMonth                    0
DayOfWeek                     0
DepTime                       0
CRSDepTime                    0
ArrTime                       0
CRSArrTime                    0
UniqueCarrier                 0
FlightNum                     0
TailNum                       0
ActualElapsedTime             0
CRSElapsedTime                0
AirTime                       0
ArrDelay                      0
DepDelay                      0
Origin                        0
Dest                          0
Distance                      0
TaxiIn                        0
TaxiOut                       0
Cancelled                     0
CancellationCode              0
Diverted                      0
CarrierDelay              33942
WeatherDelay              33942
NASDelay                  33942
SecurityDelay             33942
LateAircraftDelay         33942
SumDelay                  33942
Diff_Sum

We start by subsetting only the numerical features that don't have NAs to use with the K Means Algorithm.

In [9]:
numeric = ['CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Distance', 'TaxiIn', 'TaxiOut']

In [10]:
num_df = df[numeric]

In [11]:
num_df.head()

Unnamed: 0,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut
1098428,275.0,252.0,0.0,6.0,1959,5.0,12.0
112686,88.0,69.0,15.0,6.0,370,9.0,19.0
599454,60.0,41.0,11.0,11.0,197,3.0,16.0
743117,125.0,121.0,23.0,16.0,687,3.0,8.0
581400,115.0,86.0,14.0,29.0,590,6.0,8.0


"Machine learning algorithms need to consider all features on an even playing field. That means the values for all features must be transformed to the same scale.

The process of transforming numerical features to use the same scale is known as feature scaling. It’s an important data preprocessing step for most distance-based machine learning algorithms because it can have a significant impact on the performance of your algorithm." - https://realpython.com/k-means-clustering-python/

I will use scikit learn's robust scaler for this dataset before clustering.

- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

"Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results."


In [12]:
scaler = RobustScaler()
scaled_features = scaler.fit_transform(num_df)

In [13]:
df_scaled_features = pd.DataFrame(scaled_features, columns = numeric)

In [14]:
df_scaled_features.head()

Unnamed: 0,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut
0,1.892857,2.050633,-0.510638,-0.439024,2.048485,-0.25,-0.181818
1,-0.333333,-0.265823,-0.191489,-0.439024,-0.359091,0.75,0.454545
2,-0.666667,-0.620253,-0.276596,-0.317073,-0.621212,-0.75,0.181818
3,0.107143,0.392405,-0.021277,-0.195122,0.121212,-0.75,-0.545455
4,-0.011905,-0.050633,-0.212766,0.121951,-0.025758,0.0,-0.545455


In [15]:
df_scaled_features.shape

(96419, 7)

In [17]:
df_scaled_features.describe()

Unnamed: 0,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut
count,96419.0,96419.0,96419.0,96419.0,96419.0,96419.0,96419.0
mean,0.215144,0.229707,0.385457,0.464967,0.237805,0.206622,0.377418
std,0.845323,0.866049,1.194652,1.282684,0.866427,1.298131,1.295855
min,-1.202381,-1.139241,-1.914894,-0.439024,-0.872727,-1.5,-1.272727
25%,-0.416667,-0.405063,-0.319149,-0.292683,-0.407576,-0.5,-0.363636
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.583333,0.594937,0.680851,0.707317,0.592424,0.5,0.636364
max,6.47619,6.797468,26.148936,29.609756,6.598485,35.0,28.454545


In [18]:
df_scaled_features.to_csv('flights-sampled-robustscale.csv', index=False)

I will apply PCA in a separate notebook before clustering so that I can plot the results in two dimensions.

- https://github.com/KatBCN/NoSupervisat_Classificacio/blob/main/Sprint11-Clustering-PCA.ipynb