# Clustering with K-Means
### Clustering airlines based on average air time and average arrival delay using K-Means

There are a lot of clustering algorithms available and choosing the right one is sometimes difficult. As in our lecture, in this repository we will focus on two of the available clustering algorithms: **K-Means** and **DBSCAN**.

In this notebook we will use our beloved flights data to apply the K-Means clustering algorithm. 

At the end of this notebook you should: 
* know how to use the sklearn implementations of `K-Means`  
* know which steps are necessary to perform clustering with `K-Means`    
* know what results you will get by clustering 

For a deeper understanding of the `K-Means` algortihm, check out notebook 6 in this repository.

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [None]:
# Import get_dataframe function from your own sql module
from sql_functions import get_dataframe

# Import get_engine
from sql_functions import get_engine

# create a variable called engine using the get_engine function
engine = get_engine()

In [None]:
# define/assign the schema you want to query from
schema = 'hh_analytics_22_2'

In [None]:
# Get the aggregated data from the database
sql_select = f'''select 
    airline, 
    count(*) as flights, 
    avg(air_time) as avg_air_time, 
    avg(arr_delay) as avg_arr_delay
from {schema}.flights
group by 1
'''

In [None]:
# Query the database
k_means_data = pd.read_sql_query(sql_select, engine)

In [None]:
# check results
k_means_data.head()

### Scale your data
Often the input features of your model have different units which means that the variables also have different scales. While some model types (e.g. tree-based models like decision tree or random forest) are unaffected by the scale of numerical input variables, many machine learning algorithms including f.e. algorithms using distance measures (e.g. K-Means) perform better when the input features are scaled to a specific range. 
**You can learn more about scaling in Notebook 7**

In [None]:
# Scaling with standard scaler
# First, a StandardScaler instance is defined with default hyperparameters.
# After defining we can call the fit_transform() function and pass it to our data we want to transform.

sc = StandardScaler()
scaled_data = sc.fit_transform(k_means_data[['avg_air_time', 'avg_arr_delay']])

In [None]:
# Result is a transformed array with transformed values
scaled_data

In [None]:
# set up the kmean object and cluster using the scaled data
kmeans = KMeans(n_clusters=3)
kmeans.fit(X=scaled_data)

# Write the clusters to the dataframe as a new column
k_means_data['k_clusters'] = kmeans.labels_

In [None]:
# Check dataframe with assigned labels
k_means_data.head()

In [None]:
# Chart the data using matplotlib
fig, ax1 = plt.subplots(figsize=(10,8))

#labels
ax1.set_xlabel('avg_air_time')
ax1.set_ylabel('avg_arr_delay')
ax1.set_title('k means clustering example')
ax1.set_xlim(0,250)
#plot
plt.scatter(k_means_data['avg_air_time'], 
            k_means_data['avg_arr_delay'], 
            s = 300, 
            c = k_means_data['k_clusters'] # color based on cluster labels 
           ); 