# K-Means Clustering

### Dataset:  
This project explores the relationship between Social Media, Salary, Influence, Performance and Team Valuation in the NBA.


Variables:
- TEAM: Name of the NBA Team
- GMS: Games Played
- PCT_ATTENDANCE: Average % Attendance of capacity (note some teams were over capacity as an averag)
- WINNING_SEASON: If the team won over 50% of their games, it was 1, otherwise 0.
- TOTAL_ATTENDANCE_MILLIONS: Total season attendance in the millions.
- VALUE_MILLIONS: Valuation of the team in millions
- ELO: https://en.wikipedia.org/wiki/Elo_rating_system
- CONF: Eastern or Western Conference
- COUNTY: The county where the team is located
- MEDIAN_HOME_PRICE_COUNTY_MILLIONS: Median Home Price
- COUNTY_POPULATION_MILLIONS: The Population of the county in Millions
- cluster: A cluster created by KMeans clustering (shown in notebook)

In [2]:
import pandas as pd
team_data = "https://raw.githubusercontent.com/noahgift/socialpowernba/master/data/nba_2017_att_val_elo_win_housing.csv"
val_housing_win_df = pd.read_csv(team_data)
numerical_df = val_housing_win_df.loc[:,["TOTAL_ATTENDANCE_MILLIONS", "ELO", "VALUE_MILLIONS", "MEDIAN_HOME_PRICE_COUNTY_MILLIONS"]]

In [3]:
val_housing_win_df.head()
val_housing_win_df.describe()

Unnamed: 0,GMS,PCT_ATTENDANCE,WINNING_SEASON,TOTAL_ATTENDANCE_MILLIONS,VALUE_MILLIONS,ELO,MEDIAN_HOME_PRICE_COUNTY_MILLIONS,COUNTY_POPULATION_MILLIONS
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,41.0,93.466667,0.566667,0.733263,1355.333333,1504.833333,407471.1,2.236
std,0.0,8.544945,0.504007,0.073376,709.613704,106.843451,301904.1,2.434055
min,41.0,72.0,0.0,0.605585,750.0,1338.0,129900.0,0.39
25%,41.0,86.5,0.0,0.67913,886.25,1425.25,271038.8,0.9725
50%,41.0,96.5,1.0,0.724902,1062.5,1510.5,324199.0,1.275
75%,41.0,100.0,1.0,0.800584,1600.0,1582.5,465425.0,2.4
max,41.0,104.0,1.0,0.888882,3300.0,1770.0,1725000.0,10.1


### Distance metrics used:
- Euclidean
- Hamming 
    - The Hamming distance between two strings, a and b is denoted as d(a,b). It is used for error detection or error correction when data is transmitted over computer networks. It is also using in coding theory for comparing equal length data words
- Cosine
- Mahalanobis 
    - measure of the distance between a point P and distribution D

#### Find Euclidean Distance

In [14]:
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np

def get2DArray(row):
    return np.array([row.values.tolist()])

first_row = get2DArray(numerical_df.iloc[0])
second_row = get2DArray(numerical_df.iloc[1])

euclidean_distances(first_row, second_row)

array([[45102.33254507]])

### Standardisation
- Ensures that clusters are equally influenced by each feature/dimension

In [None]:
team_data = "https://raw.githubusercontent.com/noahgift/socialpowernba/master/data/nba_2017_att_val_elo_win_housing.csv"
val_housing_win_df = pd.read_csv(team_data)
numerical_df = val_housing_win_df.loc[:,["TOTAL_ATTENDANCE_MILLIONS", "ELO", "VALUE_MILLIONS", "MEDIAN_HOME_PRICE_COUNTY_MILLIONS"]]
numerical_df.head()

In [16]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.cluster import KMeans

In [20]:
%%timeit
std_cluster = make_pipeline(StandardScaler(), KMeans(n_clusters=3))
kmeans = std_cluster.fit(numerical_df)

2.92 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Difference between `Pipeline()` and `make_pipeline()` from sklearn:
`make_pipeline` generates names for steps automatically.

## K-Means

#### Diagnostics
- Elbow plot
- Silhouetee analysis
- Intercluster Distance Map

1. Elbow Plot
Helps to determine the optimal numner of clusters by plotting the number of clusters,k, against the distortion score. The inflection point, or "elbow" shows the best k-value that the model fits best.

In [21]:
from yellowbrick.cluster import KElbowVisualizer


ModuleNotFoundError: No module named 'yellowbrick'