## Clustering Functions for Homework 4

This notebook contains code to create the complete the clustering needed for homework 4.

Please read __ALL__ the comments in the code and the headings. This notebook is NOT intended to be run as a script from top to bottom, although there are some code cells that need to be run first.
- The general utility libraries need to be loaded first
- Then you need to execute the load data and engineer features code cells

In [None]:
# Load general utilities
# ----------------------
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.axes as ax
import datetime
import math
import numpy as np
import pickle

import warnings
warnings.filterwarnings("ignore")

### PREP AND PREPROCESSING SECTION

###  Load the data and engineer features

In [None]:
# This is the code you can use to open your pickle file
# Read the data and features from the pickle
data, discrete_features, continuous_features, ret_cols = pickle.load( open( "Data/clean_data.pickle", "rb" ) )

In [None]:
# Create a feature for the length of a person's credit history at the
# time the loan is issued
data['cr_hist'] = (data.issue_d - data.earliest_cr_line) / np.timedelta64(1, 'M')
continuous_features.append('cr_hist')

#### If you want to use a smaller sample of the data due to time constraints, use the following code

In [None]:
# this code randomly samples 55% of the rows
# change the frac paramter if you want a different % to sample
# replace = False insures we won't select the same row twice
data=data.sample(frac=.55, replace=False, ).copy()

#### VERY IMPORTANT STEP
You need to define which features to use in the clustering.

In [None]:
# define the discrete features you want to use in modeling.
# if you want to use all the discrete features, just set discrete_features_touse = discrete_features
discrete_features_touse =['purpose', 'term', 'verification_status', 'emp_length', 'home_ownership']

# define the continuous features to use in modeling
# if you want to use all the continuous features, just set the continuous_features_touse = continuous_features
continuous_features_touse = ['loan_amnt', 'funded_amnt','installment','annual_inc','dti','revol_bal','delinq_2yrs','open_acc',
 'pub_rec','fico_range_high','fico_range_low','revol_util','cr_hist']

#### Run the code below if you want to use all the features

In [None]:
discrete_features_touse=discrete_features
continuous_features_touse = continuous_features

#### Functions to scale data

In [None]:
from sklearn.preprocessing import MinMaxScaler

def minMaxScaleContinuous(continuousList):
    return pd.DataFrame(MinMaxScaler().fit_transform(data[continuousList])
                             ,columns=list(data[continuousList].columns)
                             ,index = data[continuousList].index)

def createDiscreteDummies(discreteList):
    return pd.get_dummies(data[discreteList], dummy_na = True, prefix_sep = "::", drop_first = False)

def createTransformedData(continuousList,discreteList):
    # use this line if you want to scale the continuous features using the MinMaxScaler in the function defined above
    continuous = minMaxScaleContinuous(continuousList)

    # create numeric dummy features for the discrete features to be used in modeling
    discrete = createDiscreteDummies(discreteList)

    #concatenate the continuous and discrete features into one dataframe
    return pd.concat([continuous, discrete], axis = 1)

#### Function to Determine Number of Clusters

In [None]:
from sklearn.cluster import KMeans

def determineClusters(transformed_data, range_min, range_max):
    Sum_of_squared_distances = []
    K = range(range_min, range_max)
    for k in K:
        km = KMeans(n_clusters=k, n_jobs=-1)
        km = km.fit(transformed_data)
        Sum_of_squared_distances.append(km.inertia_)
        
    plt.plot(K, Sum_of_squared_distances, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Sum_of_squared_distances')
    plt.title('Elbow Method For Optimal k')
    plt.show()

### Scale and Transform Data

- The continuous features are scaled and dummies are created for the discrete features
- The dataset is split into Training and Testing

In [None]:
transformed_data = pd.concat([createTransformedData(continuous_features_touse, discrete_features_touse),data[ret_cols]], axis = 1)
transformed_data.head()

In [None]:
from sklearn.model_selection import train_test_split

X = pd.concat([createTransformedData(continuous_features_touse, discrete_features_touse),data[ret_cols]], axis = 1)
#y = data[ret_cols]

# create a test and train split of the transformed data
X_train, X_test = train_test_split(X, random_state=0, test_size=.3)

clusterInput = X_train.iloc[:,:-len(ret_cols)]
predictInput = X_test.iloc[:,:-len(ret_cols)]


### Determine Number of Clusters

- Use the Elbow Method to determine proper number of clusters
- You can change the range_min and range_max parameters to control the range of clusters that are considered in the graph

In [None]:
determineClusters(clusterInput, range_min=2, range_max=25)

### Fit Kmeans Clustering

- Fit the clustering on the training data
- The number of clusters parameter comes from the ideal number as determined by the Elbow method above
- The other parameters can be adjusted as well. You can find documentation at the link below

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=16, random_state=0, n_jobs=-1).fit(clusterInput)

print("Sum Squared Distances: ", kmeans.inertia_)

### Add Cluster to the Training Data
- Create a dataframe of the standard deviations by cluster

In [None]:
X_train['cluster'] = kmeans.labels_

In [None]:
clusterStdDev=X_train.groupby('cluster')[ret_cols].std()

### Get Clusters for Test Data

In [None]:
X_test['cluster'] = kmeans.predict(predictInput)