# Using clusters

what is it?
- now that we have made our clusters, how do we use them?

what do we care? 
- to get some usefulness out of our supervised learning

## Clusters are about providing options and insight

- Step 1: Explore the clusters
- Step 2: Name the clusters
- Step 3: 
    - Option 1: Dimensionality reduction
    - Option 2: Treat cluster names as a new target variable
    - Option 3: Perform deeper EDA
    - Option 4: Make many models

# Show us!

Scenario: Analyzing our mall data and seeing how unsupervised learning can drive our data insights

In [None]:
#do the data things
import pandas as pd
import numpy as np

#visualize & stastisize
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

#prepore and model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

#my creds
from env import get_db_url

## Acquire

In [None]:
#get my data
df = pd.read_sql('SELECT * FROM customers;', get_db_url('mall_customers'))
df = df.set_index('customer_id')

#see it
df.head()

In [None]:
df.info()

## Prepare

Since my dataset was so small, I made my validate and test dataframes smaller than normal
- this is probably too small to be utilized in the real world
- the tiny validate and test are just to show the steps of working through a split df

In [None]:
def train_validate_test_split(df, seed=123):
    '''
    accepts dataframe and splits the data into train, validate and test 
    '''
    train_validate, test = train_test_split(df, test_size=0.05, random_state=seed)
    train, validate = train_test_split(train_validate, test_size=0.05, random_state=seed)
    return train, validate, test

In [None]:
def scale_my_data(train, validate, test, features):
    '''
    scale my data using minmaxscaler, input the features to scale
    '''
    scaler = MinMaxScaler()
    scaler.fit(train[features])
    
    train_scaled = scaler.transform(train[features])
    validate_scaled = scaler.transform(validate[features])
    test_scaled = scaler.transform(test[features])

    train_scaled = pd.DataFrame(train_scaled, index=train.index, columns=features)
    validate_scaled = pd.DataFrame(validate_scaled, index=validate.index, columns=features)
    test_scaled = pd.DataFrame(test_scaled, index=test.index, columns=features)
    
    return train_scaled, validate_scaled, test_scaled

In [None]:
def prep_mall(df):
    '''
    dummy var for gender into is_male
    split on target of 'spending_score'
    scale age and annual income. 
    '''
#     df['is_male'] = pd.get_dummies(df['gender'], drop_first=True)['Male']
    train, validate, test = train_validate_test_split(df)
    
    print(f'df: {df.shape}')
    print()
    print(f'train: {train.shape}')
    print(f'validate: {validate.shape}')
    print(f'test: {test.shape}')
    return train, validate, test

In [None]:
#prep my data!
train, validate, test = prep_mall(df)

In [None]:
train.head()

In [None]:
features_to_scale = ['age','annual_income','spending_score']

In [None]:
train_scaled, validate_scaled, test_scaled = scale_my_data(train, validate, test, features_to_scale)
train_scaled

## Explore

We explored all the things!
- hypothesize
- visualize
- statisticize
- summarize

We found that age, annual_income, and spending score looked like were good canidates for clusters. 

We utilized the elbow method to determine the best number of clusters

In [None]:
# lets plot inertia vs k
pd.Series(
    {k: KMeans(k, random_state=42, n_init=10).fit(train_scaled).inertia_ for k in range(2, 12)}).plot(marker='x')
plt.xticks(range(2, 12))
plt.xlabel('k')
plt.ylabel('inertia')
plt.title('Change in inertia as k increases')
plt.grid()

#### Move forward optimal cluster number (i choose 4)

In [None]:
#make it
kmeans = KMeans(n_clusters = 4, random_state=42, n_init=10)

#fit it
kmeans.fit(train_scaled)

#use it
kmeans.predict(train_scaled)

#### Now save all of the newly created clusters

In [None]:
# And assign the cluster number to a column on the dataframe
train_scaled["cluster"] = kmeans.predict(train_scaled)
train_scaled.head()

## Yay clusters! What next?

- Step 1: Explore the clusters
- Step 2: Name the clusters
- Step 3: 
    - Option 1: Dimensionality reduction
    - Option 2: Treat cluster names as a new target variable
    - Option 3: Perform deeper EDA
    - Option 4: Make many models

### Step 1: Explore the clusters

see how they are similiar or different

In [None]:
# plot out income vs. spending with regard to the cluster and age


### Step 2: Name the clusters 

use natural, descriptive language

In [None]:
#rename using map


### Step 3: Options

#### Let's say our mall dataset had more features in it, this will allow us to better utilize our clusters

In [None]:
train['transportation'] = np.random.choice(['personal_vehicle','walking','dropoff'],len(train),p=[.75,0.05,.2])
train['group_size'] = np.random.randint(1,10,len(train))
train['hair_color'] = np.random.choice(['black','brown','blonde','grey','other'],len(train))

In [None]:
train.head()

### Option 1: Use the clusters to simplify multiple other variables
- Since the cluster names make sense, we can use them instead of age, spending, and income
    - this potentially makes our features simplier and easier to explain
        - helpful for storytelling
    - this allows us to reduce features
        - likely help model performance (regression, classification)

#### lets say we wanted to predict transportation

we will now build a classification model since we have a target

In [None]:
#define y_train


without our clusters

In [None]:
#define X_train


with our clusters

In [None]:
#define X_train


### Option 2: Use cluster names as a target variable to classify new data

#### our new clusters can be our target variable

we can once again build a classification model using our new target variable

In [None]:
#define y_train


#### our X_train can NOT include features that were used to calculate our target variable

In [None]:
#define X_train


### Option 3: Perform deeper EDA
Sometimes the identification of clusters gives us additional questions we need to ask.
- hypothesize
- visualize
- statisticize
- summarize

#### How do the customer groups relate to hair color?

#### How do the customer groups compare to the group size?

### Option 4: Create a Many Models Approach

For each unique value in our cluster, build a separate model. Therefore, each cluser can have it's own model

#### Model 1 -  Young people who are low income and spend a lot

#### Model 2 - Old people who are low income and spend low amounts

#### Model 3 - Young people who make a lot and spend a lot

#### Model 4 - Old people who make a lot and spend a little