# I have clusters...now what?

You have created amazing clusters! Yay! 
At a minimum, you have a new piece of art. 
But what do you do now? How does this help you? Well (this answer will you surprise you), it depends. 


![now_what.jpg](now_what.jpg)

__________________________

## Overview: what you can do with clusters

### Explore your Clusters

I cluster multiple variables in order to be able to explore and understand my data better. As an example, in the Zillow dataset:

1. Cluster by bedrooms, bathrooms and square feet to understand the different groups of combinations.

2. Plot 2 dimensions, such at logerror and lot size, and color by cluster id to see multiple dimensions in a single plot and understand the interaction of all these variables on your target, logerror in this case, a bit better. 

3. Use the clusters to run an anova to see if there is a significant difference in the log error among these groups. 

4. If there is a difference, what can we learn? What are the specs of the clusters with significantly higher or lower errors? 

________________________

### Turn your Clusters into Labels

I create clusters that I then label so that I can build a model to predict which group an observation belongs in. For example, I want to classify H-E-B customers by types of items they shop for. But I don't know what the distinct groups are yet.

1. Cluster based on something like average number of items per store department per visit. 

2. Review clusters through exploration to create useful labels. 

3. When you have the clustering model fit, you can run predict on new data to identify which clusters new observations belong in OR, for a longer term solution, 3. Use supervised methods (regression/classification) and a training sample of your dataset (with features) with your target being labeled classes to create a model that predicts those classes.

4. After selecting the best model on train, test your model on the out-of-sample data that was already clustered (therefore it has labels) and evaluate. 

5. Run the model on all data to add labels to all existing data. 

6. Take some random samples to manually verify and do a little exploration to verify the new labels are doing what you expect. 

_______________________

### Model Each Cluster Separately

I have created clusters of customers based on their consumption of hosting products and services. I have seen that the different clusters have different drivers of churn. I would like to build a unique model predicting churn for each cluster created. 

1. Cluster by services, products and monthly revenue.  

2. When you have the clustering model fit, you can run predict on new data to identify which clusters new observations belong in. 

3. Use supervised methods (regression/classification) and a training sample of your dataset (with features + target variable) for each unique cluster to create a model that predicts your original target variable. 

_____________________________

### Turn your Clusters into Features

Use clusters to create new, more descriptive features. Could also reduce number of dimensions. 

1. Cluster by latitude, longitude and age to get "area clusters"

2. Cluster by home and lot size to get "size clusters"

3. Get descriptive stats of the value per sqft

4. Use the standard deviation & median of the dollar per square foot for the cluster of the observation as new features (or other stats), or create dummy variables of the clusters. 

## Acquire

We will acquire data from the Zillow database, property information from 2017 sales along with prediction outcomes (logerror) for single unit properties, or those with land use type id of 261. 

In [None]:
import pandas as pd
import env
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

def get_connection(db, user=env.user, host=env.host, password=env.password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'


def get_zillow_data():
    query = '''
    select prop.parcelid
        , pred.logerror
        , bathroomcnt
        , bedroomcnt
        , calculatedfinishedsquarefeet
        , fips
        , latitude
        , longitude
        , lotsizesquarefeet
        , regionidcity
        , regionidcounty
        , regionidzip
        , yearbuilt
        , structuretaxvaluedollarcnt
        , taxvaluedollarcnt
        , landtaxvaluedollarcnt
        , taxamount
    from properties_2017 prop
    inner join predictions_2017 pred on prop.parcelid = pred.parcelid
    where propertylandusetypeid = 261;
    '''
    return pd.read_sql(query, get_connection('zillow'))

In [None]:
df = get_zillow_data()

## Clean & Prep

1. Drop missing values: It is < 1500 observations out of 54,000, so there are plenty of observations to work with even when we drop all observations with missing values. 

2. Get county names and create dummy variables with those. 

3. Compute new features out of existing features in order to reduce noise, capture signals, and reduce collinearity, or dependence between independent variables. 

4. Remove outliers - we can study those separately at a later time. 

5. Reduce features. 

**Drop Observations with Missing Values**

**Get Counties**

Replace fips/county with the county name and create dummy vars for county, or split into 3 different dataframes. 

- 6037: Los Angeles County

- 6059: Orange County

- 6111: Ventura County

https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013697

**Create New Features**

- **age**: 2017 - year built.

- **tax_rate**: taxamount/taxvaluedollarcnt fields (total, land & structure). We can then remove taxamount and taxvaluedollarcnt, and will keep taxrate, tructuretaxvaluedollarcnt, and landtaxvalue. 

- **acres**: lotsizesquarefeet/43560

- **structure_dollar_per_sqft**: structure tax value/finished square feet

- **land_dollar_per_sqft**: land tax value/lot size square feet

- **bed_bath_ratio**: bedroomcnt/bathroomcnt

- **cola**: city of LA, LA has the largest number of records (across single cities) with a very wide range in values, so we I am creating a boolean feature for city of LA. That will help the model for LA county. 

**Remove Outliers**

1. Remove extremes in bedrooms and baths, we will keeps homes with between 1 and 7 baths, between 0 and 7 bedrooms   

2. There is an error in zip, so we will remove those whose zips are invalid numbers (> 99999).   

3. remove square feet > 7000 for now

4. remove lot size (acres) > 10 for now

5. What is this tax rate of almost 50%?? Remove tax rate > 5% for now. 

**Drop Columns**


For now, I will focus on the most difficult and diverse county, LA county. I'll add the others in after I see what I can find. 

I'm not sure where I will use bins and where I will use actual values, so for now I think i'll go with bins and see what happens. 

I will remove the following variables: 

- parcelid: can tie back to parcels later

- bedroomcnt: info captured in bed_bath_ratio + bathroomcnt

- taxrate, taxamount, taxvaluedollarcnt, structuretaxvaluedollarcnt, landtaxvaluedollarcnt: info captured in tax_bin + structure_dollar_per_sqft + land_dollar_per_sqft + acres

- yearbuilt: info captured in age

- lotsizesquarefeet: info captured in acres

- regionidcity: using boolean of whether in city of LA or not

- regionidzip: not using at this time

- LA, Orange, Ventura: will look at LA county only right now. 

**Split into train, validate & test**

**Scale**

I will scale all of our features using min-max scaler. 

**Brainstorming**

As I think about where it's difficult to predict housing prices, I think about areas where price and condition of homes vary drastically. This is generally in areas with older homes. So, how can we increase the information we have about those areas so that estimating the condition, and therefore the price, can be more accurate? There are so many ways to go about this, and who knows what will work best until we start trying them out. One idea I had was figure out a way to identify neighborhoods that are similar. Neighborhoods, in terms of the data available through the field `regionidneighborhood`, have many problems. The primary challenge is that so much of the data is missing. Secondly, there are many areas without defined neighborhoods. And the final point I'll mention (though there are more), is that there are so many neighborhoods that I wouldn't have to computing power to model each neighborhood separately! There is a way to do this, I'm sure, but that is going to wait until an iteration much in the future, and when it's possible to find the missing neighborhoods. 

So, all of that said, I want to find a way to cluster properties at a higher level than neighborhood and zip code in some cases, and can span both geographic areas. 

What if we could predict the error using the variance or standard deviation of the property assessed values of similar neighborhoods? Large standard deviation leads to larger errors. 
If we cluster by latitude, longitude and age maybe we can get city segments that were developed closer in time to one another, at a level higher than zip code and neighborhood. And maybe it will help separate terrain a bit, like coast vs mountains. We can then get basic statistics of dollar/sqft and lot/sqft of those areas and use those statistics as features. So we will use the clusters to extract statistics that describe them, and use those stats as features. 

I could also try clustering sizes. Acres, square feet, and location, e.g.

Cluster with a focus on areas: using latitude, longitude and age. 

So let's create our clusters. 

Label clusters on validate and test, as we did above for train. 

Get the centroids

Append cluster id onto X_train & X_train_scaled, then join with the centroids dataframe. 

Repeat on validate and test

Now, I'd like to cluster by size. Then I will group by the two different cluster types (area/age & size) and compute summary statistics. 

## What do we do now? 

1. Explore our clusters

2. Turn clusters into features

3. Model each cluster separately

4. Turn clusters into labels