In [9]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split

import acquire
import prepare

from wrangle_zillow import wrangle_zillow_data

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## acquire

In [10]:
df = pd.read_csv('zillow_data.csv')

In [12]:
df = df.drop(columns=['Unnamed: 0'])

### Goal: Improve our original estimate of the log error by using clustering methodologies.

## Acquisition, Prep, and Initial Exploration
Using the notebook and files you created during the exercises make any changes, additions, etc. you want at this point. NOTE: You will NOT be splitting into train and test at this point.

Ideas:

   1. Data types:

        - Write a function that takes in a dataframe and a list of column names and returns the dataframe with the datatypes of those columns changed to a non-numeric type.
        - Use this function to appropriately transform any numeric columns that should not be treated as numbers.

   2. Missing Values: Impute the values in land square feet.

        - For land square feet, the goal is to impute the missing values by creating a linear model where landtaxvaluedollarcnt is the x-variable and the output/y-variable is the estimated land square feet.
        - We'll then use this model to make predictions and fill in the missing values.
        - Write a function that accepts the zillow data frame and returns the data frame with the missing values filled in.

   3. Missing Values: Of the remaining missing values, can they be imputed or otherwise estimated?

        - Impute those that can be imputed with the method you feel best fits the attribute.
        - Decide whether to remove the rows or columns of any that cannot be reasonably imputed.
        - Document your reasons for the decisions on how to handle each of those.

   4. Outliers: Original from exercises. Adapt as you see fit.

        - Write a function that accepts a series (i.e. one column from a data frame) and summarizes how many outliers are in the series. This function should accept a second parameter that determines how outliers are detected, with the ability to detect outliers in 3 ways: IQR, standard deviations (z-score), percentiles)

   5. Use your function defined above to identify columns where you should handle the outliers.

   6. Write a function that accepts the zillow data frame and removes the outliers. You should make a decision and document how you will remove outliers.

   7. Is there erroneous data you have found that you need to remove or repair? If so, take action.

   8. Are there outliers you want to "squeeze in" to a max value? (e.g. all bathrooms > 6 => bathrooms = 6). If so, make those changes.

# Exploration with Clustering
## Cluster the Target Variable
    Why? By reducing the noise of the continuous variable, we can possibly see trends easier by turning this continuous variable into clusters and then comparing those clusters with respect to other variables through visualizations or tests.

    Perform clustering with logerror as the only feature used in the clustering algorithm. Decide on a number of clusters to use, and store the cluster predictions back onto your data frame as cluster_target. Look at the centroids that were produced in this process. What do they tell you?

    Use the produced clusters to help you explore through visualization how logerror relates to other variables. (A common way to do this is to use color to indicate the cluster id, and the other variables can be your x-axis and y-axis. (hint: look at your swarmplot function)).

## Cluster Independent Variables
   You should also perform some clustering based on a number of independent variables. Create and evaluate several clustering models based on subsets of the independent variables. Here are some ideas:

   - Location, that is, latitude and longitude
   - Size (finished square feet)
   - Location and size
   - Be sure to use these new clusters in exploring your data, and interpret what these clusters tell you.

## Test the Significance of Clusters
    Use statistical testing methods to determine whether the clusters you have created are significant in terms of their relationship to logerror.

# Modeling
## Feature Engineering
   1. Remove variables that are not needed, wanted, useful, or are redundant.
   2. Add any features you think may be useful.
   3. Split your data into training and test sets.
   4. Create subsets of data if you would like to create multiple models and then merge (such as, a different model for each cluster or for each county).

# Model Selection
   1. Train at least 3 different models (a model is different if there are changes in one or more of the following: features, hyper-parameters, algorithm). Create object, fit, predict & evaluate. Use mean absolute error or mean squared error to evaluate. Also, try regression algorithms you have not used before.
   2. Evaluate your best model on your test data set to get an idea of your model's out of sample error.