## Zillow Regression with Clustering Project

### Project Goals

- The goal of this project is to find features or clusters of features to improve Zillow's log error for single family residences in three Southern California counties and to use these features to develop an improved machine learning model.

- Our initial hypothesis is that the size of the home in square feet, the age of the home, and the location are the main features affecting log error.

- Initial questions:
    - What is the relationship between square feet and log error? Do area clusters have a large impact on the overall log error?
    - Does the size of the home affect log error? Can that error be better determined by clustering by size?
    - Does the location have an effect on log error? Where does the most log error occur?

### Project Planning

- Acquire the dataset from the Codeup database using SQL
- Prepare the data with the intent to improve the log error from Zestimates; clean the data and encode categorical features if necessary; ensure that the data is tidy
- Split the data into train, validate, and test datasets using a 60/20/20 split
- Explore the data:
    - Univariate, bivariate, and multivariate analyses; statistical tests for significance, find the three primary features or clusters affecting log error
    - Create graphical representations of the analyses
    - Answer questions about the data
    - Document findings
- Train and test at least three models:
    - Establish a baseline
    - Select key features and train multiple linear regression models
    - Test the model on the validate set, adjust for overfitting if necessary
- Select the best model for the project goals:
    - Determine which model performs best on the validate set
- Test and evaluate the model:
    - Use the model on the test set and evaluate its performance (RMSE, R2, etc.)
    - Visualize the data using an array of probabilities on the test set
- Document key findings and takeaways, answer the questions
    
### Executive Summary

- After running four models on my train and validate sets, we decided to use the polynomial linear regression model because it provided the lowest RMSE compared to baseline.

- We selected the features for modeling based on statistical analysis (square feet of the home, ratio of bedrooms and bathrooms, lot size, age, number of bathrooms, area cluster, and size cluster). We selected a degree multiplier of 2. The RMSE of the selected model was .162 on train, .143 on validate, and .174 on test.

- Takeaways: the selected features improved the overall log error, but not much more than baseline. The clusters did not significantly reduce the RMSE, but there was a very small improvement when using the absolute value for the log error. Overall, none of the models significantly outperformed Zillow's current model.

### Acquire and Prepare Data

In [38]:
# standard imports for full data pipeline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import os
from math import sqrt
# imports for project-specific functions
import env
import wrangle
import model
import explore
# sklearn imports for modeling, splitting, scaling
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression 
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
import warnings
warnings.filterwarnings('ignore')

In [None]:
# full zillow database
wrangle.full_zillow_db()

In [None]:
# acquire database using function from wrangle.py and save to a variable
df = wrangle.wrangle_zillow()

In [None]:
# verify successful wrangling of data
df.info()

In [None]:
# verify proper encoding of fips into counties
wrangle.verify_counties(df)

### Acquisition and Preparation Takeaways

- The dataset was acquired from the Codeup database using a SQL query.

- Data was limited to homes with a transaction in 2017, homes with more than 0 and less than 8 bedrooms, more than 0 and less than 8 bathrooms, home size less than 10,000 square feet, homes with less than 20 acres, and homes with a tax rate less than 30. All observations with null values were removed. 

- FIPS was encoded and new features (age, age_bin, taxrate, acres, acres_bin, sqft_bin, structure_dollar_per_sqft, structure_dollar_per_sqft_bin, land_dollar_per_sqft, lot_dollar_sqft_bin) were created.

- The cleaned dataset has 50699 observations and 29 columns. All columns are integers or floats.

- The dataset has been split into train, validate, and test sets using a 60/20/20 split.

## Exploration

In [None]:
# split the data into train, validate, and test sets
train, X_train, X_validate, X_test, y_train, y_validate, y_test = wrangle.split(df, target_var='logerror')

In [None]:
# bin logerror
train['logerror_bins'] = pd.cut(train.logerror, [-5, -.2, -.05, .05, .2, 4])

In [None]:
# scale features and concatenate to train, validate, and test datasets; MinMax scaler was fit to train only
X_train = explore.fit_scale_and_concat(X_train, X_train)
X_validate = explore.fit_scale_and_concat(X_validate, X_train)
X_test = explore.fit_scale_and_concat(X_test, X_train)

In [None]:
# verify successful concatenation
X_train.describe()

In [None]:
# list of variables I will cluster on. 
cluster_vars = ['scaled_latitude', 'scaled_longitude', 'age_bin']
# cluster column name
cluster_name = 'area_cluster'
# range for find_k
k_range = range(2,10)

In [None]:
# graph find_k using the SSE
explore.find_k(X_train, cluster_vars, k_range)

In [None]:
# select best number for centroids according to elbow method
k = 5
# create the clusters using function from explore.py
kmeans = explore.create_clusters(X_train, k, cluster_vars)
# create dataframe for the clusters
centroid_df = explore.get_centroids(kmeans, cluster_vars, cluster_name)

In [None]:
# confirm that the dataframe was created successfully
centroid_df

In [None]:
# assign the clusters to train, validate, and test for modeling
X_train = explore.assign_clusters(X_train, kmeans, cluster_vars, cluster_name, centroid_df)
X_validate = explore.assign_clusters(X_validate, kmeans, cluster_vars, cluster_name, centroid_df)
X_test = explore.assign_clusters(X_test, kmeans, cluster_vars, cluster_name, centroid_df)

In [None]:
# show the centroids and number of homes for each area cluster
pd.DataFrame(X_train.groupby(['area_cluster', 'centroid_scaled_latitude', 'centroid_scaled_longitude', 
                           'centroid_age_bin'])['area_cluster'].count())

In [None]:
# select features for size cluster
cluster_vars = ['scaled_bathroomcnt', 'sqft_bin', 'acres_bin', 'bath_bed_ratio']
cluster_name = 'size_cluster'
k_range = range(2,10)
# graph find_k using SSE to select the best k
explore.find_k(X_train, cluster_vars, k_range)

In [None]:
# select number of centroids using elbow method above
k=5
# fit kmeans 
kmeans = explore.create_clusters(X_train, k, cluster_vars)
# get centroid values per variable per cluster
centroid_df = explore.get_centroids(kmeans, cluster_vars, cluster_name)
# get cluster assignments and append those with centroids for each X dataset (train, validate, test)
X_train = explore.assign_clusters(X_train, kmeans, cluster_vars, cluster_name, centroid_df)
X_validate = explore.assign_clusters(X_validate, kmeans, cluster_vars, cluster_name, centroid_df)
X_test = explore.assign_clusters(X_test, kmeans, cluster_vars, cluster_name, centroid_df)

In [None]:
# show the centroids and number of homes for each size cluster
pd.DataFrame(X_train.groupby(['size_cluster', 'centroid_scaled_bathroomcnt', 'centroid_sqft_bin',
                              'centroid_acres_bin', 'centroid_bath_bed_ratio'])['area_cluster'].count())

### Are the sizes of homes different based on area cluster? Are newer homes larger than older homes? Where are the newer homes located?

In [None]:
# plot visualizations of age of home and square footage for each area cluster
explore.plot_age_sqft(X_train)

#### Statistical Test

- H0: There is no relationship between the age of the home and the square feet of the home.
- Ha: There is a relationship between the age of the home and the square feet of the home.

In [None]:
# conduct spearman test for age and square footage
explore.spearman_test(X_train.age, X_train.scaled_calculatedfinishedsquarefeet)

- Spearman's correlation test shows that there is a negative relationship between the age of the home and square footage.
- Los Angeles County has the largest number of homes over 60 years old.

### Does the age of the home affect log error? Do area clusters show any distinctions in log error and property age?

In [None]:
# graph the relationship between age and log error for each area cluster
explore.plot_age_error(X_train, y_train)

#### Statistical Testing

- H0: There is no relationship between the age of the home and the log error.
- Ha: There is a relationship between the age of the home and the log error.

In [None]:
# conduct statistical test for relationship between age and logerror overall
explore.spearman_test(X_train.age, y_train.logerror)

- Spearman's correlation test shows that there is a relationship between age and log error, but the correlation is very small (-0.05).

### What is the relationship between square feet and log error? Do size clusters have a large impact on the overall log error?

In [None]:
# graph the relationship between square footage and log error for each size cluster
explore.plot_size_error(X_train, y_train)

#### Statistical Testing

- H0: There is no relationship between the number of square feet and log error.
- Ha: There is a relationship between the number of square feet and log error.

In [None]:
# test the relationship between square footage and log error
explore.spearman_test(X_train.calculatedfinishedsquarefeet, y_train.logerror)

- Spearman's correlation test confirms a relationship between square footage and log error, but once again the correlation coefficient is very small.

### Does the location have an effect on log error? Where does the most log error occur?

In [None]:
# graph the relationship between square footage and log error for each county
explore.plot_size_county_error(X_train, y_train)

In [None]:
# conduct ANOVA statistical test for significance for square footage and log error between counties
explore.anova_sqft_fips(X_train)

- Analysis of variance shows a significant difference in log error between counties.

### Exploration Takeaways

- Analysis of variance shows a significant difference in home square footage between counties.
- Initial exploration showed a moderate relationship between the size of the home and the home’s age, with Los Angeles County having the largest number of older homes. 
- The age and size of the home showed statistical significance to log error, but indicated a weak relationship.

## Modeling

- Select features for modeling

In [None]:
# select features for modeling and assign to a variable
train_df = X_train[['scaled_calculatedfinishedsquarefeet', 'scaled_bathroomcnt', 'scaled_age',
                        'size_cluster', 'area_cluster']]
validate_df = X_validate[['scaled_calculatedfinishedsquarefeet', 'scaled_bathroomcnt', 'scaled_age', 
                        'size_cluster', 'area_cluster']]
test_df = X_test[['scaled_calculatedfinishedsquarefeet', 'scaled_bathroomcnt', 'scaled_age',
                        'size_cluster', 'area_cluster']]

In [None]:
# establish a baseline using logerror
baseline = y_train.logerror.mean()

In [None]:
# create a new column for baseline in train set
train_df['baseline'] = baseline

In [None]:
# create a new column for baseline in validate set
validate_df['baseline'] = baseline
RMSE_baseline = sqrt(mean_squared_error(y_train.logerror, train_df.baseline))
RMSE_baseline

In [None]:
# fit and predict four models using selected features
m1 = model.lasso_lars_model(train_df, validate_df, y_train, y_validate, 1.0)
m2 = model.glm_model(train_df, validate_df, y_train, y_validate, 0, 1)
m3 = model.poly_lm(train_df, validate_df, y_train, y_validate, 2)
m4 = model.lrm(train_df, validate_df, y_train, y_validate)

In [None]:
# show the performance for each model
model.model_performance(m1,m2,m3,m4)

In [None]:
# refit and predict four models using the absolute value of log error
m1 = model.lasso_lars_model(train_df, validate_df, np.absolute(y_train), np.absolute(y_validate), 1.0)
m2 = model.glm_model(train_df, validate_df, np.absolute(y_train), np.absolute(y_validate), 2, 1)
m3 = model.poly_lm(train_df, validate_df, np.absolute(y_train), np.absolute(y_validate), 2)
m4 = model.lrm(train_df, validate_df, np.absolute(y_train), np.absolute(y_validate))

In [None]:
# show each model's performance
model.model_performance(m1,m2,m3,m4)

## Test the Best Model

In [None]:
# fit the features to polynomial regression
pf = PolynomialFeatures(degree=2)
pf.fit(train_df)
pf.transform(validate_df)

In [None]:
# create a placeholder column for test; this will be replaced with predicted values later
test_df['yhat'] = baseline

In [None]:
# test the polynomial regression and graph the results; output RMSE and r2
model.test_poly_lm(train_df, validate_df, test_df, y_train, y_validate, y_test, 2)

#### Test Findings:

- The linear regression with squared polynomial features performed above baseline with an RMSE of 0.174, which was  0.004 less error than the baseline prediction. 
- The model performed better on validate than on train, indicating that the model was not overfit to the training dataset.

## Conclusion and Recommendations

- All of the models performed very close to baseline, indicating that the new models with the selected features do not outperform Zillow’s current model. 

- Our recommendation would be to test Zillow’s current model using the absolute value of the log error if it does not do so already. 

- If we had more time, we would train the models after removing the homes that had the largest error to see how much the log error outliers affected the model. We would also see if clustering with log error as a feature can tell us more about why the model cannot accurately predict certain home values.