In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from mvp_wrangle import *
from explore import *
from modeling import *

from scipy import stats
from sklearn.linear_model import LinearRegression, TweedieRegressor, LassoLars
from sklearn.metrics import r2_score, mean_squared_error, explained_variance_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer, PolynomialFeatures

# Shhhh
import warnings
warnings.filterwarnings("ignore")

### Planning
##### Goal:
- Project Objective: Use Clusters to help advance our Regression Modeling attempts on the 2017 Zillow dataset.
##### Deliverables:
- Wrangle.py : Functions to reproduce Wrangling and Cleaning of Data.
- Explore.py : Functions to reproduce Exploring the Data.
- Model.py : Functions to reproduce Modeling the Data.
- Final Notebook (ipynb) : Notebook with steps to reproduce.
##### Questions to Explore:
- Which features have the most deviation from the logerror mean?
- Do any clusters help predict logerror?

------

### Data Acquisition and Data Preparations
Using our wrangle.py, we'll Acquire the data via the Cloud SQL and save a local copy as a .csv, any further need for data will load the local saved copy.

Outliers will be stripped, and the few remaining nulls dropped.

Histograms and Box Plots are generated to visualize our data,

Then the data will be split into Train, X_train, X_validate, X_test, y_train, y_validate, and y_test Dataframes.

In [4]:
train, X_train, X_validate, X_test, y_train, y_validate, y_test = wrangle_zillow()
train.shape, X_train.shape, X_validate.shape, X_test.shape, y_train.shape, y_validate.shape, y_test.shape

ValueError: too many values to unpack (expected 7)

In [None]:
train.dtypes

The shape of the split looks good, as do our charts with more expected outliers. All Datatypes are what might be expected and are easily worked with as well.

-----

### Data Exploration
We'll begin visualising using the TRAIN dataframe. Comparing different features.

- first let's check the range of the columns.

Range of bedrooms and bathrooms are fairly expected, the range of the area though is larger than you'd expect, but that seems due to a very small house at 152sqft (it's not small enough to be an outlier though). The range of tax_value (our target) is a little daunting, with over 1,000,000 difference between lowest and highest, but we can do it.

- Let's look at some Categorical data vs. our goal now. [Using Train]

- And Variable Pairs (including correlations)

------

### Stats Testing

Looking at some of the charats above, we get an idea of how some of the data might correlate, so let's move onto testing the main hypothesis.

H0:
- There is no Correlation between [Bedrooms, Bathrooms, Area] and Tax Value.

Ha:
- There is a Correlation between [Bedrooms, Bathrooms, Area] and Tax Value.

Now, we saw above that generally Area, Bathroom, then Bedroom had the most correlation with Tax Value. Let's look into that again.

Area:
###### With the low starting point in area, let's try the lower of mean vs median.

- This is to be expected, a larger house is usually worth more.

Bathrooms:

- Again, More of something usually values at more.

Bedrooms:

- Yet again, More is more valuable.

Area vs. Bathroom:

- This is where it gets interesting, More bathrooms values higher than more square footage.
- On both counts, Bathrooms appear to be more prized than area.

Area vs. Bedroom:

- Now here, More area is worth more than More bedrooms.
- But the lower half of bedrooms actually have more value than the lower half of area.
- Does that mean there is such a thing as too many rooms? and too few?

Bedroom vs. Bathroom:

- Here, given the other two tests, it's expected bathrooms would have more value since they beat area, and area beat bedrooms.
- But, it seems Bedrooms reign supreme for items wanted below the median, putting area at the bottom below bathrooms.

##### Testing some other stats:

Creating some new columns using ones we have, like Area per Bedroom/Bathroom.

###### Factoring in the tax_value might be useful as well. If time is given, we'll come back to that.

Let's check out correlations.

They're about on par if not less than the default : Area, Bedrooms, Bathrooms

Honestly, my thoughts are Area, Bathroom, and Bedroom are the best for building models. I include Bedrooms because they appear to have the biggest affect on the lower end of the spectrum, so it might help balance the models.

------

### Feature Engineering

Let's create X,y dataframes, and verify my conclusions with some SKLearn calculations:

Using LinearRegression:

Using LassoLars:

Using TweedieRegressor:

So there's a couple wild cards, Bed and Bath, as well as Rooms and Area. So let's keep those two on the models.

We'll drop area_by_bathroom, and area_by_beds, then resplit into X,y.

There, now we've got the features we want to keep. Now to fix our X,y.

Shape looks good; Now we're all set for modeling

---------

### Modeling

Now to try out our selected features on some actual models.

Let's check the RMSE and r^2 for several different models, a metric df will be created will all the stats for the different models. The baseline and predictions will also be added to y_train and y_validate. 

So none of them are great, but Linear Regression and LassoLars are my top two picks, looking at the above charts, I like LassoLars just a smidge more due to having a tad more distribution within the blue actual tax_value area.

##### Final Model:

Fitting the LassoLars to X,y train dataframe, and evaluating on X,y test.

### Conclusions

There is a lot more that contributes to house value than just squarefootage, number of bathrooms and number of bedrooms. Location is very important, as is how much property the house sits on. With more time I'd like to start with Location and Property Size, then branch out further from there.

Otherwise, Lasso + Lars with a 3.8 Alpha is slightly better than Normalised Linear Regression with the basic values we're working with.