# $Final$ $Report$ $Zillow$ $Team$ $Project$

# **Goals:**

* Discover key attributes that drive error in Zestimate .

* Use those attributes to develop a machine learning model to predict log error.

  


## Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import wrangle as w
#import model as m
#import explore as e


#import warnings
#warnings.filterwarnings("ignore")

# Acquire:

In [2]:
# acquire telco data 
df = w.get_zillow_data()

In [3]:
df.shape

(52319, 68)

* Data acquire from Codeup Database 11/17/22

* It contained  52319 rows and 68 columns before cleaning

* Each row represents a single family household:
    * properties from 2017 with current transactions
    * located in the Californian counties of 'Los Angeles' or 'Orange'or 'Ventura'

* Each column represents a feature related to the single family residential.

# Prepare:

prepare actions:
* After the follwing steps we retained 95.9% of original data:
    * Outliers were removed
    (to better fit the definition of Single Family Property):
    
        * Beds above 8 
        * Baths above 8 
        * Home values above 2_000_000
        * Rows with 0 beds and/or 0 baths 
        
    * For the following features it was assumed null values meant the structure did not exist on property:
        * has_fireplace (45096)
        * has_deck (51930)
        * has_pool (41242)
        * has_garage (34335)
        * has_taxdelinquency(50251)

* Columns with less than 50_000 values and columns with no pertenant value were dropped:
     * totaling 34 columns
 
* Feature engineer 
     * 'age' is a feature created by subracting yearbulit from 2017
     * 'optional_features' is a feature created by stating a 1 if aproperty has a fireplace, garage, pool or deck and 0 if none is present.
        
* Encoded categorical variables

* Split data into train, validate and test 
    * Approximately: train 56%, validate 24%, test 20%
  


###                                                        <h1><center>Data Dictionary</center></h1>     

|Feature          | Description|
| :---------------: | :---------------------------------- |
| home_value (target) | The total tax assessed value of the parcel  |
| squarefeet:  | Calculated total finished living area of the home |
| bathrooms:   |  Number of bathrooms in home including fractional bathrooms |
| bedrooms: | Number of bedrooms in home  |
| yearbuilt:  |  The Year the principal residence was built   |
| fireplace: | fireplace on property (if any = 1) |
| deck:  | deck on property (if any = 1) |
| pool:  | pool on property (if any = 1) |
| garage: | garage on property (if any = 1) |
| county: | FIPS code for californian counties: 6111 Ventura County, 6059  Orange County, 6037 Los Angeles County |
| home_age: | The age of the home in 2017   |
|optional_features: |If a home has any of the follwing: fireplace, deck, pool, garage it is noted as 1   |
|additional features: | 	Encoded and values for categorical data


In [5]:
# prepare data 
df = w.zillow_prep(df)

In [6]:
df.shape

(50293, 34)

In [7]:
# split data: train, validate and test
train, validate, test = w.split_data(df)

# Peek at the data

In [8]:
train.head(3)

Unnamed: 0,parcelid,bedrooms,bathrooms,calculatedbathnbr,fullbathcnt,age,yearbuilt,has_basement,has_deck,has_fireplace,...,regionidcounty,rawcensustractandblock,censustractandblock,sqft,lot_sqft,tax_value_bldg,tax_value,tax_value_land,taxamount,log_error
11303,14498262,4.0,2.5,2.5,2.0,27.0,1990.0,0,0,0,...,1286.0,60590320.0,60590320000000.0,2361.0,10800.0,240309.0,562351.0,322042.0,5740.94,0.014676
26458,12851739,3.0,2.0,2.0,2.0,55.0,1962.0,0,0,0,...,3101.0,60374090.0,60374090000000.0,1245.0,5763.0,140233.0,236264.0,96031.0,3099.27,-0.025476
28486,11276661,4.0,2.0,2.0,2.0,24.0,1993.0,0,0,0,...,3101.0,60379010.0,60379010000000.0,1556.0,14711.0,152292.0,202328.0,50036.0,3234.5,-0.084878


# Explore:

## Questin 1?

In [9]:
# viz 1


#### !!!!! takeway q1

!!!! HO/HA 
I will now conduct a T-test to test for **a significant difference between the mean home value of homes with  optional features (such as: garage, fireplace,pool,deck) is greater than the mean home value  of homes with no optional features**

* The confidence interval is 95%
* Alpha is set to 0.05
* p/2 value will be compared to alpha


$H_0$: Mean home value of homes with optional featues <= mean home value of homes with no optional features 

$H_a$: Mean home value of homes with optional featues > mean home value of homes with no optional features 

In [10]:
# test 1

!!!!! stats summary
The p-value/2 is less than the alpha. **There is evidence to support that homes that have at least one of the following feature: garage, fireplace, pool,deck on average have a higher home value.** Based on this statistical finding I believe that optional features is a driver of home value.Adding an encoded version of this feature to the model will likely increase the model ability to predict home value.

# Question 2?


In [11]:
# viz 2

#### !!! takeaway q2

 # Question 3?
  

In [12]:
# viz 3

* **take away q3

!!!!! HO/HA 3
**I will now conduct an anova test to test for a significant differences between the mean of the three different counties**

* The confidence interval is 95%
* Alpha is set to 0.05
* p value will be compared to alpha


$H_0$: There is  two or more counties that have the same home value mean. 

$H_a$: Mean home value of the 3 diffirent counties is not the same

In [13]:
# test 3

!!!! stats summary
The p-value is less than alpha. **There is evidence to support that the three counties have diffirent home value mean.** Based on this statistical finding I believe that county location is a driver of home value. Adding an encoded version of this feature to the model will likely improve the model ability to predict home value.

# Q 4?

In [14]:
# viz 4

* **takeway Q4.

!!! HO/HA
**I will now conduct pearsonr test to test for a relationship  between the median home value and home age.**

* The confidence interval is 95%
* Alpha is set to 0.05
* p value will be compared to alpha

$H_0$: There is **no linear correlation** between the median home value and home age.

$H_a$: There **is a linear relationship** between the median home value and home age



In [15]:
# stats 4

!!! Summary 4
The p-value is less than alpha. There is **evidence to support that there is a relationship between home value and home age.** While it is a weak-negative relatiohsip I believe that adding it to my model will help improve the models ability to predict.

# Exploration Summary

* 
* 
* 
* 

# Features that will be included in my model

* All features ...
* **
* **
* *
* **


# Features that will be not included in my model

* **y...

# !!! needs to be updated Modeling:

* Since the  Home Value is not normaly distributed I will use **median as a baseline** set at $340,572.

* $R^2$ is the primary metric I will use to evaluate models and secondary will be a favorable $RMSE$. 
* $R^2$ helps understand how well the model fits the data.

* I will evaluate the following top models on train and validate:
        * Polynomial Regression degree 2
        * Polynomial Regression degree 2 with interactions only
        * Polynomial Regression degree 3
        * Polynomial Regression degree 4
* The model that performs the best on validate data will be run with test data.

In [16]:
# prepare data for modeling
X_train, y_train, X_validate, y_validate, X_test, y_test = m.model_data_prep(train, validate, test)

NameError: name 'm' is not defined

In [None]:
# run predictions on train data for models
train_pred, validate_pred, test_pred = m.predictions(X_train,y_train,X_validate,y_validate,X_test, y_test)



# Comparing Top Models on train and Validate 

### * Baseline Median Home Value                                                            $340,572

In [None]:
m.metrics(train_pred,y_train, validate_pred, y_validate)


All models outperformed median baseline in terms of RMSE in both train and validate data.

Polynomial Regressor degree 4 did best in train data in both RMSE and $R^2$

The model with the best $R^2$ and RMSE in validate data is Polynomial Regressor degree 3.

**I will select model Polynomial Regressor degree 3 since it has the highest R2 and a least RMSE to other models on validate**

# Model on Test data

In [None]:
# get metrics on Final Model
m.metric_test(test_pred[['poly_d3','baseline_median']],y_test)

## Modeling Summary

* All models performed better than the baseline
* The Final Model Polynomial Regressor degree 3 had an $R^2$ or .37 on test data and had a better RMSE than baseline by $73,639.

In [None]:
e.distribution_top_model(y_test, test_pred)   


# Conclusion

## Exploration



* Homes with an optional feature such as deck, pools, garage, fireplace have more value.
* Home with more bedrooms and bathrooms tend to have more value on average.
* County location make a difference in home value.
* Home age has a relationship with home value.

## Modeling

**The final model has an $R^2$ of 0.37 and performed  better than the median baseline by $ 73,639 RMSE** 


## Recommendations

* Standardize methods of data collection to increase data accuracy; this will likely improve models ability to predict future home values.

## Next Steps

* Look into locations of homes with information about neighborhoods and/or longitute and latitute to explore relationships between location and home value.
* Explore if lot size has and influence on home value.
