# The Drivers of Errors in Single Unit Zestimates at Zillow
***

## Introduction

### Author
- Samuel Davila
 - Data Scientist
 - Zillow DS Department

### Purpose of project
- Identify what causes the accuracy of our estimates of property value ("Zestimates") to increase or decrease. 

### Why is this project important?
- By understanding what improves and what weakens the accuracy of our estimates, we can shore up our weaknesses while maintaining our strengths. 

### Data source
- Single unit property data from Zillow table in Data Science Database
***

## Executive Summary

### Goals
- Improve original estimate of the log error by using clustering methodologies.
    - Identify drivers of log error
    - Create model that...
    

- Deliver the following:

    - zillow_clustering_project.ipynb

    - README.md

    - acquire.py

    - prep.py

    - preprocessing.py

    - model.py
    
    - A presentation that walks through each step of our project and this notebook as a whole.

### Analysis

Analysis goes here

### Recommendation

Recommendation goes here

### Expectation

Expectation goes here
***

## Project Planning

### Acquire

### Preparation

### Exploraton

### Model

### Conclusion
***

## Acquire
Acquire the data we need for our project from the zillow table in the data science database.

Create __acquire.py__ file that contains the functions needed to replicate this process.
***

#### Importing modules needed for code in notebook to run.

In [69]:
# set up environment
from preprocessing import split_data
from preprocessing import add_upper_outlier_columns
from preprocessing import upper_outlier_data_print
from preprocessing import data_scaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from acquire import get_zillow_data
from prep import drop_missing_columns
from prep import missing_rows
from prep import drop_selected_columns
from prep import drop_more_selected_columns
from prep import zillow_dummy
from sklearn.linear_model import LinearRegression, LassoLars

import pandas as pd

import warnings
warnings.filterwarnings("ignore")

#### Using function in acquire.py file to import data then previewing data.

In [2]:
# create variable that will hold DF for easy access to data
df = get_zillow_data()

# previewing data
df

Unnamed: 0,id,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,id.1,parcelid.1,logerror,transactiondate
0,2288172,12177905,,,,3.0,4.0,,8.0,3.0,...,2016.0,36225.0,1777.51,,,6.037300e+13,3,12177905,-0.103410,2017-01-01
1,1970746,10887214,1.0,,,3.0,3.0,,8.0,3.0,...,2016.0,45726.0,1533.89,,,6.037124e+13,4,10887214,0.006940,2017-01-01
2,781532,12095076,1.0,,,3.0,4.0,,9.0,3.0,...,2016.0,496619.0,9516.26,,,6.037461e+13,6,12095076,-0.001011,2017-01-01
3,870991,12069064,,,,1.0,2.0,,5.0,1.0,...,2016.0,199662.0,2366.08,,,6.037302e+13,7,12069064,0.101723,2017-01-01
4,1246926,12790562,,,,3.0,4.0,,9.0,3.0,...,2016.0,43056.0,3104.19,,,6.037500e+13,8,12790562,-0.040966,2017-01-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47409,2864704,10833991,1.0,,,3.0,3.0,,8.0,3.0,...,2016.0,114000.0,4685.34,,,6.037132e+13,77608,10833991,-0.002245,2017-09-20
47410,673515,11000655,,,,2.0,2.0,,6.0,2.0,...,2016.0,283704.0,4478.43,,,6.037101e+13,77609,11000655,0.020615,2017-09-20
47411,1843709,12773139,1.0,,,1.0,3.0,,4.0,1.0,...,2016.0,16749.0,876.43,,,6.037434e+13,77611,12773139,0.037129,2017-09-21
47412,1187175,12826780,,,,2.0,3.0,,6.0,2.0,...,2016.0,382000.0,6317.15,,,6.037503e+13,77612,12826780,0.007204,2017-09-25


### Acquire Takeaways
- Acquiring data from zillow database on the data science database server using the __get_zillow_data__ function
- Function needed to replicate this phase is located in the __acquire.py__ file
***

## Prepare
Prepare, tidy, and clean the data for exploration and analysis.

Create __prep.py__ file that contains the functions needed to replicate this process.
***

#### We'll use .info to see null value counts, data types, and row / columns count.

In [3]:
# using info function to examine data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47414 entries, 0 to 47413
Data columns (total 63 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            47414 non-null  int64  
 1   parcelid                      47414 non-null  int64  
 2   airconditioningtypeid         22070 non-null  float64
 3   architecturalstyletypeid      0 non-null      object 
 4   basementsqft                  0 non-null      object 
 5   bathroomcnt                   47414 non-null  float64
 6   bedroomcnt                    47414 non-null  float64
 7   buildingclasstypeid           8 non-null      float64
 8   buildingqualitytypeid         46923 non-null  float64
 9   calculatedbathnbr             47368 non-null  float64
 10  decktypeid                    0 non-null      object 
 11  finishedfloor1squarefeet      0 non-null      object 
 12  calculatedfinishedsquarefeet  47407 non-null  float64
 13  f

- Many columns have a large amount of null values
    - We'll be using a function (missing_rows) to examine the amount of null values in more depth later
    - After examining in-depth, we'll decide how to proceed


- Data type of several columns need to be converted 
    - We'll handle this in a later stage since the columns in question may be removed for in an alternate step 
        - For example, we may remove a column with a bad data type due to it having too many nulls


- Several columns, such as rawcensustractandblock, are categorical variables that may have a very large amount of unique values.
    - Encoding every value for these types of columns may be computationally expensive and add a large amount of columns to our dataset.
        - We will need to set a cutoff amount for categorical column unique values and remove any columns that exceed this amount.


- Once we've identified which columns to move into explore with, we'll need to rename them if they are hard to read, such as landtaxvaluedollarcnt.


#### Using missing_rows function from prep.py to create DF that shows the total number and percent of missing rows in each column of our data.

In [4]:
# passing dataframe to function from prep.py file
missing_rows(df)

Unnamed: 0,num_rows_missing,pct_rows_missing
id,0,0.000000
parcelid,0,0.000000
airconditioningtypeid,25344,53.452567
architecturalstyletypeid,47414,100.000000
basementsqft,47414,100.000000
...,...,...
censustractandblock,116,0.244653
id,0,0.000000
parcelid,0,0.000000
logerror,0,0.000000


- There are many columns with a substantial amount of missing values
    - We will remove any columns that are missing 40% or more of their values
    - We don't have an exact formula for why we should go with 40% but it seems reasonable to remove any columns that are missing that proportion of values.

#### Using drop_missing_columns function from prep.py to remove columns that are missing 40% or more of their values.

In [5]:
# function removes any columns that are missing 40% or more of their values
drop_missing_columns(df)

Unnamed: 0,id,parcelid,bathroomcnt,bedroomcnt,buildingqualitytypeid,calculatedbathnbr,calculatedfinishedsquarefeet,finishedsquarefeet12,fips,fullbathcnt,...,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,censustractandblock,id.1,parcelid.1,logerror,transactiondate
0,2288172,12177905,3.0,4.0,8.0,3.0,2376.0,2376.0,6037.0,3.0,...,108918.0,145143.0,2016.0,36225.0,1777.51,6.037300e+13,3,12177905,-0.103410,2017-01-01
1,1970746,10887214,3.0,3.0,8.0,3.0,1312.0,1312.0,6037.0,3.0,...,73681.0,119407.0,2016.0,45726.0,1533.89,6.037124e+13,4,10887214,0.006940,2017-01-01
2,781532,12095076,3.0,4.0,9.0,3.0,2962.0,2962.0,6037.0,3.0,...,276684.0,773303.0,2016.0,496619.0,9516.26,6.037461e+13,6,12095076,-0.001011,2017-01-01
3,870991,12069064,1.0,2.0,5.0,1.0,738.0,738.0,6037.0,1.0,...,18890.0,218552.0,2016.0,199662.0,2366.08,6.037302e+13,7,12069064,0.101723,2017-01-01
4,1246926,12790562,3.0,4.0,9.0,3.0,3039.0,3039.0,6037.0,3.0,...,177527.0,220583.0,2016.0,43056.0,3104.19,6.037500e+13,8,12790562,-0.040966,2017-01-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47409,2864704,10833991,3.0,3.0,8.0,3.0,1741.0,1741.0,6037.0,3.0,...,265000.0,379000.0,2016.0,114000.0,4685.34,6.037132e+13,77608,10833991,-0.002245,2017-09-20
47410,673515,11000655,2.0,2.0,6.0,2.0,1286.0,1286.0,6037.0,2.0,...,70917.0,354621.0,2016.0,283704.0,4478.43,6.037101e+13,77609,11000655,0.020615,2017-09-20
47411,1843709,12773139,1.0,3.0,4.0,1.0,1032.0,1032.0,6037.0,1.0,...,32797.0,49546.0,2016.0,16749.0,876.43,6.037434e+13,77611,12773139,0.037129,2017-09-21
47412,1187175,12826780,2.0,3.0,6.0,2.0,1762.0,1762.0,6037.0,2.0,...,140000.0,522000.0,2016.0,382000.0,6317.15,6.037503e+13,77612,12826780,0.007204,2017-09-25


#### While examining the data in SQL, we noticed that several columns appeared to have identical values. We're going to see how many values are unique between them.

In [6]:
# comparing each pair of columns we suspected are duplicates
# the resulting variables tell us how many rows differ between each
sqft_columns_diff = (df.finishedsquarefeet12 != df.calculatedfinishedsquarefeet).sum()
bathroom_count_diff = (df.calculatedbathnbr != df.bathroomcnt).sum()
bathroom_count_diff_alt = (df.fullbathcnt != df.bathroomcnt).sum()

print(f'Number of different values between finishedsquarefeet12 and calculatedfinishedsquarefeet: {sqft_columns_diff}')
print(f'Number of different values between calculatedbathnbr and bathroomcnt: {bathroom_count_diff}')
print(f'Number of different values between fullbathcnt and bathroomcnt: {bathroom_count_diff_alt}')

Number of different values between finishedsquarefeet12 and calculatedfinishedsquarefeet: 48
Number of different values between calculatedbathnbr and bathroomcnt: 46
Number of different values between fullbathcnt and bathroomcnt: 46


- The sum of non-duplicate values between all of these columns is 140 (48 + 46 + 46). 
    - This is roughly .003 of all rows in the DF
        - We can safely drop all the following majority-duplicate columns and only lose a extremely small proportion of rows.
            - finishedsquarefeet12
            - calculatedbathnbr
            - fullbathcnt
        - We could drop any of their alternativee columns instead but it would cost us time to find a non-arbitrary reason to do so and given how few unique values we're losing, the loss is relatively inconsequential.
        - These columns will be dropped in an upcoming function, __drop_selected_columns__, along with any other columns that are found to be in need of removal.

#### While examining the data in SQL, I noticed that tax_value_dollarcnt appeared to be the sum of landtaxvaluedollarcnt and structuretaxvaluedollarcnt. 

#### To test this I'm going to combine landtaxvaluedollarcnt and structuretaxvaluedollarcnt manually, then compare the results to landtaxvaluedollarcount and see how many of the values match.

In [7]:
# creating new df that holds all three columns we're interested in
tax_eval_df = df[['structuretaxvaluedollarcnt', 'landtaxvaluedollarcnt', 'taxvaluedollarcnt']]

# creating new column in df that is the sum of landtaxvaluedollarcnt and structuretaxvaluedollarcnt
tax_eval_df['taxvaluedollarcnt_test'] = df.structuretaxvaluedollarcnt + df.landtaxvaluedollarcnt

# comparing taxvaluedollarcnt to our manually calculated column and finding the average % of rows where the values matched
(tax_eval_df.taxvaluedollarcnt_test == tax_eval_df.taxvaluedollarcnt).mean()

0.9986501877082719

- 99% of the our manually summed values matched the original 
    - Safe the say that in a vast majority of rows, taxvaluedollarcnt is the sum of landtaxvaluedollarcnt and structuretaxvaluedollarcnt
        - This being the case, we're only going to keep taxvaluedollarcnt and remove the other two columns since their values are already accounted for in this column.
        - If need be, we can add them back later and see if we get better results by having them seperated.
            - The columns will be dropped at a later step using the __drop_selected_columns__ function

#### We're using nunique() to see how many unique values each column has. This is useful for identifying categorical columns with large amounts of unique values and columns with only a single value.

In [8]:
# nunique() displays each column and the amount of unique values that it holds
df.nunique()

id                              47293
parcelid                        47293
bathroomcnt                        13
bedroomcnt                         12
buildingqualitytypeid              12
calculatedbathnbr                  12
calculatedfinishedsquarefeet     4302
finishedsquarefeet12             4300
fips                                3
fullbathcnt                        12
heatingorsystemtypeid               3
latitude                        38668
longitude                       37041
lotsizesquarefeet               16506
propertycountylandusecode          40
propertylandusetypeid              10
propertyzoningdesc               1854
rawcensustractandblock          25213
regionidcity                      135
regionidcounty                      3
regionidzip                       290
roomcnt                             5
unitcnt                             1
yearbuilt                         131
structuretaxvaluedollarcnt      28782
taxvaluedollarcnt               32671
assessmentye

- The following are categorical columns will be removed because they contain more than 10 unique values. Encoding them would be computationaly expensive and add a large amount of columns to our dataframe.

    - id (both)
    - parcelid (both)
    - buildingqualitytypeid
    - latitude
    - longitude
    - propertylandusetypeid
    - propertycountylandusecode
    - propertyzoningdesc
    - rawcensustractandblock
    - regionidcity
    - regionidzip
    - yearbuilt
    - censustractandblock
    - transactiondate

    
- There are ways to avoid the consequences of encoding categorical columns with lots of features, but in the interest of time we will avoid these routes for now.


- The following columns will be removed because they only contain 1 unique value and would thus not allow us to make any meaningful distinctions with them.

    - assessmentyear
    - unitcnt


- All of these columns will be removed using the __drop_selected_columns__ function from prep.py

#### Dropping columns that meet any of the following criteria (identified in steps prior)
    - Categorical with 10 or more unique values 
    - Only contain 1 unique value
    - landtaxvaluedollarcnt and structuretaxvaluedollarcnt
        - summed under taxvaluedollarcnt column
    - Near-duplicates of alternate column(s)

In [9]:
# using function from prep.py to drop columns meeting any criterion above
drop_selected_columns(df)

Unnamed: 0,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,heatingorsystemtypeid,lotsizesquarefeet,regionidcounty,roomcnt,taxvaluedollarcnt,taxamount,logerror
0,3.0,4.0,2376.0,6037.0,2.0,13038.0,3101.0,0.0,145143.0,1777.51,-0.103410
1,3.0,3.0,1312.0,6037.0,2.0,278581.0,3101.0,0.0,119407.0,1533.89,0.006940
2,3.0,4.0,2962.0,6037.0,2.0,63000.0,3101.0,0.0,773303.0,9516.26,-0.001011
3,1.0,2.0,738.0,6037.0,,4214.0,3101.0,0.0,218552.0,2366.08,0.101723
4,3.0,4.0,3039.0,6037.0,2.0,20028.0,3101.0,0.0,220583.0,3104.19,-0.040966
...,...,...,...,...,...,...,...,...,...,...,...
47409,3.0,3.0,1741.0,6037.0,2.0,59487.0,3101.0,0.0,379000.0,4685.34,-0.002245
47410,2.0,2.0,1286.0,6037.0,2.0,47405.0,3101.0,0.0,354621.0,4478.43,0.020615
47411,1.0,3.0,1032.0,6037.0,2.0,5074.0,3101.0,0.0,49546.0,876.43,0.037129
47412,2.0,3.0,1762.0,6037.0,2.0,6347.0,3101.0,0.0,522000.0,6317.15,0.007204


#### Using .info to examine our remaining columns

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47414 entries, 0 to 47413
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   bathroomcnt                   47414 non-null  float64
 1   bedroomcnt                    47414 non-null  float64
 2   calculatedfinishedsquarefeet  47407 non-null  float64
 3   fips                          47414 non-null  float64
 4   heatingorsystemtypeid         46685 non-null  float64
 5   lotsizesquarefeet             46179 non-null  float64
 6   regionidcounty                47414 non-null  float64
 7   roomcnt                       47414 non-null  float64
 8   taxvaluedollarcnt             47414 non-null  float64
 9   taxamount                     47410 non-null  float64
 10  logerror                      47414 non-null  float64
dtypes: float64(11)
memory usage: 4.0 MB


- Considering that we'll be encoding our categorical columns, our data types look find for now 
- A few of our columns still have missing values
    - Let's calculate what percent are missing, if the amount is very low, we can drop them instead of imputing them to save time

#### Using our missing_rows function again to find what overall percent of rows are missing values.

In [11]:
# passing dataframe to missing_rows function
# saving dataframe to variable so we can work with it more easily
missing_rows_df = missing_rows(df)

# summing values in the percent of rows missing column
# this tells us at most, what percent of rows still have missing values
total_percent_rows_missing = round((missing_rows_df.pct_rows_missing).sum(),0)

print(f'At most, {total_percent_rows_missing}% of rows contain missing values.')

At most, 4.0% of rows contain missing values.


- In the interest of time, we will drop the remaining rows with missing values
    - Given how small this amount is, it should be relatively inconsequential
    - If need be, we can add these rows back in at a later time

#### Dropping rows with missing values

In [12]:
# drop all rows with missing values
df.dropna(inplace=True)

#### RFE PREPARATION CHECKPOINT

At this point, we've removed a significant number of columns and taken care of our missing values.

Our remaining columns will be ranked using Recursive Feature Elimination (RFE) to determine which we will focus on during our initial venture through exploration. 

With that in mind, we'll need to prepare our columns for RFE by
- Encoding categorical columns
- Idenfiying and handling outliers
- Splitting our data into train, validate and test (although we only need train for RFE)
- Scaling the data

#### Before we encode our data, we need to identify our categorical columns so we know which columns to encode

In [13]:
df.nunique()

bathroomcnt                        13
bedroomcnt                         12
calculatedfinishedsquarefeet     4245
fips                                1
heatingorsystemtypeid               3
lotsizesquarefeet               16337
regionidcounty                      1
roomcnt                             1
taxvaluedollarcnt               31582
taxamount                       44224
logerror                        45105
dtype: int64

- After dropping all of our null values, we see that fips, regionidcounty and roomcnt have been reduced to one unique value each.
    - We'll remove these columns since they are no longer useful to use with only one unique value
    - If need be, we can return to this stage, impute the missing values that were dropped earlier and retain these columns for exploration

#### Dropping columns with only 1 unique value remaining since all nulls were removed

In [14]:
# using function from prep.py to drop columns meeting any criterion above
drop_more_selected_columns(df).head()

Unnamed: 0,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,heatingorsystemtypeid,lotsizesquarefeet,taxvaluedollarcnt,taxamount,logerror
0,3.0,4.0,2376.0,2.0,13038.0,145143.0,1777.51,-0.10341
1,3.0,3.0,1312.0,2.0,278581.0,119407.0,1533.89,0.00694
2,3.0,4.0,2962.0,2.0,63000.0,773303.0,9516.26,-0.001011
4,3.0,4.0,3039.0,2.0,20028.0,220583.0,3104.19,-0.040966
5,3.0,2.0,1290.0,2.0,54048.0,371361.0,4557.32,-0.036763


- With those columns removed we can now encode our sole remaining categorical variable, heatingorsystemtypeid

#### Using zillow_dummy function from prep.py to create dummy variables for heatingorsystemtypeid

In [15]:
# saving resulting df to variable
df = zillow_dummy(df)

# previewing dummy data
df.head()

Unnamed: 0,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,lotsizesquarefeet,taxvaluedollarcnt,taxamount,logerror,heating_system_type_2,heating_system_type_7,heating_system_type_20
0,3.0,4.0,2376.0,13038.0,145143.0,1777.51,-0.10341,1,0,0
1,3.0,3.0,1312.0,278581.0,119407.0,1533.89,0.00694,1,0,0
2,3.0,4.0,2962.0,63000.0,773303.0,9516.26,-0.001011,1,0,0
4,3.0,4.0,3039.0,20028.0,220583.0,3104.19,-0.040966,1,0,0
5,3.0,2.0,1290.0,54048.0,371361.0,4557.32,-0.036763,1,0,0


- Dummy variables added succesfully, now we can split the data
- All functions for the remainder of our prep phase will come from the __preprocessing.py__ file (not the __prep.py__ file)

#### Using split_data function from preprocessing.py to split data into x | y train, validate and test samples.

In [17]:
train, validate, test = split_data(df, 'logerror')

train.shape

(25463, 10)

- We use .shape to confirm our number of rows reflects a split and it has

### OUTLIERS

#### We will now use our add_upper_outlier column function from preprocess.py to add a column to our dataframe for each numerical column. 

#### The new columns will reflect how far above the upper boundary a corresponding outlier is, and if the corresponding value is not an outlier, it will contain a 0. 

In [18]:
add_upper_outlier_columns(train, k=6)

Unnamed: 0,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,lotsizesquarefeet,taxvaluedollarcnt,taxamount,logerror,heating_system_type_2,heating_system_type_7,heating_system_type_20,bathroomcnt_upper_outliers,bedroomcnt_upper_outliers,calculatedfinishedsquarefeet_upper_outliers,lotsizesquarefeet_upper_outliers,taxvaluedollarcnt_upper_outliers,taxamount_upper_outliers,logerror_upper_outliers,heating_system_type_2_upper_outliers,heating_system_type_7_upper_outliers,heating_system_type_20_upper_outliers
44015,3.0,2.0,1150.0,10868.0,168859.0,2122.76,-0.007175,1,0,0,0.0,0,0.0,0.0,0.0,0.0,0.000000,0,0,0.0
40095,3.0,5.0,1895.0,6746.0,428202.0,5519.32,0.003639,1,0,0,0.0,0,0.0,0.0,0.0,0.0,0.000000,0,0,0.0
41118,2.0,2.0,859.0,209776.0,143238.0,2046.38,-0.002396,0,1,0,0.0,0,0.0,117874.0,0.0,0.0,0.000000,0,0,0.0
15628,1.0,2.0,875.0,5399.0,211401.0,2849.73,-0.040422,0,1,0,0.0,0,0.0,0.0,0.0,0.0,0.000000,0,0,0.0
115,1.0,3.0,1720.0,39946.0,96224.0,1216.03,1.078275,0,1,0,0.0,0,0.0,0.0,0.0,0.0,0.635787,0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19718,2.0,3.0,1379.0,6160.0,103712.0,1920.46,0.016483,0,1,0,0.0,0,0.0,0.0,0.0,0.0,0.000000,0,0,0.0
7197,2.0,3.0,1372.0,7032.0,92679.0,1967.51,-0.004263,1,0,0,0.0,0,0.0,0.0,0.0,0.0,0.000000,0,0,0.0
18228,2.0,2.0,1853.0,2476.0,596407.0,7084.85,0.022837,1,0,0,0.0,0,0.0,0.0,0.0,0.0,0.000000,0,0,0.0
15538,1.0,3.0,1461.0,6559.0,54418.0,862.94,0.028341,0,1,0,0.0,0,0.0,0.0,0.0,0.0,0.000000,0,0,0.0


#### Now that we have our outlier data, we will use our outlier_data_print function from preprocess.py to see on average, how far past the upper boundary each outlier is and how many there are per column.

In [19]:
upper_outlier_data_print(train)

~~~
bathroomcnt_upper_outliers
count    7.000000
mean     1.857143
std      1.069045
min      1.000000
25%      1.000000
50%      2.000000
75%      2.000000
max      4.000000
Name: bathroomcnt_upper_outliers, dtype: float64
~~~
bedroomcnt_upper_outliers
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: bedroomcnt_upper_outliers, dtype: float64
~~~
calculatedfinishedsquarefeet_upper_outliers
count       77.000000
mean      2178.129870
std       2793.506024
min         14.000000
25%        468.000000
50%       1111.000000
75%       2877.000000
max      14958.000000
Name: calculatedfinishedsquarefeet_upper_outliers, dtype: float64
~~~
lotsizesquarefeet_upper_outliers
count    2.440000e+03
mean     1.997289e+05
std      2.853832e+05
min      2.400000e+01
25%      4.321000e+04
50%      1.085470e+05
75%      2.504595e+05
max      3.497243e+06
Name: lotsizesquarefeet_upper_outliers, dtype: float64
~~~
taxvaluedollarcnt_upper_outliers

- The "count" and "mean" values above are what we are focused on
     - Count tells us how many upper outliers we have
     - Mean tells us how far above the upper bound (UB) they are on average
     - We're disregarding all heating systems columns because they are categorical
     - We're also disregarding logerror because we are not removing outliers from our target variable
         - We can't remove outliers from out-of-sample target variable data in many real-world scenarios and we want our models to reflect this


- In total, we have 3,192 upper outliers, which is roughly 12% of our dataset
    - This is a fairly signifcant amount but this should be the last data removal step we do before RFE
    - Given that we set k so high earlier (6), these values are extreme outliers so we cannot overlook them

- With all this noted here is our plan
    - Drop all outlier rows
    - #################After RFE ranks our variables, we'll only be keeping the top 4 features (the 3 heating system types count as one feature collectively)
    - Any of the dropped features that contained outliers will have their rows restored since once the outlier column is gone, the data will be okay to use again and this will allow us to retain more data??????????????????

#### dropping outliers

In [20]:
train.drop(train[train['bathroomcnt_upper_outliers'] > 0].index, inplace = True) 
train.drop(train[train['bedroomcnt_upper_outliers'] > 0].index, inplace = True) 
train.drop(train[train['calculatedfinishedsquarefeet_upper_outliers'] > 0].index, inplace = True) 
train.drop(train[train['lotsizesquarefeet_upper_outliers'] > 0].index, inplace = True) 
train.drop(train[train['taxvaluedollarcnt_upper_outliers'] > 0].index, inplace = True) 
train.drop(train[train['taxamount_upper_outliers'] > 0].index, inplace = True) 

#### dropping upper outlier columns

In [21]:
outlier_cols = [col for col in train if col.endswith('_outliers')]

train = train.drop(columns = outlier_cols)

train.head()

Unnamed: 0,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,lotsizesquarefeet,taxvaluedollarcnt,taxamount,logerror,heating_system_type_2,heating_system_type_7,heating_system_type_20
44015,3.0,2.0,1150.0,10868.0,168859.0,2122.76,-0.007175,1,0,0
40095,3.0,5.0,1895.0,6746.0,428202.0,5519.32,0.003639,1,0,0
15628,1.0,2.0,875.0,5399.0,211401.0,2849.73,-0.040422,0,1,0
115,1.0,3.0,1720.0,39946.0,96224.0,1216.03,1.078275,0,1,0
16132,3.0,4.0,2569.0,11168.0,603335.0,7513.64,-0.006399,1,0,0


#### scale data

In [24]:
train, validate, test = data_scaler(train, validate, test)

In [25]:
train.head()

Unnamed: 0,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,lotsizesquarefeet,taxvaluedollarcnt,taxamount,logerror,heating_system_type_2,heating_system_type_7,heating_system_type_20
44015,0.333333,0.181818,0.148556,0.111172,0.064023,0.064408,-0.007175,1,0,0
40095,0.333333,0.454545,0.259452,0.065931,0.164285,0.174718,0.003639,1,0,0
15628,0.111111,0.181818,0.107621,0.051146,0.08047,0.088017,-0.040422,0,1,0
115,0.111111,0.272727,0.233403,0.430321,0.035942,0.03496,1.078275,0,1,0
16132,0.333333,0.363636,0.35978,0.114465,0.231992,0.239488,-0.006399,1,0,0


#### RFE

In [43]:
# creating linear regression object
lm = LinearRegression()

# creating RFE object that will use our linear regression object and only pick only the single best feature
rfe = RFE(lm, 1)

# transforming data using rfe object
x_rfe = rfe.fit_transform(train[['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'lotsizesquarefeet', 'taxvaluedollarcnt', 'taxamount', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']], train['logerror'])

# fitting our linear regression model to data
lm.fit(train[['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'lotsizesquarefeet', 'taxvaluedollarcnt', 'taxamount', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']], train['logerror'])

# storing array of boolean values that reflect true if a feature was one of the three selected
# false otherwise
mask = rfe.support_

rfe_train = train[['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'lotsizesquarefeet', 'taxvaluedollarcnt', 'taxamount', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']]

# creating list of the top feature using boolean mask
rfe_features = rfe_train.loc[:,mask].columns.tolist()

# creating array of ranking list
var_ranks = rfe.ranking_

# creating list of feature names
var_names = rfe_train.columns.tolist()

# combine ranks and names into a df
rfe_ranks_df = pd.DataFrame({'Var': var_names, 'Rank': var_ranks})

# sort the df by rank
rfe_ranks_df.sort_values('Rank')

Unnamed: 0,Var,Rank
2,calculatedfinishedsquarefeet,1
4,taxvaluedollarcnt,2
8,heating_system_type_20,3
7,heating_system_type_7,4
6,heating_system_type_2,5
0,bathroomcnt,6
1,bedroomcnt,7
5,taxamount,8
3,lotsizesquarefeet,9


## Explore

In [53]:
y_train

Unnamed: 0,logerror
44015,-0.007175
40095,0.003639
15628,-0.040422
115,1.078275
16132,-0.006399
...,...
19718,0.016483
7197,-0.004263
18228,0.022837
15538,0.028341


## Modeling

In [93]:
y_train = pd.DataFrame()

X_train = train[['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'taxvaluedollarcnt', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']]
y_train['logerror'] = train['logerror']

#### baseline

In [94]:
# make baseline prediction
y_train['baseline_pred'] = y_train['logerror'].mean()

# evaluate: rmse
rmse_train_bl = mean_squared_error(y_train.logerror, y_train.baseline_pred)**(1/2)

print("RMSE for OLS using LinearRegression\nTraining/In-Sample: ", rmse_train_bl)


RMSE for OLS using LinearRegression
Training/In-Sample:  0.1545438954603233


In [96]:
# create the model object
lm = LinearRegression(normalize=True)

# fit the model to our training data. We must specify the column in y_train, 
# since we have converted it to a dataframe from a series! 
lm.fit(X_train, y_train.logerror)

# predict train
y_train['model_1_pred'] = lm.predict(X_train)

# evaluate: rmse
rmse_train = mean_squared_error(y_train.logerror, y_train.model_1_pred)**(1/2)

print("RMSE for OLS using LinearRegression\nTraining/In-Sample: ", rmse_train)

RMSE for OLS using LinearRegression
Training/In-Sample:  0.1543583059001428


#### model 1 validate

In [97]:
y_validate = pd.DataFrame()

X_validate = validate[['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'taxvaluedollarcnt', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']]
y_validate ['logerror'] = validate['logerror']

In [98]:
# predict validate
y_validate['model_1_pred'] = lm.predict(X_validate)

# evaluate: rmse
rmse_validate = mean_squared_error(y_validate.logerror, y_validate.model_1_pred)**(1/2)

print("RMSE for OLS using LinearRegression\nvalidateing/In-Sample: ", rmse_validate)

RMSE for OLS using LinearRegression
validateing/In-Sample:  0.15064000306877062


#### model 1 test

In [99]:
y_test = pd.DataFrame()

X_test = test[['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'taxvaluedollarcnt', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']]
y_test ['logerror'] = test['logerror']

# predict test
y_test['model_1_pred'] = lm.predict(X_test)

# evaluate: rmse
rmse_test = mean_squared_error(y_test.logerror, y_test.model_1_pred)**(1/2)

print("RMSE for OLS using LinearRegression\ntesting/In-Sample: ", rmse_test)

RMSE for OLS using LinearRegression
testing/In-Sample:  0.15892246648602842


#### model 2 and 3

In [64]:
X_train2 = train[['calculatedfinishedsquarefeet', 'taxvaluedollarcnt', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']]

# create the model object
lm2 = LinearRegression(normalize=True)

# fit the model to our training data. We must specify the column in y_train, 
# since we have converted it to a dataframe from a series! 
lm2.fit(X_train2, y_train.logerror)

# predict train
y_train['model_2_pred'] = lm2.predict(X_train2)

# evaluate: rmse
rmse_train2 = mean_squared_error(y_train.logerror, y_train.model_2_pred)**(1/2)

print("RMSE for OLS using LinearRegression\nTraining/In-Sample: ", rmse_train2)


RMSE for OLS using LinearRegression
Training/In-Sample:  0.15436388019163488


In [65]:
X_train3 = train[['bathroomcnt', 'bedroomcnt','calculatedfinishedsquarefeet', 'taxvaluedollarcnt']]

# create the model object
lm3 = LinearRegression(normalize=True)

# fit the model to our training data. We must specify the column in y_train, 
# since we have converted it to a dataframe from a series! 
lm3.fit(X_train3, y_train.logerror)

# predict train
y_train['model_3_pred'] = lm3.predict(X_train3)

# evaluate: rmse
rmse_train3 = mean_squared_error(y_train.logerror, y_train.model_3_pred)**(1/2)

print("RMSE for OLS using LinearRegression\nTraining/In-Sample: ", rmse_train3)

RMSE for OLS using LinearRegression
Training/In-Sample:  0.15437857245665929


#### lasso 

In [77]:
# create the model object
lars = LassoLars(alpha=1.0)

# fit the model to our training data. We must specify the column in y_train, 
# since we have converted it to a dataframe from a series! 
lars.fit(X_train, y_train.logerror)

# predict train
y_train['lassopred'] = lars.predict(X_train)

# evaluate: rmse
rmse_train = mean_squared_error(y_train.logerror, y_train.lassopred)**(1/2)


print("RMSE for Lasso + Lars\nTraining/In-Sample: ", rmse_train)

RMSE for Lasso + Lars
Training/In-Sample:  0.1545438954603233


#### validate

In [102]:
y_validate = pd.DataFrame()

X_validate = validate[['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'taxvaluedollarcnt', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']]
y_validate['logerror'] = validate['logerror']

In [103]:
# predict train
y_validate['lassopred'] = lars.predict(X_validate)

# evaluate: rmse
rmse_validate = mean_squared_error(y_validate.logerror, y_validate.lassopred)**(1/2)


print("RMSE for Lasso + Lars\nTraining/In-Sample: ", rmse_validate)

RMSE for Lasso + Lars
Training/In-Sample:  0.1506561262103518


#### test

In [83]:
y_test = pd.DataFrame()

X_test = test[['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'taxvaluedollarcnt', 'heating_system_type_2', 'heating_system_type_7', 'heating_system_type_20']]
y_test['logerror'] = test['logerror']

In [101]:
# predict train
y_test['lassopred'] = lars.predict(X_test)

# evaluate: rmse
rmse_test = mean_squared_error(y_test.logerror, y_test.lassopred)**(1/2)


print("RMSE for Lasso + Lars\nTraining/In-Sample: ", rmse_test)

RMSE for Lasso + Lars
Training/In-Sample:  0.15915430114482182


## Overall Conclusion and Takeaways