**Goal:** Predict "log-error"

**Hypothesis:**

In [127]:
#ignore warnings
import warnings
warnings.filterwarnings("ignore")

#wrangling
import pandas as pd
import numpy as np

#explore
import scipy.stats as stats

#visuals
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns


#default pandas decimal display formatting
pd.options.display.float_format='{:20,.2f}'.format

import env
import acquire
import prepare
# import summarize

### Acquire & Summarize


Acquired zillow data using acquire.py (sequel query in this file)

In [10]:
df = acquire.get_zillow_data()
df.head()

Unnamed: 0,tax_rate,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,garagecarcnt,garagetotalsqft,latitude,longitude,lotsizesquarefeet,poolcnt,poolsizesum,taxvaluedollarcnt,yearbuilt,landtaxvaluedollarcnt,taxdelinquencyflag,taxdelinquencyyear,logerror,transactions
0,0.01,3.5,4.0,3100.0,6059.0,2.0,633.0,33634931.0,-117869207.0,4506.0,,,1023282.0,1998.0,537569.0,,,0.03,1
1,0.01,1.0,2.0,1465.0,6111.0,1.0,0.0,34449266.0,-119281531.0,12647.0,,,464000.0,1967.0,376000.0,,,0.06,1
2,0.01,2.0,3.0,1243.0,6059.0,2.0,440.0,33886168.0,-117823170.0,8432.0,1.0,,564778.0,1962.0,479489.0,,,0.01,1
3,0.01,3.0,4.0,2376.0,6037.0,,,34245180.0,-118240722.0,13038.0,1.0,,145143.0,1970.0,36225.0,,,-0.1,1
4,0.01,3.0,4.0,2962.0,6037.0,,,34145202.0,-118179824.0,63000.0,1.0,,773303.0,1950.0,496619.0,,,-0.0,1


In [None]:
# df.transactiondate = pd.to_datetime(df.transactiondate, format='%Y-%m-%d')
# df = df.sort_values("transactiondate").drop_duplicates('parcelid',keep='last') 

In [128]:
df = pd.read_csv("zillow_dataframe.csv")

In [129]:
df = df.drop(columns="Unnamed: 0")

In [130]:
df = df.drop(columns=["garagetotalsqft", "poolsizesum", "taxdelinquencyflag", "taxdelinquencyyear", "transactions"])

In [131]:
# df["county_name"] = df["fips"].map({"Los Angeles": 6037, "Orange": 6059, "Ventura": 6111})

In [132]:
df.drop(columns="county_name", inplace=True)

In [134]:
df.shape

(52169, 14)

In [135]:
cols = ["poolcnt", "garagecarcnt"]
# df = prepare.fill_zero(df, cols=cols)

In [109]:
# df.info()

In [110]:
# df.describe()

In [111]:
# df.dtypes

In [112]:
# pd.DataFrame(df.columns)

In [138]:
df = prepare.handle_missing_values(df)

In [139]:
df.isnull().sum()

tax_rate                          5
bathroomcnt                       0
bedroomcnt                        0
calculatedfinishedsquarefeet      8
fips                              0
latitude                          0
longitude                         0
lotsizesquarefeet               354
taxvaluedollarcnt                 1
yearbuilt                        40
landtaxvaluedollarcnt             1
logerror                          0
dtype: int64

In [115]:
# df['price_per_sq_ft'] = df.taxvaluedollarcnt/df.calculatedfinishedsquarefeet
# df['yard_sq_ft'] = df.lotsizesquarefeet - df.calculatedfinishedsquarefeet

In [116]:
# df.dropna(inplace=True)

In [140]:
df.head()

Unnamed: 0,tax_rate,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,latitude,longitude,lotsizesquarefeet,taxvaluedollarcnt,yearbuilt,landtaxvaluedollarcnt,logerror
0,0.01,3.5,4.0,3100.0,6059.0,33634931.0,-117869207.0,4506.0,1023282.0,1998.0,537569.0,0.03
1,0.01,1.0,2.0,1465.0,6111.0,34449266.0,-119281531.0,12647.0,464000.0,1967.0,376000.0,0.06
2,0.01,2.0,3.0,1243.0,6059.0,33886168.0,-117823170.0,8432.0,564778.0,1962.0,479489.0,0.01
3,0.01,3.0,4.0,2376.0,6037.0,34245180.0,-118240722.0,13038.0,145143.0,1970.0,36225.0,-0.1
4,0.01,3.0,4.0,2962.0,6037.0,34145202.0,-118179824.0,63000.0,773303.0,1950.0,496619.0,-0.0


In [118]:
# df = prepare.numeric_to_category(df, cols)

In [119]:
df.dtypes

tax_rate                        float64
bathroomcnt                     float64
bedroomcnt                      float64
calculatedfinishedsquarefeet    float64
fips                            float64
latitude                        float64
longitude                       float64
lotsizesquarefeet               float64
taxvaluedollarcnt               float64
yearbuilt                       float64
landtaxvaluedollarcnt           float64
logerror                        float64
dtype: object

In [141]:
df.latitude = df.latitude / 1_000_000 
df.longitude = df.longitude / 1_000_000 

### Clustering fips - aka binning the 3 different counties

In [142]:
from sklearn.cluster import KMeans

X = df[['fips']]

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

X['cluster'] = pd.Series(kmeans.predict(X)).astype(str)

In [143]:
X.shape

(52169, 2)

In [144]:
X = X.dropna()

In [145]:
X = X.drop(columns='fips')

In [146]:
df['cluster'] = X.cluster

In [147]:
df.head()

Unnamed: 0,tax_rate,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,latitude,longitude,lotsizesquarefeet,taxvaluedollarcnt,yearbuilt,landtaxvaluedollarcnt,logerror,cluster
0,0.01,3.5,4.0,3100.0,6059.0,33.63,-117.87,4506.0,1023282.0,1998.0,537569.0,0.03,2
1,0.01,1.0,2.0,1465.0,6111.0,34.45,-119.28,12647.0,464000.0,1967.0,376000.0,0.06,1
2,0.01,2.0,3.0,1243.0,6059.0,33.89,-117.82,8432.0,564778.0,1962.0,479489.0,0.01,2
3,0.01,3.0,4.0,2376.0,6037.0,34.25,-118.24,13038.0,145143.0,1970.0,36225.0,-0.1,0
4,0.01,3.0,4.0,2962.0,6037.0,34.15,-118.18,63000.0,773303.0,1950.0,496619.0,-0.0,0


## Encoding - encoding the 3 clusters made from fips data

In [148]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [149]:
def encode(df, col_name):
    
    encoded_values = sorted(list(df[col_name].unique()))

    # Integer Encoding
    int_encoder = LabelEncoder()
    df.encoded = int_encoder.fit_transform(df[col_name])

    # create 2D np arrays of the encoded variable (in train and test)
    df_array = np.array(df.encoded).reshape(len(df.encoded),1)

    # One Hot Encoding
    ohe = OneHotEncoder(sparse=False, categories='auto')
    df_ohe = ohe.fit_transform(df_array)

    # Turn the array of new values into a data frame with columns names being the values
    # and index matching that of train/test
    # then merge the new dataframe with the existing train/test dataframe
    df_encoded = pd.DataFrame(data=df_ohe, columns=encoded_values, index=df.index)
    df = df.join(df_encoded)

    return df, ohe

In [150]:
df, ohe = encode(df, 'cluster')

In [151]:
ohe.inverse_transform(df[['0', '1', '2']])

array([[2],
       [1],
       [2],
       ...,
       [0],
       [0],
       [0]])

In [153]:
df = df.drop(columns='cluster')

df['los_angeles'] = df['0']
df['ventura'] = df['1']
df['orange'] = df['2']

df = df.drop(columns=['0', '1', '2'])

In [162]:
df.head()

Unnamed: 0,tax_rate,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,latitude,longitude,lotsizesquarefeet,taxvaluedollarcnt,yearbuilt,landtaxvaluedollarcnt,logerror,los_angeles,ventura,orange
0,0.01,3.5,4.0,3100.0,6059.0,33.63,-117.87,4506.0,1023282.0,1998.0,537569.0,0.03,0.0,0.0,1.0
1,0.01,1.0,2.0,1465.0,6111.0,34.45,-119.28,12647.0,464000.0,1967.0,376000.0,0.06,0.0,1.0,0.0
2,0.01,2.0,3.0,1243.0,6059.0,33.89,-117.82,8432.0,564778.0,1962.0,479489.0,0.01,0.0,0.0,1.0
3,0.01,3.0,4.0,2376.0,6037.0,34.25,-118.24,13038.0,145143.0,1970.0,36225.0,-0.1,1.0,0.0,0.0
4,0.01,3.0,4.0,2962.0,6037.0,34.15,-118.18,63000.0,773303.0,1950.0,496619.0,-0.0,1.0,0.0,0.0


## Split Data

In [171]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(X, test_size=.30, random_state=123)

In [172]:
train.shape

(36518, 1)

In [173]:
train.describe().T

Unnamed: 0,count,unique,top,freq
cluster,36518,3,0,23631


In [174]:
X_train = train.drop(columns="logerror")

y_train = train[["logerror"]]

X_test = test.drop(columns="logerror")

y_test = test[["logerror"]]

KeyError: "['logerror'] not found in axis"

In [175]:
# df_nums_train = train.select_dtypes(exclude="category")

# df_nums_test = test.select_dtypes(exclude="category")

# df_nums_train.shape

In [176]:
# x_df_nums_train = df_nums_train.drop(columns="logerror")

# y_df_nums_train = df_nums_train[["logerror"]]

# x_df_nums_test = df_nums_test.drop(columns="logerror")

# y_df_nums_test = df_nums_test[["logerror"]]

# x_df_nums_train.head().T

In [177]:
# y_df_nums_train.head().T

### Model df - Random Forest

In [178]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [179]:
train_cluster = train_cluster.drop(columns='cluster')

NameError: name 'train_cluster' is not defined

In [180]:
test_cluster = test_cluster.drop(columns='cluster')

NameError: name 'test_cluster' is not defined

In [181]:
rf = RandomForestRegressor(n_estimators=100)

In [182]:
rf.fit(X_train, y_train)

NameError: name 'X_train' is not defined

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = rf.predict(train_cluster)
print(f'root mean squared error = {mean_squared_error(y_train, y_pred)**1/2}')

In [None]:
y_pred = rf.predict(test_cluster)
print(f'root mean squared error = {mean_squared_error(y_test, y_pred)**1/2}')

# Below is previous cleaning work.  ignore

- Number of rows with missing values

In [None]:
number_missing = df.isnull().sum()

- percent of total rows that have missing values

In [None]:
pct_missing = (df.isnull().sum())/df.shape[0]

In [None]:
nulls_by_column_df = pd.DataFrame({'num_rows_missing': number_missing, 'pct_rows_missing': pct_missing})

In [None]:
def nulls_by_col(df):
    number_missing = df.isnull().sum()
    pct_missing = (df.isnull().sum())/df.shape[0]
    rows_missing_df = pd.DataFrame({'num_rows_missing': number_missing, 'pct_rows_missing': pct_missing})
    return nulls_by_column_df

In [None]:
nulls_by_column_df

In [None]:
df.fips.unique()

In [None]:
def nulls_by_row(df):
    num_cols_missing = df.isnull().sum(axis=1)
    pct_cols_missing = df.isnull().sum(axis=1)/df.shape[1]*100
    rows_missing = pd.DataFrame({'num_cols_missing': num_cols_missing, 'pct_cols_missing': pct_cols_missing}).reset_index().groupby(['num_cols_missing','pct_cols_missing']).count().rename(index=str, columns={'index': 'num_rows'}).reset_index()
    return rows_missing

In [None]:
num_cols_missing = df.isnull().sum(axis=1)
pct_cols_missing = df.isnull().sum(axis=1)/df.shape[1]*100
rows_missing = pd.DataFrame({'num_cols_missing': num_cols_missing, 'pct_cols_missing': pct_cols_missing}).reset_index().groupby(['num_cols_missing','pct_cols_missing']).count().rename(index=str, columns={'index': 'num_rows'}).reset_index()

In [None]:
rows_missing

### Prepare

1. Remove any properties that are likely to be something other than single unit properties. (e.g. no duplexes, no land/lot, ...). There are multiple ways to estimate that a property is a single unit, and there is not a single "right" answer. But for this exercise, do not purely filter by unitcnt as we did previously. Add some new logic that will reduce the number of properties that are falsely removed. You might want to use # bedrooms, square feet, unit type or the like to then identify those with unitcnt not defined.

#### Single Unit Properties (as defined by Jeff Hutchins)

Single Family Residential = 52320

Residential General = 37

Rural Residence = 0

Mobile Home = 74

Manufactured, Modular, Prefabricated Homes = 58

Inferred Single Family Residential = 0

Bungalow = 0

## Going to create a new column called price_per_sq_ft, use a clustering method called K-means clustering to find clusters of prices, and compare it to latitude and longitude points.

In [None]:
df.head()

In [None]:
df_subset = df[['calculatedfinishedsquarefeet', 'taxvaluedollarcnt', 'latitude', 'longitude']]
df_subset.head()

In [None]:
df_subset['price_per_sq_ft'] = df_subset.taxvaluedollarcnt/df_subset.calculatedfinishedsquarefeet

In [None]:
df_subset.head()

In [None]:
df_subset = df_subset.dropna()

In [None]:
df_subset.isnull().sum()

This graph shows the shape of the combination of all three counties.  The different colors represent the clusters.  The clusters are based on price per square foot, latitude, and longitude.

Used KMeans clustering on price per square foot, latitude, and longitude.  Then used the clusters as a hue to map it onto a 2D graph with longitude on the x-axis and latitude on the y-axis.  

Notes:

**Try to insert a slide for n_clusters**

**Try adding more variables**

In [None]:
from sklearn.cluster import KMeans

X = df_subset[['price_per_sq_ft', 'latitude', 'longitude']]

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

X['cluster'] = 'clusters ' + pd.Series(kmeans.predict(X)).astype(str)

sns.relplot(data=X, hue='cluster', x='longitude', y='latitude')

In [None]:
print(kmeans.labels_)
print(kmeans.inertia_)
print(kmeans.n_iter_)
print(kmeans.cluster_centers_)

In [None]:
df_subset.head()

In [None]:
from sklearn.cluster import KMeans

X = df_subset[['price_per_sq_ft']]

kmeans = KMeans(n_clusters=8)
kmeans.fit(X)

In [None]:
print(kmeans.labels_)

print(kmeans.cluster_centers_)

In [None]:
X['cluster'] = 'clusters ' + pd.Series(kmeans.predict(X)).astype(str)

In [None]:
X['latitude'] = df_subset['latitude']

In [None]:
X['longitude'] = df_subset['longitude']

This graph shows the shape of the combination of all three counties.  The different colors represent the clusters.  The clusters are based on price per square foot.

Used KMeans clustering on price per square foot.  Then used the clusters as a hue to map it onto a 2D graph with longitude on the x-axis and latitude on the y-axis.  

In [None]:
sns.relplot(data=X, hue='cluster', x='longitude', y='latitude')

In [None]:
df[['latitude', 'longitude']].head()

In [None]:
# df['latitude'].value_counts()

In [None]:
# df['longitude'].value_counts()
# pd.DataFrame(pd.cut(df['longitude'], bins=[-120_000_000, -119_000_000, -118_000_000, -117_000_000, -116_000_000]))

Also try lot size minus sq ft of house

In [None]:
df.head()

**Creating a new feature using the lot size square footage minus the square footage of the house**

In [None]:
df['yard_square_footage'] = (df['lotsizesquarefeet'] - df['calculatedfinishedsquarefeet'])

In [None]:
df.head()

**Project Planning** graph ideas, hypotheses, doodles, data dictionary

**Acquire**

**Prep** - Nulls, outliers, visualie distribution, drop variables

**Split Data**

**Impute** - Don't want to use test data as evidence of what to impute.  Use train to find imputer then transform train and test

**Scale** - can choose to scale variables differently. Just choose between 0 and 1.

**explore, visualize, clustering, stats, testing, etc.** (in no particular order)

Audience is the class.  Work with partner.  Choose one of three ways to apply clustering.  Zillow Data.  Go from end to end.  Share the highlights of the discoveries, what we uncovered, exploration, modeling.  What we've learned and how it relates to data science.

What we learned as it relates to domain, what we learned as it relates to data science, and what we learned as it relates to clustering.

Trying to predict log error because we want to help zillow improve their zestimate.  Which features drive the error. Trying to predict a continuous variable.  Predict the target.

If we encode, do so after clustering.  Have to encode your clusters.  Also makes it easier to visualize the data.

1. Audience: class, fellow learners
2. Deliverable: Notebook with supporting files
    - clean, easy to read
    - separate modules (acq, prep, etc.)
3. Team of 2
4. Clustering (are these clusters drivers of the target?)
    - Clusters for features
    - Clusters for explorations
    - Clustering target variable (binning)
5. Analysis/takeaways with a model.  What is the best model you made?  

Has to have:
 - statistical testing
 - visualizations of clusters
 - clusters
 - models
 - summary


**Visualizations**

continuous vs continuous, relplot

discrete vs continuous, t-test (group pool or not pool and compare prices)

two discrete, chi-squared test, pandas crosstabs, clustering

Do stuff, Learn Clustering, Model

In [184]:
df.columns.tolist()

['tax_rate',
 'bathroomcnt',
 'bedroomcnt',
 'calculatedfinishedsquarefeet',
 'fips',
 'latitude',
 'longitude',
 'lotsizesquarefeet',
 'taxvaluedollarcnt',
 'yearbuilt',
 'landtaxvaluedollarcnt',
 'logerror',
 'los_angeles',
 'ventura',
 'orange']

T-test feature vs log error to see if they have a correlation