<img src="https://1000logos.net/wp-content/uploads/2017/11/Zillow-Logo.png" title="Zillow Logo"/>

***

- imports that we will be using for this data set

In [1]:
# fetches the data
import acquire
# credentials file to access the data
import env
# Imports functions necessary to run visuals and hides unnecessary code
import wrangle

# coding 
import math
import numpy as np
import seaborn as sns
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
from pydataset import data
import scipy.stats
import scipy
import os

# needed for modeling
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer
from sklearn.metrics import explained_variance_score
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans

***

Fips dictionary:
- 6037.0 = Los Angeles,CA
- 6059.0 = Orange,CA
- 6111.0 = Ventura,CA

## PLAN
### 1. [Acquire Data](#Acquire:)
### Takeaways:
- Data is collected from the codeup cloud database with an appropriate SQL query.
- Data is imported using an acquire.py file.
- Original dataframe consisted of 71858 rows × 69 columns.
- Null values/ missing data are very common in about 50 percent of the data. 
 
### 2. [Prepare](#Prepare:)
### - Using [wrangle.py]()
### Takeaways:
- Before cleaning data and dropping unnecesary columns, 71858 rows × 69 columns.
- After dropping nulls and collumns, 44679 rows × 12 columns.
- resulted in 62% row retention 17% column retention.
- we continued to split the data into train, validate, and test for exploration and modeling purposes.

### 3. [Explore](#Explore:)
### Summary:
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between bathroomcnt and logerror.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between calculatedfinishedsquarefeet and logerror.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between bedroomcnt and logerror.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between yearbuilt and logerror.
- Based on the correlation test above fips and log error have no correlation.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between taxvaluedollarcnt and logerror.

### [Modeling](#Modeling:)

# <span style="color:blue">Acquiring Data</span>

In [2]:
# importing and aquiring data set
df = acquire.get_zillow_data()

### - Summary

In [3]:
# summary function for DataFrame
wrangle.summarize(df)

--- Shape: (71858, 69)
--- Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71858 entries, 0 to 71857
Data columns (total 69 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   propertylandusetypeid         71858 non-null  float64
 1   parcelid                      71858 non-null  int64  
 2   storytypeid                   47 non-null     float64
 3   typeconstructiontypeid        223 non-null    float64
 4   heatingorsystemtypeid         46680 non-null  float64
 5   buildingclasstypeid           0 non-null      float64
 6   architecturalstyletypeid      207 non-null    float64
 7   airconditioningtypeid         23069 non-null  float64
 8   id                            71858 non-null  int64  
 9   basementsqft                  47 non-null     float64
 10  bathroomcnt                   71858 non-null  float64
 11  bedroomcnt                    71858 non-null  float64
 12  buildingqualitytypeid      

--- Nulls by Column: propertylandusetypeid               0
parcelid                            0
storytypeid                     71811
typeconstructiontypeid          71635
heatingorsystemtypeid           25178
buildingclasstypeid             71858
architecturalstyletypeid        71651
airconditioningtypeid           48789
id                                  0
basementsqft                    71811
bathroomcnt                         0
bedroomcnt                          0
buildingqualitytypeid           26795
calculatedbathnbr                 223
decktypeid                      71269
finishedfloor1squarefeet        66161
calculatedfinishedsquarefeet      156
finishedsquarefeet12              335
finishedsquarefeet13            71856
finishedsquarefeet15            71847
finishedsquarefeet50            66161
finishedsquarefeet6             71692
fips                                0
fireplacecnt                    63754
fullbathcnt                       223
garagecarcnt                 

***

# <span style="color:blue">Prepare:</span>
- all these functions will be found in the explore.py file

###  What percentage of data is missing per column?

In [4]:
#looking at percentage of null values by column
wrangle.nulls_by_columns(df).sort_values(by= 'percent', ascending=False)

Unnamed: 0,count,percent
buildingclasstypeid,71858,1.0
buildingclassdesc,71858,1.0
finishedsquarefeet13,71856,0.999972
finishedsquarefeet15,71847,0.999847
storydesc,71811,0.999346
basementsqft,71811,0.999346
storytypeid,71811,0.999346
yardbuildingsqft26,71788,0.999026
finishedsquarefeet6,71692,0.99769
fireplaceflag,71686,0.997606


### Takeaways:

- we can see that there is a large percentage of information missing in alot of the columns (ranging from column regionidneighborhood with %60 data missing to buildingclasstypeid with %100 of the data missing.

### Functions to clean data:

- lets build a function to address the columns with large percentage of missing data.

In [5]:
# range of out liers 1.5 is recomended 
k = 1.5
#creating a cols value of columns that we want for the outliers to be handled
cols = ['bathroomcnt', 'bedroomcnt','calculatedfinishedsquarefeet','yearbuilt','lotsizesquarefeet']    

def handle_outliers(df, cols, k):
    """this will eliminate most outliers, use a 1.5 k value if unsure because it is the most common, make sure to define cols value as the features
    you want the outliers to be handled. this should be done before running the function and outiside of it"""

    
    # Create placeholder dictionary for each columns bounds
    bounds_dict = {}
   
    for col in cols:
        # get necessary iqr values
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        upper_bound =  q3 + k * iqr
        lower_bound =  q1 - k * iqr

        #store values in a dictionary referencable by the column name
        #and specific bound
        bounds_dict[col] = {}
        bounds_dict[col]['upper_bound'] = upper_bound
        bounds_dict[col]['lower_bound'] = lower_bound

    for col in cols:
        #retrieve bounds
        col_upper_bound = bounds_dict[col]['upper_bound']
        col_lower_bound = bounds_dict[col]['lower_bound']

        #remove rows with an outlier in that column
        df = df[(df[col] < col_upper_bound) & (df[col] > col_lower_bound)]
        
    return df
df = handle_outliers(df, cols, k)
df

Unnamed: 0,propertylandusetypeid,parcelid,storytypeid,typeconstructiontypeid,heatingorsystemtypeid,buildingclasstypeid,architecturalstyletypeid,airconditioningtypeid,id,basementsqft,...,id.1,logerror,transactiondate,airconditioningdesc,architecturalstyledesc,buildingclassdesc,heatingorsystemdesc,typeconstructiondesc,storydesc,propertylandusedesc
0,261.0,14297519,,,,,,,1727539,,...,0,0.025595,2017-01-01,,,,,,,Single Family Residential
1,261.0,17052889,,,,,,,1387261,,...,1,0.055619,2017-01-01,,,,,,,Single Family Residential
2,261.0,14186244,,,,,,,11677,,...,2,0.005383,2017-01-01,,,,,,,Single Family Residential
3,261.0,12177905,,,2.0,,,,2288172,,...,3,-0.103410,2017-01-01,,,,Central,,,Single Family Residential
5,266.0,17143294,,,,,,,1447245,,...,5,-0.020526,2017-01-01,,,,,,,Condominium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71851,261.0,12412492,,,2.0,,,,2274245,,...,77607,0.001082,2017-09-19,,,,Central,,,Single Family Residential
71854,261.0,17239384,,,,,,,2968375,,...,77610,0.013209,2017-09-21,,,,,,,Single Family Residential
71855,261.0,12773139,,,2.0,,,1.0,1843709,,...,77611,0.037129,2017-09-21,Central,,,Central,,,Single Family Residential
71856,261.0,12826780,,,2.0,,,,1187175,,...,77612,0.007204,2017-09-25,,,,Central,,,Single Family Residential


## Takeaways
- Before dropping nulls, we had (71858 rows, 69 cols).
- After dropping nulls, we ended with (43964 rows × 69 columns).

### Lets see what were left with

In [6]:
# a view a how many null values exist in each column
wrangle.nulls_by_columns(df).sort_values(by= 'percent', ascending=False)

Unnamed: 0,count,percent
buildingclassdesc,49964,1.0
buildingclasstypeid,49964,1.0
finishedsquarefeet15,49964,1.0
finishedsquarefeet13,49963,0.99998
basementsqft,49938,0.99948
storydesc,49938,0.99948
storytypeid,49938,0.99948
yardbuildingsqft26,49921,0.999139
architecturalstyletypeid,49896,0.998639
architecturalstyledesc,49896,0.998639


### Lets adress the rest of the  null values

### columns to remove:(column, reason for removal)
removing 
- (id, id.1, parcelid,'propertylandusetypeid','buildingqualitytypeid'), id is not necessary for our algorithms and will confuse any models from here on forward.
- (fullbathcnt,calculatedbathnbr,roomcnt),  any room room count other the bedroomcnt or bathroomcnt is not necessary considering that they return similar information if not combined info.
- (propertyzoningdesc,rawcensustractandblock,regionidcounty,censustractandblock), considering that fips is being kept for region identification purposes, these columns are not necessary.
- (assessmentyear, landtaxvaluedollarcnt, taxamount, transactiondate), considering that we have already filtered out the data to only return back information for the year 2017, and are keeping taxvaluedollarcnt, these columns are not necessary because this information can be obtained through the data that we will be keeping.
- (heatingorsystemdesc,finishedsquarefeet12,propertylandusedesc,'propertycountylandusecode','unitcnt'), calculatedfinishedsquarefeet already covers this info and heatingorsystemid already identifies this information numerically.

In [7]:
def drop_columns(df):
    """ using this function to drop columns that i wont be using for exploration stage of this project"""
    # drop function to remove columns
    df = df.drop(columns=['heatingorsystemtypeid','buildingqualitytypeid','id','parcelid','calculatedbathnbr','propertylandusetypeid','fullbathcnt','propertyzoningdesc','rawcensustractandblock','regionidcounty',
    'roomcnt','structuretaxvaluedollarcnt','assessmentyear','landtaxvaluedollarcnt','taxamount','censustractandblock',
    'id.1','transactiondate','heatingorsystemdesc','finishedsquarefeet12','propertylandusedesc','propertycountylandusecode','unitcnt'])
    return df


def split(df):
    """ using this funciton to split data into train, validate, & test in order to explore, model and test data."""
    train_and_validate, test = train_test_split(df, random_state=13, test_size=.15)
    train, validate = train_test_split(train_and_validate, random_state=13, test_size=.2)

    print('Train: %d rows, %d cols' % train.shape)
    print('Validate: %d rows, %d cols' % validate.shape)
    print('Test: %d rows, %d cols' % test.shape)
    
    return train, validate, test    

def handle_missing_values(df, prop_required_column, prop_required_row):
    """this piece of code allows us to handle the missing data and get rid of it, both in the columns and in the rows(so that we can analize better)."""
    print ('Before dropping nulls, %d rows, %d cols' % df.shape)
    n_required_column = round(df.shape[0] * prop_required_column)
    n_required_row = round(df.shape[1] * prop_required_row)
    # drops based on percentage missing in row
    df = df.dropna(axis=0, thresh=n_required_row)
    #drops na based on percentage missing in collumn
    df = df.dropna(axis=1, thresh=n_required_column)
    #drops na values in colums
    df = drop_columns(df)
    #dropping.
    df = df.dropna()
    print('After dropping nulls. %d rows. %d cols' % df.shape)
    return df

def get_exploration_data(df):
    #drops rows and columns with more than %50 data missing
    print('Before dropping nulls, %d rows, %d cols' % df.shape)
    # calls on function from above and we give it a no .5 value to drop na with over %50 data missing
    df = handle_missing_values(df, prop_required_column=.5, prop_required_row=.5)
    print('After dropping nulls, %d rows, %d cols' % df.shape)
    return df
#calling the function 
get_exploration_data(df)

Before dropping nulls, 49964 rows, 69 cols
Before dropping nulls, 49964 rows, 69 cols
After dropping nulls. 44092 rows. 12 cols
After dropping nulls, 44092 rows, 12 cols


Unnamed: 0,bathroomcnt,bedroomcnt,calculatedfinishedsquarefeet,fips,latitude,longitude,lotsizesquarefeet,regionidcity,regionidzip,yearbuilt,taxvaluedollarcnt,logerror
1,1.0,2.0,1465.0,6111.0,34449266.0,-119281531.0,12647.0,13091.0,97099.0,1967.0,464000.0,0.055619
2,2.0,3.0,1243.0,6059.0,33886168.0,-117823170.0,8432.0,21412.0,97078.0,1962.0,564778.0,0.005383
3,3.0,4.0,2376.0,6037.0,34245180.0,-118240722.0,13038.0,396551.0,96330.0,1970.0,145143.0,-0.103410
5,2.0,3.0,1492.0,6111.0,34230044.0,-118993991.0,903.0,51239.0,97091.0,1982.0,331064.0,-0.020526
7,1.0,2.0,738.0,6037.0,34149214.0,-118239357.0,4214.0,45457.0,96325.0,1922.0,218552.0,0.101723
...,...,...,...,...,...,...,...,...,...,...,...,...
71851,2.0,4.0,1633.0,6037.0,33870815.0,-118070858.0,4630.0,30267.0,96204.0,1962.0,346534.0,0.001082
71854,2.0,4.0,1612.0,6111.0,34300140.0,-118706327.0,12105.0,27110.0,97116.0,1964.0,67205.0,0.013209
71855,1.0,3.0,1032.0,6037.0,34040895.0,-118038169.0,5074.0,36502.0,96480.0,1954.0,49546.0,0.037129
71856,2.0,3.0,1762.0,6037.0,33937685.0,-117996709.0,6347.0,14634.0,96171.0,1955.0,522000.0,0.007204


## Takeaways
- Before dropping additional nulls and collumns , we had (49964 rows × 69 columns).
- After dropping additional nulls and collumns, we ended with (44092 rows × 12 columns).

# Clustering

### How many clusters should we make?

In [10]:
# building and X to start clustering
X = df[['bathroomcnt', 'calculatedfinishedsquarefeet','bedroomcnt','yearbuilt','taxvaluedollarcnt']]

In [11]:
# visualizing drop to estimate what number of clusters work best for the model
with plt.style.context('seaborn-whitegrid'):
    #graph size
    plt.figure(figsize=(9, 6))
    pd.Series({k: KMeans(k).fit(X).inertia_ for k in range(2, 12)}).plot(marker='x')
    plt.xticks(range(2, 12))
    plt.xlabel('k')
    plt.ylabel('inertia')
    plt.title('Change in inertia as k increases')


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

<Figure size 648x432 with 0 Axes>

### Takeways:
- 2 point inertia drop from 2-3
- 1 point inertia drop from 3-4

### Conclusion:
- we will create 3 clusters

In [None]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

kmeans.predict(X)

In [None]:
centroids = pd.DataFrame(kmeans.cluster_centers_, columns=X.columns)
centroids

In [None]:
train['cluster'] = kmeans.predict(X)


In [None]:
train.groupby('cluster')['bathroomcnt', 'calculatedfinishedsquarefeet','bedroomcnt','yearbuilt','taxvaluedollarcnt'].mean()


In [None]:
train

In [None]:
#graph size 
plt.figure(figsize=(14, 9))
#graphing after weve clustered using a scatter plot
for cluster, subset in train.groupby('cluster'):
    plt.scatter(subset.taxvaluedollarcnt, subset.calculatedfinishedsquarefeet, label='cluster ' + str(cluster), alpha=.6)

centroids.plot.scatter(y='calculatedfinishedsquarefeet', x='taxvaluedollarcnt', c='black', marker='x', s=1000, ax=plt.gca(), label='centroid')

plt.legend()
plt.xlabel('price')
plt.ylabel('sq ft')
plt.title('Visualizing Cluster Centers')


In [None]:
# look at the first 5 rows of our new dataframe, transposed
centroids.head().T

In [None]:
cluster1 =train[train.cluster == 0]
cluster1

In [None]:
plt.scatter(X_train.longitude, X_train.latitude, c = X_train.cluster1)


In [None]:
cluster2 = train[train.cluster == 1]
cluster2

In [None]:
cluster3 = train[train.cluster == 2]
cluster3

# <span style="color:blue">Split Data:</span>

In [None]:
    
    train, validate, test = split(df)
    
    return train, validate, test
# get train to expolore 
train, validate, test = get_exploration_data(df)
# seeing what the train split dataset
train.info()

In [None]:
#percent of null values
wrangle.nulls_by_columns(train).sort_values(by= 'percent', ascending=False)

***

In [None]:
#train sample
train

In [None]:
#validate sample
validate

In [None]:
# test sample
test

---

***

# # <span style="color:blue">Split Data:</span>

# <span style="color:blue">Explore:</span>

###  ? Does logerror differ across bathroomcnt, bedroomcnt, calculatedfinishedsquarefeet, yearbuilt, & taxvaluedollarcnt ?

In [None]:
#train info
train.info()

In [None]:
# columns in train
train.columns

In [None]:
cols_features = ['bathroomcnt', 'bedroomcnt', 'calculatedfinishedsquarefeet', 'yearbuilt', 'taxvaluedollarcnt', 'fips']
target_variable = ['logerror']

In [None]:
# graphing each colum seperately
for col in train.columns:
    #graph size
    plt.figure(figsize=(4,2))
    #histogram graph
    plt.hist(train[col])
    #title of column
    plt.title(col)
    # show graph
    plt.show()

### Takeaways:
- logerror, yearbuilt, calculatedsquarefeet, bedroom and bathroom tend to have a skew to the right.
- using this info we will run a simple correlation test.

---

***
### Is there a correlation between bathroomcnt & logerror ?
- Null Hypothesis  = there is no correlation between the bathroomcnt of a home and logerror

- Alternative Hypothesis  = there is a correlation between the bathroomcnt of a home and logerror

In [None]:
# lineplot to Visualize correlation
sns.lineplot(data=train, x = 'bathroomcnt' ,y="logerror")

In [None]:
print("Is there a relationship\nbetween bathroomcnt and logerror?")
# graph correlation using jointplot
sns.jointplot(x="bathroomcnt", y="logerror", data=train)
# x label
plt.xlabel("bathroomcnt")
# y label
plt.ylabel("log error")
# show the graph
plt.show()

In [None]:
#set the alpha to .05
alpha = .05
# corralation test between bathroomcnt and logerror
corr, p = stats.pearsonr(train.bathroomcnt, train.logerror)
corr, p
#correlation test summary
print("correlation:", corr,",","p value:",p)
if p < alpha:
    print(f'Pvalue is: {p} is less than alpha: {alpha}')
    print("Reject the null hypothesis because there is a correlation present")
else:
    print(f'Pvalue is: {p} is greater than alpha: {alpha}')
    print("We fail to reject the null hypothesis because there is no significant correlation present")

### Takeaways:
- Based on the line graph above we can see that the range of log error increases when the bathroomcnt is more then 3 bathroomcnt.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between bathroomcnt and logerror. 

---

***
### Is there a correlation between calculatedfinishedsquarefeet & logerror ?
- Null Hypothesis  = there is no correlation between the calculatedfinishedsquarefeet of a home and logerror

- Alternative Hypothesis  = there is a correlation between the calculatedfinishedsquarefeet of a home and logerror

In [None]:
# lineplot to Visualize correlation
sns.lineplot(data=train, x = 'calculatedfinishedsquarefeet' ,y="logerror")

In [None]:
print("Is there a relationship\nbetween calculatedfinishedsquarefeet and logerror?")
# graph correlation using jointplot
sns.jointplot(x="calculatedfinishedsquarefeet", y="logerror", data=train)
# x label
plt.xlabel("calculatedfinishedsquarefeet")
# y label
plt.ylabel("log error")
# show the graph
plt.show()

In [None]:
#set the alpha to .05
alpha = .05
# corralation test between calculatedfinishedsquarefeet and logerror
corr, p = stats.pearsonr(train.calculatedfinishedsquarefeet, train.logerror)
corr, p
#correlation test summary
print("correlation:", corr,",","p value:",p)
if p < alpha:
    print(f'Pvalue is: {p} is less than alpha: {alpha}')
    print("Reject the null hypothesis because there is a correlation present")
else:
    print(f'Pvalue is: {p} is greater than alpha: {alpha}')
    print("We fail to reject the null hypothesis because there is no significant correlation present")

### Takeaways:
- From the line graph above we can see that the range of log error increases when the calculatedfinishedsquarefeet is less than 1000 and greater then 2800 roughly.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between calculatedfinishedsquarefeet and logerror. 

---

***
### Is there a correlation between bedroomcnt & logerror ?
- Null Hypothesis  = there is no correlation between the bedroomcnt  of a home and logerror

- Alternative Hypothesis  = there is a correlation between the bedroomcnt  of a home and logerror

In [None]:
# lineplot to Visualize correlation
sns.lineplot(data=train, x = 'bedroomcnt' ,y="logerror")

In [None]:
print("Is there a relationship\nbetween bedroomcnt and logerror?")
# graph correlation using jointplot
sns.jointplot(x="bedroomcnt", y="logerror", data=train)
# x label
plt.xlabel("bedroom cnt")
# y label
plt.ylabel("log error")
# show the graph
plt.show()

In [None]:
#set the alpha to .05
alpha = .05
# corralation test between bedroomcnt and logerror
corr, p = stats.pearsonr(train.bedroomcnt, train.logerror)
corr, p
#correlation test summary
print("correlation:", corr,",","p value:",p)
if p < alpha:
    print(f'Pvalue is: {p} is less than alpha: {alpha}')
    print("Reject the null hypothesis because there is a correlation present")
else:
    print(f'Pvalue is: {p} is greater than alpha: {alpha}')
    print("We fail to reject the null hypothesis because there is no significant correlation present")

### Takeaways:
- Based on the line graph above we can see that the range of log error increases when the bedroomcnt is less than 2 and greater 5.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between bedroomcnt and logerror. 

---

---
### Is there a correlation between yearbuilt & logerror ?
- Null Hypothesis  = there is no correlation between the yearbuilt  of a home and logerror

- Alternative Hypothesis  = there is a correlation between the yearbuilt  of a home and logerror

In [None]:
# lineplot to Visualize correlation
sns.lineplot(data=train, x = 'yearbuilt' ,y="logerror")

In [None]:
print("Is there a relationship\nbetween yearbuilt and logerror?")
# graph correlation using jointplot
sns.jointplot(x="yearbuilt", y="logerror", data=train)
# x label
plt.xlabel("yearbuilt")
# y label
plt.ylabel("log error")
# show the graph
plt.show()

In [None]:
#set the alpha to .05
alpha = .05
# corralation test with sq ft yearbuilt and logerror
corr, p = stats.pearsonr(train.yearbuilt, train.logerror)
corr, p
#correlation test summary
print("correlation:", corr,",","p value:",p)
if p < alpha:
    print(f'Pvalue is: {p} is less than alpha: {alpha}')
    print("Reject the null hypothesis because there is a correlation present")
else:
    print(f'Pvalue is: {p} is greater than alpha: {alpha}')
    print("We fail to reject the null hypothesis because there is no significant correlation present")

### Takeaways:
- Based on the line graph above we can see that the range of log error increases when the yearbuilt is before 1940s.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between yearbuilt and logerror. 

---

***
### Is there a correlation between fips & logerror ?
- Null Hypothesis  = there is no correlation between the fips  of a home and logerror

- Alternative Hypothesis  = there is a correlation between the fips  of a home and logerror

In [None]:
# lineplot to Visualize correlation
sns.lineplot(data=train, x = 'fips' ,y="logerror")

In [None]:
print("Is there a relationship\nbetween fips and logerror?")
# graph correlation using jointplot
sns.jointplot(x="fips", y="logerror", data=train)
# x label
plt.xlabel("fips")
# y label
plt.ylabel("log error")
# show the graph
plt.show()

In [None]:
#set the alpha to .05
alpha = .05
# corralation fips and logerror
corr, p = stats.pearsonr(train.fips, train.logerror)
corr, p
#correlation test summary
print("correlation:", corr,",","p value:",p)
if p < alpha:
    print(f'Pvalue is: {p} is less than alpha: {alpha}')
    print("Reject the null hypothesis because there is a correlation present")
else:
    print(f'Pvalue is: {p} is greater than alpha: {alpha}')
    print("We fail to reject the null hypothesis because there is no significant correlation present")

### Takeaways:
- Based on the correlation test above fips and log error have no correlation.

***

***
### Is there a correlation between taxvaluedollarcnt & logerror ?
- Null Hypothesis  = there is no correlation between the taxvaluedollarcnt  of a home and logerror

- Alternative Hypothesis  = there is a correlation between the taxvaluedollarcnt  of a home and logerror

In [None]:
# lineplot to Visualize correlation
sns.lineplot(data=train, x = 'taxvaluedollarcnt' ,y="logerror")

In [None]:
print("Is there a relationship\nbetween taxvaluedollarcnt and logerror?")
# graph correlation using jointplot
sns.jointplot(x="taxvaluedollarcnt", y="logerror", data=train)
# x label
plt.xlabel("taxvaluedollarcnt")
# y label
plt.ylabel("log error")
# show the graph
plt.show()

In [None]:
#set the alpha to .05
alpha = .05
# corralation taxvaluedollarcnt and logerror
corr, p = stats.pearsonr(train.taxvaluedollarcnt, train.logerror)
corr, p
#correlation test summary
print("correlation:", corr,",","p value:",p)
if p < alpha:
    print(f'Pvalue is: {p} is less than alpha: {alpha}')
    print("Reject the null hypothesis because there is a correlation present")
else:
    print(f'Pvalue is: {p} is greater than alpha: {alpha}')
    print("We fail to reject the null hypothesis because there is no significant correlation present")

### Takeaways:
- Based on the line graph above we can see that the range of log error increases when the taxvaluedollarcount is below the midrange.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between taxvaluedollarcnt and logerror. 
***

## Exploration Summary
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between bathroomcnt and logerror.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between calculatedfinishedsquarefeet and logerror.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between bedroomcnt and logerror.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between yearbuilt and logerror.
- Based on the correlation test above fips and log error have no correlation.
- After running a corralation test we can reject the null hypothesis and prove that there is a correlation between taxvaluedollarcnt and logerror.

---

# Modeling:

In [None]:
# looking at the head info for train data set
train.head()

In [None]:
#features well be working with
features = ['bathroomcnt',
                 'bedroomcnt',
                 'calculatedfinishedsquarefeet',
                 'yearbuilt',
                 'taxvaluedollarcnt',
                 'latitude', 
                 'longitude',
                 'lotsizesquarefeet',
                 'regionidcity',
                 'regionidzip',
                 'fips']
#columns we will be scaling
scale_columns = ['yearbuilt',  
                   'taxvaluedollarcnt', 
                   'bathroomcnt', 
                   'calculatedfinishedsquarefeet','logerror']

X_train = cluster1[['yearbuilt',  
                   'taxvaluedollarcnt', 
                   'bathroomcnt', 
                   'calculatedfinishedsquarefeet']]
y_train = train['logerror']


X_validate = validate[['yearbuilt',  
                   'taxvaluedollarcnt', 
                   'bathroomcnt', 
                   'calculatedfinishedsquarefeet']]
y_validate = validate['logerror']


X_test = test[['yearbuilt',  
                   'taxvaluedollarcnt', 
                   'bathroomcnt', 
                   'calculatedfinishedsquarefeet']]
y_test = test['logerror']

def scale_data(train, 
               validate, 
               test, 
               columns_to_scale=['yearbuilt',  
                   'taxvaluedollarcnt', 
                   'bathroomcnt', 
                   'calculatedfinishedsquarefeet'],
               return_scaler=False):
    '''
    Scales the 3 data splits. 
    Takes in train, validate, and test data splits and returns their scaled counterparts.
    If return_scalar is True, the scaler object will be returned as well
    '''
    train_scaled = train.copy()
    validate_scaled = validate.copy()
    test_scaled = test.copy()
    
    scaler = MinMaxScaler()
    scaler.fit(train[columns_to_scale])
    
    train_scaled[columns_to_scale] = pd.DataFrame(scaler.transform(train[columns_to_scale]),
                                                  columns=train[columns_to_scale].columns.values).set_index([train.index.values])
                                                  
    validate_scaled[columns_to_scale] = pd.DataFrame(scaler.transform(validate[columns_to_scale]),
                                                  columns=validate[columns_to_scale].columns.values).set_index([validate.index.values])
    
    test_scaled[columns_to_scale] = pd.DataFrame(scaler.transform(test[columns_to_scale]),
                                                 columns=test[columns_to_scale].columns.values).set_index([test.index.values])
    
    if return_scaler:
        return scaler, train_scaled, validate_scaled, test_scaled
    else:
        return train_scaled, validate_scaled, test_scaled

scaler, X_train_scaled, X_validate_scaled, X_test_scaled = scale_data(X_train, X_validate, X_test, return_scaler=True)

# We need y_train and y_validate to be dataframes to append the new columns with predicted values.
y_train = pd.DataFrame(y_train)
y_validate = pd.DataFrame(y_validate)
# 1. Predict logerror_pred_mean
logerror_pred_mean = y_train.logerror.mean()
# creating a logerror_pred_mean column for my y_train
y_train['logerror_pred_mean'] = y_train.logerror.mean()
# creating a logerror_pred_mean column for my y_validate
y_validate['logerror_pred_mean'] = y_validate.logerror.mean()
# 2. compute logerror_pred_median
# creating a predictive median for y train 
y_train['logerror_pred_median'] = y_train.logerror.median()
# creating a predictive median for y_validate
y_validate['logerror_pred_median'] = y_validate.logerror.median()
# 3. RMSE of logerror_pred_mean
rmse_train = mean_squared_error(y_train.logerror, y_train.logerror_pred_mean)**(1/2)
rmse_validate = mean_squared_error(y_validate.logerror, y_validate.logerror_pred_mean)**(1/2)
print("RMSE using Mean\nTrain/In-Sample: ", round(rmse_train, 2),
      "\nValidate/Out-of-Sample: ", round(rmse_validate, 2))
# 4. RMSE of logerror_pred_median
rmse_train = mean_squared_error(y_train.logerror, y_train.logerror_pred_median)**(1/2)
rmse_validate = mean_squared_error(y_validate.logerror, y_validate.logerror_pred_median)**(1/2)
print("RMSE using Median\nTrain/In-Sample: ", round(rmse_train, 2),
      "\nValidate/Out-of-Sample: ", round(rmse_validate, 2))

#predicting y from trained data
#predict rules
model = LinearRegression().fit(X_train_scaled, y_train)
predictions = model.predict(X_train_scaled)

y_test = pd.DataFrame(y_test)

y_test

### LinearRegression (ols)

# create the model object
lm = LinearRegression(normalize=True)

# fit the model to our training data. We must specify the column in y_train, 
# since we have converted it to a dataframe from a series! 
lm.fit(X_train_scaled, y_train.logerror)

# predict train
y_train['logerror_pred_lm'] = lm.predict(X_train_scaled)

# evaluate: rmse
rmse_train = mean_squared_error(y_train.logerror, y_train.logerror_pred_lm)**(1/2)

# predict validate
y_validate['logerror_pred_lm'] = lm.predict(X_validate_scaled)

# evaluate: rmse
rmse_validate_lm = mean_squared_error(y_validate.logerror, y_validate.logerror_pred_lm)**(1/2)

print("RMSE for OLS using LinearRegression\nTraining/In-Sample: ", rmse_train, 
      "\nValidation/Out-of-Sample: ", rmse_validate_lm)

### LassoLars (lars)

# create the model object
lars = LassoLars(alpha=1.0)

# fit the model to our training data. We must specify the column in y_train, 
# since we have converted it to a dataframe from a series! 
lars.fit(X_train_scaled, y_train.logerror)

# predict train
y_train['logerror_pred_lars'] = lars.predict(X_train_scaled)

# evaluate: rmse
rmse_train_lars = mean_squared_error(y_train.logerror, y_train.logerror_pred_lars)**(1/2)

# predict validate
y_validate['logerror_pred_lars'] = lars.predict(X_validate_scaled)

# evaluate: rmse
rmse_validate_lars = mean_squared_error(y_validate.logerror, y_validate.logerror_pred_lars)**(1/2)

print("RMSE for Lasso + Lars\nTraining/In-Sample: ", rmse_train_lars, 
      "\nValidation/Out-of-Sample: ", rmse_validate_lars)

#residuals
y_train['lars_residuals'] = y_train['logerror_pred_lars'] - y_train['logerror']
y_validate['lars_residuals'] = y_validate['logerror_pred_lars'] - y_validate['logerror']


### TweedieRegressor (GLM)

# create the model object
glm = TweedieRegressor(power=1, alpha=0)

# fit the model to our training data. We must specify the column in y_train, 
# since we have converted it to a dataframe from a series! 
glm.fit(X_train_scaled, y_train.logerror)

# predict train
y_train['logerror_pred_glm'] = glm.predict(X_train_scaled)

# evaluate: rmse
rmse_train_glm = mean_squared_error(y_train.logerror, y_train.logerror_pred_glm)**(1/2)

# predict validate
y_validate['logerror_pred_glm'] = glm.predict(X_validate_scaled)

# evaluate: rmse
rmse_validate_glm = mean_squared_error(y_validate.logerror, y_validate.logerror_pred_glm)**(1/2)

print("RMSE for GLM using Tweedie, power=1 & alpha=0\nTraining/In-Sample: ", rmse_train, 
      "\nValidation/Out-of-Sample: ", rmse_validate_glm)