## Final Project Submission

Please fill out:
* __Student name:__ Cassarra Groesbeck
* __Student pace:__ Part Time/ Flex
* __Scheduled project review date/time:__ 
* __Instructor name:__ Claude Fried
* __Blog post URL:__


# Introduction 
This regression modeling will yield findings to support relevant recommendations to a real estate agency, that helps homeowners buy and/or sell homes.

# Objectives
Those findings will include:
- a metric describing overall model performance
- at least two regression model coefficients; that is to say, at least two feature-specific effects on Sale Price.

# Business Understanding
The real estate agency needs to provide advice to homeowners about how home renovations might increase the estimated value of their homes, and by what amount.

# Data Understanding
This project uses the King County House Sales dataset. For more information, other than what is provided below, see the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r)

###  Column Names and Descriptions for King County Data Set


| Column     | Description   |
|------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  `id`         | **Unique identifier for a house**  |
| `date`        | **Date house was sold**  |
| `price`       | **Sale price (prediction target)** |
| `bedrooms`    | **Number of bedrooms**  |
|`bathrooms`    | **Number of bathrooms**   |
|`sqft_living`  | **Square footage of living space in the home**  |
| `sqft_lot`    | **Square footage of the lot**   |
|  `floors`     | **Number of floors (levels) in house**  |
| `waterfront`  | **Whether the house is on a waterfront**  |
| `view`        | **Quality of view from house** |
| `condition`   | **How good the overall condition of the house is. Related to maintenance of house.**  |
| `grade`       | **Overall grade of the house. Related to the construction and design of the house.**  |
| `sqft_above`  | **Square footage of house apart from basement**  |
|`sqft_basement`| **Square footage of the basement**   |
|  `yr_built`   | **Year when house was built**  |
| `yr_renovated`| **Year when house was renovated**  |
| `zipcode`     | **ZIP Code used by the United States Postal Service** |
| `lat`         | **Latitude coordinate**  |
| `long`        | **Longitude coordinate**   |
|`sqft_living15`| **The square footage of interior housing living space for the nearest 15 neighbors**   |
| `sqft_lot15`  | **The square footage of the land lots of the nearest 15 neighbors**   |



## Imports

In [None]:
# The basics
import pandas as pd
import numpy as np

# sklearn
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# statsmodels
from statsmodels.formula.api import ols
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# scipy
import scipy.stats as stats

# rando
from itertools import combinations

#visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')

## Bring in the data

In [None]:
# data/kc_house_data.csv
data = pd.read_csv("data/kc_house_data.csv")
data

# Functions needed

In [None]:
# function to return full statsmodel summary
def model_it_full(df, target):
    y = df[target]
    X = df.drop(target, axis=1)

    model = sm.OLS(y, sm.add_constant(X)).fit()
    
    return model.summary()

In [None]:
# function to return r_squared values, p table from .summary, and a 
# couple of residual normality checks (hist and qq plot)

def model_it_small(df, target):
    y = df[target]
    X = df.drop(target, axis=1)
    #statsmodel fit
    model = sm.OLS(y, sm.add_constant(X)).fit()  
    
    #kfold
    regression = LinearRegression()
    crossvalidation = KFold(n_splits=3, shuffle=True, random_state=1)
    kfold_r = np.mean(cross_val_score(regression, X, y, scoring='r2', cv=crossvalidation))
    
    #PLOTS
    fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(10, 3))
    fig.suptitle('Normality of Residuals')
    #hist
    sns.histplot(model.resid, ax=ax0)
    ax0.set(xlabel='Residual', ylabel='Frequency', title='Distribution of Residuals')
    #qq
    sm.qqplot(model.resid, fit = True, line = '45', ax=ax1)
    ax1.set(title='QQ Plot')
    plt.show()
    
    #print r_squared values
    print(f'r_sq: {model.rsquared}. r_sq_adjusted: {model.rsquared_adj}. k_fold_r: {kfold_r}')
    
    #return 
    return model.summary().tables[1]



In [None]:
# make colinearity check function
# code from Multicollinearity of Features - Lab, turned it into a function

def colinearity(df):
    #get absolute value of correlations, sort them, and turn into new DF called df
    df=df.corr().abs().stack().reset_index().sort_values(0, ascending=False)

    # zip the columns (Which were only named level_0 and level_1 by default) 
    # into a new column named "pairs"
    df['pairs'] = list(zip(df.level_0, df.level_1))

    # set index to pairs
    df.set_index(['pairs'], inplace = True)

    # drop level_ columns
    df.drop(columns=['level_1', 'level_0'], inplace = True)

    # rename correlation column as cc rather than 0
    df.columns = ['cc']

    # just correlations over .75, but less than 1.
    df = df[(df.cc>.75) & (df.cc <1)]

    df.drop_duplicates(inplace=True) 

    return df

In [None]:
# colinearity with VIF
# code from Linear Regression - Cumulative Lab, altered to make a df w/sorted values

def get_VIFs_above5(df, target):

    vif_data = sm.add_constant(df.drop(target, axis=1))

    vif = [variance_inflation_factor(vif_data.dropna().values, i)\
           for i in range(vif_data.dropna().shape[1])]

    vif_df = pd.DataFrame(vif, index=vif_data.columns).sort_values(0, ascending=False)
    return vif_df[vif_df[0]>5]

In [None]:
def remove_outliers_from_pdDataFrame(df):
    return df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

def remove_outliers_from_Column(df, column):
    return df[(np.abs(stats.zscore(df[column])) < 3)]


# Exploring the data:

In [None]:
data.info()

## Identify target variable 

In [None]:
target = 'price'

## Extract Catagorical Variables

In [None]:
obj_df = data.select_dtypes(include=object)
obj_df.head()

#### Explore catagorical variables

- [x] 'date' (will become ordinal)
- [x] 'waterfront' (will become boolean)
- [x] 'view' (stays catagorical)
- [x] 'condtion' (stays catagorical)
- [x] 'grade' (ordinal)
- [x] 'sqft_basement' (continuous)

### Findings from obj_df exploration:
- 'waterfront'
 - has two values: NO & YES
 - has 2376 nulls, that will need to be addressed before ohe'ing
 - 11% of data is null/ missing
 - 0.7% of properties are waterfront
 - 88.3% of properites are not on waterfront
 - I will change nulls to NO due to less than 1% of properties on waterfront
- 'condition'
 - 5 unique values
 - has zero nulls
- 'view'
 - has 6 values 
 - 89.93% is 'NONE'
 - 63 nulls (0.29%), change to 'NONE' 
- 'grade'
 - 11 unique values
 - has numeric value (3-13) and word description (ex "poor" or "good") associated with each grade assignment
 - need to change to just number grade and delete description
- 'date' 
 - string: 'mm/dd/yyyy'
- 'sqft_basement'
 - float values cast as string
 - 454 missing, shown as '?', 2% missing
 - 12826 '0.0' basement, ie 59% no basement, add new column "has_basement"


__TODO__ for obj_df features to be ohe-ready. I will add to this list as I explore data and will address needed conversions at end before ohe'ing.

1. [x] replace 'waterfront' {np.nan:'NO'} - this will be a boolean feature
2. [x] change 'view' nulls to 'NONE' - will stay catagorical
3. [x] keep 'grade' number and ditch description - this will make the feature ordinal
4. [x] convert 'date' to just numerical month - ordinal
5. [x] for 'sqft_basement' make new column "has_basement"
6. [x] if value '0.0' or '?' append new column 0, else 1 - this will be a boolean feature
7. [x] make new get_dummies_df of ['view', 'condition', 'has_basement', 'waterfront']
8. [x] pd.get_dummies(dummies_df, drop_first=True)

In [None]:
# print .value_counts() for each column in obj_df
for column in obj_df.columns:
    print(f"COLUMN: '{column}'")
    print(f"Number of unique values: {len(obj_df[column].unique())}")
    print(f"Number of nulls: {obj_df[column].isnull().sum()}")
    print(obj_df[column].value_counts())
    print()

In [None]:
#  null counts
obj_df.isnull().sum()

In [None]:
obj_df.info()

In [None]:
obj_df[['sqft_basement']].head(20)

In [None]:
type(obj_df['date'][0])

#### Tackeling TODO ohe prep list

In [None]:
# 1. replace 'waterfront' {np.nan:'NO'}
data['waterfront'].replace({np.nan:'NO'}, inplace=True)

#check
data['waterfront'].value_counts()

In [None]:
# 2. change 'view' nulls to 'NONE'
data['view'].fillna('NONE', inplace=True)

#check
data['view'].value_counts()

In [None]:
# 3. keep 'grade' number (as an int) and ditch description
data['grade'] = [int(grade[:2]) for grade in data['grade']]

#check
data['grade'].value_counts()

In [None]:
# 4. convert 'date' to just numerical month
data['date'] = pd.DatetimeIndex(data['date']).month

#check
data['date'].value_counts()

In [None]:
# 5. & 6. for 'sqft_basement' make new column "has_basement"
# if value '0.0' or '?' append new column 0, else 1

basement = []
for square_feet in data['sqft_basement']:
    if square_feet == '0.0':
        basement.append('NO')
    elif square_feet == '?':
        basement.append('NO')
    else:
        basement.append('YES')
        
data['has_basement'] = basement

#drop 'TotalBsmtSF'
data = data.drop('sqft_basement', axis=1)

# check 
data.head()

In [None]:
# 7. make new dummies_df of ['view', 'condition']
dummies_df = data[['view','condition', 'has_basement', 'waterfront']]
dummies_df

In [None]:
# print the values to note which feature has been dropped


In [None]:
# 8. pd.get_dummies(dummies_df, drop_first=True)
dummies_df = pd.get_dummies(dummies_df, drop_first=True)
dummies_df.head()

### __NOTES:__
Take note of the features that were dropped:
- 
- 
- 
- 

## Extract Continuous Variables

In [None]:
# extract out columns with Dtype == int or float for further exploration
cont_df = data.select_dtypes(exclude=object).drop(['id', 'price'], axis=1)
cont_df.head()

In [None]:
# check for nulls 
cont_df.isnull().sum()

### __NOTES:__ 
3,842 missing values from 'yr_renovated'. Thats too many for imputations or replacement. Nulls may mean N/A. Could turn into boolean: 'renovated_YES' == 1

In [None]:
cont_df['yr_renovated'].value_counts()

### __NOTES:__ 
An additional 17,011 values of 0, ie 0 likely means N/A and missing values are just that, missing. That is in fact too many for imputations or replacement. I will need to drop this column.

## Converting zip code to cites

In [None]:
data['zipcode'].value_counts() # ohe these, find way to reduce. 

### Web Scrapping for City Zip Codes

In [None]:
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
html_page = requests.get('https://www.ciclt.net/sn/clt/capitolimpact/gw_ziplist.aspx?FIPS=53033') # Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') # Pass the page contents to beautiful soup for parsing


In [None]:
#soup.prettify

### Extract out just zipcode and city from https://www.ciclt.net/sn/clt/capitolimpact/gw_ziplist.aspx?FIPS=53033

In [None]:
#grab an easy to identify thing
span = soup.find('span')

# move up to find sibling of container I want
parent = span.parent

# get to the correct container
box = parent.next_sibling.next_sibling

# get text from container and format how needed
box_text = box.get_text().replace('\n', ',')
box_text = box_text.replace('Zip CodeCityCounty', '')
box_text = box_text.replace('King County,', '')
box_text = box_text.replace(' ... ', '')

# split on the commas and remove last (empty) element
lst = box_text.split(",")
lst.pop()

#check
lst

In [None]:
# seperate into zipcodes and cities
codes = []
cities = []
i=0
for element in lst:
    if i %2 == 0:
        codes.append(element)
        i+=1
    else:
        cities.append(element)
        i+=1

### Make a DF with _code_ and _cities_ lists

In [None]:
# use the two lists to make a DF
# empty df
web_df = pd.DataFrame()
web_df['zipcode_web']  = codes
web_df['city_web']  = cities

#check
web_df.head()

In [None]:
# use DF to make a dict of 
dictionary = {}
for key in web_df['zipcode_web'].unique():
      dictionary[key] = str(web_df[web_df['zipcode_web'] == key]['city_web'].unique())
        
# check
dictionary

### Make new column on exisiting _data_ df

In [None]:
type(data['zipcode'][0])

In [None]:
#copy zipcode to new column
data['Location/Area'] = data['zipcode'].astype(str)

In [None]:
# Use dictionary to replace zipcodes with cities
data['Location/Area'] = data['Location/Area'].replace(dictionary)

__Distill Cities down to Areas__

In [None]:
# use new column (now containing city names) to begin a list of areas
new_values = []
for cities in data['Location/Area']:
    if 'Seattle' in cities:
        new_values.append('Seattle Area')
    elif 'Bellevue' in cities:
        new_values.append('Bellevue Area')
    elif 'Auburn' in cities:
        new_values.append('Auburn Area')
    elif 'Kent' in cities:
        new_values.append('Kent Area')
    else:
        new_values.append(cities)

In [None]:
# change column values from cities to Areas (when possible) otherwise remains city name
data['Location/Area'] = new_values

In [None]:
# check
data['Location/Area'].value_counts()

__NOTES:__
- [x] '98077' needs a City name: Woodinville
- [x] if two cites distilled need to assign an area
 - [x] Bothell area: Kenmore, _Bothell_
 - [x] Bellevue Area: _Kirkland_
 - [x] Sammamish Area: _Sammamish_, _Issaquah_, _Redmond_
 - [x] Newcastle area: Newcastle, _Renton_

In [None]:
# repeat above, further distilling values from 'Location/Area' column
new_values2 = []
for cities in data['Location/Area']:
    if 'Bothell' in cities:
        new_values2.append('Bothell Area')
    elif 'Kirkland' in cities:
        new_values2.append('Bellevue Area')
    elif 'Renton' in cities:
        new_values2.append('Newcastle Area')
    elif 'Sammamish' in cities:
        new_values2.append('Sammamish Area')
    elif 'Issaquah' in cities:
        new_values2.append('Sammamish Area')
    elif 'Redmond' in cities:
        new_values2.append('Sammamish Area')
    elif cities == '98077':
        new_values2.append('Woodinville')
    else:
        new_values2.append(cities.strip("['']"))

# change column values to new list
data['Location/Area'] = new_values2

# check
data['Location/Area'].value_counts()

In [None]:
# looks good enough for get_dummies then add to dummies_df

### Get Dummies for new column $data['Location/Area']$

In [None]:
zipcode_dummies = pd.get_dummies(data[['Location/Area']], drop_first=True)

In [None]:
# concatinate with existing dummies_df
dummies_df = pd.concat([dummies_df, zipcode_dummies], axis=1)

# check
dummies_df

## Paring Down Data:
__Get two df's in order__ 

In [None]:
# clean up dummies_df (catagorical features)

# move boolean columns to dummies df
dummies_df['waterfront_YES'] = data['waterfront']
dummies_df['has_basement_YES'] = data['has_basement']

# check
dummies_df.head()

In [None]:
# clean up cont_df (continuous features)

# re define cont_df with relevant columns
# leave 'price' for now
cont_df = data.drop(['id', 'yr_renovated', 'view', 'condition', 'Location/Area', 'lat', 'long', 'zipcode', 'waterfront', 'has_basement'], axis=1)

#check
cont_df.head()

In [None]:
# check continuous features, some maybe catagorical features let in, they are ordinal

#looping over all columns  
plots = cont_df.drop('price', axis=1)

fig, axes = plt.subplots(ncols=3, nrows=4, figsize=(12, 15))
fig.set_tight_layout(True)

for index, col in enumerate(plots.columns):
    ax = axes[index//3][index%3]
    sns.regplot(x = col, y = 'price', data = cont_df, ax=ax, line_kws={"color": "tab:red"})
    ax.set_xlabel(col)
    ax.set_ylabel("price")

In [None]:
# a few almost a flat lines, ie zero relationship
# drop 
cont_df = cont_df.drop(['floors', 'yr_built', 'date', 'sqft_lot15', 'sqft_lot'], axis=1)

#check
cont_df.head()

In [None]:
# is there really a 30+ bedroom house?
cont_df['bedrooms'].value_counts()

In [None]:
# drop that 1 it's obviously an anomaly 
cont_df = cont_df[cont_df['bedrooms']<30]

In [None]:
# again now that features have been dropped
plots = cont_df.drop('price', axis=1)

fig, axes = plt.subplots(ncols=3, nrows=2, figsize=(12, 10))
fig.set_tight_layout(True)

for index, col in enumerate(plots.columns):
    ax = axes[index//3][index%3]
    sns.regplot(x = col, y = 'price', data = cont_df, ax=ax, line_kws={"color": "tab:red"})
    ax.set_xlabel(col)
    ax.set_ylabel("price")

# Build Baseline Model

### Build baseline model with highest correlated feature

In [None]:
cont_df.corr()['price'].abs().sort_values(ascending=False)

In [None]:
# use 'sqft_living' as baseline model feature
# baseline model
baseline_model_df = cont_df[['sqft_living', 'price']]
y = baseline_model_df[target]
X = baseline_model_df.drop(target, axis=1)

model_1 = sm.OLS(y, sm.add_constant(X)).fit()

model_1.summary()

In [None]:
# a look at the residuals
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 5))

sns.histplot(model_1.resid, ax=ax0)
ax0.set(xlabel='Residual', ylabel='Frequency', title='Distribution of Residuals')

sm.qqplot(model_1.resid, fit = True, line = '45', ax=ax1)
ax1.set(title='QQ Plot')

fig.suptitle('Normality of Residuals')

plt.show()

In [None]:
# oofs, not great. 

## Make functions for further modeling

## Getting Multiple R_squared Values (to get an idea of where to start)

In [None]:
# unedited raw data
full_dfs = pd.concat([cont_df, dummies_df], axis=1).dropna(axis=0)

In [None]:
# Log and Scale 
# log transform
log_df = np.log(cont_df)

# standardized AND logged
log_stand_df = log_df.apply(standardize)

# standardize ONLY
stand_df = cont_df.apply(standardize)

# concat with dummies
model_log = pd.concat([log_df, dummies_df], axis=1).dropna(axis=0)             #logged only
model_stand = pd.concat([stand_df, dummies_df], axis=1).dropna(axis=0)         #scaled only 
model_log_stand = pd.concat([log_stand_df, dummies_df], axis=1).dropna(axis=0) #logged and scaled

In [None]:
# remove outliers from everything
filtered_df = remove_outliers_from_pdDataFrame(cont_df)

# concat with dummies
model_filtered = pd.concat([filtered_df, dummies_df], axis=1).dropna(axis=0)

In [None]:
# features relative to remodels
#list(cont_df.columns)

reno_features = ['price',
 'bedrooms',
 'bathrooms',
 'sqft_living',
 'grade',
 'condition_Fair',
 'condition_Good',
 'condition_Poor',
 'condition_Very Good']


In [None]:
# outlier removed and logged
outliers_log = np.log(filtered_df)

# concat with dummies
model_outl_fltd = pd.concat([outliers_log, dummies_df], axis=1).dropna()

In [None]:
# test multiple df's quickly before moving on to removing colinear features

dfs = [baseline_model_df,               #1. baseline, 'sqft_living' only
         cont_df,                       #2. continuous features only
         full_dfs,                      #3. cont and dummies
         model_log,                     #4. all cont features logged
         model_stand,                   #5. all cont features scaled
         model_log_stand,               #6. all cont features logged and scaled
         model_filtered,                #7. all cont features outliers removed
         model_outl_fltd,               #8. cont outliers removed and logged
         full_dfs[reno_features]]       #9. reno specific features 

n=0
for df in dfs:
    y = df[target]
    X = df.drop(target, axis=1)

    model = sm.OLS(y, sm.add_constant(X)).fit()
    
    n+=1
    
    print(f'{n}. {model.rsquared}')


## Visual comparisons of continuous data transformations

In [None]:
import seaborn as sns
columnz = list(cont_df.columns)
colorz = ['red', 'purple', 'blue', 'green', 'yellow', 'orange', 'cyan']

i=0
for i in range(len(columnz)):
    fig, axes = plt.subplots(1, 5, figsize=(15, 3))
    #fig.suptitle(f'{columnz[i]}')
    
    sns.histplot(ax=axes[0], data=cont_df[columnz[i]], bins='auto', color=colorz[0], alpha=.7)\
    .set(title="RAW")
    
    sns.histplot(ax=axes[1], data=filtered_df[columnz[i]], bins='auto', color=colorz[1], alpha=.7)\
    .set(title="OUTLIERS REMOVED")
    
    sns.histplot(ax=axes[2], data=log_df[columnz[i]], bins='auto', color=colorz[2], alpha=.7)\
    .set(title="LOG TRANSFORMED")
    
    sns.histplot(ax=axes[3], data=stand_df[columnz[i]], bins='auto', color=colorz[3], alpha=.7)\
    .set(title="STANDARDIZED")
    
    sns.histplot(ax=axes[4], data=log_stand_df[columnz[i]], bins='auto', color=colorz[4], alpha=.7)\
    .set(title="LOG & STAND")
    
    plt.show();
    i+=1
   

__NOTES:__

DF 'model_log' has highest r_squared score

## Model 'model_log'

In [None]:
model_it_small(model_log, target)

## Remove 'view_FAIR' (p_value: 0.909) and remodel

In [None]:
log_drop_1p = model_log.drop('view_FAIR', axis=1)
model_it_small(log_drop_1p, target)

## Multicollinearity of Features 

In [None]:
colinearity(log_drop_1p)

In [None]:
get_VIFs_above5(log_drop_1p, target)

### Drop 'sqft_above' and remodel

In [None]:
drop_colin_feature = log_drop_1p.drop('sqft_above', axis=1).dropna(axis=0)

In [None]:
model_it_small(drop_colin_feature, target)

### __NOTES:__
'has_basement_YES' has p_values above .05, and CI spans 0. Remove this feature and remodel 

In [None]:
y = drop_colin_feature[target]
X = drop_colin_feature.drop(target, axis=1)

#statsmodel fit
model = sm.OLS(y, sm.add_constant(X)).fit()  

# expo it, data has been loggend?
np.exp(model.params)

# Interactions

In [None]:
X = drop_colin_feature.drop(target, axis=1)
regression = LinearRegression()
crossvalidation = KFold(n_splits=3, shuffle=True, random_state=1)

features = list(X.columns)
combos = combinations(features, 2)
r_2s_dict = {pair:None for pair in combos}

# use pairs list and find r_2's
for k,v in r_2s_dict.items():
    
    # make copy of df so you don't mess anything up
    X_interact = X.copy()
    
    # use pairs
    # new column in X_interact with product of predictors
    X_interact[f'{k}'] = X[f'{k[0]}'] * X[f'{k[1]}']
    # r2 with combo feature added
    r_2_with_interaction = np.mean(cross_val_score(regression, X_interact, y, scoring='r2', cv=crossvalidation))
    # store r_2 and pair in a dictionary
    r_2s_dict[k] = r_2_with_interaction
    

# sort by r_2 value and extract top 3 pairs (the last 3)
top_5 = dict(sorted(r_2s_dict.items(), key=lambda item: item[1])[-5:])

top_5

### Get r_squared with interactions added

In [None]:
# redefine X so this cell can run multiple times
X = drop_colin_feature.drop(target, axis=1)

# add interactions columns to df 
# TOP 3
X["'grade'*'Location/Area_Seattle area'"] = X['grade'] * X['Location/Area_Seattle area']
X["'bathrooms'*'sqft_living'"] = X['bathrooms'] * X['sqft_living']
X["'sqft_living'*'grade'"] = X['grade'] * X['sqft_living']

# TOP 5
X["'bathrooms'*'grade'"] = X['bathrooms']*X['grade']
X["'bedrooms'*'grade'"]= X['bedrooms']*X['grade']

# Then use 10-fold cross-validation ...
crossvalidation = KFold(n_splits=10, shuffle=True, random_state=1)

np.mean(cross_val_score(regression, X, y, scoring='r2', cv=crossvalidation))

#TOP 3: 0.7748541615156376
#TOP 5: 0.7748750703784515


In [None]:
X[target] = drop_colin_feature[target]
model_it_small(X, target)

### Investigating Linearity

In [None]:
# Linearity?
plots = seattle_reno_change.drop(target, axis=1)

fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(10, 3))
fig.set_tight_layout(True)

for index, col in enumerate(plots.columns):
    ax = axes[index]
    sns.regplot(x = col, y = 'price', data = cont_df, ax=ax, line_kws={"color": "tab:red"})
    ax.set_xlabel(col)
    ax.set_ylabel("price")

### Investigating Homoscedasticity

In [None]:
# plot the residuals against predicted values to 
y = seattle_reno_change[target]
X2 = seattle_reno_change.drop(target, axis=1)

#statsmodel fit
model = sm.OLS(y, sm.add_constant(X2)).fit()
y_pred = model.fittedvalues

# check for homoscedasticity
p = sns.scatterplot(x=y_pred,y=model.resid)
plt.xlabel('Predicted y values')
plt.ylabel('Residuals')
#plt.xlim(70,100)
p = sns.lineplot(x=[y_pred.min(),y_pred.max()],y=[0,0],color='blue')
p = plt.title('Residuals vs Predicted y value')

### Investigating Multicollinearity (Independence Assumption)

In [None]:
colinearity(seattle_reno_change)

In [None]:
get_VIFs_above5(seattle_reno_change, target)

## remove 'sqft_living'

In [None]:
model_it_small(seattle_reno_change.drop('sqft_living', axis=1), target)

In [None]:
get_VIFs_above5(seattle_reno_change.drop('sqft_living', axis=1), target)

### Interpret

In [None]:
expoed = np.exp(seattle_reno_change)
y = expoed[target]
X2 = expoed.drop(target, axis=1)

#statsmodel fit
model = sm.OLS(y, sm.add_constant(X2)).fit() 

# expo it, data has been loggend?
model.params

In [None]:
# TODO only log target, that will make coeffs in percentages, 
# find the dropped dummies values
# maybe remove outliers? prob not
# work on notebook flow
# need big header FINAL MODEL
# final model checks
# header for interpretations
# keep floors
#use sqft_living15 and subtract 