![for sale image, from https://time.com/5835778/selling-home-coronavirus/](https://api.time.com/wp-content/uploads/2020/05/selling-home-coronavirus.jpg?w=800&quality=85)

# Project Title

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

Questions to consider:

- Who are your stakeholders?
- What are your stakeholders' pain points related to this project?
- Why are your predictions important from a business perspective?

### Stakeholders: small real estate company who advises families on selling their homes
### Pain points: 

## Data Understanding

Describe the data being used for this project.

Questions to consider:

- Where did the data come from, and how do they relate to the data analysis questions?
- What do the data represent? Who is in the sample and what variables are included?
- What is the target variable?
- What are the properties of the variables you intend to use?

In [None]:
# import relevant libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
data = pd.read_csv('../../data/kc_house_data.csv')
data.info()

In [None]:
data.isna().sum()

In [None]:
# I'm focusing on specific house features
rel_cols = ['id','price','sqft_living','sqft_lot','sqft_above','sqft_basement',
            'floors','bedrooms','bathrooms']
rel_cols_log = ['id','price','sqft_living','sqft_lot','sqft_above']

In [None]:
datan = data[rel_cols]
datan

In [None]:
datan.info()

In [None]:
datan['sqft_basement'].value_counts()

In [None]:
datan.loc[datan['sqft_basement'] == '?','sqft_basement'] = np.nan

In [None]:
datan['sqft_basement'].value_counts()

In [None]:
def tryfloat(x):
    try:
        return float(x)
    except:
        return x

In [None]:
datan['sqft_basement'] = datan['sqft_basement'].map(tryfloat)

In [None]:
datan.info()

In [None]:
datan.describe()

In [None]:
for col in datan.columns:
    print(f'\n{col}:\n')
    print(datan.sort_values(by=col,ascending=False).head(10))

33 bedrooms is pretty crazy and not highly correlated with a high price. I'll remove that one.

In [None]:
datan = datan[datan['bedrooms'] != 33]
datan.describe()

In [None]:
sns.pairplot(datan);

In [None]:
datan.hist(figsize=(20,20));

In [None]:
datan.corr()

Lot of these features look like they could use some log processing. Let's try it with the whole thing to see what happens.

In [None]:
datanlog = pd.DataFrame()
for col in rel_cols_log:
    if col == 'id':
        datanlog[col] = datan[col]
        continue
    if col == 'sqft_basement':
        continue
    datanlog[f'{col}_log'] = datan[col].map(lambda x: np.log(x))

In [None]:
datanlog.info()

In [None]:
datanlog.describe()

In [None]:
datanlog.corr()

In [None]:
sns.pairplot(datanlog);

In [None]:
fix, axes = plt.subplots(2,3, figsize=(20,20))
for i, col in enumerate(datanlog.columns):
    sns.histplot(data=datanlog, x=col, kde=True, ax=axes[i//3,i%3]);

In [None]:
fix, axes = plt.subplots(3,3, figsize=(20,20))
for i, col in enumerate(datan.columns):
    sns.histplot(data=datan, x=col, kde=True, ax=axes[i//3,i%3]);

In [None]:
datanfeat = datan.drop(columns='price')
datanfeat.corr()

In [None]:
datanlogfeat = datanlog.drop(columns='price_log')
datanlogfeat.corr()

In [None]:
datantot = pd.merge(datan,datanlog,on='id')
datantot

In [None]:
datantot.corr().sort_values('price_log',ascending=False)['price_log']

In [None]:
datantotfeat = datantot.drop(columns=['price_log','price'])
dtfc = datantotfeat.corr().abs().stack().reset_index().sort_values(0, ascending=False)

dtfc['col_pairs'] = list(zip(dtfc.level_0,dtfc.level_1))
dtfc['same'] = dtfc['col_pairs'].map(lambda x: (x[0] in x[1]) or (x[1] in x[0]))
dtfc['col_pairs'] = dtfc['col_pairs'].map(lambda x:sorted(list(x)))
dtfc.set_index(['col_pairs'],inplace=True)
dtfc = dtfc[dtfc['same'] == False]
dtfc.drop(columns=['level_0','level_1','same'],inplace=True)
dtfc.columns = ['C']
dtfc.drop_duplicates(inplace=True)
dtfc

## Let's build models.
### Preprocessing:

In [None]:
X = datantot.drop(columns=['price_log','price'])

Xpr_train, Xpr_test, ypr_train, ypr_test = \
train_test_split(X, datantot['price'], test_size=0.33, random_state=42)

X_train, X_test, y_train, y_test = \
train_test_split(X, datantot['price_log'], test_size=0.33, random_state=42)

In [None]:
X_train.describe()

In [None]:
X_test.describe()

In [None]:
y_train

In [None]:
y_test

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled = pd.DataFrame(X_train_sc, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_sc, columns=X_test.columns, index=X_test.index)

In [None]:
X_train_scaled

## Baseline Understanding

In [None]:
train_target_mean = y_train.mean()
baseline_train_pred = [train_target_mean] * len(y_train)
baseline_test_pred = [train_target_mean] * len(y_test)

In [None]:
def evaluate(y_tr, y_te, y_tr_pr, y_te_pr):
    '''
    Evaluates the error between the model predictions and the real values for both
    training and test sets.
    
    Arguments:
    y_tr - array-like
        Actual values for output variable, for the training set
    y_tr_pr - array-like
        Predicted values for output variable, for the training set
    y_te - array-like
        Actual values for output variable, for the test set
    y_te_pr - array-like
        Predicted values for output variable, for the test set
    
    Returns:
    R2 scores for Train and Test sets
    RMSE for Train and Test sets
    MAE for Train and Test sets
    '''
    print(f'Train R2 score: {r2_score(y_tr, y_tr_pr)} ')
    print(f'Test R2 score: {r2_score(y_te, y_te_pr)} ')
    print('<><><><><>')
    print(f'Train RMSE (ln): {mean_squared_error(y_tr, y_tr_pr, squared=False)} ')
    print(f'Test RMSE (ln): {mean_squared_error(y_te, y_te_pr, squared=False)} ')
    print('<><><><><>')
    print(f'Train MAE (ln): {mean_absolute_error(y_tr, y_tr_pr)} ')
    print(f'Test MAE (ln): {mean_absolute_error(y_te, y_te_pr)} ')
    
    # residuals
    train_res = y_tr - y_tr_pr
    test_res = y_te - y_te_pr
    
    # scatter plot of residuals
    print("\nScatter of residuals:")
    plt.scatter(y_tr_pr, train_res, label='Train')
    plt.scatter(y_te_pr, test_res, label='Test')
    plt.axhline(y=0, color='purple', label='0')
    plt.xlabel("Predicted Price")
    plt.ylabel("Residual Price")
    plt.legend()
    plt.show()
    
    print("QQ Plot of residuals:")
    fig, ax = plt.subplots()
    sm.qqplot(train_res, ax=ax, marker='.', color='r', label='Train', alpha=0.3, line='s')
    sm.qqplot(test_res, ax=ax,  marker='.', color='g', label='Test', alpha=0.3)
    plt.legend()

In [None]:
evaluate(y_train, y_test, baseline_train_pred, baseline_test_pred)

### Shit model

In [None]:
def smols(X,y,cols=cols):
    Xcol = X[cols]
    shmod = sm.OLS(endog=y, exog=sm.add_constant(Xcol)).fit()
    return shmod

In [None]:
cols = ['sqft_living_log']
smols(X_train,y_train,cols).summary()

In [None]:
def linpreds(X_tr_scaled, y_tr, X_te_scaled):
    '''
    Uses Linear Regression to generate output predictions given training and test inputs.
    Arguments:
    X_tr_scaled - dataframe
        Input variables and values for the training set
    y_tr - array-like
        Actual values for output variable, for the training set
    X_te_scaled - dataframe
        Input variables and values for the test set
    Returns:
    Output (y) prediction arrays:
        train, test
    '''
    lr = LinearRegression()
    lr.fit(X_tr_scaled, y_tr)
    return lr.predict(X_tr_scaled), lr.predict(X_te_scaled)

In [None]:
X_train_scaled.columns

In [None]:
smols(X_train_scaled,y_train,\
      cols=['sqft_living_log','bathrooms','bedrooms','floors']).summary()

In [None]:
X_tr1, X_te1 = X_train_scaled[['sqft_living_log']], X_test_scaled[['sqft_living_log']]
X_tr2, X_te2 = X_train_scaled[['sqft_living_log','bathrooms']],\
               X_test_scaled[['sqft_living_log','bathrooms']]
X_tr3, X_te3 = X_train_scaled[['sqft_living_log','bathrooms','bedrooms']],\
               X_test_scaled[['sqft_living_log','bathrooms','bedrooms']]
X_tr4, X_te4 = X_train_scaled[['sqft_living_log','bathrooms','bedrooms','floors']],\
               X_test_scaled[['sqft_living_log','bathrooms','bedrooms','floors']]

trp1, tep1 = linpreds(X_tr1, y_train, X_te1)
trp2, tep2 = linpreds(X_tr2, y_train, X_te2)
trp3, tep3 = linpreds(X_tr3, y_train, X_te3)
trp4, tep4 = linpreds(X_tr4, y_train, X_te4)

In [None]:
evaluate(y_train, y_test, trp4, tep4)

In [None]:
evaluate(y_train, y_test, trp3, tep3)

In [None]:
evaluate(y_train, y_test, trp2, tep2)

In [None]:
evaluate(y_train, y_test, trp1, tep1)

### Polynomial Features
As seen above, we get only modest improvements in R2 and error calculations, but let's see if we can improve this with interaction terms.

In [None]:
datantot.columns

In [None]:
Xpf = datantot.drop(columns=['price_log','price','id','sqft_basement','sqft_living', 'sqft_lot', 'sqft_above'])

pf = PolynomialFeatures(degree=2)
pf.fit(Xpf)
Xpdf = pd.DataFrame(pf.transform(Xpf),\
                   columns=pf.get_feature_names(input_features=Xpf.columns))

Xpf_train, Xpf_test, ypf_train, ypf_test = \
train_test_split(Xpdf, datantot['price_log'], test_size=0.33, random_state=42)

In [None]:
pfscaler = StandardScaler()
pfscaler.fit(Xpf_train)
Xpf_train_scaled = pfscaler.transform(Xpf_train)
Xpf_test_scaled = pfscaler.transform(Xpf_test)
Xpf_train_scaled = pd.DataFrame(Xpf_train_scaled, columns=Xpf_train.columns, index=Xpf_train.index)
Xpf_test_scaled = pd.DataFrame(Xpf_test_scaled, columns=Xpf_test.columns, index=Xpf_test.index)

In [None]:
pftrp1, pftep1 = linpreds(Xpf_train_scaled, ypf_train, Xpf_test_scaled)

In [None]:
evaluate(ypf_train, ypf_test, pftrp1, pftep1)

In [None]:
Xpf_train_scaled.columns

In [None]:
smXpf = Xpf_train_scaled.drop(columns='1')
pfsm = smols(smXpf, ypf_train, cols=smXpf.columns)
pfsm_df = pfsm.params.reset_index()
pfsm_df = pfsm_df.merge(pfsm.pvalues.reset_index(), on='index')
pfsm_df = pfsm_df.set_index('index')
pfsm_df.columns = ['coef','p_value']

In [None]:
pfsm_df.sort_values('coef', ascending=False)

In [None]:
pfsm_df.sort_values('p_value', ascending=False)

## Data Preparation

Describe and justify the process for preparing the data for analysis.

Questions to consider:

- Were there variables you dropped or created?
- How did you address missing values or outliers?
- Why are these choices appropriate given the data and the business problem?

In [None]:
# code here to prepare your data

## Modeling

Describe and justify the process for analyzing or modeling the data.

Questions to consider:

- How did you analyze the data to arrive at an initial approach?
- How did you iterate on your initial approach to make it better?
- Why are these choices appropriate given the data and the business problem?

## Evaluation

The evaluation of each model should accompany the creation of each model, and you should be sure to evaluate your models consistently.

Evaluate how well your work solves the stated business problem. 

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model? Is it over or under fit?
- How well does your model/data fit any modeling assumptions?

For the final model, you might also consider:

- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?

### Baseline Understanding

- What does a baseline, model-less prediction look like?

In [None]:
# code here to arrive at a baseline prediction

### First $&(@# Model

Before going too far down the data preparation rabbit hole, be sure to check your work against a first 'substandard' model! What is the easiest way for you to find out how hard your problem is?

In [None]:
# code here for your first 'substandard' model

In [None]:
# code here to evaluate your first 'substandard' model

### Modeling Iterations

Now you can start to use the results of your first model to iterate - there are many options!

In [None]:
# code here to iteratively improve your models

In [None]:
# code here to evaluate your iterations

### 'Final' Model

In the end, you'll arrive at a 'final' model - aka the one you'll use to make your recommendations/conclusions. This likely blends any group work. It might not be the one with the highest scores, but instead might be considered 'final' or 'best' for other reasons.

In [None]:
# code here to show your final model

In [None]:
# code here to evaluate your final model

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project (future work)?
