# Pseudocode

 ## Cleaning:
    - NaN/?/Blank
        - waterfront
        - view
        - yr_rennovated
        - sqft_basement
    - whitespace *no issue
    - format to int/float
        - view
        - waterfront
        - condition
        - grade
        - sqft_basement
    - deal with dupes
        - several properties are listed more than once, suggesting they were bought/sold during this timeframe. 
    - categorical encoding
        - bedrooms
        - bathrooms
        - floors
        - zipcode
        - waterfront
        - view
        - condition
        - grade
    - duplicate properties
        - Located, need to decide how to treat. 
    - drop unused columns
        - lat/long once done with all other steps. Technically this combination is categorical.
        - date
        - sqft_basement? - 454 records do not have a value here--we would need to drop these records at a minimum.
    - save cleaned data

## Ideas for stakeholders:
 - looking for properties to flip
    - Maybe look at adding sqft or bedrooms/bathrooms to see what improvements add most value?
 - looking for investment properties
     - identify combinations of bedrooms/bathrooms/year built/etc that are underpriced?
 - real estate agents providing guidance to sellers about what price they can expect for their home.
     - define ranges of price expected based on bedrooms/bathrooms/sqft/etc.
    

# EDA

In [27]:
# Importing packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.api as sm

from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('../data/kc_house_data.csv')


In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.columns

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

In [5]:
# Looking for NANs
df.isna().sum()

id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

#### Addressing Waterfront

There are ~2400 NA cells, no easy way to determine if waterfront or not. Recommend we encode 0/1/2 as no/yes/unkown

In [None]:
df['waterfront'].value_counts()

In [None]:
sum(df['waterfront'].isna())

In [6]:
df['waterfront'] = df['waterfront'].fillna('N/A')

#### Addressing View

Doesn't seem to be many rows with na. Maybe encode as NA?

In [7]:
sum(df['view'].isna())

63

In [None]:
df['view'].value_counts()

In [11]:
# code to replace empty with N/A - add inplace if we decide to take this approach.
df['view'] = df['view'].fillna('N/A')

#### Addressing yr_renovated

~ 78% have 0 values, so assume they were not rennovated. I think rennovations are a meaningful datapoint, so we should encode this somehow. We could think about how recent the rennovation was to see if there is a meaningful relationship--for example, a home rennovated in 2020 is likely to sell better than a home rennovated only in 1980. Maybe we do some research on this and decide on a breakpoint for rennovated before/after? We could do a simple linear regression between yr renovated and price...

In [9]:
sum(df['yr_renovated'].isna())

3842

In [52]:
df['yr_renovated'].value_counts(sort=False).head(75)

0.0       17011
N/A        3842
1934.0        1
1940.0        2
1944.0        1
          ...  
2011.0        9
2012.0        8
2013.0       31
2014.0       73
2015.0       14
Name: yr_renovated, Length: 71, dtype: int64

In [12]:
df['yr_renovated'] = df['yr_renovated'].fillna('N/A')

#### Duplicate Properties

In [None]:
# Looking for duplicate properties

df['id'].duplicated().sum()

In [None]:
# Identify instances of the same property appearing more than once in the data set based on lattitude, longitude
# How do we treat this? Might not matter since there is only ~4-5 months of data here. as long as listing date 
# is different, probably OK to treat these as unique listings.


df[df.duplicated(subset=['id'], keep=False)].sort_values('id')

#### Addressing grade
Think we can easily make a numerical column for grade to use in our model. Might need to encode this later?

In [None]:
df['grade'].value_counts()

In [None]:
# Create new column for numerical grade. 
# Should either drop df['grade'] or omit it from clean df.
df['grade_num']= df['grade'].apply(lambda x: x[:1]).astype(int)

In [None]:
df.info()

#### Addressing sqft_basement

Recommend we drop rows with N/A since we don't have an easy way to encode.
We have 454 unknown values here. need to drop or encode somehow. Maybe has basement or no basement?


In [None]:
df['sqft_basement'].value_counts()

In [None]:
df['sqft_basement'] = df['sqft_basement'].replace(['?'],'N/A')

In [None]:
df['sqft_basement'].value_counts()

#### Addressing Floors

Categorical variable - min 1, max 3.5.

In [None]:
df['floors'].value_counts()

In [25]:
df_dummied = pd.get_dummies(df.drop(['floors', axis=1))

In [26]:
df_dummied.head()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,sqft_above,yr_built,zipcode,lat,...,yr_renovated_2007.0,yr_renovated_2008.0,yr_renovated_2009.0,yr_renovated_2010.0,yr_renovated_2011.0,yr_renovated_2012.0,yr_renovated_2013.0,yr_renovated_2014.0,yr_renovated_2015.0,yr_renovated_N/A
0,7129300520,221900.0,3,1.0,1180,5650,1180,1955,98178,47.5112,...,0,0,0,0,0,0,0,0,0,0
1,6414100192,538000.0,3,2.25,2570,7242,2170,1951,98125,47.721,...,0,0,0,0,0,0,0,0,0,0
2,5631500400,180000.0,2,1.0,770,10000,770,1933,98028,47.7379,...,0,0,0,0,0,0,0,0,0,1
3,2487200875,604000.0,4,3.0,1960,5000,1050,1965,98136,47.5208,...,0,0,0,0,0,0,0,0,0,0
4,1954400510,510000.0,3,2.0,1680,8080,1680,1987,98074,47.6168,...,0,0,0,0,0,0,0,0,0,0


In [23]:
len(df_dummied.columns)

780

# Nonsense Model with all columns

In [35]:
dummy_data = df.drop(columns=['date', 'lat', 'long', 'id'])

In [36]:
df_dummied = pd.get_dummies(dummy_data)
df_dummied.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,yr_built,zipcode,sqft_living15,...,yr_renovated_2007.0,yr_renovated_2008.0,yr_renovated_2009.0,yr_renovated_2010.0,yr_renovated_2011.0,yr_renovated_2012.0,yr_renovated_2013.0,yr_renovated_2014.0,yr_renovated_2015.0,yr_renovated_N/A
0,221900.0,3,1.0,1180,5650,1.0,1180,1955,98178,1340,...,0,0,0,0,0,0,0,0,0,0
1,538000.0,3,2.25,2570,7242,2.0,2170,1951,98125,1690,...,0,0,0,0,0,0,0,0,0,0
2,180000.0,2,1.0,770,10000,1.0,770,1933,98028,2720,...,0,0,0,0,0,0,0,0,0,1
3,604000.0,4,3.0,1960,5000,1.0,1050,1965,98136,1360,...,0,0,0,0,0,0,0,0,0,0
4,510000.0,3,2.0,1680,8080,1.0,1680,1987,98074,1800,...,0,0,0,0,0,0,0,0,0,0


In [None]:
formula = 'price ~ bedrooms, bathrooms, sqft_living, sqft_lot, floors, sqft_above, sqft_below, 

In [37]:
X = df_dummied.drop('price', axis=1)
y = df['price']
model = sm.OLS(endog=y, exog=X).fit()
model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.708
Model:,OLS,Adj. R-squared:,0.703
Method:,Least Squares,F-statistic:,128.9
Date:,"Sun, 19 Jun 2022",Prob (F-statistic):,0.0
Time:,13:03:49,Log-Likelihood:,-294090.0
No. Observations:,21597,AIC:,589000.0
Df Residuals:,21197,BIC:,592200.0
Df Model:,399,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
bedrooms,-2.461e+04,1959.655,-12.557,0.000,-2.84e+04,-2.08e+04
bathrooms,4.996e+04,3363.259,14.856,0.000,4.34e+04,5.66e+04
sqft_living,97.9369,21.210,4.617,0.000,56.363,139.511
sqft_lot,0.0346,0.048,0.723,0.469,-0.059,0.128
floors,4.404e+04,3687.409,11.942,0.000,3.68e+04,5.13e+04
sqft_above,11.9351,21.221,0.562,0.574,-29.659,53.530
yr_built,-3261.2574,71.132,-45.848,0.000,-3400.681,-3121.834
zipcode,31.2565,29.644,1.054,0.292,-26.847,89.360
sqft_living15,44.9364,3.447,13.036,0.000,38.180,51.693

0,1,2,3
Omnibus:,10761.517,Durbin-Watson:,1.97
Prob(Omnibus):,0.0,Jarque-Bera (JB):,234617.706
Skew:,1.907,Prob(JB):,0.0
Kurtosis:,18.69,Cond. No.,2.01e+21
