# House Price Predictions Using Linear Regression

Data provided from kaggle: https://www.kaggle.com/harlfoxem/housesalesprediction

With the data provided from kaggle, I will try to predict the housing prices using linear regression and a SDG regressor.

In [39]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler




In [40]:
df = pd.read_csv("kc_house_data.csv")
pd.set_option("display.max_columns", None)

In [41]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

In [43]:
df.isnull().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

### Preprocessing

Preprocessing stage will involve:
- Dropping id column
- Take just the year from the date column
- One-Hot Encode zipcode column as it is a nominal column
- Scale the data using a standard scaler

In [44]:
def preprocessing(df):
    df = df.copy()
    
    df = df.drop("id", axis=1)
    
    df["date"] = pd.to_datetime(df["date"])
    df["date_year"] = df["date"].apply(lambda x: x.year) 
    df = df.drop("date", axis=1)
    
    zipcode_dummies = pd.get_dummies(df["zipcode"])
    df = pd.concat([df, zipcode_dummies], axis=1)
    df = df.drop("zipcode", axis=1)
    
    X = df.drop("price", axis=1)
    y = df["price"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, shuffle=True, random_state=1)
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
       
    return X_train, X_test, y_train, y_test

In [45]:
X_train, X_test, y_train, y_test = preprocessing(df)

### Models

In [46]:
def linear_reg(X_train, X_test, y_train, y_test):
    
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r_score = r2_score(y_test, y_pred)
    
    return model, r_score
    




In [47]:
lin_model, linear_r2 = linear_reg(X_train, X_test, y_train, y_test)
print("R^^2 score for linear regression: {}".format(linear_r2.round(4)))

R^^2 score for linear regression: 0.7813


In [48]:
def sgd_reg(X_train, X_test, y_train, y_test):
    
    model = SGDRegressor()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r_score = r2_score(y_test, y_pred)
    
    return model, r_score

In [49]:
sgd_model, sgd_r2 = sgd_reg(X_train, X_test, y_train, y_test)
print("R^^2 score for SGDRegressor {}".format(sgd_r2.round(4)))

R^^2 score for SGDRegressor 0.7789
