# Hose Sales Prediction (Linear Regression)

#### Feature Columns
    
* id - Unique ID for each home sold
* date - Date of the home sale
* price - Price of each home sold
* bedrooms - Number of bedrooms
* bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
* sqft_living - Square footage of the apartments interior living space
* sqft_lot - Square footage of the land space
* floors - Number of floors
* waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
* view - An index from 0 to 4 of how good the view of the property was
* condition - An index from 1 to 5 on the condition of the apartment,
* grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
* sqft_above - The square footage of the interior housing space that is above ground level
* sqft_basement - The square footage of the interior housing space that is below ground level
* yr_built - The year the house was initially built
* yr_renovated - The year of the house’s last renovation
* zipcode - What zipcode area the house is in
* lat - Lattitude
* long - Longitude
* sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
* sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors


Data used: https://www.kaggle.com/harlfoxem/housesalesprediction

In [1]:
# import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# read the csv file 
Data = pd.read_csv('kc_house_data.csv')


In [3]:
Data.shape

(21613, 21)

In [4]:
Data = Data.drop('date',axis=1)

Train Test Split


In [5]:
X = Data.drop('price',axis =1)
y = Data['price']

#splitting Train and Test 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

# Method2: Multiple Linear Regression

In [6]:
#Liner Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

#evaluate the model (intercept and slope)
print(regressor.intercept_)
print(regressor.coef_)

6945806.66294097
[-1.67013259e-06 -3.72774020e+04  3.59635009e+04  1.10709164e+02
  1.36873862e-01  1.16505241e+04  5.50485950e+05  5.22119883e+04
  2.78318912e+04  9.68491860e+04  7.11836029e+01  3.95255612e+01
 -2.62906424e+03  1.89397988e+01 -5.83290523e+02  6.00503800e+05
 -2.14376145e+05  2.18382426e+01 -3.73198722e-01]


In [7]:
coeff_df = pd.DataFrame(regressor.coef_, Data.drop('price',axis =1).columns, columns=['Coefficient']) 
coeff_df

Unnamed: 0,Coefficient
id,-2e-06
bedrooms,-37277.402023
bathrooms,35963.500899
sqft_living,110.709164
sqft_lot,0.136874
floors,11650.524124
waterfront,550485.950006
view,52211.988264
condition,27831.891184
grade,96849.185982


In [8]:
y_pred = regressor.predict(X_test)

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(10)
df1

Unnamed: 0,Actual,Predicted
3834,349950.0,546359.768029
1348,450000.0,638102.003309
20366,635000.0,499533.817967
16617,355500.0,327999.116097
20925,246950.0,59419.931098
7891,406550.0,471700.08401
939,350000.0,296550.16725
10502,226500.0,323591.390441
2948,265000.0,252686.683249
5079,656000.0,526590.285581


In [11]:
from sklearn import metrics
print('Mean Squared Error:{:.2f}'.format(metrics.mean_squared_error(y_test, y_pred)))  


Mean Absolute Error: 126051.39
Mean Squared Error:41054725150.16


41054725150.158165

In [10]:
print('Linear Regression Model:')
print("Test Score {:.2f}".format(regressor.score(X_test, y_test)))

Linear Regression Model:
Train Score 0.70
Test Score 0.71
