<a href="https://colab.research.google.com/github/MikeDeecode/KC-properties-price-analysis/blob/master/kc_properties_pricing_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**MULTIPLE LINEAR REGRESSION MODEL FOR THE PRICE PREDICTION OF HOUSES IN KINGS COUNTY, WASHINGTO STATE, USA**

Import the required libraries for preprocessing and setup 

In [None]:
import numpy as np 
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

print("Setup complete")

Setup complete


Loading the dataset

In [None]:
kc_properties = pd.read_csv("/content/kc_house_data.csv")
print("Successful")

Successful


In [None]:
kc_properties.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [None]:
kc_properties.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

###FEATURE SELECTION

**Here we select the features to be used for the model**

I will use the features that affect the pricing of the house significantly based on my analysis contained in a seperate file in my github repository

The following features will be used as the independent variables 

* Number of bedrooms (bedrooms)
* Number of bathrooms (bathrooms)
* Square footage of living room (sqfr_living)
* Number of floors (floors)
* Waterfront presence - 1 for yes and 0 for no (waterfront)
* Number of times the house was viewed (view)
* Condition of the house - 5 is excellent (condition)
* grade of the houses based on Kings county standards - 13 is the best (grade)

The target variable is the price of the house because that is what we are tryimg to predict 

Lets get to work

###**CREATING TRAINING AND TESTING DATASETS**

In [None]:
features_df = kc_properties[['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'waterfront', 'view', 'condition'
, 'grade', 'price']]

In [None]:
mask = np.random.rand(len(kc_properties)) < 0.8
train = features_df[mask]
test = features_df[~mask]

In [None]:
train.shape

(17342, 9)

* The training dataset has 17275 entries

In [None]:
test.shape

(4271, 9)

* The testing dataset has 4338 entries

###**MODELING**

First we split the training set into independent and target variables (remembering to convert them to a numpy array)

In [None]:
x = np.asanyarray(train[['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'waterfront', 'view', 'condition', 'grade']])
y = np.asanyarray(train[['price']])

In [None]:
x[0:5]

array([[3.000e+00, 2.250e+00, 2.570e+03, 2.000e+00, 0.000e+00, 0.000e+00,
        3.000e+00, 7.000e+00],
       [3.000e+00, 2.000e+00, 1.680e+03, 1.000e+00, 0.000e+00, 0.000e+00,
        3.000e+00, 8.000e+00],
       [4.000e+00, 4.500e+00, 5.420e+03, 1.000e+00, 0.000e+00, 0.000e+00,
        3.000e+00, 1.100e+01],
       [3.000e+00, 2.250e+00, 1.715e+03, 2.000e+00, 0.000e+00, 0.000e+00,
        3.000e+00, 7.000e+00],
       [3.000e+00, 1.500e+00, 1.060e+03, 1.000e+00, 0.000e+00, 0.000e+00,
        3.000e+00, 7.000e+00]])

In [None]:
y[0:5]

array([[ 538000.],
       [ 510000.],
       [1225000.],
       [ 257500.],
       [ 291850.]])

Now we import the libraries for the modeling 

In [None]:
from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(x, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Lets see the co-efficients of the equation 

In [None]:
print("The coefficients of the features:", model.coef_)

The coefficients of the features: [[-3.06274995e+04 -1.04924608e+04  1.91819199e+02 -9.59860119e+03
   5.63088780e+05  6.32965573e+04  5.44145001e+04  9.78208574e+04]]


Lets see the intercept

In [None]:
print("The intercept:", model.intercept_)

The intercept: [-674087.72549068]


###**PREDICTION OF PRICES**

In [None]:
predicted_prices = model.predict(test[['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'waterfront'
, 'view', 'condition', 'grade']])

###**EVALUATION OF THE MODEL**

This hsows how accurate the model is 

First, convert the test dataset into a numpy array 

In [39]:
X = np.asanyarray(test[['bedrooms', 'bathrooms', 'sqft_living', 'floors', 'waterfront'
, 'view', 'condition', 'grade']])
Y = np.asanyarray(test['price'])

**MEAN SQUARED ERROR**

In [47]:
print("Mean squared error of the model: %.2f" %np.mean((predicted_prices - Y) ** 2))

Mean squared error of the model: 54281448719.28


**EXPLAINED VARIANCE REGRESSION SCORE**

In [48]:
print("Variance score of the model: %.2f" %model.score(X, Y))

Variance score of the model: 0.62


* The variance score suggests that the prediction is quite accurate ( best prediction is 1)

#**THANK YOU FOR VIEWING MY PROJECT**