# Predict Boston Housing Prices

This python program predicts the price of houses in Boston using  Linear Regression.

# Linear Regression
Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

## Pros:
1. Simple to implement.
2. Used to predict numeric values.

## Cons:
1. Prone to overfitting.
2. Cannot be used when the relation between independent and dependent variable are non linear.
3. Not suitable for data with higher Dimension

In [13]:
#import the libraries
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split

In [19]:
#Load the Boston Housing data
from sklearn.datasets import load_boston
boston = load_boston()

In [20]:
#get sample feature names
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [23]:
#get the target variable of first five
boston.target[:5]

array([24. , 21.6, 34.7, 33.4, 36.2])

In [25]:
#get sample data of first five rows
boston.data[5][:]

array([2.9850e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
       6.4300e+00, 5.8700e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
       1.8700e+01, 3.9412e+02, 5.2100e+00])

In [26]:
# We will use boston.data , boston.feature_names and boston.taregt to create a new df

In [27]:
#here from boston df and labels
df = pd.DataFrame(boston.data, columns = boston.feature_names)
y = pd.DataFrame(boston.target)


In [28]:
#Get some statistics
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CRIM,506.0,3.613524,8.601545,0.00632,0.082045,0.25651,3.677083,88.9762
ZN,506.0,11.363636,23.322453,0.0,0.0,0.0,12.5,100.0
INDUS,506.0,11.136779,6.860353,0.46,5.19,9.69,18.1,27.74
CHAS,506.0,0.06917,0.253994,0.0,0.0,0.0,0.0,1.0
NOX,506.0,0.554695,0.115878,0.385,0.449,0.538,0.624,0.871
RM,506.0,6.284634,0.702617,3.561,5.8855,6.2085,6.6235,8.78
AGE,506.0,68.574901,28.148861,2.9,45.025,77.5,94.075,100.0
DIS,506.0,3.795043,2.10571,1.1296,2.100175,3.20745,5.188425,12.1265
RAD,506.0,9.549407,8.707259,1.0,4.0,5.0,24.0,24.0
TAX,506.0,408.237154,168.537116,187.0,279.0,330.0,666.0,711.0


In [29]:
#init the model
reg = linear_model.LinearRegression()

In [31]:
#split with 80 to 20
x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

In [32]:
#Train our model 
reg.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [33]:
#Print the coefecients
print(reg.coef_)

[[-1.13055924e-01  3.01104641e-02  4.03807204e-02  2.78443820e+00
  -1.72026334e+01  4.43883520e+00 -6.29636221e-03 -1.44786537e+00
   2.62429736e-01 -1.06467863e-02 -9.15456240e-01  1.23513347e-02
  -5.08571424e-01]]


In [34]:
#predict for xtest
y_pred = reg.predict(x_test)

In [37]:
#check the score
from sklearn.metrics import mean_squared_error as mse
print(f"Rmse is {np.sqrt(mse(y_test , y_pred))}")

Rmse is 4.928602182665303
