# INTRODUCTION
We will be working on a data set that comes from the real estate industry in Boston (US). Using machine learning techniques to delve into the given data and predict the median value of owner-occupied homes in 1000 USD's. The target variable in this dataset is MEDV and we will use test dataset to predict the median value.

Data Description

CRIM: per capita crime rate by town

ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS: proportion of non-retail business acres per town

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

NOX: nitric oxides concentration (parts per 10 million)

RM: average number of rooms per dwelling

AGE: proportion of owner-occupied units built prior to 1940

DIS: weighted distances to five Boston employment centres

RAD: index of accessibility to radial highways

TAX: full-value property-tax rate per 10,000 USD

PTRATIO: pupil-teacher ratio by town

B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT: lower status of the population (%)

MEDV: Median value of owner-occupied homes in 1000 USD's

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
boston_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Boston_Housing/Training_set_boston.csv" )


In [3]:
eval_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Boston_Housing/Testing_set_boston.csv')

In [4]:
boston_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404 entries, 0 to 403
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     404 non-null    float64
 1   ZN       404 non-null    float64
 2   INDUS    404 non-null    float64
 3   CHAS     404 non-null    float64
 4   NOX      404 non-null    float64
 5   RM       404 non-null    float64
 6   AGE      404 non-null    float64
 7   DIS      404 non-null    float64
 8   RAD      404 non-null    float64
 9   TAX      404 non-null    float64
 10  PTRATIO  404 non-null    float64
 11  B        404 non-null    float64
 12  LSTAT    404 non-null    float64
 13  MEDV     404 non-null    float64
dtypes: float64(14)
memory usage: 44.3 KB


In [5]:
eval_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     102 non-null    float64
 1   ZN       102 non-null    float64
 2   INDUS    102 non-null    float64
 3   CHAS     102 non-null    float64
 4   NOX      102 non-null    float64
 5   RM       102 non-null    float64
 6   AGE      102 non-null    float64
 7   DIS      102 non-null    float64
 8   RAD      102 non-null    float64
 9   TAX      102 non-null    float64
 10  PTRATIO  102 non-null    float64
 11  B        102 non-null    float64
 12  LSTAT    102 non-null    float64
dtypes: float64(13)
memory usage: 10.5 KB


In [6]:
boston_data.corr()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
CRIM,1.0,-0.198855,0.400198,-0.044589,0.396406,-0.200303,0.33409,-0.366487,0.615947,0.576894,0.28897,-0.356858,0.414142,-0.380091
ZN,-0.198855,1.0,-0.533489,-0.043754,-0.526414,0.274661,-0.575078,0.681817,-0.31379,-0.294267,-0.389163,0.178652,-0.396572,0.309504
INDUS,0.400198,-0.533489,1.0,0.095158,0.770957,-0.39869,0.636569,-0.707566,0.588952,0.702353,0.348303,-0.363151,0.603644,-0.470546
CHAS,-0.044589,-0.043754,0.095158,1.0,0.135476,0.111272,0.096016,-0.121671,0.028685,0.007746,-0.113003,0.041666,-0.070652,0.190642
NOX,0.396406,-0.526414,0.770957,0.135476,1.0,-0.299615,0.720417,-0.77233,0.589061,0.650247,0.161253,-0.368034,0.593862,-0.415768
RM,-0.200303,0.274661,-0.39869,0.111272,-0.299615,1.0,-0.210863,0.198299,-0.199738,-0.281127,-0.342643,0.113347,-0.612577,0.71068
AGE,0.33409,-0.575078,0.636569,0.096016,0.720417,-0.210863,1.0,-0.756589,0.430321,0.47167,0.240841,-0.265186,0.571051,-0.340216
DIS,-0.366487,0.681817,-0.707566,-0.121671,-0.77233,0.198299,-0.756589,1.0,-0.483329,-0.523577,-0.217588,0.291122,-0.494921,0.235114
RAD,0.615947,-0.31379,0.588952,0.028685,0.589061,-0.199738,0.430321,-0.483329,1.0,0.912527,0.472257,-0.439387,0.480301,-0.387467
TAX,0.576894,-0.294267,0.702353,0.007746,0.650247,-0.281127,0.47167,-0.523577,0.912527,1.0,0.444836,-0.442027,0.530632,-0.459795


In [7]:
y=boston_data['MEDV']
x=boston_data.drop(labels=['MEDV'],axis=1)

In [9]:
model=LinearRegression()
model.fit(x,y)

In [10]:
y_pred=model.predict(eval_data)

In [11]:
print(y_pred)

[28.99672362 36.02556534 14.81694405 25.03197915 18.76987992 23.25442929
 17.66253818 14.34119    23.01320703 20.63245597 24.90850512 18.63883645
 -6.08842184 21.75834668 19.23922576 26.19319733 20.64773313  5.79472718
 40.50033966 17.61289074 27.24909479 30.06625441 11.34179277 24.16077616
 17.86058499 15.83609765 22.78148106 14.57704449 22.43626052 19.19631835
 22.43383455 25.21979081 25.93909562 17.70162434 16.76911711 16.95125411
 31.23340153 20.13246729 23.76579011 24.6322925  13.94204955 32.25576301
 42.67251161 17.32745046 27.27618614 16.99310991 14.07009109 25.90341861
 20.29485982 29.95339638 21.28860173 34.34451856 16.04739105 26.22562412
 39.53939798 22.57950697 18.84531367 32.72531661 25.0673037  12.88628956
 22.68221908 30.48287757 31.52626806 15.90148607 20.22094826 16.71089812
 20.52384893 25.96356264 30.61607978 11.59783023 20.51232627 27.48111878
 11.01962332 15.68096344 23.79316251  6.19929359 21.6039073  41.41377225
 18.76548695  8.87931901 20.83076916 13.25620627 20