# Boston Housing Data Set

Problem: Predict the median value of owner occupied homes.

* Tutorial : https://www.analyticsvidhya.com/blog/2015/11/started-machine-learning-ms-excel-xl-miner/
* Get data : https://www.kaggle.com/c/boston-housing


#### Housing Values in Suburbs of Boston
#### The medv variable is the target variable.

Data description


crim - 
per capita crime rate by town.

zn - 
proportion of residential land zoned for lots over 25,000 sq.ft.

indus - 
proportion of non-retail business acres per town.

chas - 
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox - 
nitrogen oxides concentration (parts per 10 million).

rm - 
average number of rooms per dwelling.

age - 
proportion of owner-occupied units built prior to 1940.

dis - 
weighted mean of distances to five Boston employment centres.

rad - 
index of accessibility to radial highways.

tax - 
full-value property-tax rate per \$10,000.

ptratio - 
pupil-teacher ratio by town.

black - 
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat - 
lower status of the population (percent).

medv - 
median value of owner-occupied homes in \$1000s.

In [1]:
import pandas
import numpy
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error,r2_score

In [2]:
# Read Data
dataset=pandas.read_csv('Data/train.csv')
test=pandas.read_csv('Data/test.csv')
dataset.shape

(333, 15)

In [3]:
dataset.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 15 columns):
ID         333 non-null int64
crim       333 non-null float64
zn         333 non-null float64
indus      333 non-null float64
chas       333 non-null int64
nox        333 non-null float64
rm         333 non-null float64
age        333 non-null float64
dis        333 non-null float64
rad        333 non-null int64
tax        333 non-null int64
ptratio    333 non-null float64
black      333 non-null float64
lstat      333 non-null float64
medv       333 non-null float64
dtypes: float64(11), int64(4)
memory usage: 39.1 KB


In [5]:
display(dataset.describe())

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0
mean,250.951952,3.360341,10.689189,11.293483,0.06006,0.557144,6.265619,68.226426,3.709934,9.633634,409.279279,18.448048,359.466096,12.515435,22.768769
std,147.859438,7.352272,22.674762,6.998123,0.237956,0.114955,0.703952,28.133344,1.981123,8.742174,170.841988,2.151821,86.584567,7.067781,9.173468
min,1.0,0.00632,0.0,0.74,0.0,0.385,3.561,6.0,1.1296,1.0,188.0,12.6,3.5,1.73,5.0
25%,123.0,0.07896,0.0,5.13,0.0,0.453,5.884,45.4,2.1224,4.0,279.0,17.4,376.73,7.18,17.4
50%,244.0,0.26169,0.0,9.9,0.0,0.538,6.202,76.7,3.0923,5.0,330.0,19.0,392.05,10.97,21.6
75%,377.0,3.67822,12.5,18.1,0.0,0.631,6.595,93.8,5.1167,24.0,666.0,20.2,396.24,16.42,25.0
max,506.0,73.5341,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


In [6]:
dataset.boxplot()
# open following in new tab.


<matplotlib.axes._subplots.AxesSubplot at 0x1a11e4eda0>

In [7]:
Q1=dataset.quantile(0.25)
Q3=dataset.quantile(0.75)
IQR = Q3 - Q1

In [8]:
colNames=dataset.columns

for col in colNames:
    dataset[col] = dataset[col].mask(dataset[col] < Q1[col] - 1.5 * IQR[col], Q1[col])
    dataset[col] = dataset[col].mask(dataset[col] > Q3[col] + 1.5 * IQR[col], Q3[col])

In [18]:

#dataset=dataset.drop(columns=['ID'])
#dataset=dataset.drop(columns=['crim','zn','indus','rad','black'])
#dataset=dataset.drop(columns='dis')
dataset.corr()

KeyError: "['chas'] not found in axis"

In [19]:
dataset.dtypes

ID           int64
crim       float64
zn         float64
indus      float64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
black      float64
lstat      float64
medv       float64
dtype: object

In [20]:
subset_onlyNum=dataset.drop(columns=['medv'])


subset_onlyNum2=subset_onlyNum*subset_onlyNum
subset_onlyNum2.columns=subset_onlyNum.columns+'^2'



subset_onlyNum3=subset_onlyNum*subset_onlyNum*subset_onlyNum
subset_onlyNum3.columns=subset_onlyNum.columns+'^3'


subset_onlyNum4=subset_onlyNum*subset_onlyNum*subset_onlyNum*subset_onlyNum
subset_onlyNum4.columns=subset_onlyNum.columns+'^4'

subset_onlyNum5=subset_onlyNum*subset_onlyNum*subset_onlyNum*subset_onlyNum*subset_onlyNum
subset_onlyNum5.columns=subset_onlyNum.columns+'^5'


subset_onlyNum_2RT=numpy.sqrt(subset_onlyNum)
subset_onlyNum_2RT.columns=subset_onlyNum.columns+'^2RT'

subset_onlyNum_3RT=numpy.cbrt(subset_onlyNum)
subset_onlyNum_3RT.columns=subset_onlyNum.columns+'^3RT'



subset_onlyNum_4RT=numpy.power(subset_onlyNum,1/4)
subset_onlyNum_4RT.columns=subset_onlyNum.columns+'^4RT'



subset_onlyNum_5RT=numpy.power(subset_onlyNum,1/5)
subset_onlyNum_5RT.columns=subset_onlyNum.columns+'^5RT'




subset_onlyNum_6RT=numpy.power(subset_onlyNum,1/6)
subset_onlyNum_6RT.columns=subset_onlyNum.columns+'^6RT'



subset_onlyNum_7RT=numpy.power(subset_onlyNum,1/7)
subset_onlyNum_7RT.columns=subset_onlyNum.columns+'^7RT'


subset_onlyNum_8RT=numpy.power(subset_onlyNum,1/8)
subset_onlyNum_8RT.columns=subset_onlyNum.columns+'^8RT'


In [21]:
import sklearn
from sklearn import model_selection # for splitting into train and test

In [22]:
modelDF=pandas.concat([subset_onlyNum,subset_onlyNum2,subset_onlyNum3,subset_onlyNum4
                            
                            ],axis=1)
modelDF.head()

Unnamed: 0,ID,crim,zn,indus,nox,rm,age,dis,rad,tax,...,indus^4,nox^4,rm^4,age^4,dis^4,rad^4,tax^4,ptratio^4,black^4,lstat^4
0,1,0.00632,18.0,2.31,0.538,6.575,65.2,4.09,1,296,...,28.473963,0.083778,1868.886938,18071340.0,279.82933,1,7676563456,54798.1281,24815580000.0,615.05984
1,2,0.02731,0.0,7.07,0.469,6.421,78.9,4.9671,2,242,...,2498.490228,0.048383,1699.850313,38753240.0,608.71165,16,3429742096,100387.5856,24815580000.0,6978.864768
2,4,0.03237,0.0,2.18,0.458,6.998,45.8,6.0622,3,222,...,22.585306,0.044001,2398.257176,4400094.0,1350.58226,81,2428912656,122283.0961,24252720000.0,74.711821
3,5,0.06905,0.0,2.18,0.458,7.147,54.2,6.0622,3,222,...,22.585306,0.044001,2609.126456,8629729.0,1350.58226,81,2428912656,122283.0961,24815580000.0,807.065599
4,7,0.08829,12.5,7.87,0.524,6.012,66.6,5.5605,5,311,...,3836.179582,0.075392,1306.399145,19674190.0,955.994471,625,9354951841,53379.4816,24492050000.0,23871.764124


In [23]:
# Split-out validation dataset
X = modelDF.values
Y = dataset['medv'].values

validation_size = 0.30
seed = 100
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

In [24]:
from sklearn import linear_model

model_LR=linear_model.LinearRegression()
model_LR.fit(X_train,Y_train)
trainResult_LR=model_LR.predict(X_train)
testResult_LR=model_LR.predict(X_test)

In [25]:
def adj_r2_score(model,actual,predicted):
        from sklearn import metrics
        adj = 1 - float(len(actual)-1)/(len(actual)-len(model.coef_)-1)*(1 - metrics.r2_score(actual,predicted))
        return adj

########## TRAIN DATA RESULT ##########

print('---------- TRAIN DATA RESULT ----------')
# The mean squared error
print("Mean squared error: %.0f"% mean_squared_error(Y_train, trainResult_LR))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.4f' % r2_score(Y_train, trainResult_LR))
print('Adj R2 score: %.4f' % adj_r2_score(model_LR,Y_train, trainResult_LR))




########## TEST DATA RESULT ##########

print('---------- TEST DATA RESULT ----------')
# The mean squared error
print("Mean squared error: %.0f"% mean_squared_error(Y_test, testResult_LR))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.4f' % r2_score(Y_test, testResult_LR))
print('Adj R2 score: %.4f' % adj_r2_score(model_LR,Y_test, testResult_LR))

---------- TRAIN DATA RESULT ----------
Mean squared error: 6
Variance score: 0.8371
Adj R2 score: 0.7901
---------- TEST DATA RESULT ----------
Mean squared error: 14
Variance score: 0.6395
Adj R2 score: 0.2406
