# **Background**


The dataset contains information about different houses in Boston bases on features such as crime rate, average number of rooms per dwelling, age and median value of owner-occupied homes etc. The data is to be used to predict the housing prices of a new house using linear regression.

1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per $10,000

11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
14. MEDV - Median value of owner-occupied homes in $1000's

# **1. Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

# **2. Load Data**

In [None]:
from sklearn.datasets import load_boston
df = load_boston()

In [None]:
print(df.keys())

In [None]:
df.data.shape

In [None]:
df.feature_names

In [None]:
print(df.DESCR)

**Pandas Dataframe Conversion**

In [None]:
boston = pd.DataFrame(df.data)

In [None]:
boston.head()

In [None]:
boston.columns = df.feature_names

In [None]:
boston['PRICE'] = df.target

Check the head, info() and shape of the data set.

In [None]:
boston.head()

In [None]:
boston.shape

In [None]:
boston.info()

**Summary Statistics**

Generative descriptive statistics which include those  that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [None]:
boston.describe().transpose()

# **3. Exploratory Data Analysis**

Check the number of null values in each column.

In [None]:
boston.isnull().sum()

There appears to be no null, we can then proceed exploring the data.

Check the correlations existing in the data to ascertain which which features affect the target variable the most.

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(boston.corr(),cmap='viridis',annot=True,fmt='.2g')

In [None]:
sns.pairplot(data=boston)

**Use seaborn to create a distplot of the target feature.**

In [None]:
sns.distplot(boston['PRICE'],kde=True)

There seems to be some outlier in the PRICE column. Investigate this further by creating a boxplot. In descriptive statistics, a box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.

In [None]:
sns.boxplot(boston['PRICE'])

**Detecting and Filtering Outliers**

Z-Score

The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.


In [None]:
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(boston))
print(z)

In [None]:
threshold = 3
print(np.where(z > 3))

In [None]:
boston_df = boston[(z < 3).all(axis=1)]

In [None]:
boston_df.shape

# **4. Training and Testing Data**

Now that we have explored the data, we can go on and split the data into training and testing sets.

In [None]:
X = boston_df.drop('PRICE',axis=1)
y = boston_df['PRICE']

**Use model_selection.train_test_split from sklearn to split the data into training and testing sets. Set test_size=0.3 and random_state=101**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# **5. Training the Model**

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(X_train,y_train)

**Print out the coefficients and intercept of the model**

In [None]:
print('Coefficients:',lm.coef_)
print('\n')
print('Intercept:',lm.intercept_)

# **6. Predicting Test Data**

Now that we have fit our model, let's evaluate its performance by predicting off the test values!


In [None]:
predictions = lm.predict(X_test)

 Create a scatterplot of the real test values versus the predicted values.

In [None]:
plt.scatter(y_test,predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

# **7. Evaluating the Model**

Let's evaluate our model performance by calculating the residual sum of squares and the explained variance score (R^2).

Calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared

In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

# **8. Residuals**

Plot a histogram of the residuals and make sure it looks normally distributed

In [None]:
sns.distplot((y_test-predictions),bins=50)

# **9. Conclusion**

We still want to intepret the highest predictors or influencers on the price of the house, that is, to ascertain what impact a single unit change in a feature has on the price. Let's see if we can interpret the coefficients at all to get an idea.

In [None]:
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

In [None]:
coeffecients.apply(lambda x: '%.5f' % x, axis=1)

Intepreting the coefficients:

- Holding all other features fixed, a 1 unit increase in **per capita crime rate by town** is associated with an **increase of -0.16078 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in **proportion of residential land zoned for lots over 25,000 sq.ft.** is associated with an **increase of 0.00490 total dollars in house price.**


- Holding all other features fixed, a 1 unit increase in **proportion of non-retail business acres per town** is associated with **an increase of 0.05385 total dollars in house price.**


- Holding all other features fixed, a 1 unit increase in **Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)** is associated with an **increase of -0.00000 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in **nitric oxides concentration (parts per 10 million)** is associated with an **increase of -9.19738 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in **average number of rooms per dwelling** is associated with an **increase of  5.50573 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in **proportion of owner-occupied units built prior to 1940** is associated with **an increase of -0.03581 total dollars in house price.** 

- Holding all other features fixed, a 1 unit increase in **weighted distances to five Boston employment centres** is associated with** an increase of -1.11873 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in **index of accessibility to radial highways** is associated with an **increase of 0.24707 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in **full-value property-tax rate per $10,000** is associated with **an increase of  -0.01295 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in **pupil-teacher ratio by town** is associated with an **increase of -0.86378 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in **1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town** is associated with an **increase of 0.00740 total dollars in house price.**

- Holding all other features fixed, a 1 unit increase in** % lower status of the populatio** is associated with an **increase of -0.34586 total dollars in house price.**