**Data Set: Boston House Prices**

The problem that we are going to solve here is that given a set of features that describe a house in Boston, our machine learning model must predict the house price. To train our machine learning model with boston housing data, we will be using scikit-learn’s boston dataset.

In this dataset, each row describes a boston town or suburb. There are 506 rows and 13 attributes (features) with a target column (price). https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names

# Importing libraries & loading the Dataset

In [4]:
# Import Essential Libraries
import pandas as pd
import numpy as np

# to visualize data using 2D plots.
import matplotlib.pyplot as plt
# to make 2D plots look pretty and readable.
import seaborn as sns
import random
import os

# Setting Seaborn Style
sns.set(style = 'whitegrid')

# For Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# For Preformance metrics
from sklearn.metrics import mean_squared_error, r2_score

# ignore all warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [5]:
#To create machine learning models easily and make predictions.
from sklearn.datasets import load_boston
dataset = load_boston()

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


In [7]:
#There are 6 keys in this dataset using which we can access more information about the dataset .
print("[INFO] keys : {}".format(dataset.keys()))

NameError: name 'dataset' is not defined

In [None]:
dataset.feature_names

: 

In [None]:
print("[INFO] dataset summary", dataset.DESCR)

: 

# Exploratory Data Analysis

In [None]:
# We can easily convert the dataset into a pandas dataframe to perform exploratory data analysis. 
df=pd.DataFrame(dataset.data)
df

: 

In [None]:
df.columns = dataset.feature_names
df["prices"]=dataset.target

: 

In [None]:
df

: 

Exploratory  Data Analysis is a very important step before training the model. Here,  we will use visualizations to understand the relationship of the target  variable with other features.

Let’s first plot the distribution of the target variable. We will use the histogram plot function from the matplotlib library.

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})
plt.hist(df['prices'],color ="brown", bins=30)
plt.xlabel("House prices in $1000")
plt.show()

: 

We can see from the plot that the values of PRICE are distributed normally with few outliers. Most of the house are around 20–24 range (in $1000 scale)

In [None]:
#descriptive statistics
#statistical summary of the dataset using the describe() function. Using this function,
#we can understand the count, min, max, mean and standard deviation for each attribute (column) in the dataset. 
df.describe().T

: 

## Understanding the Data and statistical analysis

In [None]:
df.info()

: 

In [None]:
# Identifying the unique number of values in the dataset
df.nunique()

: 

## Check for missing values

In [None]:
df.isnull().sum()

: 

In [None]:
print(df.isna().sum())

: 

# Data Visualization

## Checking the distribution of the data

In [None]:
def draw_plots(df, var, rows, cols):
    fig=plt.figure(figsize=(20,20))
    for i, f in enumerate(var):
        ax=fig.add_subplot(rows,cols,i+1)
        df[f].hist(bins=20,ax=ax, facecolor='midnightblue')
        ax.set_title(f+'Distribution',color='DarkRed')
 
    fig.tight_layout() 

: 

In [None]:
plt.show()
draw_plots(df,df.columns,5,3)

: 

## Explorning data to know relation before processing

<b> The correlation coefficient ranges from -1 to 1. If the value is close  to 1, it means that there is a strong positive correlation between the  two variables. When it is close to -1, the variables have a strong  negative correlation.

In [None]:
# Finding out the correlation between the features
corr = df.corr()
corr

: 

In [3]:
# Plotting the heatmap of correlation between features

plt.figure(figsize=(12,12))
sns.heatmap(data=df.corr().round(2),annot=True,cmap='coolwarm',linewidths=0.2,square=True)

NameError: name 'df' is not defined

<Figure size 1200x1200 with 0 Axes>

The Big colorful picture above which is called Heatmap helps us to understand how features are correlated to each other.

Postive sign implies postive correlation between two features whereas Negative sign implies negative correlation between two features.  
I am here interested to know which features have good correlation with our dependent variable prices and can help in having good predictions.  
I observed that INDUS, RM, TAX, PTRATIO and LSTAT shows some good correaltion with prices and I am interested to know more about them.  
However I noticed that INDUS shows good correlation with TAX and LSAT which is a pain point for us :(

because it leads to Multicollinearity. So I decided NOT to consider this feature and do further analysis with other 6 remaining features.  

By looking at the correlation matrix we can see that RM has a strong positive correlation with PRICE (0.7) where as LSTAThas a high negative correlation with PRICE (-0.7).

In [None]:
# TODO : Visualizing correlation of features with prediction column `MEDV`

corr_with_prices = df.corrwith(df['prices'])

plt.figure(figsize = (16, 4))
sns.heatmap([np.abs(corr_with_prices)], cmap = 'RdBu_r', annot = True, fmt = '.2%')

: 

In [None]:
# Let's confirm this by using ExtraTreesRegressor
# TODO : To know the feature Importances
y = df['prices'].values
from sklearn.ensemble import ExtraTreesRegressor
etc = ExtraTreesRegressor()
etc.fit(df.iloc[:, :-1].values, y)

print("Percentage Importance of each features with respect to House Price : ")
important_features = pd.Series(etc.feature_importances_*100, index = df.columns[:-1])
important_features

: 

In [None]:
# Feature Impotances by ExtraTressRegressor
important_features.sort_values(ascending = False)

: 

In [None]:
# Feature Impotances by Correlation Matrix
corr_with_prices[:-1].abs().sort_values(ascending = False)

: 

In [None]:
# it says the same proximity
plt.figure(figsize=(16, 10))
plt.plot(etc.feature_importances_, df.columns[:-1], 'go-', linewidth=5, markersize=12)

: 

**From the above feature observations, we found that some columns are most important such as LSTAT and RM**

In [None]:
plt.figure(figsize=(20, 5))

features = ['LSTAT', 'RM']
target = df['prices']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = df[col]
    y = target
    plt.scatter(x, y,color='green', marker='o')
    plt.title("Variation in House prices")
    plt.xlabel(col)
    plt.ylabel('"House prices in $1000"')

: 

The prices increase as the value of RM increases linearly. There are few outliers and the data seems to be capped at 50.

The prices tend to decrease with an increase in LSTAT. Though it doesn’t look to be following exactly a linear line.

# Univariate and Multivariate Analysis

In [None]:
desc = df.describe().round(2)
desc

: 

In [None]:
df.plot(kind='box',figsize=(20,10),color='Green',vert=False)
plt.show()

: 

## prices

In [None]:
#Box Plot and Distribution Plot for Dependent variable MEDV
plt.figure(figsize=(20,3))

plt.subplot(1,2,1)
sns.boxplot(df.prices,color='#005030')
plt.title('Box Plot of prices')

plt.subplot(1,2,2)
sns.distplot(a=df.prices,color='#500050')
plt.title('Distribution Plot of prices')
plt.show()

: 

**outliers removal for future work**

# Building Machine Learning Model

In [None]:
# Arranging features based on features importance
features_arranged_on_importance = important_features.sort_values(ascending = False).index
features_arranged_on_importance

: 

In [None]:
y = df.loc[:, 'prices'].values

: 

In [None]:
# Existing dataframe
df.head()

: 

In [None]:
# Arranging columns based on features importance
new_df = df[features_arranged_on_importance]
new_df.head()

: 

In [None]:
# Getting boston values
X = new_df.values
X = X[:, :13]

# TODO : Splitting data as train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

: 

In [None]:
X_df=pd.DataFrame(X)
X_df

: 

# Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

print('Training Score : ', linear_model.score(X_train, y_train))
print('Testing Score  : ', linear_model.score(X_test, y_test))

print('R2 Score : ', r2_score(y_test, linear_model.predict(X_test)))
print('MSE : ', mean_squared_error(y_test, linear_model.predict(X_test)))

: 

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

linear_model = make_pipeline(MinMaxScaler(), LinearRegression())
linear_model.fit(X_train, y_train)

print('Training Score : ', linear_model.score(X_train, y_train))
print('Testing Score  : ', linear_model.score(X_test, y_test))

print('R2 Score : ', r2_score(y_test, linear_model.predict(X_test)))
print('MSE : ', mean_squared_error(y_test, linear_model.predict(X_test)))

: 

# Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
scores = []
for i in range(100):

    dtr_model = DecisionTreeRegressor(max_depth=None, random_state=i)
    dtr_model.fit(X_train, y_train)
    scores.append(r2_score(y_test, dtr_model.predict(X_test)))

plt.figure(figsize = (16, 8))
plt.plot(list(range(100)), scores, 'ro-')
plt.xlabel('Random Decision Tree Regressor')
plt.ylabel('Scores')
plt.show()

: 

**See how the decision tree score changes for different random states**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

dtr_model = DecisionTreeRegressor(max_depth=23, random_state=3)
dtr_model.fit(X_train[:, :], y_train)
    

print('Training Score : ', dtr_model.score(X_train, y_train))
print('Testing Score  : ', dtr_model.score(X_test, y_test))

print('R2 Score : ', r2_score(y_test, dtr_model.predict(X_test)))
print('MSE : ', mean_squared_error(y_test, dtr_model.predict(X_test)))

: 

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import make_pipeline
adtr_model = make_pipeline(MinMaxScaler(), DecisionTreeRegressor(max_depth = 12, random_state = 92))
adtr_model.fit(X_train, y_train)

print('Training Score : ', adtr_model.score(X_train, y_train))
print('Testing Score  : ', adtr_model.score(X_test, y_test))

print('R2 Score : ', r2_score(y_test, adtr_model.predict(X_test)))
print('MSE : ', mean_squared_error(y_test, adtr_model.predict(X_test)))

: 

# Random Forest Regression

In [None]:
from sklearn.ensemble import RandomForestRegressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
rfr = RandomForestRegressor(max_depth = 7, random_state = 63)
rfr.fit(X_train, y_train)


print('Training Score : ', rfr.score(X_train, y_train))
print('Testing Score  : ', rfr.score(X_test, y_test))

print('R2 Score : ', r2_score(y_test, rfr.predict(X_test)))
print('MSE : ', mean_squared_error(y_test, rfr.predict(X_test)))

: 

# Different Models Accuracy

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X[:, :], y, test_size = 0.20, random_state = 42)

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

print('Linear Regression : ')
model1 = LinearRegression()
model1.fit(X_train, y_train)
print('Score : ', model1.score(X_test, y_test))

print('Decision Tree Regression : ')
model2 = DecisionTreeRegressor(max_depth=23, random_state=3)
model2.fit(X_train, y_train)
print('Score : ', model2.score(X_test, y_test))

print('Random Forest Regression : ')
model3 = RandomForestRegressor(max_depth = 7, random_state = 63)
model3.fit(X_train, y_train)
print('Score : ', model3.score(X_test, y_test))

print('k Neighbors Regression : ')
model4 = KNeighborsRegressor(n_neighbors = 10)
model4.fit(X_train, y_train)
print('Score : ', model4.score(X_test, y_test))

: 

# Building optimal Random Regression Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X[:, :], y, test_size = 0.20, random_state = 46)

print('Random Forest Regression : ')
random_forest_regressor = RandomForestRegressor(max_depth = 7, random_state = 63)
random_forest_regressor.fit(X_train, y_train)
print('Score : ', random_forest_regressor.score(X, y))

: 

In [None]:
# Scores for different training samples
scores = []
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = i)
    random_forest_regressor = RandomForestRegressor(max_depth = 7, random_state = 63)
    random_forest_regressor.fit(X_train, y_train)
    scores.append(random_forest_regressor.score(X, y))
    
plt.figure(figsize = (16, 8))
plt.plot(list(range(100)), scores, 'go-')
plt.xlabel('Different Training Samples')
plt.ylabel('Scores')
plt.show()

: 

In [None]:
# Scores for different random forest model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 3)

scores = []
for i in range(100):
    random_forest_regressor = RandomForestRegressor(max_depth = 13, random_state = i)
    random_forest_regressor.fit(X_train, y_train)
    scores.append(random_forest_regressor.score(X, y))
    
plt.figure(figsize = (16, 8))
plt.plot(list(range(100)), scores, 'ro-')
plt.xlabel('Different Random Forest Models')
plt.ylabel('Scores')
plt.show()

: 

In [None]:
# Scores for different random forest model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 3)

scores = []
for i in range(1, 30):
    random_forest_regressor = RandomForestRegressor(max_depth = i, random_state = 68)
    random_forest_regressor.fit(X_train, y_train)
    scores.append(random_forest_regressor.score(X, y))
    
plt.figure(figsize = (16, 8))
plt.plot(list(range(1, 30)), scores, 'bo-')
plt.xlabel('Different Max_depths')
plt.ylabel('Scores')
plt.show()

: 

In [None]:
plt.figure(figsize = (16, 8))
plt.plot(list(range(1, 30)), scores, 'bo-')
plt.ylim(0.95, 0.97)
plt.show()

: 

From this, we are going to choose,

random_state = 3, for choosing random Training samples  
random_state = 68, for random Random forest regressor  
max_depth = 13, for Max Depths in random forest regressor  

# Building Optimal Model

In [None]:
# Choosing Optimal Training Samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 3)

# Building Optimal Random Forest regressor Model
random_forest_regressor = RandomForestRegressor(max_depth = 13, random_state = 68)
random_forest_regressor.fit(X_train, y_train)

: 

In [None]:
random_forest_regressor.score(X, y)

: 

In [None]:
print('Training Accuracy : ', random_forest_regressor.score(X_train, y_train))
print('Testing Accuracy  : ', random_forest_regressor.score(X_test, y_test))

: 

In [None]:
print('Mean Squared Error : ', mean_squared_error(y_test, random_forest_regressor.predict(X_test)))
print('Root Mean Squared Error : ', mean_squared_error(y_test, random_forest_regressor.predict(X_test))**0.5)
print('Score : ', r2_score(y, random_forest_regressor.predict(X)))

: 

Finally we finishes the project. We have built a Random Forest Regressor Model which performs well with top 6 features and having the Training accuracy of 97.89% and Testing accuracy of 96.73%.

In [None]:
# plot between y-test and y_pred
y_pred = random_forest_regressor.predict(X_test)
plt.rcParams['figure.figsize'] = (15, 8)
plt.plot(np.linspace(1, 102, 102), y_test, 'b--')
plt.plot(np.linspace(1, 102, 102), y_pred, 'g-')
plt.title('A Plot Representing line Plots for the values for y_test and y-pred', fontsize = 30)
plt.xlabel('Count')
plt.ylabel('Range of these Values')
plt.legend()
plt.show()

: 

In [None]:
import pickle
# Save trained model to file
pickle.dump(random_forest_regressor, open("final_model.pkl", "wb"))
loaded_model = pickle.load(open("final_model.pkl", "rb"))
loaded_model.predict(X_test)
loaded_model.score(X_test,y_test)

: 

In [None]:
CRIM = input()
ZN = input()
INDUS = input()
CHAS = input()
NOX = input()
RM = input()
AGE = input()
DIS = input()
RAD = input()
TAX = input()
PTRATIO = input()
B = input()
LSTAT = input()

: 

In [None]:
data = np.array([[CRIM,ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX,PTRATIO, B, LSTAT]])
mean = np.load('mean.npy')
std = np.load('std.npy')

: 

In [None]:
 my_prediction = loaded_model.predict(data)[0]
print(f"Price of the house is {my_prediction} Million")

: 

In [None]:
# importing the necessary libraries
import pandas as pd
import numpy as np
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeRegressor

# importing the dataset
data = pd.read_csv('boston_housing_prices.csv')
data.columns = data.columns.str.strip()  # removing the extra spaces

# using the stratified shuffle so that the column 'CHAS' will be equally distributed among the train and test
split = StratifiedShuffleSplit()
for train_index,test_index in split.split(data,data['CHAS']):
    train_set = data.loc[train_index]
    test_set = data.loc[test_index]  # saving the data into train and test

train = train_set.copy()
test = test_set.copy()

train_target = train['prices']
train.drop('prices',axis = 1,inplace = True)
test_target = test['prices']
test.drop('prices',axis = 1, inplace = True)

# scaling the data
scaler = StandardScaler()
scaler.fit(train) # using fit so that we can save the mean and variance of the scaled data later
train_transformed = scaler.transform(train)

# saving the required mean and variance so that we can use it after the deployment to scale the input data
std = np.sqrt(scaler.var_)
np.save('std.npy',std)
np.save('mean.npy',scaler.mean_)

regressor = DecisionTreeRegressor()
regressor.fit(train_transformed,train_target)


: 

In [None]:
data = np.array([[CRIM,ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD, TAX,PTRATIO, B, LSTAT]])
my_prediction = random_forest_regressor.predict(data)

: 

In [None]:
my_prediction

: 

In [None]:
int(my_prediction)

: 

: 