## Dataset Information


****The objective of this problem is to predict the monetary value of a house located the boston suburbs.****

## Import modules

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')

## Loading the dataset

In [2]:
df = pd.read_csv("../input/boston-dataset/Boston.csv")
df.drop(columns=['Unnamed: 0'], axis=0, inplace=True) #dropping unnamed
df.head()

In [3]:
# statistical info
df.describe()

In [4]:
# datatype info
df.info()

****All the columns are in numerical datatype.****
****We will create new categorical columns using the existing columns later.****

## Preprocessing the dataset

In [5]:
# check for null values
df.isnull().sum()

In [6]:
#no null values hence the dataset is clean and we can move forward to exploratory data analysis

## Exploratory Data Analysis

In [7]:
# create box plots
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.boxplot(y=col, data=df, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

****In the graph, the dots represent the outliers.****

****The column containing many outliers does not follow the normal distribution.****

****We can minimalize outliers with log transformation.****

****We can also drop the column which contains outliers (or) we can delete the rows which contains the same.****

In [8]:
# create dist plot
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.distplot(value, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

****We can observe right skewed and left skewed graphs for 'crim', 'zn', 'tax', and 'black'.****

****Therefore, we need to normalize these data.****



## Min-Max Normalization

****We will create the column list for the 4 columns and use Min-Max Normalization.****

In [9]:
cols = ['crim', 'zn', 'tax', 'black']
for col in cols:
    # find minimum and maximum of that column
    minimum = min(df[col])
    maximum = max(df[col])
    df[col] = (df[col] - minimum) / (maximum - minimum)

****The last line shows the formula for min-max normalization.****

****It will execute this code for the selected 4 columns.****

In [10]:
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.distplot(value, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

In [11]:
# standardization
from sklearn import preprocessing
scalar = preprocessing.StandardScaler()

# fit our data
scaled_cols = scalar.fit_transform(df[cols])
scaled_cols = pd.DataFrame(scaled_cols, columns=cols)
scaled_cols.head()

In [12]:
for col in cols:
    df[col] = scaled_cols[col] #assigning the values in orignal dataframe for further processing

In [13]:
fig, ax = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
ax = ax.flatten()

for col, value in df.items():
    sns.distplot(value, ax=ax[index])
    index += 1
plt.tight_layout(pad=0.5, w_pad=0.7, h_pad=5.0)

In [14]:
#Standardization uses mean and standard deviation. Here, preprocessing.StandardScaler( ) 
#is the standardization function.
#Even now the columns 'crim', 'zn', 'tax', and 'black' does not show a perfect normal distribution.

#However, the standardized value of these columns will slightly improve the model performance.

## Coorelation Matrix



In [15]:
corr = df.corr()
plt.figure(figsize=(20,10))
sns.heatmap(corr, annot=True, cmap='coolwarm')

****We mostly focus on the target variable as this is a Regression problem.****

****But we can also observe other highly correlated attributes by column 'tax' and 'rad'.****

****We will later eliminate this correlation by ignoring any of the variables.****

****Additionally, we will display 'lstat' and 'rm' to show their correlation with the target variable 'medv'.****



In [16]:
sns.regplot(y=df['medv'], x=df['lstat'])

In [None]:
#Here, the price of houses decreases with the increase in the 'lstat'.
#Hence it is negatively correlated. lstat s the poulation hence its obvious



In [17]:
sns.regplot(y=df['medv'], x=df['rm'])

In [None]:
#Here, the prices of houses increase with the increase in 'rm'.
#Hence it is positively correlated. that is rm is rooms per dwelling

## Input Split

In [25]:
X = df.drop(columns=['medv', 'rad'], axis=1)
y = df['medv']

## Model Training

Instead of training the whole model, we will split the dataset for estimating the model performance.

If you train and test the dataset completely, the results will be inaccurate. Hence, we will use 'train_test_split'.

We will add random_state with the attribute 42 to get the same split upon re-running.

If you don't specify a random state, it will randomly split the data upon re-running giving inconsistent resu

[](http://)

In [27]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
def train(model, X, y):
    # train the model
    x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42)
    model.fit(x_train, y_train)
    
    # predict the training set
    pred = model.predict(x_test)
    
    # perform cross-validation
    cv_score = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=5)
    cv_score = np.abs(np.mean(cv_score))
    
    print("Model Report")
    print("MSE:",mean_squared_error(y_test, pred))
    print('CV Score:', cv_score)

X contains input attributes and y contains the output attribute.

We use 'cross val score' for better validation of the model.

Here, cv=5 means that the cross-validation will split the data into 5 parts.

np.abs will convert the negative score to positive and np.mean will give the average value of 5 scores.



In [29]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)
train(model, X, y)
coef = pd.Series(model.coef_, X.columns).sort_values()
coef.plot(kind='bar', title='Model Coefficients')


In [21]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

In [22]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

In [23]:
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

In [24]:
import xgboost as xgb
model = xgb.XGBRegressor()
train(model, X, y)
coef = pd.Series(model.feature_importances_, X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title='Feature Importance')

****Final Thoughts
To summarize,RandomForestRegressor  works best for this project.****

