## 1. Importing

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from collections import Counter

In [None]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)

## 2. Overview col

In [None]:
dataset.head()

Columns:
- *CRIM:* Per capita crime rate by town
- *ZN:* Proportion of residential land zoned for lots over 25,000 sq. ft
- *INDUS:* Proportion of non-retail business acres per town
- *CHAS :* Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- *NOX:* Nitric oxide concentration (parts per 10 million)
- *RM:* Average number of rooms per dwelling
- *AGE:* Proportion of owner-occupied units built prior to 1940
- *DIS:* Weighted distances to five Boston employment centers
- *RAD:* Index of accessibility to radial highways
- *PTRATIO:* Pupil-teacher ratio by town
- *B:*  1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
- *LSTAT:* Percentage of lower status of the population
- *MEDV:* Median value of owner-occupied homes in $1000s

As you seen, there isn't "MEDV" column that we will try to predict. Let's add the column to our dataset.

In [None]:
dataset['MEDV'] = boston_dataset.target

In [None]:
dataset.head()

## 3. Data Analysis 

### Data Preprocessing

> Are there missing values? There isn't any missing values as shown below.

In [None]:
dataset.isnull().sum()

In [None]:
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)

> Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 25)

In [None]:
print("Shape of X_train: ",X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ",y_train.shape)
print("Shape of y_test",y_test.shape)

## Visualizing Data

In [None]:
corr = dataset.corr()
#Plot figsize
fig, ax = plt.subplots(figsize=(10, 10))
#Generate Heat Map, allow annotations and place floats in map
sns.heatmap(corr, cmap='RdBu', annot=True, fmt=".2f")
#Apply xticks
plt.xticks(range(len(corr.columns)), corr.columns);
#Apply yticks
plt.yticks(range(len(corr.columns)), corr.columns)
#show plot
plt.show()

## 4. Regression Models 

###  Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
regressor_linear = LinearRegression()
regressor_linear.fit(X_train, y_train)

In [None]:
from sklearn.metrics import r2_score

# Predicting Cross Validation Score the Test set results
cv_linear = cross_val_score(estimator = regressor_linear, X = X_train, y = y_train, cv = 10)

# Predicting R2 Score the Train set results
y_pred_linear_train = regressor_linear.predict(X_train)
r2_score_linear_train = r2_score(y_train, y_pred_linear_train)

# Predicting R2 Score the Test set results
y_pred_linear_test = regressor_linear.predict(X_test)
r2_score_linear_test = r2_score(y_test, y_pred_linear_test)

# Predicting RMSE the Test set results
rmse_linear = (np.sqrt(mean_squared_error(y_test, y_pred_linear_test)))
print("CV: ", cv_linear.mean())
print('R2_score (train): ', r2_score_linear_train)
print('R2_score (test): ', r2_score_linear_test)
print("RMSE: ", rmse_linear)

## 6. Conclusion 

In [None]:
predict.sort_values(by=['RMSE'], ascending=False, inplace=True)

f, axe = plt.subplots(1,1, figsize=(18,6))
sns.barplot(x='Model', y='RMSE', data=predict, ax = axe)
axe.set_xlabel('Model', size=16)
axe.set_ylabel('RMSE', size=16)

plt.show()

In this kernel, I have built Linear regression models using Boston Housing Dataset. Please make a comment and let me know how to improve model performance, visualization or something in this kernel. This will also help me on my future analysis.

Don't forget to UPVOTE if you liked this kernel, thank you. 🙂👍