[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13W9jpgQCO_tpL9ok5mPRCII9e_wjy6SM#scrollTo=HzPHRjwltwtM.ipynb)

# Regression problem

In this project, you will evaluate the performance and predictive power of a model that has been trained and tested on data collected from homes in suburbs of Boston, Massachusetts. A model trained on this data that is seen as a good fit could then be used to make certain predictions about a home — in particular, its monetary value. This model would prove to be invaluable for someone like a real estate agent who could make use of such information on a daily basis.

The dataset for this project originates from the UCI Machine Learning Repository. The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts.

We will use the following python libraries, which you will encounter frequently for data analysis and machine learning tasks:

- numpy, which provides vectorised arrays, and maths, algebra functionality;
- pandas, which provides data structures and data analysis tools;
- matplotlib, which provides highly customisable plotting functionality (and we - seaborn, built on top of matplotlib, which is less customisable but can generate charts with less code); and,
- scikit-learn, which provides models and tools for most machine learning algorithms

We start by importing necessary packages. sklearn.datasets which is a platform for small standard datasets. SVR which is a Support Vector Regression a subsequent model of SVMs. GridSearchCV to loop through predefined hyperparameters and fit our SVR model on our Boston's training set. Seaborn to statistically analyze our dataset. then respectively import metrics to evaluate the model.

In [1]:
import pandas as pd
import sklearn
from sklearn.datasets import load_boston #loading the boston house-prices dataset
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
import seaborn as sns
from sklearn import metrics
plt.rcParams["figure.figsize"] = (10, 10)

Then let's define the functions. 
- We will start by importing and reorganize the dataset into main columns (features).
- Once uploaded, we remove outliers and split the data into training and testing dataset using train_test_split from sklearn.model_selection to shuffle and split the features and prices data. this will actually allow us to use the independent testing dataset to estimate the SVR performance.
- Once the dataset is ready, we need to define the model. Keep in mind that SVR is a regression model. To implement it we need to define Regularization parameter and kernel. 
- In this second step of this project, you will make a cursory investigation about the Boston housing data and provide your observations. Familiarizing yourself with the data through an explorative process is a fundamental practice to help you better understand and justify your results. Since the main goal of this project is to construct a working model which has the capability of predicting the value of houses, we will need to separate the dataset into features and the target variable. The target variable, 'MEDV', will be the variable we seek to predict. These are stored in features and prices, respectively.

# Load the dataset

In [None]:
# we load the dataset and save it as the variable boston
boston = load_boston()

In [None]:
# if we want to know what sort of detail is provided with this dataset, we can call .keys()
boston.keys()

In [None]:
# the info at the .DESCR key will tell us more 
print(boston.DESCR)

In [None]:
features = boston.feature_names

In [None]:
print(f'The features in dataset are: {features}')

In [None]:
# we can use pandas to create a dataframe, which is basically a way of storing and operating on tabular data 
# here we pass in both the data and the column names as variables
boston_X = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_X

In [None]:
# we can then look at the top of the dataframe to see the sort of values it contains
boston_X.head()

In [None]:
# store target values
boston_y = boston.target
boston_y

In [None]:
# we can then look at the top of the dataframe to see the sort of values it contains
print(f'Data description\n {boston_X.describe()}')

# Plot graphs to understand data

Further, We will create in this step a scatterplot matrix that will allow us to visualize the pair-wise relationships and correlations between the different features. It is also quite useful to have a quick overview of how the data is distributed and whether it cointains or not outliers.

In [None]:

def plot_data(x_df, y_df,features, cor=False):
    X = x_df.values
    plt.figure(figsize=(10,10))
    plt.title("Price Distribution")
    plt.hist(y_df, bins=30)
    plt.show()
    #cols = x_df.columns()
    fig, ax = plt.subplots(1, len(features), sharey=True, figsize=(20,5))
    plt.title("Relationship between different input features and price")
    ax = ax.flatten()
    for i, col in enumerate(features): #for every feature for all statistics
        x = X[:,i]
        y = y_df
        ax[i].scatter(x, y, marker='o')
        ax[i].set_title(col)
        ax[i].set_xlabel(col)
        ax[i].set_ylabel('MEDV')
    plt.show()
#Plotting the heatmap of correlation between features, The correlation coefficient 
#ranges from -1 to 1. If the value is close  to 1, 
#it means that there is a strong positive correlation between the  two variables.
 #When it is close to -1, the variables have a strong  negative correlation.
    if cor:
      corr = x_df.corr()
      plt.figure(figsize=(10,10))
      sns.heatmap(corr, cbar=True, square= True, fmt='.1f', annot=True, annot_kws={'size':15}, cmap='Greens')
      plt.show()



In [None]:
# call function to display the charts
plot_data(boston_X, boston_y, features, cor=True)

A histogram tells is the number of times, or frequency, a value occurs within a bin, or bucket, that splits the data (and which we defined). A histogram shows the frequency with which values occur within each of these bins, and can tell us about the distribution of data.

# Preprocess data

In this Boston Dataset we need not to clean the data. The dataset already cleaned when we download it from the sklearn.datasets. By looking at data ploting resulting on the code above, it is obvious that normalizing the data would be a good practice.

In [None]:
# Remove outliers
def remove_outliers(x,y, features):
    #remove null
    x_df = x.copy(deep=True)
    x_df['MEDV'] = y
    x_df.dropna(inplace=True)
    return x_df[features], x_df['MEDV']

In [None]:
boston_X, boston_y = remove_outliers(boston_X,boston_y, features)

In [None]:
# Normalize data

# we can now define the scaler we want to use and apply it to our dataset 

# a good exercise would be to research waht StandardScaler does - it is from the scikit learn library 

def scale_numeric(df):
    x = df.values 
    scaler = preprocessing.StandardScaler()
    x_scaled = scaler.fit_transform(x)
    df = pd.DataFrame(x_scaled)
    return df

In [None]:
# Preprocess data
def preprocess(x, y, features):
    x_df = x[features].copy(deep=True)
    x_df = scale_numeric(x_df)
    #print(len(x_df),len(y))
    # Split data into train, test
    X_train, X_test, y_train, y_test = train_test_split(x_df,y, test_size=0.3, random_state=1)
    return X_train, y_train, X_test, y_test

In [None]:
X_train, y_train, X_test, y_test = preprocess(boston_X, boston_y, features)

# Train the model

Once the data processed and ready to use, it is now fed to the model! Now let's have fun !!

In [None]:
# Train model 
def train(model,X_train, y_train):
    model.fit(X_train, y_train)
    return model

In [None]:
model = SVR() #LinearRegression()

model = train(model, X_train, y_train)

#Evaluate the model

Once We run the training, we evaluate our model using different metrics predifined in the sklearn.metrics

In [None]:
# Step - 6 - Evaluate Model
def evaluate(model, X_test, y_test, plot = True, print_results=True, bl=False):
    y_pred = model.predict(X_test)
    if print_results:
      if bl:
        print('\n\nBaseline Model Performance on Test Dataset:\n')
      else:
        print('\n\nBest Model Performance on Test Dataset:\n')
      print('R^2:',metrics.r2_score(y_test, y_pred))
      print('MAE:',metrics.mean_absolute_error(y_test, y_pred))
      print('MSE:',metrics.mean_squared_error(y_test, y_pred))
      print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
     
    if plot:
      plt.scatter(y_test, y_pred)
      plt.xlabel("Prices")
      plt.ylabel("Predicted prices")
      plt.title("Prices vs Predicted prices")
      plt.show()
    return 

Common metrics below:
- Mean Absolute Error: which provides a mean score for all the predicted versus actual values as an absolute value
- Means Squared Error: which provides a mean score for all the predicted versus actual values as a square of the absolute value
- R2: which we recommend you research as an exercise to grow your knowledge. WIkipedia and sklearn document are a great place to start!
- Root Mean Square Error: is the standard deviation of the residuals (prediction errors)

In [None]:
evaluate(model, X_test, y_test, bl= True)

# Improve the model

In the code block below, we will need to implement code so that the fit_model function does the following:

Create a scoring function using the same performance metric as in Step 6. See the sklearn make_scorer documentation.
Build a GridSearchCV object using regressor, parameters. See the sklearn documentation on GridSearchCV.

In [None]:
# Step - 7 - Improve Model
def optimize_models(X_train, y_train):
  params = {'kernel':['linear', 'rbf'], 'C':[1, 10]}
  model = SVR()
  clf = GridSearchCV(model, params)
  clf.fit(X_train, y_train)
  return (clf.best_params_)

In [None]:
best_params = optimize_models(X_train, y_train)
print(best_params)

In [None]:
## Build Best Model
best_C= best_params['C']
best_kernel = best_params['kernel']

best_model = SVR(kernel = best_kernel, C= best_C)
best_model = train(best_model, X_train, y_train)
evaluate (best_model, X_test, y_test)