# <center> Project: **Customer Intelligence** department in a Bank company: real world examples of a **Data Scientist** in a Bank company. Part II: Extra bonus - Regression model for car price estimation

# Project goals:
In this part of the project, we continue working as a Data Scientist in our Bank company facing to new challenges in the area of Machine Learning. 

In particular, the Bank credit department is realizing that a lot of customers are asking credits to purchase second-hand vehicles. Until now they have used a price reference book but this last is not too accurate and the final consequence is the Bank is not measuring correctly the risk of granting these credits.

Therefore, as a Customer Intelligence team member, you will be responsible for designing, developing and analyzing a model to estimate the price of a second hand vehicle based on its main characteristics. To do it we will use a **Multilayer Perceptron (MLP)** architecture.

### Due date: up to Junem 18th at 23:59h. 
### Submission procedure: via Moodle.

# Step 1: Data gathering

In this part of the Project we are using a new dataset named `CarPrice.csv`. This file contains information of **205 of cars** and 26 features that describes the main characteristics of every car. Some examples of these ones are:

- *car_ID*: It's an integer that identifies any car.
- *symboling*: It's an integer that identifies a car category
- *CarName*: Manufacture and model of the car
- *fueltype*: Type of fuel that the car uses 
- ...
- *price*: The current price of the car in the market


Let's upload some libraries and function we will need to develop our model.

In [None]:
# Imports
import numpy as np 
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import matplotlib.animation as animation


#%matplotlib notebook
import matplotlib.cm as cm
import seaborn as sns
from matplotlib import pyplot
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error

**[EX0]** Upload the car price into a Dataframe named `car_price_dt`. You should obtain a dataframe similar to this one.

In [None]:
df = pd.read_csv("car_prediction.csv", sep=";")
display(df.head())

# Step 2: Data understanding and preparation

Once we know the problem to solve, the next stage is to have a clear understanding of the data we have extracted and to prepare it before modelling. In particular, we will:
- List and verify the type of each variable (object, float, int...). Identify variables with nulls. Measure the memory usage
- Eliminate rows with nulls in order to have a dataset 100% fulfilled
- Exploratory Data Analysis to understand main statistics (mean, standard deviation, min&max values and 25%-50%-75% quartiles) and distribution of the most relevant variables or features
- Plot several graphs in order to identify how variables are related between them. In particular:
- correlation matrix
- 2D and 3D scatter plots

Once this part, also known as **data wrangling** of the Project is done, we should achieve a deep knowledge about the data.

**[EX1]** Is there any null variable to fix?

In [None]:
df.info()

<font color="red"> Answer: TODO</font>

**[EX2]** Calculate the quartiles, maximum and minimum values for numeric features

In [None]:
numeric_features = ["symboling", "wheelbase", "carlength", "carwidth", "carheight", "curbweight", "enginesize", "boreratio", "stroke", "compressionratio", "horsepower", "peakrpm", "citympg", "highwaympg", "price"]
df_numeric = df[numeric_features]

df_numeric.quantile([0.25, 0.5, 0.75])

In [None]:
print("%37s" % "Min values:", "%25s" % "Max values:")
for i in range(0, len(numeric_features)):
    print("%-25s" % numeric_features[i], "%-25.2f" % df_numeric[numeric_features[i]].min(), "%-15.2f" % df_numeric[numeric_features[i]].max())

**[EX3]** Plot the distribution of the column `price`

In [None]:
# TODOO

**[EX4]** Plot the correlation matrix between numerical features


In [None]:
corr = df_numeric.corr()
display(corr)

pyplot.matshow(corr)

plt.show()

**[EX5]** Look at the correlation matrix results. Do you think a model to predict car price is feasible? Justify your answer.

<font color="red"> Answer: TODO</font>

**[EX6]** Filter from the previous dataset (i.e. `car_price_dt`) the following columns: `symboling`, `wheelbase`, `carlength`, 
             `carwidth`, `carheight`, `curbweight`, 
             `enginesize`, `boreratio`, `stroke`, 
             `compressionratio`, `horsepower`, `peakrpm`, 
             `citympg`, `highwaympg`, `price`. Normalize with **StandardScaler()** function all these features. Finally, create two new arrays: `y` with the `price` column and `x` with the rest of features.

In [None]:
filtered_columns = ['symboling', 'wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize', 'boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg', 'price']
filtered_dt = df[filtered_columns]

scaler = StandardScaler()
scaler.fit(filtered_dt)
sc_filtered_dt = scaler.transform(filtered_dt)

X = sc_filtered_dt[:, :-1]
y = sc_filtered_dt[:, -1]

# Step 3: Training the model and performance evaluation: Regression model to estimate the price of a car based on vehicles's features.

**[EX7]** Build some **utils** functions we will need for our MLP architecture. Create:
- *sigmoid* function that calculates the sigmoid of a value, array, etc....
- *sigmoid_derivative* funtion that calculates the derivative of sigmoid for a value p.
- *relu* function that calculates the relu of a value, array, etc....
- *relu_derivative* funtion that calculates the derivative of relu for a value x.

In [None]:
def sigmoid(value):
    return 1 / (1 + np.exp(-value))

def sigmoid_derivative(p):
    return p * (1-p)

def relu(value):
    return max(0, value)

def relu_derivative(x):
    return (int)(x > 0)

**[EX8]** Split `x` and `y` into `xtrain` , `xtest` , `ytrain` , `ytest` with 20% of total samples for testing usage.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=0)

**[EX9]** Complete the following code to build of MLP solution with 1 hidden layer. In particular, you should:
- 1) complete the **feedforward** method. Select the activation function of the utils section that you consider suitable for this use case (i.e. car price estimation).
- 2) complete the **backpropagation** method
- 3) build the **predict** method that calculates the output of the MLP based on the last calculated weights during the training process.

In [None]:
# Class definition
class NeuralNetwork:
    def __init__(self, x, y):
        self.input = x
        self.number_neurons_hidden=6
        print(self.input.shape[1]+1)
        self.weights1= np.random.rand((self.input.shape[1]+1),self.number_neurons_hidden) # considering we have number_neurons_hidden nodes in the hidden layer and include and extra w for w0
        print("Initialized weights layer1\n",self.weights1)
        self.weights2 = np.random.rand((self.number_neurons_hidden+1), 1)# considering we have number_neurons_hidden nodes in the hidden layer and include and extra w for w0 for last neuron
        print("Initialized weights layer2\n",self.weights2)
        self.y = y
        self.output = np.zeros(y.shape)
        self.lr=0.01
        
    def feedforward(self):
        #We add a column of "1" to input_data to multiply with w0 at first hidden layer
        self.input_aux=np.c_[np.ones(self.input.shape[0]), self.input]

        #Calculate the output of the first layer
        self.layer1_output = sigmoid(self.input_aux @ self.weights1) # layer1
        

        #We add a column of "1" to input data to the second layer to multiply with w0 at second layer
        self.layer1_output_aux=np.c_[np.ones(self.layer1_output.shape[0]),self.layer1_output]
        
        #Calculate the output of the first layer
        self.layer2_output = sigmoid(self.layer1_output_aux @ self.weights2) # layer2
        

        self.output=self.layer2_output
        return self.output
        
    def backpropagation(self):
        m = self.input.shape[0]
        #Calculate the gradient of the Error vs the output
        gradient_output = (self.output - self.y)
        
        #Calculate the gradient of the Error vs weigths at layer 2
        gradient_weights2 = gradient_output * sigmoid_derivative(self.output)

        #Calculate the gradient of the Error vs weigths at layer 1
        gradient_weights1 = gradient_weights2 @ self.weights2.T * sigmoid_derivative(self.layer1_output_aux)
        # gradient_output @ sigmoid_derivative(self.layer1_output_aux @ self.weights2).T @ sigmoid_derivative(self.input_aux @ self.weights1) @ self.input_aux.T
        
        #Update the weights1 and weights2 according to gradient_weights1 and gradient_weights2 previously calculated and the learning_rate defined at the constructor of the NeuralNetwork class
        self.weights1 = self.weights1 - self.lr * gradient_weights1 / m
        self.weights2 = self.weights2 - self.lr * gradient_weights2 / m
          

    def train(self, X, y):
        self.output = self.feedforward()
        self.backpropagation()

    def predict(self, X_pred):
    
      self.input_predict=X_pred
      #Calculate the output of the MLP using the chosen activation function and the weights at layers 1 and 2 once fitted. 
      
      return self.output


We are ready to train our model. Execute the following code to train our MLP based on xtrain and ytrain. Tip: Maybe ytrain should be reshaped to be adjusted to be an array.

In [None]:
#Vector to store the loss in every iteration
loss=[]
#Create the neural network
NN = NeuralNetwork(X_train, y_train.reshape(-1,1))
#Number of iterations
epochs=1500
for i in range(epochs):
    NN.train(X_train, y_train)
    loss.append(np.mean(np.square(y - NN.feedforward())))

Use the following **summarize_diagnostics** function to plot the loss.

In [None]:
def summarize_diagnostics(history):
    # plot loss
    pyplot.subplot(211)
    pyplot.title('Loss')
    pyplot.plot(history, color='blue', label='train')
    pyplot.legend(loc="lower center", fontsize=14)
    return

**[EX10]** Plot the **loss** obtained during the training stage of the MLP. Describe the visualization

**[EX11]** Execute the **predict** method of the NeuralNetwork class for xtest. Answer the following questions:
- (1) Which is the performance of the model using ytest. Is it a good model? Justify your answer.
- (2) Which is the mean absolute error at the original scale of the dataset (i.e. previously to normalization)?
- (3) In case of a not too good performance, how would you tune the MLP solution (i.e. which parameters of the NN would you modify)? Execute the training and predict process again with this new setup and compare with the previous results. Are the results (i.e. R2 score and mean absolute error) as you expected? Justify your answer.
