# How do I predict the fair market prices of used cars?

## Goals

At this point, you have been introduced to various type of machine learning techniques. Today we will introduce a machine learning model that most of you have heard about, **neural networks (NN)**. The fundamental architecture of an NN is very different than anything we have encountered so far and it is this architecture that makes it a much more powerful model for certain tasks, such as image recognition. However, we do not want you to be intimidated by NNs as they are simply another tool in your data science toolkit. Additionally, it is important to remind yourself that NNs are not a panacea for data science that solves any and all problems. Data science has and always will be about proper analysis.

There may be areas of confusion throughout this case as there is a vast amount of new technical jargon to cover. This is normal and okay. It is fine for you to simply accept certain statements as true for the time being without understanding why. We recommend slowly getting used to the implementation of an NN and then diving into the various components to understand how they really work.

## Introduction

**Business Context.** There are many companies selling used (refurbished) cars across the United States. As automobiles depreciate in value as they age, this is an extremely competitive industry, and they have to price the car right in order to win business. You are a data scientist tasked with building a predictive model for the prices of used car sales around the country. We have already seen some methods that will be useful for this purpose, such as linear regression. In this case, we will look at another approach to this problem, called a **neural network**. Neural networks are the basic building block of **deep learning algorithms**.

**Business Problem.** Your task is to **predict the fair market price of a used car given its attributes**.

**Analytical Context.** The provided dataset on used cars was scraped from Craigslist. We have already pre-cleaned the dataset by removing some outliers that are irrelevant, whittling down the features to relevant columns, and standardizing missing data that may cause trouble for our analysis.

The case will proceed as follows: we will (1) build a predictive model using linear regression; (2) discuss the challenges of feature engineering in order to implement regression in real-world contexts; (3) look at neural networks as an alternative to explicit feature engineering; and finaly (4) build a neural network to solve this problem.

## Reading in the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(11.7,8.27)})
from sklearn.model_selection import train_test_split
# removed this like and replaced below
# from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error as mse
from sklearn.linear_model import LinearRegression

In [None]:
df = pd.read_csv('used_cars_clean.csv')

In [None]:
print(df.columns)
df.head()

In [None]:
df.shape

In [None]:
df.transmission.value_counts()

## Linear regression and feature engineering

A natural first step to building a model is to consider different variables that impact the price of a used car. For example, let's take a look at the following graph, which gives boxplots of price for each condition category:

In [None]:
sns.boxplot(x='condition', y='price', data=df, order=['salvage', 'fair', 'good', 'excellent', 'like new', 'new'])

In [None]:
sns.stripplot(x="condition", y="price", data=df, order=['salvage', 'fair', 'good', 'excellent', 'like new', 'new'])

Unsurprisingly, we see that price increases as the condition of the car improves from `salvage` to `new`. Let's build a linear model using the different category labels within this feature:

In [None]:
df = pd.concat([df, pd.get_dummies(df['condition'])], axis=1)
y = df.price
X = df[['salvage', 'fair', 'good', 'excellent', 'like new', 'new']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
lr = LinearRegression().fit(X_train, y_train)

In [None]:
print(pd.DataFrame(zip(X_train.columns, lr.coef_)))
print('\nintercept:', lr.intercept_)
print('r2:',  lr.score(X_test,y_test))

In [None]:
pred = lr.predict(X_test)
print('mse', mse(pred, y_test))

In [None]:
corr = df[['price','salvage', 'fair', 'good', 'excellent', 'like new', 'new']].corr()
sns.heatmap(corr, center=0,  annot=True)

### Exercise 1:

Perform an exploratory analysis of the dataset to find other features that may be correlated with price. Use these features to create a linear model regressing price against those features. How is the model fit? Why do you think that is?

**Answer.**

------------

### Exercise 2:

Write code to create dummy variables for each categorical value in this DataFrame, and create a new DataFrame with each categorical column turned into either numerical data or a set of dummy columns.

**Answer.**

------------

There are already a lot of new columns with a lot of potential for errors! In order to use this categorical data in a linear regression, we need some understanding of how this data can be converted to numbers. This may require significant domain knowledge. With cars, it's straightforward to understand that "good" will be more expensive than "salvage". However, our intuitions may not be so accurate in other situations.

We can try to make some more quick plots to understand how these features impact the price of a used car. Then we could restrict ourselves to only looking at the features that seem to have a major impact:

In [None]:
sns.lineplot(x='year', y='price', data=df)

In [None]:
sns.boxplot(x='manufacturer', y='price', data=df)
plt.xticks(rotation=90)

Without a doubt, these graphs are interesting, and this kind of exploratory data analysis is incredibly important. For instance, it's surprising that price does not increase linearly with the condition of the car, and it's also notable how noisy the prices are for years before 1980. We could create and add better features, but we may not have enough domain knowledge to create good features from our data to get a good result with these linear regression models. Why could this be?

This could be happening for a couple of reasons:

1. We may have wrongly chosen some factors. For example, perhaps the difference between `salvage` and `fair` matters a lot, but maybe the difference between `fair` and `good` is less significant.
2. The interaction between these variables may be highly non-linear. By definition, a linear model cannot accommodate this without additional data wrangling & engineering.

### Exercise 3:

Which of the following are difficulties with a traditional feature engineering approach to this problem?

I. It is difficult to think of enough relevant features

II. Many desired features may not be quantitative

III. It is difficult to extract those features from the photos

IV. Some features may be too abstract to immediately recognize

**Answer.**

------------

## A more complex model: neural networks


One way to avoid the difficulties of explicit feature engineering is by using **neural networks**. In a neural network, the computer automatically optimizes relevant features from the data and uses those features in a model to test the effects of various parameters and tune those parameters to fit the model to the given labels. This is essentially a giant calculus problem, where the computer minimizes the amount by which various functions on the data make incorrect predictions. If you've studied multivariable calculus, you may recognize this solution as requiring the **gradient vector**:

<img src="neural_net.png" width="400" height="400" />

A more-precise characterization of the functions that a neural network uses is given above. The inner workings of the algorithm take inputs on the left-hand side to provide outputs on the right-hand side by multiplying each layer by the edges between neurons to get the next layer. Then, the algorithm uses multivariable calculus (in a process known as **gradient descent**) to optimize each edge. While there are many ways that a neural network learns, we'll focus  on the easiest to understand method. <a href="https://en.wikipedia.org/wiki/Backpropagation">Backpropagation</a> is the method to adjust the weights in each hidden layer according to how well the network performed compared to the actual outputs in each iteration step.

How do we make good or bad choices within the network? We compare the outputs of the predictions (using the loss function), and make tiny changes to compare the outputs. Most frequently, we use a learning rate and a gradient descent method to estimate the changes that our successive models have used.

This idea of a neural network is actually borrowed from, and gets its name from, the way that neurons fire in the human brain. You can read more about this connection and the history of nueral network development <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">here:</a>

<img src="neuron.png" width="400" height="400" />

We can add more hidden layers in the middle to allow the neural network to create more complex functions, like in the following diagram:

<img src='deep_net.png' width='400' height='400' />

This is then called a **deep neural network**. Modern deep neural networks can have hundreds of layers. How many layers you add is part of the art and science of constructing a proper neural network. It should be noted that more layers is not necessarily better. 

This image shows an example of how the nodes in a neural network are calculated:

<img src="neural_net2.png" width="400" height="400" />

(*Source:* https://medium.com/wwblog/transformation-in-neural-networks-cdf74cbd8da8).

Each node is a linear combination of the layer before it. Between each layer, there is an **activation function**, which is a "threshold" for that neuron to trigger and when combined in layers, this allows our network to develop non-linear sensitivities. When there are multiple layers, we can keep adding more linear combinations, which can lead to more complex *non-linear* models. This is the main strength of neural networks over more elementary regressions. 

Neurons process the input they receive in a standard way. Each of them first processes the input data in the following way:

$$
z = b+\sum_i w_i X_i
$$

Weights and intercepts are specific to each neuron and have to be determined through an iterative procedure.

Once the neuron has formed $z$ it applies a user-defined activation function to it. One simple non-linear activation function that we can use is the **relu** function, which looks like the following:

<img src="relu.png" width="300" height="300" />


Also known as a [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)), this returns 0 if the output is less than 0, otherwise it simply returns the input, i.e.:

- take the input and feed it through $f(z) = {\rm max}(0, z)$. 
This means that the neuron is activated when its output is positive and not activated otherwise.

Another popular activation function is the softmax function you know from logistic regression - for two classes it reduces to the sigmoid. It returns values between 0 and 1 as desired for assigning probabilities of falling into any of the given classes ([more information here](https://en.wikipedia.org/wiki/Softmax_function)). There's a wealth of information on different types of activation functions within [this article](https://en.wikipedia.org/wiki/Activation_function) - different activation functions, hidden layers, and neurons per layer can change how effective your neural network will be!

A few more things to consider as this isn't just magic (the hyperparameters):

- **Epochs:** The number of iterations of full model fitting (i.e., how many times one runs through the fitting process). There's no upper limit, but generally there will be a point where additional epochs do not generate new insights
- **Batch size:** Neural networks tend to work best when you feed portions of your data in at a time (versus the full set) and adjust weights in between. Smaller batches allow for more frequent updates but may be less consistent in what changes are needed


You can also check out this cool example on <a href="playground.tensorflow.org"> Tensorflow</a>.

### Exercise 4:

Which of the following are suitable applications for a neural network?

A. Classifying images of handwritten numerals

B. Analyzing a Don Quixote poem for themes

C. Deciding how to caption previously unlabeled photographs

D. Sorting an array

**Answer.**

------------

### Prepare the data

In [None]:
cols = ['year', 'price']
df_clean = pd.concat([df[cols], pd.get_dummies(df['condition']), pd.get_dummies(df['drive'])], axis=1)
df_clean.head()

In [None]:
df_clean.isna().sum()

In [None]:
df_clean.dropna(inplace=True)

Let's split our data into training and testing data so we can properly evaluate how our model does later on:

In [None]:
y = df_clean.price
X = df_clean.drop('price', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Let's try regressing on the linear model again first but standardize the data.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)

In [None]:
lr = LinearRegression().fit(X_train_std.head(1000) , y_train.head(1000))
print(pd.DataFrame(zip(X_train_std.columns, lr.coef_)))
print('\nintercept:', lr.intercept_)
print('r2:',  lr.score(X_test, y_test))
pred = lr.predict(X_test)

import math
print('rmse', math.sqrt(mse(pred, y_test)))

### A simple first model

Let's now build our first NN model. Before we begin, here's a recap on the different components of an NN we will be working on:

1. **Input layer:** This is the layer that governs how the data inputs will be structured i.e. how many nodes refers to how many different parameters you are using)
2. **Hidden layers:** These are the internal layers of the neural network that make calculations by weighting the input data in various ways, and which ultimately determine the output. We have to determine the number of nodes for these layers as well as how many we will be using.
3. **Output layer:** The layer governing the structure of your output, which will differ from situation to situation. For example, in binary classification, you can have 1 node as your output and if it returns > 0.5, then it will be class A; otherwise, class B. 

We will keep all three components above as simple as possible for now (e.g. 1 output node and as few layers as possible). The following three components are also essential to an NN model:

4. **Activation function:** The forumla between layers that determines the activation of neurons. For our exercises we will just use ReLu which is extremely common.
5. **Loss function:** This determines how we are measuring the accuracy or performance of our model. We will use mean squared error (MSE) here, which we are familiar with from ordinary least squares linear regression.
6. **Optimizer:** These is the algorithm that will determine how we minimize our loss function. We will use the `adam` optimizer for this case - don't worry about what this is, you only need to know that there is a component called "optimizer" for now.

Everything above is determined as you construct your model. Now, let's discuss parameters you will be dealing with as you are training your model:

7. **Epochs:** This is the number of times we run our model through the fitting process on the data. There is no exact science on choosing the epochs at first glance. We will simply use 100 for this case and you can adjust this in your free time and see the differences in training time and performance results. We are NOT saying that 100 is the default, nor are we saying that 100 is optimal.
8. **Verbosity:** This is the amount of information you want to see printed on the screen as the model is training. We will use 1 as the default value for this case.
9. **Validation split:** This is similar to a train-test split. You enter a number between 0 and 1 and it is the fraction of the data that is NOT trained. We will use a default of 0.2 which means 20% of the data will not be used in training to minimize overfitting.
10. **Batch size:** This is the number of data points that are being fed into the model for training at any point in time. We will use 1056 for this case. Just like epochs, there is no "proper" default number. It's a balancing act of choosing a number that is not too high (because it will be too slow to converge to a solution) and not too low (because although it will have a higher chance of converging to an optimal solution, it will take longer to train overall). 

We understand that there many components to a neural network and it's difficult to fully grasp them all right now; for now, just accept the default values we give and don't worry too much about why these values were chosen. Once you are comfortable with the overall implementation of a NN, you should dive into each component thoroughly and learn them in depth.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

### Nonlinearities

Sometimes, we may have a dataset with **nonlinearities**. In this case, by definition a linear model will not be able to understand the underlying factors. A neural network, on the other hand, is composed of a variety of linear functions and activation functions, which allows it to predict more complex nonlinear models.

In the real world, connections between disparate variables can be extremely unclear. This is the reason why linear models are often reasonably good at approximating relationships in most real-world datasets. However, there are times when neural networks can really demonstrate their power.

Let's go ahead and build our model. We'll use a simple **Sequential** constructor from Keras. Take a look at the description of how to use Keras Sequential models here: https://keras.io/getting-started/sequential-model-guide/. The following code constructs a simple Keras Sequential instance with one layer. We need to specify the **activation function** and the **input shape** in our first model:

In [None]:
neurons = 128
model = keras.Sequential([layers.Dense(neurons, activation='relu', input_shape=[len(X_train_std.columns),]), # Input layer
                          layers.Dense(1)]) # Output layer

In addition to the layers of our model, we need to specify three more parameters:

1. The **loss function** - this function will define how "wrong" our final answer is
2. The **optimizer** – this is the algorithm that minimizes our loss function by fine-tuning the weights in our neural network.
3. A list of metrics that determine the performance of our model – you can understand the metrics that Keras uses here: https://keras.io/metrics/

These specifications go in the model's compile function, with the following parameters:

In [None]:
model.compile(loss='mse', # This uses Mean-Squared Error (https://en.wikipedia.org/wiki/Mean_squared_error)
              optimizer = 'adam', # The algorithm to optimize, root means squared is useful for regression
             )

For a neural network, training the model consists of tuning each of the weights between every node by minimizing the difference between the model's predicted value and the actual value in the training dataset. After this, the model is run on the test set to see how well it generalizes:

We're now ready to finally train the model! This uses the Keras `fit` function, which uses the following parameters:

1. Input data
2. Input labels
3. Number of epochs
4. Verbosity - 0, 1, or 2, depending on how frequently you want your model to give information while logging information
5. Validation split - A fraction of the training data to *avoid* use while training, in order to prevent the model from overfitting on a subset of the data
6. Batch size - how many data points to train on at once. A higher batch size will be faster until a certain point, until the algorithm hits diminishing returns from overhead. Higher batch size may also reduce accuracy.

Here's an example of fitting the data with some sample parameters:

In [None]:
history = model.fit(X_train_std, y_train, epochs=100, validation_split = 0.2, verbose=1, batch_size=1056)

In [None]:
X_test_std = scaler.transform(X_test)
math.sqrt(model.evaluate(X_test_std, y_test, batch_size=1056))

Now that we've seen a basic model, let's try making this model deeper by adding more hidden layers. To add more layers, we just use the `add()` method, which takes layer instances as input in order to upgrade the model.

### Exercise 5:

Use Keras to create a deep model with three layers to fit on our training data, then run it against our test data to see how well it fits. Compare this model to a linear regression which uses all features on our new normalized dataset.

**Answer.**

------------

## The dark side of neural networks

Around this time, you may have a natural thought: "Does this mean that if we have a single supercomputer and a neural network algorithm, we can solve *every problem in the world???*" It's a pretty natural thought, because these neural networks truly seem infinitely flexible without having to put in careful feature engineering work. It would seem that they run contrary to all the scientific principles that we've been touting throughout this entire course.

As you might suspect, this isn't true. But why? Let's illustrate this with the following exercise:

### Exercise 6:

Code and begin to fit a neural network with 10 layers.

**Answer.**

------------

Let's check how the model does on data that it hasn't seen before:

In [None]:
math.sqrt(stupid_model.evaluate(X_test_std, y_test, batch_size=1056))

What happened? The model trained well on the training data (albeit after a long time and with inefficient memory usage), but utterly failed on the testing data. Essentially, the neural network is suffering from overfitting.

Of course, we can get overfitting problems with simpler models like linear and logistic regression. So why is it so particularly bad when it occurs with neural networks?

### The Bias Variance Tradeoff

"it depends"

<img src="biasvariance.png" width="800" height="500" />


All models just make a choice of how to draw a decision boundary to fit the latent data structure in a $n$ - dimensional space. Each model has its inherent advantages and disadvantages in terms of this choice, which affects the shape of these decision boundaries, and regularization techniques/etc. are just a means of balancing this. All we need to do is find the right model for the business use case where we balance bias vs. variance, and complexity vs. interpretability.

<img src="multimodels.png" width="800" height="500" />


### Exercise 7:

Which of the following is a potential drawback of solving this problem with a neural network instead of linear regression? Select all that apply.

I. Lack of transparency in the algorithm

II. Difficulty of use

III. Speed of runtime

**Answer.**

------------

## Conclusion

In this case, we saw some of the shortcomings of a traditional linear model in performing predictions on a dataset with many complex non-linearities. Using a simple neural network, we're able to predict within 50% of the fair price of a used car. This isn't stellar accuracy, but is better than the 71% interval that we can predict with a similarly naive linear model. Nonetheless, we learned that neural networks have a dark side – when they are not as accurate as we would like them to be, they are very difficult to diagnose and improve. On the other hand, linear models remain quite interpretable even as we make them more complex.

## Takeaways

We've learned a few important facts about neural networks:

1. Neural networks can be used to optimize quantitative transformations of data into meaningful information
2. In some situations, a neural network may be more suitable and more efficient than manual feature engineering
3. The biggest drawback of neural networks is their lack of interpretability and difficulty of improvement, while the biggest strength is their ease of use and flexibility