<img src="./img/hsfrsl.jpg"/>

# House price prediction

### Intro
In this workshop we will see the basics of machine learning and deep learning by trying to predict real estate prices. To do so, we will use two machine learning libraries: [sklearn](https://scikit-learn.org) and [pytorch](https://pytorch.org).

#### What is Machine learning
Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data\
(TL;DR: A machine learning model is an AI learning more or less by itself)

### Import
- `pandas`: data manipulation and analysis
- `numpy`: support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays

In [None]:
import pandas as pd
import numpy as np

np.random.seed(0)

We said that we want to predict real estate's prices, so before building any model let's take a look at our data.

All our data is stored in a csv file, we can read it by using `panda.read_csv`, it takes in argument the path to our csv. 

Quick note: `.sample(frac=1)` will shuffle our data in case it's sorted. We don't like sorted data.

In [None]:
table = pd.read_csv("data/data.csv").sample(frac=1)

Ok, well, we readed our csv but how to show what it contains ? That is exactly what you have to figure out.

**Instruction:**
- Show the first five rows of our csv

**Help:**
- [Dataframe.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)

In [None]:
# Start of your code (1 line)

# End of your code

Take time to see what columns we have.

## Data processing 1.1

We see that we have a lot of data, some may be unnecessary like `date` and others may be more useful like `bedrooms` (the number of bedrooms in the house).

To simplify the workshop, we decide to drop some specific columns like `yr_renovated`, `street`, and `statezip`.

**Instruction:**
- Drop `date`, `yr_renovated`, `street`, and `statezip` columns.

**Help:**
- "Tell me [how to drop a column](https://letmegooglethat.com/?q=drop+column+pandas) please :("

In [None]:
# Start of your code (1 line)

# End of your code

I told you that the `date` column is not useful to train our model but how to know if a column is important ?

For example, let's take a look at the `country` column.
We all agree that the country can influence the price of a house, but if all the houses are in the same country, will it still be useful to precise the country ? Of course, the answer is no.

**Instruction:**
- Try to count the number of different countries in our data.

**Helps:**
- Cast a column into a `list`: https://stackoverflow.com/questions/23748995/pandas-dataframe-column-to-list
- `set` in python: https://www.programiz.com/python-programming/set

In [None]:
# Start of your code (1 line)

# End of your code

What a surprise ! There is only one country (wink wink) so this column is useless, you can drop it.

**Instruction:**
- Drop the `country` column

In [None]:
# Start of your code (1 line)

# End of your code

## Data processing 1.2

Another problem in data science is extreme values in our data.\
For example, some houses may have extreme prices, to help our model to train and generalize, it's preferable to drop them.

But how to define a min and max range for our data? Well, a good start would be to print the minimum, maximum and median values in our data.

**Instructions:**
- Print the minimum price of a house in our data
- Print the maximum price of a house in our data
- Print the median price of a house in our data
- Drop all houses with a price less than $10$k or higher than $2 000$k

**Helps:**
- [pandas.DataFrame.min](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html)
- [pandas.DataFrame.max](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)
- [pandas.DataFrame.median](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)
- [How to drop rows on a conditional expression](https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression)

In [None]:
# Show the minimum, maximum an median value of the prices in all our data.
# Start of your code (3 lines)



# End of your code

percentage = sum([1 if p >= 2_000_000 or p == 0 else 0 for p in table["price"]]) / table.shape[0]
print(f"Percentage of price higter than 2 000k: {percentage:.2f}%")

# Drop all houses with a price less than 10k or highter than 2 000k
# Start of your code (2 lines)


# End of your code

print("Number of lefting rows:", table.shape[0])

## Data processing 1.3

So, we dropped useless columns, we drop extreme value, what else?

Well, another issue is columns with low information, let's take `city` for example, if a city has only a few houses to sell we do not have enough information to predict well its prices and we want to drop it.

**Instruction:**
- Drop every city that appears less than 10 times

In [None]:
# Start of your code (~3 lines)




# End of your code

print(table.shape[0])

Let's take a look at our data after droping all thoses useless informations

In [None]:
table.head()

## Data processing 1.4

The penultimate step before building our model, in machine learning and expressly in deep learning we prefer to normalize our data between $0$ and $1$ to facilitate our model training.

For example, in an image, all pixels are between $0$ and $255$, we can so divide each pixel by $255$ to range all the pixels between $0$ and $1$. It's the same here.

**Instruction:**
- Store the maximum price in our data into a variable named `MAX_PRICE`
- Normalize `price`, `sqft_living`, `sqft_lot`, `sqft_above`, `sqft_basement` and `yr_built` column between $0$ and $1$.

**Help:**
- We normalize a column by dividing it by its max value

In [None]:
# Start of your code (~7 lines)








# End of your code

Let's take a look at our data after normalization:

In [None]:
table.head()

Another issue in data science is non-numerical values. Our model only handles numerical values so how to handle values that are strings like `city` ?

We encode it into one hot vector !

(Don't worry, we do this step for you, but I hardly recommend you to watch [this video](https://www.youtube.com/watch?v=v_4KWmkwmsU) to understand one hot encoding)

In [None]:
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res)

In [None]:
table = encode_and_bind(table, "city").drop(["city"], axis=1)

Let's take a final look at our data:

In [None]:
table.head()

## Linear Regression 1.1

One fundamental notion you need to understand in machine learning is labels. A label is the target that our model tries to predict. We always remove the label from our dataset and store it in another storage.
Giving the label to our model would be like giving the answer (it's cheating).

Another notion you need to understand is the training set and testing/validation set.

The training is used to train and see our model's performance evolution, but we also would like to see the performance of our model on data it has never seen. This is the role of the test set.

**Instructions:**
- Split our data into two specific set: `X_train` & `X_test` (`X_train` must have 3k rows)
- Split the labels into an other array `y_train` and `y_test` and remove it from `X_train` a,d `X_test`

**Help:**
- `my_array[0:1000]` give you the first 1k rows of your array

In [None]:
# Start of your code (~4 lines)





# End of your code

### Import

- `sklearn`: It features various classification, regression and clustering algorithms

In [None]:
from sklearn.linear_model import LinearRegression

## Linear Regression 1.2

We now want to create our model, we will use a linear regression which is already provided in the `sklearn` library.

Of course in our case, it will not be a linear regression in a 2D plan but in 42 dimensions (because with have 42 variables for each prediction.\
Hard to imagine right ?

**Instruction:**
- Create a Linear Regression using `sklearn` and train it using `X_train` and `y_train`

**Helps:**
- What a linear regression is: https://www.youtube.com/watch?v=zPG4NjIkCjc
- [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) on `LinearRegression`

In [None]:
# Start of your code (~1 line)

# End of your code

## Linear Regression 1.3

As you saw, `sklearn` does all the job for us, from creating the model to train it. We just have to provide it data.

So, you created and trained your model, but now it's time to know how it performs!

Display the `score` of our model.\
Quick reminder: the more it's closer to $1$, the better it is.

**Help:**
- [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score) on `score`

In [None]:
# Start of your code (~1 line)

# End of your code

Ok, so, you printed the coefficient $R^2$. It's closed to $1$ so you understand it should be good, but let's see in a more readable way the precision of our model.\
A great way would be to display the average difference between our predictions and our labels.

Let's do that !

**Instruction:**
- Print the average difference between our prediction and our labels using `X_test` and `y_test`

**Helps:**
- You don't care if the difference is negative or positive, you want the `abs`olute value (wink, wink)
- Don't forget to use `MAX_PRICE` to see the real difference in $.

In [None]:
# Start of your code (~1 line)

# End of your code

## Deep Learning 1.1

We see that we have an average difference arround $105 000$$, it's good but we could better by changing our model using deep learning.

Deep learning is not soo hard, but it takes time to fully understand its concept and working so we will not ask you to find answer like before, just to read, pay attention and understand basic stuff.

### Import

- `torch`: open source machine learning library based on the Torch library
- `torch.nn`: Neural network layers
- `torch.nn.fuctional`: USefull function to train our model

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

torch.manual_seed(0)

We will not go soo much into details of how does a deep learning model work but you need to remember few things:

1. **The forward propagation**: The model takes data and make predictions with it
2. **The backward propagation**: The model takes the labels and try to modify itself to increase its prediction
3. **The learning rate**: A factor variable slowing down our model training (But why should I slow down my model training ? Well that a good question, you should take a look at the link below ) 
4. **The optimizer**: the algorithm that tries to increase our model's predictions


**Helps:**
- [But what is a Neural Network | Deep learning, chapter 1](https://www.youtube.com/watch?v=aircAruvnKk)
- [Neural Networks Demystified [Part 2: Forward Propagation]
](https://www.youtube.com/watch?v=UJwK6jAStmg)
- [What is backpropagation really doing? | Deep learning, chapter 3
](https://www.youtube.com/watch?v=Ilg3gGewQ5U&t=5s)
- [Learning Rate in a Neural Network explained
](https://www.youtube.com/watch?v=jWT-AX9677k)
- [Optimizers - EXPLAINED!](https://www.youtube.com/watch?v=mdKjMPmcWjY)
- [Layers in a Neural Network explained](https://www.youtube.com/watch?v=FK77zZxaBoI)

<br/><br/>
Now that said let's create our deep learning model. It takes place as a class inheriting from `nn.Module`, if you're not familiar with classes in python you should take a look at [this link](https://docs.python.org/3/tutorial/classes.html).

In the `__init__` method we define our layers, for this step you should use `nn.Sequential`, `nn.Linear` and `nn.Sigmoid`.\
In the `forward` method, we define the forward pass.

**Instructions:**
- In `__init__` create a `self.main` attribute composed of two layers separated by the sigmoid function.
- In `forward` define the forward propagation


**Helps:**
- We have 42 columns
- We want to predict only one value
- [Sequential documentation](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html)
- [Linear documentation](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
- [Sigmoid documentation](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html)

In [None]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Start of your code (~5 lines)





        # End of your code
    
    def forward(self, t):
        # Start of your code (~1 line)

        # End of your code

Now you created our model class, we can init our model by calling `Model()`.
We can also create our optimizer, we choose to use `Adam`, a popular optimizer, it takes as arguments all our model's parameters and the learning rate that we set to $0.05$.

In [None]:
network = Model()
optimizer = optim.Adam(network.parameters(), lr=0.05)

## Deep Learning 1.2

Now that you have a model, it's time to create our training function.

Quick remind: to evaluate the accuracy our model uses a *cost function*

You see that we iterate over each data in our `train_set`, for each data ask your model to make à prediction (`network(data.float())`), we calculate how wrong our prediction is by comparing predictions with labels (`F.mse_loss(predictions.squeeze(1), labels.float())`) and we modify our model to improve our predictions.

Basically, that it's!

**Note:** Don't pay to much attention about why is there à `.float()`, why do we do `torch.tensor(labels)` or `.squeeze(1)`. It's just to make our model able to learn from our data. Of course, if you have any questions, feel free to ask.

**Help:**
- [Part 1: An Introduction To Understanding Cost Functions](https://www.youtube.com/watch?v=euhATa4wgzo)

In [None]:
def train(network, optimizer, train_set, train_labels):
    diverenge = 0
    episode_loss = 0
    correct_in_episode = 0

    network.train()
    for index, data in enumerate(train_set):
        labels = train_labels[index]
        labels = torch.tensor(labels)

        predictions = network(data.float())
        loss = F.mse_loss(predictions.squeeze(1), labels.float())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        episode_loss += loss.item()
        diverenge += sum(abs(labels.unsqueeze(1) - predictions))


    return episode_loss / (len(train_set) * np.shape(train_set)[1])


The test fucntion looks like the same as the train function, the only differences are that we don't do the backpropagation and we specify to pytorch that we don't want our model to train (`network.eval()`).

In [None]:
def test(network, optimizer, test_set, test_labels):
    diverenge = 0
    episode_loss = 0
    correct_in_episode = 0

    network.eval()
    for index, data in enumerate(test_set):
        labels = test_labels[index]
        labels = torch.tensor(labels)

        predictions = network(data.float())
        loss = F.mse_loss(predictions.squeeze(1), labels.float())

        episode_loss += loss.item()
        diverenge += sum(abs(labels.unsqueeze(1) - predictions))

    return episode_loss / (len(test_set) * np.shape(test_set)[1])


## Deep Learning 1.3

Another notion useful to understand in deep learning is batches. Batches help our model to generalize its prediction, for model detail I invite you to watch the link below:

**Help:**
- [Batch Size in a Neural Network explained](https://www.youtube.com/watch?v=U4WB9p6ODjM)

In [None]:
def create_batch(data, batch_size=8):
    result = []

    for i in range(batch_size, data.shape[0], batch_size):
        result.append(data[i - batch_size: i])
    
    return result

We then call our function `create_batch` and use a batch size equal to $32$.

In [None]:
X_train = torch.tensor(create_batch(X_train, batch_size=32))
y_train = torch.tensor(create_batch(y_train, batch_size=32))

X_test = torch.tensor(create_batch(X_test, batch_size=32))
y_test = torch.tensor(create_batch(y_test, batch_size=32))

It's time to train and test our model!

Like you see, we just call `train` then `test` successively for a number of epoch ?\
But what an epoch is ? An epoch is one iteration over all our data.

In [None]:
for e in range(0, 17):
    train_loss = train(network, optimizer, X_train, y_train)
    test_loss = test(network, optimizer, X_test, y_test)
    
    result = network(torch.tensor(X_test).float())
    diff = int(sum(sum(abs(result.squeeze(2) - y_test))) * MAX_PRICE / (len(y_test) * y_train.shape[1]))

    print(f"Epoch  {e}\train loss:{train_loss:.5f}\ttest loss:{test_loss:.5f}\tavg diff:{diff:.5f}")

Congratulation, you made your first machine learning AND deep learning model !!

For those how ask themself: "*That it's? am I a data scientist?*"

Well, not quite yet. There is a long road and a lot of things to learning in data science/machine learning/deep learning and that exactly why it's so fascinating to work in AI, there are so many things to learn.

I hope you enjoyed this workshop, and one more time: **Congratulation!**

*More workshops made by PoC: [https://github.com/PoCInnovation/Workshops](github.com/PoCInnovation/Workshops)*