<img src="./image/hsfrsl.jpg"/>

# Introduction #

In this workshop, we will try to predict the price of houses in the city of Boston and its surroundings.\
To do this, the workshop will be divided into several parts:
   - Data processing
   - Creation of the model
   - Interpretation of the results.
        
Pour cela nous utiliserons la librairie sklearn : [sklearn](https://scikit-learn.org)


## Import ##

- `pandas`: data manipulation and analysis
- `numpy`: support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
- `matplotlib`: Matplotlib allows you to visualize data in the form of graphs
- `seaborn`: Seaborn allows from matplotlib to visualize data and to integrate pandas structures

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Data processing 1.1 #

## dataset ##

Sklearn proposes to load datasets.\
You can see the list of datasets available here : [datasets](https://scikitlearn.org/stable/datasets/toy_dataset.html)

The dataset we are interested in is called : Boston house prices dataset\
To perform this task, you must import the function that allows to load this dataset from sklearn.datasets.\
with the parameter `return_X_y=False`\
name the variable `boston_dataset`

load_boston : [load_boston](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html)\
Don't forget to import the function with : `from sklearn.datasets import load_boston`

In [None]:
# chargez la dataset de boston (~ 2 lignes)

# début du code


#fin du code

### description of the dataset ###

you must now print the description of the dataset with the `DESCR` attribute\
you can use an attribute of an object as follows: `object_name.attribute`


In [None]:
# affichez la description de votre dataset (~ 1 ligne)

#début du code

#fin du code

## dataframe ##

Now create a dataframe using panda.\
name the variable `boston`
- data = data from the dataset
- columns = features names of the dataset

What is a dataframe : [dataframe](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe)\
Usage : [pd.DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [None]:
#Vous devez créer un dataframe (~ 1 ligne)

#début du code

#fin du code

### Add house prices to your dataframe ###

You need to add the house price to your dataframe using the target attribute of your dataset.\
Display the head of your dataframe : [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)

In [None]:
#Vous devez rajouter une colonne à votre dataframe boston['PRICE'] = ... (~ 1 ligne)
from IPython.display import display

#début du code
boston['PRICE'] = None
#fin du code

If all is well you should see that each feature indicates 0, this indicates that the dataframe does not contain null values

In [None]:
print(boston.isnull().sum())

# Data processing 1.2 #

## adjust data and visualize ##

The dataset is from the 1970's, one dollar from that time equals six today,\
to make the prices consistent with our time, multiply the house prices by six.

How do you multiply all the prices?\
Find out how to use numpy. [np.multiply](https://numpy.org/doc/stable/reference/generated/numpy.multiply.html)

Once the prices are multiplied you should see a graph of data\
representing the ratio of the number of houses to the prices.

In [None]:
#Vous devez multiplier toute les prix par 6 (~ 1 ligne)

#début du code

#fin du code

sns.set(rc={'figure.figsize':(11.7,8.27)}, palette="flare")
sns.histplot(boston["PRICE"], bins=30)
plt.show()


## correlation ##

seaborn allows us to get a heatmap to see the correlation between the different features of the houses.

The correlation coefficient ranges from -1 to 1.\
If the value is close to 1, it means that there is a strong positive correlation between the two variables.\
When it is close to -1, the variables have a strong negative correlation.

Choose the two features that most influence the price of the houses.\
Keep them in mind

In [None]:
correlation_matrix = boston.corr().round(2)
sns.set(rc={'figure.figsize':(11.7,8.27)}, palette="flare")
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()

# Data processing 1.3 #

## Input & Label ##


Our dataset is now ready and you have the two features you want to use for your model in your mind.

In practice it is possible to use more than two features, in the context of an introduction we will only use two.

We will create two variables `X` and `Y`.\
`X` will represent our input and `Y` our labels.

`Y` is the price of the houses.\
`X` corresponds to a dataframe, where we concatenate our two features.\
The two columns of our dataframe `X` will keep the name of the one used in the dataset.

To concatenate : [concatenate](https://numpy.org/doc/stable/reference/generated/numpy.c_.html)

In [None]:
#Vous devez créer votre input X, un dataframe de deux colonnes et Y contenant la colonne des prix des maisons
#(~ 2 Lignes)

#début du code
X = None
Y = None
#fin du code

## Split data ##

Now that our data are ready, it is necessary to split our data into a training set and an evaluation set.

Indeed, the training set allows us to improve the model.\
But if we test the model on our training data,\
the result is biased because the model already knows the data.

Thus, we use evaluation data on which the model has not been trained to have valid results.\
We want to have 80% of the data for training and 20% for evaluation.

<img src="./image/split.png"/>

Split data : [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
#Maintenant que vos données sont prêtes, découpez vos données en un jeu d'entraînement, et un d'évaluation.
#Utilisez la fonction train_test_split (~ 1 ligne)

from sklearn.model_selection import train_test_split

#début du code
X_train, X_test, Y_train, Y_test = None
#fin du code

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

# Model 1.1 #

## Create the model ##

Pour notre modèle on utilise un modèle de regression linéaire.

<img src="./image/regression.png"/>

Une regression linéaire, permet de trouver une fonction affine qui modèlise au mieux notre relation entre les features et les prix.

Pour cela on importe la modèle de regression linéaire de scikit-learn\
nommé la variable : `house_price_model`

Create model : [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)\
Don't forget to import the function with : `from sklearn.linear_model import LinearRegression`


In [None]:
#Importez et créer un modèle de regresion linéaire à partir de Scikit-Learn (~ 2 Lignes)

#début du code


#fin du code

## Train the model ##

Our model needs training to find the best relationship between our features and house prices.

To do this we use the `fit` method of the linear model of Scikit-learn\
Use the training data X and Y.

Use `fit` method like this : `house_price_model.fit(parameters)`

In [None]:
#Entrainez votre modèle avec vos données d'entraînement. (~ 1 Ligne)

#début du code

#fin du code

## Evaluate your model ##

Once your model is trained, it is important to evaluate it to observe its performance.

To do this, use the predict method for your model.\
Give it your test data as input.

Store the result in a variable named :  `y_test_predict`

Use `predict` method like this : `house_price_model.predict(parameters)`

In [None]:
#Recuperez les prédictions pour vos données d'évaluation (~ 1 Ligne)

#début du code
predict = None
#fin du code

# Model 1.2 #

## Error ##

To evaluate our model, it is interesting to calculate the error of our model.\
For that we use the RMSE function, it allows to obtain the root of the mean square error.\
The lower the error of our model, the better it is.

This may seem very mathematical and complex.\
But it corresponds to the difference between the real value and the prediction to which we apply\
a square root to get back to the unit y

Formula : 

<img src="./image/rmse.png"/>

Compute the root mean squared error for our model,\
learn about the mse function of sklearn and the application of the square root to an array.

mse : [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)\
apply square root on arrays : [numpy sqrt](https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html)

In [None]:
#Calculez la racine de l'erreur quadratique moyenne (Root Mean Squared Error. RMSE) (~ 1 ligne)

#début du code
rmse = None
#fin du code

print(rmse)

## Coefficient of determination ##

Even if the values to be predicted are all of the same order of magnitude, the RMSE can be difficult to interpret.\
Let's imagine that our model is used to predict sales prices.\
If we work with DVDs, an error of 10€ will be important. If we work with cars, an error of 10€ will be very small.

For this reason, one can choose to normalize the sum of squares of the residuals not by the number of points n in the dataset,\
but by a measure of what would be a reasonable error: the sum of the distances between each of the values to be predicted and their mean.\
The result is called the relative square error, or RSE for Relative Squared Error.

<img src="./image/RSE.png"/>

You will often meet, instead of the CSR, its complementary to 1,\
noted R²= 1-RSE It is the coefficient of determination. 

<img src="./image/r2.png"/>

Calculez R² pour votre modèle, pour cela utilisez la fonction que propose scikit-learn\
r2_score : [r2](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)

In [None]:
#Calculez r2 pour votre modèle

#début du code
r2 = None
#fin du code

print(r2)

# Model 1.3 #

## Verify shape ##

Your model takes as input an array, you have defined its size when creating the model. \
Normally, the shape should be equal to (m, 2) where `m` is the number of samples in your data set.

In [None]:
print(X_test.shape)

## create an input ##

To get a prediction, you will create a numpy array of dimension (1, 2).\
The two values must correspond to possible values of the two features of your model.

create array : [np.array](https://numpy.org/doc/stable/reference/generated/numpy.array.html)

In [None]:
#Créer un np array contenant une valeur pour chaque features de votre modèle (~ 1 ligne)

#début du code
data = None
#fin du code

## get prediction ##

Use the `data` variable defined above to get the price prediction based on the features.\
You should get a plausible price if you succeeded in the different steps.

Have fun changing the values of `data` to see different predictions.


In [None]:
#Recuperer la prédiction du prix pour votre maison en fonction des données que vous avez choisi sur l'étape précedente

#début du code
predict = None
#fin du code

print(str(int(predict[0] * 1000)) + "€")

## Conclusion ##

I hope that this workshop has allowed you to learn more about the implementation of predictive models.\
Artificial intelligence is developing more and more within companies.

We can be sure that in the future we will hear more and more about it.

In [None]:
print("Thank you for participating in this workshop")