# Price Prediction

In the previous notebook ([01_CollectData](01_CollectData.ipynb)), we have collected data in order to build a price prediction model for real estates at Paris.

Now, our work will be divided into 2 steps:
* Data Analysis
* Build prediction model

## Filter data

In this section we will focus on filtering raw data to select only data that are in the scope of this project.<br />
After filtering, these data will be saved in a dedicated data file that we will use for training our model.<br />
Our aim is to make price predictions for:
* Property type: Houses and appartments
* Nature of the transaction: Sales

In [62]:
import pandas as pd

raw_df = pd.read_csv('../data/raw/real_estate_sales.csv', low_memory=False)
print(raw_df.columns)

Index(['Unnamed: 0', 'Code service CH', 'Reference document', '1 Articles CGI',
       '2 Articles CGI', '3 Articles CGI', '4 Articles CGI', '5 Articles CGI',
       'No disposition', 'Date mutation', 'Nature mutation', 'Valeur fonciere',
       'No voie', 'B/T/Q', 'Type de voie', 'Code voie', 'Voie', 'Code postal',
       'Commune', 'Code departement', 'Code commune', 'Prefixe de section',
       'Section', 'No plan', 'No Volume', '1er lot',
       'Surface Carrez du 1er lot', '2eme lot', 'Surface Carrez du 2eme lot',
       '3eme lot', 'Surface Carrez du 3eme lot', '4eme lot',
       'Surface Carrez du 4eme lot', '5eme lot', 'Surface Carrez du 5eme lot',
       'Nombre de lots', 'Code type local', 'Type local', 'Identifiant local',
       'Surface reelle bati', 'Nombre pieces principales', 'Nature culture',
       'Nature culture speciale', 'Surface terrain'],
      dtype='object')


### Property type
According the [Dataset Description](../references/notice-descriptive-du-fichier-dvf-20210809.pdf), the real estate type is defined in the `Type local` property in the dataset:

In [63]:
raw_df['Type local'].value_counts()

Appartement                                 211558
Dépendance                                  103384
Local industriel. commercial ou assimilé     32905
Maison                                       19284
Name: Type local, dtype: int64

We are only interested by appartments (`Appartement`) or houses (`Maison`) real estates, so we filter data on this criteria:

In [64]:
property_types = ['Appartement', 'Maison']
raw_df = raw_df[raw_df['Type local'].isin(property_types)]
raw_df['Type local'].value_counts()

Appartement    211558
Maison          19284
Name: Type local, dtype: int64

### Transaction type

According the [Dataset Description](../references/notice-descriptive-du-fichier-dvf-20210809.pdf), the transaction type is defined in the `Nature mutation` property in the dataset:

In [65]:
raw_df['Nature mutation'].value_counts()

Vente                                 225235
Vente en l'état futur d'achèvement      3578
Echange                                 1223
Adjudication                             730
Vente terrain à bâtir                     58
Expropriation                             18
Name: Nature mutation, dtype: int64

We are only interested by sales, corresponding to the values `Vente` or `Vente en l'état futur d'achèvement`:

In [66]:
transactions_types = ['Vente', "Vente en l'état futur d'achèvement"]
raw_df = raw_df[raw_df['Nature mutation'].isin(transactions_types)]
raw_df['Nature mutation'].value_counts()

Vente                                 225235
Vente en l'état futur d'achèvement      3578
Name: Nature mutation, dtype: int64

### Save data

In [67]:
raw_df.to_csv('../data/processed/train.csv', mode='w')

## Data Wrangling

In [68]:
raw_df['Date mutation'].dtype
raw_df['Date mutation'] = pd.to_datetime(raw_df['Date mutation'])
raw_df['Date mutation'].dtype

raw_df['Date mutation'].head()

1   2021-02-16
3   2021-04-02
4   2021-02-19
6   2021-01-27
8   2021-01-03
Name: Date mutation, dtype: datetime64[ns]

### Manage null values

In [69]:
null_columns = (raw_df.isnull().sum())
print(null_columns[null_columns == 225235])
#raw_df.loc[null_columns].isnull().sum()

Series([], dtype: int64)


### Manage outliers

## Build prediction model

In [70]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from joblib import dump

X = pd.read_csv('../data/processed/train.csv', low_memory=False)

# Temp test before data wrangling
X = X.loc[X['Code postal'].notna() & X['Surface reelle bati'].notna() & X['Nombre pieces principales'].notna() & X['Valeur fonciere'].notna()]
y = X['Valeur fonciere'].str.replace(',', '.')

X = X[['Code postal', 'Surface reelle bati', 'Nombre pieces principales']]
#X.drop(['Valeur fonciere'], axis=1, inplace=True)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
model = RandomForestRegressor(n_estimators=50, random_state=0)
model.fit(X_train, y_train)

dump(model, '../models/model_rfr.joblib', compress=True)
print('Model saved')

print('Evaluate predictions...')
y_preds = model.predict(X_valid)
mae = mean_absolute_error(y_valid, y_preds)
print(f'Mean Absolute Error: {mae}')

Model saved
Evaluate predictions...
Mean Absolute Error: 3732272.323827975
