<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

## Import et traitement des tables

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

In [2]:
data_train = pd.read_csv("../good_data/donnees_train.csv")
data_test = pd.read_csv("../good_data/donnees_test.csv")
data_valid = pd.read_csv("../good_data/donnees_validation.csv")

In [3]:
data_train.drop(["idmutation", "Ind", "sbati_squa", "datemut", "Unnamed: 0"], axis='columns', inplace=True)
data_test.drop(["idmutation", "Ind", "sbati_squa", "datemut", "Unnamed: 0"], axis='columns', inplace=True)
data_valid.drop(["idmutation", "Ind", "sbati_squa", "datemut", "Unnamed: 0"], axis='columns', inplace=True)

In [4]:
def rename_p_t_n(df, dico):
    df.rename(columns= dico, inplace= True)
    return df

 

rename_dico= {"valfoncact2":"valfoncact"}
data_train= rename_p_t_n(data_train, rename_dico)
data_test= rename_p_t_n(data_test, rename_dico)
data_valid= rename_p_t_n(data_valid, rename_dico)

On enlève les observations pour lesquelles la valeur foncière est inférieure à 10 000 et celles pour lesquelles la valeur foncière est supérieure à 3 millions. Cela améliore les résultats

In [5]:
data_train = data_train[data_train["valfoncact"] > 10000]
data_train = data_train[data_train["valfoncact"] < 3000000]
data_test = data_test[data_test["valfoncact"] > 10000]
data_test = data_test[data_test["valfoncact"] < 3000000]
data_valid = data_valid[data_valid["valfoncact"] > 10000]
data_valid = data_valid[data_valid["valfoncact"] < 3000000]

In [6]:
Y = data_train.pop("valfoncact")

## Sélection de variables

On utilise la méthode lasso pour sélectionner les variables.

In [7]:
sc_X = StandardScaler()
col = [i for i in data_train.columns]
data_train[col] = sc_X.fit_transform(data_train[col])

In [8]:
lasso= LassoCV()
lasso.fit(data_train, Y)
sf_lasso= SelectFromModel(lasso, prefit= True)

In [9]:
selected_variables= data_train.columns[(sf_lasso.get_support())]
print(len(selected_variables))
print(selected_variables)

33
Index(['nblot', 'nbpar', 'nblocmut', 'sbati', 'pp', 'Men', 'Men_pauv',
       'Men_1ind', 'Men_prop', 'Men_fmp', 'Ind_snv', 'Log_av45', 'Log_45_70',
       'Log_70_90', 'Log_soc', 'Ind_0_3', 'Ind_4_5', 'Ind_18_24', 'Ind_25_39',
       'Ind_40_54', 'Ind_65_79', 'Ind_inc', 'ind_par_zo', 'nv_par_hab',
       'ind_par__1', 'THEATRE', 'arrondissement', 'AUTRES_ALIM', 'AUTRES_SERV',
       'CULTURE', 'ENS_PRI', 'MEDECIN', 'POLICE'],
      dtype='object')


## Modélisation : SVR avec noyau gaussien

###  Recherche des meilleurs paramètres pour un noyau gaussien

In [10]:
Y_valid = data_valid.pop("valfoncact")

In [11]:
sc_Y = StandardScaler()
Y_valid = sc_Y.fit_transform(np.asarray(Y_valid).reshape(-1,1))

In [12]:
from sklearn.model_selection import RandomizedSearchCV
parameters = {'kernel':['rbf'], 'C':np.logspace(np.log10(0.001), np.log10(200), num=20), 'gamma':np.logspace(np.log10(0.00001), np.log10(2), num=30)}
svr = SVR()
rand_searcher = RandomizedSearchCV(svr, parameters, n_jobs=-1, n_iter = 80, verbose=2, cv = 3)
rs = rand_searcher.fit(data_valid[selected_variables], Y_valid)

Fitting 3 folds for each of 80 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 18.4min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed: 120.9min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 181.0min finished
  return f(**kwargs)


In [13]:
rs.best_estimator_

SVR(C=0.6166174163308118, gamma=1e-05)

Les meilleurs paramètres trouvés sont C = 0.6166174163308118 et gamma = 1e-05

### Entraînement du modèle

In [14]:
Y = sc_Y.fit_transform(np.asarray(Y).reshape(-1,1))

In [15]:
regressor = SVR(kernel='rbf', C = 0.6166174163308118, gamma = 1e-05)
svr = regressor.fit(data_train[selected_variables],Y)

  return f(**kwargs)


In [16]:
len(regressor.support_)

67605

In [17]:
Yfit = svr.predict(data_train[selected_variables])

In [18]:
Y = sc_Y.inverse_transform(Y)
Yfit = sc_Y.inverse_transform(Yfit)

In [19]:
mse = mean_squared_error(Y, Yfit)
rmse = mse**(1/2)

In [20]:
print(rmse)

202317.2387515113


###  Performance du modèle

In [21]:
Ytest = data_test.pop("valfoncact")

In [22]:
data_test[col] = sc_X.fit_transform(data_test[col])
Ytest = sc_Y.fit_transform(np.asarray(Ytest).reshape(-1,1))

In [23]:
Yfit_test = svr.predict(data_test[selected_variables])

In [24]:
Ytest = sc_Y.inverse_transform(Ytest)
Yfit_test = sc_Y.inverse_transform(Yfit_test)

In [25]:
mse_test = mean_squared_error(Ytest, Yfit_test)
rmse_test = mse_test**(1/2)

In [26]:
print(rmse_test)

205309.53352240095
