# Prise en main de scikit-learn

On consultera [le guide de l'utilisateur de scikit-learn](https://scikit-learn.org/stable/user_guide.html) pour une description complète!

## Régression



In [1]:
import sklearn.datasets as data

In [2]:
habitat = data.fetch_california_housing()

In [3]:
print(habitat.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [4]:
X = habitat.data
type(X)

numpy.ndarray

In [5]:
X.shape

(20640, 8)

In [6]:
y = habitat.target
type(y)

numpy.ndarray

In [7]:
y.shape

(20640,)

**Entrainer** 
1. régression linéaire
2. lasso
3. ridge
4. elastic net
5. nearest neighbors regression
6. gaussian process regression
7. support vector regression
8. random forest regressor
9. multi layer perceptron, 

### LinearRegression

In [8]:
from sklearn.linear_model import LinearRegression

In [9]:
lr = LinearRegression()

In [10]:
lr.fit(X, y)

In [11]:
lr.coef_

array([ 4.36693293e-01,  9.43577803e-03, -1.07322041e-01,  6.45065694e-01,
       -3.97638942e-06, -3.78654265e-03, -4.21314378e-01, -4.34513755e-01])

In [12]:
lr.predict(X) - y

array([-0.39435017,  0.39160644,  0.15557094, ..., -0.75174859,
       -0.52789476, -0.37819637])

In [13]:
lr.score(X, y)

0.606232685199805

### Lasso

In [14]:
from sklearn.linear_model import Lasso

In [15]:
ls = Lasso()

In [16]:
ls.fit(X, y)

In [17]:
ls.coef_

array([ 1.45469232e-01,  5.81496884e-03,  0.00000000e+00, -0.00000000e+00,
       -6.37292607e-06, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00])

In [18]:
ls.score(X, y)

0.28526231449198314

### Ridge

In [19]:
from sklearn.linear_model import Ridge

In [20]:
ri = Ridge()

In [21]:
ri.fit(X, y)

In [22]:
ri.coef_

array([ 4.36594382e-01,  9.43739513e-03, -1.07132761e-01,  6.44062485e-01,
       -3.97034295e-06, -3.78635869e-03, -4.21299306e-01, -4.34484717e-01])

In [23]:
ri.score(X, y)

0.6062326586911465

### ElasticNet

In [24]:
from sklearn.linear_model import ElasticNet

In [25]:
en = ElasticNet()

In [26]:
en.fit(X, y)

In [27]:
en.coef_

array([ 2.53202643e-01,  1.12982857e-02,  0.00000000e+00, -0.00000000e+00,
        9.63636030e-06, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00])

In [28]:
en.score(X, y)

0.4230627291195209

### KNeighborsRegressor

In [29]:
from sklearn.neighbors import KNeighborsRegressor

In [30]:
knr = KNeighborsRegressor()

In [31]:
knr.fit(X, y)

In [32]:
knr.score(X,y)

0.4711185944964351

### GaussianProcessRegressor

In [33]:
from sklearn.gaussian_process import GaussianProcessRegressor

In [34]:
gpr = GaussianProcessRegressor()

In [36]:
%%time
gpr.fit(X,y)

CPU times: total: 8min 44s
Wall time: 2min 36s


In [37]:
%%time
gpr.score(X,y)

1.0

### SVR

In [38]:
from sklearn.svm import SVR

In [39]:
svr = SVR()

In [40]:
svr.fit(X, y)

In [41]:
svr.score(X, y)

-0.01658668690926901

### RandomForestRegressor

In [42]:
from sklearn.ensemble import RandomForestRegressor

In [43]:
rfr = RandomForestRegressor()

In [44]:
rfr.fit(X, y)

In [45]:
rfr.score(X, y)

0.9741388559558823

### MLPRegressor

In [46]:
from sklearn.neural_network import MLPRegressor

In [47]:
mplr = MLPRegressor()

In [48]:
mplr.fit(X, y)

In [49]:
mplr.score(X, y)

0.47427543082805745

### Conclusion

- L'API de scikit-learn pour entrainer, évaluer, prédire... est unifiée!
- On n'a pas du tout tester la capacité des modèles à généraliser!
- Sans ajuster les hyperparamètres les modèles les plus complexes n'arrivent pas forcément mieux à reproduire les données d'entrainement!
- Le temps d'apprentissage est très variables suivant les modèles!

## Méthodologie

1. `train_test_split` pour estimer la capacité à généraliser
2. `cross_validation` pour décider des hyperparamètres


### Train test

Découper le jeu de données en 2 puis faites l'apprentissage de chaque modèle et comparer les scores sur les deux parties des données.
Faites un tableau synthétisant les résultats.

In [50]:
from rich import print
from rich.table import Table

In [55]:
tableau = Table(
    "modèle", 
    "score entrainement", 
    "score test", 
    title="Synthèse des modèles"
)

In [56]:
modeles = [
    LinearRegression,
    Lasso,
    Ridge,
    ElasticNet,
    KNeighborsRegressor,
    GaussianProcessRegressor,
    SVR,
    RandomForestRegressor,
    MLPRegressor,
]
    

In [58]:
from sklearn.model_selection import train_test_split

In [59]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=42)

In [60]:
%%time
for modele in modeles:
    instance = modele()
    instance.fit(X_tr, y_tr)
    tableau.add_row(
        str(modele),
        str(instance.score(X_tr, y_tr)),
        str(instance.score(X_te, y_te))
    )

CPU times: total: 9min 39s
Wall time: 2min 49s


In [61]:
print(tableau)

#### Conclusion

- On voit que l'on peut facilement décider entre overfitting/underfitting!
- Il faut maintenant bricoler les hyperparamètres pour que les modèles complexes arrivent à sortir de l'underfitting!
- ATTENTION: dans la pratique les données de test ne serviront pas à sélectionner le meilleur modèle. Ce sera juste un garde fou pour valider le modèle sélectionné avant de la déployer en production!

### Cross-Validation

Mettre en place la méthodologie expliquée [ici](https://scikit-learn.org/stable/model_selection.html#model-selection) pour décider du meilleur modèle!

In [62]:
from sklearn.model_selection import cross_val_score

In [63]:
lr = LinearRegression()
cross_val_score(lr, X_tr, y_tr, cv=5)

array([0.60066691, 0.61958343, 0.60436533, 0.61034236, 0.60084181])

In [65]:
tableau = Table(
    "modèle", 
    "moyenne cross validation", 
    "dispersion cross validation",
    title="Synthèse des modèles"
)

In [66]:
%%time
for modele in modeles:
    instance = modele()
    scores = cross_val_score(instance, X_tr, y_tr, cv=5)
    tableau.add_row(str(instance), str(scores.mean()), str(scores.std()))

CPU times: total: 12min 19s
Wall time: 4min 6s


In [67]:
print(tableau)

## Classification