## Etude sur Introduction aux réseaux de neurones artificiels

La régression par un perceptron multi-couche sera illustrée dans trois Jupyter Notebooks sur l'exemple du Boston house prices dataset en utilisant exclusivement TensorFlow / Keras (pas de scikit-learn).


### préparation des data

### 1 - Importation des librairies Python nécessaires à la résolution du problème

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
import tensorflow as tf
from keras.datasets import boston_housing
import joblib

### 2 - affichage

In [24]:
pd.set_option("max_columns", None)
pd.set_option("max_colwidth", None)
pd.set_option("max_row", 500)

### 3 - Chargement des données du Boston Housing Dataset Boston

Je charge les données avec TensorFlow

In [25]:
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

In [26]:
print(f'Training data : {train_data.shape}')
print(f'Test data : {test_data.shape}')
print(f'Training sample : {train_data[0]}')
print(f'Training target sample : {train_targets[0]}')

Training data : (404, 13)
Test data : (102, 13)
Training sample : [  1.23247   0.        8.14      0.        0.538     6.142    91.7
   3.9769    4.      307.       21.      396.9      18.72   ]
Training target sample : 15.2


In [27]:
# Load data avec sklearn
#boston = load_boston()

### 4 - afficher le dataFrame Boston

In [28]:
feature_columns = [tf.feature_column.numeric_column(key) for key in features]

NameError: name 'features' is not defined

In [6]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
                'TAX', 'PTRATIO', 'B', 'LSTAT']
train_df = pd.DataFrame(data = train_data, columns = column_names)

In [7]:
train_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,1.23247,0.0,8.14,0.0,0.538,6.142,91.7,3.9769,4.0,307.0,21.0,396.9,18.72
1,0.02177,82.5,2.03,0.0,0.415,7.61,15.7,6.27,2.0,348.0,14.7,395.38,3.11
2,4.89822,0.0,18.1,0.0,0.631,4.97,100.0,1.3325,24.0,666.0,20.2,375.52,3.26
3,0.03961,0.0,5.19,0.0,0.515,6.037,34.5,5.9853,5.0,224.0,20.2,396.9,8.01
4,3.69311,0.0,18.1,0.0,0.713,6.376,88.4,2.5671,24.0,666.0,20.2,391.43,14.65


In [8]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
                'TAX', 'PTRATIO', 'B', 'LSTAT']
test_df = pd.DataFrame(data = test_data, columns = column_names)

### Data cleansing

In [9]:
train_df_cp = train_df.copy()
test_df_cp = test_df.copy()

In [10]:
train_df_cp.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

In [11]:
train_stats = train_df_cp.describe()
train_stats_t = train_stats.T
train_stats_t

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CRIM,404.0,3.745111,9.240734,0.00632,0.081437,0.26888,3.674808,88.9762
ZN,404.0,11.480198,23.767711,0.0,0.0,0.0,12.5,100.0
INDUS,404.0,11.104431,6.811308,0.46,5.13,9.69,18.1,27.74
CHAS,404.0,0.061881,0.241238,0.0,0.0,0.0,0.0,1.0
NOX,404.0,0.557356,0.117293,0.385,0.453,0.538,0.631,0.871
RM,404.0,6.267082,0.709788,3.561,5.87475,6.1985,6.609,8.725
AGE,404.0,69.010644,27.940665,2.9,45.475,78.5,94.1,100.0
DIS,404.0,3.740271,2.030215,1.1296,2.0771,3.1423,5.118,10.7103
RAD,404.0,9.440594,8.69836,1.0,4.0,5.0,24.0,24.0
TAX,404.0,405.898515,166.374543,188.0,279.0,330.0,666.0,711.0


In [12]:
test_stats = test_df_cp.describe()
test_stats_t = test_stats.T

### features scaling

Standardization pour train et test

通常、単に「正規化」と言った場合は、Min-Max normalizationを指す。この場合の正規化とは、データの最小値からの偏差（＝最小値を中心0にした場合の値）をデータ範囲（＝最大値－最小値）で割ることである。これにより、データの最小値は0、最大値は1に変換される。

　Z-score normalizationは、標準化（Standardization）と呼ばれるのが一般的である。標準化とは、データの平均値からの偏差（＝平均値を中心0にした場合の値、中心化した値）を標準偏差で割ることである。これにより、データの平均は0、分散（標準偏差）は1に変換される（※分散1の平方根（√）はやはり1なので、標準偏差も1となる）。

In [21]:
def stand(x):
  return (x - train_stats_t['mean']) / train_stats_t['std']

stand_train_data = stand(train_df_cp)
stand_test_data = stand(test_df_cp)

In [15]:
pd.options.display.float_format = '{:.5f}'.format

In [16]:
stand_train_data.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0,404.0
mean,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-0.4046,-0.48302,-1.56276,-0.25651,-1.46945,-3.81252,-2.36611,-1.28591,-0.97037,-1.30969,-2.67044,-3.76643,-1.51778
25%,-0.39647,-0.48302,-0.87713,-0.25651,-0.8897,-0.55275,-0.84234,-0.81921,-0.62547,-0.76273,-0.56853,0.21134,-0.8065
50%,-0.37619,-0.48302,-0.20766,-0.25651,-0.16502,-0.09662,0.33963,-0.29454,-0.51051,-0.45619,0.28359,0.38749,-0.18551
75%,-0.00761,0.04291,1.02705,-0.25651,0.62786,0.48172,0.89795,0.67861,1.67381,1.56335,0.7835,0.43963,0.59986
max,9.22341,3.72437,2.44235,3.88876,2.67402,3.46289,1.10911,3.43315,1.67381,1.83382,1.60154,0.44752,3.47771


In [17]:
X_train = stand_train_data.to_numpy()
X_test = stand_test_data.to_numpy()
y_train = train_targets
y_test = test_targets

In [18]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((404, 13), (102, 13), (404,), (102,))

### exportation des données

In [19]:
joblib.dump(X_train, "X_train.joblib")
joblib.dump(X_test, "X_test.joblib")
joblib.dump(y_train, "y_train.joblib")
joblib.dump(y_test, "y_test.joblib")

['y_test.joblib']