# Regression

The Boston Housing Price dataset

Predict the median price of homes in a given Boston suburb in the mid-1970s, given data points about the suburb at the time, such as the crime rate, the local property tax rate, and so on. The dataset you'll use has an interesting difference from the two previous examples. It has relatively few data points: only 506, split between 404 training samples and 102 test samples. And each feature in the input data (for example, the crime rate) has a different scale. For instance, some values are proportions, which take values between 0 and 1; others take values between 1 and 12, others between 0 and 100, and so on.

    - In order to find the best hyperparameters (epochs, number of hidden layers, number of neurons,…) we will evaluate the hyperparameters with cross validation.
    - The goal is to find hyperparameters that minimize the cross validation error.
    - Once the hyperparameters have been chosen, the network with these hyperparameters is initialized and trained on the whole training set.

In [1]:
from tensorflow.keras.datasets import boston_housing

(X_train, Y_train), (X_test, Y_test) = boston_housing.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz


Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/boston_housing.npz: None -- [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)

In [None]:
X_train.shape

- CRIM     per capita crime rate by town',
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.',
- INDUS    proportion of non-retail business acres per town',
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)',
- NOX      nitric oxides concentration (parts per 10 million)',
- RM       average number of rooms per dwelling',
- AGE      proportion of owner-occupied units built prior to 1940',
- DIS      weighted distances to five Boston employment centres',
- RAD      index of accessibility to radial highways',
- TAX      full-value property-tax rate per \$10,000',
- PTRATIO  pupil-teacher ratio by town',
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town',
- LSTAT    \% lower status of the population',
- MEDV      Median value of owner-occupied homes in \$1000's"

MEDV est notre variable à expliquer et les autres sont des variables explicatives.

In [None]:
Columns_names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']

In [None]:
import pandas as pd
X_train_df = pd.DataFrame(X_train, columns=Columns_names)

In [None]:
import pandas as pd
#Transformation de notre jeu de données en Data Frame grace à pandas
X_train_df = pd.DataFrame(X_train, columns=Columns_names)
#on affiche les 5 premières lignes
X_train_df.head()
#on créé une nouvelle colonne qui est PRIX. ce qui equivaut à MEDV du jeu de données
Y_train_df = pd.DataFrame(Y_train, columns=['MEDV']) 
#on vérifie s'il n'y pas des valeurs nulles
print(X_train_df.isnull().sum())
print(Y_train_df.isnull().sum())

In [None]:
#etude de la correlation
matrice_corr = donnees_boston_df.corr().round(1)
sns.heatmap(data=matrice_corr, annot=True)

In [None]:
#entrainement du modèle
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
 
lmodellineaire = LinearRegression()
lmodellineaire.fit(X_train_df, Y_train_df)

In [None]:
#je dois l'installer via anaconda
import statsmodels.api as sm

#add constant to predictor variables
x = sm.add_constant(X_train_df)

#fit linear regression model
model = sm.OLS(Y_train_df, x).fit()

#view model summary
print(model.summary())

In [None]:
# Evaluation du training set
from sklearn.metrics import r2_score
y_train_predict = lmodellineaire.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)
 
print('La performance du modèle sur la base dapprentissage')
print('--------------------------------------')
print('Lerreur quadratique moyenne est {}'.format(rmse))
print('le score R2 est {}'.format(r2))
print('\n')
 
# model evaluation for testing set
y_test_predict = lmodellineaire.predict(X_test)
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
r2 = r2_score(Y_test, y_test_predict)
 
print('La performance du modèle sur la base de test')
print('--------------------------------------')
print('Lerreur quadratique moyenne est {}'.format(rmse))
print('le score R2 est {}'.format(r2))