# I : notebook_model introduction

### I / A : Goal of this notebook.

In this notebook, we will try different models with pipelines. We will use pipeline in order to avoid exporting multiple dataframes for each of our iteration. We will try different transformation on our dataframe (encoding, scaling, ...), so that we can find the best model for our goal: median_house_value prediction.

### II / A : Preliminary steps

### II / B : Importing libraries

In [16]:
import pandas as pd
import numpy as np
import seaborn as sns

# Scikit learn libraries
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, RobustScaler
from sklearn.linear_model import LinearRegression, PoissonRegressor
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA

# Pickle
import pickle

### II / C : Importing dataset after EDA

In [5]:
# Reading dataframe
model_df = pd.read_csv('data/eda_clean_df.csv')
model_df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-119.84,36.77,6.0,1853.0,473.0,1397.0,417.0,1.4817,72000.0,INLAND
1,-117.80,33.68,8.0,2032.0,349.0,862.0,340.0,6.9133,274100.0,<1H OCEAN
2,-120.19,36.60,25.0,875.0,214.0,931.0,214.0,1.5536,58300.0,INLAND
3,-118.32,34.10,31.0,622.0,229.0,597.0,227.0,1.5284,200000.0,<1H OCEAN
4,-121.23,37.79,21.0,1922.0,373.0,1130.0,372.0,4.0815,117900.0,INLAND
...,...,...,...,...,...,...,...,...,...,...
16331,-121.90,39.59,20.0,1465.0,278.0,745.0,250.0,3.0625,93800.0,INLAND
16332,-122.25,38.11,49.0,2365.0,504.0,1131.0,458.0,2.6133,103100.0,NEAR BAY
16333,-121.22,38.92,19.0,2531.0,461.0,1206.0,429.0,4.4958,192600.0,INLAND
16334,-118.14,34.16,39.0,2776.0,840.0,2546.0,773.0,2.5750,153500.0,<1H OCEAN


### II / D :  Create clusters using Kmean()

In [6]:
# First, we need to drop ocean_proximity feature
model_df_drop = model_df.drop("ocean_proximity", axis=1)

# Clustering with KMeans
kmeans = KMeans(n_clusters=3, random_state=42).fit(model_df_drop)
labels = kmeans.labels_

# Creating new dataframe
kmean_df = pd.DataFrame(data = model_df_drop)

# Adding label_kmeans and ocean_proximity columns
kmean_df['label_kmeans'] = labels
kmean_df["ocean_proximity"] = model_df["ocean_proximity"]

# kmean_df

### II / E : Split into train/test

In [7]:
# Here we will split our dataframe because we need to use the hold out method (70/30 %)
X = kmean_df.drop("median_house_value", axis=1).copy()
y = kmean_df["median_house_value"].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=1)

### III / A : Modelisation

In [8]:
# We have to select specific columns for our pipeline's steps (select num et cat columns)
cat_columns = X_train.select_dtypes(include=["object"]).columns.tolist()
num_columns = X_train.select_dtypes(exclude=["object"]).columns.tolist()

### III / B : First iteration with OneHoteEncoder and StandardScaler()

In [9]:
# Creating categorical data pipeline
cat_preprocessing = Pipeline(
    [
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False))
    ]
)

# Creating numerical data pipeline
num_preprocessing = Pipeline(
    [
        ("scaler", StandardScaler())
    ]
)

# Combine both numerical and categorical pipelines
preprocessing = ColumnTransformer(
    [
        ("num", num_preprocessing, num_columns),
        ("cat", cat_preprocessing, cat_columns)
    ]
)

# Full pipe with preprocessing pipes and regression pipe
full_pipe = Pipeline(
    [
        ('preprocess', preprocessing),
        ('regressor', LinearRegression())
    ]
)

# measure model performance with cross validation
scores = cross_val_score(
    full_pipe, X_train, y_train, cv=5
)

# Fitting our model
full_pipe.fit(X_train, y_train)

# Storing the predictions
y_pred = full_pipe.predict(X_test)

# Printing different metrics for model evaluation
print(f'cross_val_score : {scores.mean()}')
print(f'y_pred min : {y_pred.min()}')
print(f'r² score : {full_pipe.score(X_test, y_test)}')
print(f'mae score : {mean_absolute_error(y_test, y_pred)}')
print(f'mae % score : {mean_absolute_percentage_error(y_test, y_pred)}')
print(f'mse score : {mean_squared_error(y_test, y_pred)}')
print(f'rmse score : {np.sqrt(mean_squared_error(y_test, y_pred))}')

cross_val_score : 0.6789760761009063
y_pred min : -142174.77201245917
r² score : 0.6931144305582347
mae score : 48537.02593175569
mae % score : 0.30527711247453365
mse score : 4115898190.785157
rmse score : 64155.26627475843


#### First iteration conclusion:
After using onehotencoding for categorical value and standardscaling on numerical columns, our first score is 0.69 with a Linear regression model. We have an issue, because our model can predict negative values, whereas the price of a house can go below 0. Because of our outliers, we can't use mse and rmse for evaluating the model. Mae is less sensitive to outliers, so we will focus on r² and mae metrics. Our mae is really high, our predictions can have an error of 48K $ more or less than the real value. It's way too high.

### Second iteration, with MinMaxScaler instead of StandardScaler

In [10]:
# Creating categorical data pipeline
cat_preprocessing_2 = Pipeline(
    [
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False))
    ]
)

# Creating numerical data pipeline
num_preprocessing_2 = Pipeline(
    [
        ("scaler", MinMaxScaler())
    ]
)

# Combine both pipeline

preprocessing_2 = ColumnTransformer(
    [
        ("num", num_preprocessing_2, num_columns),
        ("cat", cat_preprocessing_2, cat_columns)
    ]
)

# Full pipe with preprocessing pipes and regression pipe
full_pipe_2 = Pipeline(
    [
        ('preprocessing', preprocessing_2),
        ('regressor', LinearRegression())
    ]
)

# Fitting our model
full_pipe_2.fit(X_train, y_train)

# Storing our predictions
y_pred_2 = full_pipe_2.predict(X_test)

# Printing different metrics for model evaluation
print(f'y_pred min : {y_pred_2.min()}')
print(f'r² score : {full_pipe_2.score(X_test, y_test)}')
print(f'mae score : {mean_absolute_error(y_test, y_pred_2)}')
print(f'mae % score : {mean_absolute_percentage_error(y_test, y_pred_2)}')
print(f'mse score : {mean_squared_error(y_test, y_pred_2)}')
print(f'rmse score : {np.sqrt(mean_squared_error(y_test, y_pred_2))}')

y_pred min : -142592.0
r² score : 0.693113879810968
mae score : 48536.530299938786
mae % score : 0.30510633938560483
mse score : 4115905577.315854
rmse score : 64155.32384234261


#### Second iteration conclusion:
After using minmaxscaling instead of standardscaling on numerical columns, our evaluation metrics are the same. We need to find other ways to improve our model. We still have negative values.

### Third iteration, with PoissonRegressor instead of LinearRegression.

In [11]:
# Creating categorical data pipeline
cat_preprocessing_3 = Pipeline(
    [
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False))
    ]
)

# Creating numerical data pipeline
num_preprocessing_3 = Pipeline(
    [
        ("scaler", MinMaxScaler())
    ]
)

# Combine both pipelines
preprocessing_3 = ColumnTransformer(
    [
        ("num", num_preprocessing_3, num_columns),
        ("cat", cat_preprocessing_3, cat_columns)
    ]
)

# Full pipe with preprocessing pipes and regression pipe
full_pipe_3 = Pipeline(
    [
        ('preprocessing', preprocessing_3),
        ('regressor', PoissonRegressor(max_iter=1000))
    ]
)

# Fitting our model
full_pipe_3.fit(X_train, y_train)

# Storing our predictions
y_pred_3 = full_pipe_3.predict(X_test)

# Printing different metrics for model evaluation
print(f'y_pred min : {y_pred_3.min()}')
print(f'r² score : {full_pipe_3.score(X_test, y_test)}')
print(f'mae score : {mean_absolute_error(y_test, y_pred_3)}')
print(f'mae % score : {mean_absolute_percentage_error(y_test, y_pred_3)}')
print(f'mse score : {mean_squared_error(y_test, y_pred_3)}')
print(f'rmse score : {np.sqrt(mean_squared_error(y_test, y_pred_3))}')

y_pred min : 17965.669537479593
r² score : 0.6688298227022536
mae score : 49369.20908776794
mae % score : 0.3084131251445375
mse score : 4777741353.814425
rmse score : 69121.20769933368


#### Third iteration conclusion:
After using PoissonRegressor instead of LinearRegression, our r² score has decreased (0.66). The mae score is increased, so it's worse than the previous iteration. We don't have negative prediction, so it's great improvement with this specific model. 

### Fourth iteration, with KNeighborsRegressor .

In [12]:
# Creating categorical data pipeline
cat_preprocessing_4 = Pipeline(
    [
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False))
    ]
)

# Creating numerical data pipeline
num_preprocessing_4 = Pipeline(
    [
        ("scaler", MinMaxScaler())
    ]
)

# Combine both pipelines
preprocessing_4 = ColumnTransformer(
    [
        ("num", num_preprocessing_4, num_columns),
        ("cat", cat_preprocessing_4, cat_columns)
    ]
)

# Full pipe with preprocessing pipes and regression pipe
full_pipe_4 = Pipeline(
    [
        ('preprocessing', preprocessing_4),
        ('regressor', KNeighborsRegressor())
    ]
)

# Fitting our model
full_pipe_4.fit(X_train, y_train)

# Storing our predictions
y_pred_4 = full_pipe_4.predict(X_test)

# Printing different metrics for model evaluation
print(f'y_pred min : {y_pred_4.min()}')
print(f'r² score : {full_pipe_4.score(X_test, y_test)}')
print(f'mae score : {mean_absolute_error(y_test, y_pred_4)}')
print(f'mae % score : {mean_absolute_percentage_error(y_test, y_pred_4)}')
print(f'mse score : {mean_squared_error(y_test, y_pred_4)}')
print(f'rmse score : {np.sqrt(mean_squared_error(y_test, y_pred_4))}')

y_pred min : 47640.0
r² score : 0.9031538750176659
mae score : 26746.389716384412
mae % score : 0.15314172322335073
mse score : 1298884112.8125527
rmse score : 36040.03486142255


#### Fourth iteration conclusion:
After using KNeighborsRegressor, our r² score has greatly improved (0.90). We also have a lower mae score, which is great because we have a lower variance between our predictions and the real values. It's our best model so far.

### Fifth iteration, adding PCA in our pipeline.

In [13]:
# Creating categorical data pipeline
cat_preprocessing_5 = Pipeline(
    [
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False))
    ]
)

# Creating numerical data pipeline
num_preprocessing_5 = Pipeline(
    [
        ("scaler", MinMaxScaler()),
        ('pca', PCA(n_components=3))
    ]
)

# Combine both pipeline
preprocessing_5 = ColumnTransformer(
    [
        ("num", num_preprocessing_5, num_columns),
        ("cat", cat_preprocessing_5, cat_columns)
    ]
)

# Full pipe with preprocessing pipes and regression pipe
full_pipe_5 = Pipeline(
    [
        ('preprocessing', preprocessing_5),
        ('regressor', KNeighborsRegressor())
    ]
)

# Fitting our model
full_pipe_5.fit(X_train, y_train)

# Storing our predictions
y_pred_5 = full_pipe_5.predict(X_test)

# Printing different metrics for model evaluation
print(f'y_pred min : {y_pred_5.min()}')
print(f'r² score : {full_pipe_5.score(X_test, y_test)}')
print(f'mae score : {mean_absolute_error(y_test, y_pred_5)}')
print(f'mae % score : {mean_absolute_percentage_error(y_test, y_pred_5)}')
print(f'mse score : {mean_squared_error(y_test, y_pred_5)}')
print(f'rmse score : {np.sqrt(mean_squared_error(y_test, y_pred_5))}')

y_pred min : 47800.0
r² score : 0.8892739433289027
mae score : 29450.92768822689
mae % score : 0.17405186546686796
mse score : 1485039446.9650195
rmse score : 38536.209556273425


#### Fifth iteration conclusion:
After adding PCA, our r² score has deteriorated (0.88). The mae score is also worst, so we will keep the fourth iteration.

### Exporting with pickle

In [None]:
# Exporting preprocessing steps with pickle
with open ('cleaning_pickle', 'wb') as file:
    pickle.dump(full_pipe_4, file)

In [15]:
# Exporting our model with pickle
with open ('model_pickle', 'wb') as file:
    pickle.dump(full_pipe_4, file)