A.S. Lundervold, v111022

> **Note:** This is a short notebook giving a quick taste of a concept that's also covered elsewhere in the course. It should be regarded as extra material. 

# Setup

In [None]:
%matplotlib inline

import numpy as np, pandas as pd
import matplotlib.pyplot as plt 
from pathlib import Path
import seaborn as sns 
import sklearn
from sklearn import datasets

pd.options.mode.chained_assignment = None

In [None]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory
# or on Kaggle, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
try:
    import colab
    colab=True
except:
    colab=False

import os
kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

# Data

Vi bruker housing prices dataset:

In [None]:
df = pd.read_csv('https://www.dropbox.com/s/ml97sjhte1s4dnx/housing_data.csv?dl=1')

In [None]:
df.info()

Vi ser at features i datasettet er på ganske ulike skalaer:

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

Vi ser fra `df.info()` over at alle features utenom `ocean_proximity` er numeriske.

In [None]:
df.ocean_proximity.value_counts()

Vi dropper `ocean_proximity` her for å forenkle historien (det er uansett ikke aktuelt å skalere ikke-numeriske features).

In [None]:
df.drop('ocean_proximity', axis=1, inplace=True)

Historien vi skal fortelle om dette har behov for at vi er i en maskinlærings-situasjon med X, y og trenings- og test-data:

In [None]:
X, y = df.drop('median_house_value', axis=1), df.median_house_value

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Imputer

Vi ser at det er noen manglende verdier i `total_bedrooms`. Vi imputerer:

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
features = ['total_bedrooms']
imp = SimpleImputer()

In [None]:
X_train.loc[:, features] = imp.fit_transform(X_train[features])
X_test.loc[:, features] = imp.transform(X_test[features])

In [None]:
X_train.info()

In [None]:
X_test.info()

# Skaler features

Vi kan plotte features mot hverandre for å se effekten av deres ulike skalaer:

In [None]:
sns.jointplot(data=X_train, x="population", y="median_income", marginal_kws=dict(bins=15))
plt.show()

Hvis vi bruker samme verdier på aksene blir dette enda klarere:

In [None]:
axes_lim = (0,16000)
sns.jointplot(data=X_train, x="population", y="median_income", xlim=axes_lim, ylim=axes_lim, marginal_kws=dict(bins=15))
plt.show()

Vi kan skalere features ved hjelp av scikit-learn:

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [None]:
?StandardScaler

In [None]:
?MinMaxScaler

In [None]:
std = StandardScaler()
mms = MinMaxScaler()

In [None]:
X_train_std = std.fit_transform(X_train)
X_train_std = pd.DataFrame(data=X_train_std, columns=X_train.columns)
X_test_std = std.transform(X_test)

X_train_mms = mms.fit_transform(X_train)
X_train_mms = pd.DataFrame(data=X_train_mms, columns=X_train.columns)
X_test_std = mms.transform(X_test)

In [None]:
X_train_std.describe()

In [None]:
X_train_mms.describe()

In [None]:
sns.jointplot(data=X_train, x="population", y="median_income", marginal_kws=dict(bins=15))
plt.show()

In [None]:
sns.jointplot(data=X_train_std, x="population", y="median_income", marginal_kws=dict(bins=15))
plt.show()

In [None]:
sns.jointplot(data=X_train_mms, x="population", y="median_income", marginal_kws=dict(bins=15))
plt.show()

**Flere eksempler:**

In [None]:
f, ax = plt.subplots(figsize=(10,10))
nb_points = 300

sns.scatterplot(data=X_train[:nb_points], x="housing_median_age", y="median_income", ax=ax)
sns.scatterplot(data=X_train_std[:nb_points], x="housing_median_age", y="median_income", ax=ax)
sns.scatterplot(data=X_train_mms[:nb_points], x="housing_median_age", y="median_income", ax=ax)
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(10,10))
nb_points = 300

sns.scatterplot(data=X_train[:nb_points], x="housing_median_age", y="latitude", ax=ax)
sns.scatterplot(data=X_train_std[:nb_points], x="housing_median_age", y="latitude", ax=ax)
sns.scatterplot(data=X_train_mms[:nb_points], x="housing_median_age", y="latitude", ax=ax)
plt.show()

> Hvilken skalering (om noen) det er naturlig å bruke avhenger av modellen og egenskaper i datasettet (tre-baserte modeller som random forest behøver for eksempel ingen skalering). 

> I praksis kan en gjerne forsøke flere typer (og bruke ytelsen på valideringssettet til å velge skalerings-strategi). 