'''''
{
"title": "Data-Preprocessing-Checklist",
"keywords": "DataPreprocessing, ",
"categories": "",
"description": "Hier geht es um die individuelle Abarbeitung der <em>EDA</em>-Checkliste im Kontext der California-Housing Problematik",
"level": "30",
"pageID": "16112020-10-California-Housing-Data-Preprocessing-Checklist"
}
'''''

<center><h1>California Housing <br> Data Preprocessing</h1></center>

![](imgs/2020-11-14-21-31-19.png)

In diesem Notebook wird die [EDA-Checkliste](16112020-EDA-Checklite) im Kontext des California-Housing Problem abgeabreitet. Bei dieser Checkliste geht es darum, die Datenstruktur grundlegend zu verstehen. Hier werden die Daten lediglich beschrieben, das ist wichtig bevor die Modellierung startet.

# Laden der Daten
Die Daten wurden bereits im vorherigen [Notebook - California Housing Priceses Data](14112020-10-California-Housing-Data) gesplittet und persistent auf der Festplatte gespeichert. Somit müssen diese Daten in diesem Notebook zunächst geladen werden. Die Funktion Daten-Laden wurde in diesem Notebook der übersichtlichkeit wegen ebenfalls ausgelagert.

In [1]:
# To support both python 2 and python 3

import os
import tarfile
from six.moves import urllib
import pandas as pd
import numpy as np

import FunctionFileCalifornia as ffc

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# Basic Variables
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

.\images\end_to_end_project


In [2]:
housing = ffc.load_housing_data()
strat_test_set = ffc.load_housing_data(filename="strat_test_set.csv")
strat_train_set = ffc.load_housing_data(filename="strat_train_set.csv")
#print(housing.shape, strat_test_set.shape, strat_train_set.shape)
#print(strat_test_set.head(5))

# Vertikaler Cut
Im vorherigen [horizontalen Cut](14112020-10-California-Housing-Data) wurden die Trainings und Testdaten erstellt und auf der Festplatte gepseichert. In dem Vertikalen Cut werden nun die Trainings und Testdaten in jeweils beschreibende und zu erklärende Variable aufgeteilt = Supervised Learning.

In [3]:
print(strat_train_set.shape)
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()
print(housing.shape)
print(housing_labels.shape)

(16512, 12)
(16512, 11)
(16512,)


# [1. Data-Cleaning - NAN-Values](07112020200718-DataCleaning)
Aus dem [Data-Management-Notebook](14112020-10-California-Housing-Data) ist klar, dass NAN/0/"" Werte in dem Feature "total bedrooms" existieren. Diese Werte werden nun in dieser Rubrik behandelt. 

In [None]:
import missingno as msno
msno.matrix(housing)

In [None]:
print(housing.isnull().any())
print(housing.dtypes)

## Imputation
In dieser Lösung entscheide ich mich dazu eine Imputaion durchzuführen. Ander Lösungen finden sich in diesem [Notebook](16112020-NAWerte).

In [None]:
try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer
imputer = SimpleImputer(strategy="median")
# erstellen eines Sub-DF welches nur die numerischen Werte beinhaltet
housing_num = housing.drop('ocean_proximity', axis=1)
imputer.fit(housing_num)
# imputer.statistics_
# housing_num.median().values
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

In [None]:
print(housing_tr.shape)
#print(housing_tr.isnull().any())

print(housing.shape)
#print(housing.isnull().any())


# [2. Feature-Selection](07112020200718-FeatureSelection)
Im konkreten Beispiel verwende ich alle gegebenen Feature des Datensatzes 

# [3. Feature-Engineering](07112020200718-FeatureEngineering)

## One Hot Encoding
für die kategorischen Variablen check out das [OHE-Notebook](16112020-OneHotEncodingOrdinalEncoding)

In [None]:
try:
    from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
    from sklearn.preprocessing import OneHotEncoder
except ImportError:
    from future_encoders import OneHotEncoder # Scikit-Learn < 0.20

In [None]:
housing_cat = housing[['ocean_proximity']]
housing_cat.shape
cat_encoder = OneHotEncoder(sparse=False)
housing_ocean_proximity_cat_1hot = cat_encoder.fit_transform(housing_cat)
titles = cat_encoder.get_feature_names(['ocean_proximity'])
partOHEdf = pd.DataFrame(housing_ocean_proximity_cat_1hot, columns=titles)
print(partOHEdf.shape)

# Zusammenfügen des OHE & Imputer DF

In [None]:
housingDF = pd.concat([housing_tr,partOHEdf],axis=1)
print(housingDF.shape)
print(housingDF.head(5))

In [None]:
# Feature Creation

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)


In [None]:
print(housing_extra_attribs.shape)

# [4. Feature Scaling](07112020200718-FeatureScaling)
wichtig ist, dass die Skalierungen später(nach den Predictions) wieder zurück skaliert werden.

In [None]:
housingDF.dtypes

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
#print(housing_tr)
scaler.fit(housingDF)
housing_tr_scaled = scaler.transform(housingDF)

In [None]:
print(housing_tr_scaled.shape)
#print(houhousingDF.colsing_tr_scaled)
titles = housingDF.columns
finalPreporcessedDF = pd.DataFrame(housing_tr_scaled, columns=titles)
print(finalPreporcessedDF.shape)
print(finalPreporcessedDF.head(1))

# [5. SK-Learn Pipeline](07112020200718-FeatureScaling)
Die oberen Schritte waren bisher primär für die Entwicklung. Für einen vernünftigen Einsatz werden nun [SK-Learn Pipelines]() verwendet.

In [None]:
housing = ffc.load_housing_data()
strat_test_set = ffc.load_housing_data(filename="strat_test_set.csv")
strat_train_set = ffc.load_housing_data(filename="strat_train_set.csv")
housing_num = housing.drop('ocean_proximity', axis=1)
housing_cat = ["ocean_proximity"]

In [None]:
from sklearn.preprocessing import FunctionTransformer

def add_extra_features(X, add_bedrooms_per_room=True):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate=False,
                                 kw_args={"add_bedrooms_per_room": False})
housing_extra_attribs = attr_adder.fit_transform(housing.values)

In [None]:
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_attribs.head()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('OHE', OneHotEncoder())
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)
housing_num_tr
from pipe_tools.pipe_visualizer import plot_pipeline
plot_pipeline(num_pipeline, "pipeline_plot.png")
plot_pipeline(cat_pipeline, "pipeline_plot.png")

In [None]:
try:
    from sklearn.compose import ColumnTransformer
    print("SK-Learn Version passt")
except ImportError:
    from future_encoders import ColumnTransformer # Scikit-Learn < 0.20
    print("Alte Version")

In [None]:
num_attribs = list(housing_num)
cat_attribs = list(housing_cat)


full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

print(housing.shape)
housing_prepared = full_pipeline.fit_transform(housing)
print(housing_prepared.shape)