<img align="right" src="images/dhbw.png" style="width:200px"/>

# Data Mining

- Studiengang: Wirtschaftsingenieurwesen (6. Semester)
- Dozent: Tin Votan
- Datum: 21.04.2020

## 1. Python-Module importieren

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## 2. Datensätze herunterladen
Python-Skript zum Herunterladen der Datensätze, Erstellen einer Ordnerstruktur und Extrahieren der CSV-Datei.

Wann ist es sinnvoll ein Data-Scrapping-Tool in Python zu programmieren?

- Bei Änderungen der Datensätze hilft ein automatisiertes Skript die Daten unkompliziert und in der selben Ordnerstruktur herunterzuladen
- Datensätze werden auf mehreren Rechnern benötigt (Multiple-User)

In [None]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

In [None]:
"""
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
"""

#### Datensatz bereits heruntergeladen und Ordner erstellt

In [None]:
#fetch_housing_data()

## 3. Auslesen der Daten

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [None]:
housing = load_housing_data()
housing.head()

## 4. Einblick in die Datenstruktur

In [None]:
housing.info()

### 4.1 Anzeigen der Kategorie `ocean_proximity`

In [None]:
housing["ocean_proximity"].value_counts()

### 4.2 Zusammenfassung der numerischen Attribute

In [None]:
housing.describe()

#### 1. Beispiel
25% der Distrikte in Kalifornien haben Häuser, die im Durchschnitt 18 Jahre oder jünger sind.

#### 2. Beispiel
75% der Distrikte in Kalifornien haben 1725 Einwohner oder mehr.

### 4.3 Visualierung der numerischen Attribute über ein Histogram

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

- Vertikale = Anzahl der Instanzen
- Horizontale = Wertebereich

#### 1. Beispiel
Rund 450 Distrikte in Kalifornien haben Häuser, die im Durchschnitt 18 Jahre alt sind.

#### 2. Beispiel
Rund 210 Distrikte in Kalifornien haben Häuser, die im Durchschnitt 300.000 USD wert sind.

### Vorverarbeitete Datensätze

- `housing_median_age`
- `housing_median_value`
- `median_income` 

Algorithmus könnte fälschlicherweise aus den Rohdaten lernen, dass die Preise nie höher als die Limits sind.

### Relevanz

Stellen die vorverarbeiteten Werte in `housing_median_value` eine hohe Relevanz für die Entscheidung dar?

Wenn ja, können zwei Dinge unternommen werden:
1. Die passenden Labels zu den gekappten Werten der Distrikte sammeln und aufbereiten.
2. Die Distrikte aus dem Data-Mining-Prozess entfernen, die davon betroffen sind.

## 5. Aufteilung in einen Trainingsdatensatz und einen Validierungsdatensatz

#### Data Snooping Bias
Menschen neigen dazu Datensätze automatisch auszuwerten und interessante Muster in diesen zu erkennen. Dies birgt die Gefahr, dass im Vorfeld ein Machine-Learning-Model präferiert wird.

Um dem entgegen zu wirken wird die Voraussagekraft des Models getestet.

Dazu wird der Datensatz in einen $Trainingsdatensatz$ (80%) und einen $Validierungsdatensatz$ (20%) aufgeteilt.

### 5.1 Aufteilung mit `NumPy`

In [None]:
# to make this notebook's output identical at every run
np.random.seed(42)

In [None]:
# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

In [None]:
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")

In [None]:
import hashlib

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

In [None]:
housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

In [None]:
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

In [None]:
test_set.head()

### 5.2 Aufteilung mit `Scikit-Learn`

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
test_set.head()

### 5.3 Stratified Sampling (Geschichtete Stichprobe)

Representative Darstellung / Wiedergabe der homogenen Untergruppen (= $Strata$ oder $Startum$) und der richtigen Anzahl an Instanzen von jedem Startum, z.B. Anteil Männer/Frauen an der Gesamtpopulation.

#### 5.3.1 `median_income`

In [None]:
housing["median_income"].hist()

In [None]:
# Divide by 1.5 to limit the number of income categories
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# Label those above 5 as 5
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

In [None]:
housing["income_cat"].value_counts()

In [None]:
housing["income_cat"].hist()

| Kategorie | Wertebereich | Einkommensspanne |
| --- | --- | --- |
| 1.0 | 0.0 bis 1.5 | < 15.000 USD |
| 2.0 | 1.5 bis 3.0 | 15.000 USD bis 30.000 USD |
| 3.0 | 3.0 bis 4.5 | 30.000 USD bis 45.000 USD |
| 4.0 | 4.5 bis 6.0 | 45.000 USD bis 60.000 USD |
| 5.0 | > 6.0  | > 60.000 USD |

#### 5.3.2 `StratifiedShuffleSplit`

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

#### 5.3.3 Vergleich zwischen zufällig generierten Stichproben und geschichteten Stichproben

In [None]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()

compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

In [None]:
compare_props

Zurücksetzen von `income_cat`

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

## 6. Erkunden und visualisieren des Datensatz

In [None]:
housing = strat_train_set.copy()

### 6.1 Visualisierung der geografischen Daten

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")

### 6.1.1 Visualisierung unter Berücksichtigungen der Dichte der Datenpunkte

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

### 6.1.2 Visualisierung unter Berücksichtigung von `population` und `median_house_value`

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()

In [None]:
import matplotlib.image as mpimg
california_img=mpimg.imread(PROJECT_ROOT_DIR + '/images/california.png')
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
plt.show()

### 6.1.3 Korrelationen zu `median_house_value`  mithilfe einer Korrelationsmatrix

Pearson-Korrelationskoeffizient:

In [None]:
corr_matrix = housing.corr()

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

### 6.1.4 Korrelationen zu `median_house_value`  mithilfe einer Scatter-Matrix

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

#### Scatter-Matrix zwischen `median_income` und `median_house_value`

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
plt.axis([0, 16, 0, 550000])

- starke (positive) Korrelation
- Preisgrenze bei 500.000 USD
- weitere horizontale Linien bei 450.000 USD, 350.000 USD und 280.000 USD

### 6.2 Experimentieren mit verschiedenen Attributkombinationen

- `total_rooms` ist schwach aussagefähig ohne `total_households` zu kennen
- interessant wäre die Anzahl der Zimmer pro Haushalt
- ebenfalls interessant ist die Anzahl der Bevölkerung pro Haushalt

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

## 7. Datenvorbereitung für die Machine-Learning-Algorithmen

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

In [None]:
strat_train_set.head()

In [None]:
housing.head()

In [None]:
housing_labels.head()

### 7.1 Data Cleaning

In [None]:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

#### Option 1

In [None]:
sample_incomplete_rows.dropna(subset=["total_bedrooms"])

#### Option 2

In [None]:
sample_incomplete_rows.drop("total_bedrooms", axis=1)

#### Option 3

In [None]:
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
sample_incomplete_rows

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

In [None]:
housing_num = housing.drop('ocean_proximity', axis=1)

In [None]:
imputer.fit(housing_num)

In [None]:
imputer.statistics_

In [None]:
housing_num.median().values

#### Leere Felder durch errechnete Medianwerte ersetzen:

In [None]:
X = imputer.transform(housing_num)

#### Umwandlung in Pandas DataFrame:

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

### 7.2 Umgang mit Text- und Kategorieattributen

In [None]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

In [None]:
ordinal_encoder.categories_

#### One-Hot Encoding

Beispiel:
- Bewertung = ["schlecht", "durchschnittlich", "gut", "exzellent"]
- `ocean_proximity` = ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
cat_encoder.categories_

### 7.3 Transformer

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

In [None]:
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()

### 7.4 Merkmalsskalierung und Transformation Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

In [None]:
housing_num_tr

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

In [None]:
housing_prepared

In [None]:
housing_prepared.shape

## 8. Model auswählen und trainieren

### Zusammenfassung:
1. Extrahierung des Datensatz
2. Erkundung der Attribute und Parametertypen
3. Aufteilung in ein Trainingsdatensatz und ein Validierungsdatensatz
4. Aufbereitung für die Machine-Learning-Algorithmen

### 8.1 Machine-Learning-Algorithmus #1: Linear-Regression

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

In [None]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))

In [None]:
print("Labels:", list(some_labels))

In [None]:
some_data_prepared

#### Evaluation anhand des Root-mean-square error (RMSE)

In [None]:
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

In [None]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae

Hinweis auf Underfitting (Das Model ist zu einfach bzw. der Datensatz ist für das Model zu komplex, um die tieferen Datenstrukturen zu verarbeiten.)

### 8.2 Machine-Learning-Algorithmus #2: Decision-Tree-Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

#### Evaluation anhand des Root-mean-square error (RMSE)

In [None]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

Hinweis auf Overfitting (Das Model ist zu komplex bzw. der Datensatz ist für das Model zu einfach, um die tieferen Datenstrukturen zu verarbeiten.)

#### Evaluation anhand der Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

##### Auswertung des Decision-Tree-Model

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

##### Auswerung der Linear-Regression-Model

In [None]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

| Model | RMSE | Mittelwert der Abweichungen| Standardabweichung |
| --- | --- | --- | --- |
| Linear Regression | 68,628.20 | 69,054.75 | 2,744.22 |
| Decision Tree | 0.0 | 71,407.69 | 2,439.43 |

### 8.3 Machine-Learning-Algorithmus #3: Random-Forest-Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

##### Auswerung der Random-Forest-Model

In [None]:
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

In [None]:
scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
pd.Series(np.sqrt(-scores)).describe()

| Model | RMSE | Mittelwert der Abweichungen| Standardabweichung |
| --- | --- | --- | --- |
| Linear Regression | 68,628.20 | 69,054.75 | 2,744.22 |
| Decision Tree | 0.0 | 71,407.69 | 2,439.43 |
| Random Forest | 21,933.31 | 52,583.72 | 2,298.35 |

## 9. Optimierung des Model

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

##### Auswertung des Model

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

Die `GridSearchCV`-Funktion hat den optimalen Hyperparameterwert für `max_features` und `n_estimators` ermittelt, nämlich `(8, 30)` bei einem Mean-Score von USD 49,682.27. Die Standard-Hyperparameterwerte hätten im Vergleich einen Mean-Score von USD 52,583.72 ausgegeben.

In [None]:
pd.DataFrame(grid_search.cv_results_)

## 10. Analyse des besten Model und dessen Fehler

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

In [None]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

##### => Je näher die Häuser eines Stadtteils am Ozean liegen, desto höher ist der Preis der Häuser.

## 11. Evaluation am Validierungsdatensatz

In [None]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

In [None]:
final_rmse