# ML : Übung 2 Hauptkomponentenanalyse
-----------------------------------


# 1. Implementierung der Hauptkomponentenanalyse

>###    Einbinden der Packages
Import zweier Standardpakete für die Datenanalyse: Numpy für mehrdimensionale Arrays, Pandas für Datenanalyse in Tabellen.

In [1]:
!pip install pandas numpy wget matplotlib sklearn

import pandas as pd
import numpy as np
import matplotlib
import os
import wget
from sklearn import preprocessing

#np.__version__, pd.__version__



>## Einlesen der Data
Im ersten Schritt werden die Daten eingelesen.
>### Direkter Download
Direkter Download vom ics.uci.de,
automatischer Import in Pandas-Dataframe,
Abruf des Downloaddatums.

In [2]:
url    = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
cols   = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','TGT']
dateDownloaded = !date #Calling Linux
dateDownloaded

['Mo 26. Okt 08:45:42 CET 2020']

>## Caching der Daten
Da der Datensatz größer ist wird zuerst ein caching der Daten durchgeführt.

In [3]:
if not os.path.isfile('housing.data'):
    print("Downloading file...\n")
    wget.download('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', 'housing.data')
else:
    print("File exists\n")
# !ls -11

File exists



## Vorverarbeitung

In [4]:
boston = pd.read_csv(url , sep=' ', skipinitialspace=True , header=None ,names=cols , index_col=False)

if boston.isna().values.any():
    boston = boston.dropna()

if boston.duplicated().any():
    boston = boston.drop_duplicates()

boston

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,TGT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273.0,21.0,393.45,6.48,22.0


### Zentrierung

In [5]:
boston_centred = boston-boston.mean()
boston_centred

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,TGT
0,-3.607204,6.636364,-8.826779,-0.06917,-0.016695,0.290366,-3.374901,0.294957,-8.549407,-112.237154,-3.155534,40.225968,-7.673063,1.467194
1,-3.586214,-11.363636,-4.066779,-0.06917,-0.085695,0.136366,10.325099,1.172057,-7.549407,-166.237154,-0.655534,40.225968,-3.513063,-0.932806
2,-3.586234,-11.363636,-4.066779,-0.06917,-0.085695,0.900366,-7.474901,1.172057,-7.549407,-166.237154,-0.655534,36.155968,-8.623063,12.167194
3,-3.581154,-11.363636,-8.956779,-0.06917,-0.096695,0.713366,-22.774901,2.267157,-6.549407,-186.237154,0.244466,37.955968,-9.713063,10.867194
4,-3.544474,-11.363636,-8.956779,-0.06917,-0.096695,0.862366,-14.374901,2.267157,-6.549407,-186.237154,0.244466,40.225968,-7.323063,13.667194
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,-3.550894,-11.363636,0.793221,-0.06917,0.018305,0.308366,0.525099,-1.316443,-8.549407,-135.237154,2.544466,35.315968,-2.983063,-0.132806
502,-3.568254,-11.363636,0.793221,-0.06917,0.018305,-0.164634,8.125099,-1.507543,-8.549407,-135.237154,2.544466,40.225968,-3.573063,-1.932806
503,-3.552764,-11.363636,0.793221,-0.06917,0.018305,0.691366,22.425099,-1.627543,-8.549407,-135.237154,2.544466,40.225968,-7.013063,1.367194
504,-3.503934,-11.363636,0.793221,-0.06917,0.018305,0.509366,20.725099,-1.406143,-8.549407,-135.237154,2.544466,36.775968,-6.173063,-0.532806


### Normierung der Varianz

In [6]:
standard_scaler = preprocessing.StandardScaler()
boston_scaled = pd.DataFrame(standard_scaler.fit_transform(boston_centred), columns=cols)
boston_scaled.var()

CRIM       1.00198
ZN         1.00198
INDUS      1.00198
CHAS       1.00198
NOX        1.00198
RM         1.00198
AGE        1.00198
DIS        1.00198
RAD        1.00198
TAX        1.00198
PTRATIO    1.00198
B          1.00198
LSTAT      1.00198
TGT        1.00198
dtype: float64

### Lösung des Eigenwertproblens anhand der Singulärwertzerlegung
 **M = U D V'**
 - **U** is an m x m real or complex unitary matrix
 - **D** is an m x n rectangular diagonal matrix with non-negative real numbers on the diagonal
 - **V** is an n x n real or complex unitary matrix
 - The number of non-zero singular values is equal to the rank of M.

In [7]:
U, D, Vt =  np.linalg.svd(boston_scaled)
Vt

array([[ 2.42284451e-01, -2.45435005e-01,  3.31859746e-01,
        -5.02713285e-03,  3.25193880e-01, -2.02816554e-01,
         2.96976574e-01, -2.98169809e-01,  3.03412754e-01,
         3.24033052e-01,  2.07679535e-01, -1.96638358e-01,
         3.11397955e-01, -2.66636396e-01],
       [-6.58731079e-02, -1.48002653e-01,  1.27075668e-01,
         4.10668763e-01,  2.54276363e-01,  4.34005810e-01,
         2.60303205e-01, -3.59149977e-01,  3.11495955e-02,
         8.85140554e-03, -3.14623061e-01,  2.64810325e-02,
        -2.01245177e-01,  4.44924411e-01],
       [ 3.95077419e-01,  3.94545713e-01, -6.60819134e-02,
        -1.25305293e-01, -4.64755487e-02,  3.53406095e-01,
        -2.00823078e-01,  1.57068710e-01,  4.18510334e-01,
         3.43232194e-01,  3.99092044e-04, -3.61375914e-01,
        -1.61060336e-01,  1.63188735e-01],
       [-1.00366211e-01, -3.42958421e-01,  9.62693566e-03,
        -7.00406497e-01, -5.37075825e-02,  2.93357309e-01,
         7.84263261e-02, -1.84747787e-01,  5.

Die ersten r Basisvektoren qi (d.h die ersten r Hauptkomponenten) sind die ersten r
Spalten der orthogonalen d × d-Matrix V (or n x n)

# 3. Versionsübersicht

In [8]:
# !pip install version_information
# %reload_ext version_information
# %version_information numpy, pandas, os, wget, zipfile, json, random