# Vorbereitung

## Pakete laden

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler    # Standardisiert Features (Mittelwert=0, Varianz=1) für vergleichbare Skalen

## Daten laden und explorieren

### Daten laden

In [26]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
print(type(y))

X_names = cancer.feature_names
y_names = cancer.target_names

<class 'numpy.ndarray'>


In [27]:
# Welche Form haben die Daten?
print("Shape of X:", X.shape)
print("Shape of y:", y.shape,"\n")

print("Feature names: ", X_names,"\n")
print("Target names: ", y_names, "\n")

Shape of X: (569, 30)
Shape of y: (569,) 

Feature names:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension'] 

Target names:  ['malignant' 'benign'] 



### Daten ausgeben

In [28]:
# Ein DataFrame aus den Daten erstellen für eine schönere Ausgabe
X_part = pd.DataFrame(X, columns=X_names)
y_part = pd.Series(y, name="target")
cancer_df = pd.concat([X_part, y_part], axis=1)

print(cancer_df.head(n=12).to_markdown(floatfmt=".2f"))

|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points error |   symmetry error |   fractal dimension error |   worst radius |   worst texture |   worst perimeter |   worst area |   worst smoothness |   worst compactness |   worst concavity |   worst concave points |   worst symmetry |   worst fractal dimension |   target |
|---:|--------------:|---------------:|-----------------:|------------:|------------------:|-------------------:|-----------------:|----------------------:|----------------:|-------------------------:|---------------:|----------------:|------------------:|-------------:|-------------------:|--------------------:|------------------:|-----------------------:|-----------------:

In [29]:
# Beschreibung und statistische Übersicht der Kennzahlen der Daten ansehen
print("\nKennzahlen der Daten:")

print(cancer_df.describe().to_markdown(floatfmt=".3f"))


Kennzahlen der Daten:
|       |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points error |   symmetry error |   fractal dimension error |   worst radius |   worst texture |   worst perimeter |   worst area |   worst smoothness |   worst compactness |   worst concavity |   worst concave points |   worst symmetry |   worst fractal dimension |   target |
|:------|--------------:|---------------:|-----------------:|------------:|------------------:|-------------------:|-----------------:|----------------------:|----------------:|-------------------------:|---------------:|----------------:|------------------:|-------------:|-------------------:|--------------------:|------------------:|--------------

# **PCA durchführen**

Beachten Sie, dass die PCA nur auf den Features und nicht dem target durchgeführt wird!

### **Schritt 1**: Daten standardisieren

In [30]:
## Die Daten standardisieren oder nur zentrieren? true=standardisieren / false=nur zentrieren
standardisieren = False

if (standardisieren):
    # Daten standardisieren
    scaler = StandardScaler()
    X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X_names)
else:
    # Daten zentrieren
    mean_values = X.mean(axis=0)
    X_scaled = pd.DataFrame(X - mean_values, columns=X_names)

print(X_scaled.head(n=20).to_markdown(floatfmt=".3f"))



|    |   mean radius |   mean texture |   mean perimeter |   mean area |   mean smoothness |   mean compactness |   mean concavity |   mean concave points |   mean symmetry |   mean fractal dimension |   radius error |   texture error |   perimeter error |   area error |   smoothness error |   compactness error |   concavity error |   concave points error |   symmetry error |   fractal dimension error |   worst radius |   worst texture |   worst perimeter |   worst area |   worst smoothness |   worst compactness |   worst concavity |   worst concave points |   worst symmetry |   worst fractal dimension |
|---:|--------------:|---------------:|-----------------:|------------:|------------------:|-------------------:|-----------------:|----------------------:|----------------:|-------------------------:|---------------:|----------------:|------------------:|-------------:|-------------------:|--------------------:|------------------:|-----------------------:|-----------------:|----------

### **Schritt 2**: Kovarianz-Matrix, Eigenwerte und Eigenvektoren

#### Kovarianz-Matrix

In [31]:
# Kovarianzmatrix berechnen


In [32]:
# Kovarianzmatrix plotten


#### Eigenwerte und Eigenvektoren

In [33]:
# Eigenwerte und Eigenvektoren berechnen


In [34]:
# Eigenwerte absteigend sortieren


### **Schritt 3**: Ladungsmatrix

In [35]:
# Ladungsmatrix erstellen und ausgeben

In [36]:
# Heatmap der Ladungsmatrix erzeugen

### **Schritt 4**: Wahl der Hauptkomponenten

#### Scree Plot

In [37]:
# Scree Plot erzeugen

#### Kumulierte Varianz

In [38]:
# Plot der kumulierten Varianzen

In [39]:
# Komponentenanzahl für 90 % bzw. 95 % Varianz bestimmen

### **Schritt 5**: Projektion

In [40]:
# Anzahl der Hauptkomponenten festlegen
num_components = max(n_95,2); # n_90 bei 90% erklaerter Varianz / mindestens 2 Komponenten für den nächsten Plot

# Projektion durchführen

#### Visualisierung der Projizierten Daten

In [41]:
# Daten gegen PC1 und PC2 plotten