# Chapter 6: Machine Learning -------- Part C

    6 | Overview
        6.1 What is Machine Learning?
        6.2 Scikit-Learn
        6.3 Supervised Learning: Classification
        6.4 Supervised Learning: Regression
        6.5 Unsupervised Learning: Dimension Reduction
        6.6 Unsupervised Learning: Clustering

### In this part we will only focus on last two parts

## 6.4 : Unsupervised Learning - Dimension Reduction


### Dimension reduction algorithms
    ▪ dimension reduction techniques are used to decrease the number of variables
    ▪ tries to filter out variables while maintaining as much of the original information as possible

### Why is dimension reduction necessary?
    ▪ visualization becomes more easy (especially for datasets with many variables)
    ▪ “curse of dimensionality”
    ▪ colinearity


### How can we reduce dimensions?
    ▪ ‘summarize’ variables (e. g., compress information contained in 1000 variables into 10 variables)
        → Principal Component Analysis (PCA)
        → Uniform Manifold Approximation and Projection (UMAP)
    ▪ make use of regularization techniques → Ridge regression, LASSO, Elastic Net
    
### What is PCA?
    ▪ PCA is fundamentally a dimensionality reduction algorithm
    ▪ constructs new variables that explain most of the variation/information in the data
        → finds axes that maximize the variance in the data
        → first principal axes maximizes the most variance, followed by the second, the third axes ...
    ▪ there are no labels for PCA as it is unsupervised
        → PCA learns patterns in the data itself without a ground truth

### Applications of PCA
    ▪ feature engineering: use the first several components (which capture most of the information in the data) as inputs to regression, classification or clustering algorithms
    ▪ visualization tool: better understand how the variables in the data are related to each other


## Applying PCA to Plamer Penguin dataset

In [42]:
# Palmer penguin dataset

import pandas as pd

penguins = pd.read_csv('/Users/abdulhabirkarahanli/Desktop/Data/penguins.csv')
penguins
penguins['island']

0      Torgersen
1      Torgersen
2      Torgersen
3      Torgersen
4      Torgersen
         ...    
339        Dream
340        Dream
341        Dream
342        Dream
343        Dream
Name: island, Length: 344, dtype: object

In [43]:
# data preprocessing

# fill missing values with mean
penguins = penguins.fillna(penguins.mean())

# covert categorical data into dummy varibales
#penguins = pd.get_dummies(penguins, columns=['island', 'sex'], drop_first=True)

# split into features and target

#X_penguins = penguins.drop('species', axis=1)



## Exercise: 
    → Use the Wisconsin breast cancer dataset wisc_bc_data.csv.
    → Perform a PCA on this dataset. How many components would you retain?
    → Visualize your results. Are the features informative of distinguishing between benign and malignant diagnoses?
 

In [45]:
# load the Wisconsin breast cancer dataset

wbcd = pd.read_csv('/Users/abdulhabirkarahanli/Desktop/Data/wisc_bc_data.csv')
wbcd.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,points_worst,symmetry_worst,dimension_worst
0,87139402,B,12.32,12.39,78.85,464.1,0.1028,0.06981,0.03987,0.037,...,13.5,15.64,86.97,549.1,0.1385,0.1266,0.1242,0.09391,0.2827,0.06771
1,8910251,B,10.6,18.95,69.28,346.4,0.09688,0.1147,0.06387,0.02642,...,11.88,22.94,78.28,424.8,0.1213,0.2515,0.1916,0.07926,0.294,0.07587
2,905520,B,11.04,16.83,70.92,373.2,0.1077,0.07804,0.03046,0.0248,...,12.41,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,0.07881
3,868871,B,11.28,13.39,73.0,384.8,0.1164,0.1136,0.04635,0.04796,...,11.92,15.77,76.53,434.0,0.1367,0.1822,0.08669,0.08611,0.2102,0.06784
4,9012568,B,15.19,13.21,97.65,711.8,0.07963,0.06934,0.03393,0.02657,...,16.2,15.73,104.5,819.1,0.1126,0.1737,0.1362,0.08178,0.2487,0.06766
