##### CSCI 303
# Introduction to Data Science
<p/>
### 18 - Unsupervised Learning

![PCA scatter plots](pca.png) 

## This Lecture
---
- Introduction to unsupervised learning
- Data preprocessing
  - Scaling and normalization
  - Dimensionality reduction

## Setup
---
The obligatory setup code.

In [None]:
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt

from pandas import DataFrame

plt.style.use("ggplot")

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# function for generating normally distributed data|
def sample_cluster(n, x, y, sigma):
    x = np.random.randn(n) * sigma + x;
    y = np.random.randn(n) * sigma + y;
    return np.array([x, y]).T


## Unsupervised vs Supervised
---
In supervised learning, we have *labeled* data:
- some input variables 
- some additional variable(s) which we are learning to predict

For example, we might have a classification problem like the one below (colors = class labels):

In [None]:
np.random.seed(1234)
c1 = sample_cluster(50, 0, 0.5, 0.15)
c2 = sample_cluster(50, 0, 1.2, 0.1)
c3 = sample_cluster(50, 0.5, 1, 0.2)
plt.plot(c1[:,0], c1[:,1], 'r.', c2[:,0], c2[:,1], 'b.', c3[:,0], c3[:,1], 'y.',)
plt.show()

In *unsupervised* learning, we are given no labels, and we seek to find hidden patterns in the data: 

In [None]:
plt.plot(c1[:,0], c1[:,1], 'k.', c2[:,0], c2[:,1], 'k.', c3[:,0], c3[:,1], 'k.',)
plt.show()

Questions we could ask about the data:

- Is there a transformation of the data which will reveal patterns (to humans or algorithms)?
- What are the relevant features of the data which are informative?
- Are there natural groupings into which we could separate the data?

## Challenges of Unsupervised Learning
---
Since we have no labeled data, there are no predictions that we can make *and meaningfully test*.

Evaluation of unsupervised learning algorithms is often largely subjective.

Unsupervised learning is often used in *exploratory data analysis*.

## Example Applications
---
- group (cluster) gene expression data in cancer patients to look for patterns; a gene (or group of genes) which strongly differentiates patients may be worth further study:
  - different disease causes
  - different responses to treatment
- look for anomalous patterns in credit card spending
- group people or organizations according to some new identifiers
  - reveal hidden similarities
  - provide alerts to activities with similar risks (e.g., fund analysis)
  - targeted marketing

## Data Preprocessing
---
- Generally useful to improve supervised learning algorithm performance
- Scaling/normalization:
  - Transform data so that features are on same scale or have same statistics
  - Helps some algorithms which are sensitive to scale
- Dimensionality reduction:
  - Transform data into a sub-space in which visualization or learning is easier
  - Reduce computational cost of learning

## Scaling
---
Is a thing.  It helps with some algorithms.

## Dimensionality Reduction
---
- Input data is often (very) high dimensional
- This can lead to expensive learning and promotes overfitting
- Variables can often also have high correlation
- Solution: extract most relevant sub-space of input data

## Principal Components Analysis
---
The most popular form of dimensionality reduction.

Lots of linear algebra behind this.  We won't go there.

Basically, rotates and transforms the data into a new subspace:
- Most relevance (variation) is now associated with first feature
- Second feature gets the next most, etc.

## PCA Example
---
Consider this dataset:

In [None]:
M = [[1, -1, 7],[20, 3, -5],[1,1,1]]
x1 = np.random.randn(300);
y1 = np.random.randn(300);
z1 = np.random.randn(300);
data1 = np.array([x1, y1, z1]).T @ M
x2 = np.random.randn(300);
y2 = np.random.randn(300);
z2 = np.random.randn(300);
data2 = np.array([x2, y2, z2]).T @ M + np.array([20,-10,15])
data = np.concatenate((data1, data2))

In [None]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(data1[:,0], data1[:,1], data1[:,2])
ax.scatter(data2[:,0], data2[:,1], data2[:,2])
plt.show()

Let's apply PCA and look at the first two principal components.


In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)
data1_pca = pca.transform(data1)
data2_pca = pca.transform(data2)

plt.scatter(data1_pca[:,0], data1_pca[:,1])
plt.scatter(data2_pca[:,0], data2_pca[:,1])
plt.show()

## Taiwan Credit Card Default Dataset
---

In [None]:
data = pd.read_csv('default.csv', header=1, encoding='utf8', index_col='ID')
all_dummies = ['SEX', 'EDUCATION','MARRIAGE','PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']
df3 = pd.get_dummies(data, columns=all_dummies)
target = 'default payment next month'
inputs3 = df3.columns.drop(target)
df3.info()

In [None]:
X = df3[inputs3]
t = df3[target]

In [None]:
#print(X)

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(X)
X_scaled = ss.transform(X)
#print(X_scaled)

In [None]:
pca = PCA(n_components=10)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)

In [None]:
X.shape, X_pca.shape

In [None]:
plt.scatter(X_pca[:,0], X_pca[:,1], c=t)
plt.show()

In [None]:
from sklearn.preprocessing import Normalizer
normalized = Normalizer()
normalized.fit(X)
X_norm = normalized.transform(X)
#print(X_norm)
X_norm.shape

In [None]:
pca = PCA(n_components=10)
pca.fit(X_norm)
X_pca = pca.transform(X_norm)

In [None]:
X.shape, X_pca.shape

In [None]:
plt.scatter(X_pca[:,0], X_pca[:,1], c=t)
plt.show()

## Next Time
---
- Clustering