# Import Data

In [None]:
import pandas as pd
import numpy as np

cereal = pd.read_csv("http://fengmai.net/download/data/bia652/cereal.csv", sep=" ")
names = cereal["name"].values
cereal = cereal.drop(["name","mfr","type"],1)
cereal.describe()

In [None]:
cereal.shape

In [None]:
cereal.head(10) # we can see that there are missing values! (-1)

- We replace missing values with the mean

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=-1, strategy='mean')
imputer = imputer.fit(cereal)
cereal[:] = imputer.transform(cereal)

In [None]:
cereal.head(10) # we can see that there are missing values! (-1)

In [None]:
# alternative method
for vals in cereal.columns:
    c = cereal[vals]
    avg = np.mean(c[c != -1])
    cereal[vals] = c.replace(-1, avg)

In [None]:
from sklearn.preprocessing import StandardScaler
cereal2 = StandardScaler(with_std=True).fit_transform(cereal) #standardize

In [None]:
pd.DataFrame(cereal2)

# PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, svd_solver = 'full') #we initialize y with PCA. 
pca.fit(cereal2)
cereal_pca = pca.transform(cereal2)

In [None]:
pca.components_

In [None]:
#generate 77 random colors, one for each cereal
import random
random.seed(123)
import matplotlib.pyplot as plt

color = ["#%06x" % random.randint(0, 0xAAAAAA) for i in range(0, cereal.shape[0])]
# image size
from pylab import rcParams #set figure size
rcParams['figure.figsize'] = 15, 15

#scatter plot
for x, y, c in zip(cereal_pca[:,0], cereal_pca[:,1], color):
    plt.scatter(x,y,color=c)
    
#labels
for label, x, y, c in zip(names, cereal_pca[:,0],cereal_pca[:,1], color):
    plt.annotate(label, xy = (x, y), xytext = (-0, 0),
        textcoords = 'offset points', ha = 'right', va = 'bottom', color=c)

# SVD

In [None]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2, algorithm='arpack') #we initialize y with PCA. 
svd.fit(cereal2)
cereal_svd = svd.fit_transform(cereal2)

In [None]:
#scatter plot
for x, y, c in zip(cereal_svd[:,0], cereal_svd[:,1], color):
    plt.scatter(x,y,color=c)
    
#labels
for label, x, y, c in zip(names, cereal_svd[:,0],cereal_svd[:,1], color):
    plt.annotate(label, xy = (x, y), xytext = (-0, 0),
        textcoords = 'offset points', ha = 'right', va = 'bottom', color=c)

# Multidimensional Scaling

## Motivation
- In PCA, we project our $n\times k$ data matrix $X$ to $n\times l$. E.g., $l = 2,3$ can be used for visualization.
- This projection is linear, and is done to maximize the variation preserved.
- There is no guarantee that two data points (two rows of $X$) that are far away in the $k$ dimensional space get projected to be very close
- Can we preserve the distance in $k$-dimensional space after "projecting" to $l$-dimensions?

- E.g., we want the distance between any two points on the left to be similar to the corresponding distance on the right

| $k$-Dimensional  | $l$-Dimensional  |
|---|---|
| <img width="300px" src="http://fengmai.net/download/data/bia652/images/mds1.png"></img>  |  <img  width="300px" src="http://fengmai.net/download/data/bia652/images/mds2.png"></img>   |

- We can achieve this by minimizing the "stress"
- Let $d_{ij}=\|x_i-x_j\|$ be the distance between points $i$ and $j$ in $k$-D
- Let point $i,\ j$ be projected to $y_i,\ y_j$ in $l$-D
- We minimize

$$\text{stress} = \sum_{i\neq j} w_{\text{ij}} \left(\left\|y_i - y_j\right\| - d_{\text{ij}}\right){}^2$$

- Here $w_{ij}$ are the weights. E.g., if $w_{ij}=\frac{1}{d_{ij}^2}$, then

$$\text{stress} = \sum _{i\neq j} \left(\frac{\left\|y_i - y_j\right\|}{d_{\text{ij}}} - 1\right)^2,$$

- thus stress measures the relative difference between the actual edge length and ideal edge length.

## MDS

- set colors

In [None]:
from sklearn import manifold

mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9, random_state=123,
                   dissimilarity="euclidean", n_jobs=1)

cereal_mds = mds.fit(cereal2).embedding_

In [None]:
#scatter plot
for x, y, c in zip(cereal_mds[:,0], cereal_mds[:,1], color):
    plt.scatter(x,y,color=c)
    
#labels
for label, x, y, c in zip(names, cereal_mds[:,0],cereal_mds[:,1], color):
    plt.annotate(label, xy = (x, y), xytext = (-0, 0),
        textcoords = 'offset points', ha = 'right', va = 'bottom', color=c)

# SNE and T-SNE

- We can achieve a similar goal by first define the likelyhood of being "neighbors"
- We know the the closer two points are, the more likely they will be neighbors
- Let $P(j|i)$ defines the probablity of $j$ being a neighbor of $i$.
- In the original $k$-Dimensional space, this should be inverse proportional to $\|x_i - x_j\|$
- So we could set

$$P(j|i) = e^{-\frac{\|x_i-x_j\|}{2\sigma_i^2}}$$

- The above function is also called Gaussian or RBF kernel. $\sigma$ is a hyperparameter determined later.

- But we want that probability for all $j\ne i$ to sum to 1.
- So we normalize and redefine
$$p_{ij} = P(j|i) = \frac{e^{-\frac{\|x_i-x_j\|}{2\sigma_i^2}}}{\sum_{k\ne i} e^{-\frac{\|x_i-x_k\|}{2\sigma_i^2}}}$$
- The sequence $\{p_{ij}\ |\ j \ne i\}$ form a  probablity distribution.

- Now after "projecting" to the lower dimensional space, we can also define similarly the probability of $i$ and $j$ being neighbors

$$q_{ij} = \frac{e^{-\|y_i-y_j\|^2}}{\sum_{k\ne i} e^{-\|y_i-y_k\|^2}}$$

- Here we omitted the "$\sigma$", since they can be part of $y_i$.

- We want the two distribution to be similar. i.e., say $i=10$, and data points $1,\ 5,\ 12$ are close to $10$ so $p_{10,1},\ p_{10,5},\ p_{10,12}$ are relatively large compared with other points

- So we want find $y$ such that $q_{10,1},\ q_{10,5},\ q_{10,12}$ are also relatively large

- The standard way to measure the difference of two distributions is KL divergence

$$KL(p_i, q_i) = \sum_{k \ne i} p_{ij} log\frac{p_{ij}}{q_{ij}}$$

- This measures the distance between two distributions

- So we want to minimize $y_i$ such that  

$$C(y_1, y_2, \ldots, y_n) = \sum_i KL(p_i, q_i) = \sum_i \sum_{k \ne i} p_{ij} log\frac{p_{ij}}{q_{ij}}$$

- The objective is a nonlinear function of $y$.
- The way to minimize the objective function is via stochastic gradient descent
- The gradient of $C$ with regard to $y_i$ is 
$$\frac{\partial C}{\partial y_i} = 2\sum_j (y_i-y_j)(p_{ij} - q_{ij} + p_{ji} - q_{ji})$$

- Then $y_i = y_i - t \frac{\partial C}{y_i}$
- We will use sklearn TSNE package
- TSNE is similar to SNE, except that 
$q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k\ne l} (1 + \|y_l - y_k\|^2)^{-1} }$
- $(1 + \|y_i - y_j\|^2)^{-1}$ decays slower than $e^{-\|y_i-y_j\|}$ as distance increases, avoid some of the overcrowding problems with SNE.

In [None]:
from sklearn.manifold import TSNE
# perplexity determines sigma
model = TSNE(n_components=2, random_state = 1, perplexity=50, method="exact", n_iter = 1000, learning_rate = 100)
cereal_tsne = model.fit_transform(cereal2) 

#scatter plot
for x, y, c in zip(cereal_tsne[:,0], cereal_tsne[:,1], color):
    plt.scatter(x,y,color=c)
#labels
for label, x, y, c in zip(names, cereal_tsne[:,0],cereal_tsne[:,1], color):
    plt.annotate(label, xy = (x, y), xytext = (-0, 0),
        textcoords = 'offset points', ha = 'right', va = 'bottom', color=c)

# Word2vec

<img width="200px" src="http://fengmai.net/download/data/bia652/images/text.gif" style="float: left; width: 45%; margin-right: 1%; margin-bottom: 0.5em;"></img><img width="200px" src="http://fengmai.net/download/data/bia652/images/text2vectors.png" style="float: right; width: 45%; margin-right: 1%; margin-bottom: 0.5em;">

- In text, each word is a categorical feature
- If your text uses a dictionary of size 100K, you have 100K features
- word2vec represent each word by a vector of dimension $k$
- reduce the dimensional from 100K to $k$
- It is a dimensional reduction applied to a sequence of co-occuring tokens (words)
- SVD/PCA can be slow with a large matrix ($O(p^2 n+p^3)$)
- How does it work? (Take BIA-667 or another DL/NLP course!)

## Python package
- Python: Gensim [models.word2vec](https://radimrehurek.com/gensim/intro.html)

In [None]:
import urllib.request
from pathlib import Path
response = urllib.request.urlopen('http://fengmai.net/download/data/bia652/documents.txt')
text_doc = response.read()
Path("documents.txt").write_text(str(text_doc))

In [None]:
Path('documents.txt').read_text()[:1000]

In [19]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

# train word2vec on your file
model0 = Word2Vec(LineSentence("documents.txt"), size=10, window=5, min_count=1, workers=2)

- Look at a vector

In [20]:
model0['profit']

  """Entry point for launching an IPython kernel.


array([ 0.7787977 ,  2.9085257 , -0.1965623 , -1.6382704 ,  1.3658626 ,
       -0.34135985, -2.5573366 ,  3.443796  ,  1.7005769 ,  2.5193603 ],
      dtype=float32)

- Find closest to a word

In [21]:
model0.most_similar('profit')

  """Entry point for launching an IPython kernel.


[('revenue,', 0.9619590640068054),
 ('contribution', 0.9570902585983276),
 ('organic', 0.9567114114761353),
 ('income', 0.9562941789627075),
 ('loss', 0.9560518860816956),
 ('revenue', 0.9508347511291504),
 ('expenses', 0.9459767937660217),
 ('weighted', 0.9396394491195679),
 ('NOI', 0.9381172060966492),
 ('EBITDA', 0.936472475528717)]

Using word vectors:

1. As continuous features in linear/logistic regression
2. finding similarity of words or sentences. vector for a sentence = sum/mean of vectors for the words
