# Distributed representations


1. Sparse features  
    1.1 Hashing trick  
    1.2 Categorial features  
    
    Note about semi-supervised learning.
    
2. Word2vec  
    2.1 skip-gram model  
    2.2 continious bag of words model  
    2.3 Co-occurence matrix  
    2.4 Glove    
    

## 1 Sparse features
### 1.1 Hashing trick

Basically it is a substution (string_token) -> hash(string_token) of fixed size  
    
Hello, polynomial hash for strings and MurmurHash3 (used in sklearn)  

Pros:
    1. extrapolate on unseen words, scalable
    2. reduce feature dimension
Cons:
    1. no inverse transform possible
    2. collisions

In [1]:
# demonstrate on US airlines twitter dataset for sentiment analysis
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics


SEED = 1337

In [3]:
df = pd.read_csv('Tweets.csv')
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline,retweet_count,text
0,570306133677760513,neutral,Virgin America,0,@VirginAmerica What @dhepburn said.
1,570301130888122368,positive,Virgin America,0,@VirginAmerica plus you've added commercials t...
2,570301083672813571,neutral,Virgin America,0,@VirginAmerica I didn't today... Must mean I n...
3,570301031407624196,negative,Virgin America,0,@VirginAmerica it's really aggressive to blast...
4,570300817074462722,negative,Virgin America,0,@VirginAmerica and it's a really big bad thing...


In [4]:
y = LabelEncoder().fit_transform(df.airline_sentiment)

df_train, df_test, y_train, y_test = model_selection.train_test_split(df, y, test_size=0.25, 
                                                                      stratify=y, # WHY
                                                                      random_state=SEED, 
                                                                      shuffle=True) # WHY

# model v1
# Simple BOW model, binary matrix  
# Let's try to reduce number of features with hashing
model1 = Pipeline([
    ('text_vect', HashingVectorizer(analyzer='word', n_features=500, ngram_range=(1,1), norm=None, binary=True)),
    ('logreg', LogisticRegressionCV(Cs=10, cv=3, scoring='neg_log_loss', n_jobs=-1, 
                                    multi_class='multinomial', random_state=SEED))
])

model1.fit(df_train.text, y_train)
print('train logloss', metrics.log_loss(y_train, model1.predict_proba(df_train.text)))
print('test logloss', metrics.log_loss(y_test, model1.predict_proba(df_test.text)))

train logloss 0.585451549361806
test logloss 0.6163646385006908


### 1.2 Categorial features in linear models

In [5]:
# add categorial feature to our linear model with one-hot encoding

# categorial features
df.airline.value_counts()

United            3822
US Airways        2913
American          2759
Southwest         2420
Delta             2222
Virgin America     504
Name: airline, dtype: int64

In [6]:
from sklearn.preprocessing import OneHotEncoder

text_vec = HashingVectorizer(analyzer='word', n_features=500, ngram_range=(1,1), norm=None, 
                             binary=True)
X1_train = text_vec.fit_transform(df_train.text).toarray()

tmp_le = LabelEncoder()
X2_train = tmp_le.fit_transform(df_train.airline.values).reshape(-1,1)

enc = OneHotEncoder(sparse=False)
X2_train = enc.fit_transform(X2_train)
print('one-hot enc shape', X2_train.shape)

X_train = np.hstack([X1_train, X2_train])

model2 = LogisticRegressionCV(Cs=10, cv=3, scoring='neg_log_loss', n_jobs=-1, 
                                    multi_class='multinomial', random_state=SEED)
model2.fit(X_train, y_train)

X1_test = text_vec.transform(df_test.text).toarray()
X2_test = tmp_le.transform(df_test.airline.values).reshape(-1,1)
X2_test = enc.transform(X2_test)
X_test = np.hstack([X1_test, X2_test])

print('train logloss', metrics.log_loss(y_train, model2.predict_proba(X_train)))
print('test logloss', metrics.log_loss(y_test, model2.predict_proba(X_test)))

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


one-hot enc shape (10980, 6)
train logloss 0.5845149676056038
test logloss 0.6145427069493641


## 2 Word2vec
![image](http://nlpx.net/wp/wp-content/uploads/2015/11/word2vec.png)
### 2.1 Skip-gram model
![image](https://i.stack.imgur.com/igSuE.png)

For each word t predict surrounding words in a window of size m

Objective is maximize probability of context words given the current center word:  
    
$$J(\theta) = \prod^T_{t=1} \prod_{-m \le j \le m; j != 0 }  p(w_{t+j} | w_t; \theta)  \rightarrow max $$

or negative log-likelihood:

$$J(\theta) = -\frac{1}{T}\sum^T_{t=1} \sum_{-m \le j \le m; j != 0 }  log p(w_{t+j} | w_t; \theta)  \rightarrow max $$

$$p(w_{t+j} | w_t) = p(out | center) = \frac{exp(u_{out}^T v_{center})}{\sum_k=1^K exp(u_{k}^T v_{center})}$$

## 2.2 Hierarchial Huffman trees

Complexity $O(V) \rightarrow O(\log_2 V)$

$x = v_{n(w,j)}^T v_{w}$,   
where $n(w,j)$ is the j-th node on the path from the root to $w$.  

$p(n, left) = \sigma (v_n^T v_w)$ - probability to go to the left.  
$p(n, right) = \sigma (- v_n^T v_w )$ - probability to go to the right.  

Then,  
$p(w_j | w) = \prod_{j=1}^{L(w) - 1} \sigma ( [ n(w, j+1) == child(n(w,j)) ] v_n^T v_w)$,  
where $L(w)$ - depth of the tree,  
$child(n)$ - child of node n.


<img src="images/hier.png" style="height:300px">

How to build binary prefix tree? -> Huffman Tree.
<img src="images/huffman.png" style="height:300px">

## Negative sampling

Using negative sampling with k samples:   
    
$log p(w_{t+j} | w_t; \theta) = log \sigma(u_{outer}^T v_{center})  + \sum_{i=1}^k E_{j ~ P(w)} [log \sigma (-u_j^T v_{center})]$

In [14]:
sentences = df.text.apply(lambda x: x.split()).values

In [15]:
%%time

from gensim.models.word2vec import Word2Vec


w2v = Word2Vec(sentences, negative=5, size=100, iter=5, sg=1)

CPU times: user 3.73 s, sys: 13.4 ms, total: 3.75 s
Wall time: 1.41 s


In [16]:
w2v.wv.most_similar('airline')

[('airline.', 0.9007248878479004),
 ('best', 0.8687483072280884),
 ('ever', 0.8615865111351013),
 ('awful', 0.8586821556091309),
 ('most', 0.8517616987228394),
 ('worst', 0.84912109375),
 ('disappointed', 0.8412606716156006),
 ('horrible', 0.838492751121521),
 ('company', 0.8372955322265625),
 ('absolute', 0.8371437788009644)]

### 2.2 CBOW model

<img src="images/cbow.png" style="height:500px">

$$h = W^T x$$  
$$x = [w_{j-m}, w_{j-m+1}, ... w_{j-1}, w_{j+1}, ..., w_{j+m}] $$  

$$p(w_j | x) = \frac{exp(v_j^T h)}{\sum_k=1^K exp(v_k^T h)}$$

In [None]:
%%time

from gensim.models.word2vec import Word2Vec


w2v = Word2Vec(sentences, negative=5, size=100, iter=100, sg=0)

In [18]:
w2v.wv.most_similar('police')

[('assult', 0.6154369115829468),
 ('reported', 0.45402228832244873),
 ('most', 0.4082372188568115),
 ('communication,', 0.40658941864967346),
 ('Delays', 0.38628947734832764),
 ('Gate', 0.38524293899536133),
 ('Atlantic', 0.38465529680252075),
 ('engine', 0.3818504810333252),
 ('computer', 0.3697021007537842),
 ('SNA', 0.35934001207351685)]

### 2.3 Co-occurence matrix

<img src="images/matrix.png" style="height:300px">

$P_{ij}$ - occurance of i-th word along with j-th in the window of size m

Cons: 
1. Very high-dimensional, not used in practice
2. Hard to add new words and docs

Trivial solution: use some dimension-reduction method, usually SVD

Singular Value Decomposition

$M = U \Sigma V$  
$Mv = \sigma u$  
$M^{*}u = \sigma v$   
U, V are unitary matrices  
$\Sigma$ - diagonal


$O(nm^2)$ for case n < m

### 2.4 Glove

<img src="images/glove.png" style="height:300px">

$J(\theta) = \frac{1}{2} \sum_{i,j=1}^W f(P_{ij})(u_i^T v_j - log P_{ij})$