<h1 style='text-align: center; color: lightblue; font-size: 40px'> NLP: word embeddings </h1>
<h2 style='text-align: center; color: lightblue; font-size: 30px'> a kind of autoencoder ? </h2>

# nn.Embedding

In [1]:
import torch
from torch import nn
embedding = nn.Embedding(1000,128)
embedding

Embedding(1000, 128)

In [2]:
embedding.num_embeddings

1000

In [3]:
for param in embedding.parameters():
    print(param)

Parameter containing:
tensor([[ 0.8772,  0.7212,  0.6579,  ...,  0.5811,  1.2449, -2.6291],
        [-0.0957, -0.1208, -0.2715,  ...,  0.1107, -0.2024, -1.0312],
        [ 0.2956, -0.1486, -1.2484,  ..., -0.3240,  0.0238, -1.4764],
        ...,
        [-0.4767,  1.1193, -1.6117,  ...,  0.5408, -0.0215, -1.4875],
        [ 0.0148,  0.2336, -0.4190,  ...,  2.2828,  0.0482, -1.1248],
        [ 1.4917, -0.0458, -0.7208,  ..., -0.8328, -0.0042,  0.0886]],
       requires_grad=True)


In [13]:
embedding(torch.LongTensor([3])).shape

torch.Size([1, 128])

# Word2vec

Word2vec is similar to an autoencoder, encoding each word in a vector, but rather than training against the input words through reconstruction,word2vec trains words against other words that neighbor them in the input corpus.

<img src="images/word2vec_diagrams.png">

a very good step by step tutorial to understand how word2vec works: <br />
https://towardsdatascience.com/implementing-word2vec-in-pytorch-skip-gram-model-e6bae040d2fb

# Encoding for categorical variables

Imagine you want to predict energy consumption based on the category of building.

In [6]:
import numpy as np
import pandas as pd

In [7]:
categories = pd.Series(['school', 'hospital', 'parking', 'school'])
df = pd.DataFrame(categories)
df.rename({0:'categories'}, axis=1, inplace=True)
df

Unnamed: 0,categories
0,school
1,hospital
2,parking
3,school


 You cannot pass "commercial building" or "parking" to a model, because models need numbers. So you need to turn all the possible categories into numbers. The first thing you think of is one-hot encoding... <br />

In [8]:
df2 = pd.get_dummies(df)
df2['categories'] = df
df2

Unnamed: 0,categories_hospital,categories_parking,categories_school,categories
0,0,0,1,school
1,1,0,0,hospital
2,0,1,0,parking
3,0,0,1,school


Alas, you have more than 200 possible categories, and passing that many new columns to your model would:
* drown your model (especially tree-based models) into too many possibilities
* make it train forever

Another strategy is to count the number of occurences of each class, and create a single new column with that count:

In [9]:
df

Unnamed: 0,categories
0,school
1,hospital
2,parking
3,school


In [10]:
df3 = df.copy()
df3['count'] = df3['categories'].apply(lambda x: df3['categories'].value_counts()[x])
df3

Unnamed: 0,categories,count
0,school,2
1,hospital,1
2,parking,1
3,school,2


There are plenty of small tricks you can add to this. You could take the log of this count if it was right-tailed distributed, for instance... <br /> There are also other approaches, such as target encoding:

<img src='images/target-encoding.png'>

And now, you now a new one with embeddings (!) : 
* use an embedding to convert your categorical variables into numbers
* you can pick how many numbers you want (I want a combination of 50 numbers to represent each category)
* train your neural network (it will change automatically the numbers into the 50 columns you created)
* you can even use those 50 columns (instead of 200 building possibilities) to feed them into your random forest !