# Word Embedding
***

- Word embedding is a method used to find relations between vectors
- Typically defined as a way to convert words to context vectors

**Preprocessing**
1. `Load in Data`
2. `Remove Stop Words`
3. `Convert to Bigram`
4. `Convert Bigram to One-Hot-Encodings`

**Training**
1. `Split Bigram to Train, Test Data`
2. `Create Linear Model`
3. `Obtain Weights After Training`
4. `Vizualize Points`

In [135]:
import torch
import torch.nn as nn

import numpy as np

In [136]:
def remove_stop_words(data):
    stop_words = ["is", "a", "has", "an"]

    removed_stop_words_list = []
    
    for i, _ in enumerate(data):
        removed_stop_words_list.append([word for word in data[i].replace("\n", "").split(" ") if word not in stop_words])

    return removed_stop_words_list
                
def bigrams(data):
    bigram_list = []

    for _, word in enumerate(data):
        for j in range(len(word) - 1):
            bigram_list.append([word[j], word[j+1]])

    return bigram_list

def vocabulary(bigram_data):
    vocab_list = []
    for bigram in bigram_data:
        vocab_list.extend(bigram)

    return list(set(vocab_list))

def one_hot_encoder(vocab_data, bigram_data):
    one_hot_values = {}
    for i, key in enumerate(vocab_data):
        one_hot_values[key] = [0 if i != j else 1 for j in range(len(vocab_data))]

    for i, (X,y) in enumerate(bigram_data):
        # print(bigram_data[i])
        bigram_data[i][0] = one_hot_values[X]
        bigram_data[i][1] = one_hot_values[y]

    return np.array(bigram_data)

**Preprocessing**
1. Load in Data
   - Using a default txt file with a bunch of names and adjectives
   - Python's **`open()`** function will open the text file
   - **`readlines()`** function will split the text file by new line


In [139]:
file = open("text_data.txt", "r")
data = file.readlines()

2. Remove Stop Words
   - Using our built function **`remove_stop_words()`**, it cleans the sentence (removes "\n")
   - Then, removes stop words ("is", "a", "an", "has") from the sentence

In [140]:
cleaned_data = remove_stop_words(data)

3. Bigrams
   - Using our built function **`bigrams()`**, it converts our newly cleaned txt_data to bigrams
   - We will use sliding window technique to get all the bigrams
   <br></br>
  
**What is Bigrams**
- A sequence of two words, the first value being the feature and the seconds value being the label
- Example: "I am great" will become ("I", "am") and ("am great")
      

In [141]:
bigram_data = bigrams(cleaned_data)

4. Convert Bigrams to One-Hot-Encodings
   - Using our built function **`one_hot_encoder()`**, it cleans the sentence (removes "\n")
   - Then, removes stop words ("is", "a", "an", "has") from the sentence

In [142]:
train_data = one_hot_encoder(vocab_data, bigram_data)

In [144]:
X = train_data[:, 0]
y = train_data[:, 1]

In [None]:
class Embedding_Scratch(nn.Module):
    def __init__(self):
        super().__init__()
        

In [12]:
embedding = nn.Embedding(10, 20)

embedding.state_dict()

OrderedDict([('weight',
              tensor([[ 1.4541, -1.0955, -0.5818, -1.1664,  1.9851,  0.6057,  1.1191,  1.0689,
                        1.0880,  0.1913, -1.0502,  0.1446,  0.6950, -1.3089,  0.1055, -1.2217,
                       -0.6265,  2.2836, -1.3186, -0.4641],
                      [-1.0683, -0.5650, -0.6572,  0.1054, -0.5782,  0.2241, -1.5169,  1.3078,
                       -0.0769,  0.7008, -1.9838,  0.5673, -0.8595, -2.1326, -0.6864, -0.6746,
                        0.4995, -0.2576, -0.7156, -1.2703],
                      [-0.8015,  1.4487,  0.2416, -0.1133,  1.6529,  0.0532, -0.2236,  0.2681,
                        2.6695,  0.8156,  1.9102, -0.2560, -0.0401, -0.5461,  1.2636,  0.5680,
                        0.5692,  0.3456,  0.2229, -0.4243],
                      [-1.0524, -0.1332, -0.4791, -0.2953,  1.1115, -0.7103,  0.9557, -2.0772,
                       -0.5244, -0.9411,  1.8240,  1.0541,  0.9309, -0.7129,  0.9434,  0.2365,
                        0.5055, -0.5