# Word Embeddings

# **One-Hot encoding**

**Why is it called one-hot?**

**After each word is one-hot encoded, only one position has an element of 1 and the other positions are all 0.**

For example,
the sentence **"the boy is crying"** (assuming there are only four English words in the world), after one-hot encoding,

**the corresponds to (1, 0, 0, 0)**

**boy corresponds to (0, 1, 0 ， 0）**

**is corresponds to (0,0,1,0)**

**crying corresponds to (0,0,0,1)**

Each word corresponds to a position in the vector, and this position represents the word.

But this way requires a very high dimension, because if all vocabularies have 100,000 words, then each word needs to be represented by a vector of length 100,000.

**the corresponding to (1, 0, 0, 0, ..., 0) (length is 100,000)**

**boy corresponding to (0, 1, 0, 0, ..., 0)**

**is corresponding to (0, 0, 1, 0 , ..., 0)**

**crying corresponds to (0,0,0,1, ..., 0) to get high-dimensional sparse tensors.**




### Disadvatages of One HotEncoding

One-Hot coding is simple and easy to use, the disadvantages are also obvious:

>The length of the word vector is equal to the length of the vocabulary, and the word vector is extremely sparse. When the vocabulary is large, the computational complexity will be very large.

>Any two words are orthogonal, meaning that the relationship between words cannot be obtained from the One-Hot code

>The distance between any two words is equal, and the semantic relevance of the two words cannot be reflected from the distance


## **Embedding**
In contrast, word embedding embeds words into a low-dimensional dense space.

For example, the same **"the boy is crying"** sentence (assuming that there are only 4 English words in the world), after encoding, it may become:

**the corresponding (0.1)**

**boy corresponding (0.14)**

**is corresponding (0)**

**crying corresponding (0.82)**

We assume that the embedded space is 256 dimensions (generally 256, 512 or 1024 dimensions, the larger the vocabulary, the higher the corresponding spatial dimension)

**Then
the corresponding (
0.1,
0.2, 0.4,
0 , ...) (vector length is 256) boy corresponds to (0.23, 0.14, 0, 0 , ...) is corresponding to (0, 0 , 0.41, 0.9, ...) , 0.82, 0, 0.14, ...)**

One-hot encoding is very simple, but the spatial dimension is high and for one-hot encoding, the distance between any two words is $$\sqrt{2}$$.

 But in practice, the word **(boy) to word (man) should be very close** (because they are closely related), and **the word (cat) to word (stone) should be very far** (because they are basically unrelated).

Embedding space has low dimensions and allows space to have structure .

For example, the distance between the vectors can reflect gender, age, etc. (this requires training, and the unembedding layer has no structure), for example:

**man-woman = boy-girl**

**man-daddy = woman-mother**

In Keras, the Embedding layer requires two parameters, one is the number of words in the token, and the other is the embedded dimension

In [None]:
from tensorflow import keras
from keras.layers import Embedding
embedding_layer = Embedding(1000,64)

1000 : The length of the token is 1000 (can be considered as the number of all words in the vocabulary)

64 : Represents embedded 64-dimensional space (64 attributes can be considered, such as imaginary adult eye shape, nose shape, mouth shape, height, weight, age, etc., together, it is a person (word). A word, a thousand such words means all the words in the vocabulary)

Embedding layer input : a two-dimensional tensor with the shape (samples, sequential_length)
samples: represent different sentences.
sequential_length: represents the number of words in the sentence, each word corresponds to a number, a total of sequential_length words.

The output of the embedding layer : a three-dimensional tensor with the shape (samples, sequential_length, dimensionality)
samples: Represent different sentences.
sequential_length: represents the number of words in a sentence.
dimensionality: represents the number of channels. A vector of values ​​on all channels on the same samples and the same sequential_length represents a word, such as (0,0, :) represents a word.

The embedding layer can be regarded as a matrix , assuming that the input is (100, 20), 100 sequences of length 20, the vocabulary length is 10000, Embedding (10000, 8), and the output is (100, 20, 8)
because After one-hot encoding, each sequence can be regarded as (20, 10000), and the matrix of (10000, 8) is multiplied to get the matrix of (20, 8), so 100 such sequences pass through the embedding layer. Becomes (100, 20, 8)

In [None]:
#Instantiate an Embedding layer

from keras.models import Sequential
from keras.layers import Flatten,Dense,Embedding

model = Sequential()
model.add(Embedding(10000,8,input_length=20))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 20, 8)             80000     
                                                                 
 flatten_1 (Flatten)         (None, 160)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 161       
                                                                 
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [None]:
from keras.datasets import imdb
from keras import preprocessing
from keras.models import Sequential
from keras.layers import Flatten,Dense,Embedding
from keras_preprocessing.sequence import pad_sequences

max_features = 10000
maxlen = 20

(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=10000)
#(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

x_train = pad_sequences(x_train,maxlen=maxlen)
x_test = pad_sequences(x_test,maxlen=maxlen)


model = Sequential()

model.add(Embedding(10000,8,input_length=maxlen))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [None]:
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
history = model.fit(x_train,y_train,epochs=1,batch_size=32,validation_split=0.2,verbose=1)

Instructions for updating:
Use tf.cast instead.
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### It can be seen that in the embedding layer, we need to train 8 times 10000 = 80,000 parameters. Each row in the trained embedding layer represents a vector of words.

>Use the Embedding layer and classifier on IMDB data.

>The imdb data set built in keras has classified positive and negative evaluations and vectorized evaluation content. We use the Embedding layer and classifier to train a neural network. How well did it perform.


# BOW

**Bag-of-words model is a commonly used document representation method in the field of information retrieval .**

>In information retrieval, the BOW model assumes that for a document, it ignores its word order, grammar, syntax and other factors, and treats it as a collection of several words. The appearance of each word in the document is independent and independent of whether other words appear. **(It's out of order)**

>The Bag-of-words model (BoW model) ignores the grammar and word order of a text, and uses a set of unordered words to express a text or a document.

#### Let's take an example

`John likes to watch movies. Mary likes too.`

`John also likes to watch football games.`

Build a dictionary based on the words that appear in the above two sentences:

`{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}`


The dictionary contains 10 words, each word has a unique index. Note that their order is not related to the order in which they appear in the sentence. According to this dictionary, we re-express the above two sentences into the following two vectors:

`[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]`

`[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]`


These two vectors contain a total of 10 elements, where the i-th element represents the number of times the i-th word in the dictionary appears in the sentence.

Now imagine a **huge document set D with a total of M documents**. After all the words in the document are extracted, they form a dictionary containing N words. Using the Bag-of-words model, **each document can be represented as an N-dimensional vector**.


Therefore, the BoW model can be considered as a statistical histogram. It is used in text retrieval and processing applications.

## TF-IDF

**TF-IDF (Term Frequency-Inverse Document Frequency)**, a commonly used weighting technique for information retrieval and information exploration.

TF-IDF is a statistical method used to evaluate the importance of a word to a file set or a file in a corpus. The importance of the word increases in proportion to the number of times it appears in the file, but at the same time decreases inversely with the frequency of its appearance in the corpus.

* **Term frequency TF (item frequency)**: number of times a given word appears in the text. This number is usually normalized (the numerator is generally smaller than the denominator) to prevent it from favoring long documents, because whether the term is important or not, it is likely to appear more often in long documents than in paragraph documents.

> **TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).**

Term frequency (TF) indicates how often a term (keyword) appears in the text .

This number is usually normalized (usually the word frequency divided by the total number of words in the article) to prevent it from favoring long documents.

**Formula of Tf**    ![title](img/tf.png)


where  ni, j  is the number of occurrences of the word in the file  dj  , and the denominator is the sum of the occurrences of all words in the file dj;

* **Inverse document frequency (IDF)**: A measure of the general importance of a word. The main idea is that if there are fewer documents containing the entry t and the larger, it means that the entry has a good ability to distinguish categories. The IDF of a specific word can be calculated by dividing the total number of files by the number of files containing the word, and then taking the log of the obtained quotient.

>**IDF(t) = log_e(Total number of documents / Number of documents with term t in it).**

**Formula of Idf**  ![title](img/idf1.png)


among them

* | D |: Total number of files in the corpus

* |  {  $j: $t_{i}$ \in $d_{j} $$  }  | : The number of files containing words $t_{i}$ ( $n_{i,j}$ $\neq$ 0 , the number of files). If the word is not in the corpus, it will cause the dividend to be zero, so it is generally used.1 + |  {  $j :$t_{i}$ \in $d_{j}$$  }  |.


**So, Formula of tf-Idf** ![title](img/ttttt.png)

#### Example:

Consider a document containing 100 words where in the word cat appears 3 times.

The **term frequency (Tf) for cat** is then **(3 / 100) = 0.03**. Now, assume we have 10 million documents and the word cat appears in one thousand of these.

Then, the **inverse document frequency (Idf)** is calculated as **log(10,000,000 / 1,000) = 4.**

Thus, the **Tf-idf** weight is the product of these quantities: **0.03 * 4 = 0.12.**

#### TF-IDF application

1.  **Search engine**
2.  **Keyword extraction**
3.  **Text similarity**
4.  **Text summary**

# n GRAM

**Wikipedia definition**:

**In computational linguistics, n-gram refers to n consecutive items in the text (items can be phoneme, syllable, letter, word or base pairs)**

N-grams of texts are widely used in the field of text mining and natural language processing. They are basically a set of co-occurring words within a defined window and when computing the n-grams, we typically move one word forward or more depending upon the scenario.

>For example, for the sentence **“The cow jumps over the moon”**. If **N=2** (known as bigrams), then the ngrams would be:

* the cow
* cow jumps
* jumps over
* over the
* the moon

In n-gram, **n = 1 is unigram**, **n = 2 is bigram**, **n = 3 is trigram**.

After **n> 4**, refer directly to numbers, such as **4-gram, 5-gram**.

gram is often used to compare sentence similarity, fuzzy query, sentence rationality, sentence correction, etc.


The n-gram can represent the semantic association reflected by the positional relationship between words. Before explaining the n-gram, we derive from the initial sentence probability.

Suppose a sentence S is an ordered arrangement of n words, and is written as: ![title](img/gif.gif)


We will abbreviate it as W_ {1} ^ {n}, then the probability of this sentence is: ![title](img/ng.gif)

For a single probability, which means the probability that the word appears in the case given by the previous word, we can use Bayesian formula to get: ![title](img/ng1.gif)

The last item is the frequency in the corpus. However, long sentences or text after depunctuation may be very long, and the words that are too early have a small impact on the prediction of the word, so we use Markov's hypothesis that the probability of taking the word depends only on the front of the word the n-1 words, this is the idea n-gram model.

So the above formula becomes: ![title](img/ng2.gif)


#### Determination of N in N-gram

To confirm the value of N. "Language Modeling with Ngrams" uses the indicator **Perplexity**. The smaller the indicator, the better the effect of a language model.

>The article uses a Wall Street Journal database with a dictionary size of 19,979. The training set contains 38 million words and the test set contains 1.5 million words.

For different N-grams, calculate their respective purplexity.

![title](img/formula.png)

The results show that Tri-gram's Perplexity is the smallest, so it works best.
![title](img/result.png)

### Unigram Implementation

In [None]:
import jieba

text = "I am going to the United States"
cut = jieba.cut(text)
sent = list(cut)
print(sent)

['I', ' ', 'am', ' ', 'going', ' ', 'to', ' ', 'the', ' ', 'United', ' ', 'States']


### Bigram Implementation

In [None]:
Sent = "I will go to United States"
lst_sent = Sent.split (" ")
of_bigrams_in = []
for i in range(len(lst_sent)- 1):
   of_bigrams_in.append(lst_sent[i]+ " " + lst_sent[ i + 1])


print(of_bigrams_in)

['I will', 'will go', 'go to', 'to United', 'United States']


### Trigram Implementation

In [None]:
import re
punctuation_pattern = re.compile(r"" "[.,!? ""] "" " )

sent = "I will go to United States"
no_punctuation_sent = re.sub(punctuation_pattern , " " , sent )
lst_sent = no_punctuation_sent.split (" ")
trigram = []
for i in range(len(lst_sent)- 2):
   trigram.append(lst_sent[i] + " " + lst_sent[i + 1] + " " +lst_sent[i + 2])

In [None]:
trigram

['I will go', 'will go to', 'go to United', 'to United States']

#Word2Vec

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/005_BOKTIAR_AHMED_BAPPY/My_classes/FSDS-Bootcamp/NLP/Text-Representation

/content/drive/MyDrive/005_BOKTIAR_AHMED_BAPPY/My_classes/FSDS-Bootcamp/NLP/Text-Representation


In [3]:
import numpy as np
import pandas as pd
import gensim
import os

In [4]:
!pip install --upgrade gensim --user

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
story = []
for filename in os.listdir('data1'):
    if filename == '.ipynb_checkpoints':
      pass
    f = open(os.path.join('data',filename))
    corpus = f.read()
    raw_sent = sent_tokenize(corpus)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [7]:
len(story)

8602

In [8]:
story

[['game',
  'of',
  'thrones',
  'book',
  'one',
  'of',
  'song',
  'of',
  'ice',
  'and',
  'fire',
  'by',
  'george',
  'martin',
  'prologue',
  'we',
  'should',
  'start',
  'back',
  'gared',
  'urged',
  'as',
  'the',
  'woods',
  'began',
  'to',
  'grow',
  'dark',
  'around',
  'them'],
 ['the', 'wildlings', 'are', 'dead'],
 ['do', 'the', 'dead', 'frighten', 'you'],
 ['ser',
  'waymar',
  'royce',
  'asked',
  'with',
  'just',
  'the',
  'hint',
  'of',
  'smile'],
 ['gared', 'did', 'not', 'rise', 'to', 'the', 'bait'],
 ['he',
  'was',
  'an',
  'old',
  'man',
  'past',
  'fifty',
  'and',
  'he',
  'had',
  'seen',
  'the',
  'lordlings',
  'come',
  'and',
  'go'],
 ['dead', 'is', 'dead', 'he', 'said'],
 ['we', 'have', 'no', 'business', 'with', 'the', 'dead'],
 ['are', 'they', 'dead'],
 ['royce', 'asked', 'softly'],
 ['what', 'proof', 'have', 'we'],
 ['will', 'saw', 'them', 'gared', 'said'],
 ['if',
  'he',
  'says',
  'they',
  'are',
  'dead',
  'that',
  'proof',


In [9]:
story[0]

['game',
 'of',
 'thrones',
 'book',
 'one',
 'of',
 'song',
 'of',
 'ice',
 'and',
 'fire',
 'by',
 'george',
 'martin',
 'prologue',
 'we',
 'should',
 'start',
 'back',
 'gared',
 'urged',
 'as',
 'the',
 'woods',
 'began',
 'to',
 'grow',
 'dark',
 'around',
 'them']

In [10]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [11]:
model.build_vocab(story)

In [12]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(322490, 447775)

In [13]:
model.wv.most_similar('daenerys')

[('still', 0.9989397525787354),
 ('two', 0.9989286661148071),
 ('an', 0.9989266395568848),
 ('all', 0.9989076256752014),
 ('dothraki', 0.9989068508148193),
 ('while', 0.9989063143730164),
 ('three', 0.9989057183265686),
 ('other', 0.9989027976989746),
 ('bran', 0.9988930225372314),
 ('illyrio', 0.998889148235321)]

In [14]:

model.wv.doesnt_match(['jon','rikon','robb','arya','sansa','bran'])



'bran'

In [15]:
model.wv.similarity('arya','sansa')

0.9997505

In [17]:
model.wv['king'].shape

(100,)

In [18]:
model.wv.get_normed_vectors()

array([[-0.11576346,  0.06064302,  0.08447791, ..., -0.10003971,
         0.09316063,  0.01833937],
       [-0.11743305,  0.06115282,  0.08263383, ..., -0.09658253,
         0.0886247 ,  0.01195555],
       [-0.09199662,  0.05163564,  0.0779503 , ..., -0.08789031,
         0.08356459, -0.01736212],
       ...,
       [-0.08012906,  0.11552733,  0.10509431, ..., -0.10942402,
         0.08625774,  0.02648468],
       [-0.08696802,  0.08001342,  0.05666172, ..., -0.11312983,
         0.06970461, -0.02556758],
       [-0.02454497,  0.12391928,  0.0076316 , ..., -0.06914677,
         0.08990334, -0.02390001]], dtype=float32)

In [19]:
model.wv.get_normed_vectors().shape

(3840, 100)

In [26]:

y = model.wv.index_to_key

In [27]:
len(y)

3840

In [20]:

from sklearn.decomposition import PCA

In [21]:

pca = PCA(n_components=3)

In [22]:

X = pca.fit_transform(model.wv.get_normed_vectors())

In [23]:
X

array([[-0.01521304,  0.19706357,  0.00503732],
       [-0.0226582 ,  0.17192905,  0.00427064],
       [-0.05249996, -0.01761406, -0.00416719],
       ...,
       [ 0.02249863,  0.02134849,  0.05642961],
       [-0.02372259,  0.05915458, -0.01235134],
       [ 0.04225143, -0.11634178, -0.02176546]], dtype=float32)

In [24]:
X.shape

(3840, 3)

In [28]:
import plotly.express as px
fig = px.scatter_3d(X[200:300],x=0,y=1,z=2, color=y[200:300])
fig.show()