<a href="https://colab.research.google.com/github/AIWalaBro/Complete_NLP/blob/main/2_word_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings


Whenever we apply any algorithm to textual data, we need to convert the text to a numeric form. Hence, there arises a need for some pre-processing techniques that can convert our text to numbers.

This is where the concepts of One-Hot Encoding(OHE), Bag-of-Words (BoW) and TF-IDF come into play. Both BoW and TF-IDF are techniques that help us convert text sentences into numeric vectors.

### Example:
- Review 1: This movie is very scary and long
- Review 2: This movie is not scary and is slow
- Review 3: This movie is spooky and good

You can see that there are some contrasting reviews about the movie as well as the length and pace of the movie. Imagine looking at a thousand reviews like these. Clearly, there is a lot of interesting insights we can draw from them and build upon them to gauge how well the movie performed.

However, as we saw above, we cannot simply give these sentences to a machine learning model and ask it to tell us whether a review was positive or negative. We need to perform certain text preprocessing steps.


**Word Embedding is one such technique where we can represent the text using vectors. The more popular forms of word embeddings are:**

- 1.BoW, which stands for Bag of Words
- 2.TF-IDF, which stands for Term Frequency-Inverse Document Frequency

### Why do we need Word Embeddings?

As we know many Machine Learning algorithms and almost all Deep Learning Architectures are not capable of processing strings or plain text in their raw form.

They require numerical numbers as inputs to perform any sort of task, such as classification, regression, clustering, etc. Also, from the huge amount of data that is present in the text format, it is imperative to extract some knowledge out of it and build any useful applications.

In short, we can say that to build any model in machine learning or deep learning, the final level data has to be in numerical form because models don’t understand text or image data directly as humans do.

### How did NLP models learn patterns from text data?

To convert the text data into numerical data, we need some smart ways which are known as vectorization, or in the NLP world, it is known as Word embeddings.

Vectorization or word embedding is the process of converting text data to numerical vectors. Later those vectors are used to build various machine learning models. In this manner, we say this as extracting features with the help of text with an aim to build multiple natural languages, processing models, etc.

### Familiar with Terminologies

Before understanding Vectorization, below are the few terms that you need to understand.

**Document**
A document is a single text data point. For Example, a review of a particular product by the user.

**Corpus**
It a collection of all the documents present in our dataset.

**Feature**
Every unique word in the corpus is considered as a feature.

**For Example**, Let’s consider the 2 documents shown below:

**Sentences:**
**Dog hates a cat. It loves to go out and play.**
**Cat loves to play with a ball.**

We can build a corpus from the above 2 documents just by combining them.

**Corpus = “Dog hates a cat. It loves to go out and play. Cat loves to play with a ball.”**

And features will be all unique words:

**Fetaures: [‘and’, ‘ball’, ‘cat’, ‘dog’, ‘go’, ‘hates’, ‘it’, ‘loves’, ‘out’, ‘play’, ‘to’, ‘with’]**

## One-Hot Encoding (OHE)

In this technique, we represent each unique word in vocabulary by setting a unique token with value 1 and rest 0 at other positions in the vector. In simple words, a vector representation of a one-hot encoded vector represents in the form of 1, and 0 where 1 stands for the position where the word exists and 0 everywhere else.

![2.png](attachment:2.png)

Example:

**Sentence: I am teaching NLP in Python**

A word in this sentence may be “NLP”, “Python”, “teaching”, etc.

Since a dictionary is defined as the list of all unique words present in the sentence. So, a dictionary may look like –

**Dictionary: [‘I’, ’am’, ’teaching’,’ NLP’,’ in’, ’Python’] **

Therefore, the vector representation in this format according to the above dictionary is

**Vector for NLP: [0,0,0,1,0,0]**

**Vector for Python:  [0,0,0,0,0,1]**

This is just a very simple method to represent a word in vector form.

#### Disadvantages of One-hot Encoding

1. One of the disadvantages of One-hot encoding is that the Size of the vector is equal to the count of unique words in the vocabulary.

2. One-hot encoding does not capture the relationships between different words. Therefore, it does not convey information about the context.

## Bag of Words (BoW) Model

The Bag of Words (BoW) model is the simplest form of text representation in numbers. Like the term itself, we can represent a sentence as a bag of words vector (a string of numbers).

Let’s recall the three types of movie reviews we saw earlier:

- This movie is very scary and long
- This movie is not scary and is slow
- This movie is spooky and good

We will first build a vocabulary from all the unique words in the above three reviews. The vocabulary consists of these 11 words: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,  ‘slow’, ‘spooky’,  ‘good’.

We can now take each of these words and mark their occurrence in the three movie reviews above with 1s and 0s. This will give us 3 vectors for 3 reviews:

![1.jpg](attachment:1.jpg)

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]

And that’s the core idea behind a Bag of Words (BoW) model.

#### Drawbacks of using a Bag-of-Words (BoW) Model

In the above example, we can have vectors of length 11. However, we start facing issues when we come across new sentences:

1. If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.


2. Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)


3. We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.

The bag-of-words model converts text into fixed-length vectors by counting how many times each word appears.

- Text processing is necessary.

- Text processing is necessary and important.

- Text processing is easy.


If we take out the unique words in all these sentences, the vocabulary will consist of these 7 words: {‘Text’, ’processing’, ’is’, ’necessary’, ’and’, ’important, ’easy’}.

To carry out bag-of-words, we will simply have to count the number of times each word appears in each of the documents.

![2.jpg](attachment:2.jpg)

- we have the following vectors for each of the documents of fixed length -7

Document 1: [1,1,1,1,0,0,0]

Document 2: [1,1,1,1,1,1,0]

Document 3: [1,1,1,0,0,0,1]

#### Limitations of Bag-of-Words

If we deploy bag-of-words to generate vectors for large documents, the vectors would be of large sizes and would also have too many null values leading to the creation of sparse vectors.

Bag-of-words does not bring in any information on the meaning of the text.

*For example:* if we consider these two sentences – “Text processing is easy but tedious.” and “Text processing is tedious but easy.” – a bag-of-words model would create the same vectors for both of them, even though they have different meanings.

Example:
    
- This burger is very tasty and affordable.
- This burger is not tasty and is affordable.
- This burger is very very delicious.

**Unique words: [“and”, “affordable.”, “delicious.”,  “is”, “not”, “burger”, “tasty”, “this”, “very”]**

In [57]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["This burger is very tasty and affordable.", "This burger is not tasty and is affordable.", "This burger is very very delicious."]
countvect = CountVectorizer()
x = countvect.fit_transform(corpus)
result = x.toarray()
print(result)

[[1 1 1 0 1 0 1 1 1]
 [1 1 1 0 2 1 1 1 0]
 [0 0 1 1 1 0 0 1 2]]


In [58]:
import nltk
import re
import numpy as np

# execute the text here as :
text = """If learning what is data science sounded interesting, understanding what does this job roles is all about
                will me much more interesting to you.
                Data scientists are among the most recent analytical data professionals who have the technical
                ability to handle complicated issues as well as the desire to investigate what questions need to be answered.
                They're a mix of mathematicians, computer scientists, and trend forecasters.
                They're also in high demand and well-paid because they work in both the business and IT sectors.
                On a daily basis, a data scientist may do the following tasks."""

In [59]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [60]:
from sklearn import datasets
dataset = nltk.sent_tokenize(text)
for i in range(len(dataset)):
  dataset[i] = dataset[i].lower()
  dataset[i]= re.sub(r'\W',' ',dataset[i])
  dataset[i] = re.sub(r'\s+',' ',dataset[i])

In [61]:
dataset

['if learning what is data science sounded interesting understanding what does this job roles is all about will me much more interesting to you ',
 'data scientists are among the most recent analytical data professionals who have the technical ability to handle complicated issues as well as the desire to investigate what questions need to be answered ',
 'they re a mix of mathematicians computer scientists and trend forecasters ',
 'they re also in high demand and well paid because they work in both the business and it sectors ',
 'on a daily basis a data scientist may do the following tasks ']

### Creating word tokenization from scratch

In [62]:
word2count = {}

for data in dataset:
  words = nltk.word_tokenize(data)
  for word in words:
    if word not in word2count.keys():
      word2count[word] = 1
    else:
      word2count[word] += 1

In [63]:
word2count

{'if': 1,
 'learning': 1,
 'what': 3,
 'is': 2,
 'data': 4,
 'science': 1,
 'sounded': 1,
 'interesting': 2,
 'understanding': 1,
 'does': 1,
 'this': 1,
 'job': 1,
 'roles': 1,
 'all': 1,
 'about': 1,
 'will': 1,
 'me': 1,
 'much': 1,
 'more': 1,
 'to': 4,
 'you': 1,
 'scientists': 2,
 'are': 1,
 'among': 1,
 'the': 5,
 'most': 1,
 'recent': 1,
 'analytical': 1,
 'professionals': 1,
 'who': 1,
 'have': 1,
 'technical': 1,
 'ability': 1,
 'handle': 1,
 'complicated': 1,
 'issues': 1,
 'as': 2,
 'well': 2,
 'desire': 1,
 'investigate': 1,
 'questions': 1,
 'need': 1,
 'be': 1,
 'answered': 1,
 'they': 3,
 're': 2,
 'a': 3,
 'mix': 1,
 'of': 1,
 'mathematicians': 1,
 'computer': 1,
 'and': 3,
 'trend': 1,
 'forecasters': 1,
 'also': 1,
 'in': 2,
 'high': 1,
 'demand': 1,
 'paid': 1,
 'because': 1,
 'work': 1,
 'both': 1,
 'business': 1,
 'it': 1,
 'sectors': 1,
 'on': 1,
 'daily': 1,
 'basis': 1,
 'scientist': 1,
 'may': 1,
 'do': 1,
 'following': 1,
 'tasks': 1}

In [64]:
import heapq
freq_words = heapq.nlargest(100, word2count,key= word2count.get)

In [65]:
freq_words

['the',
 'data',
 'to',
 'what',
 'they',
 'a',
 'and',
 'is',
 'interesting',
 'scientists',
 'as',
 'well',
 're',
 'in',
 'if',
 'learning',
 'science',
 'sounded',
 'understanding',
 'does',
 'this',
 'job',
 'roles',
 'all',
 'about',
 'will',
 'me',
 'much',
 'more',
 'you',
 'are',
 'among',
 'most',
 'recent',
 'analytical',
 'professionals',
 'who',
 'have',
 'technical',
 'ability',
 'handle',
 'complicated',
 'issues',
 'desire',
 'investigate',
 'questions',
 'need',
 'be',
 'answered',
 'mix',
 'of',
 'mathematicians',
 'computer',
 'trend',
 'forecasters',
 'also',
 'high',
 'demand',
 'paid',
 'because',
 'work',
 'both',
 'business',
 'it',
 'sectors',
 'on',
 'daily',
 'basis',
 'scientist',
 'may',
 'do',
 'following',
 'tasks']

In [66]:
X = []
for data in dataset:
  vector = []
  for word in freq_words:
    if word in nltk.word_tokenize(data):
      vector.append(1)
    else:
      vector.append(0)
  X.append(vector)
X = np.asarray(X)
print(X)

[[0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0]
 [1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0]
 [0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0]
 [1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
  0]
 [1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
  1]]


In [67]:
import nltk

paragraph =  """If learning what is data science sounded interesting, understanding what does this job roles is all about
                will me much more interesting to you.
                Data scientists are among the most recent analytical data professionals who have the technical
                ability to handle complicated issues as well as the desire to investigate what questions need to be answered.
                They're a mix of mathematicians, computer scientists, and trend forecasters.
                They're also in high demand and well-paid because they work in both the business and IT sectors.
                On a daily basis, a data scientist may do the following tasks."""

In [68]:
# cleaning text
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [69]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [70]:
ps = PorterStemmer()
wordnet = WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)

# corpus = []
# for i in range(len(sentenses)):
#   review = re.sub('[^a-zA-Z]',' ', sentenses[i])
#   review = review.lower()
#   review = review.split()
#   review = [ ps.stem(word) for word in sentenses if word not in set(stopwords.words('english'))]
#   review = " ".join(review)
#   corpus.append(review)



corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [71]:
corpus

['learn data scienc sound interest understand job role much interest',
 'data scientist among recent analyt data profession technic abil handl complic issu well desir investig question need answer',
 'mix mathematician comput scientist trend forecast',
 'also high demand well paid work busi sector',
 'daili basi data scientist may follow task']

In [72]:
paragraph

"If learning what is data science sounded interesting, understanding what does this job roles is all about\n                will me much more interesting to you.\n                Data scientists are among the most recent analytical data professionals who have the technical\n                ability to handle complicated issues as well as the desire to investigate what questions need to be answered.\n                They're a mix of mathematicians, computer scientists, and trend forecasters.\n                They're also in high demand and well-paid because they work in both the business and IT sectors.\n                On a daily basis, a data scientist may do the following tasks."

In [73]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
result = cv.fit_transform(corpus).toarray()


In [74]:
result

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 1,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0],
       [1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 2, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0]])

## TF-IDF Vectorization

One problem that we encounter in the bag-of-words approach is that it treats every word equally, but in a document, there is a high chance of particular words being repeated more often than others. So, to solve this problem, TF-IDF comes into the picture!


Term frequency-inverse document frequency ( TF-IDF) gives a measure that takes the importance of a word into consideration depending on how frequently it occurs in a document and a corpus.


To understand TF-IDF, firstly we will understand the two terms separately:

- Term frequency (TF)
- Inverse document frequency (IDF)

##### Term Frequency
Term frequency denotes the frequency of a word in a document. For a specified word, it is defined as the ratio of the number of times a word appears in a document to the total number of words in the document.

Or, it is also defined in the following manner:

It is the percentage of the number of times a word (x) occurs in a particular document (y) divided by the total number of words in that document.

![tf.png](attachment:tf.png)

**For Example,** Consider the following document

**Document: Cat loves to play with a ball**

For the above sentence, the term frequency value for word cat will be: tf(‘cat’) = 1 / 6

**Note:** Sentence “Cat loves to play with a ball” has 7 total words but the word ‘a’ has been ignored.

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = ["This burger is very tasty and affordable.", "This burger is not tasty and is affordable.", "This burger is very very delicious."]
tfidf_counter = TfidfVectorizer()
vectors = tfidf_counter.fit_transform(corpus)

feature_names = tfidf_counter.get_feature_names_out()
print(f'features names are:{feature_names}')

matrix = vectors.todense()
list_dense = matrix.tolist()

df = pd.DataFrame(list_dense, columns = feature_names)
print(df)

features names are:['affordable' 'and' 'burger' 'delicious' 'is' 'not' 'tasty' 'this' 'very']
   affordable       and    burger  delicious        is       not     tasty  \
0    0.414896  0.414896  0.322204   0.000000  0.322204  0.000000  0.414896   
1    0.346117  0.346117  0.268791   0.000000  0.537582  0.455102  0.346117   
2    0.000000  0.000000  0.282851   0.478909  0.282851  0.000000  0.000000   

       this      very  
0  0.322204  0.414896  
1  0.268791  0.000000  
2  0.282851  0.728445  


# Finish