# Data Preprocessing Techniques for Natural Language Processing

## 1. Bag of Words

The idea is to **represent each sentence as a bag of words**, disregarding grammar and paradigms, i.e., just the occurrence of words in a sentence defines the meaning of the sentence for the model.

For example: Given two sentence "I have a dog" and "You have a cat", the first sentence (“I have a dog”) representation becomes 1,1,1,1,0,0, while the second sentence (“You have a cat”) representation becomes 0,1,1,0,1,1.

Pros and Cons of Bag of Words:

1. Pros: 
    1. Simple to implement, easy to understand.
2. Cons:
    1. If our input data is big, that would mean that the vocabulary size will also increase. This, in turn, makes our representation matrix much larger and makes computations very complex.
    2. Computational nightmare is the inclusion of many 0s in our matrix (i.e., a sparse matrix). A sparse matrix contains less information and wastes a lot of memory.
    3. The biggest disadvantage in Bag-of-Words is the complete inability to learn grammar and semantics.

In [None]:
# bag of words
def calculate_bag_of_words(text, sentence):
    # create a dictionary for frequency check
    freqDict = dict.fromkeys(text, 0)
    # loop over the words in sentences
    for word in sentence:
        # update word frequency
        freqDict[word]=sentence.count(word)
    # return dictionary 
    return freqDict

text = ['I', 'have', 'a', 'dog', 'you', 'cat']
s1 = "I have a dog"
s2 = "You have a cat"
ans1 = calculate_bag_of_words(text, s1)
ans2 = calculate_bag_of_words(text, s2)
print(ans1)
print(ans2)

## 2. Word2Vec
Word2Vec essentially means expressing each word in your text corpus in an N-dimensional space (embedding space).  We help define the meaning of words based on their context.

There are two subsets of Word2Vec:
1. Countinous Bag-of-Words (CBOW)
2. SkipGram

### CBOW
CBOW is a technique where, given the neighboring words, the center word is determined. If our input sentence is **“I am reading the book.”**, then the input pairs and labels for a window size of 3 would be:
- I, reading, for the label am
- am, the, for the label reading
- reading, book, for the label 

### Skip-Gram
Skip-Gram approach is given the center word, we have to predict its neighboring words. Quite literally the opposite of CBOW, but more efficient. Before we get to that, let’s understand what Skip-Gram is.

Let our given input sentence be “I am reading the book.” The corresponding Skip-Gram pairs for a window size of 3 would be:

- am, for labels I and reading
- reading, for labels am and the
- the, for labels reading and 

## 3. GLOVE

Glove is based on **matrix factorization techniques** on the word-context matrix. It first constructs a large matrix of (words x context) co-occurrence information, i.e. for each “word” (the rows), you count how frequently we see this word in some “context” (the columns) in a large corpus.  The number of “contexts” is of course large, since it is essentially combinatorial in size.

What is the different between Word2Vec and GLOVE?

    Word2vec embeddings are based on training a shallow feedforward neural network while glove embeddings are learnt based on matrix factorization techniques.

## Reference
- [Word2Vec: A Study of Embeddings in NLP](https://pyimagesearch.com/2022/07/11/word2vec-a-study-of-embeddings-in-nlp/)