# NLP : Word Embedding_1

## Count Vectorizer:
It uses to two of the following models as the base to vectorize the given words on the basis of frequency of words.
* it contains 2 models as follows:

### 1. BOW: Bag of Words Model:
BOW model is used in NLP to represent the given text/sentence/documents as a collection (bag) of **words without giving any importance to grammar or the occurrence order of the words**. It keeps the account of frequency of the words in the text document, which can be used as features in many models.

Let's understand this with an example:

Text1 = "I went to have a cup of coffee but I ended up having lunch with her."

Text2 = "I don't understand, what is the problem here?"

BOW1 = {I:2, went:1, to:1, have:1, a:1, cup:1, of:1, coffee:1, but:1, ended:1, up:1, having:1, lunch:1, with:1, her:1}

BOW2= {I:1, don't:1, understand:1, what:1, is:1, the:1, problem:1, here:1}

**- BOW is mainly used for feature selection:**
* The above dictionary is converted as a list with *only the frequency terms there and on that basis*, weights are given to the most occurring terms. 
* But the "stop words" are the most frequent words that appears in raw document. 
* Thus, having a word with high frequency count doesn't mean that the word is as important. **To resolve this problem, "Tf-idf" was introduced.** We will discuss about it later.

### 2. n-gram Model:

As discussed in bag of words model, BOW model doesn't keep the sequence of words in a given text, only the frequency of words matters. It doesn't take into account the context of the given sentence, or care for grammatical rules such as verb is following a proper noun in the given text. n-gram model is used in such cases to keep the context of the given text intact.
N-gram is the sequence of n words from a given text/document.

When,

    n=1, we call it a "unigram".
    n=2, it is called a "bigram".
    n=3, it is called a "trigram".
    And so on...
    
Let's understand this with an example:

Text1= "I went to have a cup of coffee but I ended up having lunch with her."
.

* Unigram
        [I, went, to, have, a, cup, of, coffee, but, I, ended, up, having, lunch, with, her]

* Bi-gram
        [I went],[went to],[to have],[have a],[a cup],[cup of],[of coffee],[coffee but],[butI],[Iended],[ended up],[up having],[having lunch],[lunch with],[with her]

* Tri-gram
        [I went to],[went to have],[to have a],[have a cup],[a cup of],[cup of coffee],[of coffee but],[coffee but I],[but I ended],[I ended up],[ended up having],[up having lunch],[having lunch with],[lunch with her].

**Note:We can clearly see that BOW model is nothing but n-gram model when n=1.**

Skip-grams
Skip grams are type of n-grams where the words are not necessarily in the same order as are in the given text i.e. some words can be skipped. 

Example:

Text2= "I don't understand, what is the problem here?"

1-skip 2-grams (we have to make 2-gram while skipping 1 word)

[I understand, don't what, understand is, what the, is problem, the here].

.

Let's see the implementation of Count vectorizer in python:

## BOW : Bag of Words
* It is used to get the feature names **in Ascending order**.

In [3]:
#Example of single document
#Without stopwords

from sklearn.feature_extraction.text import CountVectorizer

from nltk.corpus import stopwords
import pandas as pd

#Single document(',seperates each document)
string=["This is an example of bag of words!"]

#This step will convert text into tokens
vect1=CountVectorizer()

vect1.fit_transform(string)
print("bag of words",vect1.get_feature_names())

bag of words ['an', 'bag', 'example', 'is', 'of', 'this', 'words']


In [5]:
vect1.vocabulary_ # .vocabulary_ method is used to get the indexes

{'this': 5, 'is': 3, 'an': 0, 'example': 2, 'of': 4, 'bag': 1, 'words': 6}

### Fit and transform and predict if the word is present or not
  •This is widely used for document or subject classification

In [6]:
c_vect= CountVectorizer()
c_vect.fit(string)  #string=["This is an example of bag of words!"]

CountVectorizer()

In [7]:
string2= ['Lets understand is of words is']

c_new_vect= c_vect.transform(string2)

print("Text Present at",c_new_vect.toarray())

#Compare with the indexes
print("original indexes",vect1.get_feature_names())

Text Present at [[0 0 0 2 1 0 1]]
original indexes ['an', 'bag', 'example', 'is', 'of', 'this', 'words']


* **Explination**: here in string2 how many words are present which are already present in string (1)

ex: 
- 'an' is not presennt in string2 so : 0
- 'bag' is not presennt in string2 so : 0
- 'example' is not presennt in string2 so : 0
- 'is' is present and occured 2 times in string2 : 2

and so on...

In [9]:
## Bag Of Words using stopwords(you can avoid writing extra steps to remove stopwords)
# Count vectorizer itself has a stopwords: so i don't need to take care of stopwords specifically

stpwords = stopwords.words('english')

string = ["This is an example of bag of words!"]
vect1=CountVectorizer(stop_words=stpwords)
print(vect1)

CountVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...])


In [10]:
vect1.fit_transform(string)
print("bag of words:",vect1.get_feature_names())
print("vocab       :",vect1.vocabulary_)

bag of words: ['bag', 'example', 'words']
vocab       : {'example': 1, 'bag': 0, 'words': 2}




In [12]:
 #Using function
def text_matrix(message,countvect):
    terms_doc = countvect.fit_transform(message)
    return pd.DataFrame(terms_doc.toarray(),columns=countvect.get_feature_names())

In [14]:
message=['We are slowly slowly making progress in Natural Language Processing',
          "We will get there","But practice is the only mantra for success"]

c_vect = CountVectorizer()
print("Below metrix is the Bag of Words approach")
text_matrix(message,c_vect)

Below metrix is the Bag of Words approach


Unnamed: 0,are,but,for,get,in,is,language,making,mantra,natural,only,practice,processing,progress,slowly,success,the,there,we,will
0,1,0,0,0,1,0,1,1,0,1,0,0,1,1,2,0,0,0,1,0
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1
2,0,1,1,0,0,1,0,0,1,0,1,1,0,0,0,1,1,0,0,0


## n-grams:

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

from nltk.tokenize import word_tokenize

string=["This is an example of gram!"]

vect1 = CountVectorizer(ngram_range=(1,1)) # unnigram
vect1.fit_transform(string)

vect2 = CountVectorizer(ngram_range=(2,2)) # bigram
vect2.fit_transform(string)
# but what if we take (2,3): then it will take [bigram,trigram,bigram,trigram,...]

vect3 = CountVectorizer(ngram_range=(3,3)) # trigram
vect3.fit_transform(string)

vect4 = CountVectorizer(ngram_range=(4,4))
vect4.fit_transform(string)

print("1-gram   :",vect1.get_feature_names())
print("2-gram   :",vect2.get_feature_names())
print("3-gram   :",vect3.get_feature_names())
print("4-gram   :",vect4.get_feature_names())
            

1-gram   : ['an', 'example', 'gram', 'is', 'of', 'this']
2-gram   : ['an example', 'example of', 'is an', 'of gram', 'this is']
3-gram   : ['an example of', 'example of gram', 'is an example', 'this is an']
4-gram   : ['an example of gram', 'is an example of', 'this is an example']


