#### Outline.
            1. Frequency Base Embedding
            
            2. Prediction Base Embedding.
            
            3. Word Embedding: Word2vec & GloVe
            
            4. Keras Embedding Layer.

In [1]:
import numpy as np
import pandas as pd

## 1. Frequency Base Embedding

Includes: `Count Vector`; `tf-idf Vector` and `Co-occurrence Matrix.`

### 1.1. CountVectorizer using with `TfidfTransformer` 

First, consider the simple sentences.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['this is the first document',
          'this document is the second document',
          'and this is the third one',
          'is this the first document?',
          'this Document is not yours..']

cvect = CountVectorizer()
X = cvect.fit_transform(corpus)

print("There are %d sentences in this corpus"%(X.shape[0]))
print('The number of the different words is :', X.shape[1], ", and ... they are:")
print(cvect.get_feature_names())

There are 5 sentences in this corpus
The number of the different words is : 11 , and ... they are:
['and', 'document', 'first', 'is', 'not', 'one', 'second', 'the', 'third', 'this', 'yours']


- Firstly, they will count how many `different words` in this sentence, here are `11`; noting that both of the words "`document`" and "`Document`" will be changed to the lower scripts : `"Document"`.
- Only the second document contains the `word has frequencies = 2`, it is `document`.
- The `unique` words in the corpus will be arranged to the `English alphabet characters`; starting at the word "**a**nd" and ending by "**y**ours".
- The `punctuation` (such as `"?"` or `"!"`, ....) will be ignored.



In [3]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

vocabulary = cvect.get_feature_names()
pipe = Pipeline([('count', CountVectorizer(vocabulary = vocabulary, min_df=2, max_df=0.5, ngram_range=(1,2))),
                 ('tfid', TfidfTransformer(smooth_idf=False, use_idf=True))]).fit(corpus)
pipe['count'].transform(corpus).toarray()

array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0],
       [0, 2, 0, 1, 0, 0, 1, 1, 0, 1, 0],
       [1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1]], dtype=int64)

Next, **compute the `IDF` values.** (An `idf` is constant per corpus, and **accounts** for the ratio of documents that include the word.)

In [4]:
table = pd.DataFrame({"fea_names": cvect.get_feature_names(), "idf_smooth_False)": pipe['tfid'].idf_})
table

Unnamed: 0,fea_names,idf_smooth_False)
0,and,2.609438
1,document,1.223144
2,first,1.916291
3,is,1.0
4,not,2.609438
5,one,2.609438
6,second,2.609438
7,the,1.223144
8,third,2.609438
9,this,1.0


According to https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992 and https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfTransformer

- If `smooth_idf=False`); the formula that is used to compute the `tf-idf` for a term t of a document `d` in a document set is 

                                    tf-idf(w, d) = tf(w, d) * idf(w), 

and the `idf` is computed as 

                                        idf(w) = log [ n / df(w) ] + 1, 

where `n` is the `total number of documents in the corpus` and `df(t) is the document frequency of w`; the document frequency is the number of documents in the document set that contain the word `w`. 

The effect of adding `“1”` to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, ***will not be entirely ignored***. 

For example: 
1. The word `"can"`. We have a corpus of `5 sentences/ documents` and all of them contain this word (`"is"`); so 

                                    idf("is") = log(5 / 5) + 1 = 1

2. The word `"and"`, we have

                                    idf("and") = log(5 / 1) + 1 appox 2.609 
                                
Noting that, the `log` here is `natural logarithm (default)`.

***Note that the `idf` formula above differs from the standard textbook notation that defines the idf as***

                                    idf(w) = log [ n / (df(w) + 1) ].

- If `smooth_idf=True` (the default), the constant `“1”` is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which ***prevents zero divisions:*** 

                                    idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

In [5]:
pipe = Pipeline([('count', CountVectorizer(vocabulary = vocabulary, 
                                           min_df=2, max_df=0.5, ngram_range=(1,2))),
                 ('tfid', TfidfTransformer(smooth_idf=True, use_idf=True))]).fit(corpus)

table["idf_smooth_True"] = pipe['tfid'].idf_
table

Unnamed: 0,fea_names,idf_smooth_False),idf_smooth_True
0,and,2.609438,2.098612
1,document,1.223144,1.182322
2,first,1.916291,1.693147
3,is,1.0,1.0
4,not,2.609438,2.098612
5,one,2.609438,2.098612
6,second,2.609438,2.098612
7,the,1.223144,1.182322
8,third,2.609438,2.098612
9,this,1.0,1.0


In [6]:
## verify the idf_value of the word "and"

np.log(5) + 1, np.log((1 + 5)/(1+1))+1

(2.6094379124341005, 2.09861228866811)

**Compute the TFIDF score**, depend on how we compute the `idf_values`, the `tfidf` is defined by

                                tf-idf(w, d) = tf(w, d) * idf(w)

Recall that; the meaning of `TF` is **`term frequency`** and here defined by *the number of times that word `w` occurs in document `d`*

For example; in the first sentence, `d = 1`; the word `"and"` is not in this sentence, so `tf("and", d=1) = 0`.

See the table bellow.

In [7]:
vocabulary = cvect.get_feature_names()
pipe = Pipeline([('count', CountVectorizer(vocabulary = vocabulary, 
                                           min_df=2, max_df=0.5, ngram_range=(1,2))),
                 ('tfid', TfidfTransformer(smooth_idf=True, use_idf = True))]).fit(corpus)

count_vector = pipe['count'].transform(corpus).toarray()  ## equivalent with CountVectorizer.fit_transform(corpus)

tf_idf_vector = pipe['tfid'].transform(count_vector)
tf_idf_vector[0].toarray()
table["tfidf_smooth_True_1st_doc"] = tf_idf_vector[0].T.toarray()
table["tfidf_smooth_True_2nd_doc"] = tf_idf_vector[1].T.toarray()
table["tfidf_smooth_True_3rd_doc"] = tf_idf_vector[2].T.toarray()
table

Unnamed: 0,fea_names,idf_smooth_False),idf_smooth_True,tfidf_smooth_True_1st_doc,tfidf_smooth_True_2nd_doc,tfidf_smooth_True_3rd_doc
0,and,2.609438,2.098612,0.0,0.0,0.514923
1,document,1.223144,1.182322,0.42712,0.646126,0.0
2,first,1.916291,1.693147,0.611659,0.0,0.0
3,is,1.0,1.0,0.361255,0.273244,0.245363
4,not,2.609438,2.098612,0.0,0.0,0.0
5,one,2.609438,2.098612,0.0,0.0,0.514923
6,second,2.609438,2.098612,0.0,0.573434,0.0
7,the,1.223144,1.182322,0.42712,0.323063,0.290099
8,third,2.609438,2.098612,0.0,0.0,0.514923
9,this,1.0,1.0,0.361255,0.273244,0.245363


### 1.2. `TfidfVectorizer` is equivalent to the first method

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer()
tfidf_matrix =  tf.fit_transform(corpus)
feature_names = tf.get_feature_names()
print(feature_names)

['and', 'document', 'first', 'is', 'not', 'one', 'second', 'the', 'third', 'this', 'yours']


**Viewing the `tfidf-score` by using `TfidfVectorizer`; first looking at the `tfidf-values` in the first sentences.**

In [9]:
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)
M = tfidf_vectorizer_vectors.toarray()
M[0,:]

array([0.        , 0.42712001, 0.6116585 , 0.36125537, 0.        ,
       0.        , 0.        , 0.42712001, 0.        , 0.36125537,
       0.        ])

**tf-idf values using Tfidfvectorizer**

In [10]:
pd.DataFrame({"fea_names": feature_names, 
              "tfidf_TfVec_1st_doc": M[0, :], 
              "tfidf_TfVec_2nd_doc": M[1, :],
              "tfidf_TfVec_3rd_doc": M[2, :]})

Unnamed: 0,fea_names,tfidf_TfVec_1st_doc,tfidf_TfVec_2nd_doc,tfidf_TfVec_3rd_doc
0,and,0.0,0.0,0.514923
1,document,0.42712,0.646126,0.0
2,first,0.611659,0.0,0.0
3,is,0.361255,0.273244,0.245363
4,not,0.0,0.0,0.0
5,one,0.0,0.0,0.514923
6,second,0.0,0.573434,0.0
7,the,0.42712,0.323063,0.290099
8,third,0.0,0.0,0.514923
9,this,0.361255,0.273244,0.245363


In a summary, the main difference between the two modules are as follows:

- With `Tfidftransformer` you will systematically compute word counts using `CountVectorizer` and then compute the `Inverse Document Frequency (IDF)` values and only then compute the `Tf-idf scores`.

- With `Tfidfvectorizer` on the contrary, **you will do all three steps at once**. It computes the word counts, IDF values, and Tf-idf scores all using the same dataset.

**When to use what?**
So now you may be wondering, why you should use more steps than necessary if you can get everything done in two steps. Well, there are cases where you want to use Tfidftransformer over Tfidfvectorizer and it is sometimes not that obvious. Here is a general guideline:

- If you need the `term frequency` (term count) vectors for `different tasks`, use `Tfidftransformer`.
- If you need to compute `tf-idf scores` on documents within your `“training”` dataset, use `Tfidfvectorizer`.
- If you need to compute `tf-idf scores` on documents **`outside your “training”`** dataset, use either one, both will work.

### 1.3. Co-occurrence Matrix.

A co-occurrence matrix or co-occurrence distribution is a matrix that is defined over an image to be the distribution of co-occurring pixel values (grayscale values, or colors) at a given offset.

In [11]:
import nltk
from nltk import bigrams
import itertools

def generate_co_occurrence_matrix(corpus):
    
    ## Create the vocabulary_list, set and indexes
    vocab = set(corpus)
    vocab = list(vocab)
    vocab_index = {word: i for i, word in enumerate(vocab)}
 
    # Create bigrams from all words in corpus
    bi_grams = list(bigrams(corpus))
 
    # Frequency distribution of bigrams ((word1, word2), num_occurrences)
    bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
 
    # Initialise co-occurrence matrix
    # co_occurrence_matrix[current][previous]
    co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
 
    # Loop through the bigrams taking the current and previous word,
    # and the number of occurrences of the bigram.
    for bigram in bigram_freq:
        current = bigram[0][1]
        previous = bigram[0][0]
        count = bigram[1]
        pos_current = vocab_index[current]
        pos_previous = vocab_index[previous]
        co_occurrence_matrix[pos_current][pos_previous] = count
    
    # create matrix
    co_occurrence_matrix = np.matrix(co_occurrence_matrix)
 
    # return the matrix and the index
    return co_occurrence_matrix, vocab_index

In [16]:
text_1 = [["never", "say", "never", "because", "never", "give", "up"]]

# Create one list using many lists
data = list(itertools.chain.from_iterable(text_1))
matrix, vocab_index = generate_co_occurrence_matrix(data)
  
data_matrix = pd.DataFrame(matrix, index=vocab_index,
                             columns=vocab_index)
data_matrix

Unnamed: 0,never,because,up,give,say
never,0.0,1.0,0.0,0.0,1.0
because,1.0,0.0,0.0,0.0,0.0
up,0.0,0.0,0.0,1.0,0.0
give,1.0,0.0,0.0,0.0,0.0
say,1.0,0.0,0.0,0.0,0.0


In [17]:
text_2 = [['Where', 'are', 'you', 'now'],
             ['Why', 'did', 'you', 'used', 'Python'],
             ['When', 'you', 'leave', 'there'],
             ['What', 'companies', 'use', 'Python'],
             ['In', 'the', 'begining', 'of', 'Python'],
             ["ok", "Jane", "is", "fat", "but", "Adam", "is", "tall"]]
 
data = list(itertools.chain.from_iterable(text_2))
matrix, vocab_index = generate_co_occurrence_matrix(data)
  
data_matrix = pd.DataFrame(matrix, index=vocab_index,
                             columns=vocab_index)
data_matrix

Unnamed: 0,Why,companies,is,now,When,What,begining,used,Where,are,...,ok,Python,there,tall,but,use,In,Adam,did,of
Why,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
companies,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
is,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
now,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
When,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
What,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
begining,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
used,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Where,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
are,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
