## VECTORIZATION TECHNIQUES

Text vectorization is the process of feature extraction from text data, that is the process of creating variables for each observation, where an observation is a text document. We'll consider the **bag-of-words**, the **TF-IDF** and the **n-grams** vectorized representations of text. <br>

Let's vectorize the corpus about "blue skies and blue cheese" similar to one used in the video lecture: 

In [8]:
corpus = ['the sky is blue',
          'sky is blue and sky is beautiful', 
          'the beautiful sky is so blue',
          'i love blue cheese']

We'll use built-in vectorizers from Scikit-Learn module for machine learning. 

### Bag-of-Words Representation

We'll use bag-of-words representation (CountVectorizer) first. You can see the documentation here:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_BOW = CountVectorizer(max_features=1000) #"define" a vectorizer
BOW_matrix = vectorizer_BOW.fit_transform(corpus).toarray() #Note the .fit_transform function below. It creates the dictionary of the corpus and does the vectorization: 
pd.DataFrame(np.round(BOW_matrix,2),columns=vectorizer_BOW.get_feature_names()) #Here are the names of the features from the dictionary of the corpus

### Vectorization Using N-grams
<br>

ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

In [16]:
vectorizer_Bi_Grams = CountVectorizer(max_features=1000, ngram_range=(2, 2))
Bi_Grams_matrix = vectorizer_Bi_Grams.fit_transform(corpus).toarray()
pd.DataFrame(np.round(Bi_Grams_matrix,2),columns=vectorizer_Bi_Grams.get_feature_names())



Unnamed: 0,and sky,beautiful sky,blue and,blue cheese,is beautiful,is blue,is so,love blue,sky is,so blue,the beautiful,the sky
0,0,0,0,0,0,1,0,0,1,0,0,1
1,1,0,1,0,1,1,0,0,2,0,0,0
2,0,1,0,0,0,0,1,0,1,1,1,0
3,0,0,0,1,0,0,0,1,0,0,0,0


### Vectorization with Term Frequency – Inverse Document Frequency (TF-IDF)

Now, let's do feature extraction (vectorization) using the TF-IDF approach. <br> <br> See full documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer <br> <br>


In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vectorizer_TF_IDF = TfidfVectorizer(norm = None, smooth_idf = True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).toarray()
pd.DataFrame(np.round(TF_IDF_matrix, 2), columns=vectorizer_TF_IDF.get_feature_names())

Have a look at the IDF weights:

In [None]:
print(np.round(vectorizer_TF_IDF.idf_,2))

[1.92 1.51 1.   1.92 1.22 1.92 1.22 1.92 1.51]


It's a good idea to normalize the TF-IDF matrix, i.e. restrict all entries to be between 0 and 1. Some text mining models require normalized matrices. Norm parameter is used for this purpose (you can look it up in the documentation):

In [27]:
vectorizer_TF_IDF = TfidfVectorizer(norm = 'l2', smooth_idf = True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).todense()
pd.DataFrame(np.round(TF_IDF_matrix,2), columns=vectorizer_TF_IDF.get_feature_names())



Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.4,0.0,0.49,0.0,0.49,0.0,0.6
1,0.44,0.35,0.23,0.0,0.56,0.0,0.56,0.0,0.0
2,0.0,0.43,0.29,0.0,0.35,0.0,0.35,0.55,0.43
3,0.0,0.0,0.35,0.66,0.0,0.66,0.0,0.0,0.0


### **<font color=green> EXERCISE 3: You are given a new small corpus called corpus_exercise (see below). Your ultimate task is to normalize (pre-process) the corpus and produce the TF-IDF and the Bag-of-Words representations of the data. Follow the steps below to complete this exercise:</font>**

Step 1. Download a file Text_Normalization_Function.ipynb from Canvas and put it into the same directory(!) as the current Jupyter notebook. That file defines a relatively sophisticated text normalization function. (OPTIONAL: you can explore what that file does when you are done with this exercise.)

Step 2. Run the file Text_Normalization_Function.ipynb to define the text normalization function:

In [None]:
%run ./Text_Normalization_Function.ipynb

Step 3. Define the corpus_exercise text corpus:

In [None]:
corpus_exercise = ['python is great for text mining',
          'anyone can learn python and do text mining', 
          'python can go without eating for days',
          'python can be a great pet']

Step 4. Normalize the corpus_exercise text corpus and call its normalized version NORM_corpus:

In [None]:
NORM_corpus = normalize_corpus(corpus_exercise)
NORM_corpus

['python great text mining',
 'anyone learn python text mining',
 'python without eat day',
 'python great pet']

Step 5. Compute and print out the TF-IDF and the Bag-of-Words representations for NORM_corpus (WRITE the lines of code needed in the cell below):

### <font color=green> Answer for E3:

The bag-of-words representation of the normalized corpus (NORM_corpus):

In [None]:
vectorizer_BOW = CountVectorizer(max_features=1000) #BOW = bag-of-words
BOW_matrix = vectorizer_BOW.fit_transform(NORM_corpus).toarray()
pd.DataFrame(np.round(BOW_matrix,2),columns=vectorizer_BOW.get_feature_names())

Unnamed: 0,anyone,day,eat,great,learn,mining,pet,python,text,without
0,0,0,0,1,0,1,0,1,1,0
1,1,0,0,0,1,1,0,1,1,0
2,0,1,1,0,0,0,0,1,0,1
3,0,0,0,1,0,0,1,1,0,0


The TF-IDF representation of the corpus:

In [None]:
vectorizer_TF_IDF = TfidfVectorizer(norm = 'l2', smooth_idf = True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(NORM_corpus).todense()
pd.DataFrame(np.round(TF_IDF_matrix,2), columns=vectorizer_TF_IDF.get_feature_names())

Unnamed: 0,anyone,day,eat,great,learn,mining,pet,python,text,without
0,0.0,0.0,0.0,0.54,0.0,0.54,0.0,0.36,0.54,0.0
1,0.53,0.0,0.0,0.0,0.53,0.42,0.0,0.28,0.42,0.0
2,0.0,0.55,0.55,0.0,0.0,0.0,0.0,0.29,0.0,0.55
3,0.0,0.0,0.0,0.57,0.0,0.0,0.73,0.38,0.0,0.0


### <font color=green> End of Answer