<a href="https://colab.research.google.com/gist/Melvinchen0404/2afad4796dfcca3125d7c2c851a5a238/text_vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##NLP Technique 5: Text Vectorization
**Text vectorization** refers to the process of converting text into **numerical representation** \
\
2 METHODS for **text vectorization** are: \
**METHOD 1:** The **bag-of-words (BoW)** model; \
**METHOD 2:** **TF-IDF** or **Term Frequency-Inverse Document Frequency** \

**Word embedding** is the representation of words for text analysis, typically in the form of **real-valued vectors** that encode the meaning of the words such that the **words that are closer in the vector space are expected to be similar in meaning** \
$\therefore$ Both **METHOD 1** and **METHOD 2** are **word embedding** techniques \

SOURCES: \
https://sites.pitt.edu/~naraehan/presentation/Movie%20Reviews%20sentiment%20analysis%20with%20Scikit-Learn.html \
https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/ \
https://towardsdatascience.com/sentiment-analysis-a-how-to-guide-with-movie-reviews-9ae335e6bcb2

**METHOD 1:** The **bag-of-words (BoW)** model \
The **bag-of-words (BoW)** model is the simplest form of text representation in numbers \
Each text (e.g., a sentence or a document) is represented as the **bag (multiset) of its words**

**STEP 1 of METHOD 1:** Import `nltk` and `CountVectorizer` \
`CountVectorizer` allows us to convert a collection of text documents to a **matrix of token counts**

In [1]:
# import CountVectorizer, nltk
from sklearn.feature_extraction.text import CountVectorizer
import nltk

In [2]:
# Turn off pretty printing of jupyter notebook... it generates long lines
%pprint

Pretty printing has been turned OFF


**STEP 2 of METHOD 1:** Initialize the `CountVectorizer` \
Its default mode will remove **punctuation** (non-alphabetic characters) and **stopwords** (a set of very common words like 'the', 'a', 'and', etc) \
It will also convert all letters into **lowercase form** \
Its minimum document frequency can be set to 1 with `min_df=1`

In [7]:
# Initialize a CountVectorizer to use NLTK's tokenizer instead of its 
#    default one (which ignores punctuation and stopwords). 
# Minimum document frequency set to 1. 
convert = CountVectorizer(min_df=1)

**STEP 3 of METHOD 1:** Set the **corpus** \
Here, our **corpus** comprises 3 movie reviews: \
REVIEW 1: This movie is very scary and long. \
REVIEW 2: This movie is not scary and is slow. \
REVIEW 3: This movie is spooky and good. \

**STEP 4 of METHOD 1:** Use the `.fit_transform` method to adapt `convert` to the supplied text data (or `corpus`) and create and return a **count-vectorized output**

In [8]:
corpus = ['This movie is very scary and long.',
        'This movie is not scary and is slow.',
        'This movie is spooky and good.']

In [9]:
corpus_counts = convert.fit_transform(corpus)

# fooVzer now contains vocabulary dictionary which maps unique words to index numbers
convert.vocabulary_

{'this': 9, 'movie': 4, 'is': 2, 'very': 10, 'scary': 6, 'and': 0, 'long': 3, 'not': 5, 'slow': 7, 'spooky': 8, 'good': 1}

**STEP 5 of METHOD 1:** Use the `.shape` function to determine that we have a dimension of 3 (we have REVIEWS 1-3 in our `corpus`) by 11 (we have 11 **unique words**) \
**STEp 6 of METHOD 1:** Generate the **vector** for each of REVIEWS 1-3 using the `.toarray()` method \
Our vocabulary of **11 unique words** is as follows (in accordance with the index numbering from the previous step): \
'and', 'good', 'is', 'long', 'movie', 'not', 'scary', 'slow', 'spooky', 'this', 'very' \
If a particular **unique word** is present in a review, it will be marked with 1. Otherwise, it will be marked with 0 in the **vector** \
**Vector** of REVIEW 1 (This movie is very scary and long.): [1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1] \
**Vector** of REVIEW 2 (This movie is not scary and is slow.): [1, 0, 2, 0, 1, 1, 1, 1, 0, 1, 0] \
**Vector** of REVIEW 3 (This movie is spooky and good.): [1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0]  

In [11]:
# corpus_counts has a dimension of 3 (there are 3 reviews) by 11 (there are 11 unique words)
corpus_counts.shape

(3, 11)

In [12]:
# this vector is small enough to view in a full, non-sparse form! 
corpus_counts.toarray()

array([[1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1],
       [1, 0, 2, 0, 1, 1, 1, 1, 0, 1, 0],
       [1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0]])

**METHOD 2:** **TF-IDF** or **Term Frequency-Inverse Document Frequency** \
**Term Frequent (TF)** is a measure of how frequently a term, $t$, appears in a document, $d$ \
$tf_{t,d} = \frac{n_{t,d}}{\text{Number of terms in document $d$}}$ \

Consider REVIEW 2: This movie is not scary and is slow. \
Number of terms in REVIEW 2 = 8 \
'and', 'good', 'is', 'long', 'movie', 'not', 'scary', 'slow', 'spooky', 'this', 'very'

TF(and) = $\frac{1}{8}$ \
TF(good) = $\frac{0}{8} = 0$ \
TF(is) = $\frac{2}{8} = \frac{1}{4}$ \
TF(long) = $\frac{0}{8} = 0$ \
TF(movie) = $\frac{1}{8}$ \
TF(not) = $\frac{1}{8}$ \
TF(scary) = $\frac{1}{8}$ \
TF(slow) = $\frac{1}{8}$ \
TF(spooky) = $\frac{0}{8} = 0$ \
TF(this) = $\frac{1}{8}$ \
TF(very) = $\frac{0}{8} = 0$ \

**IDF (Inverse Document Frequency)** is a measure of how important a term is \
$idf_t = \log{\frac{\text{Number of documents}}{\text{Number of documents with the term $t$}}}$ \
IDF('this') = $\log{3/3} = 0$ \
IDF('movie') = $\log{3/3} = 0$ \
IDF('is') = $\log{3/3} = 0$ \
IDF('not') = $\log{3/1} = 0.48$ \
IDF('scary') = $\log{3/2} = 0.18$ \
IDF('and') = $\log{3/3} = 0$ \
IDF('slow') = $\log{3/1} = 0.48$ \

NOTE: Terms like 'is', 'this', 'and', etc have little importance and their **IDF** score is reduced to 0, whereas other terms like 'long', 'scary', 'slow', etc are more important and enjoy a higher **IDF** score \

We can compute the **TF-IDF** score from the **TF** and **IDF** scores of each term \
$tfidf_{t,d} = tf_{t,d} \times idf_t$ \
TF-IDF('this', REVIEW 2) = $\frac{1}{8} \times 0 = 0$ \
TF-IDF('movie', REVIEW 2) =  $\frac{1}{8} \times 0 = 0$ \
TF-IDF('is', REVIEW 2) = $\frac{1}{4} \times 0 = 0$ \
TF-IDF('not', REVIEW 2) = $\frac{1}{8} \times 0.48 = 0.06$ \
TF-IDF('scary', REVIEW 2) = $\frac{1}{8} \times 0.18 = 0.023$ \
TF-IDF('and', REVIEW 2) = $\frac{1}{8} \times 0 = 0$ \
TF-IDF('slow', REVIEW 2) = $\frac{1}{8} \times 0.48 = 0.06$ \

NOTE: **TF-IDF** gives higher values for less frequent words \
The **TF-IDF** score is high when both the **IDF** score (i.e.,  the term is **rare in the corpus** or all the documents combined) and the **TF** score (i.e., the term is **frequent in a single document**) \

The **bag-of-words (BoW)** model (**METHOD 1**) merely creates a set of vectors containing the **frequency count of word occurrences** in the corpus \
On the other hand, the **TF-IDF** model (**METHOD 2**) contains information on the more important terms and the less important ones. $\therefore$ **METHOD 2** typically fares better in **machine learning** approaches

**STEP 1 of METHOD 2:** Import the `TfidfTransformer` \
This will allow us to **transform** a **count matrix** to a normalized **TF-IDF** representation \
**STEP 2 of METHOD 2:** Use the `fit_transform()` method to convert the raw **frequency counts** of words into **TF-IDF** scores \
NOTE: Under the **standard textbook** method: \
$idf_t = \log{\frac{\text{Number of documents}}{\text{Number of documents with the term $t$}}}$ \
Under the **Scikit-Learn (Sklearn)** method: \
$idf_t = \log{\frac{\text{Number of documents} \ + \ 1}{\text{Number of documents with the term $t$} \ + \ 1} + 1}$ \
The **Scikit-Learn (Sklearn)** method ensures that terms that occur in all documents in the training set or corpus (i.e., terms with a zero **IDF** score under the **standard textbook** approach) will not be completely ignored \

In [14]:
# Convert raw frequency counts into TF-IDF (Term Frequency -- Inverse Document Frequency) values
from sklearn.feature_extraction.text import TfidfTransformer
fooTfmer = TfidfTransformer()

# Again, fit and transform
corpus_tfidf = fooTfmer.fit_transform(corpus_counts)

In [15]:
# TF-IDF values are returned
# Raw counts have been normalized against document length 
# Terms that are found across many documents are weighted down ('a' vs. 'scary')
corpus_tfidf.toarray()

array([[0.29628336, 0.        , 0.29628336, 0.50165133, 0.29628336,
        0.        , 0.38151877, 0.        , 0.        , 0.29628336,
        0.50165133],
       [0.26359985, 0.        , 0.5271997 , 0.        , 0.26359985,
        0.44631334, 0.3394328 , 0.44631334, 0.        , 0.26359985,
        0.        ],
       [0.32052772, 0.54270061, 0.32052772, 0.        , 0.32052772,
        0.        , 0.        , 0.        , 0.54270061, 0.32052772,
        0.        ]])

**STEP 3 of METHOD 2:** **Transform** new data or reviews into **count-vectorized** form \

Suppose we introduce a new review (REVIEW 4) \
REVIEW 4: This movie is filled with good actors and is very good. \
**STEP 3A of METHOD 2:** Define REVIEW 4 as `newdata` \
**STEP 3B of METHOD 2:** **Transform** new data or reviews into **count-vectorized** form using the `transform()` method. No fitting is needed \
This will yield the **vector** of REVIEW 4 under the **BoW** model \
Unseen words (e.g., 'actors', 'filled', 'with') are ignored

In [16]:
# A list of new documents
newdata = ["This movie is filled with good actors and is very good."]

In [17]:
newdata_counts = convert.transform(newdata)
newdata_counts.toarray()

array([[1, 2, 2, 0, 1, 0, 0, 0, 0, 1, 1]])

**STEP 3C of METHOD 2:** **Transform** new data or reviews using `tfidf` to derive the **TF-IDF** scores of the terms \

In [19]:
# Again, transform using tfidf 
newdata_tfidf = fooTfmer.transform(newdata_counts)
newdata_tfidf.toarray()

array([[0.2165043 , 0.7331473 , 0.43300861, 0.        , 0.2165043 ,
        0.        , 0.        , 0.        , 0.        , 0.2165043 ,
        0.36657365]])