Feature Extraction from Text / Text Vectorization / Text Representation is the process of converting text into numerical representation.
There are different techniques to do that ML Approach and DP Appoach
In this code we will looking in to 3 techniques
- Bag of Words
- Ngram
- TF-IDF

In [1]:
# Importing the Dependencies

import numpy as np
import pandas as pd

In [11]:
# Creating a Dataframe

df = pd.DataFrame({'text':['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'],
                  'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


## 1.Bag Of Words

In [12]:
# Bag-of-words(BoW) is a statistical language model used to analyze text and documents based on word count.
# To implement BOW we use sklearn.feature_extraction library and from there we use CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

#creating Countvectorizer object

cv = CountVectorizer()

In [13]:
# Implementation
# To implement the BOW on the df we use .fit_trasnform and then pass the column which on which we want to perform
# BOW, Here we will implement it on 'text' column

bow = cv.fit_transform(df.text)

In [18]:
type(bow)

scipy.sparse._csr.csr_matrix

bow is matrix type to view the consent of bow we need to convert it into array

In [19]:
bow.toarray()

array([[1, 0, 1, 1, 0],
       [2, 0, 0, 1, 0],
       [0, 1, 1, 0, 1],
       [1, 1, 0, 0, 1]])

So it s (4,5) vector with each row encoded as per the vocabulary

In [14]:
# To check the vocalubary we use cv.vocabulary_

cv.vocabulary_

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

As observe vocabulary for 5 words in acsending order is created.
- campusx : 0 index
- comment : 1 index
- people : 2 index
- watch : 3 index
- write : 4 index

Now each row will be encoded on the basis of this only,

In [20]:
# To check the encoding of row 1 : 'people watch campusx'
bow[0].toarray()

array([[1, 0, 1, 1, 0]])

In [17]:
# To check the prediction for another text, we use .transform and then pass that text and then convert it into 
# array to see the encoding.

cv.transform(['campusx watch and write comment of campusx']).toarray()

array([[2, 1, 0, 1, 1]])

Here we can observe that and and of were not the part of vocalbulary still the input size is same that is because
the algorithm is desined in such a way that the encoding done on the basis of vocaulary and we aren't consider the any word which are out of vocab. so this resolve the OOV issue which we were facing in OHE

### Hypertunning of CountVectorizer

In [22]:
# There are multiple parameters in CountVectorizer as mentioned below and its a very powerful function

# class sklearn.feature_extraction.text.CountVectorizer(*, input='content', 
#                                                       encoding='utf-8', 
#                                                       decode_error='strict', 
#                                                       strip_accents=None, 
#                                                       lowercase=True, 
#                                                       preprocessor=None, 
#                                                       tokenizer=None, 
#                                                       stop_words=None, 
#                                                       token_pattern='(?u)\b\w\w+\b', 
#                                                       ngram_range=(1, 1), 
#                                                       analyzer='word', 
#                                                       max_df=1.0, min_df=1, 
#                                                       max_features=None, 
#                                                       vocabulary=None, 
#                                                       binary=False, 
#                                                       dtype=<class 'numpy.int64'>)

# max_features=None 
# This parameter is basically used to remove the rare words and encoding is done on the basis of word with max
# frequency, lets say we give max_feature=1, that means, the word in the vocabulary with max frequency in corpus
# will be use for encoding. 
# In above eg, campusx has highest frequency of 4 so giving max_feaure=1 will encode the data on the basis of 
# campusx only.

cv1 = CountVectorizer(max_features=1)
bow1 = cv1.fit_transform(df.text).toarray()
bow1

array([[1],
       [2],
       [0],
       [1]])

As we observed document 3, or row 3 doesn't have any campusx in it so we it has been encoded as 0

In [24]:
# binary = False
# This feature is use to set all Non-Zero to 1. This is basically used when the occurance of the a particular 
# word matter more than its frequncy
# So setting binary=True , if a particular word is occurring more than 1 in a Document, it will still be encoded 
# as 1.

cv2 = CountVectorizer(binary=True)
bow2 = cv2.fit_transform(df.text).toarray()
bow2

array([[1, 0, 1, 1, 0],
       [1, 0, 0, 1, 0],
       [0, 1, 1, 0, 1],
       [1, 1, 0, 0, 1]])

As observed, in 2nd row, campusx was occuring 2 times however its still encoded as 1 coz we set binary=True

## 2.Ngrams

In [None]:
# N-grams are continuous sequences of words or symbols, or tokens in a document
# In N-gram,the vocabulary is created by using more than 1 word.
# Bi-gram if we are creating vocabulary using 2 continous words
# Tri-gram if we are creating vocabulary using 3 continous words

# For implemetation of N-gram , we use hyper parameter of CountVectorizer with name ngram_range
# where we pass a tupple with size, 
# for uniram ~ BOW = (1,1)
# for Bigram = (2,2)
# for Trigram = (3,3)

In [26]:
# Bi-gram

cv3 = CountVectorizer(ngram_range=(2,2))

In [31]:
# bigram vector
bigram = cv3.fit_transform(df.text).toarray()

bigram

array([[0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 1],
       [0, 1, 0, 0, 0, 1]])

In [29]:
# Checking bi-gram vocabulary

cv3.vocabulary_

{'people watch': 2,
 'watch campusx': 4,
 'campusx watch': 0,
 'people write': 3,
 'write comment': 5,
 'campusx write': 1}

In [32]:
# Trigram

cv4 = CountVectorizer(ngram_range=(3,3))

# Trigram Vector

trigram = cv4.fit_transform(df.text).toarray()

trigram

array([[0, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 1, 0, 0]])

In [33]:
# Trigram Vocabulary

cv4.vocabulary_

{'people watch campusx': 2,
 'campusx watch campusx': 0,
 'people write comment': 3,
 'campusx write comment': 1}

In [34]:
# Special cases
# Now instead of passing (1,1), (2,2) and (3,3) as ngram_range if we pass 
# (1,2) - This will create the vocabulary for unigram + bigram so in total we will have 11 word in vocabulary

# Combination of Uni-gram and Bi-Gram

cv5 = CountVectorizer(ngram_range=(1,2))

# Combine Vec

comb_vec = cv5.fit_transform(df.text).toarray()

comb_vec

array([[1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0],
       [2, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1],
       [1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1]])

In [36]:
# Combine Vorct Vocalbulary

cv5.vocabulary_

{'people': 4,
 'watch': 7,
 'campusx': 0,
 'people watch': 5,
 'watch campusx': 8,
 'campusx watch': 1,
 'write': 9,
 'comment': 3,
 'people write': 6,
 'write comment': 10,
 'campusx write': 2}

As per can see we have combine vocabulary for Uni-gram and Bi-gram the resultant shape is (4,11)

In [37]:
# Combination of Uni, Bi, and Tri Gram

cv6 = CountVectorizer(ngram_range=(1,3))

# Combine vec

comb_vec1 = cv6.fit_transform(df.text).toarray()

comb_vec1

array([[1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0],
       [2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
       [1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]])

In [38]:
# CombineVec Library

cv6.vocabulary_

{'people': 6,
 'watch': 11,
 'campusx': 0,
 'people watch': 7,
 'watch campusx': 12,
 'people watch campusx': 8,
 'campusx watch': 1,
 'campusx watch campusx': 2,
 'write': 13,
 'comment': 5,
 'people write': 9,
 'write comment': 14,
 'people write comment': 10,
 'campusx write': 3,
 'campusx write comment': 4}

As we can see we have combination of all uni,bi,tri gram vocabulary and resultant shape is (4,15)

In [39]:
# If We try to move further lets say with Quard-Gram, the system will show error as we donot have 4 continues words
# in any of the rows to create the vocabulary.

cv7 = CountVectorizer(ngram_range=(4,4))

quard_gram = cv7.fit_transform(df.text).toarray()

ValueError: empty vocabulary; perhaps the documents only contain stop words

As observe, python is giving error as empty vocabulary.

## 3.TF-IDF 

In [None]:
#Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in NLP and information 
# retrieval. It measures how important a term is within a document relative to a collection of documents 

# TF-IDF basically focus on the weightage of a particular word/term in a given document. 
# Weightage('t') = T.F('t' in give 'D') * IDF('t')
# We don't need to go into calculation part its build-in
# We use sklearn.feature_extraction.text.TfidfVectorizer to perform this operation

In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating td_idf object

tf_idf = TfidfVectorizer()

# To create the vector we use similar approach as CountVectorizer.fit_transform which gives us a spare matrix
# and which we can convert into array using .toarray()

vec = tf_idf.fit_transform(df.text).toarray()

vec

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [42]:
# Since IDF values are constant we can review then using obj.idf_

tf_idf.idf_

array([1.22314355, 1.51082562, 1.51082562, 1.51082562, 1.51082562])

In [43]:
# Checking the model Vocabulary

tf_idf.vocabulary_

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

As observed, Vocabulary of TD-IDF is similar to BOW, only differnce is we are focusing on weightage here