# **01. Feature Extraction Techniques from Texts**
---


### 🧑‍💼 **Shuvendu Pritam Das**  
*Data Science / ML Enthusiast*  

- **GitHub:** [SPritamDas](https://github.com/SPritamDas/My-Profile)  
- **LinkedIn:** [Shuvendu Pritam Das](https://www.linkedin.com/in/shuvendupritamdas/)  
- **Email:** shuvendupritamdas181@gmail.com  
---

In [1]:
import numpy as np
import pandas as pd
import sklearn

In [2]:
data = pd.DataFrame(['I love India and Odisha and Balasore and Soro','I love Indians','Indians love India and Indians'])

In [3]:
data

Unnamed: 0,0
0,I love India and Odisha and Balasore and Soro
1,I love Indians
2,Indians love India and Indians


# Preprocessing (Tokenization)

In [4]:
data1 = data.copy().rename(columns={0:'text'})

In [5]:
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd

# Download NLTK 'punkt' tokenizer (used by word_tokenize)
nltk.download('punkt')

def tokenizer(text):
    tokenized_data = word_tokenize(text)
    return tokenized_data

data1['tokenized_text'] = data1['text'].apply(tokenizer)
data1

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,text,tokenized_text
0,I love India and Odisha and Balasore and Soro,"[I, love, India, and, Odisha, and, Balasore, a..."
1,I love Indians,"[I, love, Indians]"
2,Indians love India and Indians,"[Indians, love, India, and, Indians]"


# Bag of Words

class sklearn.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

**Remarks:**
We can observe that with out text preprocessing also, we can use countvectorizer using proper hyper parameter values.

In [6]:
data2 = data.copy()
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
vectorizer = CountVectorizer()

vectorizer.fit(data2[0])

# Transform the text data into a bag-of-words representation
bag_of_words = vectorizer.transform(data2[0])
print('vocabulary: ',vectorizer.vocabulary_)#Observation: Vocab words are in alphabatically orders as per index shown
print('bag of words: \n',bag_of_words.toarray())

vocabulary:  {'love': 4, 'india': 2, 'and': 0, 'odisha': 5, 'balasore': 1, 'soro': 6, 'indians': 3}
bag of words: 
 [[3 1 1 0 1 1 1]
 [0 0 0 1 1 0 0]
 [1 0 1 2 1 0 0]]


In [7]:
vectorizer.transform(['I love my India and love my indians']).toarray()

array([[1, 0, 1, 1, 2, 0, 0]])

**Observation:**

- Sparcity is here.
- Different Row shape problem also solved because each Document will be of V-dimensional array.
- OutofVocabulary problem solved.(Though my is a new words, no problem here)
- BOW represents the semantic means by cosine similarities.

**Problems related to Cosine Similarity**

- D1 : This is good.
- D2: This is not good.
- Vocabulary: ['This','is','good','not']


| Documents | This | is | good | not |
|-----------|------|----|------|-----|
| D1        | 1    | 1  | 1    | 0   |
| D2        | 1    | 1  | 1    | 1   |

cos(theta) = D1 . D2 /(|D1|*|D2|)
= 3/(2*root(3))
~ 30
Means though the words are completely different, the cosine dis-similarity not that much high. That means large changes in meaning due to small changes can not be captured by this.

  

**Hyper-Parameters**
1. Binary = True (It will make the matrix to 0/1 matrix. It means more than 0 occurances is 1 and 0 occurances is 0)
<br>
- In `Sentiment Analysis`, we use this `binary = True` hyperparameter values.

2. max_features = x
- It takes top x number of top features in bow.

# n-grams

## Bi-grams

In [8]:
data3 = data.copy()
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
vectorizer = CountVectorizer(ngram_range=(2,2)) #bi-grams

vectorizer.fit(data3[0])

# Transform the text data into a bag-of-words representation
bag_of_words = vectorizer.transform(data2[0])
print('vocabulary: ',vectorizer.vocabulary_)#Observation: Vocab words are in alphabatically orders as per index shown
print('bag of words: \n',bag_of_words.toarray())

vocabulary:  {'love india': 7, 'india and': 5, 'and odisha': 2, 'odisha and': 9, 'and balasore': 0, 'balasore and': 4, 'and soro': 3, 'love indians': 8, 'indians love': 6, 'and indians': 1}
bag of words: 
 [[1 0 1 1 1 1 0 1 0 1]
 [0 0 0 0 0 0 0 0 1 0]
 [0 1 0 0 0 1 1 1 0 0]]


## Tri-grams

In [9]:
data3 = data.copy()
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
vectorizer = CountVectorizer(ngram_range=(3,3)) #tri-grams

vectorizer.fit(data3[0])

# Transform the text data into a bag-of-words representation
bag_of_words = vectorizer.transform(data2[0])
print('vocabulary: ',vectorizer.vocabulary_)#Observation: Vocab words are in alphabatically orders as per index shown
print('bag of words: \n',bag_of_words.toarray())

vocabulary:  {'love india and': 6, 'india and odisha': 4, 'and odisha and': 1, 'odisha and balasore': 7, 'and balasore and': 0, 'balasore and soro': 2, 'indians love india': 5, 'india and indians': 3}
bag of words: 
 [[1 1 1 0 1 0 1 1]
 [0 0 0 0 0 0 0 0]
 [0 0 0 1 0 1 1 0]]


## Monograms + Bi-grams

In [10]:
data3 = data.copy()
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
vectorizer = CountVectorizer(ngram_range=(1,2)) # mono-grams + bi-grams

vectorizer.fit(data3[0])

# Transform the text data into a bag-of-words representation
bag_of_words = vectorizer.transform(data2[0])
print('vocabulary: ',vectorizer.vocabulary_)#Observation: Vocab words are in alphabatically orders as per index shown
print('bag of words: \n',bag_of_words.toarray())

vocabulary:  {'love': 11, 'india': 7, 'and': 0, 'odisha': 14, 'balasore': 5, 'soro': 16, 'love india': 12, 'india and': 8, 'and odisha': 3, 'odisha and': 15, 'and balasore': 1, 'balasore and': 6, 'and soro': 4, 'indians': 9, 'love indians': 13, 'indians love': 10, 'and indians': 2}
bag of words: 
 [[3 1 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1]
 [0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0]
 [1 0 1 0 0 0 0 1 1 2 1 1 1 0 0 0 0]]


## Monograms + bi-grams + trigrams

In [11]:
data3 = data.copy()
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
vectorizer = CountVectorizer(ngram_range=(1,3)) # mono-grams + bi-grams + tri-grams

vectorizer.fit(data2[0])

# Transform the text data into a bag-of-words representation
bag_of_words = vectorizer.transform(data2[0])
print('vocabulary: ',vectorizer.vocabulary_)#Observation: Vocab words are in alphabatically orders as per index shown
print('bag of words: \n',bag_of_words.toarray())

vocabulary:  {'love': 17, 'india': 10, 'and': 0, 'odisha': 21, 'balasore': 7, 'soro': 24, 'love india': 18, 'india and': 11, 'and odisha': 4, 'odisha and': 22, 'and balasore': 1, 'balasore and': 8, 'and soro': 6, 'love india and': 19, 'india and odisha': 13, 'and odisha and': 5, 'odisha and balasore': 23, 'and balasore and': 2, 'balasore and soro': 9, 'indians': 14, 'love indians': 20, 'indians love': 15, 'and indians': 3, 'indians love india': 16, 'india and indians': 12}
bag of words: 
 [[3 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0]
 [1 0 0 1 0 0 0 0 0 0 1 1 1 0 2 1 1 1 1 1 0 0 0 0 0]]


**Benefits**
- Semantic Meaning
- Size of Documnets same
- Out of Vocabulary problem is not here.

**Problems**
- Length of array increases than BOW.
- Sparsity is still here.
- If some important oov word will come during testing, then no way to add that.

# Tf-Idf

In [12]:
data4 = data.copy()
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a CountVectorizer instance
vectorizer = TfidfVectorizer()
vectorizer.fit(data4[0])

# Transform the text data into a bag-of-words representation
tf_idf = vectorizer.transform(data2[0])
print('vocabulary: ',vectorizer.vocabulary_)
print('tf-idf table \n',pd.DataFrame(tf_idf.toarray()))

vocabulary:  {'love': 4, 'india': 2, 'and': 0, 'odisha': 5, 'balasore': 1, 'soro': 6, 'indians': 3}
tf-idf table 
           0         1         2         3         4         5         6
0  0.754975  0.330901  0.251658  0.000000  0.195435  0.330901  0.330901
1  0.000000  0.000000  0.000000  0.789807  0.613356  0.000000  0.000000
2  0.389158  0.000000  0.389158  0.778317  0.302216  0.000000  0.000000


# Custom Features