## Bag of words  TF-IDF
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.
The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears
https://www.mygreatlearning.com/blog/bag-of-words/
https://www.geeksforgeeks.org/bag-of-words-bow-model-in-nlp/

In [1]:
# Python3 code for preprocessing text
import nltk
import re
import numpy as np

In [2]:
# execute the text here as :
text1 = "The quick brown fox jumps over the lazy little dog"
text1

'The quick brown fox jumps over the lazy little dog'

In [8]:
dataset = nltk.sent_tokenize(text1)
dataset

['The quick brown fox jumps over the lazy little dog']

- Convert text to lower case.
- Remove all non-word characters.
- Remove all punctuations.

In [10]:
dataset = nltk.sent_tokenize(text1)
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower()
    dataset[i] = re.sub(r'\W', ' ', dataset[i])
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])

In [5]:
data1

['the quick brown fox jumps over the lazy little dog']

Obtaining most frequent words in our text. We will apply the following steps to generate our model.
- We declare a dictionary to hold our bag of words.
- Next we tokenize each sentence to words.
- Now for each word in sentence, we check if the word exists in our dictionary.
- If it does, then we increment its count by 1. If it doesn’t, we add it to our dictionary and set its count as 1.

In [11]:
#BOW
word2count = {}
for data in dataset:
    words = nltk.word_tokenize(data)
    for word in words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

In [7]:
word2count

{}

In [12]:
import heapq
freq_words = heapq.nlargest(100, word2count, key=word2count.get)

In [13]:
freq_words

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'little', 'dog']

### step 3 : Building the Bag of Words model
In this step we construct a vector, which would tell us whether a word in each sentence is a frequent word or not. If a word in a sentence is a frequent word, we set it as 1, else we set it as 0.
This can be implemented with the help of following code:

In [14]:
X = []
for data in dataset:
    vector = []
    for word in freq_words:
        if word in nltk.word_tokenize(data):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)
X = np.asarray(X)

In [15]:
X

array([[1, 1, 1, 1, 1, 1, 1, 1, 1]])

### TFid

In [16]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
sentence_1="This is a good job.I will not miss it for anything"
sentence_2="This is not good at all"

In [20]:
#without smooth IDF
print("Without Smoothing:")
#define tf-idf
tf_idf_vec = TfidfVectorizer(use_idf=True,  smooth_idf=False,  ngram_range=(1,1),stop_words='english') # to use only  bigrams ngram_range=(2,2)
#transform
tf_idf_data = tf_idf_vec.fit_transform([sentence_1,sentence_2])
 
#create dataframe
tf_idf_dataframe=pd.DataFrame(tf_idf_data.toarray(),columns=tf_idf_vec.get_feature_names_out())
print(tf_idf_dataframe)
print("\n")
 
#with smooth
tf_idf_vec_smooth = TfidfVectorizer(use_idf=True,  smooth_idf=True,  ngram_range=(1,1),stop_words='english')
 
 
tf_idf_data_smooth = tf_idf_vec_smooth.fit_transform([sentence_1,sentence_2])
 
print("With Smoothing:")
tf_idf_dataframe_smooth=pd.DataFrame(tf_idf_data_smooth.toarray(),columns=tf_idf_vec_smooth.get_feature_names_out())
print(tf_idf_dataframe_smooth)

Without Smoothing:
       good       job      miss
0  0.385372  0.652491  0.652491
1  1.000000  0.000000  0.000000


With Smoothing:
       good       job      miss
0  0.449436  0.631667  0.631667
1  1.000000  0.000000  0.000000
