# Vectorization

**Vectorization are an important part in text processing before passing it to a model.** We know that a computer cannot understand words, so it is first converted to a number, that is then passed through a Neural Network to train parameters of the model.

Over time, vectorization have evolved from simplest statics vectorization methods like one-hot-code vectorization and bag-of-words to neural network based, trainable and dynamic vectorization techniques like Word2Vec and contextual/positional encodings.

In this notebook we will explore simple vectorization techniques, which are:
- OHE
- Bag of words
- TF-IDF

We will build them from scratch and will also use NLTK/Gensim for getting vectorization vectors.

### Loading dataset

In [1]:
import numpy as np
import pandas as pd
import nltk
import kagglehub
import os
import string

import warnings
warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harshagarwal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
from nltk.tokenize import  sent_tokenize, word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/harshagarwal/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/harshagarwal/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
from nltk.stem import PorterStemmer

## Implementing text preprocessing (tokenization) as performed in *tokenization/basic_tokenizers* notebook

In [6]:
path = kagglehub.dataset_download("selimkhaled50/new-data")
text = pd.read_json(os.path.join(path, os.listdir(path)[0]))

text['tag'] = text.intents.apply(lambda x: x['tag'])
text['patterns'] = text.intents.apply(lambda x: ' '.join(x['patterns']))
text['responses'] = text.intents.apply(lambda x: ' '.join(x['responses']))

text.drop(columns=['intents'], inplace=True)

def remove_punctuations(text):
    return ''.join(i for i in list(text) if i not in string.punctuation)

texts = text.copy()
texts['patterns'] = texts.patterns.apply(lambda x: remove_punctuations(x))
texts['responses'] = texts.responses.apply(lambda x: remove_punctuations(x))
texts['tag'] = texts.tag.apply(lambda x: remove_punctuations(x))

def convert_to_lowercase(text):
    return text.lower()

texts['patterns'] = texts.patterns.apply(lambda x: convert_to_lowercase(x))
texts['responses'] = texts.responses.apply(lambda x: convert_to_lowercase(x))
texts['tag'] = texts.tag.apply(lambda x: convert_to_lowercase(x))

stopwords = set(stopwords.words('english'))
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stopwords])

texts['patterns'] = texts.patterns.apply(lambda x: remove_stopwords(x))
texts['responses'] = texts.responses.apply(lambda x: remove_stopwords(x))

lemmatizer = WordNetLemmatizer()

def lematize_text_n(text):
    return ' '.join(lemmatizer.lemmatize(word, pos="n") for word in text.split())

def lematize_text_v(text):
    return ' '.join(lemmatizer.lemmatize(word, pos="v") for word in text.split())

texts['patterns'] = texts.patterns.apply(lambda x: lematize_text_v(x))

stemmer = PorterStemmer()

def stem_text(text):
    return " ".join(stemmer.stem(word) for word in text.split())

texts['responses'] = texts.responses.apply(lambda x: stem_text(x))

texts

Unnamed: 0,tag,patterns,responses
0,greeting,anyone ola hey hi howdy konnichiwa guten tag h...,hello tell feel today hi bring today hi feel t...
1,morning,great wake good start day great start day good...,good morn hope good night sleep feel today
2,afternoon,nice day good start day good day good afternoo...,good afternoon day go
3,evening,nice day nice morning good night good morning ...,good even day
4,night,good even good night nice nice day good night ...,good night get proper sleep good night sweet d...
...,...,...,...
75,fact28,concern mental health im worry mental health i...,import thing talk someon trust might friend co...
76,fact29,sure im well im feel well know im well know po...,belief thought feel behaviour signific impact ...
77,fact30,keep touch people stay touch friends do mainta...,lot peopl alon right dont lone togeth think di...
78,fact31,difference stress anxiety stress anxiety diffe...,stress anxieti often use interchang overlap st...


Now we have the pre-processed text, and we need to convert this into vectors

In [7]:
texts['patterns'][0], text['patterns'][0]

('anyone ola hey hi howdy konnichiwa guten tag hi hola hey bonjour hello',
 'Is anyone there? Ola Hey there Hi there Howdy Konnichiwa Guten tag Hi Hola Hey Bonjour Hello')

In [8]:
texts['responses'][0], text['responses'][0]

('hello tell feel today hi bring today hi feel today great see feel current hello glad see your back what go world right',
 "Hello there. Tell me how are you feeling today? Hi there. What brings you here today? Hi there. How are you feeling today? Great to see you. How do you feel currently? Hello there. Glad to see you're back. What's going on in your world right now?")

## One Hot Code Embedding

In one hot code encoding, we first make a vocabulary, or choose an already made vocabulary. Then we make a vector with the same size as our chosen vocabulary. Now for every word in our corpus, we make a vector with that word corresponding to 1 and other words corresponding to 0 in our vector.

Let's make a simple vocabulary

In [9]:
vocab = sorted(list(set(' '.join(texts['responses'].unique()).split(' '))))
len(vocab), vocab

(754,
 ['1',
  '10',
  '1318',
  '1524',
  '18',
  '2',
  '24',
  '3',
  '4',
  '5',
  '6',
  '7',
  '7090',
  '75',
  '8',
  '9',
  '9152987821',
  'abil',
  'abl',
  'abus',
  'access',
  'accord',
  'ach',
  'act',
  'activ',
  'actual',
  'addit',
  'adolesc',
  'adult',
  'advic',
  'advis',
  'affect',
  'afraid',
  'afternoon',
  'age',
  'agent',
  'aggress',
  'ai',
  'aid',
  'aim',
  'alcohol',
  'allevi',
  'alon',
  'along',
  'alright',
  'also',
  'altern',
  'although',
  'alway',
  'america',
  'andor',
  'anger',
  'anoth',
  'answer',
  'anxieti',
  'anyon',
  'anyth',
  'anyway',
  'appetit',
  'appli',
  'approach',
  'appropri',
  'area',
  'arent',
  'aris',
  'around',
  'ask',
  'aspect',
  'assist',
  'assit',
  'associ',
  'assum',
  'attent',
  'attitud',
  'author',
  'avail',
  'avoid',
  'aw',
  'awar',
  'away',
  'back',
  'background',
  'base',
  'becom',
  'begin',
  'behav',
  'behavior',
  'behaviour',
  'behind',
  'belief',
  'benefici',
  'benef

So, we have combined all our words and chose unique word of our words to make our vocabulary. Now our vocabulary vector will be of 754 size.

Let's now form a basic vocabulary vector

In [10]:
vector = np.zeros(shape=(len(vocab)))
vector.shape, vector

((754,),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
   

So, we just made our basic vector. 

**Let's vectorize our first sentence.**

In [11]:
texts['responses'][0]

'hello tell feel today hi bring today hi feel today great see feel current hello glad see your back what go world right'

In [12]:
ohe_0 = np.zeros((len(texts['responses'][0].split(" ")), len(vector)))
ohe_0.shape, ohe_0

((23, 754),
 array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], shape=(23, 754)))

In [13]:
i=0
for word in texts['responses'][0].split(" "):
    word_index = vocab.index(word)
    ohe_0[i, word_index] = 1
    i+=1

ohe_0

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(23, 754))

In [14]:
ohe_0[0].sum()

np.float64(1.0)

Now we have our vector ready for 1st sentence. Similarly we can do OHE for every sentence.

In [15]:
responses = list(texts['responses'].unique())

max_len = 0
for i in responses:
    if len(i.split(" "))>max_len:
        max_len = len(i.split(" "))

max_len

204

### Now, let's decode the size of our input vectors.

- First of all, it's length will be of the length of our data (80 here)
- Then for each data point, we will have corresponding vector, having value 1 at the place of the value index in the vocabulary list.

So, the overall size will be **(length of data x size of each vector x size of vocab)**

Where size of each vector = maximum length of data among all data points. This is done ensuring all data points is of equal size.

In [16]:
vectors_full = np.zeros((len(texts), max_len,len(vocab)))
vectors_full.shape

(80, 204, 754)

In [17]:
ind = 0
for i in responses:
    for j, k in enumerate(i.split(" ")):
        index = vocab.index(k)
        vectors_full[ind][j][index] = 1
    ind+=1

vectors_full

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0.

In [18]:
vectors_full[0][0].sum(), vectors_full[0][200].sum()

(np.float64(1.0), np.float64(0.0))

So, we have done one-hot code encodings for our responses column of our dataframe.

## Now let's build Bag-of-words

### Intuition

Bag of words is a vectorization algorithm based on the frequency count of every word. It does not consider order or the context of words in the sentences.

From a vocabulary of unique words, bag of words vectorization is calculated. In a sentence, count of every word is calculated and accordingly assigned in vector of every sentence.

The difference here in BOW from previous OHE is that OHE encode every sentence into vector of size (word count in sentence x vocabulary length), whereas here, encoding for every sentence is (1 x vocabulary length). We assign 1 vector to each sentence, of the size of our vocabulary, with entry for a word representing its count in that sentence.

Eg. "i love ice creams and i like summers" => vocab: ['and', 'creams', 'i', 'ice', 'like', 'love', 'summers'] => BOW: [1, 1, 2, 1, 1, 1, 1]

Let's build BOW from Scratch!

In [19]:
# Vocab for patterns column

vocab = sorted(list(set(' '.join(texts['patterns'].unique()).split(' '))))
len(vocab), vocab

(330,
 ['able',
  'absolutely',
  'advice',
  'affect',
  'afraid',
  'afternoon',
  'alot',
  'already',
  'alright',
  'anecdote',
  'another',
  'answer',
  'anxiety',
  'anxious',
  'anymore',
  'anyone',
  'anything',
  'appear',
  'appreciate',
  'approach',
  'arrive',
  'ask',
  'assistance',
  'au',
  'available',
  'aware',
  'away',
  'awful',
  'bad',
  'become',
  'believe',
  'better',
  'bonjour',
  'boyfriend',
  'break',
  'bring',
  'brother',
  'burn',
  'bye',
  'call',
  'cannot',
  'cant',
  'cause',
  'chance',
  'cheerful',
  'child',
  'choices',
  'commit',
  'common',
  'concern',
  'connections',
  'consider',
  'contemplate',
  'continue',
  'control',
  'correct',
  'correctly',
  'could',
  'crazy',
  'create',
  'creator',
  'crucial',
  'cure',
  'currently',
  'dad',
  'day',
  'days',
  'define',
  'depress',
  'depression',
  'deserve',
  'die',
  'difference',
  'differences',
  'different',
  'discuss',
  'dislike',
  'disorder',
  'do',
  'doesnt'

#### So, every sentence in our dataset will be of size (1 x 330) and our BOW vectorization will be of size (80 x 1 x 330)

In [20]:
dataset_size = np.zeros((len(texts), 1, len(vocab)))
dataset_size.shape

(80, 1, 330)

In [21]:
ind = 0
for i in texts['patterns'].unique():
    word_dict = {}
    for c, word in enumerate(list(set(i.split(" ")))):
        word_dict[word] = i.split(" ").count(word)
    print(word_dict)
    for k, v in word_dict.items():
        vocab_index = vocab.index(k)
        word_count = v
        dataset_size[ind][0][vocab_index] = word_count

    ind += 1

{'hola': 1, 'tag': 1, 'hi': 2, 'ola': 1, 'howdy': 1, 'hello': 1, 'anyone': 1, 'guten': 1, 'konnichiwa': 1, 'bonjour': 1, 'hey': 2}
{'great': 3, 'morning': 1, 'nice': 3, 'good': 5, 'start': 9, 'day': 8, 'wake': 2, 'warm': 1}
{'morning': 3, 'nice': 4, 'good': 8, 'day': 7, 'start': 3, 'afternoon': 1}
{'morning': 3, 'great': 1, 'nice': 3, 'good': 8, 'day': 4, 'start': 2, 'even': 1, 'night': 4}
{'great': 3, 'nice': 4, 'good': 4, 'day': 2, 'wonderful': 1, 'even': 4, 'night': 5}
{'alright': 1, 'later': 1, 'goodbye': 3, 'bye': 4, 'see': 1, 'well': 1, 'revoir': 1, 'fare': 1, 'au': 1, 'sayonara': 1, 'thee': 1, 'ok': 1}
{'help': 2, 'useful': 1, 'thats': 2, 'assistance': 1, 'much': 4, 'helpful': 2, 'thank': 7}
{'vvvvvvvvvvvvv': 1, 'person': 1, 'indeed': 1, 'xcaxxczcq': 1, 'mbndjjfjfjsssf': 1, 'n': 1, 'rain': 1, 'theres': 1, 'something': 1, 'someone': 2}
{'anything': 1, 'significance': 2, 'lot': 1, 'nothing': 4, 'much': 3, 'wasnt': 1, 'isnt': 1, 'significant': 6}
{'whats': 1, 'call': 1, 'name': 3, 

In [22]:
dataset_size[0]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 2., 2., 0., 1., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

So, we are now done with Bag Of Words. Vectors are now ready to pass to models.

**Disadvantages of BOW:**

- Bad representation, no context and position accounted for in the vectorization
- Sparse vector
- Fails for a new word

**Advantages of BOW:**

- Less sparse than OHE

## TF-IDF (Term frequency - Inverse Document Frequency)

TF-IDF is a better technique for word representation than BOW and OHE. It is not sparse, and do captures word relations in a document to some extent. There are 2 terms in TF-IDF, lets understand both:

- **TF (Term Frequncy):** As the name suggest, it is the frequncy of a word in the sentence. 

TF = #times word appears in the sentence / Total number of words in the sentence

- **IDF (Inverse Document Frequency):** Simply put, it is the inverse of document frequency. But what is document frequncy? Document frequency is the ratio of number of documents where that word appears and the total number of documents.

So, IDF = Inverse(Document Frequncy)

and, Document Frequency = #document where word w appears / total number of documents

So, IDF = log(total number of documents / #document where word w appears)

Log is applied so that IDF doesn't shoot up in case of low numerator.

Now **TF-IDF = TF x IDF**


Let's build TF-IDF on the pattern column

In [23]:
vocabs = {v: 0 for v in vocab}
vocabs

{'able': 0,
 'absolutely': 0,
 'advice': 0,
 'affect': 0,
 'afraid': 0,
 'afternoon': 0,
 'alot': 0,
 'already': 0,
 'alright': 0,
 'anecdote': 0,
 'another': 0,
 'answer': 0,
 'anxiety': 0,
 'anxious': 0,
 'anymore': 0,
 'anyone': 0,
 'anything': 0,
 'appear': 0,
 'appreciate': 0,
 'approach': 0,
 'arrive': 0,
 'ask': 0,
 'assistance': 0,
 'au': 0,
 'available': 0,
 'aware': 0,
 'away': 0,
 'awful': 0,
 'bad': 0,
 'become': 0,
 'believe': 0,
 'better': 0,
 'bonjour': 0,
 'boyfriend': 0,
 'break': 0,
 'bring': 0,
 'brother': 0,
 'burn': 0,
 'bye': 0,
 'call': 0,
 'cannot': 0,
 'cant': 0,
 'cause': 0,
 'chance': 0,
 'cheerful': 0,
 'child': 0,
 'choices': 0,
 'commit': 0,
 'common': 0,
 'concern': 0,
 'connections': 0,
 'consider': 0,
 'contemplate': 0,
 'continue': 0,
 'control': 0,
 'correct': 0,
 'correctly': 0,
 'could': 0,
 'crazy': 0,
 'create': 0,
 'creator': 0,
 'crucial': 0,
 'cure': 0,
 'currently': 0,
 'dad': 0,
 'day': 0,
 'days': 0,
 'define': 0,
 'depress': 0,
 'depression

In [24]:
words = ' '.join(texts['patterns'].unique()).split(" ")
words

['anyone',
 'ola',
 'hey',
 'hi',
 'howdy',
 'konnichiwa',
 'guten',
 'tag',
 'hi',
 'hola',
 'hey',
 'bonjour',
 'hello',
 'great',
 'wake',
 'good',
 'start',
 'day',
 'great',
 'start',
 'day',
 'good',
 'start',
 'day',
 'warm',
 'start',
 'day',
 'great',
 'start',
 'day',
 'nice',
 'start',
 'good',
 'start',
 'day',
 'nice',
 'wake',
 'nice',
 'start',
 'day',
 'good',
 'morning',
 'good',
 'start',
 'day',
 'nice',
 'day',
 'good',
 'start',
 'day',
 'good',
 'day',
 'good',
 'afternoon',
 'nice',
 'morning',
 'good',
 'morning',
 'nice',
 'day',
 'good',
 'start',
 'day',
 'nice',
 'good',
 'start',
 'day',
 'good',
 'day',
 'good',
 'morning',
 'nice',
 'day',
 'nice',
 'morning',
 'good',
 'night',
 'good',
 'morning',
 'good',
 'start',
 'day',
 'nice',
 'night',
 'good',
 'even',
 'good',
 'start',
 'day',
 'great',
 'night',
 'good',
 'day',
 'good',
 'morning',
 'good',
 'night',
 'good',
 'even',
 'good',
 'night',
 'nice',
 'nice',
 'day',
 'good',
 'night',
 'wonderfu

In [25]:
for k, v in vocabs.items():
    times = words.count(k)
    vocabs[k] = times

vocabs

{'able': 1,
 'absolutely': 3,
 'advice': 5,
 'affect': 6,
 'afraid': 1,
 'afternoon': 1,
 'alot': 1,
 'already': 7,
 'alright': 1,
 'anecdote': 1,
 'another': 2,
 'answer': 2,
 'anxiety': 12,
 'anxious': 5,
 'anymore': 4,
 'anyone': 2,
 'anything': 10,
 'appear': 1,
 'appreciate': 1,
 'approach': 1,
 'arrive': 1,
 'ask': 10,
 'assistance': 8,
 'au': 1,
 'available': 10,
 'aware': 7,
 'away': 10,
 'awful': 1,
 'bad': 1,
 'become': 3,
 'believe': 3,
 'better': 12,
 'bonjour': 1,
 'boyfriend': 1,
 'break': 6,
 'bring': 1,
 'brother': 1,
 'burn': 1,
 'bye': 4,
 'call': 3,
 'cannot': 2,
 'cant': 12,
 'cause': 25,
 'chance': 1,
 'cheerful': 1,
 'child': 19,
 'choices': 1,
 'commit': 2,
 'common': 4,
 'concern': 3,
 'connections': 3,
 'consider': 1,
 'contemplate': 1,
 'continue': 4,
 'control': 12,
 'correct': 6,
 'correctly': 1,
 'could': 7,
 'crazy': 2,
 'create': 2,
 'creator': 1,
 'crucial': 1,
 'cure': 12,
 'currently': 1,
 'dad': 2,
 'day': 21,
 'days': 4,
 'define': 5,
 'depress': 11,

We need to build one for each sentences

In [26]:
# Building Term Frequency for every sentence
import math
tfidf = []

for word in texts['patterns'].unique():
    v = {}
    for i in word.split(" "):
        if i in v.keys():
            v[i] += 1
        else:
            v[i] = 1
    tf = {k: val/len(word.split(" ")) for k, val in v.items()}
    idf = {k: math.log(len(texts)/sum([k in i.split(" ") for i in texts['patterns'].unique()])) for k, val in v.items()}
    tfidfx = {k: round(v*idf[k],2) for k, v in tf.items()}
    tfidf.append(tfidfx)
    print("\n")

tfidf



































































































































































[{'anyone': 0.28,
  'ola': 0.34,
  'hey': 0.67,
  'hi': 0.67,
  'howdy': 0.34,
  'konnichiwa': 0.34,
  'guten': 0.34,
  'tag': 0.34,
  'hola': 0.34,
  'bonjour': 0.34,
  'hello': 0.34},
 {'great': 0.28,
  'wake': 0.27,
  'good': 0.38,
  'start': 0.84,
  'day': 0.75,
  'warm': 0.14,
  'nice': 0.26,
  'morning': 0.1},
 {'nice': 0.43,
  'day': 0.81,
  'good': 0.75,
  'start': 0.35,
  'afternoon': 0.17,
  'morning': 0.38},
 {'nice': 0.32,
  'day': 0.46,
  'morning': 0.38,
  'good': 0.75,
  'night': 0.57,
  'start': 0.23,
  'even': 0.14,
  'great': 0.12},
 {'good': 0.42,
  'even': 0.64,
  'night': 0.8,
  'nice': 0.48,
  'day': 0.26,
  'wonderful': 0.19,
  'great': 0.39},
 {'goodbye': 0.77,
  'ok': 0.18,
  'bye': 1.03,
  'fare': 0.26,
  'thee': 0.26,
  'well': 0.15,
  'alright': 0.26,
  'see': 0.22,
  'later': 0.26,
  'sayonara': 0.26,
  'au': 0.26,
  'revoir': 0.26},
 {'thats': 0.39,
  'helpful': 0.39,
  'thank': 1.21,
  'much': 0.55,
  'help': 0.27,
  'assistance': 0.15,
  'useful': 0.19},

#### So, every sentence in our dataset will be of size (1 x 330) and our TF-IDF vectorization will be of size (80 x 1 x 330)

In [27]:
tfidf_size = np.zeros((len(texts), 1, len(vocabs)))
tfidf_size.shape

(80, 1, 330)

In [28]:
ind = 0
for i in texts['patterns'].unique():
    word_dict = tfidf[ind]
    print(word_dict)
    for k, v in word_dict.items():
        vocab_index = vocab.index(k)
        tfidf_value = v
        tfidf_size[ind][0][vocab_index] = tfidf_value

    ind += 1

tfidf_size

{'anyone': 0.28, 'ola': 0.34, 'hey': 0.67, 'hi': 0.67, 'howdy': 0.34, 'konnichiwa': 0.34, 'guten': 0.34, 'tag': 0.34, 'hola': 0.34, 'bonjour': 0.34, 'hello': 0.34}
{'great': 0.28, 'wake': 0.27, 'good': 0.38, 'start': 0.84, 'day': 0.75, 'warm': 0.14, 'nice': 0.26, 'morning': 0.1}
{'nice': 0.43, 'day': 0.81, 'good': 0.75, 'start': 0.35, 'afternoon': 0.17, 'morning': 0.38}
{'nice': 0.32, 'day': 0.46, 'morning': 0.38, 'good': 0.75, 'night': 0.57, 'start': 0.23, 'even': 0.14, 'great': 0.12}
{'good': 0.42, 'even': 0.64, 'night': 0.8, 'nice': 0.48, 'day': 0.26, 'wonderful': 0.19, 'great': 0.39}
{'goodbye': 0.77, 'ok': 0.18, 'bye': 1.03, 'fare': 0.26, 'thee': 0.26, 'well': 0.15, 'alright': 0.26, 'see': 0.22, 'later': 0.26, 'sayonara': 0.26, 'au': 0.26, 'revoir': 0.26}
{'thats': 0.39, 'helpful': 0.39, 'thank': 1.21, 'much': 0.55, 'help': 0.27, 'assistance': 0.15, 'useful': 0.19}
{'person': 0.27, 'someone': 0.5, 'rain': 0.4, 'n': 0.4, 'vvvvvvvvvvvvv': 0.4, 'theres': 0.4, 'something': 0.21, 'inde

array([[[0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.]]], shape=(80, 1, 330))

In [29]:
tfidf_size[1]

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.75,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.38, 0.  ,
        0.28, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.  , 0.  , 0.  , 

In [30]:
tfidf[1]

{'great': 0.28,
 'wake': 0.27,
 'good': 0.38,
 'start': 0.84,
 'day': 0.75,
 'warm': 0.14,
 'nice': 0.26,
 'morning': 0.1}

## So, we are now done with TF-IDF

However, it is still sparse and doesn't carry contextual or positional relationships.

### Everything defined above is done using unigrams tokens/words. These can be performed using Bi-grams, Tri-grams or n-grams as well

# TF-IDF using sklearn with n-gram: 1 and 2

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = texts['patterns'].unique()

tfv = TfidfVectorizer(ngram_range=(1, 2))
vect = tfv.fit_transform(corpus)
vect

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1930 stored elements and shape (80, 1403)>

In [32]:
tfv.get_feature_names_out()

array(['able', 'absolutely', 'absolutely correct', ..., 'youre robot',
       'youre stupid', 'youre useless'], shape=(1403,), dtype=object)

In [33]:
vect.shape

(80, 1403)

In [34]:
pd.DataFrame(vect[:5].toarray(), columns=tfv.get_feature_names_out())

Unnamed: 0,able,absolutely,absolutely correct,absolutely right,advice,advice need,advice something,affect,affect affect,affect mental,...,yes would,youre,youre absolutely,youre correct,youre crazy,youre insane,youre right,youre robot,youre stupid,youre useless
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### We are now done with basic word vectorization