One hot encoding

In [1]:
import numpy as np

corpus = [
    "The quick brown fox jumped over the lazy dog.",
    "She sells seashells by the seashore.",
    "Peter Piper picked a peck of pickled peppers."
]

unique_words = set()
for sentence in corpus:
    for word in sentence.split():
        unique_words.add(word.lower())

word_to_index = {}
for i, word in enumerate(unique_words):
    word_to_index[word] = i
word_to_index

{'picked': 0,
 'seashore.': 1,
 'of': 2,
 'jumped': 3,
 'she': 4,
 'peck': 5,
 'by': 6,
 'seashells': 7,
 'peppers.': 8,
 'dog.': 9,
 'lazy': 10,
 'a': 11,
 'piper': 12,
 'over': 13,
 'sells': 14,
 'brown': 15,
 'the': 16,
 'peter': 17,
 'fox': 18,
 'pickled': 19,
 'quick': 20}

In [2]:
one_hot_vectors = []
for sentence in corpus:
    sentence_vectors = []
    for word in sentence.split():
        vector = np.zeros(len(unique_words))
        vector[word_to_index[word.lower()]] = 1
        sentence_vectors.append(vector)
    one_hot_vectors.append(sentence_vectors)

for vector in one_hot_vectors[0]:
    print(vector)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


One-Hot Encoding captures the presence or absence of words in a document but ignores the semantic relationship between words

Bag of words

BoW representation provides insights into the importance of different terms within the text. By counting the frequency of words, we can observe which words occur more frequently and, therefore, potentially carry more significance in the text. However, it still suffers from the sparsity problem and does not consider the semantic meaning of words.

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data=pd.DataFrame({"text":["people watch ineuron","ineuron watch ineuron","people write comment","ineuron write comment"],"output":[1,1,0,0]})

In [13]:
BOW = CountVectorizer()
document_matrix = BOW.fit_transform(data['text'])
print(BOW.vocabulary_)
print(BOW.get_feature_names_out())

print(document_matrix.toarray())

{'people': 2, 'watch': 3, 'ineuron': 1, 'write': 4, 'comment': 0}
['comment' 'ineuron' 'people' 'watch' 'write']
[[0 1 1 1 0]
 [0 2 0 1 0]
 [1 0 1 0 1]
 [1 1 0 0 1]]


In [14]:
bigram=CountVectorizer(ngram_range=(2,2))
bigramvocab=bigram.fit_transform(data['text'])

In [15]:
bigram.vocabulary_

{'people watch': 2,
 'watch ineuron': 4,
 'ineuron watch': 0,
 'people write': 3,
 'write comment': 5,
 'ineuron write': 1}

In [16]:
trigram=CountVectorizer(ngram_range=(3,3))
trigramvocab=trigram.fit_transform(data['text'])
trigram.vocabulary_

{'people watch ineuron': 2,
 'ineuron watch ineuron': 0,
 'people write comment': 3,
 'ineuron write comment': 1}

In [18]:
mix=CountVectorizer(ngram_range=(1,3))
mix_vocab=mix.fit_transform(data["text"])
mix.vocabulary_

{'people': 6,
 'watch': 11,
 'ineuron': 1,
 'people watch': 7,
 'watch ineuron': 12,
 'people watch ineuron': 8,
 'ineuron watch': 2,
 'ineuron watch ineuron': 3,
 'write': 13,
 'comment': 0,
 'people write': 9,
 'write comment': 14,
 'people write comment': 10,
 'ineuron write': 4,
 'ineuron write comment': 5}

Term Frequency (TF):
The term frequency of a word within a document represents how frequently the word appears in that document.

Inverse Document Frequency (IDF):
The inverse document frequency measures the rarity or importance of a term in the entire collection of documents. 

TF-IDF still suffers from the lack of semantic meaning between words. It treats each word independently and does not consider the relationships or semantics between words

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
tfidf=TfidfVectorizer()
tfidf.fit_transform(data["text"]).toarray()

array([[0.        , 0.49681612, 0.61366674, 0.61366674, 0.        ],
       [0.        , 0.8508161 , 0.        , 0.52546357, 0.        ],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027],
       [0.61366674, 0.49681612, 0.        , 0.        , 0.61366674]])

In [21]:
tfidf.get_feature_names_out()

array(['comment', 'ineuron', 'people', 'watch', 'write'], dtype=object)

In [22]:
tfidf.idf_

array([1.51082562, 1.22314355, 1.51082562, 1.51082562, 1.51082562])

False