### One Hot Encoding:
One-hot encoding is a representation of categorical variables as binary vectors. It is commonly used in machine learning and natural language processing tasks when dealing with categorical data. In this encoding scheme, each category is represented by a binary vector, where all elements are zero except for the index that corresponds to the category, which is marked with a 1.



                         ## all   at    bad    not    its
its not bad at all            1    1     1       1      1

its not bad                   0    0     1       1      1

its bad                       0    0     1       0      1 

not bad at all                1    1     1       1      0

The one-hot encoding scheme is applied to each unique word in the set of sentences, where each word corresponds to a binary position in the one-hot vector. The length of the vector is determined by the total number of unique words in the set.

In this case, the unique words are: "all", "at", "bad", "not", "its". The length of the one-hot vectors is 5, corresponding to these five unique words.

For each sentence, the one-hot vector is created by placing a 1 in the position that corresponds to the presence of a word in the sentence and 0 in the positions corresponding to the absence of that word. If a sentence contains multiple occurrences of a word, the corresponding position in the vector remains 1.

### One Hot Encoding From Scratch

In [1]:
# import the required libraries
import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
# Create a corpus
corpus = 'GeeksforGeeks is not a website or a company. GeeksforGeeks is a coding environment. Thats what i love. Which is good. Not bad at all'


In [2]:
seq = set() #Initializes an empty set

for word in word_tokenize(corpus):
    if(word != '.'): # adding words other than '.'
        seq.add(word)
        
seq = list(seq) #Converts the set seq to a list

print(seq)


['is', 'environment', 'website', 'a', 'coding', 'good', 'GeeksforGeeks', 'Which', 'or', 'at', 'i', 'Thats', 'all', 'Not', 'not', 'bad', 'company', 'what', 'love']


In [3]:
data = [] # Initializes an empty list 

for sent in sent_tokenize(corpus): #Tokenizing corpus into sentences
    index = []
    for word in word_tokenize(sent): # Tokenizing sentences into word
        if(word != '.'):
            index.append(seq.index(word)) # appending the index of that word
            
    data.append(index)


In [4]:
data # assigning index to each word in each sentence

[[6, 0, 14, 3, 2, 8, 3, 16],
 [6, 0, 3, 4, 1],
 [11, 17, 10, 18],
 [7, 0, 5],
 [13, 15, 9, 12]]

In [5]:
fin = []

for indexes in data: # Iterates through each list of indices (indexes) in the data list
    enc = [0 for x in range(len(seq))] # Initializes a list enc of zeros with a length equal to the number of unique words in the vocabulary (seq)
    for index in indexes:
        # Marks the positions in the enc list with 1 for each word present in the sentence.
        enc[index] = 1 
    fin.append(enc) # Appends the binary encoding (enc) for the current sentence to the fin list.

fin

[[1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
 [1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1],
 [1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0]]

In [6]:
df = pd.DataFrame(fin, columns = seq) # Creates a pandas DataFrame
df['Sentances_'] = sent_tokenize(corpus) # Add 'Sentences_' column with tokenized sentences
df.head()

Unnamed: 0,is,environment,website,a,coding,good,GeeksforGeeks,Which,or,at,i,Thats,all,Not,not,bad,company,what,love,Sentances_
0,1,0,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,GeeksforGeeks is not a website or a company.
1,1,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,GeeksforGeeks is a coding environment.
2,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,1,Thats what i love.
3,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,Which is good.
4,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,1,0,0,0,Not bad at all


### One Hot Encoding With Sklearn

In [10]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd

In [11]:
encoder=OneHotEncoder()

In [12]:
text=['i love dogs','i hate cats','i like turtles','i love the dogs']
text=np.reshape(text,(-1,1))

In [16]:
encoder.fit(text)
one_hot_text=encoder.transform(text)

In [17]:
print(one_hot_text.toarray())

[[0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]]


#### One word at a time

In [20]:
text=['cat','dog','fish']
data=pd.get_dummies(text)
data

Unnamed: 0,cat,dog,fish
0,True,False,False
1,False,True,False
2,False,False,True


### Bag of Words

The bag of words (BoW) model is a simplified representation of text data used in natural language processing. It treats a document as an unordered set of words, ignoring grammar and word order, and focuses on word frequency. The process involves tokenization, creating a vocabulary of unique words, and generating feature vectors for documents based on word occurrences. While it loses information about word relationships, BoW is a versatile and widely used technique for tasks like text classification and sentiment analysis.

In [21]:
import pandas as pd

# Initialize an empty list to store feature vectors
feature_vectors = []

# Sample text data
word_text = ["The is a very very cute cat", "The dog is very cute"]

# Tokenize the text into lists of words
tokens = [token.split() for token in word_text]

# Create a vocabulary list of unique words
vocab = list(set([word for text in tokens for word in text]))

# Iterate over each text in the tokenized list
for text in tokens:
    # Initialize a feature vector with zeros for each word in the vocabulary
    feature_vector = [0] * len(vocab)
    
    # Iterate over each word in the text
    for word in text:
        # Update the feature vector based on the count of each word in the text
        feature_vector[vocab.index(word)] = text.count(word)

    # Append the feature vector to the list of feature vectors
    feature_vectors.append(feature_vector)

# Create a DataFrame with feature vectors as columns and the word_text as an additional column
df = pd.DataFrame(feature_vectors, columns=vocab)
df['word_corpus'] = word_text


In [22]:
df

Unnamed: 0,is,cute,a,The,very,cat,dog,word_corpus
0,1,1,1,1,2,1,0,The is a very very cute cat
1,1,1,0,1,1,0,1,The dog is very cute


### Binary Bag of Words:
A binary bag-of-words (BoW) representation is a way to represent text data in natural language processing. In this model, each document is transformed into a binary vector, where each element indicates the presence (1) or absence (0) of a specific word from a predefined vocabulary. This representation simplifies information by focusing on word presence, making it useful for tasks like text classification and information retrieval where the emphasis is on identifying relevant terms rather than their frequency. It is similar to one-hot encoding, another technique used in NLP, where each word is represented by a vector with a single "1" at the index corresponding to its position in the vocabulary. Both methods capture the categorical nature of words in a document without considering their frequency.

In [23]:
import pandas as pd

# Initialize an empty list to store feature vectors
feature_vectors = []

# Sample text data
word_text = ["The is a very very cute cat", "The dog is very cute"]

# Tokenize the text into lists of words
tokens = [token.split() for token in word_text]

# Create a vocabulary list of unique words
vocab = list(set([word for text in tokens for word in text]))

# Iterate over each text in the tokenized list
for text in tokens:
    # Initialize a feature vector with zeros for each word in the vocabulary
    feature_vector = [0] * len(vocab)
    
    # Iterate over each word in the text
    for word in text:
        # Check if the word is in the vocabulary
        if word in vocab:
            # Update the feature vector to indicate the presence of the word
            feature_vector[vocab.index(word)] = 1

    # Append the feature vector to the list of feature vectors
    feature_vectors.append(feature_vector)

# Create a DataFrame with feature vectors as columns and the word_text as an additional column
df = pd.DataFrame(feature_vectors, columns=vocab)
df['word_corpus'] = word_text
df

Unnamed: 0,is,cute,a,The,very,cat,dog,word_corpus
0,1,1,1,1,1,1,0,The is a very very cute cat
1,1,1,0,1,1,0,1,The dog is very cute


### Count Vectorizer

In [24]:
text_data = [
    "GFG is providing a new Deep Learning Course which is really good",
    "We will be studying Deep Learning from today",
    "I want a Deep sleep today"
]

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

In [27]:
vec=CountVectorizer()

In [28]:
trs=vec.fit_transform(text_data)
print(trs.toarray())

[[0 1 1 0 1 1 2 1 1 1 1 0 0 0 0 0 1 0]
 [1 0 1 1 0 0 0 1 0 0 0 0 1 1 0 1 0 1]
 [0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0]]


In [29]:
df=pd.DataFrame(trs.toarray(),columns=vec.vocabulary_)

In [30]:
df

Unnamed: 0,gfg,is,providing,new,deep,learning,course,which,really,good,we,will,be,studying,from,today,want,sleep
0,0,1,1,0,1,1,2,1,1,1,1,0,0,0,0,0,1,0
1,1,0,1,1,0,0,0,1,0,0,0,0,1,1,0,1,0,1
2,0,0,1,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0


#### Dealing with stopwords in Count Vectorizer

In [31]:
vec=CountVectorizer(stop_words='english')

In [32]:
trs=vec.fit_transform(text_data,)
print(trs.toarray())

[[1 1 1 1 1 1 1 1 0 0 0 0]
 [0 1 0 0 1 0 0 0 0 1 1 0]
 [0 1 0 0 0 0 0 0 1 0 1 1]]


In [33]:
df=pd.DataFrame(trs.toarray(),columns=vec.vocabulary_)
df

Unnamed: 0,gfg,providing,new,deep,learning,course,really,good,studying,today,want,sleep
0,1,1,1,1,1,1,1,1,0,0,0,0
1,0,1,0,0,1,0,0,0,0,1,1,0
2,0,1,0,0,0,0,0,0,1,0,1,1


#### Dealing with N-Gram in Count Vectorizer

In [34]:
vec=CountVectorizer(ngram_range=(2,3))

In [35]:
trs=vec.fit_transform(text_data,)
print(trs.toarray())

[[0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0]
 [1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1]
 [0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0]]


In [36]:
df=pd.DataFrame(trs.toarray(),columns=vec.vocabulary_)
df

Unnamed: 0,gfg is,is providing,providing new,new deep,deep learning,learning course,course which,which is,is really,really good,...,will be studying,be studying deep,studying deep learning,deep learning from,learning from today,want deep,deep sleep,sleep today,want deep sleep,deep sleep today
0,0,0,1,1,1,1,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
1,1,1,0,0,1,0,1,0,0,1,...,1,1,0,0,1,1,0,0,1,1
2,0,0,0,0,0,0,0,1,1,0,...,0,0,1,1,0,0,0,0,0,0
