# <center> <font size = 24 color = 'steelblue'> <b>One Hot Encoding

## Overview:

The goal is to implement one-hot encoding for text data representation. The notebook covers acquiring and cleaning data, generating a vocabulary, and creating a one-hot encoded matrix. It concludes with displaying the matrix, showcasing how categorical text data is transformed into numerical format for machine learning tasks.

# <a id= 'h0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Introduction](#h1)<br>
[2. Data acquisition and cleaning](#h2)<br>
[3. Vocabulary generation](#h3)<br>
[4. Creation of the one hot encoded matrix](#h4)<br>
[5. Display one hot encoded matrix](#h5)<br>

## <a id = 'h1'>
    
<font size = 10 color = 'midnightblue'> <b> Introduction

<div class="alert alert-block alert-success">    
<font size=4> 

**One-hot encoding stands out as the most prevalent and fundamental method for converting a token into a vector.**<br>

<b>The process involves:</b>
  - Assigning a unique integer index to each word.
  - Converting this integer index, denoted as 'i,' into a binary vector of size N, where N represents the vocabulary size.
  - This vector is predominantly filled with zeros, except for the i-th entry, which is set to 1.

<br>

<b>Example:</b><br>
In our example, the **corpus** consists of the following words:
- **apple**
- **banana**
- **cherry**

<b>Encoding Process:</b>
1. **Assign Unique Index**: 
   - **apple**: Index 0
   - **banana**: Index 1
   - **cherry**: Index 2

2. **Convert Index to Binary Vector**:
   Each index is converted into a one-hot encoded binary vector of size N (where N is the vocabulary size).

3. **Resulting Vectors**:
   - **apple**: `[1 0 0]`
   - **banana**: `[0 1 0]`
   - **cherry**: `[0 0 1]`

<b>Significance in NLP:</b>
One-hot encoding allows algorithms to work with discrete variables effectively, enabling tasks like text classification, sentiment analysis, and other NLP applications. It simplifies the representation of text data, making it suitable for numerical computations.

</font>
</div>


</div>


![Image Description](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/NLP/one-hot-encoding.png
)


In [8]:
import re
import pandas as pd
from string import punctuation

## <a id = 'h2'>
    
<font size = 10 color = 'midnightblue'> <b> Data acquisition and cleaning

<font size = 5 color = pwdrblue> <b>  Define the set of statements

In [9]:
dataset = [
    "The weather today is fantastic, with clear skies and a gentle breeze.",
    "Reading is a great way to escape reality and immerse oneself in different worlds.",
    "Climate change is a pressing global issue that requires immediate attention.",
    "Exercise is crucial for maintaining good physical and mental health.",
    "Learning a new language can be challenging but incredibly rewarding."
]

<font size = 5 color = pwdrblue> <b>  Remove punctuations

In [10]:
pat = re.compile('[A-Za-z][{}]+'.format(punctuation))
new_dataset = []
for s in dataset:
    s = s.lower()
    txt = re.findall(pat,s )
    for k in txt:
        s = s.replace(k[-1], '')
    new_dataset.append(s)
new_dataset

['the weather today is fantastic with clear skies and a gentle breeze',
 'reading is a great way to escape reality and immerse oneself in different worlds',
 'climate change is a pressing global issue that requires immediate attention',
 'exercise is crucial for maintaining good physical and mental health',
 'learning a new language can be challenging but incredibly rewarding']

[top](#h0)

## <a id = 'h3'>  
<font size = 10 color = 'midnightblue'> <b>  Create a set of unique words as vocabulary from documents.

In [11]:
vocab =list(set((' '.join(new_dataset)).split()))
# Sorting the vocabulary for better management
vocab.sort()
print(vocab)

['a', 'and', 'attention', 'be', 'breeze', 'but', 'can', 'challenging', 'change', 'clear', 'climate', 'crucial', 'different', 'escape', 'exercise', 'fantastic', 'for', 'gentle', 'global', 'good', 'great', 'health', 'immediate', 'immerse', 'in', 'incredibly', 'is', 'issue', 'language', 'learning', 'maintaining', 'mental', 'new', 'oneself', 'physical', 'pressing', 'reading', 'reality', 'requires', 'rewarding', 'skies', 'that', 'the', 'to', 'today', 'way', 'weather', 'with', 'worlds']


In [12]:
len(vocab)

49

[top](#h0)

## <a id = 'h4'>    
<font size = 10 color = 'midnightblue'> <b> Creating one hot encoded matrix for each sentence.

In [13]:
d = {}
i = 0
for sentence in new_dataset:

    # getting the words of the sentence in dataset
    s = sentence.split()

    # creating an empty df to store the one-hot encoded matrix and filling it up with 0
    df = pd.DataFrame([],columns = vocab,index = s)
    df.fillna(0, inplace = True)

    # assign 1 to the cells where the word in the statement(index of df) matches the column name
    for word in s:
        df.loc[word,word ] = 1

    # creating a dictionary of these matrices
    d[f'sent{i}'] = df

    i+= 1

[top](#h0)

## <a id = 'h5'> 
<font size = 10 color = 'midnightblue'> <b>  Displaying results for one of the sentences

In [14]:
print(f"\nThe one hot encoding for the sentence : \n \"{dataset[0]}\" is :\n")
d['sent0']


The one hot encoding for the sentence : 
 "The weather today is fantastic, with clear skies and a gentle breeze." is :



Unnamed: 0,a,and,attention,be,breeze,but,can,challenging,change,clear,...,rewarding,skies,that,the,to,today,way,weather,with,worlds
the,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
weather,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
today,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
is,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fantastic,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
with,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
clear,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
skies,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
and,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
