# One Hot Encoding

One-Hot Encoding (OHE) is a categorical data encoding technique that converts text or categories into binary vectors (0 and 1).

Each unique word/category gets:
- 1 → present
- 0 → absent

Unlike Bag of Words, frequency does NOT matter.

### One-Hot Encoding vs Bag of Words

| Feature     | One-Hot Encoding   | Bag of Words             |
| ----------- | ------------------ | ------------------------ |
| Values      | 0 / 1              | Frequency count          |
| Repetition  | Ignored            | Counted                  |
| Matrix size | Same vocabulary    | Same vocabulary          |
| Use case    | Presence detection | Importance via frequency |


#### - Bag of Words
CountVectorizer()

#### - One-Hot Encoding
CountVectorizer(binary=True)

#### #Are Binary Bag of Words and One-Hot Encoding the same?
YES

Binary Bag of Words = One-Hot Encoding (for text)

Both:
- Ignore frequency
- Use 0 / 1
- Capture only presence or absence

In [1]:
import re
import nltk
import pandas as pd

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to C:\Users\Purvi
[nltk_data]     jain\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
sentences = [
    "I love Machine Learning",
    "Machine Learning is powerful",
    "I love Python and Machine Learning"
]


In [3]:
corpus = []
stop_words = set(stopwords.words('english'))

for s in sentences:
    review = re.sub('[^a-zA-Z]', ' ', s)
    review = review.lower().split()
    review = [w for w in review if w not in stop_words]
    review = ' '.join(review)
    corpus.append(review)

corpus


['love machine learning',
 'machine learning powerful',
 'love python machine learning']

In [4]:
cv = CountVectorizer(binary=True)
x = cv.fit_transform(corpus)


In [6]:
cv.vocabulary_

{'love': 1, 'machine': 2, 'learning': 0, 'powerful': 3, 'python': 4}

In [7]:
x.toarray()

array([[1, 1, 1, 0, 0],
       [1, 0, 1, 1, 0],
       [1, 1, 1, 0, 1]])

In [8]:
cv.get_feature_names_out()

array(['learning', 'love', 'machine', 'powerful', 'python'], dtype=object)

In [5]:
ohe_df = pd.DataFrame(
    x.toarray(),
    columns=cv.get_feature_names_out()
)

ohe_df


Unnamed: 0,learning,love,machine,powerful,python
0,1,1,1,0,0
1,1,0,1,1,0
2,1,1,1,0,1
