# One Hot Encoding:

One-Hot Encoding is a method to represent categorical data as binary vectors.  
In Natural Language Processing (NLP), it is used to represent words or tokens in a way that machine learning models can process.

In this representation:
- Each word in the vocabulary is assigned a unique index.
- The word is represented by a binary vector of length equal to the vocabulary size.
- All vector elements are 0 except for the index corresponding to the word, which is 1.

---

## How It Works
1. **Build Vocabulary**: Extract all unique words from the corpus.
2. **Assign Index**: Assign a unique index to each word.
3. **Vector Representation**:  
   - For a vocabulary of size v, each word is represented as a vector of length v.  
   - If the word's index is i, the i-th position is 1 and the rest are 0.

Example:  
Vocabulary = { "cat": 0, "dog": 1, "fish": 2 }  
- "cat" → [1, 0, 0]  
- "dog" → [0, 1, 0]  
- "fish" → [0, 0, 1]  

---

## Advantages
- Simple and easy to implement.
- Preserves the uniqueness of words.

---

## Disadvantages
- **High Dimensionality**: For large vocabularies, vectors become very large and sparse.
- **No Semantic Meaning**: Similar words are not closer in vector space.
- Inefficient for deep learning models compared to dense embeddings like Word2Vec or GloVe.



# Manual implementation without libraries


In [1]:
data = ["Red", "Blue", "Green", "Red"]

In [2]:
# Get unique categories
unique_values = sorted(set(data))
print("Unique Categories:", unique_values)

Unique Categories: ['Blue', 'Green', 'Red']


In [3]:
# Create mapping from category to index
category_to_index = {cat: i for i, cat in enumerate(unique_values)}
print("Category to Index Mapping:", category_to_index)

Category to Index Mapping: {'Blue': 0, 'Green': 1, 'Red': 2}


In [4]:
# Convert to one-hot vectors
one_hot_result = []
for item in data:
    vector = [0] * len(unique_values)  # start with all zeros
    vector[category_to_index[item]] = 1  # put 1 at the correct position
    one_hot_result.append(vector)

In [5]:
print("\nManual One-Hot Encoding Result:")
for item, vector in zip(data, one_hot_result):
    print(f"{item} -> {vector}")


Manual One-Hot Encoding Result:
Red -> [0, 0, 1]
Blue -> [1, 0, 0]
Green -> [0, 1, 0]
Red -> [0, 0, 1]


# Using Pandas

In [6]:
import pandas as pd

In [7]:
df = pd.DataFrame({'Color': ["Red", "Blue", "Green", "Red"]})

In [8]:
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['Color'], prefix='Color')

In [17]:
print("Pandas One-Hot Encoding:")
print(df_encoded)

Pandas One-Hot Encoding:
   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3       False        False       True


# Using Scikit-learn

In [18]:
from sklearn.preprocessing import OneHotEncoder

In [19]:
df = pd.DataFrame({'Color': ["Red", "Blue", "Green", "Red"]})

In [20]:
# Create encoder
encoder = OneHotEncoder(sparse_output=False)  # sparse=False returns array

In [21]:
# Fit and transform
encoded_array = encoder.fit_transform(df[['Color']])

In [22]:
# Convert to DataFrame
encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out(['Color']))

In [23]:
print("Scikit-learn One-Hot Encoding:")
print(encoded_df)

Scikit-learn One-Hot Encoding:
   Color_Blue  Color_Green  Color_Red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         0.0          0.0        1.0
