# One Hot Encoding

One hot encoding is a technique that we use to represent categorical variables as numerical values in a machine learning model. 

### The advantages of using one hot encooding include:
1. It allows the use of categorical variables in models that require numerical input. 
2. It can improve model performance by providing more information to the model about the categorical variable. 
3. It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g "small", "medium", "large").

### The disadvatage of using one hot encoding include: 
1. It can lead to increased dimensionality, as a seperate column is created. 
2. It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot encoded columns. 
3. It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot-encoded columns. 
4. One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity, and overlifting. It is important to use it cautiously and consider other methods such as ordinal encoding or binary encoding. 

### One Hot Encoding using Sci-kit Learn Library 
- Scikit-learn(sklearn) is a popular machine-learning library in Python that provide numerous tools for data preprocessing. It provides a OneHotEncoder function that we use for encoding categorical and numerical variables into binary vectors. 

In [6]:
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

# Building a dummy employee dataset for example 
data = {
    'Employee id ': [10,20,15,25,30],
    'Gender': ['M', 'F', 'F', 'M', 'F'],
    'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],
}

# Converting into a Pandas dataframe 
df = pd.DataFrame(data)
print(f"Employee data : \n{df}")

# Extract categorical columns from the dataframe 
# Here we extract the columns with object datatype as they are the categorical columns 
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()

# Initialize OneHotEncoder 
encoder = OneHotEncoder(sparse_output=False)

# Apply one-hot encoding to the categorical columns 
one_hot_encoded = encoder.fit_transform(df[categorical_columns])

# Create a DataFrame with the one-hot encoded columns 
# We use get_feature_names_out() to get the columns names for the encoded data 
one_hot_df = pd.DataFrame(one_hot_encoded,
columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the one-hot encoded dataframe with the original dataframe 
df_encoded = pd.concat([df, one_hot_df], axis=1)

df_encoded = df_encoded.drop(categorical_columns, axis=1)

# Display the resulting dataframe 
print(f"Encoded Employee data : \n{df_encoded}")



Employee data : 
   Employee id  Gender Remarks
0            10      M    Good
1            20      F    Nice
2            15      F    Good
3            25      M   Great
4            30      F    Nice
Encoded Employee data : 
   Employee id   Gender_F  Gender_M  Remarks_Good  Remarks_Great  Remarks_Nice
0            10       0.0       1.0           1.0            0.0           0.0
1            20       1.0       0.0           0.0            0.0           1.0
2            15       1.0       0.0           1.0            0.0           0.0
3            25       0.0       1.0           0.0            1.0           0.0
4            30       1.0       0.0           0.0            0.0           1.0


### It is used in a way where there is not much verbal data diversity and there is no need to see the semantic and quantity representation between the data.