# Feature Encoding

### 1. One hot encoding

In [1]:
import pandas as pd
# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)
print(df)
# One-Hot Encoding
encoded_data = pd.get_dummies(df, columns=['Color'])
print(encoded_data)

   Color
0    Red
1  Green
2   Blue
3    Red
   Color_Blue  Color_Green  Color_Red
0       False        False       True
1       False         True      False
2        True        False      False
3       False        False       True


### 2. Label Encoding

This is on elphabatic order

 Animal  Animal_encoded
0    Dog               2
1    Cat               1
2   Bird               0
3    Dog               2
4   Bird               0

In [2]:
from sklearn.preprocessing import LabelEncoder
# Sample data
data = {'Animal': ['Dog', 'Cat', 'Bird', 'Dog', "Bird"]}
df = pd.DataFrame(data)
print(df)

# Label Encoding
label_encoder = LabelEncoder()
df['Animal_encoded'] = label_encoder.fit_transform(df['Animal'])
print(df)

  Animal
0    Dog
1    Cat
2   Bird
3    Dog
4   Bird
  Animal  Animal_encoded
0    Dog               2
1    Cat               1
2   Bird               0
3    Dog               2
4   Bird               0


# 3. Ordinal Encoding

In [3]:
from sklearn.preprocessing import OrdinalEncoder
# Sample data
data = {'Size': ['Small', 'Medium', 'Large', 'Medium']}
df = pd.DataFrame(data)
print(df)

# Ordinal Encoding
ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_encoded'] = ordinal_encoder.fit_transform(df[['Size']])
print(df)

     Size
0   Small
1  Medium
2   Large
3  Medium
     Size  Size_encoded
0   Small           0.0
1  Medium           1.0
2   Large           2.0
3  Medium           1.0


# 4. Frequency Encoding

Frequency Encoding: Uses category frequencies to encode, preserving information about the category's prevalence.

Frequency encoding replaces each category with the frequency of its occurrence.

In [4]:
import pandas as pd

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red']
})

frequency_encoding = data['color'].value_counts(normalize=True)
data['color_encoded'] = data['color'].map(frequency_encoding)
print(data)


   color  color_encoded
0    red       0.333333
1  green       0.333333
2   blue       0.333333
3  green       0.333333
4   blue       0.333333
5    red       0.333333


# 5. Target Encoding
Target encoding involves replacing a categorical value with the mean of the target variable for that category.

Target Encoding: Encodes based on the target variable's mean value per category, useful for certain supervised learning tasks.

In [5]:
import pandas as pd

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red'],
    'target': [1, 0, 1, 1, 0, 0]
})

# Calculate the mean of the target variable for each category
mean_encoded = data.groupby('color')['target'].mean()
data['color_encoded'] = data['color'].map(mean_encoded)
print(data)


   color  target  color_encoded
0    red       1            0.5
1  green       0            0.5
2   blue       1            0.5
3  green       1            0.5
4   blue       0            0.5
5    red       0            0.5


#  6. Hashing Encoding
Hashing encoding uses a hash function to convert categories into numerical values. This method is useful for high-cardinality categorical variables.

Hashing Encoding: Efficient for high-cardinality features, though it may introduce collisions.

In [None]:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red']
})

hasher = FeatureHasher(input_type='string', n_features=5)
hashed_features = hasher.transform(data['color'])
hashed_data = pd.DataFrame(hashed_features.toarray(), columns=[f'feature_{i}' for i in range(hashed_features.shape[1])])
print(hashed_data)


In [8]:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red']
})

# Convert the column to a list of lists
color_list = data['color'].apply(lambda x: [x]).tolist()

hasher = FeatureHasher(input_type='string', n_features=5)
hashed_features = hasher.transform(color_list)
hashed_data = pd.DataFrame(hashed_features.toarray(), columns=[f'feature_{i}' for i in range(hashed_features.shape[1])])
print(hashed_data)


   feature_0  feature_1  feature_2  feature_3  feature_4
0       -1.0        0.0        0.0        0.0        0.0
1        0.0        0.0        0.0        0.0        1.0
2       -1.0        0.0        0.0        0.0        0.0
3        0.0        0.0        0.0        0.0        1.0
4       -1.0        0.0        0.0        0.0        0.0
5       -1.0        0.0        0.0        0.0        0.0


# How to do Data Feature Encoding  in Python 

Data feature encoding is an essential preprocessing step that involves transforming categorical data into numerical formats that machine learning models can interpret. There are several methods for feature encoding in Python, most commonly using the `pandas` and `scikit-learn` libraries. Below are the primary methods of feature encoding:

### 1. One-Hot Encoding

One-hot encoding converts categorical variables into a series of binary columns. Each category value is converted into a column, and an observation is marked with a 1 in the column corresponding to its category.

```python
import pandas as pd

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red']
})

# Using pandas get_dummies
one_hot_encoded_data = pd.get_dummies(data, columns=['color'])
print(one_hot_encoded_data)
```

### 2. Label Encoding

Label encoding converts categorical values into integer values. Each unique category is assigned a unique integer.

```python
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red']
})

label_encoder = LabelEncoder()
data['color_encoded'] = label_encoder.fit_transform(data['color'])
print(data)
```

### 3. Ordinal Encoding

Ordinal encoding is similar to label encoding, but it is used when the categorical values have an inherent order or ranking.

```python
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

data = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'large', 'small']
})

ordinal_encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
data['size_encoded'] = ordinal_encoder.fit_transform(data[['size']])
print(data)
```

### 4. Frequency Encoding

Frequency encoding replaces each category with the frequency of its occurrence.

```python
import pandas as pd

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red']
})

frequency_encoding = data['color'].value_counts(normalize=True)
data['color_encoded'] = data['color'].map(frequency_encoding)
print(data)
```

### 5. Target Encoding

Target encoding involves replacing a categorical value with the mean of the target variable for that category.

```python
import pandas as pd

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red'],
    'target': [1, 0, 1, 1, 0, 0]
})

# Calculate the mean of the target variable for each category
mean_encoded = data.groupby('color')['target'].mean()
data['color_encoded'] = data['color'].map(mean_encoded)
print(data)
```

### 6. Hashing Encoding

Hashing encoding uses a hash function to convert categories into numerical values. This method is useful for high-cardinality categorical variables.

```python
import pandas as pd
from sklearn.feature_extraction import FeatureHasher

data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red']
})

hasher = FeatureHasher(input_type='string', n_features=5)
hashed_features = hasher.transform(data['color'])
hashed_data = pd.DataFrame(hashed_features.toarray(), columns=[f'feature_{i}' for i in range(hashed_features.shape[1])])
print(hashed_data)
```

### Example Code

Here’s a comprehensive example that demonstrates several encoding techniques on a sample dataset:

```python
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

# Sample Data
data = pd.DataFrame({
    'color': ['red', 'green', 'blue', 'green', 'blue', 'red'],
    'size': ['small', 'medium', 'large', 'medium', 'large', 'small'],
    'target': [1, 0, 1, 1, 0, 0]
})

# One-Hot Encoding
one_hot_encoded = pd.get_dummies(data, columns=['color'])
print("One-Hot Encoded Data:\n", one_hot_encoded)

# Label Encoding
label_encoder = LabelEncoder()
data['color_label_encoded'] = label_encoder.fit_transform(data['color'])
print("Label Encoded Data:\n", data)

# Ordinal Encoding
ordinal_encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
data['size_ordinal_encoded'] = ordinal_encoder.fit_transform(data[['size']])
print("Ordinal Encoded Data:\n", data)

# Frequency Encoding
frequency_encoding = data['color'].value_counts(normalize=True)
data['color_frequency_encoded'] = data['color'].map(frequency_encoding)
print("Frequency Encoded Data:\n", data)

# Target Encoding
mean_encoded = data.groupby('color')['target'].mean()
data['color_target_encoded'] = data['color'].map(mean_encoded)
print("Target Encoded Data:\n", data)
```

### Summary

- **One-Hot Encoding:** Useful for nominal categorical variables without an inherent order.
- **Label Encoding:** Simple and quick, but can impose an unintended ordinal relationship.
- **Ordinal Encoding:** Best for ordinal categorical variables with a clear order.
- **Frequency Encoding:** Uses category frequencies to encode, preserving information about the category's prevalence.
- **Target Encoding:** Encodes based on the target variable's mean value per category, useful for certain supervised learning tasks.
- **Hashing Encoding:** Efficient for high-cardinality features, though it may introduce collisions.

Choose the encoding technique based on your data characteristics and the requirements of your machine learning model.