#### Categorical Encoders

#### Target Encoding

Target encoding is a technique used in machine learning for encoding categorical variables based on the mean or other aggregation of the target variable (the variable you are trying to predict). It is particularly useful when dealing with high-cardinality categorical features. The basic idea is to replace each category with the mean (or another aggregation) of the target variable for that category.

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from category_encoders import TargetEncoder

# Generate a synthetic data
data = {
    'City': ['A', 'B', 'A', 'B', 'C', 'A', 'C', 'B', 'C', 'A'],
    'Purchase': [1, 0, 1, 1, 0, 1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Separate features and target variable
X = df[['City']]
y = df['Purchase']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply target encoding to the training set and transform the testing set
encoder = TargetEncoder()
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)

# Train a Logistic Regression Model on the encoded training set
model = LogisticRegression()
model.fit(X_train_encoded, y_train)

# Make prediction on the test data
y_pred = model.predict(X_test_encoded)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.0


#### One-Hot Encoding

One-hot encoding is a technique used to convert categorical variables into a binary matrix (1s and 0s) to be used as input for machine learning algorithms. This is necessary because many machine learning models require numerical input, and one-hot encoding is a way to represent categorical variables numerically without introducing ordinal relationships between categories.



In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score

# Create a sample dataset
data = {
    'Color': ['Red', 'Green', 'Blue', 'Red', 'Green', 'Blue', 'Green', 'Red', 'Blue', 'Green'],
    'Outcome': [1, 0, 1, 1, 0, 1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Separate features and target variable
X = df[['Color']]
y = df['Outcome']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply one-hot encoding to the training set and transform the testing set
encoder = OneHotEncoder(sparse_output=False, drop='first') # drop the first column to avoid multicollinearity
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)

# Train a logistic regression model on the encoded training set
model = LogisticRegression()
model.fit(X_train_encoded, y_train)

# Make predictions on the encoded testing set
y_pred = model.predict(X_test_encoded)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.0
