What is Feature Engineering ?

Feature engineering is a crucial step in the data pre-processing phase of machine learning, where data scientists and machine learning engineers create new features or modify existing ones to improve the performance of machine learning models. The goal of feature engineering is to provide the model with informative, non-redundant, and interpretable data that captures the underlying structure of the dataset. This process can significantly enhance model accuracy and performance by leveraging domain knowledge and mathematical transformations.

Key Aspects of Feature Engineering Include:

1. Creation of New Features: Involves generating new features from the existing data, which might be more relevant to the prediction task. This could include combining two or more features, extracting parts of a date-time stamp (like the day of the week, month, or year), or creating interaction terms that capture the relationship between different variables.

2. Feature Transformation: Applying transformations to features to change their distribution or scale. Common transformations include normalization, standardization, log transformation, and power transformations. These are especially important for algorithms that assume data is normally distributed or algorithms sensitive to the scale of features, like k-nearest Neighbors (KNN) and gradient descent-based algorithms.

3. Feature Selection: Identifying the most relevant features to use in model training. This involves removing irrelevant, redundant, or noisy data that can detract from model performance. Techniques for feature selection include filter methods, wrapper methods, and embedded methods.

4. Feature Extraction: Techniques like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-distributed Stochastic Neighbour Embedding (t-SNE) are used to reduce the number of features in a dataset while retaining as much of the variance in the data as possible. This is particularly useful for high-dimensional data.

5. Handling Missing Values: Developing strategies for dealing with missing data, such as imputation (filling in missing values with the mean, median, mode, or using more complex algorithms), or creating binary indicators that signal whether data was missing.

6. Encoding Categorical Variables: Converting categorical variables into a form that can be provided to ML models to improve performance. This includes using techniques like one-hot encoding, label encoding, and target encoding.

7. Working with different modalities: Feature engineering also includes applying all the above techniques to different modalities of data like temporal, textual and geospatial data.

Feature engineering is often considered more of an art than a science, requiring creativity, intuition, and domain knowledge. The quality and relevance of the features used can often make a more significant difference in the performance of a machine learning model than the choice of model itself. It enables models to learn better from the data, leading to more accurate predictions.

Feature encoding is a crucial step in machine learning that transforms categorical data into a numerical format. This is essential because most machine learning algorithms require numerical input. By encoding categorical features, we convert them into a representation that preserves their information while making them compatible with these algorithms. Common encoding techniques include ordinal encoding, one-hot encoding, target encoding, and frequency encoding.

Second is **Discretization**, in which we take continuous numerical data and convert it into categorical form by dividing it into bins. This process is also called **Binning**.

Third, we will also learn to encode the target feature or output column using techniques like Label Encoding, Label Binarizer, and MultiLabel Binarizer.

Machine learning is all about data, which we majorly classify into two forms: numerical data and categorical data. Numerical data includes features like age, weight, and marks, while categorical data includes categories like gender and state. In categorical data, there are two types:

1. **Ordinal data:** This is categorical data with an intrinsic order, like feedback ratings (poor, good, excellent). If there's an order in categorical data, we call it ordinal data.
2. **Nominal data:** This is categorical data without an intrinsic order, like gender.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('/content/customer.csv').drop(columns=['age','gender'])

In [None]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


### 1. Ordinal Encoding

Ordinal encoding is a data preprocessing technique used in machine learning to convert categorical variables that have a natural, ordered relationship into a numerical format. This method assigns a unique integer to each category based on its rank or position in that order.

How Ordinal Encoding Works
The core principle of ordinal encoding is to replace categorical labels with integers that preserve the inherent sequence of the data. For this to be effective, the variable must be ordinal, meaning its categories have a logical progression.

For instance, consider a variable for clothing size. The categories "Small," "Medium," and "Large" have a clear order. Ordinal encoding would convert them as follows:

- Small → 0

- Medium → 1

- Large → 2

This numerical representation allows machine learning algorithms, which primarily operate on numbers, to interpret the hierarchical nature of the feature.

When to Use Ordinal Encoding
This technique is most appropriate under specific circumstances:

- Presence of Inherent Order: It should only be used for categorical features where the categories have a meaningful, ranked relationship. Examples include educational levels ("High School," "Bachelor's," "Master's"), customer satisfaction ratings ("Dissatisfied," "Neutral," "Satisfied"), or economic status ("Low," "Medium," "High").

- Tree-Based Models: Algorithms like Decision Trees, Random Forests, and Gradient Boosting are well-suited for ordinally encoded data. These models can effectively use the ordered nature of the integers to make splits and decisions.

Key Assumptions and Limitations
While useful, ordinal encoding operates on a crucial assumption that can also be its main limitation:

- Assumption of Equal Intervals: The technique implicitly assumes that the numerical distance between each category is equal. For example, in the "Small, Medium, Large" mapping (0, 1, 2), a model might interpret the difference between "Small" and "Medium" as being the same as the difference between "Medium" and "Large." This may not be true in reality and can mislead certain types of models, such as linear models or Support Vector Machines, which are sensitive to the magnitude of feature values.

- Not for Nominal Data: It is unsuitable for nominal categorical variables, where no intrinsic order exists (e.g., colors like "Red," "Green," "Blue"). Applying ordinal encoding to such data would introduce a false and misleading order, likely harming the model's performance. For nominal data, techniques like one-hot encoding are preferred.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1], test_size=0.2)

In [None]:
X_train

Unnamed: 0,review,education
34,Average,School
18,Good,School
11,Good,UG
9,Good,UG
48,Good,UG
47,Good,PG
4,Average,UG
22,Poor,PG
7,Poor,School
39,Poor,PG


In [None]:
# specify order
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [None]:
X_train = oe.fit_transform(X_train)
X_test = oe.transform(X_test)

In [None]:
X_train

Unnamed: 0,review,education
34,1.0,0.0
18,2.0,0.0
11,2.0,1.0
9,2.0,1.0
48,2.0,1.0
47,2.0,2.0
4,1.0,1.0
22,0.0,2.0
7,0.0,0.0
39,0.0,2.0


In [None]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [None]:
oe.feature_names_in_

array(['review', 'education'], dtype=object)

In [None]:
oe.n_features_in_

2

In [None]:
oe.inverse_transform(np.array([0,2]).reshape(1,2))

array([['Poor', 'PG']], dtype=object)

In [None]:
oe.get_feature_names_out()

array(['review', 'education'], dtype=object)

In [None]:
# set unknown value
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']],
                    handle_unknown='use_encoded_value',
                    unknown_value=-1)
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1], test_size=0.2)
X_train = oe.fit_transform(X_train)
oe.transform(np.array(['Poor','college']).reshape(1,2))



Unnamed: 0,review,education
0,0.0,-1.0


Infrequent categories, often referred to as "rare categories," are categories within a categorical variable that appear very seldom in the dataset. These categories are characterized by having a low frequency or count compared to other categories within the same feature.

How to handle:

- Aggregation: Combining rare categories into a single "Other" category to reduce the feature's cardinality and simplify the model.

- Encoding with Special Treatment: Using encoding techniques that specifically account for the rarity of categories, such as setting a min_frequency or max_categories threshold in OrdinalEncoder, or employing target encoding where the influence of rare categories is mitigated.

- Exclusion: In some cases, particularly when a category is extremely rare, it might be justified to exclude those data points from the analysis if it's believed they do not add value or could introduce noise.

In [None]:
# handling infrequent categories
X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +['snake'] * 3 + ['horse'] * 2], dtype=object).T
X

array([['dog'],
       ['dog'],
       ['dog'],
       ['dog'],
       ['dog'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['snake'],
       ['snake'],
       ['snake'],
       ['horse'],
       ['horse']], dtype=object)

### `max_categories` in OrdinalEncoder (with grouping)

- **Purpose**: Limits the number of categories encoded for a column.
- **How it works**:
  - If a column has **more unique categories than `max_categories`**, the encoder keeps only the **most frequent categories**.
  - All other less frequent categories are **grouped into an "other" category**.
- **Example**:
  - Column has categories: `A, B, C, D, E`
  - `max_categories=3` → encoder keeps top 2 frequent categories (`A`, `B`) as-is.
  - All remaining categories (`C, D, E`) are combined into **one "other" category**.
- **Benefit**: Reduces complexity and prevents rare categories from affecting the model.

In [None]:
enc = OrdinalEncoder(max_categories=3).fit(X)

In [None]:
enc.infrequent_categories_

[array(['dog', 'horse', 'snake'], dtype=object)]

In [None]:
enc.transform(np.array([['cat','rabbit','snake','dog']]).reshape(4,1))

Unnamed: 0,x0
0,0.0
1,1.0
2,2.0
3,2.0


### `min_frequency` in OrdinalEncoder

- **Purpose**: Groups together *rare categories* based on how often they appear in the data.
- **How it works**:
  - Instead of specifying a fixed number of top categories (like `max_categories`),
    you set a **minimum occurrence threshold**.
  - Any category that appears **less than the given threshold** is combined into a single
    **"other" category**.
- **Parameter options**:
  - An **integer** (e.g., `6`) → categories with counts **≥ 6** are kept; all others grouped as “other”.
  - A **float** between `0` and `1` → treated as a fraction of the total sample size
    (e.g., `0.05` keeps categories that occur in at least 5% of the rows).
- **Benefit**: Automatically handles rare or noisy categories without manually counting them.




In [None]:
enc = OrdinalEncoder(min_frequency=4).fit(X)
enc.infrequent_categories_

[array(['horse', 'snake'], dtype=object)]

In [None]:
enc.transform(np.array([['cat','rabbit','snake','dog','horse']]).reshape(5,1))

Unnamed: 0,x0
0,0.0
1,2.0
2,3.0
3,1.0
4,3.0


In [None]:
# handling missing data

# Example categorical data with missing values
data = [['Cat'], [np.nan], ['Dog'], ['Fish'], [np.nan]]

# Setting encoded_missing_value to -1, indicating we want missing values to be encoded as -1
encoder = OrdinalEncoder(encoded_missing_value=-1)

encoded_data = encoder.fit_transform(data)

print(encoded_data)

    x0
0  0.0
1 -1.0
2  1.0
3  2.0
4 -1.0


### Label Encoder

- **Purpose**: Encodes the **target/output feature** in classification problems.
- **Key Points**:
  - Not used for input features; it is **target/label encoding**, not feature encoding.
  - Converts categorical labels into **integer values**.
  - Useful when the output column contains categories like `['cat', 'dog', 'mouse']`.
- **Example Mapping**:

| Original Label | Encoded Value |
|----------------|---------------|
| cat            | 0             |
| dog            | 1             |
| mouse          | 2             |




In [None]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1], test_size=0.2)

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [None]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [None]:
le.inverse_transform(np.array([1,1,0]))

array(['Yes', 'Yes', 'No'], dtype=object)

### 2. OneHotEncoder

- **Purpose**: Used for **categorical input features** that are **nominal**, i.e., have **no intrinsic order**.  
  Examples: `brand` = ['Samsung', 'Apple', 'OnePlus']

- **Why not use Ordinal Encoding here?**
  - Ordinal encoding assigns integer values based on order.
  - For nominal data, integers can mislead algorithms into thinking some categories are "greater" or "more important."
  - Some algorithms may give **unintended weight** to certain categories.

- **How OneHotEncoder works**:
  - For each category in a feature, a **new binary column** is created.
  - If a row has that category, the column gets `1`; otherwise, `0`.
  - Example:

| brand      | brand_Samsung | brand_Apple | brand_OnePlus |
|------------|---------------|-------------|---------------|
| Samsung    | 1             | 0           | 0             |
| Apple      | 0             | 1           | 0             |
| OnePlus    | 0             | 0           | 1             |

- **Key Points**:
  - If a feature has `n` categories, it creates **n new features** (can also use `drop='first'` to avoid dummy variable trap).
  - Produces a **sparse matrix** (mostly zeros).
  - Called **"one-hot"** because only **one column has 1** for each row, rest are 0.




In [None]:
cars = pd.read_csv('cars.csv').drop(columns=['km_driven','owner'])
cars.head()

Unnamed: 0,brand,fuel,selling_price
0,Maruti,Diesel,450000
1,Skoda,Diesel,370000
2,Honda,Petrol,158000
3,Hyundai,Diesel,225000
4,Maruti,Petrol,130000


In [None]:
cars.isnull().sum()

Unnamed: 0,0
brand,0
fuel,0
selling_price,0


In [None]:
X = cars.iloc[:,0:2]
y = cars.iloc[:,-1]

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
X_train

Unnamed: 0,brand,fuel
6518,Tata,Petrol
6144,Honda,Petrol
6381,Hyundai,Diesel
438,Maruti,Diesel
5939,Maruti,Petrol
...,...,...
5226,Mahindra,Diesel
5390,Maruti,Diesel
860,Hyundai,Petrol
7603,Maruti,Diesel


In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder()

In [None]:
ohe = OneHotEncoder(sparse_output=False)

In [None]:
X_train = ohe.fit_transform(X_train)
X_train

Unnamed: 0,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,brand_Ford,...,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol
6518,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5226,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5390,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
X_train.shape

(6502, 36)

In [None]:
ohe.categories_

[array(['Ambassador', 'Ashok', 'Audi', 'BMW', 'Chevrolet', 'Daewoo',
        'Datsun', 'Fiat', 'Force', 'Ford', 'Honda', 'Hyundai', 'Isuzu',
        'Jaguar', 'Jeep', 'Kia', 'Land', 'Lexus', 'MG', 'Mahindra',
        'Maruti', 'Mercedes-Benz', 'Mitsubishi', 'Nissan', 'Opel',
        'Peugeot', 'Renault', 'Skoda', 'Tata', 'Toyota', 'Volkswagen',
        'Volvo'], dtype=object),
 array(['CNG', 'Diesel', 'LPG', 'Petrol'], dtype=object)]

In [None]:
ohe.feature_names_in_

array(['brand', 'fuel'], dtype=object)

In [None]:
ohe.n_features_in_

2

In [None]:
ohe.get_feature_names_out()

array(['brand_Ambassador', 'brand_Ashok', 'brand_Audi', 'brand_BMW',
       'brand_Chevrolet', 'brand_Daewoo', 'brand_Datsun', 'brand_Fiat',
       'brand_Force', 'brand_Ford', 'brand_Honda', 'brand_Hyundai',
       'brand_Isuzu', 'brand_Jaguar', 'brand_Jeep', 'brand_Kia',
       'brand_Land', 'brand_Lexus', 'brand_MG', 'brand_Mahindra',
       'brand_Maruti', 'brand_Mercedes-Benz', 'brand_Mitsubishi',
       'brand_Nissan', 'brand_Opel', 'brand_Peugeot', 'brand_Renault',
       'brand_Skoda', 'brand_Tata', 'brand_Toyota', 'brand_Volkswagen',
       'brand_Volvo', 'fuel_CNG', 'fuel_Diesel', 'fuel_LPG',
       'fuel_Petrol'], dtype=object)

In [None]:
pd.DataFrame(data = X_train,columns = ohe.get_feature_names_out())

Unnamed: 0,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,brand_Ford,...,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol
6518,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5226,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5390,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
ohe.inverse_transform(np.array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0. , 0. , 1.]).reshape(1,36))

array([['Mercedes-Benz', 'Petrol']], dtype=object)

# The Dummy Variable Trap

The **Dummy Variable Trap** is a scenario where features created by one-hot encoding are perfectly correlated. This perfect correlation, known as **perfect multicollinearity**, can be a problem for statistical models that expect features to be independent, such as Linear and Logistic Regression.

---

## Why Does It Happen?

When we one-hot encode a categorical feature with '$k$' categories, we create '$k$' new binary columns (dummy variables). The trap occurs because the information in these '$k$' columns is redundant—the value of any one column can be perfectly predicted from the values of the others.

For example, if we have a feature "Season" with four categories (Winter, Spring, Summer, Fall), we create four dummy variables.

| Season_Winter | Season_Spring | Season_Summer | Season_Fall |
| :-----------: | :-----------: | :-----------: | :-----------: |
|       1       |       0       |       0       |       0       |
|       0       |       1       |       0       |       0       |
|       0       |       0       |       1       |       0       |

If a row is not Winter, Spring, or Summer, it *must* be Fall. This means:
$$\text{Season_Winter} + \text{Season_Spring} + \text{Season_Summer} + \text{Season_Fall} = 1$$

This perfect relationship confuses linear models because they can't determine the independent influence of each category on the outcome.

---

## The Solution: Drop One Column

The solution is simple: for a feature with '$k$' categories, we only use **$k-1$** dummy variables.

We drop one of the columns. The dropped category becomes the **baseline** or **reference category**. Its effect is captured by the model's intercept, and it is represented by a row where all the other dummy variables for that feature are zero.

**Example (Dropping `Season_Fall`):**

| Original | Season_Winter | Season_Spring | Season_Summer | Interpretation                       |
| :------: | :-----------: | :-----------: | :-----------: | ------------------------------------ |
|  Winter  |       1       |       0       |       0       | It is Winter.                        |
|  Spring  |       0       |       1       |       0       | It is Spring.                        |
|  Summer  |       0       |       0       |       1       | It is Summer.                        |
|   Fall   |     **0** |     **0** |     **0** | It is not Winter, Spring, or Summer. |

---

## When Should You Care ?

You primarily need to worry about the dummy variable trap when using models that are sensitive to multicollinearity.

* **Affected Models:**
    * Linear Regression
    * Logistic Regression
    * Linear Discriminant Analysis (LDA)

* **Unaffected Models:**
    * Decision Trees
    * Random Forest
    * XGBoost / Gradient Boosting
    * K-Nearest Neighbors (KNN)

Tree-based models are generally immune because they select features one at a time and do not rely on the same mathematical assumptions as linear models.

---

## Practical Implementation

Most data science libraries have a simple parameter to handle this automatically:

* **Pandas:** `pd.get_dummies(df, drop_first=True)`
* **Scikit-learn:** `OneHotEncoder(drop='first')`

In [None]:
X = cars.iloc[:,0:2]
y = cars.iloc[:,-1]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

ohe = OneHotEncoder(drop='first',sparse_output=False)
ohe.fit_transform(X_train).shape

(6502, 34)

In [None]:
ohe.drop_idx_

array([0, 0], dtype=object)

In [None]:
# handling rare categories
X_train['brand'].value_counts()

Unnamed: 0_level_0,count
brand,Unnamed: 1_level_1
Maruti,1953
Hyundai,1127
Mahindra,635
Tata,586
Toyota,391
Honda,369
Ford,320
Chevrolet,185
Renault,183
Volkswagen,154


In [None]:
cars['fuel'].value_counts()

Unnamed: 0_level_0,count
fuel,Unnamed: 1_level_1
Diesel,4402
Petrol,3631
CNG,57
LPG,38


In [None]:
# using min frequency

ohe = OneHotEncoder(sparse_output=False, min_frequency=100)
ohe.fit_transform(X_train).shape

(6502, 14)

In [None]:
ohe.get_feature_names_out()

array(['brand_Chevrolet', 'brand_Ford', 'brand_Honda', 'brand_Hyundai',
       'brand_Mahindra', 'brand_Maruti', 'brand_Renault', 'brand_Tata',
       'brand_Toyota', 'brand_Volkswagen', 'brand_infrequent_sklearn',
       'fuel_Diesel', 'fuel_Petrol', 'fuel_infrequent_sklearn'],
      dtype=object)

In [None]:
# using max_categories
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore', max_categories=15)
ohe.fit_transform(X_train).shape

(6502, 19)

In [None]:
ohe.get_feature_names_out()

array(['brand_BMW', 'brand_Chevrolet', 'brand_Ford', 'brand_Honda',
       'brand_Hyundai', 'brand_Jaguar', 'brand_Mahindra', 'brand_Maruti',
       'brand_Nissan', 'brand_Renault', 'brand_Skoda', 'brand_Tata',
       'brand_Toyota', 'brand_Volkswagen', 'brand_infrequent_sklearn',
       'fuel_CNG', 'fuel_Diesel', 'fuel_LPG', 'fuel_Petrol'], dtype=object)

### Handling Unknown Categories During Prediction (OneHotEncoder)

- **Context**:  
  After training a model, the **OneHotEncoder** is used again during prediction  
  (e.g., when transforming new test data or live user inputs).  
  Sometimes, the new data may contain a **category that was never seen** during training.

- **Default Behavior (`handle_unknown='error')**:  
  - If a new category appears during prediction, the encoder will raise a **ValueError**.
  - Example:
    - Training categories: `Red`, `Blue`, `Green`
    - Prediction data contains: `Yellow`
    - ❌ Error: `Found unknown categories ['Yellow'] in column 0 during transform`

- **Safe Option (`handle_unknown='ignore')**:  
  - When creating the encoder, set:
    ```python
    OneHotEncoder(handle_unknown='ignore')
    ```
  - During prediction, any unseen category is encoded as **all zeros** across the dummy columns.
  - Example:
    - Training categories: `Red`, `Blue`, `Green`
    - Prediction contains: `Yellow`
    - Encoding for `Yellow` → `[0, 0, 0]`

- **Key Takeaways**:
  - Always fit the encoder on the **training data only**, and then reuse it for prediction.
  - `handle_unknown='ignore'` ensures the model will **not crash** when a new category appears.
  - The model interprets an all-zero vector as a **neutral signal** (no known category matched).


In [None]:
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe.fit_transform(X_train)

ohe.transform(np.array(['local','Petrol']).reshape(1,2))



Unnamed: 0,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,brand_Ford,...,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [None]:
ohe.inverse_transform(np.array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1.]).reshape(1,36))

array([[None, 'Petrol']], dtype=object)

### LabelBinarizer

- **Purpose**:  
  - `LabelBinarizer` is used to **one-hot encode the target (output) feature**.  
  - While `LabelEncoder` converts target categories into **single numeric labels** (0, 1, 2, …),  
    sometimes we need the output to be **one-hot encoded** instead of a single number.

- **When to Use**:
  - Useful in **multi-class classification** problems where the model expects  
    one-hot encoded targets.
  - Examples:
    - **Deep learning** models with a `softmax` output layer.
    - **Logistic Regression** with a **one-vs-all** strategy.
    - Datasets like **Iris**, where the target feature (`species`) has 3 classes.

- **Example**:
  Suppose the target `y` contains three categories: `['yes', 'no', 'maybe']`.

  - Using `LabelEncoder`:
    ```
    yes → 2
    no  → 0
    maybe → 1
    ```
    ➡️ Only a **single column** with numeric labels.

  - Using `LabelBinarizer`:
    ```
    yes   → [0, 0, 1]
    no    → [1, 0, 0]
    maybe → [0, 1, 0]
    ```
    ➡️ **Three new columns** (`y_no`, `y_maybe`, `y_yes`) representing each class.




In [None]:
from sklearn.preprocessing import LabelBinarizer

# Sample target variable for a multi-class classification problem
y = ['cat', 'dog', 'fish', 'dog', 'cat']

# Initialize the LabelBinarizer
lb = LabelBinarizer()

# Fit and transform the target variable
y_binarized = lb.fit_transform(y)

print("Binarized labels:\n", y_binarized)

# Inverse transform to recover original labels
y_original = lb.inverse_transform(y_binarized)

print("Original labels:\n", y_original)


Binarized labels:
 [[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]]
Original labels:
 ['cat' 'dog' 'fish' 'dog' 'cat']


### MultiLabelBinarizer

- **Purpose**:
  - Used for **multi-label problems**, where each sample can be associated with **multiple target labels** simultaneously.
  - Transforms a list of lists (or similar iterables) of labels into a **binary matrix** where each column corresponds to a unique label and a `1` indicates the presence of that label for a given sample.
- **When to Use**:
  - When the output column is multi-label. For example, if you have movie summaries and need to identify all applicable genres (a movie can have multiple genres).
- **How it Works**:
  - It identifies all unique labels across all samples.
  - It creates a binary column for each unique label.
  - For each sample, it places a `1` in the columns corresponding to the labels present in that sample, and `0` otherwise.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

# Example multi-label data
y = [('red', 'blue'), ('blue', 'green'), ('green',), ('red',)]

# Initialize MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit and transform the data to binary matrix format
Y = mlb.fit_transform(y)

print("Binary matrix:\n", Y)
print("Class labels:", mlb.classes_)

# Inverse transform to recover original labels
y_inv = mlb.inverse_transform(Y)
print("Inverse transformed labels:", y_inv)


Binary matrix:
 [[1 0 1]
 [1 1 0]
 [0 1 0]
 [0 0 1]]
Class labels: ['blue' 'green' 'red']
Inverse transformed labels: [('blue', 'red'), ('blue', 'green'), ('green',), ('red',)]


### 3. Count Encoder/Frequency Encoder

In [None]:
# dataset generation
import pandas as pd
import numpy as np
import category_encoders as ce

# Simulating a dataset
data = {
    'Age': np.random.randint(20, 60, size=100).astype(float),  # Random ages between 20 and 60
    'State': np.random.choice(['Karnataka', 'Tamil Nadu', 'Maharashtra', 'Delhi', 'Telangana'], size=100),
    'Education': np.random.choice(['High School', 'UG', 'PG'], size=100),
    'Package': np.random.rand(100) * 100  # Random package values for demonstration
}

# Introducing missing values in 'Age' column (5%)
np.random.seed(0)  # For reproducibility
missing_indices = np.random.choice(data['Age'].shape[0], replace=False, size=int(data['Age'].shape[0] * 0.05))
data['Age'][missing_indices] = np.nan

df = pd.DataFrame(data)

df.head()

Unnamed: 0,Age,State,Education,Package
0,21.0,Delhi,UG,23.322807
1,54.0,Telangana,UG,58.130542
2,,Delhi,UG,86.313852
3,55.0,Maharashtra,UG,88.035997
4,52.0,Delhi,PG,23.668519


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Package']), df['Package'], test_size=0.2, random_state=42)

In [None]:
X_train.head()

Unnamed: 0,Age,State,Education
55,,Delhi,PG
88,51.0,Telangana,PG
26,,Karnataka,PG
42,48.0,Karnataka,High School
69,48.0,Maharashtra,UG


In [None]:
X_train['State'].value_counts()

Unnamed: 0_level_0,count
State,Unnamed: 1_level_1
Tamil Nadu,22
Telangana,19
Delhi,15
Karnataka,12
Maharashtra,12


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import sklearn

In [None]:
class CountEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
        self.count_map = {}

    def fit(self, X, y=None):
        if self.columns is None:
            self.columns = X.columns
        for col in self.columns:
            self.count_map[col] = X[col].value_counts().to_dict()
        return self

    def transform(self, X):
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].map(self.count_map[col]).fillna(0)
        return X

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('age_missing', SimpleImputer(strategy='mean'), ['Age']),
        ('cat_state', CountEncoder(), ['State']),
        ('education_ordinal', OrdinalEncoder(), ['Education'])
    ])

sklearn.set_config(transform_output="pandas")

In [None]:
preprocessor.fit_transform(X_train)

Unnamed: 0,age_missing__Age,cat_state__State,education_ordinal__Education
55,40.8,15,1.0
88,51.0,19,1.0
26,40.8,12,1.0
42,48.0,12,0.0
69,48.0,12,2.0
...,...,...,...
60,21.0,12,1.0
71,54.0,15,2.0
14,52.0,22,0.0
92,41.0,19,0.0


In [None]:
# using category encoders
from category_encoders.count import CountEncoder

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('age_missing', SimpleImputer(strategy='mean'), ['Age']),
        ('cat_state', CountEncoder(normalize=True), ['State']),
        ('education_ordinal', OrdinalEncoder(), ['Education'])
    ])
sklearn.set_config(transform_output="pandas")

In [None]:
preprocessor.fit_transform(X_train)

Unnamed: 0,age_missing__Age,cat_state__State,education_ordinal__Education
55,40.8,0.1875,1.0
88,51.0,0.2375,1.0
26,40.8,0.1500,1.0
42,48.0,0.1500,0.0
69,48.0,0.1500,2.0
...,...,...,...
60,21.0,0.1500,1.0
71,54.0,0.1875,2.0
14,52.0,0.2750,0.0
92,41.0,0.2375,0.0


In [None]:
# frequency encoding

In [None]:
# parameters
import pandas as pd
import numpy as np
import category_encoders as ce

# Simulating a dataset
np.random.seed(42)  # For reproducibility
data = {
    'State': np.random.choice(['Karnataka', 'Tamil Nadu', 'Maharashtra', 'Delhi', 'Telangana', np.nan], size=100),
    'Education': np.random.choice(['High School', 'UG', 'PG', np.nan], size=100)
}
df = pd.DataFrame(data)

df.head(25)


Unnamed: 0,State,Education
0,Delhi,PG
1,Telangana,High School
2,Maharashtra,High School
3,Telangana,High School
4,Telangana,PG
5,Tamil Nadu,High School
6,Maharashtra,
7,Maharashtra,High School
8,Maharashtra,
9,Telangana,


In [None]:
df.isnull().sum()

Unnamed: 0,0
State,0
Education,0


In [None]:
# Initialize the CountEncoder with various parameters
encoder = ce.CountEncoder(
    cols=['State', 'Education'],  # Specify columns to encode. None would automatically select categorical columns.
    handle_missing='error',  # Treat NaNs as a countable category
    handle_unknown='error',  # Treat unknown categories as NaNs (if seen during transform but not in fit)
)

In [None]:
# Fit and transform the dataset
encoder.fit_transform(df)

#print(encoded_df.head(25))

Unnamed: 0,State,Education
0,25,34
1,17,27
2,11,27
3,17,27
4,17,34
...,...,...
95,25,27
96,25,16
97,17,23
98,11,23


In [None]:
encoder.mapping

{'State': State
 Delhi          25
 Tamil Nadu     19
 Telangana      17
 nan            17
 Maharashtra    11
 Karnataka      11
 Name: count, dtype: int64,
 'Education': Education
 PG             34
 High School    27
 nan            23
 UG             16
 Name: count, dtype: int64}

In [None]:
new_data = pd.DataFrame({'State': ['Bihar'], 'Education': ['UG']})

encoder.transform(new_data)

ValueError: Missing data found in column State at transform time.

In [None]:
np.random.seed(0)  # For reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D', 'E', 'F', np.nan], size=100, p=[0.3, 0.25, 0.15, 0.15, 0.05, 0.05, 0.05]),
    'Value': np.random.rand(100)
}

df = pd.DataFrame(data)

df.sample(10)


Unnamed: 0,Category,Value
91,C,0.209844
29,B,0.290078
2,C,0.735194
50,C,0.149448
44,C,0.806194
78,A,0.704414
33,C,0.298282
65,B,0.855803
75,A,0.223925
45,C,0.703889


In [None]:
df['Category'].value_counts()

KeyError: 'Category'

In [None]:
encoder = ce.CountEncoder(
    cols=['Category'],
    min_group_size=10,  # Groups with counts less than 5 will be combined
    min_group_name='salman',  # Use default naming for combined minimum groups
)

# Fit and transform the dataset
encoded_df = encoder.fit_transform(df['Category'])

# Display the original and encoded data for comparison
df['Encoded'] = encoded_df
print(df.head(20))

In [None]:
encoder.mapping

### Binary Encoder

In [None]:
import pandas as pd
import category_encoders as ce

# Sample dataset
data = {
    'Item': ['Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8'],
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Elderberry', 'Fig', 'Grape', 'Honeydew']
}
df = pd.DataFrame(data)

df


In [None]:
# Initialize the Binary Encoder
encoder = ce.BinaryEncoder(cols=['Fruit'], return_df=True)

# Fit and transform the data
df_encoded = encoder.fit_transform(df)

# Display the original and encoded data
print(df_encoded)

### Target Encoder

In [None]:
# using category_encoder

import pandas as pd
import category_encoders as ce

# Sample data
data = {
    'Feature': ['A', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Target': [1, 0, 0, 1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)

# Separating the feature and target columns
X = df.drop('Target', axis=1)
y = df['Target']

# Initialize the TargetEncoder
encoder = ce.TargetEncoder(cols=['Feature'])

# Fit the encoder using the feature data and target variable
encoder.fit(X, y)

# Transform the data
encoded = encoder.transform(X)

# Show the original and encoded data
print(pd.concat([df, encoded], axis=1))


In [None]:
encoder.mapping

In [None]:
!pip install --upgrade scikit-learn==1.4.0

In [None]:
# using sklearn
import pandas as pd
from sklearn.preprocessing import TargetEncoder

# Sample data
data = {
    'Feature': ['A', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Target': [1, 0, 0, 1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)

# Separating the feature and target columns
X = df.drop('Target', axis=1)
y = df['Target']

# Initialize the TargetEncoder
encoder = TargetEncoder(smooth=0.0)

# Fit the encoder using the feature data and target variable
encoder.fit(X, y)

# Transform the data
encoded = encoder.transform(X)

encoded


Unnamed: 0,Feature
0,0.666667
1,0.333333
2,0.666667
3,0.333333
4,1.0
5,0.666667
6,0.333333
7,1.0


### Weight of Evidence

In [None]:
!pip install category_encoders



In [None]:
import pandas as pd
import category_encoders as ce

# Example dataset
data = {
    'Feature': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Target': [1, 0, 0, 1, 1, 0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)

# Define the features and target
X = df[['Feature']]
y = df['Target']

# Initialize and fit the TargetEncoder
encoder = ce.WOEEncoder(cols=['Feature'])
X_encoded = encoder.fit_transform(X, y)

# Display the original and encoded data
df['Feature_Encoded'] = X_encoded
print(df)


  Feature  Target  Feature_Encoded
0       A       1         0.000000
1       B       0        -0.405465
2       A       0         0.000000
3       C       1         0.405465
4       B       1        -0.405465
5       A       0         0.000000
6       C       1         0.405465
7       B       0        -0.405465
8       A       1         0.000000
9       C       0         0.405465
