# üåà 1Ô∏è‚É£ Why we need Encoding?

- In **Data Science** and **Machine Learning**, algorithms work only with numbers, not text.

- So when you have **categorical data** (like ‚ÄúRed‚Äù, ‚ÄúBlue‚Äù, ‚ÄúGreen‚Äù), we must convert it into **numerical form**.

- But we cannot just assign numbers directly like:

    ```mathematica
        Red ‚Üí 1
        Blue ‚Üí 2
        Green ‚Üí 3
    ```
- because the model might think Green > Blue > Red, implying a false order relationship üòï.
<br>

### To fix this, we use One Hot Encoding (OHE) ‚Äî which creates binary columns (0 or 1) for each category For that we have two Methods.

- #### **Method 1 - Using `pandas.get_dummies()`**
- #### **Method 2 - Using `scikit-learn's OneHotEncoder`**
<br>

### üåü ‚û°Ô∏è **When to Use `get_dummies() vs OneHotEncoder()`** 

| Feature             | `pd.get_dummies()`       | `OneHotEncoder()`                          |
| ------------------- | ------------------------ | ------------------------------------------ |
| Ease of Use         | Very simple for EDA      | Ideal for ML pipelines                     |
| Returns             | DataFrame                | Numpy array (or sparse matrix)             |
| Integration         | Works easily with Pandas | Works with scikit-learn models             |
| Control over output | Limited                  | More flexible (drop, handle_unknown, etc.) |
| Used in             | Quick preprocessing      | Production-grade preprocessing             |


<br>
<hr>



## üí° 2Ô∏è‚É£ OneHotEncoding Real-Life Analogy 

- Imagine you‚Äôre filling out a form asking for your favorite fruit:

| Person    | Favorite Fruit |
| --------- | -------------- |
| Prajwal   | Apple          |
| Kushal    | Banana         |
| Akash     | Mango          |
| Uday      | Apple          |

- If we use One Hot Encoding, we create new columns for each fruit:

| Person    | Apple | Banana | Mango |
| --------- | ----- | ------ | ----- |
| Prajwal   | 1     | 0      | 0     |
| Kushal    | 0     | 1      | 0     |
| Akash     | 0     | 0      | 1     |
| Uday      | 1     | 0      | 0     |
- Each fruit gets its own indicator column, showing 1 if it‚Äôs selected, else 0.

<br>
<hr>



## 3Ô∏è‚É£ üåü Why It‚Äôs Called ‚ÄúOne Hot‚Äù

Because **only one** of the category columns is **‚Äúhot‚Äù** (1) at a time.
Others are **cold (0)**.

This encoding avoids implying **any ordinal relationship** between categories.
meaning **relationship between different categories of a single categorical column/feature**

<br>
<hr>
<br>


# ‚û°Ô∏è **OneHotEncoding By Method 1: using `pandas.getdummies()`**

In [46]:
import pandas as pd
from sklearn.impute import SimpleImputer

In [47]:
df = pd.read_csv("loan_data_set.csv")
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [48]:
# Since One Hot Encoding is done to convert/transform the Categorical data 
#we will fill the missing values of categorical data only
df.select_dtypes(include="object").isnull().sum()



Loan_ID           0
Gender           13
Married           3
Dependents       15
Education         0
Self_Employed    32
Property_Area     0
Loan_Status       0
dtype: int64

In [49]:
# selecting the categorical columns to fill the missing values in them
catColumns = df.select_dtypes(include = "object").columns.tolist()

In [50]:
#Initiating the SimpleImputer with strategy = most_frequent

modeImputer = SimpleImputer(strategy = "most_frequent")

In [51]:
# now filling the missing values
for col in catColumns:
    df[[col]] = modeImputer.fit_transform(df[[col]])

In [52]:
# df1 will be used for another way to do the same one hot encoding through pd.get_dummies()
df1 = pd.DataFrame()

df1 = df.copy(deep = True)

In [53]:
# now checking for the missing value in catColumns
df[catColumns].isnull().sum()

Loan_ID          0
Gender           0
Married          0
Dependents       0
Education        0
Self_Employed    0
Property_Area    0
Loan_Status      0
dtype: int64

### üåüüåüüåü **Important Note** : we have used the gender and married for OneHotEncoding Because **`one-hot encoding (OHE)` although can be used on other categorical Columns, but it `comes with significant caveats` and is `generally not the optimal approach when a single categorical feature contains a large number of unique`, non-repeated values `(a scenario known as high cardinality)`**.

### If `OHE` is performed on the feature with **High Cardinality** then it **leads to the Curse of Dimensionality** :
### üíÄüíÄüíÄ The **`curse of dimensionality`** is seen in machine learning when adding more features to a dataset causes a rapid increase in computational complexity, a decrease in model performance, and greater data sparsity. This occurs because the data volume grows exponentially with each new feature, making it harder for algorithms to find patterns and requiring a much larger amount of data to remain effective

#### ‚û°Ô∏è That's why we have choosed gender and married feature cause there cardinality is 2 (categories in those columns are in format of yes or no)

In [54]:
to_encode = df[["Gender","Married"]] # why choose gender and married reason given above
to_encode.head()

Unnamed: 0,Gender,Married
0,Male,No
1,Male,Yes
2,Male,Yes
3,Male,Yes
4,Male,No


In [55]:
pd.get_dummies(to_encode, drop_first=True)

Unnamed: 0,Gender_Male,Married_Yes
0,True,False
1,True,True
2,True,True
3,True,True
4,True,False
...,...,...
609,False,False
610,True,True
611,True,True
612,True,True


## üåü Note : **We are getting bool as encoded values for the Categorical features insted of 0 and 1**

### ‚öôÔ∏è Why Pandas Switched to bool

- It‚Äôs a memory-efficient and semantically accurate choice:

- Each dummy column is a binary indicator, so it logically fits a boolean type.

- **bool** columns use less memory than integer columns.

- **Most ML pipelines** (e.g., `scikit-learn, xgboost`, `pandas.DataFrame.to_numpy()`) automatically cast them to numeric when training a model ‚Äî so it‚Äôs safe.

In [56]:
en_df = pd.get_dummies(to_encode, drop_first=True).astype(int) #remember astype
en_df

Unnamed: 0,Gender_Male,Married_Yes
0,1,0
1,1,1
2,1,1
3,1,1
4,1,0
...,...,...
609,0,0
610,1,1
611,1,1
612,1,1


# ‚ö° Avoiding the ‚ÄúDummy Variable Trap‚Äù

1. üß© ‚Äî The real concept: "We only need (n‚àí1) dummy variables"


- For a categorical feature with n unique categories, we only need (n‚àí1) dummy variables.

- The reason:

    - **One of the categories can be inferred from the others, so including all n columns causes `multicollinearity (known as the ‚Äúdummy variable trap`‚Äù)**.

<br>

2. üß© ‚Äî Which one should we drop? ü§î

- Mathematically, it doesn‚Äôt matter which one you drop ‚Äî
- you can drop the first, last, or any one of them, and your model will still work correctly!

- ‚úÖ All that matters is that you drop exactly one dummy column per categorical feature.

<br>

3. üß© ‚Äî Then why does drop_first=True drop the first one?


- It‚Äôs just a convention in pandas.get_dummies().

- Pandas has to decide which column to drop automatically.

- It chooses to drop the first alphabetically (or the first in the original order).

In [57]:
# deleting the features from Org df which now have the new encoded values
df.drop(columns=["Gender","Married"],inplace = True)

In [58]:
# concatinating the df and encodedDataFrame along the Columns
df = pd.concat([df, en_df], axis = 1)

### üåü Note : **Understanding `axis` in Pandas**

- In Pandas (and NumPy):

| axis     | Meaning                                          |
| -------- | ------------------------------------------------ |
| `axis=0` | Vertical (rows) ‚Äî means *stack below*            |
| `axis=1` | Horizontal (columns) ‚Äî means *join side-by-side* |


üëâ So:

- axis=0 ‚Üí adds more rows (think stacking books vertically üìö)

- axis=1 ‚Üí adds more columns (think adding columns side by side in a table üß±)


In [59]:
df

Unnamed: 0,Loan_ID,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Gender_Male,Married_Yes
0,LP001002,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y,1,0
1,LP001003,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,1,1
2,LP001005,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,1,1
3,LP001006,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,1,1
4,LP001008,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y,0,0
610,LP002979,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y,1,1
611,LP002983,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y,1,1
612,LP002984,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y,1,1


<Br>
<hr>
<br>

#### ‚û°Ô∏è Similarly like this also we can encode and directly return those cols to df

- But here we have returned the newly created feature/columns after encoding to the dataFrame by passing the `df, columns=["Gender", "Married"]` as argument

In [60]:
encoded_df = pd.get_dummies(df1, columns=["Gender","Married"], drop_first= True,)
encoded_df.head()

Unnamed: 0,Loan_ID,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Gender_Male,Married_Yes
0,LP001002,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y,True,False
1,LP001003,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,True,True
2,LP001005,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,True,True
3,LP001006,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,True,True
4,LP001008,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,True,False


In [61]:
encoded_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Dependents         614 non-null    object 
 2   Education          614 non-null    object 
 3   Self_Employed      614 non-null    object 
 4   ApplicantIncome    614 non-null    int64  
 5   CoapplicantIncome  614 non-null    float64
 6   LoanAmount         592 non-null    float64
 7   Loan_Amount_Term   600 non-null    float64
 8   Credit_History     564 non-null    float64
 9   Property_Area      614 non-null    object 
 10  Loan_Status        614 non-null    object 
 11  Gender_Male        614 non-null    bool   
 12  Married_Yes        614 non-null    bool   
dtypes: bool(2), float64(4), int64(1), object(6)
memory usage: 54.1+ KB


In [62]:
bool_col = encoded_df.select_dtypes(include="bool").columns.tolist()
bool_col

['Gender_Male', 'Married_Yes']

In [63]:
for col in bool_col:
    encoded_df[[col]] = encoded_df[[col]].astype(int)

In [64]:
encoded_df

Unnamed: 0,Loan_ID,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Gender_Male,Married_Yes
0,LP001002,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y,1,0
1,LP001003,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,1,1
2,LP001005,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,1,1
3,LP001006,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,1,1
4,LP001008,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y,0,0
610,LP002979,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y,1,1
611,LP002983,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y,1,1
612,LP002984,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y,1,1


<br>
<Hr>
<br>


# ‚û°Ô∏è **OneHotEncoding By Method 2: using `scikit-learn‚Äôs OneHotEncoder`**

- #### This method is often used in ML pipelines.

In [65]:
from sklearn.preprocessing import OneHotEncoder

#### üß† **Tl/DR**
- Scikit-learn, commonly known as sklearn, is an open-source Python library primarily focused on machine learning and statistical modeling. It provides a comprehensive set of      tools and algorithms for various machine learning tasks, making it a central library for implementing machine learning workflows in Python.
- preprocessing is the submodule of this lib
- OneHotEncoder is a class in that submodule

In [66]:
df2 = pd.read_csv("loan_data_set.csv")

In [67]:
df2.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [68]:
categoricalColumns = df2.select_dtypes(include = "object").columns.tolist()
categoricalColumns

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'Property_Area',
 'Loan_Status']

In [69]:
for col in categoricalColumns:
    df2[[col]] = modeImputer.fit_transform(df2[[col]])

In [70]:
df2[categoricalColumns].isnull().sum()

Loan_ID          0
Gender           0
Married          0
Dependents       0
Education        0
Self_Employed    0
Property_Area    0
Loan_Status      0
dtype: int64

#### Selecting the gender and married features/columns for the same reason that is give above : in short their Cardinality is 2

In [71]:
en_data = df2[["Married","Gender"]] # Data to be encoded by OHE

In [72]:
ohe = OneHotEncoder(drop="first") #why drop reason is above

In [73]:
ohe.fit_transform(en_data)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 903 stored elements and shape (614, 2)>

### üåü NOTE: **Sparse Matrix**

#### The `OneHotEncoder` (especially in libraries like scikit-learn) returns a sparse matrix to save memory and improve computational efficiency, because the output of one-hot encoding is usually full of zeros.


- What is a **Sparse Matrix**?
    - A sparse matrix is a matrix (a rectangular array of numbers) in which most of the elements are zero.

- **Dense Matrix**: A standard matrix where every element, including all the zeros, is explicitly stored in memory.

- **Sparse Matrix**: A specialized data structure that only stores the non-zero elements and their coordinates (row and column indices).

üåü The advantage is significant: if a matrix is 99% zeros, :fire: a sparse matrix only has to store that 1% of non-zero data, saving a huge amount of memory and speeding up computations that would otherwise waste time multiplying or adding zeros.



## üíÄ **Since OHE returns a sparse matrix we have to convert it into array then to dataFrame**

In [74]:
encoded_array = ohe.fit_transform(en_data).toarray()
encoded_array

array([[0., 1.],
       [1., 1.],
       [1., 1.],
       ...,
       [1., 1.],
       [1., 1.],
       [0., 0.]])

In [75]:
encoded_dataframe = pd.DataFrame(encoded_array, columns= ohe.get_feature_names_out() )

### ‚ö†Ô∏è There is no need to need to manually pass the column names.

```python
encoded_dataframe = pd.DataFrame(
    encoded_array,
    columns=ohe.get_feature_names_out()
)

```
- üí° Because by default,
- `get_feature_names_out()` automatically uses the names from
- `ohe.feature_names_in_` that were learned during `.fit()`.

### ‚öôÔ∏è TL;DR Summary

| Step              | Action                                               | Explanation                                   |
| ----------------- | ---------------------------------------------------- | --------------------------------------------- |
| Fit encoder       | `ohe.fit(df[['col1', 'col2']])`                      | Learns categories and remembers feature names |
| Transform         | `ohe.transform(...)` or `fit_transform(...)`         | Converts categories to binary vectors         |
| Get feature names | `ohe.get_feature_names_out()`                        | Automatically uses stored names               |
| Error cause       | Passed manual list not matching `feature_names_in_`  | Encoder found mismatch                        |
| Fix               | Call `get_feature_names_out()` **without arguments** | Safest and cleanest                           |


In [76]:
encoded_dataframe

Unnamed: 0,Married_Yes,Gender_Male
0,0.0,1.0
1,1.0,1.0
2,1.0,1.0
3,1.0,1.0
4,0.0,1.0
...,...,...
609,0.0,0.0
610,1.0,1.0
611,1.0,1.0
612,1.0,1.0


In [79]:
df2.drop(columns = ["Married", "Gender"])
final_df = pd.concat([df2,encoded_dataframe], axis = 1)

In [80]:
final_df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Married_Yes,Gender_Male
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y,0.0,1.0
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,1.0,1.0
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,1.0,1.0
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,1.0,1.0
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,0.0,1.0


## üíÄ Limitations of OHE

#### one-hot encoding (OHE) can be used, but it comes with significant caveats and is generally not the optimal approach when a single categorical feature contains a large number of unique, non-repeated values (a scenario known as high cardinality).

#### Here is a breakdown of why, and what you should consider instead:

- **Why One-Hot Encoding is Problematic with High Cardinality ?**
One-hot encoding creates a new binary column for every unique category in the feature. If your single feature has, say, 1,000 unique values, OHE will create 1,000 new columns.

1. **Curse of Dimensionality:**

    - **Model Performance**: Increasing the number of features dramatically can make the model training process slower and computationally expensive.

    - **Sparsity:** The resulting dataset will be very sparse (mostly zeros), which can negatively impact the performance of many machine Learning algorithms, especially linear models.

2. **Overfitting:** 
   
   - If many categories appear only once or a few times (as implied by "not repeated that much"), the model might learn a unique weight for each rare category. This weight will be based on very little data, causing the model to overfit to the training set and generalize poorly to new, unseen data.

3. **Memory Issues**:

- Creating hundreds or thousands of new columns can quickly exhaust system memory, especially with large datasets.

<Hr>

#### **Recommended Alternatives for High Cardinality Features**

Instead of basic OHE, consider these techniques which are better suited for features with many unique, non-repeated values:

1. Grouping/Binning (The Simplest Fix)
    - Method: Combine the very rare categories (those that appear below a certain frequency threshold, e.g., 5% of the data) into a single new category, often named "Other" or "Rare".

    - After: Apply One-Hot Encoding to the resulting feature, which now has a manageable number of categories.

    - Benefit: Greatly reduces the number of columns and helps the model generalize better by treating all very rare occurrences as a single group.

2. Target Encoding (Powerful for Supervised Learning)
    - Method: Replace each category with the mean of the target variable (the variable you are trying to predict) for that specific category. For example, if you are predicting house price, you would replace a neighborhood name with the average house price for that neighborhood.

    - Benefit: Converts the categorical feature to a single numerical feature, capturing the relationship between the category and the target, and avoiding the curse of dimensionality.

    - Caveat: Can lead to data leakage and overfitting if not implemented correctly (e.g., using cross-validation or smoothing techniques).

1. Feature Hashing (The "Hashing Trick")
    - Method: Converts the categories into a fixed, pre-defined number of features using a hash function, regardless of how many unique categories there are.

    - Benefit: Directly controls the output dimensionality and handles new categories (in the test set) automatically.

    - Caveat: The new features are no longer directly interpretable (they are just "bins"), and different categories can accidentally map to the same bin (collisions).

4. Embedding (For Deep Learning)
    - Method: In a neural network, the categories are mapped to a lower-dimensional, dense vector space, and the values in this vector (embeddings) are learned during the training process.

    - Benefit: Captures complex relationships and similarity between categories automatically.

**In summary**: While you can use One-Hot Encoding, your description of "A Single feature contain [sic] a number of categorical values which are not repeated that much" strongly suggests a high cardinality problem, making an alternative like Grouping/Binning followed by OHE, or Target Encoding, a much more robust and efficient choice.