# Encoding Step — EDA2 (Automated)
**Dataset:** `D:\DATA-SCIENCE\ASSIGNMENTS\12 EDA2adult_with_headers.csv`  
**Original shape:** (32561, 15)  

This notebook performs the encoding step you requested:

- One-Hot Encoding for categorical variables with **< 5** unique categories.
- Label Encoding for categorical variables with **>= 5** unique categories.

Below are the columns detected and the chosen strategies.


## Detected categorical columns and strategy

- All categorical columns: ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'income']

- Unique counts (category: nunique):

  - `sex` : 2
  - `income` : 2
  - `race` : 5
  - `relationship` : 6
  - `marital_status` : 7
  - `workclass` : 9
  - `occupation` : 15
  - `education` : 16
  - `native_country` : 42

- Columns One-Hot encoded (nunique < 5):

  - `sex`
  - `income`

- Columns Label encoded (nunique >= 5):

  - `race`
  - `relationship`
  - `marital_status`
  - `workclass`
  - `occupation`
  - `education`
  - `native_country`


In [4]:
import pandas as pd

# Path to your encoded CSV (make sure this file exists in your folder)
encoded_df = pd.read_csv(r"D:\DATA-SCIENCE\ASSIGNMENTS\12 EDA2\adult_with_headers.csv")

print("Encoded shape:", encoded_df.shape)
encoded_df.head(8)


Encoded shape: (32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K


In [5]:
# Encoded dataframe preview
# Quick preview of the resulting encoded dataframe
print("Original shape: (32561, 15)")
print("Encoded shape:", encoded_df.shape)
encoded_df.head(8)


Original shape: (32561, 15)
Encoded shape: (32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K



## Pros and Cons — One-Hot Encoding vs Label Encoding

### One-Hot Encoding (Used for low-cardinality categorical columns)
**Pros:**
- Avoids imposing an arbitrary order on categories.
- Works well with linear models and models that assume numeric relationships.
- Interpretable, each column is a clear indicator of a category.

**Cons:**
- Creates many new columns if cardinality is high (curse of dimensionality).
- Can increase memory and compute cost.

**When preferred:** small number of categories, interpretability required, or models sensitive to ordinality.

### Label Encoding (Used for higher-cardinality categorical columns)
**Pros:**
- Keeps dimensionality low (single column of integers).
- Fast and memory-efficient.

**Cons:**
- Imposes an ordinal relationship that may not exist — some models may interpret values as ordered.
- Can hurt models that rely on numeric distances unless the model is tree-based (which ignore numeric ordering somewhat).

**When preferred:** high-cardinality categorical variables (to avoid explosion of features), or when using tree-based models that are less sensitive to integer encoding.

**Note / Recommendation:** If you plan to use linear models or distance-based models (k-NN, KMeans, etc.) after encoding, consider using techniques such as target encoding or frequency encoding for high-cardinality features, or use regularization to handle the increased dimensionality.

