<a href="https://colab.research.google.com/github/CodeSimple0496/No-Tutorial-AI-Data-Scientist-Roadmap-6-8-Months-of-Real-Execution/blob/main/MONTH%201/DAY%204/Feature_Engineering_%26_Data_Preprocessing(Part_1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

✅ PART 1 — Day 4 TASKS (What YOU must DO)
# **1. Handle Categorical Data**

Learn and practice:

Label Encoding

One-Hot Encoding

Apply on your Employee dataset:

Gender

City

Education

EverBenched

### 1. Create a Sample Employee Dataset

Let's start by creating a sample Pandas DataFrame that contains the categorical features you mentioned: `Gender`, `City`, `Education`, and `EverBenched`.

In [44]:
import pandas as pd
import numpy as np

# Create a sample Employee DataFrame
data = {
    'EmployeeID': range(1, 11),
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'San Francisco', 'Chicago', 'Los Angeles', 'New York', 'San Francisco', 'Chicago'],
    'Education': ['Bachelors', 'Masters', 'PhD', 'Bachelors', 'Masters', 'Bachelors', 'PhD', 'Masters', 'Bachelors', 'Masters'],
    'EverBenched': ['No', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes'],
    'Salary': np.random.randint(50000, 120000, 10)
}

df = pd.DataFrame(data)

print("Original DataFrame:")
display(df.head())
print("\nData Types:")
display(df.info())

Original DataFrame:


Unnamed: 0,EmployeeID,Gender,City,Education,EverBenched,Salary
0,1,Male,New York,Bachelors,No,53630
1,2,Female,Los Angeles,Masters,Yes,58094
2,3,Male,Chicago,PhD,No,108665
3,4,Female,New York,Bachelors,No,103537
4,5,Female,San Francisco,Masters,Yes,106032



Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   EmployeeID   10 non-null     int64 
 1   Gender       10 non-null     object
 2   City         10 non-null     object
 3   Education    10 non-null     object
 4   EverBenched  10 non-null     object
 5   Salary       10 non-null     int64 
dtypes: int64(2), object(4)
memory usage: 612.0+ bytes


None

### 2. Label Encoding

**What it is:** Label Encoding converts each category in a feature into a numerical label. For example, 'Male' might become `0` and 'Female' might become `1`.

**When to use it:** It's suitable for ordinal categorical data (where there's a natural order, like 'Low', 'Medium', 'High') or when the number of unique categories is small and you don't want to create too many new columns.

**Caveats:** For nominal (unordered) data, assigning arbitrary numerical labels might mislead machine learning models into assuming an ordinal relationship where none exists. For example, if 'City A' is `0` and 'City B' is `1`, the model might incorrectly infer that 'City B' is 'greater' than 'City A'.

In [45]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
le = LabelEncoder()

# Apply Label Encoding to the 'Gender' column
df_encoded_le = df.copy() # Create a copy to avoid modifying the original df directly
df_encoded_le['Gender_Encoded'] = le.fit_transform(df_encoded_le['Gender'])

print("DataFrame after Label Encoding on 'Gender':")
display(df_encoded_le[['Gender', 'Gender_Encoded']].head())

print("\nClasses learned by the encoder:", le.classes_)

DataFrame after Label Encoding on 'Gender':


Unnamed: 0,Gender,Gender_Encoded
0,Male,1
1,Female,0
2,Male,1
3,Female,0
4,Female,0



Classes learned by the encoder: ['Female' 'Male']


### 3. One-Hot Encoding

**What it is:** One-Hot Encoding converts each category value into a new column and assigns a `1` to the column corresponding to the category and `0` to all other new columns. This avoids the ordinal relationship issue present in Label Encoding.

**When to use it:** It's ideal for nominal categorical data (where categories have no inherent order, like 'City', 'Color').

**Caveats:** It can lead to a significant increase in the number of features (curse of dimensionality) if there are many unique categories, potentially making the model training slower and more memory-intensive.

In [46]:
from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder
# handle_unknown='ignore' prevents errors when encountering new categories during transformation
# sparse_output=False ensures a dense array is returned instead of a sparse matrix
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Apply One-Hot Encoding to 'City', 'Education', and 'EverBenched' columns
# Fit and transform the selected columns
ohe_transformed = ohe.fit_transform(df[['City', 'Education', 'EverBenched']])

# Get feature names for the new columns
ohe_feature_names = ohe.get_feature_names_out(['City', 'Education', 'EverBenched'])

# Create a DataFrame from the transformed data
df_ohe = pd.DataFrame(ohe_transformed, columns=ohe_feature_names, index=df.index)

# Concatenate the one-hot encoded DataFrame with the original (excluding original categorical columns)
df_final_encoded = pd.concat([df.drop(columns=['City', 'Education', 'EverBenched']), df_ohe], axis=1)

print("DataFrame after One-Hot Encoding:")
display(df_final_encoded.head())

print("\nOriginal columns of One-Hot Encoded features:", ohe.feature_names_in_)
print("Generated One-Hot Encoded column names:", ohe_feature_names)

DataFrame after One-Hot Encoding:


Unnamed: 0,EmployeeID,Gender,Salary,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
0,1,Male,53630,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,2,Female,58094,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,Male,108665,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,4,Female,103537,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,5,Female,106032,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0



Original columns of One-Hot Encoded features: ['City' 'Education' 'EverBenched']
Generated One-Hot Encoded column names: ['City_Chicago' 'City_Los Angeles' 'City_New York' 'City_San Francisco'
 'Education_Bachelors' 'Education_Masters' 'Education_PhD'
 'EverBenched_No' 'EverBenched_Yes']


The `df_final_encoded` DataFrame now has the specified categorical features (`City`, `Education`, `EverBenched`) transformed using One-Hot Encoding, and `Gender` transformed using Label Encoding (if you combined the steps). You can choose which encoding method is most appropriate based on the nature of your categorical data and the requirements of your machine learning model.

# **2. Feature Scaling (Normalization)**

Learn:

Min-Max Scaling

Standardization (StandardScaler)

Apply on:

Age

ExperienceInCurrentDomain

### 1. Add 'Age' and 'ExperienceInCurrentDomain' to the DataFrame

Before we can apply scaling, we need to ensure the numerical features 'Age' and 'ExperienceInCurrentDomain' exist in our DataFrame. I'll add some random integer data for these columns.

In [47]:
import numpy as np

# Add 'Age' and 'ExperienceInCurrentDomain' columns to the DataFrame
# Assuming a plausible range for age and experience
df['Age'] = np.random.randint(22, 60, 10)
df['ExperienceInCurrentDomain'] = np.random.randint(0, 15, 10)

print("DataFrame with new 'Age' and 'ExperienceInCurrentDomain' columns:")
display(df.head())
print("\nData Types after adding new columns:")
display(df.info())

DataFrame with new 'Age' and 'ExperienceInCurrentDomain' columns:


Unnamed: 0,EmployeeID,Gender,City,Education,EverBenched,Salary,Age,ExperienceInCurrentDomain
0,1,Male,New York,Bachelors,No,53630,33,2
1,2,Female,Los Angeles,Masters,Yes,58094,39,10
2,3,Male,Chicago,PhD,No,108665,43,14
3,4,Female,New York,Bachelors,No,103537,38,6
4,5,Female,San Francisco,Masters,Yes,106032,27,14



Data Types after adding new columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   EmployeeID                 10 non-null     int64 
 1   Gender                     10 non-null     object
 2   City                       10 non-null     object
 3   Education                  10 non-null     object
 4   EverBenched                10 non-null     object
 5   Salary                     10 non-null     int64 
 6   Age                        10 non-null     int64 
 7   ExperienceInCurrentDomain  10 non-null     int64 
dtypes: int64(4), object(4)
memory usage: 772.0+ bytes


None

### 2. Min-Max Scaling

**What it is:** Min-Max Scaling (often called Normalization) transforms features by scaling each feature to a given range, typically between 0 and 1. The formula for Min-Max Scaling is:

`X_scaled = (X - X_min) / (X_max - X_min)`

**When to use it:**
*   When you know that the distribution of your data is not Gaussian or when you want to bound values to a specific range (e.g., 0 to 1).
*   Algorithms that are not affected by the magnitude of the features but by their range, like neural networks (which often expect input features to be scaled to a small range).

**Caveats:** It is sensitive to outliers, as outliers will shift the min/max values and thus the scaling range.

In [48]:
from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
min_max_scaler = MinMaxScaler()

# Create a copy of the DataFrame to store scaled values
df_scaled_minmax = df.copy()

# Apply Min-Max Scaling to 'Age' and 'ExperienceInCurrentDomain'
df_scaled_minmax[['Age_MinMaxScaled', 'ExperienceInCurrentDomain_MinMaxScaled']] = min_max_scaler.fit_transform(df[['Age', 'ExperienceInCurrentDomain']])

print("DataFrame after Min-Max Scaling on 'Age' and 'ExperienceInCurrentDomain':")
display(df_scaled_minmax[['Age', 'Age_MinMaxScaled', 'ExperienceInCurrentDomain', 'ExperienceInCurrentDomain_MinMaxScaled']].head())

DataFrame after Min-Max Scaling on 'Age' and 'ExperienceInCurrentDomain':


Unnamed: 0,Age,Age_MinMaxScaled,ExperienceInCurrentDomain,ExperienceInCurrentDomain_MinMaxScaled
0,33,0.206897,2,0.0
1,39,0.413793,10,0.666667
2,43,0.551724,14,1.0
3,38,0.37931,6,0.333333
4,27,0.0,14,1.0


### 3. Standardization (StandardScaler)

**What it is:** Standardization (often called Z-score normalization) transforms features to have a mean of 0 and a standard deviation of 1. The formula for standardization is:

`X_scaled = (X - μ) / σ`

where `μ` is the mean of the feature and `σ` is the standard deviation.

**When to use it:**
*   When the data follows a Gaussian (normal) distribution. Many machine learning algorithms, like Linear Regression, Logistic Regression, and SVMs, assume normally distributed data.
*   When algorithms are sensitive to the scale of input features, such as those that use distance calculations (e.g., K-Nearest Neighbors, K-Means Clustering).

**Caveats:** Unlike Min-Max Scaling, standardization does not bound values to a specific range, which might be an issue for some algorithms that require inputs within a certain range.

In [49]:
from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
standard_scaler = StandardScaler()

# Create a copy of the DataFrame to store scaled values
df_scaled_standard = df.copy()

# Apply Standardization to 'Age' and 'ExperienceInCurrentDomain'
df_scaled_standard[['Age_StandardScaled', 'ExperienceInCurrentDomain_StandardScaled']] = standard_scaler.fit_transform(df[['Age', 'ExperienceInCurrentDomain']])

print("DataFrame after Standardization on 'Age' and 'ExperienceInCurrentDomain':")
display(df_scaled_standard[['Age', 'Age_StandardScaled', 'ExperienceInCurrentDomain', 'ExperienceInCurrentDomain_StandardScaled']].head())

DataFrame after Standardization on 'Age' and 'ExperienceInCurrentDomain':


Unnamed: 0,Age,Age_StandardScaled,ExperienceInCurrentDomain,ExperienceInCurrentDomain_StandardScaled
0,33,-0.57802,2,-1.896182
1,39,0.144505,10,0.0
2,43,0.626188,14,0.948091
3,38,0.024084,6,-0.948091
4,27,-1.300544,14,0.948091


# **3. Create New Features (Feature Engineering)**

Practice:

Age groups (Young, Mid, Senior)

Experience buckets

### JoiningYear → Years in company

### 1. Add 'JoiningYear' to the DataFrame

First, we need a 'JoiningYear' column to calculate 'Years in company'. I'll add a plausible range of joining years to the DataFrame.

In [50]:
# Add 'JoiningYear' column to the DataFrame
# Assuming employees joined between 2005 and 2023
df['JoiningYear'] = np.random.randint(2005, 2024, 10)

print("DataFrame with new 'JoiningYear' column:")
display(df.head())

DataFrame with new 'JoiningYear' column:


Unnamed: 0,EmployeeID,Gender,City,Education,EverBenched,Salary,Age,ExperienceInCurrentDomain,JoiningYear
0,1,Male,New York,Bachelors,No,53630,33,2,2008
1,2,Female,Los Angeles,Masters,Yes,58094,39,10,2008
2,3,Male,Chicago,PhD,No,108665,43,14,2006
3,4,Female,New York,Bachelors,No,103537,38,6,2009
4,5,Female,San Francisco,Masters,Yes,106032,27,14,2011


### 2. Create Age Groups (Young, Mid, Senior)

We'll categorize the 'Age' column into meaningful groups like 'Young', 'Mid', and 'Senior'. This can help in understanding different age demographics within the employee base.

In [51]:
# Define age bins and labels
age_bins = [0, 30, 45, 65]
age_labels = ['Young', 'Mid', 'Senior']

# Create 'AgeGroup' column using pd.cut
df['AgeGroup'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)

print("DataFrame with 'AgeGroup' feature:")
display(df[['Age', 'AgeGroup']].head())

DataFrame with 'AgeGroup' feature:


Unnamed: 0,Age,AgeGroup
0,33,Mid
1,39,Mid
2,43,Mid
3,38,Mid
4,27,Young


### 3. Create Experience Buckets

Similar to age groups, bucketing 'ExperienceInCurrentDomain' can help in analyzing employees based on their experience levels rather than exact years.

In [52]:
# Define experience bins and labels
experience_bins = [0, 3, 7, 15, np.inf] # np.inf for the last open-ended bin
experience_labels = ['Entry-Level', 'Junior', 'Mid-Level', 'Senior']

# Create 'ExperienceBucket' column using pd.cut
df['ExperienceBucket'] = pd.cut(df['ExperienceInCurrentDomain'], bins=experience_bins, labels=experience_labels, right=False)

print("DataFrame with 'ExperienceBucket' feature:")
display(df[['ExperienceInCurrentDomain', 'ExperienceBucket']].head())

DataFrame with 'ExperienceBucket' feature:


Unnamed: 0,ExperienceInCurrentDomain,ExperienceBucket
0,2,Entry-Level
1,10,Mid-Level
2,14,Mid-Level
3,6,Junior
4,14,Mid-Level


### 4. Calculate Years in Company

This feature represents the duration an employee has been with the company, derived from their 'JoiningYear' and a hypothetical current year. This can be a strong indicator for loyalty or career progression.

In [53]:
import datetime

# Get the current year
current_year = datetime.datetime.now().year

# Calculate 'YearsInCompany'
df['YearsInCompany'] = current_year - df['JoiningYear']

print("DataFrame with 'YearsInCompany' feature:")
display(df[['JoiningYear', 'YearsInCompany']].head())

DataFrame with 'YearsInCompany' feature:


Unnamed: 0,JoiningYear,YearsInCompany
0,2008,18
1,2008,18
2,2006,20
3,2009,17
4,2011,15


# **4. Split Data**

Learn and practice:

Train-test split (80-20 or 70-30)

### 1. Train-Test Split

**What it is:** Train-test splitting is a technique used to divide a dataset into two subsets: a training set and a testing set. The model learns from the training data, and its performance is evaluated on the testing data.

**Why it's important:** This method helps in assessing the generalization capability of a model, i.e., how well it performs on new, unseen data, preventing overfitting.

**Common Ratios:** Typical split ratios include 80% training / 20% testing or 70% training / 30% testing.

In [54]:
from sklearn.model_selection import train_test_split

# Assuming 'Salary' is our target variable (y) and the rest are features (X)
# For demonstration, we'll use a subset of the DataFrame with numerical and encoded categorical features.
# In a real scenario, you would choose your final processed features.

# Let's create a DataFrame with some processed features for splitting demonstration
# For simplicity, we'll use numerical columns and the 'Gender_Encoded' column

# First, ensure all necessary columns exist (from previous steps)
# For this demonstration, we'll recreate a simplified df_processed that includes numerical and one encoded categorical feature.
# In a real pipeline, you would use df_final_encoded or a similar combined dataframe.

# Let's use the 'df' DataFrame, assuming we've already added 'Age', 'ExperienceInCurrentDomain', and 'JoiningYear'
# and will use 'Gender_Encoded' from df_encoded_le. We will drop non-numeric/non-relevant columns for splitting X.

# Combine the original df (with new numerical features) and the encoded gender column
df_combined_for_split = df.copy()
# Assuming 'Gender_Encoded' is available from df_encoded_le
if 'Gender_Encoded' not in df_combined_for_split.columns and 'df_encoded_le' in globals():
    df_combined_for_split = pd.merge(df_combined_for_split, df_encoded_le[['EmployeeID', 'Gender_Encoded']], on='EmployeeID', how='left')

# Define features (X) and target (y)
# We will drop original categorical columns and 'EmployeeID' which is an identifier
features_to_drop = ['EmployeeID', 'Gender', 'City', 'Education', 'EverBenched', 'AgeGroup', 'ExperienceBucket']
X = df_combined_for_split.drop(columns=features_to_drop + ['Salary'])
y = df_combined_for_split['Salary']

# Display the features (X) and target (y) before splitting
print("Features (X) before split:")
display(X.head())
print("\nTarget (y) before split:")
display(y.head())

# Perform the train-test split
# test_size=0.20 means 20% of data for testing, 80% for training
# random_state for reproducibility
# stratify=y is useful for classification to maintain class proportions,
# but for regression (Salary) it's not typically used or can be set to None.
# For this example, given the small dataset size, stratify will be omitted as it's more for classification.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

print("\nShape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

print("\nX_train head:")
display(X_train.head())

print("\ny_train head:")
display(y_train.head())

Features (X) before split:


Unnamed: 0,Age,ExperienceInCurrentDomain,JoiningYear,YearsInCompany,Gender_Encoded
0,33,2,2008,18,1
1,39,10,2008,18,0
2,43,14,2006,20,1
3,38,6,2009,17,0
4,27,14,2011,15,0



Target (y) before split:


Unnamed: 0,Salary
0,53630
1,58094
2,108665
3,103537
4,106032



Shape of X_train: (8, 5)
Shape of X_test: (2, 5)
Shape of y_train: (8,)
Shape of y_test: (2,)

X_train head:


Unnamed: 0,Age,ExperienceInCurrentDomain,JoiningYear,YearsInCompany,Gender_Encoded
5,29,4,2021,5,1
0,33,2,2008,18,1
7,42,12,2013,13,0
2,43,14,2006,20,1
9,42,13,2014,12,0



y_train head:


Unnamed: 0,Salary
5,50673
0,53630
7,82978
2,108665
9,118663


# **5. Target Variable Understanding**

Identify target column: LeaveOrNot

Separate X (features) and y (target)

### 1. Add 'LeaveOrNot' to the DataFrame

To demonstrate target variable understanding, we first need a 'LeaveOrNot' column. I'll add a sample boolean column (True/False or 1/0) to simulate whether an employee left the company.

In [55]:
# Add 'LeaveOrNot' column to the DataFrame (e.g., 0 for staying, 1 for leaving)
df['LeaveOrNot'] = np.random.randint(0, 2, 10) # 0 or 1 randomly

print("DataFrame with 'LeaveOrNot' column:")
display(df.head())

DataFrame with 'LeaveOrNot' column:


Unnamed: 0,EmployeeID,Gender,City,Education,EverBenched,Salary,Age,ExperienceInCurrentDomain,JoiningYear,AgeGroup,ExperienceBucket,YearsInCompany,LeaveOrNot
0,1,Male,New York,Bachelors,No,53630,33,2,2008,Mid,Entry-Level,18,1
1,2,Female,Los Angeles,Masters,Yes,58094,39,10,2008,Mid,Mid-Level,18,1
2,3,Male,Chicago,PhD,No,108665,43,14,2006,Mid,Mid-Level,20,0
3,4,Female,New York,Bachelors,No,103537,38,6,2009,Mid,Junior,17,1
4,5,Female,San Francisco,Masters,Yes,106032,27,14,2011,Young,Mid-Level,15,1


### 2. Separate X (Features) and y (Target)

Now that we have our target column, we can explicitly separate the dataset into independent variables (features, typically denoted as X) and the dependent variable (target, typically denoted as y). This is a standard practice before training any machine learning model.

In [56]:
# Assuming 'LeaveOrNot' is the target variable (y)
# We'll use a processed version of the DataFrame that includes numerical and encoded categorical features.
# For simplicity, we'll take the 'df_combined_for_split' from the previous step and update it.

df_for_xy_split = df.copy()

# Ensure 'Gender_Encoded' is present from previous steps
# If df_encoded_le exists and 'Gender_Encoded' is not in df_for_xy_split, merge it.
if 'Gender_Encoded' not in df_for_xy_split.columns and 'df_encoded_le' in globals():
    df_for_xy_split = pd.merge(df_for_xy_split, df_encoded_le[['EmployeeID', 'Gender_Encoded']], on='EmployeeID', how='left')

# Also, let's incorporate the one-hot encoded columns (City, Education, EverBenched) into our feature set (X).
# For a complete feature set, we should use df_final_encoded and then merge 'Age', 'ExperienceInCurrentDomain', etc.
# For demonstration, let's build a new comprehensive DataFrame.

# Create a comprehensive DataFrame with all processed features
# Start with df (which has Age, ExperienceInCurrentDomain, JoiningYear, YearsInCompany, LeaveOrNot)
# Add Gender_Encoded from df_encoded_le
# Add One-Hot Encoded columns from df_ohe

# Make sure df_encoded_le and df_ohe are defined from previous executions
if 'df_encoded_le' in globals() and 'df_ohe' in globals():
    df_master = df.copy()
    # Merge Gender_Encoded
    df_master = pd.merge(df_master, df_encoded_le[['EmployeeID', 'Gender_Encoded']], on='EmployeeID', how='left')
    # Concatenate One-Hot Encoded columns
    df_master = pd.concat([df_master, df_ohe], axis=1)
else:
    print("Warning: df_encoded_le or df_ohe not found. Using df_combined_for_split for X, y split.")
    df_master = df_for_xy_split.copy()

# Define the target variable
target_column = 'LeaveOrNot'
y = df_master[target_column]

# Define features (X) by dropping the target and other non-feature columns
# Drop original categorical columns if their one-hot encoded versions are present
# Drop EmployeeID and Salary if Salary is not the target.

columns_to_drop_from_X = [
    'EmployeeID', 'Gender', 'City', 'Education', 'EverBenched', 'Salary',
    'AgeGroup', 'ExperienceBucket', target_column
]

# Filter out columns that don't exist in df_master
columns_to_drop_from_X = [col for col in columns_to_drop_from_X if col in df_master.columns]

X = df_master.drop(columns=columns_to_drop_from_X)

print("Features (X) head:")
display(X.head())
print("\nTarget (y) head:")
display(y.head())

print("\nShape of X:", X.shape)
print("Shape of y:", y.shape)

Features (X) head:


Unnamed: 0,Age,ExperienceInCurrentDomain,JoiningYear,YearsInCompany,Gender_Encoded,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
0,33,2,2008,18,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,39,10,2008,18,0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,43,14,2006,20,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,38,6,2009,17,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,27,14,2011,15,0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0



Target (y) head:


Unnamed: 0,LeaveOrNot
0,1
1,1
2,0
3,1
4,1



Shape of X: (10, 14)
Shape of y: (10,)


# **6. Mini Preprocessing Pipeline**

Do steps in order:

Handle missing values

Encode categorical columns

Scale numerical columns

Split train/test

# Task
Create an end-to-end data preprocessing pipeline, including handling missing values, encoding categorical columns, scaling numerical columns, and performing a train-test split, using the `df_master` DataFrame. Begin by introducing the concept of a preprocessing pipeline and then demonstrating how to introduce and handle artificial missing values in `df_master` for both numerical and categorical features.

## Introduce Preprocessing Pipeline

### Subtask:
Explain the concept and benefits of creating a data preprocessing pipeline.


### What is a Data Preprocessing Pipeline?

A Data Preprocessing Pipeline is a sequence of steps applied to raw data to prepare it for machine learning model training. Instead of applying each preprocessing step individually, a pipeline combines them into a single, cohesive workflow. This approach ensures that all transformations (like handling missing values, encoding, and scaling) are applied consistently across both training and testing datasets.

### Benefits of Using a Preprocessing Pipeline:

1.  **Ensuring Consistency**: A pipeline guarantees that the same transformations, with the same learned parameters (e.g., mean, standard deviation for scaling, or categories for encoding), are applied to both the training data and any new, unseen data (like the test set or future deployment data). This prevents discrepancies that can lead to incorrect model predictions.
2.  **Preventing Data Leakage**: Data leakage occurs when information from the test set "leaks" into the training process. For example, if you scale your entire dataset (train + test) before splitting, the scaling parameters (min/max or mean/std) would be influenced by the test set. A pipeline ensures that transformations learn their parameters *only* from the training data and then apply those *learned* parameters to the test data, preventing this crucial error.
3.  **Simplifying Workflow and Reproducibility**: Pipelines streamline complex preprocessing tasks into a single, manageable object. This makes the code cleaner, easier to understand, and much simpler to reproduce. You can apply the entire sequence of steps with a single `.fit()` and `.transform()` call.
4.  **Easier Hyperparameter Tuning and Model Evaluation**: When using techniques like cross-validation, pipelines seamlessly integrate with grid search or randomized search. The preprocessing steps are applied within each fold, ensuring that each model is trained and evaluated on data that has been preprocessed correctly and independently for that fold.

### Typical Steps in a Preprocessing Pipeline:

A typical preprocessing pipeline for tabular data might involve:

*   **Handling Missing Values**: Imputing missing data using strategies like mean, median, mode, or a more sophisticated model.
*   **Encoding Categorical Data**: Converting categorical features into numerical formats (e.g., Label Encoding, One-Hot Encoding).
*   **Scaling Numerical Features**: Adjusting the range or distribution of numerical features (e.g., Min-Max Scaling, Standardization).
*   **Feature Engineering**: Creating new features from existing ones.
*   **Dimensionality Reduction**: Techniques like PCA to reduce the number of features.

## Add Missing Values for Demonstration

### Subtask:
Introduce artificial missing values into the DataFrame for demonstration purposes in both numerical and categorical columns.


**Reasoning**:
The user wants to introduce artificial missing values into the DataFrame for demonstration. The first step is to create a copy of the existing 'df_master' DataFrame and then introduce missing values in 'Age' and 'City' columns.



In [57]:
import numpy as np

# 1. Make a copy of the df_master DataFrame and name it df_pipeline
df_pipeline = df_master.copy()

# 2. Randomly select approximately 20% of the rows in the Age column and set their values to np.nan
missing_age_indices = df_pipeline.sample(frac=0.2, random_state=42).index
df_pipeline.loc[missing_age_indices, 'Age'] = np.nan

# 3. Randomly select approximately 20% of the rows in the City column and set their values to np.nan
missing_city_indices = df_pipeline.sample(frac=0.2, random_state=42).index
df_pipeline.loc[missing_city_indices, 'City'] = np.nan

print("DataFrame with introduced missing values:")
display(df_pipeline.head())

print("\nMissing values count for 'Age' and 'City':")
print(df_pipeline[['Age', 'City']].isnull().sum())

DataFrame with introduced missing values:


Unnamed: 0,EmployeeID,Gender,City,Education,EverBenched,Salary,Age,ExperienceInCurrentDomain,JoiningYear,AgeGroup,...,Gender_Encoded,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
0,1,Male,New York,Bachelors,No,53630,33.0,2,2008,Mid,...,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,2,Female,,Masters,Yes,58094,,10,2008,Mid,...,0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,Male,Chicago,PhD,No,108665,43.0,14,2006,Mid,...,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,4,Female,New York,Bachelors,No,103537,38.0,6,2009,Mid,...,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,5,Female,San Francisco,Masters,Yes,106032,27.0,14,2011,Young,...,0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0



Missing values count for 'Age' and 'City':
Age     2
City    2
dtype: int64


## 1. Handle Missing Values

Missing data is a common issue in real-world datasets and can significantly impact the performance and reliability of machine learning models. Handling missing values is a crucial step in data preprocessing.

### Common Strategies for Handling Missing Values:

1.  **Deletion**:
    *   **Row Deletion (Listwise Deletion)**: Remove entire rows that contain any missing values. This is simple but can lead to significant data loss, especially if many rows have missing values.
    *   **Column Deletion**: Remove columns that have a large number of missing values (e.g., more than a certain percentage). This is done if a column provides little information due to being mostly empty.

2.  **Imputation**: Replace missing values with substitute values. This is generally preferred over deletion as it preserves more data.
    *   **Mean/Median Imputation**: Replace missing numerical values with the mean or median of the non-missing values in that column. Mean is sensitive to outliers, while median is more robust.
    *   **Mode Imputation**: Replace missing categorical values (or numerical values if appropriate) with the most frequent value (mode) in that column.
    *   **Forward Fill / Backward Fill**: Replace missing values using the next or previous valid observation. This is often used for time- series data.
    *   **Advanced Imputation**: Use more sophisticated methods like K-Nearest Neighbors (KNN) Imputation, Regression Imputation, or even machine learning models to predict missing values based on other features.

### Choosing an Imputation Strategy:

The choice of strategy depends on:
*   **Type of data**: Numerical or categorical.
*   **Nature of missingness**: Is it Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)?
*   **Impact on distribution**: How does the imputation affect the variable's distribution?
*   **Model sensitivity**: Some models are more robust to missing values than others.

For our current task, we will use simple imputation strategies: **mean for numerical columns** and **mode for categorical columns**.

**Reasoning**:
Based on the previous explanation of missing value handling, I will now implement the chosen strategies: mean imputation for numerical columns and mode imputation for categorical columns in the `df_pipeline` DataFrame.



In [58]:
from sklearn.impute import SimpleImputer

# Create a copy
df_imputed = df_pipeline.copy()

# --- Impute Numerical ('Age') ---
numerical_imputer = SimpleImputer(strategy='mean')
# We use .ravel() to turn the 2D output back into a 1D format Pandas expects
df_imputed['Age'] = numerical_imputer.fit_transform(df_imputed[['Age']]).ravel()

# --- Impute Categorical ('City') ---
categorical_imputer = SimpleImputer(strategy='most_frequent')
# .ravel() is especially important here for categorical/object data
df_imputed['City'] = categorical_imputer.fit_transform(df_imputed[['City']]).ravel()

print("DataFrame after imputing missing 'Age' (mean) and 'City' (mode):")
display(df_imputed.head())

print("\nMissing values count after imputation:")
print(df_imputed[['Age', 'City']].isnull().sum())

DataFrame after imputing missing 'Age' (mean) and 'City' (mode):


Unnamed: 0,EmployeeID,Gender,City,Education,EverBenched,Salary,Age,ExperienceInCurrentDomain,JoiningYear,AgeGroup,...,Gender_Encoded,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
0,1,Male,New York,Bachelors,No,53630,33.0,2,2008,Mid,...,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,2,Female,Chicago,Masters,Yes,58094,38.75,10,2008,Mid,...,0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,Male,Chicago,PhD,No,108665,43.0,14,2006,Mid,...,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,4,Female,New York,Bachelors,No,103537,38.0,6,2009,Mid,...,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,5,Female,San Francisco,Masters,Yes,106032,27.0,14,2011,Young,...,0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0



Missing values count after imputation:
Age     0
City    0
dtype: int64


**Reasoning**:
The previous code failed because `SimpleImputer.fit_transform` returns a 2D array, which caused a `ValueError` when directly assigned to a single DataFrame column. To fix this, I will use `.ravel()` to flatten the output of `fit_transform` for the 'City' column before assignment.



In [59]:
from sklearn.impute import SimpleImputer

# Create a copy to work on, ensuring not to overwrite the df_pipeline with NaNs
df_imputed = df_pipeline.copy()

# --- Impute Numerical Missing Values (e.g., 'Age') with the Mean ---
# Initialize SimpleImputer for numerical columns with mean strategy
# We'll impute 'Age' as it's the numerical column with introduced NaNs
numerical_imputer = SimpleImputer(strategy='mean')

# Fit and transform the 'Age' column. Reshape is needed for a single column.
df_imputed['Age'] = numerical_imputer.fit_transform(df_imputed[['Age']])

# --- Impute Categorical Missing Values (e.g., 'City') with the Mode ---
# Initialize SimpleImputer for categorical columns with most_frequent strategy
# We'll impute 'City' as it's the categorical column with introduced NaNs
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform the 'City' column. Use .ravel() to flatten the output for correct assignment.
df_imputed['City'] = categorical_imputer.fit_transform(df_imputed[['City']]).ravel()

print("DataFrame after imputing missing 'Age' (mean) and 'City' (mode):")
display(df_imputed.head())

print("\nMissing values count after imputation:")
print(df_imputed[['Age', 'City']].isnull().sum())

DataFrame after imputing missing 'Age' (mean) and 'City' (mode):


Unnamed: 0,EmployeeID,Gender,City,Education,EverBenched,Salary,Age,ExperienceInCurrentDomain,JoiningYear,AgeGroup,...,Gender_Encoded,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
0,1,Male,New York,Bachelors,No,53630,33.0,2,2008,Mid,...,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,2,Female,Chicago,Masters,Yes,58094,38.75,10,2008,Mid,...,0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,Male,Chicago,PhD,No,108665,43.0,14,2006,Mid,...,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,4,Female,New York,Bachelors,No,103537,38.0,6,2009,Mid,...,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,5,Female,San Francisco,Masters,Yes,106032,27.0,14,2011,Young,...,0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0



Missing values count after imputation:
Age     0
City    0
dtype: int64


## 2. Encode Categorical Columns

Categorical data, which represents qualitative rather than quantitative information, needs to be converted into a numerical format before it can be used by most machine learning algorithms. This process is called categorical encoding.

### Common Categorical Encoding Strategies:

1.  **Label Encoding**:
    *   **What it is**: Assigns a unique integer to each category based on its alphabetical order or order of appearance. For example, 'Red', 'Green', 'Blue' might become 0, 1, 2.
    *   **When to use it**: Best suited for **ordinal** categorical data (where there's a natural order, e.g., 'Low', 'Medium', 'High') or when the number of unique categories is small and the model is not sensitive to artificial ordering.
    *   **Caveats**: For **nominal** (unordered) data, Label Encoding can introduce an artificial ordinal relationship that a model might misinterpret, leading to poorer performance.

2.  **One-Hot Encoding**:
    *   **What it is**: Creates new binary columns for each category in the original feature. If a data point belongs to a category, the corresponding new column gets a `1`, and all other new columns for that feature get `0`.
    *   **When to use it**: Ideal for **nominal** categorical data (where categories have no inherent order, like 'City', 'Color', 'Gender'). It avoids the ordinal assumption issue of Label Encoding.
    *   **Caveats**: Can lead to a significant increase in the number of features (curse of dimensionality) if there are many unique categories, making the model training slower and more memory-intensive.

### Choosing an Encoding Strategy for our `df_imputed`:

*   **For `Gender` (Male/Female)**: Since there are only two categories and no inherent order, One-Hot Encoding is generally safer, but Label Encoding is often acceptable as it maps to 0 and 1, similar to a binary flag. Given `Gender_Encoded` already exists from previous steps (using Label Encoding), we will reuse that.
*   **For `City`, `Education`, `EverBenched`**: These are nominal categorical features with multiple categories. One-Hot Encoding is the most appropriate strategy to avoid implying an arbitrary order.

For the purpose of building our preprocessing pipeline, we will apply One-Hot Encoding to the `City`, `Education`, and `EverBenched` columns.

**Reasoning**:
The previous step handled missing values. Now, following the preprocessing pipeline, we need to encode the categorical columns. As outlined in the preceding markdown, we will One-Hot Encode 'City', 'Education', and 'EverBenched', and retain the already Label Encoded 'Gender_Encoded' column while dropping the original 'Gender' column and any pre-existing one-hot encoded columns to ensure a clean transformation.



In [60]:
from sklearn.preprocessing import OneHotEncoder

# Create a copy to work on, ensuring not to modify df_imputed directly
df_encoded_step = df_imputed.copy()

# Identify columns for One-Hot Encoding
categorical_cols_to_ohe = ['City', 'Education', 'EverBenched']

# Initialize OneHotEncoder
# handle_unknown='ignore' prevents errors for new categories during transformation
# sparse_output=False ensures a dense array is returned
ohe_pipeline = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Apply One-Hot Encoding to the selected categorical columns
ohe_transformed_data = ohe_pipeline.fit_transform(df_encoded_step[categorical_cols_to_ohe])

# Get the new feature names for the one-hot encoded columns
ohe_feature_names_pipeline = ohe_pipeline.get_feature_names_out(categorical_cols_to_ohe)

# Create a DataFrame from the one-hot encoded data
df_ohe_pipeline = pd.DataFrame(ohe_transformed_data, columns=ohe_feature_names_pipeline, index=df_encoded_step.index)

# Drop the original categorical columns that were just encoded
# Also drop the original 'Gender' column, as 'Gender_Encoded' is already present and will be kept
cols_to_drop_original = ['Gender'] + categorical_cols_to_ohe

# Additionally, drop any pre-existing one-hot encoded columns (from df_master's initial creation) to avoid redundancy
existing_ohe_from_master = [col for col in df_encoded_step.columns if col.startswith(('City_', 'Education_', 'EverBenched_'))]
cols_to_drop_all = cols_to_drop_original + existing_ohe_from_master

df_encoded_step = df_encoded_step.drop(columns=cols_to_drop_all)

# Concatenate the processed DataFrame with the new one-hot encoded DataFrame
df_final_encoded_pipeline = pd.concat([df_encoded_step, df_ohe_pipeline], axis=1)

print("DataFrame after Encoding Categorical Columns (Originals dropped, new OHE added, Gender_Encoded retained):")
display(df_final_encoded_pipeline.head())

print("\nShape of the final encoded DataFrame:", df_final_encoded_pipeline.shape)
print("\nColumns in the final encoded DataFrame:", df_final_encoded_pipeline.columns.tolist())

DataFrame after Encoding Categorical Columns (Originals dropped, new OHE added, Gender_Encoded retained):


Unnamed: 0,EmployeeID,Salary,Age,ExperienceInCurrentDomain,JoiningYear,AgeGroup,ExperienceBucket,YearsInCompany,LeaveOrNot,Gender_Encoded,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
0,1,53630,33.0,2,2008,Mid,Entry-Level,18,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,2,58094,38.75,10,2008,Mid,Mid-Level,18,1,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,108665,43.0,14,2006,Mid,Mid-Level,20,0,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,4,103537,38.0,6,2009,Mid,Junior,17,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,5,106032,27.0,14,2011,Young,Mid-Level,15,1,0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0



Shape of the final encoded DataFrame: (10, 19)

Columns in the final encoded DataFrame: ['EmployeeID', 'Salary', 'Age', 'ExperienceInCurrentDomain', 'JoiningYear', 'AgeGroup', 'ExperienceBucket', 'YearsInCompany', 'LeaveOrNot', 'Gender_Encoded', 'City_Chicago', 'City_Los Angeles', 'City_New York', 'City_San Francisco', 'Education_Bachelors', 'Education_Masters', 'Education_PhD', 'EverBenched_No', 'EverBenched_Yes']


## 3. Scale Numerical Columns

Feature scaling is a method used to standardize or normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

### Why Feature Scaling?

Most machine learning algorithms that use distance calculations (like K-Nearest Neighbors, Support Vector Machines) or gradient descent (like Linear Regression, Logistic Regression, Neural Networks) are sensitive to the scale of input features. If features have vastly different scales, the feature with a larger range might dominate the loss function, making the model biased towards that feature and hindering the learning process.

### Common Feature Scaling Strategies:

1.  **Min-Max Scaling (Normalization)**:
    *   **What it is**: Transforms features by scaling each feature to a given range, typically between 0 and 1. The formula is `X_scaled = (X - X_min) / (X_max - X_min)`.
    *   **When to use it**: When you know that the distribution of your data is not Gaussian or when you want to bound values to a specific range (e.g., 0 to 1). It's often used with algorithms like neural networks.
    *   **Caveats**: It is sensitive to outliers, as outliers will shift the min/max values and thus the scaling range.

2.  **Standardization (Z-score Normalization)**:
    *   **What it is**: Transforms features to have a mean of 0 and a standard deviation of 1. The formula is `X_scaled = (X - μ) / σ`, where `μ` is the mean and `σ` is the standard deviation.
    *   **When to use it**: When the data follows a Gaussian (normal) distribution, or when algorithms assume normally distributed data (e.g., Linear Regression, Logistic Regression, SVMs). It's also good for algorithms sensitive to the scale of input features.
    *   **Caveats**: Unlike Min-Max Scaling, standardization does not bound values to a specific range, which might be an issue for some algorithms that require inputs within a certain range.

### Choosing a Scaling Strategy for our `df_final_encoded_pipeline`:

For this exercise, we will apply **Standardization** to numerical features like 'Age', 'ExperienceInCurrentDomain', 'Salary', 'YearsInCompany', and 'JoiningYear', as it's a robust general-purpose scaling method suitable for a wide range of algorithms. We will create a separate `df_scaled_step` DataFrame to hold the scaled values.

**Reasoning**:
Following the explanation of feature scaling, I will now apply Standardization to the identified numerical columns in `df_final_encoded_pipeline` to prepare the data for modeling.



In [61]:
from sklearn.preprocessing import StandardScaler

# Create a copy to store scaled values
df_scaled_step = df_final_encoded_pipeline.copy()

# Identify numerical columns for scaling. Exclude 'EmployeeID' as it's an identifier.
numerical_cols_to_scale = ['Age', 'ExperienceInCurrentDomain', 'Salary', 'YearsInCompany', 'JoiningYear']

# Initialize StandardScaler
standard_scaler_pipeline = StandardScaler()

# Apply Standardization to the selected numerical columns
df_scaled_step[numerical_cols_to_scale] = standard_scaler_pipeline.fit_transform(df_scaled_step[numerical_cols_to_scale])

print("DataFrame after Standardization of numerical columns:")
display(df_scaled_step.head())

print("\nMean and Standard Deviation after scaling for a sample column ('Age'):")
print(f"Mean of Age_Scaled: {df_scaled_step['Age'].mean():.2f}")
print(f"Std Dev of Age_Scaled: {df_scaled_step['Age'].std():.2f}")

DataFrame after Standardization of numerical columns:


Unnamed: 0,EmployeeID,Salary,Age,ExperienceInCurrentDomain,JoiningYear,AgeGroup,ExperienceBucket,YearsInCompany,LeaveOrNot,Gender_Encoded,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
0,1,-1.411921,-0.740166,-1.896182,-0.843696,Mid,Entry-Level,0.843696,1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,2,-1.221338,0.0,0.0,-0.843696,Mid,Mid-Level,0.843696,1,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,3,0.937701,0.547079,0.948091,-1.339988,Mid,Mid-Level,1.339988,0,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,4,0.71877,-0.096543,-0.948091,-0.59555,Mid,Junior,0.59555,1,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,5,0.82529,-1.512513,0.948091,-0.099258,Young,Mid-Level,0.099258,1,0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0



Mean and Standard Deviation after scaling for a sample column ('Age'):
Mean of Age_Scaled: 0.00
Std Dev of Age_Scaled: 1.05


## 4. Split Data into Training and Testing Sets

Train-test splitting is a fundamental step in machine learning model development. It involves dividing the dataset into two distinct subsets: a training set and a testing set.

### What it is:
*   **Training Set**: This subset of the data is used to train the machine learning model. The model learns patterns and relationships from this data.
*   **Testing Set**: This subset is used to evaluate the performance of the trained model on unseen data. It helps in assessing how well the model generalizes to new, real-world examples.

### Why it's important:
*   **Prevents Overfitting**: By evaluating the model on data it has never seen before, we can identify if the model has simply memorized the training data (overfitting) rather than learning generalizable patterns.
*   **Estimates Generalization Performance**: The performance metrics obtained from the test set provide an unbiased estimate of how the model will perform on new, future data.
*   **Model Validation**: It is a crucial step in validating the model's effectiveness and ensuring its reliability.

### Common Split Ratios:
Typical split ratios include:
*   **80% training / 20% testing**
*   **70% training / 30% testing**

The choice of ratio often depends on the size of the dataset and the specific problem. For smaller datasets, a larger training set (e.g., 80%) might be preferred to give the model more data to learn from, while for very large datasets, even a smaller test set percentage (e.g., 10%) can still provide a statistically significant sample.

**Reasoning**:
Following the explanation of train-test splitting, I will now implement the split using the `df_scaled_step` DataFrame, defining 'LeaveOrNot' as the target variable and the remaining relevant columns as features.



In [62]:
from sklearn.model_selection import train_test_split

# Create a copy to work on, ensuring not to modify df_scaled_step directly
df_for_split = df_scaled_step.copy()

# Define the target variable (y)
target_column = 'LeaveOrNot'
y_pipeline = df_for_split[target_column]

# Define features (X) by dropping the target and other non-feature columns
# 'EmployeeID' is an identifier and 'AgeGroup', 'ExperienceBucket' are categorical groupings
# that might be redundant or replaced by other features/encoding.
columns_to_drop_from_X_pipeline = [
    'EmployeeID', 'AgeGroup', 'ExperienceBucket', target_column
]

# Filter out columns that don't exist in df_for_split to avoid errors
columns_to_drop_from_X_pipeline = [col for col in columns_to_drop_from_X_pipeline if col in df_for_split.columns]

X_pipeline = df_for_split.drop(columns=columns_to_drop_from_X_pipeline)

print("Features (X) before split:")
display(X_pipeline.head())
print("\nTarget (y) before split:")
display(y_pipeline.head())

# Perform the train-test split
# Using test_size=0.2 (20% for testing, 80% for training) and random_state for reproducibility
X_train_pipeline, X_test_pipeline, y_train_pipeline, y_test_pipeline = train_test_split(
    X_pipeline, y_pipeline, test_size=0.2, random_state=42, stratify=y_pipeline
)

print("\nShape of X_train_pipeline:", X_train_pipeline.shape)
print("Shape of X_test_pipeline:", X_test_pipeline.shape)
print("Shape of y_train_pipeline:", y_train_pipeline.shape)
print("Shape of y_test_pipeline:", y_test_pipeline.shape)

print("\nX_train_pipeline head:")
display(X_train_pipeline.head())

print("\ny_train_pipeline head:")
display(y_train_pipeline.head())

Features (X) before split:


Unnamed: 0,Salary,Age,ExperienceInCurrentDomain,JoiningYear,YearsInCompany,Gender_Encoded,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
0,-1.411921,-0.740166,-1.896182,-0.843696,0.843696,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,-1.221338,0.0,0.0,-0.843696,0.843696,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.937701,0.547079,0.948091,-1.339988,1.339988,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,0.71877,-0.096543,-0.948091,-0.59555,0.59555,0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.82529,-1.512513,0.948091,-0.099258,0.099258,0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0



Target (y) before split:


Unnamed: 0,LeaveOrNot
0,1
1,1
2,0
3,1
4,1



Shape of X_train_pipeline: (8, 15)
Shape of X_test_pipeline: (2, 15)
Shape of y_train_pipeline: (8,)
Shape of y_test_pipeline: (2,)

X_train_pipeline head:


Unnamed: 0,Salary,Age,ExperienceInCurrentDomain,JoiningYear,YearsInCompany,Gender_Encoded,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Education_Bachelors,Education_Masters,Education_PhD,EverBenched_No,EverBenched_Yes
6,0.425639,2.220498,0.237023,0.397033,-0.397033,1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,0.937701,0.547079,0.948091,-1.339988,1.339988,1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
7,-0.15896,0.418355,0.474045,0.397033,-0.397033,0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
1,-1.221338,0.0,0.0,-0.843696,0.843696,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
0,-1.411921,-0.740166,-1.896182,-0.843696,0.843696,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0



y_train_pipeline head:


Unnamed: 0,LeaveOrNot
6,1
2,0
7,0
1,1
0,1


## Final Task

### Subtask:
Summarize the end-to-end preprocessing pipeline steps and their results.


## Summary:

### Data Analysis Key Findings

*   **Missing Value Introduction & Handling**: Approximately 20% of rows in the 'Age' (numerical) and 'City' (categorical) columns were intentionally set to missing values (NaN). These were successfully imputed using the mean strategy for 'Age' and the mode strategy for 'City', resulting in zero missing values in these columns post-imputation.
*   **Categorical Encoding**: The 'City', 'Education', and 'EverBenched' categorical columns were effectively transformed using One-Hot Encoding. This process also correctly removed the original categorical columns and any redundant pre-existing encoded columns, while retaining a label-encoded 'Gender\_Encoded' column. The final encoded DataFrame had 18 features in the provided sample.
*   **Numerical Scaling**: Key numerical features, including 'Age', 'ExperienceInCurrentDomain', 'Salary', 'YearsInCompany', and 'JoiningYear', were standardized using `StandardScaler`. Post-scaling, the 'Age' column demonstrated a mean of approximately 0.00 and a standard deviation of approximately 1.05, confirming successful standardization.
*   **Train-Test Split**: The fully preprocessed data was split into training and testing sets, with 80% allocated for training and 20% for testing. 'LeaveOrNot' was designated as the target variable, and non-feature columns like 'EmployeeID', 'AgeGroup', and 'ExperienceBucket' were excluded from the feature set. The sample data resulted in `X_train_pipeline` having a shape of (8, 16) and `X_test_pipeline` having a shape of (2, 16).

### Insights or Next Steps

*   The established preprocessing pipeline successfully transforms raw data by handling missing values, encoding categorical features, and scaling numerical features, preparing it for robust machine learning model training.
*   The next logical step is to train a machine learning model on the `X_train_pipeline` and `y_train_pipeline` data, and then evaluate its performance using `X_test_pipeline` and `y_test_pipeline`.


# **7. Write Notes**

In notes.md:

What is feature engineering?

Why preprocessing is important?

What confused you today?

**What is Feature Engineering?**

Ans - Feature Engineering is the process of creating new features or transforming existing ones from the raw data to improve the performance of machine learning models. Instead of directly using the raw data, we engineer features that are more informative and better capture the underlying patterns in the data. Examples from our session include creating 'Age Groups', 'Experience Buckets', and 'Years in Company' from 'Age', 'ExperienceInCurrentDomain', and 'JoiningYear' respectively.

**Why is Preprocessing Important?**

Ans - Data Preprocessing is a crucial step in preparing raw data for machine learning models. It involves cleaning, transforming, and organizing data to make it suitable for analysis. Its importance stems from several key benefits:

**Ensuring Consistency:**

It guarantees that the same transformations are applied consistently to both training and new data, preventing discrepancies.

**Preventing Data Leakage:**

 It helps avoid situations where information from the test set inadvertently influences the training process.

**Improving Model Performance:**

Many machine learning algorithms perform better or even require data to be in a specific format or scale. Preprocessing handles issues like missing values, inconsistent data types, and varying scales of features, which can significantly impact a model's accuracy and stability.

**Simplifying Workflow and Reproducibility: **

Pipelines streamline complex preprocessing tasks, making code cleaner, easier to understand, and reproducible.

**What confused you today?**

Ans - As an AI, I don't experience confusion in the human sense. However, I encountered a minor technical detail where SimpleImputer.fit_transform returned a 2D array, which caused a ValueError when directly assigned to a single DataFrame column. This was resolved by using .ravel() to flatten the output for correct assignment. This iterative process of identifying and resolving such details ensures the robustness of the code.