# **Data Wrangling and Missing Value Handling**

## **Introduction**

### Why is Data Wrangling Important?
- **Real-world data** is often messy, containing missing values, inconsistencies, and irrelevant features.
- **Unprocessed data** can lead to inaccurate insights and unreliable models.
- **Effective data wrangling** transforms raw data into a structured, clean format suitable for analysis and machine learning.

### **Key Steps in Data Wrangling:**
1. **Data Cleaning** – Handling missing values, removing outliers, and standardizing formats.
2. **Data Transformation** – Encoding categorical variables, normalizing numerical data, and feature engineering.
3. **Data Integration** – Merging, reshaping, and aggregating data for better analysis.

## **Dataset Used for Demonstration**
We will use a **synthetic dataset** designed to teach **Data Wrangling** techniques by addressing common data issues. The dataset contains:

### **1. Numerical Features**
- **Age** – Contains missing values and outliers (e.g., unrealistic values).
- **Salary** – Has missing values that require imputation.
- **Work Experience** – Some missing entries, demonstrating handling techniques.
- **Job Satisfaction Score** – Skewed distribution, useful for transformations.
- **Customer Satisfaction Rating** – Ranges from 1 to 10, useful for normalization.

### **2. Categorical Features**
- **Name** – Contains duplicates and inconsistencies (e.g., different cases, extra spaces).
- **Department** – Includes typos and inconsistent categories.
- **Education Level** – Ordinal categorical variable requiring encoding.
- **Remote Work** – Binary categorical feature useful for one-hot encoding.
- **Performance Score** – Imbalanced target variable, demonstrating resampling techniques.

### **3. Date and Currency Fields**
- **Join Date** – Stored in mixed formats, demonstrating date parsing.
- **Bonus** – Contains currency symbols (`$`, `€`), requiring conversion to numeric values.

### **4. Key Data Wrangling Challenges Covered**
- **Handling missing values** in Salary, Work Experience, and Age using imputation techniques.
- **Dealing with outliers** in Age by capping, removing, or transforming values.
- **Fixing inconsistencies** in categorical data through text standardization and deduplication.
- **Encoding categorical variables** for machine learning compatibility.
- **Normalizing and transforming numerical features** to improve data distributions.
- **Addressing imbalanced target variables** through resampling techniques.

This dataset is structured to provide hands-on experience in **Data Wrangling**, helping students learn essential techniques for real-world data preprocessing.

Let's start by loading the dataset. 🚀


In [2]:

import numpy as np
# Import required libraries
import pandas as pd
from scipy.stats import zscore
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from imblearn.over_sampling import SMOTE


# Load the dataset
dataset = pd.read_csv("./data/data_wrangling_dataset.csv")

# Display first few rows
dataset.head()


Unnamed: 0,ID,Name,Age,Salary,Join_Date,Department,Education_Level,Work_Experience,Performance_Score,Bonus,Remote_Work,Job_Satisfaction,Customer_Satisfaction
0,1,Hank,39,,2015-01-01,IT,Master's,,5,€$6566,1,3.668445,7
1,2,Eve,33,63198.0,2015-01-02,HR,High School,23.0,4,4969,1,0.423328,3
2,3,David,41,43065.0,2015-01-03,Finance,Bachelor's,16.0,4,6420,0,1.759563,6
3,4,David,50,65048.0,2015-01-04,IT,Bachelor's,20.0,3,1607,0,3.000364,6
4,5,Charlie,32,80992.0,2015-01-05,HR,High School,1.0,4,6674,0,2.425639,2



# **1. Data Cleaning**

## **1.1 Handling Missing Values**

### **Why do Missing Values Occur?**
- **Data collection errors** (e.g., sensor malfunctions, survey non-responses).
- **Human errors** (e.g., incorrect data entry).
- **Different data sources** (some sources may not have certain attributes).

### **Methods to Handle Missing Values:**

1. **Deletion (Dropping Missing Values):**
   - **When to use?** If only a small percentage of data is missing.
   - **Drawback:** Can result in loss of valuable data.

2. **Imputation (Filling Missing Values):**
   - **Mean/Median Imputation** (for numerical data) – works well if data is normally distributed.
   - **Mode Imputation** (for categorical data) – replaces missing values with the most frequent value.
   - **Forward/Backward Fill** (for time-series data) – fills missing values based on previous or next observations.
   - **KNN Imputation** – predicts missing values based on similar data points.


In [3]:
df = dataset.copy(deep=True)

In [4]:
# Check missing values
print("Missing values before handling:")
print(df.isnull().sum())

Missing values before handling:
ID                         0
Name                       0
Age                        0
Salary                   167
Join_Date                  0
Department                 0
Education_Level            0
Work_Experience          125
Performance_Score          0
Bonus                      0
Remote_Work                0
Job_Satisfaction           0
Customer_Satisfaction      0
dtype: int64


In [None]:
# Drop rows with missing values (not recommended if data loss is high)
df_dropped = df.dropna()

In [None]:
# Impute numerical values (Age, Income) using Mean & Median
imputer_mean = SimpleImputer(strategy='mean')
df['Age'] = imputer_mean.fit_transform(df[['Age']])

imputer_median = SimpleImputer(strategy='median')
df['Income'] = imputer_median.fit_transform(df[['Income']])

# Impute categorical values (City) using Mode
imputer_mode = SimpleImputer(strategy='most_frequent')
df['City'] = imputer_mode.fit_transform(df[['City']])

# Predictive Imputation using KNN (for Age & Income)
knn_imputer = KNNImputer(n_neighbors=3)
df[['Age', 'Income']] = knn_imputer.fit_transform(df[['Age', 'Income']])


# Removing duplicates
df = df.drop_duplicates()

# Check missing values after handling
print("Missing values after handling:")
print(df.isnull().sum())

df.head()


In [None]:
# Numerical Columns
for col in ['Age', 'Salary', 'Work Experience']:
    df[col].fillna(df[col].median(), inplace=True)

# Categorical Columns
for col in ['Department', 'Education Level', 'Remote Work']:
    df[col].fillna(df[col].mode()[0], inplace=True)

df.head()


## **1.2 Handling Outliers**

### **What are Outliers?**
- **Outliers** are extreme values that differ significantly from other observations.
- They can be caused by **errors** or **natural variations** in the data.

### **Methods to Detect and Remove Outliers:**

1. **Z-Score Method**:
   - Measures how many standard deviations a data point is from the mean.
   - If the absolute Z-score is greater than 3, the value is considered an outlier.

2. **Interquartile Range (IQR) Method**:
   - Detects outliers by identifying values **outside 1.5 times the IQR**.



In [None]:
# Using Z-Score
z_scores = np.abs(zscore(df['Age']))
df = df[z_scores < 3]  # Removing outliers based on Z-score

# Detecting Outliers using IQR
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
df['Salary'] = df[(df['Salary'] >= Q1 - 1.5 * IQR) & (df['Salary'] <= Q3 + 1.5 * IQR)]

df.head()

## **1.4 Standardizing Data Formats**

### **Why Standardization of Data Formats is Important?**

- Ensures Consistency: Data often comes in different formats, making uniform processing essential.

- Facilitates Comparisons: Uniform formats enable accurate analysis and computations.

- Reduces Errors: Inconsistent formats can lead to misinterpretations and processing issues.

- Improves Data Quality: Standardization simplifies data cleaning and validation.

### **Common Data Formats to Standardize**

1. **Date and Time Standardization:** Standardizing date formats ensures accurate sorting, filtering, and time-based analysis. Using a universally accepted format, such as ISO 8601, enhances consistency.

2. **Numeric Data Standardization:** Removing extra characters like currency symbols and ensuring a consistent decimal notation is essential for correct calculations.

3. **Categorical Data Standardization:** Variations in categorical values (e.g., different capitalizations or abbreviations) can cause inconsistencies. Standardizing these values improves reliability.

4. **Text and String Formatting:** Removing unnecessary spaces, special characters, and ensuring uniform capitalization enhances text processing.



In [None]:
# Standardizing date format
df['Join Date'] = pd.to_datetime(df['Join Date'], errors='coerce')

# Converting currency column to numeric using different approaches
df['Bonus'] = df['Bonus'].replace({'\$': '', '€': ''}, regex=True).astype(float)

df.head()


# **2. Data Pre-processing**

## **2.1 Encoding Categorical Variables**

### **Why do we need Encoding?**
- Machine learning models require **numerical data**.
- Encoding converts categorical values into numbers.

### **Types of Encoding:**

1. **Label Encoding**:
   - Assigns **a unique integer** to each category.
   - **Best for** ordinal categorical data (e.g., Small, Medium, Large).

2. **One-Hot Encoding**:
   - Creates separate **binary columns** for each category.
   - **Best for** nominal categorical data (e.g., Cities, Colors).


In [None]:
# Encoding categorical variables using multiple techniques
label_encoder = LabelEncoder()
df['Education Level'] = label_encoder.fit_transform(df['Education Level'])  # Label Encoding

# Alternative method: (One-Hot Encoding)
df = pd.get_dummies(df, columns=['Department', 'Remote Work'], drop_first=True)

df.head()

## **2.1 Handling Imbalanced Datasets**

- **Why It Matters**: In imbalanced datasets, one class significantly outweighs others, leading to biased models that favor the dominant class.
- **Strategies for Handling Imbalance**:
  - **Resampling Methods**: Oversampling the minority class or undersampling the majority class can help balance the dataset.
  - **Synthetic Data Generation**: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) generate synthetic examples to balance class distribution.
  - **Algorithmic Approaches**: Some machine learning models handle imbalance better by adjusting class weights.


In [None]:
# Handling imbalanced datasets
X = df.drop(columns=['Performance Score'])
y = df['Performance Score']
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
df = pd.concat([X_resampled, y_resampled], axis=1)

df.head()

## **2.2 Feature Selection**
- **Why It Matters**: Irrelevant or low-variance features can add noise and reduce model performance.
- **Methods for Feature Selection**:
  - **Variance Thresholding**: Removing features with very low variance helps eliminate those that contribute little to predictive power.
  - **Correlation Analysis**: Highly correlated features can be redundant and may be removed to simplify the model.
  - **Model-Based Selection**: Feature importance

In [None]:
# Feature selection: Dropping low-variance columns manually
threshold = 0.01 * (1 - 0.01)
low_variance_cols = [col for col in df.columns if df[col].var() < threshold]
df = df.drop(columns=low_variance_cols)


# **3. Data Transformation**

## **3.1 Scaling & Normalization**

### **Why Scale Data?**
- Ensures that **all features contribute equally** to the model.
- Improves performance in algorithms like KNN, SVM, and PCA.

### **Methods:**
1. **Min-Max Scaling**: Scales data to a range **[0,1]**.
2. **Standardization (Z-score)**: Centers data around **mean = 0, std = 1**.


In [None]:

# Min-Max Scaling
scaler = MinMaxScaler()
df[['Age', 'Salary', 'Work Experience']] = scaler.fit_transform(df[['Age', 'Salary', 'Work Experience']])


# Standardization
df[['Age', 'Salary', 'Work Experience']] = StandardScaler().fit_transform(df[['Age', 'Salary', 'Work Experience']])

df.head()

In [None]:
# Log transformation for skewed data using numpy
for col in ['Salary', 'Bonus']:
    df[col] = np.log1p(df[col])

df.head()

In [None]:
# Feature engineering: Creating new features
df['Experience per Year'] = df['Work Experience'] / (df['Age'] + 1)

In [None]:
df.to_csv("cleaned_synthetic_data.csv", index=False)