# Introduction to Data Preprocessing

- Real-world datasets are often incomplete, inconsistent, or messy.
- Data may contain:
  - Missing values (empty fields)
  - Outliers (extreme or unusual values)
  - Errors or inconsistencies
- **Data preprocessing** is the process of cleaning and transforming data to make it suitable for machine learning models.
- **Importance of preprocessing:**
  - Ensures models are accurate and reliable.
  - Prevents biased or misleading insights.
  - Improves performance of statistical and ML algorithms.


# Missing Values

- **Definition:** Missing values occur when no data value is recorded for a feature in an observation.
- **Causes:**
  - Human or data entry errors
  - Sensor or measurement errors
  - Incomplete surveys or logs
- **Problems caused by missing values:**
  - Biased statistical calculations
  - Poor machine learning model performance
- **Example:** In a survey dataset, some participants may not provide their age or income.

# Techniques to Handle Missing Values

- **Deletion:**
  - Remove rows or columns containing missing values.
  - Simple but may lead to significant data loss.
- **Imputation:**
  - Fill missing values with statistical measures:
    - Mean or median for numerical columns
    - Mode for categorical columns
  - Advanced methods: KNN imputation, regression-based imputation.
- **Placeholder value:**
  - Replace missing categorical values with 'Unknown' or a special category.
- **Teaching tip:** Always evaluate the impact of missing data before choosing a technique.

# Outliers

- **Definition:** Outliers are data points that are significantly different from other observations.
- **Causes:**
  - Measurement or data entry errors
  - Rare but valid natural events
- **Problems caused by outliers:**
  - Skew statistical measures like mean and standard deviation
  - Reduce accuracy of ML models sensitive to distance (e.g., linear regression, KNN)
- **Example:** In a salary dataset, a few extremely high salaries may distort the average salary.

# Techniques to Detect and Handle Outliers

- **Detection methods:**
  - **IQR (Interquartile Range):** Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are outliers.
  - **Z-score method:** Data points with z-score > 3 or < -3 are considered outliers.
  - **Visualization:** Use boxplots, histograms, or scatterplots to visually identify outliers.
- **Handling methods:**
  - **Removal:** Delete rows containing outliers.
  - **Transformation:** Apply log or square root to reduce the impact of extreme values.
  - **Capping / Flooring:** Replace extreme values with nearest acceptable value.
- **Teaching tip:** Outliers are not always errors—sometimes they provide important insights.

# Summary

- Missing values and outliers are common in real-world datasets.
- Proper handling is crucial for:
  - Accurate statistics
  - Reliable machine learning models
  - Meaningful insights from data
- Key takeaways:
  - Missing values: deletion, imputation, or placeholder values
  - Outliers: detection using IQR or z-score, handling by removal, transformation, or capping
- Remember: Always analyze the dataset context before making preprocessing decisions.

# Missing Value and Outlier Handling

In this section, we will implement data preprocessing techniques on our dataset:

1. Load and explore the dataset.
2. Handle missing values.
3. Detect outliers in numerical columns.
4. Handle outliers by removal or capping.
5. Validate the cleaned dataset.

We will use **classes and functions** with proper docstrings to make the code modular and professional.


# Step 1: Loading and Exploring the Dataset

In this step, we will:
- Import necessary Python libraries.
- Load the dataset (from Lab 4 or any CSV file) into a Pandas DataFrame.
- Explore the dataset by checking its shape, column types, and basic statistics.
- Identify numerical and categorical columns, and check for missing values.

In [1]:
# Step 1: Import libraries and load dataset
import pandas as pd
import numpy as np

# Optional: visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Load Penguins dataset from URL
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url)

# Basic exploration
print("Dataset shape:", df.shape)
print("\nColumn names and data types:\n", df.dtypes)
print("\nFirst five rows of the dataset:\n", df.head())
print("\nSummary statistics of numerical columns:\n", df.describe())

# Identify numerical and categorical columns
numerical_cols = df.select_dtypes(include='number').columns.tolist()
categorical_cols = df.select_dtypes(include='object').columns.tolist()
print("\nNumerical columns:", numerical_cols)
print("Categorical columns:", categorical_cols)

# Check for missing values
print("\nMissing values per column:\n", df.isnull().sum())

Dataset shape: (344, 7)

Column names and data types:
 species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

First five rows of the dataset:
   species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    MALE  
1       3800.0  FEMALE  
2       3250.0  FEMALE  
3          NaN     NaN  
4       3450.0  FEMALE  

Summary statistics of numerical columns:
        bill_length_mm  bill_depth_mm  flipper_length_mm 

# Step 2: Handling Missing Values

In this step, we will handle missing values in the dataset.

- Identify columns with missing data.
- For **numerical columns**, we will fill missing values using the **mean** of the column.
- For **categorical columns**, we will fill missing values using a **placeholder value** ('Unknown').
- We will implement this in a **class-based approach** with proper docstrings, so it can be reused for other datasets.
- After handling, we will verify that there are no missing values left.

In [2]:
class MissingValueHandler:
    """
    A class to handle missing values in a dataset.

    Methods:
    --------
    handle_numerical(df, numerical_cols)
        Fills missing numerical values with the column mean.

    handle_categorical(df, categorical_cols, placeholder='Unknown')
        Fills missing categorical values with a placeholder value.
    """

    def handle_numerical(self, df, numerical_cols):
        """
        Fill missing values in numerical columns with the column mean.

        Parameters:
        -----------
        df : pandas.DataFrame
            Dataset containing numerical columns.
        numerical_cols : list
            List of numerical column names.

        Returns:
        --------
        pandas.DataFrame
            Dataset with missing numerical values filled.
        """
        for col in numerical_cols:
            mean_val = df[col].mean()
            df[col] = df[col].fillna(mean_val)
        return df

    def handle_categorical(self, df, categorical_cols, placeholder='Unknown'):
        """
        Fill missing values in categorical columns with a placeholder.

        Parameters:
        -----------
        df : pandas.DataFrame
            Dataset containing categorical columns.
        categorical_cols : list
            List of categorical column names.
        placeholder : str
            Value to fill missing categorical data with (default 'Unknown').

        Returns:
        --------
        pandas.DataFrame
            Dataset with missing categorical values filled.
        """
        for col in categorical_cols:
            df[col] = df[col].fillna(placeholder)
        return df

In [3]:
# Create object and handle missing values
mv_handler = MissingValueHandler()

# Handle numerical columns
df = mv_handler.handle_numerical(df, numerical_cols)

# Handle categorical columns
df = mv_handler.handle_categorical(df, categorical_cols)

# Class-level docstring explanation for lab context
"""
The 'MissingValueHandler' class preprocesses datasets by handling missing values.
Numerical columns are filled with mean values, and categorical columns are filled with a placeholder.
This ensures the dataset has no missing values and is ready for further analysis.
"""

# Verify missing values handled
print("\nMissing values after handling:\n", df.isnull().sum())


Missing values after handling:
 species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64


# Step 3: Detecting Outliers

In this step, we will identify outliers in the dataset.

- Outliers are data points significantly different from the rest of the data.
- Problems caused by outliers:
  - Skew statistical measures like mean and standard deviation
  - Reduce accuracy of machine learning models
- We will use the **Interquartile Range (IQR)** method:
  - Q1 = 25% percentile, Q3 = 75% percentile
  - IQR = Q3 - Q1
  - Any value below Q1 - 1.5*IQR or above Q3 + 1.5*IQR is considered an outlier
- We will implement this in a **class-based approach** with proper docstrings.

In [4]:
class OutlierDetection:
    """
    A class to detect outliers in numerical columns of a dataset using the IQR method.

    Methods:
    --------
    detect_outliers(df, numerical_cols)
        Returns a dictionary containing outlier indices for each numerical column.
    """

    def detect_outliers(self, df, numerical_cols):
        """
        Detect outliers in numerical columns using the IQR method.

        Parameters:
        -----------
        df : pandas.DataFrame
            Dataset containing numerical columns.
        numerical_cols : list
            List of numerical column names.

        Returns:
        --------
        dict
            Dictionary where keys are column names and values are lists of outlier row indices.
        """
        outlier_indices = {}
        for col in numerical_cols:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index.tolist()
            outlier_indices[col] = outliers
        return outlier_indices

In [5]:
# Create object and detect outliers
outlier_detector = OutlierDetection()
outliers_dict = outlier_detector.detect_outliers(df, numerical_cols)

# Class-level docstring explanation for lab context
"""
The 'OutlierDetection' class identifies outliers in numerical columns using the IQR method.
It returns the indices of rows containing outlier values for each column.
This allows us to decide how to handle them in the next preprocessing step.
"""

# Display outliers per numerical column
for col, indices in outliers_dict.items():
    print(f"Column '{col}' has {len(indices)} outlier(s) at indices: {indices}")

Column 'bill_length_mm' has 0 outlier(s) at indices: []
Column 'bill_depth_mm' has 0 outlier(s) at indices: []
Column 'flipper_length_mm' has 0 outlier(s) at indices: []
Column 'body_mass_g' has 0 outlier(s) at indices: []


# Step 4: Handling Outliers

In this step, we will preprocess the dataset to handle outliers.

- Outliers can distort statistical analysis and reduce machine learning model performance.
- Common handling techniques:
  - **Removal:** Delete rows containing outliers.
  - **Capping/Flooring:** Replace extreme values with nearest acceptable values.
- We will implement **outlier removal** using a class-based approach.
- After handling, we will verify that the dataset is cleaned of extreme values.

In [6]:
class OutlierHandler:
    """
    A class to handle outliers in numerical columns of a dataset.

    Methods:
    --------
    remove_outliers(df, numerical_cols)
        Removes rows containing outliers in any numerical column using the IQR method.
    """

    def remove_outliers(self, df, numerical_cols):
        """
        Remove rows containing outliers based on the IQR method.

        Parameters:
        -----------
        df : pandas.DataFrame
            Dataset containing numerical columns.
        numerical_cols : list
            List of numerical column names.

        Returns:
        --------
        pandas.DataFrame
            Dataset with rows containing outliers removed.
        """
        outlier_indices = set()
        for col in numerical_cols:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            indices = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
            outlier_indices.update(indices)
        df_clean = df.drop(index=outlier_indices)
        return df_clean

In [7]:
# Create object and remove outliers
outlier_handler = OutlierHandler()
df_clean = outlier_handler.remove_outliers(df, numerical_cols)

# Class-level docstring explanation for lab context
"""
The 'OutlierHandler' class removes rows containing outliers in numerical columns using the IQR method.
This ensures the dataset is cleaned and ready for further analysis or machine learning tasks.
"""

# Verify dataset after outlier removal
print("Original dataset shape:", df.shape)
print("Dataset shape after removing outliers:", df_clean.shape)
print("\nFirst five rows after outlier handling:\n", df_clean.head())

Original dataset shape: (344, 7)
Dataset shape after removing outliers: (344, 7)

First five rows after outlier handling:
   species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen        39.10000       18.70000         181.000000   
1  Adelie  Torgersen        39.50000       17.40000         186.000000   
2  Adelie  Torgersen        40.30000       18.00000         195.000000   
3  Adelie  Torgersen        43.92193       17.15117         200.915205   
4  Adelie  Torgersen        36.70000       19.30000         193.000000   

   body_mass_g      sex  
0  3750.000000     MALE  
1  3800.000000   FEMALE  
2  3250.000000   FEMALE  
3  4201.754386  Unknown  
4  3450.000000   FEMALE  


# Step 5: Validation of Preprocessed Dataset

In this step, we will validate the dataset after handling missing values and outliers:

- Verify that there are **no missing values** left.
- Check that **outliers have been removed**.
- Inspect the **shape of the dataset**.
- Preview the **first few rows** to ensure data integrity.
- This ensures the dataset is ready for further analysis or machine learning tasks.

In [8]:
# Check for missing values
print("Missing values per column:\n", df_clean.isnull().sum())

# Check dataset shape
print("\nShape of the cleaned dataset:", df_clean.shape)

# Preview first five rows
print("\nFirst five rows of the cleaned dataset:\n", df_clean.head())

# Optional: basic statistics to ensure values are reasonable
print("\nSummary statistics of cleaned numerical columns:\n", df_clean[numerical_cols].describe())

Missing values per column:
 species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Shape of the cleaned dataset: (344, 7)

First five rows of the cleaned dataset:
   species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen        39.10000       18.70000         181.000000   
1  Adelie  Torgersen        39.50000       17.40000         186.000000   
2  Adelie  Torgersen        40.30000       18.00000         195.000000   
3  Adelie  Torgersen        43.92193       17.15117         200.915205   
4  Adelie  Torgersen        36.70000       19.30000         193.000000   

   body_mass_g      sex  
0  3750.000000     MALE  
1  3800.000000   FEMALE  
2  3250.000000   FEMALE  
3  4201.754386  Unknown  
4  3450.000000   FEMALE  

Summary statistics of cleaned numerical columns:
        bill_length_mm  bill_depth_mm  flipper_length_mm  body_ma

# Step 6: Feature Scaling

In this step, we will preprocess the dataset to make it ready for machine learning models:

- **Feature Scaling:**  
  - Ensures numerical features are on a similar scale.  
  - Common techniques:  
    - **Min-Max Scaling** (scales values between 0–1).  
    - **Standardization** (transforms data to mean=0 and std=1).  
    - **Robust Scaling** (uses median and IQR, less sensitive to outliers).  
    - **Max-Abs Scaling** (scales values between -1 and 1).  
  - Scaling is important for distance-based models (KNN, SVM, Neural Networks) and models sensitive to feature magnitude.

- We will implement these scaling techniques in a **class-based approach** with proper docstrings.

In [9]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, MaxAbsScaler

class FeaturePreprocessor:
    """
    A class to perform feature scaling on a dataset.

    Methods:
    --------
    scale_features(df, numerical_cols, method='minmax')
        Scales numerical features using one of the following methods:
        - 'minmax': Min-Max Scaling
        - 'standard': Standardization (Z-score Scaling)
        - 'robust': Robust Scaling (median and IQR)
        - 'maxabs': Max-Abs Scaling
    """

    def scale_features(self, df, numerical_cols, method='minmax'):
        """
        Scale numerical columns using the specified method.

        Parameters:
        -----------
        df : pandas.DataFrame
            Dataset containing numerical columns.
        numerical_cols : list
            List of numerical column names.
        method : str
            Scaling method: 'minmax', 'standard', 'robust', or 'maxabs'.

        Returns:
        --------
        pandas.DataFrame
            Dataset with scaled numerical features.
        """
        if method == 'minmax':
            scaler = MinMaxScaler()
        elif method == 'standard':
            scaler = StandardScaler()
        elif method == 'robust':
            scaler = RobustScaler()
        elif method == 'maxabs':
            scaler = MaxAbsScaler()
        else:
            raise ValueError("Method must be 'minmax', 'standard', 'robust', or 'maxabs'")

        df_scaled = df.copy()
        df_scaled[numerical_cols] = scaler.fit_transform(df[numerical_cols])
        return df_scaled

In [21]:
# Create object and preprocess features
preprocessor = FeaturePreprocessor()

# Apply different scaling methods on numerical columns
df_minmax = preprocessor.scale_features(df_clean, numerical_cols, method='minmax')
df_standard = preprocessor.scale_features(df_clean, numerical_cols, method='standard')
df_robust = preprocessor.scale_features(df_clean, numerical_cols, method='robust')
df_maxabs = preprocessor.scale_features(df_clean, numerical_cols, method='maxabs')

# Class-level docstring explanation for lab context
"""
The 'FeaturePreprocessor' class applies four types of scaling techniques
(Min-Max, Standard, Robust, Max-Abs) on numerical features.
This ensures that data is transformed into a suitable range or distribution,
making it ready for machine learning models.
"""

# Verify scaled datasets
print("Min-Max Scaling (first 5 rows):\n", df_minmax[numerical_cols].head(), "\n")
print("Standard Scaling (first 5 rows):\n", df_standard[numerical_cols].head(), "\n")
print("Robust Scaling (first 5 rows):\n", df_robust[numerical_cols].head(), "\n")
print("Max-Abs Scaling (first 5 rows):\n", df_maxabs[numerical_cols].head(), "\n")

Min-Max Scaling (first 5 rows):
    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
0        0.254545       0.666667           0.152542     0.291667
1        0.269091       0.511905           0.237288     0.305556
2        0.298182       0.583333           0.389831     0.152778
3        0.429888       0.482282           0.490088     0.417154
4        0.167273       0.738095           0.355932     0.208333 

Standard Scaling (first 5 rows):
    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
0   -8.870812e-01   7.877425e-01          -1.422488    -0.565789
1   -8.134940e-01   1.265563e-01          -1.065352    -0.503168
2   -6.663195e-01   4.317192e-01          -0.422507    -1.192003
3   -1.307172e-15   1.806927e-15           0.000000     0.000000
4   -1.328605e+00   1.092905e+00          -0.565361    -0.941517 

Robust Scaling (first 5 rows):
    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
0       -0.558266       0.451613          -0.695652 

# Test Case 1: Fill Missing Values in Numerical Column

- Dataset contains a numerical column with missing values.
- Task: Use `MissingValueHandler` to fill missing values with the column mean.

In [23]:
# Sample dataset
df_test1 = pd.DataFrame({
    'Age': [25, np.nan, 30, 22, np.nan]
})

# Handle missing values
mv_handler_test = MissingValueHandler()
print(df_test1)
df_test2 = mv_handler_test.handle_numerical(df_test1, ['Age'])
print(df_test2)

    Age
0  25.0
1   NaN
2  30.0
3  22.0
4   NaN
         Age
0  25.000000
1  25.666667
2  30.000000
3  22.000000
4  25.666667


# Test Case 2: Fill Missing Values in Categorical Column

- Dataset contains a categorical column with missing values.
- Task: Use `MissingValueHandler` to fill missing values with 'Unknown'.

In [27]:
df_test3 = pd.DataFrame({
    'Species': ['Adelie', None, 'Chinstrap', None]
})

# Handle missing values
print(df_test3)
df_test4 = mv_handler_test.handle_categorical(df_test3, ['Species'])
print("\n\n")
print(df_test4)

     Species
0     Adelie
1       None
2  Chinstrap
3       None



     Species
0     Adelie
1    Unknown
2  Chinstrap
3    Unknown


# Test Case 3: Detect Outliers in Numerical Column

- Dataset contains a numerical column with extreme values.
- Task: Use `OutlierDetection` to identify outliers using IQR.

In [29]:
df_test5 = pd.DataFrame({
    'Salary': [30000, 32000, 31000, 1000000, 33000, 35000, 100]
})

outlier_detector_test = OutlierDetection()
outliers_test3 = outlier_detector_test.detect_outliers(df_test5, ['Salary'])
print(outliers_test3)

{'Salary': [3, 6]}


# Test Case 4: Remove Outliers from Numerical Column

- Dataset contains outliers in a numerical column.
- Task: Use `OutlierHandler` to remove rows containing outliers.

In [14]:
outlier_handler_test = OutlierHandler()
df_test4_clean = outlier_handler_test.remove_outliers(df_test3, ['Salary'])
print(df_test4_clean)

   Salary
0   30000
1   32000
2   31000
4   33000
5   35000


# Test Case 5: Scale Numerical Features

- Dataset contains numerical features with different ranges.
- Task: Use `FeaturePreprocessor` to scale numerical columns using Min-Max scaling.

In [31]:
df_test5 = pd.DataFrame({
    'Height': [150, 160, 170, 180],
    'Weight': [50, 60, 70, 90]
})

preprocessor_test = FeaturePreprocessor()
df_test5_scaled = preprocessor_test.scale_features(df_test5, ['Height', 'Weight'], method='minmax')
print(df_test5_scaled)

     Height  Weight
0  0.000000    0.00
1  0.333333    0.25
2  0.666667    0.50
3  1.000000    1.00


# Experiment 6 Summary and Future Classes

## Summary of Experiment 6: Data Preprocessing

In this experiment, we focused on preprocessing a dataset to make it ready for analysis or machine learning tasks. The main steps included:

1. **Loading and Exploring the Dataset**
   - Imported the dataset and examined its structure, column types, and missing values.
   - Identified numerical and categorical columns.

2. **Handling Missing Values**
   - Implemented a `MissingValueHandler` class to fill missing numerical values with the mean and categorical values with a placeholder.
   - Verified that the dataset had no missing values after preprocessing.

3. **Detecting Outliers**
   - Implemented an `OutlierDetection` class using the Interquartile Range (IQR) method.
   - Identified outliers in numerical columns to prevent skewed analysis.

4. **Handling Outliers**
   - Implemented an `OutlierHandler` class to remove rows containing outliers.
   - Ensured the dataset was clean and robust for further processing.

5. **Validation**
   - Verified the dataset for missing values and outliers.
   - Checked the dataset shape and previewed first few rows to ensure data integrity.

6. **Feature Scaling**
   - Implemented a `FeaturePreprocessor` class to scale numerical features (like Min-Max scaling).
   - Produced a fully numeric, scaled dataset ready for machine learning models.

7. **Test Cases**
   - Practiced preprocessing tasks on small datasets to reinforce concepts.
   - Covered missing value handling, outlier detection/removal, scaling, and encoding.

---

## Future Classes / Extensions

- **Advanced Feature Engineering**
  - Creating new features from existing ones to improve model performance.
  - Handling date/time or text features.

- **Dimensionality Reduction**
  - Techniques like PCA or t-SNE to reduce feature space.

- **Encoding for High Cardinality Features**
  - Target encoding, frequency encoding for categorical features with many levels.

- **Automated Preprocessing Pipelines**
  - Using `scikit-learn` Pipelines to combine preprocessing steps and model training seamlessly.

- **Handling Imbalanced Data**
  - Techniques like SMOTE, oversampling, or undersampling.

These extensions will build on the preprocessing foundations learned in this experiment and prepare the dataset for robust machine learning workflows.