# Data Handling and Preprocessing 

## Introduction
The quality of your data directly impacts the accuracy of your analysis and model performance.
Because raw data often contains inconsistencies, errors, and irrelevant information that can distort results and lead to flawed insights. Data preprocessing is a way to mitigate this problem. 

**Data preprocessing** is a key aspect of data preparation. It refers to any processing applied to raw data to ready it for further analysis or processing tasks. 

Traditionally, data preprocessing has been an essential preliminary step in data analysis. However, more recently, these techniques have been adapted to train machine learning and AI models and make inferences from them. 

Thus, data preprocessing may be defined as the process of converting raw data into a format that can be processed more efficiently and accurately in tasks such as: 

- Data analysis
- Machine learning 
- Data science
- AI


##  Purpose of Data Splitting

Data splitting is the process of dividing a dataset into separate subsets for training,
validation, and testing.

### Why is data splitting important?
- To evaluate model performance on unseen data
- To prevent overfitting
- To tune hyperparameters correctly
- To simulate real-world deployment scenarios

Without proper splitting, models may memorize data instead of learning patterns.

## Typical Steps in Data Preprocessing

The following sequence outlines a standard workflow :

1. Acquire the dataset
2. Import necessary libraries
3. Load the dataset
4. Explore and check for missing values
5. Encode non-numerical (categorical) data
6. Scale/normalize features
7. Split data into training, validation, and test sets 

### Understanding Data Before Preprocessing

Before performing any data handling or preprocessing steps, it is important to understand what the data represents. In machine learning, data is typically organized into **features** and a **target variable**.

**Features** are the input variables that describe the data. They contain information that the model uses to learn patterns. Examples include numerical values such as age, height, attendance percentage, or hours studied.

The **target variable** is the output we want the model to predict. Depending on the problem, it can represent a category (such as pass/fail) or a numerical value (such as a score or price).

Data can also be classified based on its type:
- **Numerical data** represents measurable quantities and can be continuous or discrete.
- **Categorical data** represents labels or categories and usually needs to be converted into numerical form before modeling.

Understanding the role and type of each column in the dataset helps in choosing the correct preprocessing techniques. This step ensures that the decisions made during data cleaning, encoding, and scaling are meaningful and appropriate for the problem being solved.


## Step 1: Acquiring the Dataset

The first step in any machine learning workflow is acquiring the dataset. In real-world projects, data is often collected from external sources such as databases, APIs, or public repositories. However, for learning and experimentation, creating a dataset manually is often a better starting point.

In this step, we create our own dataset** instead of downloading one. This allows us to clearly understand what each feature represents and how the data is structured, without any hidden complexities. By designing the dataset ourselves, we have full control over the values and can easily relate them to real-world meaning.

The dataset represents a simple student performance scenario:
- `hours_studied` indicates the number of hours a student spent studying.
- `attendance` represents the student’s attendance percentage.
- `result` is the target variable that shows whether the student **passed** or **failed** the exam.



In [8]:
import pandas as pd

data = {
    "hours_studied": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "attendance": [60, 65, 70, 72, 75, 80, 85, 88, 90, 95],
    "result": ["Fail", "Fail", "Fail", "Fail", "Fail",
               "Pass", "Pass", "Pass", "Pass", "Pass"]
}

df = pd.DataFrame(data)

df.to_csv("student_data.csv", index=False)

df


Unnamed: 0,hours_studied,attendance,result
0,1,60,Fail
1,2,65,Fail
2,3,70,Fail
3,4,72,Fail
4,5,75,Fail
5,6,80,Pass
6,7,85,Pass
7,8,88,Pass
8,9,90,Pass
9,10,95,Pass


## Step 2: Importing the Required Libraries

Once the dataset is ready, the next step is to import the necessary libraries that will help us work with the data. In Python, machine learning tasks rely heavily on a few powerful libraries that simplify data handling, preprocessing, and model development.

In this notebook, we use:
- **pandas**to handle and manipulate tabular data using DataFrames
- **numpy** to perform numerical operations efficiently
- **scikit-learn** to access tools for preprocessing and data splitting

Importing these libraries at the beginning ensures that all required functionality is available as we move through the workflow. It also keeps the notebook organized and avoids repeated imports later.

This step does not perform any computation on the data yet. It simply prepares the working environment so that we can load, explore, and preprocess the dataset smoothly in the following steps.


In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split


## Step 3: Loading the Dataset

After setting up the working environment, the next step is to load the dataset into the notebook. Since we created and saved our dataset as a CSV file in the previous step, we now read that file into memory so that it can be explored and processed.

The dataset is loaded using the pandas library and stored in a DataFrame. A DataFrame organizes data in a tabular format with rows and columns, similar to a spreadsheet. This makes it easy to view, analyze, and manipulate the data.

Once the dataset is loaded, it is good practice to display the first few rows. This helps us verify that the data has been loaded correctly and gives us a quick overview of the features, their values, and the overall structure of the dataset.

At this stage, we are not modifying the data in any way. The goal is simply to make sure the dataset is available and ready for exploration and preprocessing in the next steps.


In [10]:
df = pd.read_csv("student_data.csv")
df.head()

Unnamed: 0,hours_studied,attendance,result
0,1,60,Fail
1,2,65,Fail
2,3,70,Fail
3,4,72,Fail
4,5,75,Fail


## Step 4: Checking for Missing (Null) Values

Before moving forward with preprocessing, it is important to check whether the dataset contains any missing or null values. Missing values can occur due to data entry errors, incomplete records, or issues during data collection. If not handled properly, they can lead to incorrect results or errors during model training.

In this step, we examine each column of the dataset to identify whether any values are missing. This allows us to understand the completeness of the data and decide whether any cleaning action is required.

If **missing values are found**, common approaches include:
- Removing rows that contain missing values (when very few are missing)
- Replacing missing values with a representative value such as the mean, median, or most frequent value
- Using more advanced imputation techniques when necessary

If **no missing values are found**, no cleaning action is required for this step. In such cases, the dataset is already complete, and we can confidently move on to the next preprocessing stage.

For our manually created dataset, there are no missing values. This confirms that the data is clean and consistent, allowing us to proceed without performing any imputation or removal. Checking for missing values is still an essential step, even when the dataset appears simple, as it ensures data reliability and good preprocessing practice.


In [11]:

df.info()

df.isnull().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   hours_studied  10 non-null     int64 
 1   attendance     10 non-null     int64 
 2   result         10 non-null     object
dtypes: int64(2), object(1)
memory usage: 372.0+ bytes


hours_studied    0
attendance       0
result           0
dtype: int64

In real-world datasets, missing or null values are very common. They may occur due to incomplete data collection, human error, or system issues. Before moving further in the preprocessing pipeline, it is important to check whether the dataset contains any missing values and decide how to handle them.


In [17]:
df.isnull().sum()


hours_studied    0
attendance       0
result           0
dtype: int64

### Approaches to Handling Missing Data
In many real-world datasets, removing rows with missing values may lead to significant data loss, especially when the dataset is large or when missing values occur frequently. In such cases, alternative strategies are often preferred.

One common approach is **imputation**, where missing values are replaced with a representative value. For numerical data, this could be the mean or median of the column. For categorical data, the most frequent value or a separate category such as “Unknown” may be used.

Another important consideration is the importance of the feature. If a feature plays a critical role in the prediction task, preserving its data through imputation is often better than deleting records.

The choice of how to handle missing values depends on:
- The size of the dataset
- The percentage of missing values
- The importance of the affected feature
- The type of machine learning model being used

Understanding these options helps in making informed preprocessing decisions, even when a simple method like row removal is used for demonstration purposes.


## Step 5: Encoding Categorical Data

Machine learning models require numerical input.
Since the `result` column contains text values ("Pass" and "Fail"),
we convert them into numerical labels.

Label Encoding assigns:
- `Fail` → 0
- `Pass` → 1

This transformation makes the target variable suitable for model training.


In [18]:
encoder = LabelEncoder()
df["result"] = encoder.fit_transform(df["result"])
df


Unnamed: 0,hours_studied,attendance,result
0,1,60.0,0
1,2,65.0,0
2,3,70.0,0
4,5,75.0,0
5,6,80.0,1
6,7,85.0,1
7,8,88.0,1
8,9,90.0,1
9,10,95.0,1


## Step 6: Feature Scaling

Feature scaling ensures that numerical features have similar ranges.This is important because many machine learning algorithms are sensitive
to the scale of input data.
We use Standardization, which transforms features so that:
- Mean ≈ 0
- Standard deviation ≈ 1

Only input features are scaled the target variable is not.


In [19]:
X = df.drop("result", axis=1)
y = df["result"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled


array([[-1.56524758, -1.65278774],
       [-1.22983739, -1.21007674],
       [-0.89442719, -0.76736574],
       [-0.2236068 , -0.32465473],
       [ 0.1118034 ,  0.11805627],
       [ 0.4472136 ,  0.56076727],
       [ 0.78262379,  0.82639387],
       [ 1.11803399,  1.00347827],
       [ 1.45344419,  1.44618927]])

### Why Feature Scaling is Important

Feature scaling is an essential preprocessing step for many machine learning algorithms. In real-world datasets, different features often exist on very different scales. For example, one feature might represent a small numerical range, while another might have much larger values.

If such features are used directly, algorithms that rely on distance calculations or gradient-based optimization may give more importance to features with larger numerical values. This can lead to biased learning and poor model performance.

Standardization, which was used in the previous step, transforms features so that they have a mean of zero and a standard deviation of one. This ensures that all features contribute more equally during training.

It is also important to note that not all models require feature scaling. Tree-based models such as decision trees and random forests are generally unaffected by feature scale. However, models like logistic regression, support vector machines, and neural networks benefit significantly from scaled data.

Understanding why feature scaling is applied helps in choosing the right preprocessing steps for different types of machine learning models.


### Min–Max Normalization and Z-score Standardization

Feature scaling can be performed using different techniques depending on the nature of the data and the machine learning algorithm being used. Two of the most commonly used scaling methods are **Min–Max Normalization** and **Z-score Standardization**.



#### Min–Max Normalization

Min–Max Normalization rescales the values of a feature to a fixed range, usually between **0 and 1**. This is done by subtracting the minimum value of the feature and dividing by the range (maximum minus minimum).

After Min–Max Normalization:
- The smallest value becomes 0
- The largest value becomes 1
- All other values fall between 0 and 1

This method is useful when the data does not contain extreme outliers and when the model benefits from bounded input values. It is commonly used in algorithms such as neural networks and distance-based models where relative distances between values matter.

However, Min–Max Normalization is sensitive to outliers. A single extreme value can significantly affect the scaling of the entire feature.


#### Z-score Standardization

Z-score Standardization, also known as standardization, transforms the data so that it has:
- A mean of 0
- A standard deviation of 1

Instead of limiting values to a specific range, this method centers the data around zero and scales it based on how much each value deviates from the mean.

Z-score Standardization is widely used when the data follows a roughly normal distribution and is especially effective for algorithms that rely on gradient descent or assume standardized input, such as logistic regression, support vector machines, and linear models.

Unlike Min–Max Normalization, standardization is less affected by outliers, making it more stable for real-world datasets.


## Step 7: Splitting the Dataset

After completing all preprocessing steps, the dataset is split into separate subsets to ensure fair and reliable model evaluation. Splitting the data allows us to test how well a model performs on unseen data, rather than just memorizing the training examples.

In this step, the dataset is divided into three parts:

- **Training set (70%)**  
  This portion of the data is used to train the machine learning model. The model learns patterns, relationships, and trends from this data by adjusting its internal parameters.

- **Validation set (20%)**  
  The validation set is used during model development to tune hyperparameters and evaluate intermediate performance. It helps in detecting overfitting and guiding decisions such as model selection and parameter adjustment.

- **Test set (10%)**  
  The test set is kept completely separate and is used only once the model training and tuning are complete. It provides an unbiased estimate of the model’s final performance on unseen data.

This separation is important because it prevents information from the test data from influencing the training process. By evaluating the model on data it has never seen before, we obtain a more realistic measure of how the model will perform in real-world scenarios.

Splitting the dataset into training, validation, and test sets is a standard best practice in machine learning and plays a crucial role in avoiding overfitting and ensuring reliable results.


In [20]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=1/3, random_state=42
)

print("Training samples:", len(X_train))
print("Validation samples:", len(X_val))
print("Test samples:", len(X_test))


Training samples: 6
Validation samples: 2
Test samples: 1


## Data Leakage

After splitting the dataset into training, validation, and test sets, it is important to understand the concept of **data leakage**. Data leakage occurs when information from outside the training set is accidentally used during model training, leading to overly optimistic performance results.

One common source of data leakage is improper preprocessing order. For example, if scaling or encoding is applied to the entire dataset before splitting, the model indirectly gains information from the validation or test data. This can cause the model to perform well during evaluation but fail in real-world scenarios.

To avoid data leakage, preprocessing steps such as scaling and encoding should be fitted only on the training data and then applied to validation and test data using the same parameters.

Following a consistent and well-ordered preprocessing pipeline helps ensure that model evaluation is fair and realistic. Awareness of data leakage is an important part of developing reliable machine learning systems and is a key best practice in real-world applications.


### Task for the Reader

To reinforce your understanding of data handling and preprocessing, complete the following tasks using the steps demonstrated in this module.

1. Create a small dataset of your own with at least three features and one target variable.Example ideas include student performance, house prices, or product sales.

2. Load the dataset into a pandas DataFrame and inspect its structure using appropriate functions.

3. Introduce at least one missing value intentionally and:
   - Detect the missing value
   - Handle it using either removal or imputation

4. Identify any categorical feature in your dataset and convert it into numerical form using an appropriate encoding technique.

5. Apply feature scaling using either Min–Max Normalization or Z-score Standardization and observe how the feature values change.

6. Split the dataset into training, validation, and test sets using a suitable ratio.

7. Briefly explain why the order of preprocessing steps is important and how incorrect ordering can lead to data leakage.
