# **Blood Transfusion Service Center Dataset Preprocessing and Splitting**

In this tutorial, we will preprocess the **Blood Transfusion Service Center Dataset** from the UCI Machine Learning Repository. We will normalize the features, encode the target variable, and split the dataset into training, validation, and test sets.

-
## **Step 1: Import Necessary Libries**

We start by importing the required libraries, including `pandas` for data manipulation, `train_test_split` from `sklearn.model_selection` to split the dataset, `MinMaxScaler` for normalization, and `LabelEncoder` for encoding categorical  fetch_ucirepo


In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from ucimlrepo import fetch_ucirepo


Step 2: Fetch the Dataset
We fetch the Blood Transfusion Service Center Dataset using the fetch_ucirepo function from the UCI repository.

In [None]:
# Fetch dataset from UCI repository
blood_transfusion_service_center = fetch_ucirepo(id=176)


Step 3: Extract Features and Target
The dataset is extracted into a pandas DataFrame, where X contains the input features and y contains the target variable indicating whether a person donated blood or not.

In [None]:
# Extract data as pandas DataFrame
X = blood_transfusion_service_center.data.features
y = blood_transfusion_service_center.data.targets


Step 4: Display the First Few Rows
We display the first few rows of the dataset to understand its structure and see the target variable.

In [None]:
# Display the first few rows of the dataset
print("First few rows of the dataset:\n", X.head())
print("Target variable (Class):\n", y.head())


Step 5: Encode the Target Variable
If the target variable y contains categorical values such as 'yes'/'no', we use LabelEncoder to convert these to numerical values (0 and 1) for easier processing in machine learning algorithms.

In [None]:
# Encode the target variable (if not already binary numerically encoded)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


Step 6: Normalize the Features
We apply Min-Max Scaling to normalize the features. This scales the feature values to a range of [0, 1], which can improve the performance of machine learning algorithms.

In [None]:
# Normalize the features using Min-Max Scaling
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)


Step 7: Convert Normalized Features Back to DataFrame
After normalization, we convert the normalized features back to a DataFrame for easier manipulation and to retain the column names.

In [None]:
# Convert the normalized features back to a DataFrame for easier manipulation
df_normalized = pd.DataFrame(X_normalized, columns=X.columns)


Step 8: Split the Dataset
We split the normalized dataset into training (70%), validation (15%), and test (15%) sets. First, we split the dataset into training and a temporary set, and then further split the temporary set into validation and test sets.

In [None]:
# Split the normalized dataset into training (70%), validation (15%), and test (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(df_normalized, y_encoded, test_size=0.3, random_state=42)

# Further split the temporary set into validation (15%) and test (15%) sets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


Step 9: Print the Size of Each Split
We print the size of each dataset split to verify that the proportions are correct.

In [None]:
# Print the size of each split
print("\nTraining set size: ", X_train.shape)
print("Validation set size: ", X_val.shape)
print("Test set size: ", X_test.shape)


Step 10: Optional Sample Verification
As an optional step, we can print a few samples from each split to verify the process and ensure that the data has been correctly divided.

In [None]:
# Optional: Print a few samples from each split to verify the process
print("\nFirst few training samples:\n", X_train.head())
print("\nFirst few validation samples:\n", X_val.head())
print("\nFirst few test samples:\n", X_test.head())


Step 11: Optional Class Distribution Check
Lastly, we can check the class distribution in the training set to ensure that the encoding has been performed correctly.

In [None]:
# Optional: Print the class distribution to ensure correct encoding
print("\nEncoded target distribution in training set:", pd.Series(y_train).value_counts())
