## Introduction

Preprocessing is a critical step in preparing data for neural networks. Raw data often contains inconsistencies, missing values, and varying scales that can significantly hinder the performance of a machine learning model. By preprocessing the data, we ensure that it is clean, well-structured, and ready for training. This process includes handling missing values, scaling numerical features, and encoding categorical variables. Proper preprocessing not only improves the model's accuracy but also speeds up the training process and helps avoid overfitting.


# Preprocessing the Titanic Dataset for ANN


This notebook demonstrates various preprocessing techniques on the Titanic dataset. 
We will go through the following steps to prepare the data for an artificial neural network (ANN):
- Loading the dataset
- Handling missing values
- Scaling numerical features
- Encoding categorical variables
- Splitting the dataset into training and testing sets
- Converting data into PyTorch tensors
    

## Loading the Titanic Dataset


We will load the Titanic dataset using pandas. The dataset can be downloaded from Kaggle or other sources.                   
**Explanation:** This code imports the pandas library, which is essential for data manipulation and analysis in Python. We load the Titanic dataset using the read_csv function, specifying the path to the dataset file. The head() function displays the first five rows of the dataset, allowing us to preview the data and understand its structure, including the columns and types of information present.

In [2]:
# Import necessary libraries
import pandas as pd
# Load the Titanic dataset
data = pd.read_csv('titanic.csv')
# Display the first few rows
data.head()
    

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Handling Missing Values


Handling missing values is crucial for machine learning models. In this step, we will fill missing numerical values with their mean.                                                                   
**Explanation:** In this section, we first check for missing values in the dataset using the isnull() function, which returns a boolean DataFrame indicating missing entries. The sum() function counts the total number of missing values in each column. To address these missing values, we fill the Age column with its mean value and the Embarked column with its most frequent value (mode) using the fillna() method. This ensures that our dataset does not have any missing entries, which is crucial for training machine learning models.
    

In [3]:
# Checking for missing values
data.isnull().sum()

# Fill missing values in 'Age' and 'Fare' with mean values
data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Check again for missing values
data.isnull().sum()
    

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

## Scaling Numerical Features


Neural networks perform better when input features are on a similar scale. We will use StandardScaler to scale numerical features such as `Age` and `Fare`.                                                                       
**Explanation:** This section uses StandardScaler from the sklearn.preprocessing module to standardize the numerical features Age and Fare. Scaling is important for neural networks, as it helps to bring all features to a similar scale, improving convergence during training. The fit_transform() method calculates the mean and standard deviation of the specified features and transforms them into a standard normal distribution. The transformed values replace the original Age and Fare columns, and we display the first few rows of the scaled features to verify the changes.
    

In [4]:
from sklearn.preprocessing import StandardScaler
# Scaling 'Age' and 'Fare' features
scaler = StandardScaler()
data[['Age', 'Fare']] = scaler.fit_transform(data[['Age', 'Fare']])

# Display scaled features
data[['Age', 'Fare']].head()
    

Unnamed: 0,Age,Fare
0,0.334993,-0.497811
1,1.32553,-0.51266
2,2.514175,-0.464532
3,-0.25933,-0.482888
4,-0.655545,-0.417971


## Encoding Categorical Variables


Categorical variables must be converted into numerical format for neural networks. We will use OneHotEncoder for this purpose.   
**Explanation:** In this code block, we handle categorical variables by performing one-hot encoding on the Sex and Embarked columns. One-hot encoding converts categorical variables into a format that can be provided to machine learning algorithms, ensuring that they can interpret the data correctly. The get_dummies() function from pandas creates new binary columns for each category (e.g., 'male' and 'female' for Sex). The drop_first=True parameter helps to avoid multicollinearity by dropping the first category, resulting in cleaner datasets. We then display the first few rows of the encoded data to confirm the encoding process.   

In [5]:
from sklearn.preprocessing import OneHotEncoder
# One-hot encoding 'Sex' and 'Embarked' columns
encoded_data = pd.get_dummies(data, columns=['Sex', 'Embarked'], drop_first=True)
# Display encoded data
encoded_data.head()
    

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,892,0,3,"Kelly, Mr. James",0.334993,0,0,330911,-0.497811,,True,True,False
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",1.32553,1,0,363272,-0.51266,,False,False,True
2,894,0,2,"Myles, Mr. Thomas Francis",2.514175,0,0,240276,-0.464532,,True,True,False
3,895,0,3,"Wirz, Mr. Albert",-0.25933,0,0,315154,-0.482888,,True,False,True
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",-0.655545,1,1,3101298,-0.417971,,False,False,True


## Splitting the Dataset into Training and Testing Sets


We will split the dataset into training and testing sets to train and evaluate our model.                     
**Explanation:** This section splits the dataset into features (X) and the target variable (y). We drop the Survived column from the feature set and assign it to the target variable. The train_test_split() function from sklearn.model_selection is then used to randomly split the dataset into training and testing sets, with 80% of the data used for training and 20% for testing (as specified by test_size=0.2). The random_state=42 parameter ensures that the results are reproducible. We check the shapes of the training and testing sets to verify the split.

In [6]:
from sklearn.model_selection import train_test_split

# Define feature set (X) and target (y)
X = encoded_data.drop('Survived', axis=1)
y = encoded_data['Survived']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of training and testing sets
X_train.shape, X_test.shape
    

((334, 12), (84, 12))

## Converting Data to PyTorch Tensors


For training an ANN using PyTorch, we need to convert the dataset into tensors.                   
**Explanation:** In this final section, we convert the training data into PyTorch tensors, which are necessary for training the artificial neural network. We use the torch.tensor() function to create tensors from the NumPy arrays of the training features (X) and labels (y). The dtype parameters ensure that the data types are correctly set for processing. A TensorDataset is then created, which pairs the features with the labels, and a DataLoader is instantiated to manage batches of data. By setting batch_size=32, we ensure that the model receives data in manageable chunks during training, and shuffle=True randomizes the data order for better training efficiency. The loop iterates through the data_loader, and we print the shapes of the batches to confirm that the batching is done correctly.   

In [15]:
import torch
# Select features and target variable, fill NaN values
X = data[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].fillna(0).values  # Fill NaNs with 0
y = data['Survived'].values  # Target variable
# Convert to PyTorch tensors
X_tensor = torch.tensor(X, dtype=torch.float32)  # Continuous features
y_tensor = torch.tensor(y, dtype=torch.long)  # Categorical labels
# Create TensorDataset
dataset = TensorDataset(X_tensor, y_tensor)
# Create DataLoader for batching
data_loader = DataLoader(dataset, batch_size=32, shuffle=True)
# Iterate through the DataLoader
for batch_X, batch_y in data_loader:
    print("Batch X shape:", batch_X.shape)  # Check the shape of features
    print("Batch Y shape:", batch_y.shape)  # Check the shape of labels
    break  # Break after the first batch for inspection



Batch X shape: torch.Size([32, 5])
Batch Y shape: torch.Size([32])


## Conclusion

In this tutorial, we explored the essential preprocessing techniques necessary for preparing datasets for Artificial Neural Networks (ANNs). We highlighted the importance of handling missing values, scaling features, and encoding categorical variables. By following these steps, we can ensure that our data is clean, consistent, and ready for training. Proper preprocessing plays a vital role in enhancing model performance, enabling the ANN to learn effectively from the data and achieve better results.
