# Train a Neural Network with a Data Loader
In PyTorch, a **data loader** is a utility that helps in efficiently loading and iterating over data during the training or evaluation of a machine learning model. It is particularly useful when working with large datasets that cannot fit entirely into memory.

In this tutorial, you will learn how to create a data loader to load batches of data from a csv file. This will be essential if your csv file is too large to be loaded into your computer's memory at once.

We will still use the Pima Indians Diabetes dataset, which can be found at https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.

### Import libraries:

In [1]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.utils.data import TensorDataset, DataLoader

### Load the dataset:

In [2]:
# Load the dataset
data = pd.read_csv("https://raw.githubusercontent.com/yangliuiuk/data/main/diabetes.csv")

# Display the first few rows of the dataset
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Preprocess the data

In [3]:
# Split the data into features (X) and target variable (y)
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Use oversampling to address class inbalance issue

from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(random_state=42)
X, y = oversampler.fit_resample(X, y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# After standardized, X_train and X_test will be converted from pandas data frame into numpy arrays. To make the type of X data the same as y data (pandas series), we convert X data back to pandas data frame.
# The purpose of this step is to make the code syntax to convert X and y data to PyTorch tensors to be consistent.

X_train = pd.DataFrame(X_train) 
X_test = pd.DataFrame(X_test) 


### Create PyTorch Tensors

In [4]:
# Convert the data to PyTorch tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32) # X_train.values will convert X_train from a pandas dataframe into an numpy array, which is required as the input type for torch.tensor().
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long) 
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long) 

print(X_train_tensor.shape, y_train_tensor.shape, X_test_tensor.shape, y_test_tensor.shape)

torch.Size([800, 8]) torch.Size([800]) torch.Size([200, 8]) torch.Size([200])


The data type for y_train_tensor is torch.long, which is an integer representing the class label. 

For instance, if we have two classes, each y value can be either 0 or 1. If we have three classes, each y value can be either 0, 1, or 2. 

This representation of y is more flexible for multple-class classification.

We print out the shape attribute for each tensor to ensure each tensor is in the correct shape.

In this example, we no only use unsqueeze(1) on y tensors. Also, when constructing neural network, we don't need to add a sigmoid function on the final output layer to output a class probability.

Instead, the output layer will just be raw class weights, also called **logits**. Logits refers to the raw, unnormalized outputs of a neural network before any activation function is applied to them.

For instance, if the class weights are [100, 50], it means there are higher chance to be class 0 then class 1.
Note that it doesn't mean there is 2/3 chance to be class 0 and 1/3 chance to be class 1. This is because raw class weights produced by a linear layer can be negative numbers. 

Their is one more step to convert raw class weights into class probabilities, which is the **softmax**. It just applies the exponential function on each raw class weight to make it possitive. Then compute class probability distribution based on positve class weights.

For instance, the raw class weights for class 0 and class 1 are [-1, 0.5]. Then, the positive class weights will be [e^-1, e^0.5] = [0.37, 1.65]. Then, the class probabilities are:

probability of class 0 = 0.37 / (0.37 + 1.65) = 0.18

probability of class 1 = 1.65 / (0.37 + 1.65) = 0.82

So the class probability distribution is [0.18, 0.82]

Suppose the real class label is "0". The output of a neural network is the raw class weights [-1, 0.5]. How to compute the loss between them? 
Fortunately, we can skip the step of computing class probabily distribution from raw neural network outputs. This step can be done by PyTorch loss functions.

We can just use the **CrossEntropyLoss** loss function. Note that it is different from **BinaryCrossEntrophyLoss**, although they are highly correlated. CrossEntropyLoss loss is more flexible if the data has more than two classes. 


### Create data loaders

In this step, we will create data loaders for the training and testing set. The data loader can randomly draw a batch (a small subset) of data for training and testing.

In [5]:
# Create TensorDataset objects for train and test data
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# Create DataLoader objects for train and test datasets
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)

### Define the neural network architecture

In [6]:
# Define the neural network architecture
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

### Create the neural network instance

In [7]:
# Define hyperparameters

input_size = X_train_tensor.shape[1]
hidden_size = X_train_tensor.shape[1] * 2 # The size of hidden layer is arbitrarily chosen and can be tuned.
num_classes = 2 # Number of classes in your multi-class classification problem

# Instantiate the neural network model
model = NeuralNetwork(input_size, hidden_size, num_classes)

### Define the loss function and optimizer

In [8]:
criterion = nn.CrossEntropyLoss() # CrossEntroyLoss 
optimizer = optim.Adam(model.parameters(), lr=0.001)

### Train the neural network

In [9]:
# Training loop
num_epochs = 200
for epoch in range(num_epochs):
    for inputs, labels in train_dataloader:
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print progress
        if (epoch+1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print()

# Evaluate the model's accuracy on the training set
with torch.no_grad():
    correct = 0
    total = 0
    for inputs, labels in train_dataloader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    accuracy = correct / total
    print('Accuracy:', accuracy)
    

Epoch [10/200], Loss: 0.5433
Epoch [10/200], Loss: 0.6416
Epoch [10/200], Loss: 0.5771
Epoch [10/200], Loss: 0.6109
Epoch [10/200], Loss: 0.6133
Epoch [10/200], Loss: 0.6329
Epoch [10/200], Loss: 0.5947
Epoch [10/200], Loss: 0.5496
Epoch [10/200], Loss: 0.5785
Epoch [10/200], Loss: 0.5699
Epoch [10/200], Loss: 0.5853
Epoch [10/200], Loss: 0.6293
Epoch [10/200], Loss: 0.6253
Epoch [20/200], Loss: 0.5859
Epoch [20/200], Loss: 0.6455
Epoch [20/200], Loss: 0.5864
Epoch [20/200], Loss: 0.4444
Epoch [20/200], Loss: 0.4306
Epoch [20/200], Loss: 0.4291
Epoch [20/200], Loss: 0.5313
Epoch [20/200], Loss: 0.4595
Epoch [20/200], Loss: 0.5083
Epoch [20/200], Loss: 0.5090
Epoch [20/200], Loss: 0.4595
Epoch [20/200], Loss: 0.5276
Epoch [20/200], Loss: 0.6587
Epoch [30/200], Loss: 0.5736
Epoch [30/200], Loss: 0.4036
Epoch [30/200], Loss: 0.3784
Epoch [30/200], Loss: 0.4292
Epoch [30/200], Loss: 0.5113
Epoch [30/200], Loss: 0.5404
Epoch [30/200], Loss: 0.4333
Epoch [30/200], Loss: 0.4806
Epoch [30/200]

### Evaluate the model on testing set

In [10]:
# Evaluate the model's accuracy on the testing set
with torch.no_grad():
    correct = 0
    total = 0
    for inputs, labels in test_dataloader:
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    accuracy = correct / total
    print('Accuracy:', accuracy)

Accuracy: 0.755
