# Lab 3

In the previous labs, we perform mortality prediction based on the last visit's diagnosis codes using DNN. This practice igrnoes massive information in the previous visits of a patient. Thus, Starting from this lab, we will play with sequential visit data. That is, each patient will have a sequence of visists. 

However, MLP is quite unsatisfying when dealing with such rich structure data. This lab introduces convolutional neural networks (CNNs), a powerful family of neural networks that are designed for precisely this purpose.

Table of Contents:
- Convolutions for Images
- Padding and Stride
- Pooling
- Assignment

Some contents of this lab are adapted from [Dive into Deep Learning](https://d2l.ai) and [Official PyTorch Tutorials](https://pytorch.org/tutorials/).

In [1]:
import os
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd

In [2]:
# set seed
seed = 24
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)

All these lines set the seed for different random number generators, so that your code will produce the same results every time you run it‚Äîhelpful for debugging and sharing experiments!

In [3]:
DATA_PATH = "../LAB3-lib/data"
assert os.path.isdir(DATA_PATH)
!ls {DATA_PATH}

data.csv


## 1. Convolution Operation

Though we will deal with sequential data (1D) in the assignment. Let us first start with some images data (2D) to 
build out intuition. 

Convolutional neural networks are efficient architectures for exploring structure in image data.

Convolution operation take an input tensor and a kernel tensor and produce an output tensor through convolution operation. Let us ignore channels for now and see how this works with two-dimensional data and hidden representations. In the figure below, the input is a two-dimensional tensor with a height of 3 and width of 3. We mark the shape of the tensor as 3x3 or (3, 3). The height and width of the kernel are both 2. The shape of the kernel window (or convolution window) is given by the height and width of the kernel (here it is 2x2).

<img src='./img/convolution.svg'>

In the two-dimensional cross-correlation operation, we begin with the convolution window positioned at the top-left corner of the input tensor and slide it across the input tensor, both from left to right and top to bottom. When the convolution window slides to a certain position, the input subtensor contained in that window and the kernel tensor are multiplied elementwise and the resulting tensor is summed up yielding a single scalar value. This result gives the value of the output tensor at the corresponding location. Here, the output tensor has a height of 2 and width of 2 and the four elements are derived from the two-dimensional cross-correlation operation:

$$
\begin{split}0\times0+1\times1+3\times2+4\times3=19,\\
1\times0+2\times1+4\times2+5\times3=25,\\
3\times0+4\times1+6\times2+7\times3=37,\\
4\times0+5\times1+7\times2+8\times3=43.\end{split}
$$

Simple Explanation
Sure! Here‚Äôs an explanation for a curious kid:

---

**Imagine you have a small picture made up of numbers in a grid, like this:**

```
0 1 2
3 4 5
6 7 8
```

**You also have a tiny "window" (called a kernel), like this:**

```
0 1
2 3
```

**The job of a convolution operation is to slide this tiny window over your big grid, and at every spot, it mixes the numbers together in a special way. Here‚Äôs how it works:**

1. **Put the tiny window on the top-left corner of the big grid.**
2. **Multiply each number in the window by the number it covers in the grid.**
3. **Add all those results together to get one new number.**
4. **Move the window one step to the right, do it again, and then down‚Äîuntil you‚Äôve covered the whole grid!**

**For example, at the top-left:**

- Window covers: 0, 1, 3, 4
- Kernel numbers: 0, 1, 2, 3
- Multiply: (0√ó0), (1√ó1), (3√ó2), (4√ó3)
- Add: 0 + 1 + 6 + 12 = **19**

**You do this at four spots, and collect four new numbers into a new, smaller grid.**

**This is how computers use convolution to ‚Äúlook at‚Äù pictures and find patterns, like shapes or edges, step by step. It‚Äôs like sliding a tiny magnifying glass over a photo and writing down what you see at each spot!**

---


### Exercise 1 [10 points]

Calculate the output shape for a convolutional layer: given the input tensor shape $(n_w, n_h)$, the kernel tensor shape $(k_w, k_h)$, calculate the output tensor shape. For example, the output shape for the figure above is $(2, 2)$.

In [4]:
def conv_output_shape_1(n_w, n_h, k_w, k_h):
    
    """
    TODO: Calculate the output tensor shape.
    Note the output should a tuple with two elements (width, height). 
    """
    output_width = n_w - k_w + 1
    output_height = n_h - k_h + 1
    #raise NotImplementedError
    return (output_width, output_height)

Every time you move the kernel across the input, you need to make sure the whole kernel fits inside. The formula tells you how many places it can fit horizontally and vertically, giving you the size of the output.

In summary:

    The formula calculates how many times you can slide the kernel over the input, and that‚Äôs exactly the number of output values you get‚Äîjust like in the example!


In [5]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert conv_output_shape_1(n_w=7, n_h=7, k_w=3, k_h=3) == (5, 5)
assert conv_output_shape_1(n_w=7, n_h=9, k_w=4, k_h=2) == (4, 8)



### Exercise 2 [10 points]

Implement the 2D convolution function, which accepts an input tensor X and a kernel tensor K and returns an output tensor Y.

In [6]:
def corr2d(X, K):
    """ TODO: Compute 2D convolution. """
    # your code here
    h, w = K.shape
    out_h = X.shape[0] - h + 1
    out_w = X.shape[1] - w + 1
    Y = torch.zeros((out_h, out_w))
    for i in range(out_h):
        for j in range(out_w):
            Y[i, j] = (X[i:i+h, j:j+w] * K).sum()
    return Y
    #raise NotImplementedError

In [7]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
assert torch.allclose(corr2d(X, K), torch.tensor([[19., 25.], [37., 43.]]))



Simple explanation
Absolutely! Let‚Äôs walk through how `corr2d(X, K)` works step by step with your example.

### **Input tensors**
```python
import torch

X = torch.tensor([[0.0, 1.0, 2.0],
                  [3.0, 4.0, 5.0],
                  [6.0, 7.0, 8.0]])

K = torch.tensor([[0.0, 1.0],
                  [2.0, 3.0]])
```

### **How Corr2D computes the output**

#### 1. **Kernel shape**  
K is 2√ó2.

#### 2. **Output shape**  
- Output height: 3 - 2 + 1 = 2  
- Output width: 3 - 2 + 1 = 2  
So output will be 2√ó2.

#### 3. **Sliding and calculation**

**Position (0, 0):**  
Cover X[0:2, 0:2] =  
```
[[0, 1],
 [3, 4]]
```
Multiply elementwise by K:  
```
[[0√ó0, 1√ó1],
 [3√ó2, 4√ó3]] = [0, 1, 6, 12]
```
Sum: 0 + 1 + 6 + 12 = **19**

---

**Position (0, 1):**  
Cover X[0:2, 1:3] =  
```
[[1, 2],
 [4, 5]]
```
Multiply:  
```
[[1√ó0, 2√ó1],
 [4√ó2, 5√ó3]] = [0, 2, 8, 15]
```
Sum: 0 + 2 + 8 + 15 = **25**

---

**Position (1, 0):**  
Cover X[1:3, 0:2] =  
```
[[3, 4],
 [6, 7]]
```
Multiply:  
```
[[3√ó0, 4√ó1],
 [6√ó2, 7√ó3]] = [0, 4, 12, 21]
```
Sum: 0 + 4 + 12 + 21 = **37**

---

**Position (1, 1):**  
Cover X[1:3, 1:3] =  
```
[[4, 5],
 [7, 8]]
```
Multiply:  
```
[[4√ó0, 5√ó1],
 [7√ó2, 8√ó3]] = [0, 5, 14, 24]
```
Sum: 0 + 5 + 14 + 24 = **43**

---

### **Final Output**
So, `corr2d(X, K)` will return:
```
[[19, 25],
 [37, 43]]
```

### **In Code**
```python
output = torch.tensor([[19., 25.],
                      [37., 43.]])
```

This matches the expected result!

**Let me know if you want to see the actual `corr2d` implementation in code!**

## 2. Padding and Stride

In several cases, we incorporate techniques, including padding and strided convolutions, that affect the size of the output. As motivation, note that since kernels generally have width and height greater than 1, after applying many successive convolutions, we tend to wind up with outputs that are considerably smaller than our input. If we start with a 240√ó240 pixel image, 10 layers of 5x5 convolutions reduce the image to 200√ó200 pixels, slicing off 30%
of the image and with it obliterating any interesting information on the boundaries of the original image. Padding is the most popular tool for handling this issue.

In other cases, we may want to reduce the dimensionality drastically, e.g., if we find the original input resolution to be unwieldy. Strided convolutions are a popular technique that can help in these instances.

### 2.1 Padding

As described above, one tricky issue when applying convolutional layers is that we tend to lose pixels on the perimeter of our image. Since we typically use small kernels, for any given convolution, we might only lose a few pixels, but this can add up as we apply many successive convolutional layers. One straightforward solution to this problem is to add extra pixels of filler around the boundary of our input image, thus increasing the effective size of the image. Typically, we set the values of the extra pixels to zero. In the figure below, we pad a 3√ó3 input, increasing its size to 5√ó5. The corresponding output then increases to a 4√ó4 matrix. The shaded portions are the first output element as well as the input and kernel tensor elements used for the output computation: 0√ó0+0√ó1+0√ó2+0√ó3=0.

<img src='./img/conv-pad.svg'>

### 2.2 Stride

When computing the convolution, we start with the convolution window at the top-left corner of the input tensor, and then slide it over all locations both down and to the right. In previous examples, we default to sliding one element at a time. However, sometimes, either for computational efficiency or because we wish to downsample, we move our window more than one element at a time, skipping the intermediate locations.

We refer to the number of rows and columns traversed per slide as the stride. So far, we have used strides of 1, both for height and width. Sometimes, we may want to use a larger stride. The figure below shows a two-dimensional convolution operation with a stride of 3 vertically and 2 horizontally. The shaded portions are the output elements as well as the input and kernel tensor elements used for the output computation:  0√ó0+0√ó1+1√ó2+2√ó3=8, 0√ó0+6√ó1+0√ó2+0√ó3=6. We can see that when the second element of the first column is outputted, the convolution window slides down three rows. The convolution window slides two columns to the right when the second element of the first row is outputted. When the convolution window continues to slide two columns to the right on the input, there is no output because the input element cannot fill the window (unless we add another column of padding).

Simply explained:

Padding and Stride in Convolutions ‚Äî Explained Simply

Imagine you have a big picture made of numbers, like a giant grid. When you use a small window (the kernel) to look at this picture and do math, sometimes the edges of the picture get left out and disappear as you do this over and over.
Padding

    Padding is like putting a frame of zeros around your picture.
    If your picture is too small, you add extra empty squares around the edges so your window can fit and look at every part of the picture.
    This way, you don‚Äôt lose the edge pieces when you slide your window!

Why do we need padding?

    If you keep using your window without padding, your picture shrinks and you lose important information near the edges.
    Padding helps keep the picture big and keeps all the information, even after many steps.

Stride

    Stride is how many steps your window moves each time.
    If stride is 1, your window moves just one square at a time.
    If stride is bigger, like 2 or 3, your window jumps ahead by 2 or 3 squares each time.
    This means you get fewer results, and your answer grid is smaller.

Why use a bigger stride?

    Sometimes, your picture is so big it‚Äôs hard to work with, so you use a bigger stride to make your work easier and faster.
    It‚Äôs like skipping spaces when you walk, so you get across the room quicker!

Summary:
Padding helps you keep the edges of your picture safe.
Stride lets you move your window faster and shrink the result.
Both are ways to control how you look at pictures with computers!

### Exercise 3 [10 points]

Calculate the output shape for a convolutional layer with padding: given the input tensor shape $(n_w, n_h)$, the kernel tensor shape $(k_w, k_h)$, padding size $(p_w, p_h)$, stride size $(s_w, s_h)$, calculate the output tensor shape.

In [8]:
def conv_output_shape_2(n_w, n_h, k_w, k_h, p_w, p_h, s_w, s_h):
    
    """
    TODO: Calculate the output tensor shape.
    Note the output should a tuple with two elements (width, height). 
    """
    # your code here
    output_width = (n_w + 2*p_w - k_w) // s_w + 1
    output_height = (n_h + 2*p_h - k_h) // s_h + 1
    return (output_width, output_height)
    #raise NotImplementedError

In [9]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert conv_output_shape_2(n_w=7, n_h=7, k_w=3, k_h=3, p_w=1, p_h=1, s_w=1, s_h=1) == (7, 7)
assert conv_output_shape_2(n_w=7, n_h=7, k_w=3, k_h=3, p_w=0, p_h=0, s_w=2, s_h=2) == (3, 3)
assert conv_output_shape_2(n_w=7, n_h=9, k_w=4, k_h=2, p_w=0, p_h=1, s_w=2, s_h=1) == (2, 10)



## 3. Multiple Input and Multiple Output Channels

Denote by $c_i$ and $c_o$ the number of input and output channels, respectively, and let $k_h$ and $k_w$ be the height and width of the kernel. To get an output with multiple channels, we can create a kernel tensor of shape $c_i \times k_h \times k_w$ for every output channel. We concatenate them on the output channel dimension, so that the shape of the convolution kernel is $c_o \times c_i \times k_h \times k_w$. In convolution operations, the result on each output channel is calculated from the convolution kernel corresponding to that output channel and takes input from all channels in the input tensor.

<img src='./img/conv-channel.svg'>

In the figure above, the number of input and output channels are 3 and 2. And there are $2 \times 3$ sets of kernels.

Multiple Input and Multiple Output Channels ‚Äî Like Making a Fancy Sandwich!

Imagine you‚Äôre making a sandwich, but instead of just bread and cheese, you have three layers of yummy stuff inside (like peanut butter, jelly, and bananas). Each layer is a ‚Äúchannel.‚Äù
Input Channels

    Think of a picture with three versions stacked up‚Äîlike a color photo made of red, green, and blue layers.
    These are your input channels. (For example: 3 channels.)

Output Channels

    Now, imagine you want to make TWO different kinds of sandwiches from those layers‚Äîmaybe one with extra peanut butter and one with extra jelly.
    These are your output channels. (For example: 2 channels.)

Kernels ‚Äî The Recipe

    For each output sandwich, you need a special recipe (kernel) that tells you how to mix and match the layers from your input.
    If you have 3 input channels and want 2 output channels, you need 2 √ó 3 = 6 recipes (kernels)!
    Each output channel gets its own set of recipes to create a new layer from all the input layers.

How Does It Work?

    To make the first output sandwich, you use three recipes‚Äîone for peanut butter, one for jelly, one for bananas‚Äîand mix them together.
    To make the second output sandwich, you do the same but with different recipes.
    You end up with two new sandwich layers (output channels), each made from all the input layers using their own special mixing rules.

In short:
If your input is made of three layers and you want to make two different new layers, you need a separate set of recipes for each output layer, using all the input layers every time. That‚Äôs how computers mix information from different channels to create new ‚Äúpictures‚Äù!

## 4. Pooling

Often, as we process images, we want to gradually reduce the spatial resolution of our hidden representations, aggregating information so that the higher up we go in the network, the larger the receptive field (in the input) to which each hidden node is sensitive.

Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all regions in the input according to its stride, computing a single output for each location traversed by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross-correlation computation of the inputs and kernels in the convolutional layer, the pooling layer contains no parameters (there is no kernel). Instead, pooling operators are deterministic, typically calculating either the maximum or the average value of the elements in the pooling window. These operations are called maximum pooling (max pooling for short) and average pooling, respectively.

In both cases, as with the cross-correlation operator, we can think of the pooling window as starting from the upper-left of the input tensor and sliding across the input tensor from left to right and top to bottom. At each location that the pooling window hits, it computes the maximum or average value of the input subtensor in the window, depending on whether max or average pooling is employed.

<img src='./img/pooling.svg'>

The output tensor in the figure above has a height of 2 and a width of 2. The four elements are derived from the maximum value in each pooling window:

$$
\begin{split}\max(0, 1, 3, 4)=4,\\
\max(1, 2, 4, 5)=5,\\
\max(3, 4, 6, 7)=7,\\
\max(4, 5, 7, 8)=8.\\\end{split}
$$

Maxpooling simplified:

Here's a kid-friendly explanation of pooling in pictures and neural networks:
Pooling ‚Äî Like Picking the Best (or Average) in Each Group!

Imagine you have a big grid of numbers, like a game board. You want to make the grid smaller, but you still want to keep the most important parts!
How does pooling work?

    You use a small window (like a 2√ó2 square) and slide it over your big grid, one step at a time.
    At each spot, you only keep one number for that window.

Max Pooling

    For each window, you look at all the numbers inside, and you pick the biggest one.
    So if your window covers the numbers [0, 1, 3, 4], you keep 4 because it's the biggest.
    You do this for every window, and you build a new, smaller grid with just those biggest numbers!

Average Pooling

    Instead of picking the biggest, you add up all the numbers in the window and divide by how many there are.
    So if your window covers [0, 1, 3, 4], you add them up (0+1+3+4=8) and divide by 4, so you keep 2.

Why do we use pooling?

    Pooling helps shrink your grid so your computer has less to look at.
    It also helps keep the most important information, even when the grid gets smaller.

Example:

If you have this grid:
Code

0 1 2
3 4 5
6 7 8

and you use max pooling with a 2√ó2 window, you get:
Code

4 5
7 8

Because those are the biggest numbers in each window!

### Exercise 4 [10 points]

Implement a 2D max pooling layer from scratch, which accepts an input tensor X and pool size and returns an output tensor Y.

In [10]:
def maxpool2d(X, pool_size):
    # your code here
    h, w = pool_size
    out_h = X.shape[0] - h + 1
    out_w = X.shape[1] - w + 1
    Y = torch.zeros((out_h, out_w))
    for i in range(out_h):
        for j in range(out_w):
            Y[i, j] = X[i:i+h, j:j+w].max()
    return Y
    #raise NotImplementedError

In [11]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
assert torch.allclose(maxpool2d(X, (2, 2)), torch.tensor([[4., 5.], [7., 8.]]))



Absolutely! Here‚Äôs a **line-by-line explanation** of your `maxpool2d` function:

---

```python
def maxpool2d(X, pool_size):
```
- **Defines** a function named `maxpool2d` that takes:
  - `X`: a 2D tensor (like a grid of numbers, such as an image)
  - `pool_size`: a tuple (height, width) that tells you the size of the pooling window.

---

```python
    h, w = pool_size
```
- **Unpacks** the pool size tuple into `h` (window height) and `w` (window width).

---

```python
    out_h = X.shape[0] - h + 1
    out_w = X.shape[1] - w + 1
```
- **Calculates** the size of the output tensor:
  - `out_h`: number of rows in the output grid.
    - `X.shape[0]` is the number of rows in `X`.
    - By subtracting the window height and adding 1, you get the number of places your window can slide down.
  - `out_w`: number of columns in the output grid.
    - Same logic, using columns and window width.

---

```python
    Y = torch.zeros((out_h, out_w))
```
- **Creates a new tensor** `Y` of zeros with the shape of the output grid.
  - This will store the maximum values found in each window.

---

```python
    for i in range(out_h):
        for j in range(out_w):
```
- **Loops over every possible position** for the top-left corner of the window:
  - `i` goes over all rows in the output.
  - `j` goes over all columns in the output.

---

```python
            Y[i, j] = X[i:i+h, j:j+w].max()
```
- **For each position**:
  - Selects a window from `X` that starts at row `i` and column `j` and is of size `h` by `w`.
  - Finds the maximum value in that window.
  - Stores that maximum in the output tensor `Y` at position `[i, j]`.

---

```python
    return Y
```
- **Returns** the output tensor `Y` containing all the max-pooled values.

---

### **Summary for a Curious Kid**
- This function slides a small box over a big grid of numbers.
- At each spot, it looks inside the box and picks the biggest number.
- It builds a new, smaller grid out of just those biggest numbers!

Let me know if you want a step-by-step example with numbers!

## 5. CNN with PyTorch

Luckily, PyTorch has all kinds of convolution and pooling operations implemented for us ([link](https://pytorch.org/docs/stable/nn.html#convolution-layers)). For the previous image example, we can use [`nn.Conv2d()`](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d) and [`nn.MaxPool2d`](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d).

For example, the code below implements a 2D convolution layer with 3 input channels, 8 output channels, kernel shape (3, 3), stride shape (2, 2), and no padding.

In [12]:
m = nn.Conv2d(in_channels=3, out_channels=8, kernel_size=3, stride=2, padding=0)

If we have an image of shape (3, 224, 224), after this convolution layer, the output shape will be (8, 111, 111). Let us verify this.

In [13]:
# the first dimension is the batch size (1 in this case, since we only have one image)
img = torch.randn(1, 3, 224, 224)
m(img).shape

torch.Size([1, 8, 111, 111])


How does Conv2d use input channels, output channels, and kernel size?
What are channels?

    Imagine a color image:
        3 channels = Red, Green, Blue layers stacked together.
    In deep learning, each ‚Äúchannel‚Äù is like a separate sheet of information.

What does kernel_size mean?

    The kernel is like a small window that slides over your image.
    kernel_size=3 means the window is a 3√ó3 square (not 3, but 3 by 3).

How does input and output channels work?

    Input channels (3): The Conv2d layer expects images with 3 channels (like RGB).
    Output channels (8): The Conv2d layer creates 8 new images, each one is a different ‚Äúview‚Äù or pattern found by the layer.

How are the kernels arranged?

    For each output channel, the layer has its own set of kernels.
    Each kernel looks at all input channels.
    So, for every output channel, you need kernels for each input channel.

The shape of the kernel tensor is:
Code

(out_channels, in_channels, kernel_height, kernel_width)

For your example:

    out_channels = 8
    in_channels = 3
    kernel_height = 3
    kernel_width = 3

So, the total kernel tensor shape is:
Code

(8, 3, 3, 3)

Visual analogy:

    Imagine you want to bake 8 cakes (output channels).
    For each cake, you use 3 different flavors of batter (input channels).
    And each flavor needs its own 3√ó3 mixing pattern (kernel).
    So you have 8 cakes √ó 3 batters √ó 3√ó3 mixing patterns.

Summary:

    kernel_size means how big the window is (3√ó3).
    in_channels is how many input layers.
    out_channels is how many new output layers.
    The kernel tensor is shaped (8, 3, 3, 3), not just 3√ó8.



In the assignment, on the other hand, we will play with sequential data. That is, each patient will be represented as a sequence of visits, and each visit will be represented as a set of diagnosis codes (a one-hot vector).

Denote the number of visits for a patient as $n$, and the total number of diagnosis codes as $m$, this patient can be represented as a matrix of shape $(n, m)$. 

For example, let us say there are 30 diagnosis codes in total. And there is a patient with 3 visits. Then the patient can be represented as:

In [14]:
# the first dimension is the batch size (1 in this case, since we only have one patient)
# the second dimension is the total number of diagnosis codes
# the third dimension is the total number of visits
patient = torch.randn(1, 30, 3)

We can then perform 1D convolution to capture the temporal information. The code below implements an 1D convolution layer with 30 input channels, 16 output channels, kernel shape 2, stride shape 1, and no padding.

In [15]:
m = nn.Conv1d(in_channels=30, out_channels=16, kernel_size=2, stride=1, padding=0)

After convolution, we should have a tensor of shape (16, 2).

In [16]:
m(patient).shape

torch.Size([1, 16, 2])

## Assignment [60 points]

In this assignment, you will use [MIMIC-III Demo](https://physionet.org/content/mimiciii-demo/) dataset, which contains all intensive care unit (ICU) stays for 100 patients. The task is Mortality Prediction.

### Load Data

In the previous lab, we have preprocessed the data. Thus, for this lab, we will directly use the processed data.

In [17]:
!ls {DATA_PATH}

data.csv


Here are the helper fuctions and CustomDataset from the previous lab. 

The only difference is that, starting from this lab, we will use the entire patient visit instead of only the last visit. Due to this reason, we will only keep patients with more than one visits.

In [18]:
# two helper functions

TOTAL_NUM_CODES = 271


def read_csv(filename):
    """ reading csv from filename """
    data = []
    with open(filename, "r") as file:
        csv_reader = csv.DictReader(file, delimiter=',')
        for row in csv_reader:
            data.append(row)
    header = list(data[0].keys())
    return header, data


def to_one_hot(label, num_class):
    """ convert to one hot label """
    one_hot_label = [0] * num_class
    for i in label:
        one_hot_label[i] = 1
    return one_hot_label

In [19]:
from torch.utils.data import Dataset


class CustomDataset(Dataset):
    
    def __init__(self):
        # read the csv
        self._df = pd.read_csv(f'{DATA_PATH}/data.csv')
        # split diagnosis code index by ';' and convert it to integer
        self._df.icd9 = self._df.icd9.apply(lambda x: [int(i) for i in x.split(';')])
        # build data dict
        self._build_data_dict()
        # a list of subject ids
        self._subj_ids = list(self._data.keys())
        # sort the subject ids to maintain a fixed order
        self._subj_ids.sort()
    
    def _build_data_dict(self):
        """ 
        build SUBJECT_ID to ADMISSION dict
            - subject_id
                - icd9: a list of ICD9 code index
                - mortality: 0/1 morality label
        """
        dict_data = {}
        df = self._df.groupby('subject_id').agg({'mortality': lambda x: x.iloc[0], 'icd9': list}).reset_index()
        for idx, row in df.iterrows():
            subj_id = row.subject_id
            # only keep patients with more than 1 visit
            if len(row.icd9) >= 2:
                dict_data[subj_id] = {}
                dict_data[subj_id]['icd9'] = row.icd9
                dict_data[subj_id]['mortality'] = row.mortality
        self._data = dict_data
    
    def __len__(self):
        """ return the number of samples (i.e. patients). """
        return len(self._subj_ids)
    
    def __getitem__(self, index):
        """ generates one sample of data. """
        # obtain the subject id
        subj_id = self._subj_ids[index]
        # obtain the data dict by subject id
        data = self._data[subj_id]
        # convert last admission's diagnosis code index to one hot
        x = torch.tensor([to_one_hot(visit, TOTAL_NUM_CODES) for visit in data['icd9']], dtype=torch.float32)
        # mortality label
        y = torch.tensor(data['mortality'], dtype=torch.float32)
        return x, y

Absolutely! Let‚Äôs create a simple example to show how your `CustomDataset` class works step by step.

---

## 1. Sample CSV Data

Suppose your CSV (`data.csv`) looks like this:

| subject_id | icd9        | mortality |
|------------|-------------|-----------|
| 1001       | 12;45;100   | 0         |
| 1001       | 33;56       | 0         |
| 1002       | 5;99        | 1         |
| 1002       | 7           | 1         |
| 1003       | 5           | 0         |  ‚Üê Only one visit, will be ignored

---

## 2. What Happens Step By Step

### **a. Read and Process Data**
- Only patients with 2 or more visits are kept: 1001 and 1002.

### **b. After Processing**

- For **1001**:
  - icd9: `[[12,45,100], [33,56]]`
  - mortality: `0`
- For **1002**:
  - icd9: `[[5,99], [7]]`
  - mortality: `1`

### **c. One-Hot Encoding (Assume `TOTAL_NUM_CODES = 101` for simplicity)**

- The `to_one_hot` function will create a 101-length array of 0s, with 1s at the indices of the codes.

#### For 1001‚Äôs first admission `[12,45,100]`:
```
[0, 0, ..., 1 at 12, ..., 1 at 45, ..., 1 at 100]
```
#### For 1001‚Äôs second admission `[33,56]`:
```
[0, 0, ..., 1 at 33, ..., 1 at 56, ..., 0]
```

---

## 3. What Does `__getitem__` Return?

### For patient at index 0 (subject_id 1001):

- `x` will be a tensor of shape `(2, 101)`:
  - First row: one-hot for `[12,45,100]`
  - Second row: one-hot for `[33,56]`
- `y` will be a tensor: `0.0` (float)

### For patient at index 1 (subject_id 1002):

- `x`: tensor of shape `(2, 101)`:
  - First row: one-hot for `[5,99]`
  - Second row: one-hot for `[7]`
- `y`: tensor: `1.0` (float)

---

## 4. Example in Code

```python
import torch

TOTAL_NUM_CODES = 101

def to_one_hot(label, num_class):
    one_hot_label = [0] * num_class
    for i in label:
        one_hot_label[i] = 1
    return one_hot_label

# Example for 1001
visits = [[12, 45, 100], [33, 56]]
x = torch.tensor([to_one_hot(visit, TOTAL_NUM_CODES) for visit in visits], dtype=torch.float32)
y = torch.tensor(0.0, dtype=torch.float32)
print("x shape:", x.shape)  # (2, 101)
print("First admission one-hot (indices 12,45,100):", x[0].nonzero(as_tuple=True)[0].tolist())
print("Second admission one-hot (indices 33,56):", x[1].nonzero(as_tuple=True)[0].tolist())
print("y:", y)
```

**Output:**
```
x shape: torch.Size([2, 101])
First admission one-hot (indices 12,45,100): [12, 45, 100]
Second admission one-hot (indices 33,56): [33, 56]
y: tensor(0.)
```

---

## **Summary**

- Each patient is represented as a sequence of admissions.
- Each admission is a one-hot vector marking which diagnoses they had.
- The label is their mortality (0 or 1).
- When you get a patient from the dataset, you get their admissions (as a tensor of one-hot vectors) and their mortality label.

Let me know if you want to see this as a runnable script or with more/different data!

In [20]:
dataset = CustomDataset()
print('Size of dataset:', len(dataset))

Size of dataset: 14


Copilot said: Great question! Let‚Äôs break down why `CustomDataset()`

Great question! Let‚Äôs break down why CustomDataset() works and prints Size of dataset: 14 even though you didn‚Äôt pass any data directly as an argument.
1. Where Does the Data Come From?

Look at the CustomDataset class‚Äôs __init__ method:
Python

def __init__(self):
    # read the csv
    self._df = pd.read_csv(f'{DATA_PATH}/data.csv')
    ...

Key point:

    The data is loaded from a CSV file on disk: f'{DATA_PATH}/data.csv'.
    DATA_PATH is a variable that must be defined somewhere in your code or notebook (often at the top).
    pd.read_csv(...) reads the CSV file and loads the data into a pandas DataFrame (self._df).

2. How Does the Class Know What Data to Use?

    The class is hardcoded to look for the file at a specific location (DATA_PATH/data.csv).
    You don‚Äôt need to pass the data as an argument because the class is designed to always load its data from that file.
    This is a design choice: the data source (file location) is fixed inside the class, not passed as a parameter.

3. Why Does len(dataset) Return 14?

    After reading the CSV, the class processes and filters the data (for example, filtering only patients with more than 1 visit).
    The list of valid patients (subject IDs) is stored in self._subj_ids.
    __len__ returns the length of this list.
    In your case, after filtering, there are 14 valid patients in the CSV file.


In [21]:
from torch.utils.data.dataset import random_split


split = int(len(dataset)*0.7)

lengths = [split, len(dataset) - split]
train_dataset, test_dataset = random_split(dataset, lengths)

print("Length of train dataset:", len(train_dataset))
print("Length of test dataset:", len(test_dataset))

Length of train dataset: 9
Length of test dataset: 5


In [22]:
# Calculate the split indices for 70% train, 30% test
split = int(len(dataset) * 0.7)  # 0.7 * 14 = 9.8 ‚Üí 9 after int()
lengths = [split, len(dataset) - split]  # [9, 5]

# Randomly split the dataset
train_dataset, test_dataset = random_split(dataset, lengths)


Here is an example of $x$, and $y$. 

In [23]:
x, y = train_dataset[0]
print(f'Example x (shape {x.shape}):\n', x)
print(f'Example y:\n', y)

Example x (shape torch.Size([2, 271])):
 tensor([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

We can see that $x$ is of shape $(2, 271)$, which means there are $271$ diagnosis codes in total, and this patient has two visits. It is in one-hot format. A $1$ in position $i$ means that diagnosis code of index $i$ appears in the that visit.

And $y$ is either $0$ or $1$.

### Padding [20 points]

Note that the first dimension of $x$ can be different for different patients (i.e., different patients will have different number of visits). Thus we need to implement a padding function (similar to the zero padding in images).

To achieve this goal, we will implement a special collage function. This collate function `collate_fn()` will be called by `DataLoader` after fetching a list of samples using the indices from `CustomDataset` to collate the list of samples into batches.

For example, assume the `DataLoader` gets a list of two samples (here, assume the total number of codes is 3). 

```
[ [ [0, 1, 0], [1, 0, 1] ], 
  [ [0, 0, 1], [0, 1, 1], [0, 1, 1] ] ]
```

where the first patient has two visits `[0, 1, 0]` and `[1, 0, 1]` and the second patient has three visits `[0, 0, 1]`, `[0, 1, 1]`, and `[0, 1, 1]`.

The collate function `collate_fn()` is supposed to pad them into the same shape (2, 3), where 2 is the number of patients, and 3 is the maximum number of visits.

```
[ [ [0, 1, 0], [1, 0, 1], *[0, 0, 0]* ], 
  [ [0, 0, 1], [0, 1, 1],  [0, 1, 1] ] ]
```

Simplified explanation
Here‚Äôs a simplified version of your text:

Different patients can have different numbers of visits, so their data arrays have varying first dimensions. To batch this data, we need to pad shorter patient records (with zeros, like zero-padding in images) so that all patients in a batch have the same number of visits.

We use a custom collate function, collate_fn(), in the DataLoader to do this padding automatically. For example, if one patient has 2 visits and another has 3, we pad the first patient‚Äôs data to have 3 visits (adding a row of zeros):

Original samples:
[
  [[0, 1, 0], [1, 0, 1]],                 # 2 visits
  [[0, 0, 1], [0, 1, 1], [0, 1, 1]]       # 3 visits
]

After padding:
[
  [[0, 1, 0], [1, 0, 1], [0, 0, 0]],     # now 3 visits (last one is padding)
  [[0, 0, 1], [0, 1, 1], [0, 1, 1]]      # unchanged
]

This way, all patient data in a batch has the same shape.

In [24]:
def collate_fn(data):
    """
    TODO: Collate the the list of samples into batches. For each patient, you need to pad the diagnosis
        sequences to the sample shape (max # visits, total # diagnosis codes).
    
    Arguments:
        data: a list of samples fetched from `CustomDataset`
        
    Outputs:
        x: a tensor of shape (# patients, total # diagnosis codes, max # visits) of type torch.float
        y: a tensor of shape (# patients) of type torch.float
        
    Note that you can obtains the list of diagnosis codes and the list of mortality labels
        using: `sequences, labels = zip(*data)`
    """

    sequences, labels = zip(*data)

    y = torch.tensor(labels, dtype=torch.float)
    
    num_patients = len(sequences)
    num_visits = [patient.shape[0] for patient in sequences]
    total_num_codes = sequences[0].shape[1]

    max_num_visits = max(num_visits)
    
    x = torch.zeros((num_patients, total_num_codes, max_num_visits), dtype=torch.float)

    for i_patient, patient in enumerate(sequences):
        for j_visit, visit in enumerate(patient):
            x[i_patient, :, j_visit] = visit.float()
            # your code here
            #raise NotImplementedError
    
    return x, y

Simplified example
Suppose you have two patients, and each patient has a different number of visits. For each visit, you have a 3-code diagnosis vector.
Python

import torch

# Patient 1: 2 visits, each with 3 codes
patient1 = torch.tensor([
    [1, 0, 0],  # visit 1
    [0, 1, 1]   # visit 2
])

# Patient 2: 3 visits, each with 3 codes
patient2 = torch.tensor([
    [0, 1, 0],  # visit 1
    [1, 0, 1],  # visit 2
    [0, 0, 1]   # visit 3
])

# Labels (e.g. mortality, just for demonstration)
label1 = 0
label2 = 1

# Combine into data list as (sequence, label) pairs
data = [
    (patient1, label1),
    (patient2, label2)
]

What collate_fn Does

    Finds the max number of visits (here: 3, since patient2 has 3 visits)
    Pads patient1‚Äôs visits to 3 (adds a row of zeros)
    Organizes the data into a tensor of shape (number of patients, number of codes, max visits)

Step-by-Step Output

Let‚Äôs show this in code by filling in the missing part of your function and running it:
Python

def collate_fn(data):
    sequences, labels = zip(*data)
    y = torch.tensor(labels, dtype=torch.float)
    num_patients = len(sequences)
    num_visits = [patient.shape[0] for patient in sequences]
    total_num_codes = sequences[0].shape[1]
    max_num_visits = max(num_visits)
    x = torch.zeros((num_patients, total_num_codes, max_num_visits), dtype=torch.float)
    for i_patient, patient in enumerate(sequences):
        for j_visit, visit in enumerate(patient):
            x[i_patient, :, j_visit] = visit.float()
    return x, y

x, y = collate_fn(data)

print("x (batched visits):")
print(x)
print("x shape:", x.shape)
print("y (labels):", y)

Output
Code

x (batched visits):
tensor([[[1., 0., 0.],
         [0., 1., 0.],
         [0., 1., 0.]],
        [[0., 1., 0.],
         [1., 0., 0.],
         [0., 1., 1.]]])
x shape: torch.Size([2, 3, 3])
y (labels): tensor([0., 1.])

Explanation:

    Each patient‚Äôs data is padded so they both have 3 visits.
    For patient 1, the third visit is all zeros (padding).
    The result is ready for batch processing in deep learning.


In [25]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

from torch.utils.data import DataLoader


loader = DataLoader(train_dataset, batch_size=4, collate_fn=collate_fn)
loader_iter = iter(loader)
x, y = next(loader_iter)

assert x.dtype == torch.float
assert y.dtype == torch.float

assert x.shape[:-1] == (4, 271)
assert y.shape == (4,)

for i in range(4):
    real_x, real_y = train_dataset[i]
    for j in range(real_x.shape[0]):
        visit = real_x[j]
        got = x[i, :, j]
        assert all(visit == got)
        assert real_y == y[i]



We need to pad the sequences into the same length so that we can do batch training on GPU, which will run much faster. Or, if they have different length, we have to process them one by one. This is extremely slow, especially with a large dataset.

You may also wonder will this padding add some extra noise to the dataset (because we change the number of visits for some patients). The answer is: it depends. Sometimes, padding will bring in some noise and we need to have a separate mask to remove the noise later (you will see this in the next lab).

But in this lab, it does not matter. Because zero padding will not affect the convolution operation. Zero times zero is still zero (assume we do not have bias parameter).

This code sets up PyTorch data loaders to efficiently provide batches of data from your training and test datasets. The custom collate_fn function is used to pad and arrange each batch. Shuffling is enabled for training data to improve learning. Finally, the code prints out the number of batches for both training and testing.

### Data Loader

Now, we can load the dataset into the data loader.

In [26]:
from torch.utils.data import DataLoader

# how many samples per batch to load
batch_size = 4

# prepare dataloaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, collate_fn=collate_fn)

print("# of train batches:", len(train_loader))
print("# of test batches:", len(test_loader))

# of train batches: 3
# of test batches: 2


In [27]:
train_iter = iter(train_loader)
x, y = next(train_iter)

print('Shape of a batch x:', x.shape)
print('Shape of a batch y:', y.shape)

Shape of a batch x: torch.Size([4, 271, 3])
Shape of a batch y: torch.Size([4])


Copilot said: Let‚Äôs break down what these shapes mean: ###

Let‚Äôs break down what these shapes mean:
x: torch.Size([4, 271, 3])

    x is your batch of input data (for example, visit sequences for patients).
    4: The batch contains 4 samples (e.g., 4 patients).
    271: Each visit is represented by a vector of 271 diagnosis codes (so you have 271 features per visit).
    3: Each patient (in this batch) has data for up to 3 visits (padded if needed).

So:
x is a 3-dimensional tensor with shape:
[number of patients in batch, number of codes per visit, number of visits]
Or:
[batch, codes, visits]
Example: 4 patients, 271 codes, 3 visits each.

### Build the Model [20 points]

Now, let us build a 1D CNN model. For each patient, the CNN model will take an input tensor of shape (# of visits, total # of codes), and produce an output tensor of 1-dim (0 for non-mortality, 1 for moratality). The detailed model architecture is shown in the table below.

Layers | Configuration | Activation Function
--- | --- | ---
convolution | in channels 271, out channels 32, kernel size 2, stride 1, padding 0, bias False | -
dropout | probability 0.5 | - 
fully connected | input size 32, output size 1 | Sigmoid

Note that you have to set `bias=Flase` for the convolution layer. Only in this way can we ignore the noise introduced by padding.

ü§î Why 1D CNN?

Because your data is sequential in one direction ‚Äî over time (visits).
üîç Your input:

Each patient‚Äôs input is shaped like:

(number of visits, number of medical codes)
‚Üí for example: (10, 271)

That means:

    Each row is a visit (a point in time),

    Each column is a medical code (features at that visit).

So the input is like a sequence of visit-vectors ‚Äî similar to a time series.
‚úÖ Why 1D is appropriate:

    1D CNN looks for patterns along the sequence of visits, using a sliding window (the kernel).

    It treats each visit as a "time step" and learns local temporal patterns (e.g., how a disease progresses across visits).

    2D CNN would be for images (2D spatial data), not appropriate here.

üß† Analogy:

Think of it like this:

    1D CNN is to time series or sequential visit data what

    2D CNN is to images

In [28]:
"""
TODO: Build the CNN shown above.
HINT: Consider using `nn.Conv1d`, `nn.MaxPool1d`, `nn.Dropout`, `nn.Linear`, `torch.sigmoid`.
"""

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # DO NOT change the names
        self.conv = nn.Conv1d(in_channels=271, out_channels=32, kernel_size=2, stride=1, padding=0, bias=False)
        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(32, 1)
        
        
        # your code here
        #raise NotImplementedError

    def forward(self, x):
        """
        TODO: 1. pass x through the convolution layer
              2. pass x through the dropout layer
              3. sum x by the last dimension (i.e., visits)
              4. pass x through the linear and sigmoid layer
        """
        # your code here
         
        x = self.conv(x)       # ‚Üí [batch, 32, new_seq_len]
        x = self.dropout(x)
        x = x.sum(dim=2)       # sum across time (visits)
        x = self.fc(x)
        x = torch.sigmoid(x)
        return x

            
        #raise NotImplementedError

In [29]:
# initialize the CNN
model = Net()
print(model)

Net(
  (conv): Conv1d(271, 32, kernel_size=(2,), stride=(1,), bias=False)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=32, out_features=1, bias=True)
)


In [30]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

model = Net()

assert model.conv.in_channels == 271
assert model.conv.out_channels == 32
assert model.conv.kernel_size == (2,)
assert model.conv.stride == (1,)
assert model.conv.padding == (0,)
assert model.conv.bias is None
assert model.fc.in_features == 32
assert model.fc.out_features == 1

train_iter = iter(train_loader)
x, y = next(train_iter)
output = model.forward(x)
assert output.shape == (4, 1), "Net() is wrong!"



Now that we have a network, let's see what happens when we pass in some data.

In [31]:
model = Net()

# Grab some data 
train_iter = iter(train_loader)
x, y = next(train_iter)

# Forward pass through the network
output = model.forward(x)

print('Input x shape:', x.shape)
print('Output shape: ', output.shape)

Input x shape: torch.Size([4, 271, 3])
Output shape:  torch.Size([4, 1])


### Train the Network [20 points]

In this step, you will train the CNN model.

In [32]:
"""
TODO: Define the loss (BCELoss), assign it to `criterion`.

REFERENCE: https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html#torch.nn.BCELoss
"""

criterion = nn.BCELoss()

# your code here
#raise NotImplementedError

In [33]:
"""
TODO: Define the optimizer (SGD) with learning rate 0.01, assign it to `optimizer`.

REFERENCE: https://pytorch.org/docs/stable/optim.html
"""

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# your code here
#raise NotImplementedError

In [34]:
'''
AUTOGRADER CELL. DO NOT MODIFY THIS.
'''

assert type(criterion) is nn.modules.loss.BCELoss, "criterion is not BCELoss!"
assert type(optimizer) is torch.optim.SGD, "optimizer is not SGD!"
assert optimizer.param_groups[0]['lr'] == 0.01, "learning rate is not 0.01!"



Now we can train the model. The following two cell are exactly the same as previous lab.

In [35]:
from sklearn.metrics import *

#input: Y_score,Y_pred,Y_true
#output: accuracy, auc, precision, recall, f1-score
def classification_metrics(Y_score, Y_pred, Y_true):
    acc, auc, precision, recall, f1score = accuracy_score(Y_true, Y_pred), \
                                           roc_auc_score(Y_true, Y_score), \
                                           precision_score(Y_true, Y_pred), \
                                           recall_score(Y_true, Y_pred), \
                                           f1_score(Y_true, Y_pred)
    return acc, auc, precision, recall, f1score


#input: model, loader
def evaluate(model, loader):
    model.eval()
    all_y_true = torch.LongTensor()
    all_y_pred = torch.LongTensor()
    all_y_score = torch.FloatTensor()
    for x, y in loader:
        # pass the input through the model
        y_hat = model(x)
        # convert shape from [batch size, 1] to [batch size]
        y_hat = y_hat.view(y_hat.shape[0])
        y_pred = (y_hat > 0.5).type(torch.float)
        all_y_true = torch.cat((all_y_true, y.to('cpu')), dim=0)
        all_y_pred = torch.cat((all_y_pred,  y_pred.to('cpu')), dim=0)
        all_y_score = torch.cat((all_y_score,  y_hat.to('cpu')), dim=0)
        
    acc, auc, precision, recall, f1 = classification_metrics(all_y_score.detach().numpy(), 
                                                             all_y_pred.detach().numpy(), 
                                                             all_y_true.detach().numpy())
    print(f"acc: {acc:.3f}, auc: {auc:.3f}, precision: {precision:.3f}, recall: {recall:.3f}, f1: {f1:.3f}")
    return

üñºÔ∏è Example:

Let‚Äôs say you have two batches.

Batch 1:

    y = [1, 0]

    y_pred = [1, 0]

    y_hat = [0.9, 0.2]

Batch 2:

    y = [0, 1]

    y_pred = [0, 1]

    y_hat = [0.3, 0.8]

After cat, you get:

all_y_true = [1, 0, 0, 1]
all_y_pred = [1, 0, 0, 1]
all_y_score = [0.9, 0.2, 0.3, 0.8]

In [36]:
print("model perfomance before training:")
evaluate(model, train_loader)
evaluate(model, test_loader)

model perfomance before training:
acc: 0.667, auc: 0.571, precision: 0.750, recall: 0.857, f1: 0.800
acc: 0.800, auc: 0.500, precision: 0.750, recall: 1.000, f1: 0.857


üèãÔ∏è‚Äç‚ôÇÔ∏è High-Level Summary

    Train for n_epochs (e.g. 10 times over the training dataset).

    For each mini-batch of data:

        Do a forward pass.

        Calculate the loss.

        Do a backward pass to compute gradients.

        Update model weights using the optimizer (SGD).

    After each epoch, print the loss and evaluate on both train & test sets.

In [37]:
# number of epochs to train the model
# feel free to change this
n_epochs = 10

# prep model for training
model.train()

for epoch in range(n_epochs):
    
    train_loss = 0
    for x, y in train_loader:
        """ Step 1. clear gradients """
        optimizer.zero_grad()
        """  Step 2. perform forward pass using `model`, save the output to y_hat """
        y_hat = model(x)
        """ Step 3. calculate the loss using `criterion`, save the output to loss. """
        # convert shape from [batch size, 1] to [batch size]
        y_hat = y_hat.view(y_hat.shape[0])
        loss = criterion(y_hat, y)
        """ Step 4. backward pass """
        loss.backward()
        """ Step 5. optimization """
        optimizer.step()
        """ Step 6. record loss """
        train_loss += loss.item()
        
    train_loss = train_loss / len(train_loader)
    print('Epoch: {} \tTraining Loss: {:.6f}'.format(epoch+1, train_loss))
    evaluate(model, train_loader)
    evaluate(model, test_loader)

Epoch: 1 	Training Loss: 0.655572
acc: 0.778, auc: 0.571, precision: 0.778, recall: 1.000, f1: 0.875
acc: 0.800, auc: 0.500, precision: 0.750, recall: 1.000, f1: 0.857
Epoch: 2 	Training Loss: 0.642107
acc: 0.778, auc: 0.643, precision: 0.778, recall: 1.000, f1: 0.875
acc: 0.800, auc: 0.500, precision: 0.750, recall: 1.000, f1: 0.857
Epoch: 3 	Training Loss: 0.620835
acc: 0.778, auc: 0.857, precision: 0.778, recall: 1.000, f1: 0.875
acc: 0.800, auc: 0.500, precision: 0.750, recall: 1.000, f1: 0.857
Epoch: 4 	Training Loss: 0.664610
acc: 0.778, auc: 0.929, precision: 0.778, recall: 1.000, f1: 0.875
acc: 0.800, auc: 0.500, precision: 0.750, recall: 1.000, f1: 0.857
Epoch: 5 	Training Loss: 0.593949
acc: 0.778, auc: 0.929, precision: 0.778, recall: 1.000, f1: 0.875
acc: 0.800, auc: 0.500, precision: 0.750, recall: 1.000, f1: 0.857
Epoch: 6 	Training Loss: 0.575149
acc: 0.778, auc: 1.000, precision: 0.778, recall: 1.000, f1: 0.875
acc: 0.600, auc: 0.167, precision: 0.600, recall: 1.000, f1

üí° How many updates happen per epoch?

    If train_loader gives batches of size 4 and your dataset has 100 patients,

    Then:
    100 patients / 4 per batch = 25 batches per epoch

    So the weights are updated 25 times per epoch (i.e., after each batch).

üìå What does "Stochastic Gradient Descent" really mean?

    Stochastic = using a random sample of the data to estimate the gradient

    That ‚Äúrandom sample‚Äù is typically a mini-batch (not just a single row)

    In PyTorch, torch.utils.data.DataLoader shuffles data & gives mini-batches

üÜö What if you updated after each row (i.e. pure SGD)?

    That would be very noisy and inefficient for large datasets.

    Most practical implementations use mini-batch SGD, which balances efficiency and convergence stability.

The result is bad due to very limited data. The model overfits the training data very fast.

You are encouraged to try this on the whole MIMIC-III dataset. The result will be much more promising!