## Download data

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits

In [1]:
from pytorch_lightning.utilities.types import OptimizerLRScheduler
!pip install pytorch-lightning

Collecting pytorch-lightning
  Downloading pytorch_lightning-2.3.3-py3-none-any.whl.metadata (21 kB)
Collecting tqdm>=4.57.0 (from pytorch-lightning)
  Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.6 kB ? eta -:--:--
     ------- -------------------------------- 10.2/57.6 kB ? eta -:--:--
     -------------------------------------- 57.6/57.6 kB 751.0 kB/s eta 0:00:00
Collecting torchmetrics>=0.7.0 (from pytorch-lightning)
  Downloading torchmetrics-1.4.0.post0-py3-none-any.whl.metadata (19 kB)
Collecting lightning-utilities>=0.10.0 (from pytorch-lightning)
  Downloading lightning_utilities-0.11.6-py3-none-any.whl.metadata (5.2 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]>=2022.5.0->pytorch-lightning)
  Downloading aiohttp-3.9.5-cp311-cp311-win_amd64.whl.metadata (7.7 kB)
Collecting aiosignal>=1.1.2 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]>=2022.5.0->pytorch-lightning)
  Downloading aiosignal-1.3.1-py


[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
from sklearn.datasets import load_digits

In [2]:
data = load_digits().data

In [3]:
data.shape

(1797, 64)

In [4]:
data[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

In [5]:
targets = load_digits().target

In [6]:
targets.shape

(1797,)

In [7]:
targets[0]

0

### Split data into training and testing

to split data into training and testing use `train_test_split()` function from sklearn.

Parameters `stratify` will ensure the same distribution of data in the training and testing sets, in our case we want the distribution of classes to be the same in both sets, so we specify `stratify = traget` which will ensure an equal distribution of data with respect to the target variable

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
train_X, test_X, train_y, test_y = train_test_split(data, targets, test_size=0.2, stratify=targets)

In [10]:
train_X.shape

(1437, 64)

In [11]:
train_X[0]

array([ 0.,  0.,  9., 15., 13.,  0.,  0.,  0.,  0.,  5., 14.,  7., 13.,
        2.,  0.,  0.,  0., 12., 10.,  1., 13.,  0.,  0.,  0.,  0.,  4.,
        7.,  6., 11.,  0.,  0.,  0.,  0.,  0.,  0., 10.,  6.,  0.,  0.,
        0.,  0.,  0.,  1., 15.,  0.,  0.,  0.,  0.,  0.,  0.,  9., 11.,
        0.,  6.,  5.,  0.,  0.,  0., 11., 16., 16., 16., 16.,  3.])

In [12]:
test_X.shape

(360, 64)

In [13]:
train_X[1]

array([ 0.,  1., 14., 16., 12.,  0.,  0.,  0.,  0.,  5., 16.,  9., 16.,
        6.,  0.,  0.,  0.,  3., 11.,  0., 14.,  9.,  0.,  0.,  0.,  0.,
        0.,  0., 10., 10.,  0.,  0.,  0.,  0.,  0.,  0., 14., 10.,  0.,
        0.,  0.,  0.,  0., 10., 16.,  5.,  0.,  0.,  0.,  2., 15., 16.,
       14.,  8., 12.,  2.,  0.,  0., 11., 16., 16., 16., 15.,  5.])

In [14]:
train_y

array([2, 2, 8, ..., 8, 9, 4])

## Datamodule

### Dataset

In [15]:
from torch.utils.data import Dataset
import pytorch_lightning as pl

In [16]:
class DigitsDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        x = self.data[idx] / 255
        y = self.targets[idx]
        
        return x, y

## Datamodule

https://pytorch-lightning.readthedocs.io/en/stable/data/datamodule.html

DataModule is a class that combines all the necessary functions to create data sets.

`prepare_data` - downloads/loads data

`setup` - divides data into training and test data

`train_dataloader` and `val_dataloader` - return appropriate dataloaders

In [17]:
from torch.utils.data import DataLoader

In [19]:
class DigitsDatamodule(pl.LightningDataModule):
    def __init__(self, batch_size = 32):
        super().__init__()
        self.batch_size = batch_size
    
    def prepare_data(self):
        self.data = load_digits().data
        self.targets = load_digits().target
    
    def setup(self, stage=None):
        self.train_X, self.test_X, self.train_y, self.test_y = train_test_split(
            self.data,
            self.targets,
            train_size=0.8,
            stratify=self.targets
        )
        
        self.train_dataset = DigitsDataset(self.train_X, self.train_y)
        self.test_dataset = DigitsDataset(self.test_X, self.test_y)
    
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True)
    
    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size, shuffle=True)


## Model

In [20]:
from torch import nn
import torch.nn.functional as F
from torch import optim

https://pytorch-lightning.readthedocs.io/en/stable/starter/converting.html

The `LightningModule` class combines the definitions of the model and the way it is trained

`__init__` - contains definitions of layers used in the model

`forward` - contains definitions of how the input passes through all layers

`configure_optimizers` - creates and returns an optimizer

`training_step` - implements model training from the perspective of one batch

`validation_step` - implements model validation from the perspective of one batch

In [21]:
class DigitsModel(pl.LightningModule):
    def __init__(self, input_size, num_classes):
        super().__init__()
        
        self.loss_function = nn.CrossEntropyLoss()
        
        self.fc1 = nn.Linear(input_size, 100)
        self.fc2 = nn.Linear(100, num_classes)
        
    def forward(self, x):
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        
        if not self.training:
            out = F.softmax(out, dim=1)
        
        return out
    
    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters())
        return optimizer
    
    def training_step(self, batch, batch_idx):
        inputs, labels = batch # returns x and y
        outputs = self.forward(inputs.float()) # prediction
        loss = self.loss_function(outputs, labels.long())
        
        self.log('train_loss', loss)
        
        return loss
    
    def validation_step(self, batch, batch_idx):
        inputs, labels = batch # returns x and y
        outputs = self.forward(inputs.float()) # prediction
        loss = self.loss_function(outputs, labels)
        
        self.log('val_loss', loss)


## Training loop

Create a datamodule - it contains a training and validation set

In [22]:
data_module = DigitsDatamodule()

In [23]:
data_module.prepare_data()

In [24]:
data_module.setup()

In [25]:
print(next(iter(data_module.train_dataloader()))[0].shape)

torch.Size([32, 64])


In [26]:
print(next(iter(data_module.train_dataloader()))[1].shape)

torch.Size([32])


Create a model - it contains the training logic

In [27]:
model = DigitsModel(64, 10)

Create a trainer - an object in which we set training parameters such as the number of echoes, graphics card usage, etc.

In [28]:
trainer = pl.Trainer(
    max_epochs=100,
    accelerator="gpu",
    log_every_n_steps=10,
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


We start the training with the `fit` method to which we provide the model and datamodule

In [29]:
trainer.fit(model, data_module)

C:\Users\Kubus\my_projects\pytorch_tutor\.venv\Lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
Missing logger folder: C:\Users\Kubus\my_projects\pytorch_tutor\lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type             | Params | Mode 
-----------------------------------------------------------
0 | loss_function | CrossEntropyLoss | 0      | train
1 | fc1           | Linear           | 6.5 K  | train
2 | fc2           | Linear           | 1.0 K  | train
-----------------------------------------------------------
7.5 K     Trainable params
0         Non-trainable params
7.5 K     Total params
0.030     Total estimated model params size (MB)
C:\Users\Kubus\my_projects\pytorch_tutor\.venv\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Conside

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=100` reached.


In [51]:
!pip install tensorboard

Collecting tensorboard
  Downloading tensorboard-2.17.0-py3-none-any.whl.metadata (1.6 kB)
Collecting absl-py>=0.4 (from tensorboard)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting grpcio>=1.48.2 (from tensorboard)
  Downloading grpcio-1.65.1-cp311-cp311-win_amd64.whl.metadata (3.4 kB)
Collecting markdown>=2.6.8 (from tensorboard)
  Downloading Markdown-3.6-py3-none-any.whl.metadata (7.0 kB)
Collecting protobuf!=4.24.0,<5.0.0,>=3.19.6 (from tensorboard)
  Downloading protobuf-4.25.4-cp310-abi3-win_amd64.whl.metadata (541 bytes)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard)
  Downloading tensorboard_data_server-0.7.2-py3-none-any.whl.metadata (1.1 kB)
Collecting werkzeug>=1.0.1 (from tensorboard)
  Downloading werkzeug-3.0.3-py3-none-any.whl.metadata (3.7 kB)
Downloading tensorboard-2.17.0-py3-none-any.whl (5.5 MB)
   ---------------------------------------- 0.0/5.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/5.5 MB 6


[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [32]:
%reload_ext tensorboard
%tensorboard --logdir=./lightning_logs/

Reusing TensorBoard on port 6006 (pid 20496), started 22:21:10 ago. (Use '!kill 20496' to kill it.)

## Saving the model

In [53]:
trainer.save_checkpoint('model.ckpt')

## Loading the model

In [54]:
new_model = DigitsModel.load_from_checkpoint('model.ckpt', input_size = 64, num_classes = 10)

## Metrics

In [55]:
!pip install torchmetrics




[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [33]:
import torchmetrics

We create metrics in the `__init__` function - it is important that training and validation metrics are a separate object

Metrics are counted in the `training_step` and `validation_step` functions

In [34]:
class DigitsModel(pl.LightningModule):
    def __init__(self, input_size, num_classes):
        super().__init__()
        
        self.loss_function = nn.CrossEntropyLoss()
        
        self.fc1 = nn.Linear(input_size, 100)
        self.fc2 = nn.Linear(100, num_classes)
        
        self.train_accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes) # test metrics
        self.val_accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes) # validation metrics
    
    def forward(self, x):
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        
        if not self.training:
            out = F.softmax(out, dim=1)
        
        return out
    
    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters())
        return optimizer
    
    def training_step(self, batch, batch_idx):
        inputs, labels = batch
        
        outputs = self.forward(inputs.float())
        loss = self.loss_function(outputs, labels.long())
        
        self.log('train_loss', loss)
        
        outputs = F.softmax(outputs, dim=1)
        
        self.train_accuracy(outputs, labels)
        self.log('train_accuracy', self.train_accuracy, on_epoch=True, on_step=False) # displays the accuracy on the test set every epoch
        
        return loss

    def validation_step(self, batch, batch_idx):
        inputs, labels = batch
        
        outputs = self.forward(inputs.float())
        loss = self.loss_function(outputs, labels)
        
        self.log('val_loss', loss)
        outputs = F.softmax(outputs, dim=1)
        self.val_accuracy(outputs, labels)
        self.log('val_accuracy', self.val_accuracy, on_epoch=True, on_step=False)
        
        return loss


In [35]:
model_1 = DigitsModel(64, 10)

In [36]:
trainer = pl.Trainer(max_epochs=100, accelerator="gpu")

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [37]:
trainer.fit(model_1, data_module)

C:\Users\Kubus\my_projects\pytorch_tutor\.venv\Lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type               | Params | Mode 
--------------------------------------------------------------
0 | loss_function  | CrossEntropyLoss   | 0      | train
1 | fc1            | Linear             | 6.5 K  | train
2 | fc2            | Linear             | 1.0 K  | train
3 | train_accuracy | MulticlassAccuracy | 0      | train
4 | val_accuracy   | MulticlassAccuracy | 0      | train
--------------------------------------------------------------
7.5 K     Trainable params
0         Non-trainable params
7.5 K     Total params
0.030     Total estimated model params size (MB)
C:\Users\Kubus\my_projects\pytorch_tutor\.venv\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:424: The 'train_dataloader' does not

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=100` reached.


In [38]:
%reload_ext tensorboard
%tensorboard --logdir=./lightning_logs/


Reusing TensorBoard on port 6006 (pid 20496), started 22:22:01 ago. (Use '!kill 20496' to kill it.)

## Additional metrics

In [39]:
class DigitsModel(pl.LightningModule):
    def __init__(self, input_size, num_classes):
        super().__init__()
        
        self.loss_function = nn.CrossEntropyLoss()
        
        self.fc1 = nn.Linear(input_size, 100)
        self.fc2 = nn.Linear(100, num_classes)
        
        self.train_accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes)
        self.val_accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes)
        
        self.train_macro_f1 = torchmetrics.F1Score(num_classes=num_classes, average='macro', task="multiclass")
        self.val_macro_f1 = torchmetrics.F1Score(num_classes=num_classes, average='macro', task="multiclass")
    
    def forward(self, x):
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        if not self.training:
            out = F.softmax(out, dim=1)
        
        return out
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters())
    
    def training_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self.forward(inputs.float())
        loss = self.loss_function(outputs, labels.long())
        self.log('train_loss', loss)
        
        self.train_accuracy(outputs, labels)
        self.log('train_accuracy', self.train_accuracy, on_epoch=True, on_step=False)
        
        self.train_macro_f1(outputs, labels)
        self.log("train_macro_f1", self.train_macro_f1, on_epoch=True, on_step=False)
        
        return loss
    
    def validation_step(self, batch, batch_idx):
        inputs, labels = batch
        
        outputs = self.forward(inputs.float())
        loss = self.loss_function(outputs, labels.long())
        self.log('val_loss', loss)
        
        self.val_accuracy(outputs, labels)
        self.log('val_accuracy', self.val_accuracy, on_epoch=True, on_step=False)
        
        self.val_macro_f1(outputs, labels)
        self.log("val_macro_f1", self.val_macro_f1, on_epoch=True, on_step=False)
        
        return loss


In [40]:
model = DigitsModel(64, 10)

In [41]:
trainer = pl.Trainer(
    max_epochs=100,
    accelerator="gpu",
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [42]:
trainer.fit(model, data_module)

C:\Users\Kubus\my_projects\pytorch_tutor\.venv\Lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type               | Params | Mode 
--------------------------------------------------------------
0 | loss_function  | CrossEntropyLoss   | 0      | train
1 | fc1            | Linear             | 6.5 K  | train
2 | fc2            | Linear             | 1.0 K  | train
3 | train_accuracy | MulticlassAccuracy | 0      | train
4 | val_accuracy   | MulticlassAccuracy | 0      | train
5 | train_macro_f1 | MulticlassF1Score  | 0      | train
6 | val_macro_f1   | MulticlassF1Score  | 0      | train
--------------------------------------------------------------
7.5 K     Trainable params
0         Non-trainable params
7.5 K     Total params
0.030     Total estimated model params size (MB)
C:\Users\Kubus\my_projects\pytorch_tutor\.

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=100` reached.


In [43]:
%reload_ext tensorboard
%tensorboard --logdir=./lightning_logs/

Reusing TensorBoard on port 6006 (pid 20496), started 22:22:54 ago. (Use '!kill 20496' to kill it.)

# Exercises

The aim of the exercise is to create a model to detect whether a patient has diabetes based on laboratory results

https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

Download data

In [101]:
!pip install requests




[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [102]:
!wget -O diabetes_data.csv https://pastebin.com/raw/qdrUE0E0 # does not work on windows

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('diabetes_data.csv')

Visualization of the collection

We will try to predict the value of the `Outcome` column based on the values of laboratory results and patient characteristics - other columns

In [3]:
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [4]:
feature_columns = data.drop(columns=['Outcome'])

In [5]:
feature_columns

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


We need to standardize the data for each column - subtract the mean and divide by the standard deviation

In [6]:
def standardize(x):
    x_std = x.copy(deep=True)
    for c in feature_columns:
        x_std[c] = (x_std[c] - x_std[c].mean()) / x_std[c].std()
    
    return x_std

data = standardize(data)
data.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.63953,0.847771,0.149543,0.906679,-0.692439,0.20388,0.468187,1.425067,1
1,-0.844335,-1.122665,-0.160441,0.530556,-0.692439,-0.683976,-0.364823,-0.190548,0
2,1.233077,1.942458,-0.263769,-1.287373,-0.692439,-1.102537,0.604004,-0.105515,1
3,-0.844335,-0.997558,-0.160441,0.154433,0.123221,-0.493721,-0.920163,-1.040871,0
4,-1.141108,0.503727,-1.503707,0.906679,0.765337,1.408828,5.481337,-0.020483,1


In [7]:
data.to_numpy()

array([[ 0.63953049,  0.84777132,  0.1495433 , ...,  0.46818687,
         1.42506672,  1.        ],
       [-0.84433482, -1.12266474, -0.16044119, ..., -0.36482303,
        -0.19054773,  0.        ],
       [ 1.23307662,  1.94245802, -0.26376935, ...,  0.6040037 ,
        -0.10551539,  1.        ],
       ...,
       [ 0.34275743,  0.00329872,  0.1495433 , ..., -0.68474712,
        -0.27558007,  0.        ],
       [-0.84433482,  0.15968254, -0.47042568, ..., -0.37085933,
         1.1699697 ,  1.        ],
       [-0.84433482, -0.87245064,  0.04621514, ..., -0.4734765 ,
        -0.87080644,  0.        ]])

## 1. Creates dataset and datamodule

### a) Divide the data into input (x) and output (y):

y - outcome column
x - remaining columns

Convert x and y to numpy

In [8]:
y = data["Outcome"]
x = data.drop("Outcome", axis=1)

In [9]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [10]:
import numpy as np

x = np.array(x)
y = y.apply(lambda x: [float(x)]).tolist()
y = np.array(y)

x

array([[ 0.63953049,  0.84777132,  0.1495433 , ...,  0.20387991,
         0.46818687,  1.42506672],
       [-0.84433482, -1.12266474, -0.16044119, ..., -0.68397621,
        -0.36482303, -0.19054773],
       [ 1.23307662,  1.94245802, -0.26376935, ..., -1.10253696,
         0.6040037 , -0.10551539],
       ...,
       [ 0.34275743,  0.00329872,  0.1495433 , ..., -0.73471085,
        -0.68474712, -0.27558007],
       [-0.84433482,  0.15968254, -0.47042568, ..., -0.24004815,
        -0.37085933,  1.1699697 ],
       [-0.84433482, -0.87245064,  0.04621514, ..., -0.20199718,
        -0.4734765 , -0.87080644]])

In [11]:
y

array([[1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],

In [12]:
from torch.utils.data import Dataset, DataLoader

### b) Create dataset class

In [13]:
class DiabetesDataset(Dataset):
    def __init__(self, data, targets):
        self.data = data
        self.targets = targets
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        x = self.data[idx]
        y = self.targets[idx]
        
        return x, y


In [14]:
dataset = DiabetesDataset(x, y)

In [15]:
sample_data, sample_target = dataset[0]

In [16]:
sample_data

array([ 0.63953049,  0.84777132,  0.1495433 ,  0.90667906, -0.69243932,
        0.20387991,  0.46818687,  1.42506672])

In [17]:
sample_target

array([1.])

### c) Create Datamodule class

The class should:
* In the `prepare_data` method:
    * load data from disk
* In the `setup` method:
    * standardize data
    * divide into x and y
    * divide into x and y into training and testing
    * create training and test datasets from the DiabetesDataset class
* In the `train_dataloader` method
    * rip the dataloader for the training set
* In the `val_dataloader` method
    * return dataloader for validation set

In [18]:
import pytorch_lightning as pl
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

In [19]:
class DiabetesDataModule(pl.LightningDataModule):
    def __init__(self, batch_size = 32):
        super().__init__()
        self.batch_size = batch_size
    
    def prepare_data(self) -> None:
        self.data = pd.read_csv('diabetes_data.csv')
        
    def _standardize(self, x):
        x_std = x.copy(deep=True)
        feature_columns = data.drop(columns=['Outcome'])
    
        for c in feature_columns:
            x_std[c] = (x_std[c] - x_std[c].mean()) / x_std[c].std()
        
        return x_std
    
    def setup(self, stage = None):
        data = self._standardize(self.data)
        y = data["Outcome"]
        x = data.drop("Outcome", axis=1)
        
        x = np.array(x)
        y = y.apply(lambda x: [float(x)]).tolist()
        y = np.array(y)
        
        train_X, test_X, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=42)
        
        self.train_dataset = DiabetesDataset(train_X, train_y)
        self.test_dataset = DiabetesDataset(test_X, test_y)
    
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True)
    
    def val_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size, shuffle=False)


In [20]:
data_module = DiabetesDataModule()

In [21]:
data_module.prepare_data()

In [22]:
data_module.setup()

Test - visualization of one batch

In [23]:
next(iter(data_module.train_dataloader()))[0].shape

torch.Size([32, 8])

In [24]:
next(iter(data_module.train_dataloader()))[1].shape

torch.Size([32, 1])

## Create and train model

Implement any neural network to solve the classification problem

As error functions use binary cross entropp `BCELoss` since we are dealing with binary classification

The sigmoid function should be used as the last activation function in the network

Add a precision metric for the training and validation sets

In [25]:
import torch.nn.functional as F
from torch import nn
from torch import optim
import torchmetrics
import torch

In [35]:
class DiabetesNet(pl.LightningModule):
    def __init__(self, input_size, num_classes):
        super().__init__()
        
        self.loss_function = nn.BCELoss()
        
        self.fc1 = nn.Linear(input_size, num_classes)
        self.fc2 = nn.Linear(num_classes, 1)
        self.sigmoid = nn.Sigmoid()
        
        self.train_accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes)
        self.val_accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=num_classes)
        
        self.train_pred = torchmetrics.Precision(task="multiclass", num_classes=num_classes)
        self.val_pred = torchmetrics.Precision(task="multiclass", num_classes=num_classes)
        
    def forward(self, x):
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        
        return out
    
    def configure_optimizers(self):
        return optim.Adam(self.parameters())
    
    def training_step(self, batch, batch_idx):
        inputs, labels = batch
        inputs = inputs.float()
        labels = labels.float()
        outputs = self.forward(inputs)
        
        loss = self.loss_function(outputs, labels)
        self.log('train_loss', loss)
        
        outputs = self.sigmoid(outputs)
        
        self.train_accuracy(outputs, labels)        
        train_pred = self.train_pred(outputs, labels)
        self.log("train_pred", train_pred, on_epoch=True, on_step=False)
        
        return loss
    
    def on_train_epoch_end(self):
        # log epoch metric
        self.log('train_accuracy', self.train_accuracy, on_epoch=True, on_step=False)
    
    def validation_step(self, batch, batch_idx):
        inputs, labels = batch
        inputs = inputs.float()
        labels = labels.float()
        outputs = self.forward(inputs)
        loss = self.loss_function(outputs, labels)
        self.log('val_loss', loss)
        
        outputs = self.sigmoid(outputs)
        
        self.val_accuracy(outputs, labels)
        self.log('val_accuracy', self.val_accuracy, on_epoch=True, on_step=False)
        
        val_pred = self.val_pred(outputs, labels)
        self.log("val_pred", val_pred, on_epoch=True, on_step=False)
        
        return loss


In [36]:
model = DiabetesNet(8, 32)

In [37]:
trainer = pl.Trainer(
    max_epochs=100,
    accelerator="gpu",
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [38]:
trainer.fit(model, data_module)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [188]:
%reload_ext tensorboard
%tensorboard --logdir=./lightning_logs/

Reusing TensorBoard on port 6006 (pid 20496), started 1:39:19 ago. (Use '!kill 20496' to kill it.)