[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/CMU-IDeeL/CMU-IDeeL.github.io/blob/master/F25/document/Recitation_0_Series/0.6/0_6_Google_Colab.ipynb)

# Recitation 0: Introduction to Google Colab

# What's in this video?

- Basics of Google Colab and it's Overview
- Bash and Magic Commands
- Session and Runtime
- Managing your files using Google Drive
- Saving and Loading Model Checkpoints
- Managing Dataset
- Colab Pro or Colab Pro+

# Basics

#### Google Colab

- Colab is developed by Google Research and provides a Jupyter Notebook-style Python execution environment accessible directly through a web browser.
- Main benefit is its computing resources.
- For free, you get access to CPU and Tesla T4 GPU.
- To access more powerful GPUs like L4 and A100, you can choose to pay for Google Colab Pro or Pro+ (https://colab.research.google.com/signup)

#### Accessing Colab
- Go to https://colab.research.google.com/ to create and access your notebooks
- Directly from Google Drive
- From your GitHub repository
- Upload from local system

This recitation assumes basic knowledge of using Jupyter Notebooks, so please familiarize yourself with it if you haven't already.


# Bash and Magic Commands

Colab runs in a linux environemnt and you can access the terminal with `!`

#### Bash Commands

The !nvidia-smi command displays real-time GPU information, including
- the GPU model (like Tesla T4 or A100),
- memory usage (used vs available),
- GPU utilization to show whether your code is actively using the GPU.
- temperature
- driver and CUDA version, which is helpful in ensuring compatibility with PyTorch.

In [None]:
!nvidia-smi

Mon Jul 14 05:04:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P0             46W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!pip install torch
import gc
import os
import torch
import torch.nn as nn
import torch.optim as optim

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
!ls
# !cd ..
# !mkdir

sample_data


#### Magic commands
- %time: only works on CPU commands.
- For GPU, timing elapsed for operation is harder to measure, putting manual lines like
- start = time.time()
- ...(Your code)
- end = time.time()
- elapsed = end - start would generally be a rule of thumb.

In [None]:
%time result = [x**2 for x in range(100000)]
%time result = list(map(lambda x: x**2, range(100000)))

CPU times: user 7.31 ms, sys: 61 µs, total: 7.38 ms
Wall time: 7.33 ms
CPU times: user 9.79 ms, sys: 1.95 ms, total: 11.7 ms
Wall time: 11.7 ms


# Runtime

In [None]:
!nvidia-smi
import torch
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device: ", DEVICE)

Mon Jul 14 05:05:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P0             46W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

### Utilizing Free GPU/TPU Resources

#### Changing Runtime
- Runtime > Change runtime type
- Select GPU/TPU and High-RAM option



#### GPUs: Training Time of ResNet50
- T4: 1x Speedup (Baseline)
- V100: 3.6x Speedup (Comparing to T4)
- A100: 10x Speedup (Comparing to T4)
- TPU: TPU is a completely different architecture and require many training constraints

### Restart Session vs Restart Runtime


Restart session
- Close your browser session, with Colab Backend. Similar with closing a Jupyter Notebook tab.
- Runtime > Restart session
- Clears all session variables

Restart runtime - disconnects cloud-based VMs in backend
- It frees up resources and terminates all variables, files, and memory.
- Runtime > Disconnect and delete runtime
- Deletes session
- Lose files in content folder
- Switching GPUs will also delete current runtime

In [None]:
torch.cuda.empty_cache()  # Clear unused GPU memory cached by PyTorch to free up space.
gc.collect()  # Call the Python's garbage collector to release unused memory.

30

# Sample Helpful Code Snippets

### Mounting to Google Drive

Very useful as you lose all files after the runtime ends, Because Colab’s local runtime is temporary — when the session disconnects or the VM resets, all files in /content are lost. Mounting command will give Colab notebook **access** to your Google Drive.
After mounting, you can read/write files to paths.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


### Saving/Loading files - Model checkpoints

A checkpoint is a saved snapshot of your model’s **state** at a particular point in training.
It lets you resume training later or reload the model for inference — even if the Colab session crashes or disconnects.

Checkpoints typically save
- Model parameters (weights)
- Optimizer, scheduler (name, learning rate)
- Loss (metric that model aims to minimize, which calculates how wrong a model's prediction is compared to the true value)
- Epoch/step number (how far along we are)

In the next section, we are saving the model weights, optimizer state, scheduler state, training epoch, and metrics into a checkpoint file on Google Drive.

The model weights capture what the model has learned so far, while the optimizer and scheduler states ensure that learning can resume with the exact same configuration. The epoch tells us how far along training was, and the metrics (like accuracy) help track performance at that point.

All of this is bundled into a `.pt` file so that we can reload it later and pick up exactly where we left off.


In [None]:
class MLP(nn.Module):
    def __init__(self, size):
        super(MLP, self).__init__()
        self.layers = []
        for in_dim, out_dim in zip(size[:-2], size[1:-1]):
          self.layers.extend([
              nn.Linear(in_dim, out_dim),
              nn.ReLU(),
              nn.BatchNorm1d(out_dim),
              nn.Dropout(0.5),
        ])
        self.layers.append(nn.Linear(size[-2], size[-1]))
        self.model = nn.Sequential(*self.layers)
        self.model.apply(self.init_param)

    def init_param(self, param):
      if type(param) == nn.Linear:
        nn.init.xavier_uniform_(param.weight)

    def forward(self, x):
      return self.model(x)

# Define your model
model = MLP([40, 2048, 512, 256, 71])

# Define optimizer, scheduler, loss function, epochs, and other metrics
epoch = 5
metrics = {'accuracy': 0.85}
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

#### Checkpoint Saving and Loading Model

In [None]:
def save_model(model, optimizer, scheduler, metrics, epoch, path):
    """
    Saves the model and other related states to a checkpoint file.

    Functionality:
    - Saves the model's state dictionary, optimizer state, scheduler state,
      metrics, and epoch to the specified file checkpoint path.
    """
    torch.save(
        {'model_state_dict'         : model.state_dict(),
         'optimizer_state_dict'     : optimizer.state_dict(),
         'scheduler_state_dict'     : scheduler.state_dict(),
         'metric'                   : metrics,
         'epoch'                    : epoch},
         path)

def load_model(model, optimizer=None, scheduler=None, path='./checkpoint.pth'):
    """
    Loads the model and other related states from a checkpoint file.

    Functionality:
    - Loads the checkpoint from the specified file path using `torch.load`.
    - Restores the model's state dictionary from the checkpoint.
    - Optionally restores the optimizer and scheduler states if they are provided.
    """
    checkpoint = torch.load(path, weights_only=False)
    model.load_state_dict(checkpoint['model_state_dict'])
    if optimizer is not None:
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    else:
        optimizer = None
    if scheduler is not None:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
    else:
        scheduler = None
    epoch = checkpoint['epoch']
    metrics = checkpoint['metric']
    return model, optimizer, scheduler, epoch, metrics

#### Saving Model

In [None]:
# Define the directory and checkpoint's file path.
CHECKPOINT_DIR = '/content/drive/MyDrive/Checkpoints'
MODEL_SAVE_PATH = os.path.join(CHECKPOINT_DIR, '11785_f25_rec0_google_colab_checkpoint.pt')
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

# Save the model.
save_model(model, optimizer, scheduler, metrics, epoch, MODEL_SAVE_PATH)
print(f"Model saved to {MODEL_SAVE_PATH}")

Model saved to /content/drive/MyDrive/Checkpoints/11785_f25_rec0_google_colab_checkpoint.pt


#### Loading Model

In [None]:
# Create a new instance of the same model architecture.
loaded_model = MLP([40, 2048, 512, 256, 71])

# Load the model, optimizer, and other saved states.
loaded_model, loaded_optimizer, loaded_scheduler, loaded_epoch, loaded_metrics = load_model(
    loaded_model, optimizer, scheduler, MODEL_SAVE_PATH
)

# Verify the loaded states.
print(f"Model loaded. Resumed at epoch {loaded_epoch} with metrics: {loaded_metrics}")

Model loaded. Resumed at epoch 5 with metrics: {'accuracy': 0.85}


### Managing dataset

Obtaining dataset
- Kaggle Command
- Manually uploading
- Download/uploading dataset every time
- Move dataset from Google Drive into content folder
- Connect to GCP or AWS

In [None]:
# Downloads dataset from kaggle
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

# Retrieve the Kaggle Username and API key
from google.colab import userdata
kaggle_username = userdata.get('USER_NAME')
kaggle_api_key = userdata.get('KAGGLE_API_KEY')

# Creates Kaggle config file with retrieved username and API key
with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write(f'{{"username":"{kaggle_username}","key":"{kaggle_api_key}"}}')

# Sets appropriate permissions for Kaggle config file
!chmod 600 /root/.kaggle/kaggle.json

# Make sure to join the competition on Kaggle before running this command!!
# Downloads dataset of the competition using Kaggle API.
!kaggle competitions download -c 11785-spring-25-hw-1-p-2

# Unzips downloaded dataset into given directory folder.
!unzip -qo /content/11785-spring-25-hw-1-p-2.zip -d '/content/11785-spring-25-hw-1-p-2'

Collecting kaggle==1.5.8
  Downloading kaggle-1.5.8.tar.gz (59 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.8-py3-none-any.whl size=73249 sha256=5831f813bc7a66e9c65b3df67bc4ac15f1b5295fa4af87fe32c9a5d40d3bbe55
  Stored in directory: /root/.cache/pip/wheels/b5/23/bd/d33cbf399584fa44fa049711892d333954a50ed4b86948109e
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.7.4.5
    Uninstalling kaggle-1.7.4.5:
      Successfully uninstalled kaggle-1.7.4.5
Successfully installed kaggle-1.5.8
401 - Unauthorized

# Important Considerations for Students

- Session Timeout: Google Colab sessions may time out after a certain period of inactivity. To prevent this, remember to save your work frequently and consider using Colab Pro to extend session runtimes.

- Limited Persistent Storage: While Google Colab saves your notebooks on Google Drive, the storage space is limited. Make sure to clean up unnecessary files or download your work to your local machine to free up space.

- Resource Limits: Free Google Colab accounts have some resource limitations, such as GPU availability and maximum session runtimes. For resource-intensive projects, consider upgrading to Colab Pro for improved performance.

# Colab Pro

- Longer session runtime, reducing risk of timeout
- Priority access to GPU
- Increased storage
- Background Execution (Google Colab Pro+)