CSYE7105 Assignment
Part 1: 34 Points
Question 1: Difference Between CPU and GPU [5 Points]
Central Processing Unit (CPU):

Purpose: General-purpose processor designed to handle a wide variety of tasks.
Architecture:
Few Powerful Cores: Typically 2 to 16 cores in consumer CPUs.
High Clock Speed: Individual cores run at high frequencies (~3-5 GHz).
Processing:
Sequential Processing: Optimized for tasks that require complex computations and can't be easily parallelized.
Low Latency: Quick response times for individual tasks.
Memory Hierarchy:
Large Cache Memory: Multiple levels (L1, L2, L3) to reduce latency.
Use Cases:
General Computing: Running operating systems, applications, and performing tasks that require complex logic.
Graphics Processing Unit (GPU):

Purpose: Specialized processor designed for parallel processing, primarily for rendering graphics.
Architecture:
Thousands of Smaller Cores: Designed for handling multiple tasks simultaneously.
Lower Clock Speed: Cores run at lower frequencies (~1-2 GHz).
Processing:
Parallel Processing: Optimized for tasks that can be divided into smaller, similar subtasks.
High Throughput: Maximizing the amount of processed data over time.
Memory Hierarchy:
High Bandwidth Memory: Optimized for handling large data sets.
Use Cases:
Graphics Rendering: 3D rendering, image processing.
Compute-Intensive Tasks: Machine learning, scientific simulations.
Key Differences:

Processing Style: CPUs excel at sequential tasks with complex logic, while GPUs excel at parallel tasks with simple computations.
Core Count and Power: CPUs have fewer, more powerful cores; GPUs have many, less powerful cores.
Performance Goals: CPUs aim for low latency; GPUs aim for high throughput.
Application Domains: CPUs handle general computing tasks; GPUs handle tasks requiring massive parallelism.
Question 2: GPU Memory Hierarchy [5 Points]
Understanding the GPU memory hierarchy is essential for optimizing performance in GPU computing. The hierarchy includes several types of memory, each with different characteristics:

Registers:

Location: Within each GPU core.
Size: Very small (per-thread).
Speed: Fastest memory available to threads.
Usage: Storing variables and data needed immediately by threads.
Shared Memory:

Location: On-chip memory within each Streaming Multiprocessor (SM).
Size: Limited (typically 64 KB per SM).
Speed: Nearly as fast as registers.
Usage: Shared among threads in the same thread block for efficient communication.
L1 Cache:

Location: On-chip within each SM.
Size: Small (tens of KB).
Speed: Very fast.
Usage: Caches local data to reduce access to slower global memory.
L2 Cache:

Location: Shared across all SMs.
Size: Larger than L1 (MB range).
Speed: Slower than L1 but faster than global memory.
Usage: Reduces global memory access latency for all SMs.
Global Memory:

Location: Off-chip DRAM.
Size: Large (several GBs).
Speed: High latency compared to on-chip memory.
Usage: Main memory for the GPU, accessible by all threads.
Constant Memory:

Location: Portion of global memory with cache.
Size: Small (up to 64 KB).
Speed: Cached for fast reads.
Usage: Storing read-only data that is constant across threads.
Texture Memory:

Location: Portion of global memory with specialized cache.
Size: Depends on GPU.
Speed: Optimized for spatial locality.
Usage: Handling 2D spatial data, beneficial for certain algorithms.
Local Memory:

Location: Off-chip global memory.
Size: Depends on usage.
Speed: Same latency as global memory.
Usage: Used when registers are insufficient for a thread's data.
Key Points:

Latency and Bandwidth: On-chip memories (registers, shared memory) have low latency and high bandwidth. Off-chip memories (global memory) have higher latency.
Optimizing Performance: Effective GPU programming minimizes global memory accesses and maximizes the use of shared memory and registers.
Memory Access Patterns: Coalesced memory accesses and proper use of caches improve performance.
Question 3: Tasks on CPU Node [10 Points]
Step-by-Step Instructions:

Request a CPU Node on Discovery:

Open a terminal and use the Slurm scheduler to request an interactive session:
bash
Copy code
srun --partition=compute --nodes=1 --ntasks=1 --cpus-per-task=4 --time=01:00:00 --pty /bin/bash
Activate Your Own Conda Environment:

Load the Anaconda module if not already loaded:
bash
Copy code
module load anaconda3/2020.11
Create a new Conda environment (if you haven't already):
bash
Copy code
conda create -n myenv python=3.8
Activate your Conda environment:
bash
Copy code
conda activate myenv
Enter Python Interactive Mode and Check PyTorch Version:

Install PyTorch in your environment if necessary:
bash
Copy code
conda install pytorch torchvision torchaudio cpuonly -c pytorch
Start Python interactive mode:
bash
Copy code
python
Import PyTorch and check the version:
python
Copy code
import torch
print(torch.__version__)
Exit Python:
python
Copy code
exit()
Deactivate Your Conda Environment:

Deactivate the environment:
bash
Copy code
conda deactivate
Screenshot Instructions:

Take a screenshot showing the terminal with the commands executed and the output, including the PyTorch version.
Question 4: Tasks on GPU Node [14 Points]
Step-by-Step Instructions:

Request a GPU Node on Discovery:

Request an interactive GPU node:
bash
Copy code
srun --partition=gpu --gres=gpu:1 --nodes=1 --ntasks=1 --cpus-per-task=4 --time=01:00:00 --pty /bin/bash
Load CUDA Module [1 Point]:

Load CUDA 11.x module:
bash
Copy code
module load cuda/11.3
Use nvidia-smi for GPU Information [8 Points]:

Static Information:

Run:
bash
Copy code
nvidia-smi
Explanation:
Displays driver version, CUDA version, GPU name, total and used memory.
Shows the overall status of the GPU at the moment.
Dynamic Information:

Run:
bash
Copy code
nvidia-smi -l 1
Explanation:
Updates the GPU information every second.
Useful for monitoring real-time GPU usage while running applications.
GPU Memory and Hierarchy Explanation:

Global Memory:
The "FB Memory Usage" section shows the Frame Buffer (device memory) usage.
Indicates how much memory is allocated and available.
GPU Utilization:
The "Utilization" section shows how busy the GPU is.
Helps identify if the GPU is being effectively utilized.
Temperature and Power:
Monitoring these ensures the GPU is operating within safe limits.
Screenshots:

Take two screenshots:
Static nvidia-smi output.
Dynamic nvidia-smi output while an application is running (if possible).
Use PyTorch to Check GPU Availability [5 Points]:

Activate Conda Environment:
bash
Copy code
conda activate myenv
Install PyTorch with CUDA Support:
bash
Copy code
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
Start Python Interactive Mode:
bash
Copy code
python
Check if GPU is Available [2 Points]:
python
Copy code
import torch
print("CUDA available:", torch.cuda.is_available())
Get Current GPU Device ID [2 Points]:
python
Copy code
current_device = torch.cuda.current_device()
print("Current CUDA Device ID:", current_device)
print("CUDA Device Name:", torch.cuda.get_device_name(current_device))
Screenshot [1 Point]:
Capture the Python session showing the above commands and their outputs.
Exit Python and Deactivate Conda Environment:
python
Copy code
exit()
conda deactivate
Part 2: 66 Points
Implementing DDP on Multiple CPUs and Analyzing Speedup Performance
Objective:

Modify the existing multi-GPU parallel code to implement Distributed Data Parallel (DDP) on multiple CPUs and analyze the calculation speedup performance.

Steps:

Access the Code File:

Copy CSYE7105-pyt05.ipynb from /courses/CSYE7105.202510/shared/week11 to your working directory.
Understanding the Existing Code:

The current code utilizes multiple GPUs for parallel processing.
It likely uses PyTorch's DataParallel or DDP for GPUs.
Modify the Code for CPU DDP:

Import Necessary Libraries:

python
Copy code
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
Initialize the Process Group:

Use the gloo backend for CPU.
python
Copy code
def setup(rank, world_size):
    dist.init_process_group(
        backend='gloo',
        init_method='tcp://127.0.0.1:29500',
        rank=rank,
        world_size=world_size
    )
Define the Model and Wrap with DDP:

python
Copy code
def train(rank, world_size):
    setup(rank, world_size)
    # Model initialization
    model = Net()
    ddp_model = DDP(model)
    # Rest of the training code
Adjust the DataLoader with DistributedSampler:

python
Copy code
from torch.utils.data import DataLoader, DistributedSampler

train_dataset = datasets.MNIST(...)
train_sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, sampler=train_sampler)
Modify the Main Function to Spawn Processes:

python
Copy code
def main():
    world_size = 4  # Number of CPUs
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
Running the Modified Code:

Request Resources on Discovery:

bash
Copy code
srun --partition=compute --nodes=1 --ntasks=4 --cpus-per-task=1 --time=02:00:00 --pty /bin/bash
Load Necessary Modules and Activate Environment:

bash
Copy code
module load anaconda3/2020.11
conda activate myenv
Run the Jupyter Notebook via OOD:

Start a Jupyter session on the allocated node.
Run the notebook and ensure all processes start correctly.
Analyzing Calculation Speedup Performance:

Measure Training Time:

Record the total training time using multiple CPUs.
Also, run the training on a single CPU and record the time.
Calculate Speedup:

python
Copy code
speedup = single_cpu_time / multi_cpu_time
Plotting the Results:

Create a plot with the number of CPUs on the x-axis and training time on the y-axis.
Discuss the scaling behavior.
Documenting Node Information:

Add Node Details in Markdown:

At the beginning or end of the notebook, include:
Node Name: Use !hostname in a code cell.
Number of CPUs Used: Mention world_size or the number of tasks.
Hardware Specifications: Any relevant details.
Example:

markdown
Copy code
### Node Information

- **Node Name**: compute-node-01
- **Number of CPUs Used**: 4
- **Hardware Specs**: Intel Xeon CPU E5-2680 v4 @ 2.40GHz
Discussion:

Analyze Results:

Comment on the observed speedup.
Discuss any overheads or inefficiencies.
Explain whether the scaling is linear and why it may not be.
Possible Issues:

Communication Overhead: With more CPUs, the communication between processes may introduce overhead.
Data Loading Bottlenecks: Ensure data loading is efficient and not a bottleneck.
Finalizing the Notebook:

Ensure all cells have been run and outputs are saved.
Check for any errors or warnings.
Save the notebook with all outputs visible.
Submission:

Submit the completed Jupyter notebook as per the assignment guidelines.
Ensure the node information and analysis are included.
End of Assignment
Note: The above steps provide detailed instructions for completing the assignment tasks. Ensure that you follow each step carefully and include the required screenshots and explanations in your submission