# Tufts HPC Job Manager

This notebook helps you:
1. View available cluster resources
2. Check your current job sessions
3. Kill all your jobs (with confirmation)

**IMPORTANT**: Run this on a compute node or login node, NOT in Cursor locally.


## 1. Check Available Resources


In [7]:
#
# #currently on 8891
#  Check storage usage
!echo "=== Storage Usage ==="
!df -h /cluster/tufts/datalab
!df -h /cluster/tufts/em212class
!echo ""

# Check cluster partitions and resources
!echo "=== Cluster Partitions ==="
!sinfo -o "%P %a %l %D %c %m %G" | sed -n '1,30p'
!echo ""

# Check GPU availability
!echo "=== GPU Partitions ==="
!sinfo -o "%P %a %l %D %c %m %G" | grep gpu
!echo ""

# Check if hpctools is available
!echo "=== HPC Tools ==="
!module load hpctools 2>/dev/null && echo "hpctools loaded" || echo "hpctools not available"


=== Storage Usage ===


Filesystem                       Size  Used Avail Use% Mounted on
10.246.194.77:/projects/datalab  2.3T  2.3T  6.7G 100% /cluster/tufts/datalab
Filesystem                          Size  Used Avail Use% Mounted on
10.246.194.78:/projects/em212class 1000G  8.9G  991G   1% /cluster/tufts/em212class

=== Cluster Partitions ===
PARTITION AVAIL TIMELIMIT NODES CPUS MEMORY GRES
interactive up 4:00:00 2 36 248000 (null)
batch* up 7-00:00:00 71 36+ 120000+ (null)
mpi up 7-00:00:00 70 36+ 120000+ (null)
gpu up 7-00:00:00 1 64 190000 gpu:a100:2
gpu up 7-00:00:00 10 64+ 756121+ gpu:a100:8
gpu up 7-00:00:00 1 72 248000 gpu:p100:4
largemem up 7-00:00:00 2 36+ 1000000 (null)
preempt up 7-00:00:00 3 64+ 190000+ gpu:a100:2
preempt up 7-00:00:00 9 64+ 756121+ gpu:a100:8
preempt up 7-00:00:00 9 128 248000+ gpu:l40:4
preempt up 7-00:00:00 1 64 368000 gpu:v100:3
preempt up 7-00:00:00 1 72 248000 gpu:p100:4
preempt up 7-00:00:00 1 72 256000 gpu:p100:6
preempt up 7-00:00:00 1 64 168000 gpu:v100:2
preempt up 

## 2. Check Your Current Jobs


In [8]:
# Show ONLY your current jobs (zwu09)
!echo "=== Your Current Jobs (zwu09 only) ==="
!squeue -u zwu09
!echo ""

# Count jobs by state for zwu09 only
!echo "=== Job Summary (zwu09 only) ==="
!squeue -u zwu09 --format="%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R" --noheader | awk '{print $5}' | sort | uniq -c
!echo ""

# Show detailed info for running jobs (zwu09 only)
!echo "=== Running Job Details (zwu09 only) ==="
!squeue -u zwu09 --format="%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R" --noheader | grep ' R ' | while read line; do
!    jobid=$(echo $line | awk '{print $1}')
!    echo "Job $jobid:"
!    scontrol show jobid -dd $jobid | grep -E "(JobId|JobName|UserId|State|RunTime|TimeLimit|NodeList|NumNodes|NumCPUs|ReqMem|GRES)"
!    echo "---"
!done


=== Your Current Jobs (zwu09 only) ===


             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          15655472       gpu     bash    zwu09  R      14:17      1 cc1gpu005

=== Job Summary (zwu09 only) ===
      1 R

=== Running Job Details (zwu09 only) ===
/bin/bash: -c: line 1: syntax error: unexpected end of file
Job :
JobId=9699735 JobName=launch_experiment.sh
   UserId=areddy05(34017) GroupId=areddy05(7786) MCS_label=N/A
   JobState=PENDING Reason=ReqNodeNotAvail,_May_be_reserved_for_other_job Dependency=(null)
   RunTime=00:00:00 TimeLimit=00:15:00 TimeMin=N/A
   ReqNodeList=p1cmp010 ExcNodeList=(null)
   NodeList=
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
JobId=15579658 JobName=bash
   UserId=zhuang12(31612) GroupId=student(6150) MCS_label=N/A
   JobState=RUNNING Reason=None Dependency=(null)
   RunTime=6-11:14:24 TimeLimit=7-00:00:00 TimeMin=N/A
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=d1cmp001
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T

## 3. Kill All Your Jobs (CONFIRMATION REQUIRED)

**⚠️ WARNING: This will cancel ALL your jobs! ⚠️**

Run the cell below and confirm when prompted.


In [9]:
# Simple job killer - specify job IDs manually
# COMMENTED OUT FOR "RUN ALL" - Uncomment if you need to kill jobs
"""
print("=== Manual Job Cancellation ===")
print("Enter job IDs to cancel (comma-separated, or 'all' for all zwu09 jobs)")
print("Example: 15655172,15655173")
print("Or type 'all' to cancel all zwu09 jobs")
print()

# Get job IDs from user
job_input = input("Job IDs to cancel: ").strip()

if job_input.lower() == 'all':
    print("🔄 Cancelling all zwu09 jobs...")
    result = subprocess.run(['scancel', '-u', 'zwu09'], capture_output=True, text=True, shell=True)
    if result.returncode == 0:
        print("✅ All zwu09 jobs cancelled!")
    else:
        print(f"❌ Error: {result.stderr}")
else:
    # Parse job IDs
    job_ids = [jid.strip() for jid in job_input.split(',') if jid.strip()]
    
    if not job_ids:
        print("❌ No valid job IDs provided.")
    else:
        print(f"🔄 Cancelling jobs: {', '.join(job_ids)}")
        
        # Cancel each job individually
        for job_id in job_ids:
            result = subprocess.run(['scancel', job_id], capture_output=True, text=True, shell=True)
            if result.returncode == 0:
                print(f"✅ Job {job_id} cancelled")
            else:
                print(f"❌ Error cancelling job {job_id}: {result.stderr}")

print("\n=== Verification ===")
!squeue -u zwu09
"""

print("ℹ️  Job killer commented out for 'Run All'. Uncomment if needed.")


ℹ️  Job killer commented out for 'Run All'. Uncomment if needed.


## 4. Quick Command Reference MUST ONLY BE RUN AT https://ondemand.pax.tufts.edu/pun/sys/shell/ssh/login.pax.tufts.edu

**Manual commands to run in terminal:**

```bash
# Show your jobs
squeue -u zwu09

# Cancel specific job by ID
scancel 15655172

# Cancel multiple jobs by ID
scancel 15655172 15655173 15655174

# Cancel ALL your jobs
scancel -u zwu09

# Show job details
scontrol show jobid -dd 15655172
```


## 5. GPU Demos and Cool Stuff

Now let's test your GPU and run some awesome demos!


In [10]:
# Check GPU availability and info
!echo "=== GPU Information ==="
!nvidia-smi
!echo ""

# Check PyTorch CUDA availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU name: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("❌ CUDA not available - check your GPU allocation!")


=== GPU Information ===
Thu Sep 11 21:21:29 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:1C:00.0 Off |                    0 |
| N/A   33C    P0              41W / 250W |      3MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                            

In [11]:
# Check datalab space and clean up if needed
!echo "=== Datalab Storage Check ==="
!df -h /cluster/tufts/datalab
!echo ""
!echo "=== Your datalab directory contents ==="
!ls -la /cluster/tufts/datalab/zwu09/
!echo ""
!echo "=== Directory sizes ==="
!du -sh /cluster/tufts/datalab/zwu09/* 2>/dev/null | sort -hr | head -10


=== Datalab Storage Check ===
Filesystem                       Size  Used Avail Use% Mounted on
10.246.194.77:/projects/datalab  2.3T  2.3T  6.7G 100% /cluster/tufts/datalab

=== Your datalab directory contents ===
total 5220231
drwxrws---  2 zwu09    datalab       4096 Sep 11 19:39 .
drwxrwsr-x 92 pflora01 datalab       4096 Sep 10 18:53 ..
drwxr-xr-x  2 zwu09    datalab       4096 May  2 08:21 bin
drwxr-sr-x  2 zwu09    datalab       4096 May  2 06:16 .cache
drwxrwsr-x  2 zwu09    datalab       4096 Sep 11 21:19 caches
drwxrwxr-x  2 zwu09    datalab       4096 May  2 07:50 dateutil
drwxr-sr-x  2 zwu09    datalab       4096 Sep 11 19:20 envs
-rw-r--r--  1 zwu09    datalab 5343667350 May  9 05:36 goodreads_reviews_dedup.json.gz
drwxr-sr-x  2 zwu09    datalab       4096 Sep 11 17:32 .ipynb_checkpoints
-rw-rw-r--  1 zwu09    datalab        312 Sep 11 19:22 jlab_8890.log
drwxr-xr-x  2 zwu09    datalab       4096 May  2 08:21 joblib
drwxr-xr-x  2 zwu09    datalab       4096 May  2 08:21 jo

: 

In [None]:
# Set up cache directories to avoid filling home directory
import os
os.environ['HF_HOME'] = '/cluster/tufts/datalab/zwu09/caches/huggingface'
os.environ['TRANSFORMERS_CACHE'] = '/cluster/tufts/datalab/zwu09/caches/huggingface'
os.environ['PIP_CACHE_DIR'] = '/cluster/tufts/datalab/zwu09/caches/pip'
os.environ['TORCH_HOME'] = '/cluster/tufts/datalab/zwu09/caches/torch'
os.environ['TMPDIR'] = '/cluster/tufts/datalab/zwu09/tmp'

# Create cache directories
import subprocess
subprocess.run(['mkdir', '-p', os.environ['HF_HOME'], os.environ['PIP_CACHE_DIR'], os.environ['TORCH_HOME'], os.environ['TMPDIR']])

print("✅ Cache directories set up:")
print(f"HF_HOME: {os.environ['HF_HOME']}")
print(f"PIP_CACHE_DIR: {os.environ['PIP_CACHE_DIR']}")
print(f"TORCH_HOME: {os.environ['TORCH_HOME']}")
print(f"TMPDIR: {os.environ['TMPDIR']}")

# Install packages with explicit cache directory
# COMMENTED OUT FOR "RUN ALL" - Uncomment if you need to install packages
"""
print("\n📦 Installing packages to datalab...")
%pip install --cache-dir /cluster/tufts/datalab/zwu09/caches/pip diffusers transformers accelerate safetensors bitsandbytes einops --quiet
"""

print("ℹ️  Package installation commented out for 'Run All'. Uncomment if needed.")


✅ Cache directories set up:
HF_HOME: /cluster/tufts/datalab/zwu09/caches/huggingface
PIP_CACHE_DIR: /cluster/tufts/datalab/zwu09/caches/pip
TORCH_HOME: /cluster/tufts/datalab/zwu09/caches/torch
TMPDIR: /cluster/tufts/datalab/zwu09/tmp


In [None]:
# Clean up potential junk files
!echo "=== Cleaning up potential junk ==="
!echo "Checking for large files that can be deleted..."

# Check for large log files
!find /cluster/tufts/datalab/zwu09 -name "*.log" -size +100M 2>/dev/null | head -5
!echo ""

# Check for temporary files
!find /cluster/tufts/datalab/zwu09 -name "*.tmp" -o -name "*.temp" -o -name "*~" 2>/dev/null | head -10
!echo ""

# Check for old cache files
!find /cluster/tufts/datalab/zwu09 -name "__pycache__" -type d 2>/dev/null | head -5
!echo ""

# Check for large model files that might be duplicates
!find /cluster/tufts/datalab/zwu09 -name "*.bin" -o -name "*.safetensors" -o -name "*.pt" -o -name "*.pth" 2>/dev/null | head -10
!echo ""

print("💡 If you see large files above, you can delete them with:")
print("   rm -rf /path/to/large/file")
print("   find /cluster/tufts/datalab/zwu09 -name '__pycache__' -type d -exec rm -rf {} +")


In [None]:
# Stable Diffusion XL Demo - Generate AI Art!
# NOTE: Make sure packages are installed first by uncommenting the install cell above

try:
    from diffusers import DiffusionPipeline
    import torch
    
    print("🎨 Loading Stable Diffusion XL...")
    print("This may take a few minutes on first run (downloading model)")
    
    # Use a lightweight, fast model for testing
    model_id = "stabilityai/sdxl-turbo"  # Fast, lightweight SDXL variant
    
    # Load pipeline
    pipe = DiffusionPipeline.from_pretrained(
        model_id,
        torch_dtype=torch.float16,
        use_safetensors=True,
        variant="fp16"
    )
    
    # Move to GPU
    pipe = pipe.to("cuda")
    
    print("✅ Model loaded successfully!")
    print(f"Model: {model_id}")
    print(f"Device: {pipe.device}")
    
    # Generate an image
    prompt = "a beautiful landscape with mountains and a lake, digital art, high quality"
    print(f"\n🎨 Generating image with prompt: '{prompt}'")
    
    with torch.no_grad():
        image = pipe(prompt, num_inference_steps=4, guidance_scale=0.0).images[0]
    
    # Save the image
    output_path = "/cluster/tufts/datalab/zwu09/generated_art.png"
    image.save(output_path)
    print(f"✅ Image saved to: {output_path}")
    
    # Display the image
    from IPython.display import display, Image
    display(image)
    
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("💡 Solution: Uncomment the package installation cell above and run it first")
    print("   Then restart the kernel and run this cell again")
except Exception as e:
    print(f"❌ Error: {e}")
    print("💡 Make sure you're running on a GPU compute node with CUDA available")


ImportError: cannot import name 'cached_download' from 'huggingface_hub' (/cluster/home/zwu09/.local/lib/python3.12/site-packages/huggingface_hub/__init__.py)

In [None]:
# TinyLlama Demo - Chat with a Small Language Model
# NOTE: Make sure packages are installed first by uncommenting the install cell above

try:
    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    
    print("🤖 Loading TinyLlama...")
    print("This is a small but capable language model!")
    
    # Load model and tokenizer
    model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    
    print("✅ TinyLlama loaded successfully!")
    
    # Chat function
    def chat_with_tinyllama(prompt, max_length=100):
        # Format prompt
        formatted_prompt = f"<|user|>\n{prompt}<|assistant|>\n"
        
        # Tokenize
        inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
        
        # Generate
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_length,
                temperature=0.7,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        # Decode response
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        # Extract just the assistant's response
        response = response.split("<|assistant|>")[-1].strip()
        
        return response
    
    # Test the model
    test_prompts = [
        "What is machine learning?",
        "Write a short poem about AI",
        "Explain quantum computing in simple terms"
    ]
    
    print("\n🤖 Testing TinyLlama:")
    for i, prompt in enumerate(test_prompts, 1):
        print(f"\n--- Test {i} ---")
        print(f"Human: {prompt}")
        response = chat_with_tinyllama(prompt)
        print(f"TinyLlama: {response}")
    
    print("\n✅ TinyLlama demo complete!")
    
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("💡 Solution: Uncomment the package installation cell above and run it first")
    print("   Then restart the kernel and run this cell again")
except Exception as e:
    print(f"❌ Error: {e}")
    print("💡 Make sure you're running on a GPU compute node with CUDA available")


In [None]:
# GPU Performance Benchmark
import torch
import time

print("⚡ GPU Performance Benchmark")
print("=" * 40)

# Test matrix multiplication performance
def benchmark_gpu():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Device: {device}")
    
    # Test different matrix sizes
    sizes = [1024, 2048, 4096]
    
    for size in sizes:
        print(f"\nTesting {size}x{size} matrices...")
        
        # Create random matrices
        a = torch.randn(size, size, device=device, dtype=torch.float32)
        b = torch.randn(size, size, device=device, dtype=torch.float32)
        
        # Warm up
        for _ in range(3):
            _ = torch.mm(a, b)
        
        # Benchmark
        torch.cuda.synchronize()
        start_time = time.time()
        
        for _ in range(10):
            result = torch.mm(a, b)
        
        torch.cuda.synchronize()
        end_time = time.time()
        
        avg_time = (end_time - start_time) / 10
        gflops = (2 * size**3) / (avg_time * 1e9)  # Approximate GFLOPS
        
        print(f"  Average time: {avg_time*1000:.2f} ms")
        print(f"  Approximate GFLOPS: {gflops:.2f}")

# Run benchmark
benchmark_gpu()

# Memory usage
print(f"\n💾 GPU Memory Usage:")
print(f"  Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"  Cached: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

print("\n✅ Benchmark complete!")


## 6. Interactive Demos

Try these interactive demos! Run each cell to test different AI capabilities.


In [None]:
# Interactive Image Generation
def generate_custom_image(prompt, steps=4):
    """Generate an image with your custom prompt"""
    print(f"🎨 Generating: '{prompt}'")
    
    with torch.no_grad():
        image = pipe(prompt, num_inference_steps=steps, guidance_scale=0.0).images[0]
    
    # Save with timestamp
    import datetime
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"/cluster/tufts/datalab/zwu09/custom_art_{timestamp}.png"
    image.save(filename)
    
    print(f"✅ Saved to: {filename}")
    display(image)
    return image

# Try your own prompts!
# Uncomment and modify these examples:

# generate_custom_image("a futuristic city with flying cars, cyberpunk style")
# generate_custom_image("a cute robot playing with a cat, cartoon style")
# generate_custom_image("abstract art with vibrant colors and geometric shapes")

print("🎨 Ready for custom image generation!")
print("Use: generate_custom_image('your prompt here')")


In [None]:
# Interactive Chat with TinyLlama
def chat_interactive():
    """Interactive chat session with TinyLlama"""
    print("🤖 TinyLlama Chat - Type 'quit' to exit")
    print("=" * 50)
    
    while True:
        user_input = input("\nYou: ")
        if user_input.lower() in ['quit', 'exit', 'bye']:
            print("👋 Goodbye!")
            break
        
        try:
            response = chat_with_tinyllama(user_input, max_length=150)
            print(f"TinyLlama: {response}")
        except Exception as e:
            print(f"❌ Error: {e}")

# Uncomment to start interactive chat:
# chat_interactive()

print("🤖 Ready for interactive chat!")
print("Use: chat_interactive() to start chatting")


In [None]:
# Show ONLY your current jobs (zwu09)
!echo "=== Your Current Jobs (zwu09 only) ==="
!squeue -u zwu09
!echo ""

# Count jobs by state for zwu09 only
!echo "=== Job Summary (zwu09 only) ==="
!squeue -u zwu09 --format="%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R" --noheader | awk '{print $5}' | sort | uniq -c
!echo ""

# Show detailed info for running jobs (zwu09 only)
!echo "=== Running Job Details (zwu09 only) ==="
!squeue -u zwu09 --format="%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R" --noheader | grep ' R ' | while read line; do
!    jobid=$(echo $line | awk '{print $1}')
!    echo "Job $jobid:"
!    scontrol show jobid -dd $jobid | grep -E "(JobId|JobName|UserId|State|RunTime|TimeLimit|NodeList|NumNodes|NumCPUs|ReqMem|GRES)"
!    echo "---"
!done


=== Your Current Jobs (zwu09 only) ===
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          15655472       gpu     bash    zwu09  R      11:08      1 cc1gpu005

=== Job Summary (zwu09 only) ===
      1 R

=== Running Job Details (zwu09 only) ===
/bin/bash: -c: line 1: syntax error: unexpected end of file
Job :
JobId=9699735 JobName=launch_experiment.sh
   UserId=areddy05(34017) GroupId=areddy05(7786) MCS_label=N/A
   JobState=PENDING Reason=ReqNodeNotAvail,_May_be_reserved_for_other_job Dependency=(null)
   RunTime=00:00:00 TimeLimit=00:15:00 TimeMin=N/A
   ReqNodeList=p1cmp010 ExcNodeList=(null)
   NodeList=
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
JobId=15579658 JobName=bash
   UserId=zhuang12(31612) GroupId=student(6150) MCS_label=N/A
   JobState=RUNNING Reason=None Dependency=(null)
   RunTime=6-11:11:16 TimeLimit=7-00:00:00 TimeMin=N/A
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=d1cmp001
   NumNodes=1 NumC