# **Task 1: LLM Fundamentals & Generation Analysis**

## **Overview**
This task provides a **hands-on introduction** to the fundamentals of large language models (LLMs) using the **Llama 3-8B Instruct** model.  
It walks through model loading, text generation experiments, and key performance analyses, focusing on **latency**, **diversity**, and **memory usage**.

---

## **Step 1: Environment Setup (Cell 2)**

- **Objective:**  
  Create a clean and reproducible environment for running all experiments.

- **Actions:**
  - **Import Libraries:**  
    Load all required libraries for model handling, generation, and visualization.  
  - **Configure Environment:**  
    Define a `set_seed` function to ensure reproducibility by fixing random seeds.  
    Verify that a **GPU** is available for efficient computation.

---

## **Step 2: Model Loading (Cell 4)**

- **Objective:**  
  Load the target **Llama 3-8B Instruct** model and prepare it for inference.

- **Actions:**
  - **Select Model:**  
    Set `MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"`.  
  - **Load with Fallback:**  
    Use a `try...except` block to attempt loading the main model.  
    If it fails (e.g., due to limited memory), load a smaller fallback model (`FALLBACK_MODEL_ID`) to ensure the notebook remains executable.  
  - **Model Summary:**  
    Print key information about the loaded model, including size and configuration.

---

## **Step 3: Experiment 1 — Effect of Temperature on Diversity (Cell 5)**

- **Objective:**  
  Understand how the **temperature** hyperparameter influences the **randomness and diversity** of generated text.

- **Actions:**
  - **Set Parameters:**  
    Define a fixed input `PROMPT` and a list of `TEMPERATURES` to test.  
  - **Generate Samples:**  
    Loop through each temperature value and generate multiple text samples using  
    `model.generate(..., do_sample=True)` so that temperature actually affects randomness.  
  - **Record Results:**  
    Save all generated outputs for side-by-side comparison and qualitative analysis.

---

## **Step 4: Experiment 2 — Input Length vs. Prefill Latency (Cell 6)**

- **Objective:**  
  Demonstrate that **longer input sequences** lead to **higher prefill latency** before generation begins.

- **Actions:**
  - **Define Input Sizes:**  
    Create a list of token lengths (`INPUT_LENGTHS`) to test various input sizes.  
  - **Precise Timing:**  
    Use `torch.cuda.Event` for high-precision GPU time measurement.  
  - **Measure Prefill Stage:**  
    For each input length, perform warm-up runs, then measure time for generating one token (`max_new_tokens=1`) to isolate prefill cost.  
  - **Visualize:**  
    Plot a line graph showing **input length (x-axis)** vs. **average latency (y-axis)**.

---

## **Step 5: Experiment 3 — Real-Time Memory Tracking During Generation (Cell 8)**

- **Objective:**  
  Track and visualize **GPU memory usage** step-by-step as tokens are generated, illustrating the growth of the **KV Cache**.

- **Actions:**
  - **Design a “Memory Hook”:**  
    Implement a custom class `MemoryUsageCallback` inheriting from `transformers.StoppingCriteria`.  
    Its `__call__` method records current GPU memory after each token generation, returning `False` to continue.  
  - **Run Monitored Generation:**  
    After GPU warm-up, create an instance of the callback and pass it to `model.generate()` via `stopping_criteria`.  
  - **Analyze and Plot:**  
    Convert recorded memory data into a DataFrame and plot the **memory usage curve** over generation steps.

---

## **Step 6: Verifying the Space Complexity of KV Cache (Cell 9)**

- **Objective:**  
  Quantitatively confirm that memory usage grows **linearly** with sequence length — proving **O(L)** space complexity for the KV Cache.

- **Actions:**
  - **Linear Regression:**  
    Apply `scipy.stats.linregress` to fit a linear model between the number of generated tokens and memory increase.  
  - **Interpret Results:**  
    - The **slope** indicates average memory growth (in MB) per generated token.  
    - The **R² value** close to 1 (e.g., > 0.99) validates a strong linear correlation, confirming the **O(L)** relationship.

---

## **Step 7: Summary and Analysis (Cell 10)**

- **Objective:**  
  Summarize all experimental findings and explain the underlying principles observed.

- **Actions:**
  - **Write Analysis:**  
    Discuss results and plots from all experiments, highlighting:  
    - The trade-off between temperature and diversity.  
    - The linear scaling of prefill latency with input length.  
    - The direct, linear memory growth due to KV caching.  
  - Synthesize these insights into a clear, data-driven conclusion about **LLM generation efficiency** and **scalability**.


In [None]:
### Cell 2: Environment Setup and Dependency Installation
# TODO: import all required libraries for the lab (os, random, time, numpy, pandas, torch, transformers, etc.)

RESULTS_DIR = "./results"
FIGURES_DIR = "./figures"

# TODO: create the results/figures directories if they do not exist
# os.makedirs(RESULTS_DIR, exist_ok=True)
# os.makedirs(FIGURES_DIR, exist_ok=True)

# TODO: configure logging verbosity and select DEVICE (cuda vs. cpu)
# logging.set_verbosity_warning()
# if torch.cuda.is_available():
#     DEVICE = torch.device("cuda")
#     # print GPU diagnostics here
# else:
#     DEVICE = torch.device("cpu")
#     # print CPU-only notice

# TODO: print environment diagnostics (CUDA version, PyTorch version, etc.)

def set_seed(seed: int = 42) -> None:
    """Seed Python, NumPy, and PyTorch RNGs for reproducible lab runs."""
    ...

def require_gpu(task: str) -> None:
    """Raise a descriptive error if a GPU is required but not available."""
    ...

# TODO: call set_seed and configure visualisation defaults (sns, matplotlib)
# sns.set_theme(...)
# plt.rcParams.update(...)
# print("Environment initialised.")


In [None]:
# ### Cell 3: Hugging Face Login
# from huggingface_hub import login, HfFolder
# from getpass import getpass

# # Check if a Hugging Face token is already set in the environment.
# if not os.getenv("HUGGING_FACE_HUB_TOKEN"):
#     try:
#         # Prompt user for Hugging Face access token if not found.
#         hf_token = getpass("Please enter your Hugging Face access token: ")
#         login(token=hf_token, add_to_git_credential=True)
#         print("   Hugging Face login successful!")
#     except Exception as e:
#         print(f"Login failed: {e}. Model loading may fail later.")
# else:
#     print("   Hugging Face token detected.")

In [None]:
### Cell 4: Load Model and Tokenizer
MODEL_ID = "..."  # TODO: primary model identifier
FALLBACK_MODEL_ID = "..."  # TODO: fallback model identifier

model: Optional[AutoModelForCausalLM] = None
tokenizer: Optional[AutoTokenizer] = None

candidates = [MODEL_ID, FALLBACK_MODEL_ID]

for candidate in candidates:
    # TODO: attempt to load tokenizer/model and break once successful
    pass

# TODO: raise an error if both candidates fail to load

# TODO: ensure tokenizer/model pad tokens are configured
# tokenizer.pad_token = tokenizer.eos_token
# model.config.pad_token_id = tokenizer.pad_token_id

# TODO: move model to eval mode/device and print summary stats
# model.eval()
# print(...)


In [None]:
### Cell 5: Experiment 1 - Effect of Temperature on Generation Diversity
print("--- Experiment 1: Temperature sweep ---")

PROMPT = "..."  # TODO: provide the experiment prompt
TEMPERATURES = [...]  # TODO: define the temperatures to test
NUM_SAMPLES_PER_TEMP = ...  # TODO: samples per temperature
MAX_NEW_TOKENS = ...  # TODO: maximum generation length

# TODO: ensure model and tokenizer are loaded before running

records = []

for temp in TEMPERATURES:
    for sample_id in range(1, NUM_SAMPLES_PER_TEMP + 1):
        # TODO: tokenize prompt, generate completion, and decode output
        pass

# TODO: build df_temperature and summary_temperature from records
# df_temperature = pd.DataFrame(records)
# summary_temperature = df_temperature.groupby(...).agg(...)

# TODO: persist CSV artifacts and display summary statistics
# df_temperature.to_csv(...)
# summary_temperature.to_csv(...)
# print("Temperature sweep complete. Summary statistics:")


In [None]:
### Cell 6: Experiment 2 - Effect of Input Length on Prefilling Latency
print("--- Experiment 2: Input length vs. prefill latency ---")

LATENCY_INPUT_LENGTHS = [...]  # TODO: sorted list of token lengths
NUM_WARMUP = ...  # TODO: number of warmup passes
NUM_TRIALS = ...  # TODO: number of timed trials
MAX_NEW_TOKENS = ...  # TODO: max generation length per latency run

latency_records = []

for input_length in LATENCY_INPUT_LENGTHS:
    # TODO: build a dummy prompt of the desired token length
    # inputs = tokenizer(...)
    # warmup generations
    # time trial generations and record latency/throughput
    pass

# TODO: assemble df_latency from latency_records
# df_latency = pd.DataFrame(latency_records)

# TODO: compute summary statistics and optionally plot latency curves
# summary_latency = ...
# plt.figure(...)
# plt.savefig(...)


In [None]:
### Cell 8: Experiment 3 - Real-time GPU Memory Usage During Generation
print("--- Experiment 3: Real-time memory trace ---")

df_memory_steps = pd.DataFrame()

if DEVICE.type != "cuda":
    print("Skipped: requires GPU trace from the previous cell.")
else:
    MAX_GENERATION_LENGTH = ...  # TODO: max tokens for traced generation
    PROMPT = "..."  # TODO: seed prompt for memory trace
    INPUT_LENGTH = ...  # TODO: desired prompt length in tokens

    # TODO: prepare input_ids at the specified length
    # input_ids = ...

    class MemoryUsageCallback(StoppingCriteria):
        def __init__(self, device):
            ...

        def __call__(self, input_ids, scores, **kwargs):
            ...

        def increases_mb(self):
            ...

    callback = MemoryUsageCallback(DEVICE)

    # TODO: run warmup generation without callback
    # with torch.no_grad():
    #     model.generate(...)

    # TODO: run traced generation with callback attached
    # with torch.no_grad():
    #     model.generate(...)

    # TODO: convert callback outputs into df_memory_steps and save CSV
    # df_memory_steps = pd.DataFrame(...)
    # df_memory_steps.to_csv(...)

    # TODO: plot per-token memory increase and save the figure
    # fig, ax = plt.subplots(...)
    # sns.lineplot(...)
    # plt.savefig(...)
    # print(df_memory_steps.tail())


In [None]:
### Cell 9: Space Complexity Verification
print("--- Space complexity verification ---")

if DEVICE.type != "cuda":
    print("Skipped: requires GPU trace from the previous cell.")
elif df_memory_steps.empty:
    print("No memory trace data available. Run Experiment 3 before this analysis.")
else:
    # TODO: run linear regression on df_memory_steps and report slope/intercept/R^2
    # slope, intercept, r_value, p_value, std_err = linregress(...)
    # print(...)
    pass


In [None]:
### Cell 10: List all generated artifacts for Task 1
print("Task 1 complete. Generated artifacts:")

# TODO: iterate over output directories and list generated files
# if os.path.isdir(FIGURES_DIR):
#     ...
# if os.path.isdir(RESULTS_DIR):
#     ...
