## Exercises XP: W8_D2

##LLMs and Open Source Models
---

## What You'll Learn

- Identify the degrees of openness in LLMs and what each enables  
- Understand and compare open-source licenses  
- Assess the strengths and trade-offs of various open LLMs  
- Evaluate which models work best in specific use cases or constraints  
- Explore open LLM tools and communities—without requiring GPUs  
- Gain practical skills in LLM model selection and deployment preparation  

---

## What You Will Create

- A comparative LLM analysis table  
- A licensing compatibility checklist for SaaS products  
- A quiz-style reflection on LLM selection  
- A local deployment readiness checklist  
- A benchmark-based model match guide  
- A hardware upgrade plan or cost-benefit comparison (local vs. cloud)  

---

## Exercise 1: Open Source Levels Reflection

**Goal:** Understand the three openness levels (Fully Open, Weights Released, Architecture Only) and their implications for retraining and domain-specific use cases.

**Deliverables:**
- Comparative paragraph (3-5 sentences)  
- Healthcare prompt reflection (1-2 sentences)  

---

## Exercise 2: License Check for SaaS Use

**Goal:** Compare two open-source models' licenses (e.g., Mistral-7B, Llama-2-7B) and identify commercial use permissions and restrictions.

**Deliverables:**
- Model selection with URLs  
- Completed Markdown checklist comparing license type, commercial use, and restrictions  

---

## Exercise 3: LLM Matchmaker Challenge

**Goal:** Match suitable open LLMs to three teams (LegalTech, EdTech, NGO) based on constraints (CPU-only, math focus, multilingual).

**Deliverables:**
- Search filter summary  
- Filled table with chosen models and justifications  

---

## Exercise 4: Local Readiness Audit

**Goal:** Assess your system's specs (RAM, disk, OS) and check readiness for local deployment with tools like `llama.cpp`.

**Deliverables:**
- Audit table with ✅/❌  
- Notes on required upgrades or optimizations  

---

## Exercise 5: Benchmark-Based Model Explorer

**Goal:** Use Hugging Face Leaderboard benchmarks (HellaSwag, MMLU) to compare three models and propose ideal use cases.

**Deliverables:**
- Comparison table with scores, license, and ideal use case  
- Optional reflection paragraph on benchmarks vs. hype  

---

## Exercise 6: Cloud vs Local Deployment Plan

**Goal:** Compare pros and cons of cloud vs. local deployment and (optional) test inference speed on Colab.

**Deliverables:**
- 5-bullet comparison (local vs. cloud)  
- Optional Colab test report: model used, response time, and notes

### Exercise 1: Open Source Levels Reflection

### 1. Gather Definitions

### Open Source Levels

**Fully Open**  
The model's full architecture and pretrained weights are publicly released.  
This level allows anyone to inspect, modify, and retrain the model without major restrictions.

**Weights Released**  
Only the pretrained weights are provided, while the full source code might remain closed.  
Users can fine-tune the model but cannot deeply modify its internal design.

**Architecture Only**  
Only the model's structural blueprint is shared, without pretrained weights.  
Developers must train the model entirely from scratch, which requires significant data and compute resources.


### Key Characteristics

**Fully Open**
- Open: architecture + pretrained weights  
- Can do: inspect, modify, fine-tune freely  
- Cannot do: (almost nothing is restricted)

**Weights Released**
- Open: pretrained weights only  
- Can do: fine-tune on your data  
- Cannot do: change model architecture or see full training details

**Architecture Only**
- Open: architecture only  
- Can do: replicate or train from scratch  
- Cannot do: use pretrained capabilities or fine-tune


### Comparison Table

| Openness Level   | What’s Open?                         | Impact on Retraining/Modifying                       |
|------------------|--------------------------------------|------------------------------------------------------|
| Fully Open       | Architecture + pretrained weights    | Can fully inspect, modify, and fine‑tune             |
| Weights Released | Pretrained weights only              | Can fine‑tune but cannot alter architecture          |
| Architecture Only| Model structure only (no weights)    | Must train from scratch; no pretrained advantages    |


### Comparative Paragraph

Fully open models provide complete access to both architecture and pretrained weights, allowing full inspection, modification, and retraining without restrictions.  
Models with weights released give users access to pretrained weights but not the full source code, enabling fine-tuning while limiting structural changes.  
Architecture-only models share only the blueprint of the model, requiring training from scratch and offering no pretrained benefits.  
The key trade-off is between flexibility and effort: more openness offers easier adaptation, while less openness demands more resources to achieve similar results.


### Healthcare Reflection

For a healthcare-specific assistant that must be retrained on sensitive clinical data, a fully open model is essential.  
This level of openness allows full fine-tuning and inspection to ensure compliance with medical standards and data privacy requirements.

### Exercise 2: License Check for SaaS Use

- [x] **mistralai/Mistral-7B-Instruct**  
  - License type: Apache 2.0 (permissive, open source)  
  - Commercial use allowed: Yes (no major restrictions)  
  - Restrictions:  
    - [ ] Attribution not required  
    - [ ] No user or geographic limitations declared  

- [x] **meta-llama/Llama-2-7b-chat-hf**  
  - License type: Llama 2 Community License (source-available, not OSI-certified)  
  - Commercial use allowed: Yes, but with conditions  
    - [ ] Limited to 700M monthly active users (beyond that requires commercial license)  
    - [ ] Attribution required under license terms  
    - [ ] Must comply with Meta's Acceptable Use Policy (e.g., no retraining on outputs, export restrictions)  


 ### Exercise 3: LLM Matchmaker Challenge

### Team Needs Analysis

**LegalTech**  
- Constraint: CPU-only inference  
- Focus: logical reasoning performance  
- Goal: fast and lightweight chatbot

**EdTech**  
- Constraint: low-end laptop hardware  
- Focus: math and logic benchmarks (GSM8K, MATH)  
- Goal: efficient model with good accuracy

**Global NGO**  
- Constraint: supports 5+ languages  
- Focus: multilingual benchmark (FLORES-200, M2M100)  
- Goal: broad language coverage with reasonable size


### Search Filters Used

**LegalTech (logic-heavy, CPU)**  
- Filters: `logic`, `≤7B parameters`, `CPU compatible / quantized`  
- Candidate Models:  
  1. mistralai/Mistral-7B-Instruct  
  2. TinyLlama/TinyLlama-1.1B-Chat-v1.0  
  3. OpenAssistant/pythia-1.4b-sft-v8-8k-steps

**EdTech (math + logic, low-end laptop)**  
- Filters: `math`, `logic`, `≤7B parameters`, `quantized`  
- Candidate Models:  
  1. WizardLM/WizardMath-7B-V1.1  
  2. NousResearch/Llama-2-7b-hf-QLoRA  
  3. EleutherAI/pythia-2.8b

**Global NGO (multilingual)**  
- Filters: `multilingual`, `≤7B parameters`, `CPU friendly`  
- Candidate Models:  
  1. bigscience/bloom-3b  
  2. facebook/m2m100_418M  
  3. Helsinki-NLP/opus-mt-mul-en


### LLM Matchmaker - Final Selection

| Team        | Needs                                         | Your Pick (Justification)                                                                 |
|-------------|-----------------------------------------------|------------------------------------------------------------------------------------------|
| LegalTech   | Fast logic-heavy chatbot on CPU               | **mistralai/Mistral-7B-Instruct** - strong reasoning, efficient size, CPU-friendly quantization |
| EdTech      | Math/logic-focused model for low-end laptops  | **WizardLM/WizardMath-7B-V1.1** - tuned for math benchmarks, fits small hardware limits  |
| Global NGO  | Multilingual support (5+ languages)           | **bigscience/bloom-3b** - supports many languages, lightweight for multilingual tasks     |


#### Step 1: Collect system specs

In [1]:
# Check RAM, disk space, and kernel version on Colab
!free -h      # Shows total and available RAM
!df -h        # Displays disk usage
!uname -r     # Prints the kernel version

               total        used        free      shared  buff/cache   available
Mem:            12Gi       659Mi       9.1Gi       1.0Mi       3.0Gi        11Gi
Swap:             0B          0B          0B
Filesystem      Size  Used Avail Use% Mounted on
overlay         113G   39G   75G  34% /
tmpfs            64M     0   64M   0% /dev
shm             5.7G     0  5.7G   0% /dev/shm
/dev/root       2.0G  1.2G  775M  61% /usr/sbin/docker-init
/dev/sda1        74G   41G   34G  55% /opt/bin/.nvidia
tmpfs           6.4G   36K  6.4G   1% /var/colab
tmpfs           6.4G     0  6.4G   0% /proc/acpi
tmpfs           6.4G     0  6.4G   0% /proc/scsi
tmpfs           6.4G     0  6.4G   0% /sys/firmware
6.1.123+


### Local Readiness Audit

| Requirement               | Your System Specs          | Meets Requirement? |
|---------------------------|----------------------------|--------------------|
| RAM (≥ 16 GB)             | 12 GB                      | ❌                 |
| Free Disk Space (≥ 40 GB) | 70 GB available            | ✅                 |
| OS (Linux/WSL2)           | Linux kernel 6.1.123+ (Colab) | ✅              |


Your current environment meets the disk space and OS requirements but falls short on RAM (12 GB instead of the recommended 16 GB).
Running a 7B parameter model is possible with quantization (e.g., 4-bit) but may require memory optimization or smaller models.

### llama.cpp Readiness & Upgrade Plan

- CPU instruction sets: AVX2/SSE supported (compatible with llama.cpp)  
- Compiler tools: GCC and CMake available (ready for local builds)  
- Limitation: 12 GB RAM (below 16 GB recommended)  
- Workarounds: Use 4-bit quantized models or choose smaller models (2-3B parameters)  
- Optional upgrade: Run on a local machine with ≥ 16 GB RAM or rent cloud GPU for larger models


### Exercise 5: Benchmark-Based Model Explorer

### Benchmark Comparison

| Model Name                       | HellaSwag Score | MMLU Score | License Type                | Ideal Use Case                      |
|----------------------------------|-----------------|------------|-----------------------------|-------------------------------------|
| mistralai/Mistral-7B-Instruct     | 84              | 60         | Apache 2.0 (permissive)      | Reasoning-focused chatbots and agents |
| meta-llama/Llama-2-7b-chat-hf     | 81              | 55         | Llama 2 Community License    | General-purpose chat with some restrictions |
| bigscience/bloom-3b               | 70              | 40         | BigScience RAIL License      | Multilingual support, basic Q&A     |


### Reflection

Benchmark scores provide an objective way to compare LLMs across reasoning and academic understanding tasks.  
Rather than relying on hype or marketing claims, these metrics help align a model's strengths with real use cases,  
ensuring the chosen LLM meets performance expectations for specific tasks.


### Exercise 6: Cloud vs. Local Deployment Plan

### Cloud vs Local Deployment - Key Points

- **Local (Pro)**: No latency from network; full control over data and environment.  
- **Local (Con)**: Requires powerful hardware upfront (e.g., ≥16 GB RAM, fast CPU/GPU).  
- **Cloud (Pro)**: Easy access to high-end GPUs and scalable resources on demand.  
- **Cloud (Con)**: Ongoing costs and dependency on internet connection.  
- **Overall**: Cloud is ideal for rapid prototyping; local is better for private, long-term use.


In [2]:
# Log into Hugging Face to access gated models like Llama-2
from huggingface_hub import login

# Paste your token when prompted (kept secret in Colab)
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
# Load and test Llama-2-7B-Chat-HF
from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_name = "meta-llama/Llama-2-7b-chat-hf"

# Load tokenizer and model (device_map="auto" will use GPU if available)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

# Prepare a test prompt
inputs = tokenizer("Explain why open-source LLMs are valuable for education:", return_tensors="pt").to(model.device)

# Measure generation time
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
print("Elapsed:", time.time() - start, "seconds")


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

<s> Explain why open-source LLMs are valuable for education:

Open-source LLMs are valuable for education because they provide a cost-effective and flexible way for students to learn and gain hands-on experience with AI technology. Here are some reasons why:

1. Accessibility
Elapsed: 4.90664005279541 seconds


### Cloud vs Local Deployment - 5 Key Points

- **Local (Pro)**: No internet latency; full control over data and hardware.  
- **Local (Con)**: Requires powerful hardware (≥16 GB RAM, strong CPU/GPU) and setup.  
- **Cloud (Pro)**: Easy access to powerful GPUs; good for scaling and rapid prototyping.  
- **Cloud (Con)**: Ongoing costs; dependent on internet availability and provider limits.  
- **Overall**: Cloud works best for temporary or scalable workloads; local is ideal for private, long-term projects.

---

### Optional Colab Test

- **Model tested**: meta-llama/Llama-2-7b-chat-hf  
- **Response time**: ~4.9 seconds  
- **Observation**: Inference time is reasonable for a 7B model on Colab; quantized versions would be faster and more memory efficient.
