# Lightweight LLM Setup for Google Colab

This notebook provides a step-by-step guide to set up and run a lightweight open-source language model (like Phi-2 or TinyLlama) from Hugging Face using `transformers` and `accelerate` on a Google Colab free-tier GPU. It also includes the installation of `sympy`, `numpy`, and `scikit-learn` for mathematical and machine learning tasks, while optimizing for minimal RAM usage.

## 1. Install Dependencies

In [None]:
!pip install transformers accelerate einops scipy sympy numpy scikit-learn bitsandbytes -q

## 2. Verify GPU Availability

In [None]:
import torch

if torch.cuda.is_available():
    print(f"GPU is available: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory Total: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("GPU not available. Running on CPU.")

## 3. Load Lightweight Language Model

We will use `microsoft/phi-2` as an example. You can replace this with other lightweight models like `TinyLlama/TinyLlama-1.1B-Chat-v1.0`.

**Optimizations for RAM:**
- `torch_dtype="auto"` or `torch.float16`: Uses half-precision floating points to reduce memory.
- `device_map="auto"`: Leverages `accelerate` to automatically distribute model layers across available devices (GPU/CPU), optimizing RAM usage.
- `trust_remote_code=True`: Required for some models like Phi-2.
- `load_in_8bit=True` or `load_in_4bit=True` (optional): Further reduces memory by quantizing model weights. This requires `bitsandbytes` and might slightly affect performance or accuracy. We'll start without it and it can be added if memory is still an issue.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "microsoft/phi-2" # Example model, can be changed
# model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Another option

try:
    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype="auto", # Use torch.float16 for explicit half-precision
        device_map="auto", # Automatically maps model to GPU if available
        trust_remote_code=True
        # load_in_8bit=True, # Uncomment for 8-bit quantization if needed
        # load_in_4bit=True, # Uncomment for 4-bit quantization if needed
    )
    print(f"Model '{model_id}' loaded successfully.")
    print(f"Model device: {model.device}") # To verify where the model (or parts of it) is loaded
except Exception as e:
    print(f"Error loading model: {e}")
    print("If you are seeing an out-of-memory error, try the following:")
    print("1. Ensure your Colab runtime has a GPU (Runtime > Change runtime type > T4 GPU).")
    print("2. Restart the Colab Kernel (Runtime > Restart session) and try again.")
    print("3. Uncomment 'load_in_8bit=True' or 'load_in_4bit=True' in the model loading cell for further RAM optimization (this requires 'bitsandbytes').")
    print("4. Try an even smaller model if available.")

## 4. Simple Inference Example

In [None]:
if 'model' in globals() and 'tokenizer' in globals(): # Check if model and tokenizer were loaded successfully
    prompt = "Write a short story about a curious cat exploring a spaceship."
    
    # Ensure tokenizer has a padding token defined, if not, set it to eos_token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = model.config.eos_token_id

    inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=True, padding=True).to(model.device) # Move inputs to model's device

    # Generate text
    try:
        print("Generating text...")
        outputs = model.generate(
            **inputs,
            max_new_tokens=100, # Adjust as needed for longer/shorter responses
            # num_beams=5, # Optional: for beam search
            # early_stopping=True, # Optional: to stop generation early
            # no_repeat_ngram_size=2, # Optional: to prevent repetitive n-grams
            eos_token_id=tokenizer.eos_token_id, # Stop generation at EOS token
            pad_token_id=tokenizer.pad_token_id # Set pad token id for generation
        )
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print("\nGenerated Text:")
        print(generated_text)
    except Exception as e:
        print(f"Error during generation: {e}")
else:
    print("Model or tokenizer not loaded. Skipping inference example.")

## 5. Math and ML Library Examples

This section demonstrates that `sympy`, `numpy`, and `scikit-learn` are installed and can be imported.

In [None]:
print("--- Sympy Example ---")
from sympy import symbols, Eq, solve
x, y = symbols('x y')
eq1 = Eq(x + y, 10)
eq2 = Eq(x - y, 4)
solution = solve((eq1, eq2), (x, y))
print(f"Solving equations: \n{eq1}\n{eq2}")
print(f"Solution: {solution}")

print("\n--- NumPy Example ---")
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(f"NumPy array:\n{arr}")
print(f"Sum of all elements: {np.sum(arr)}")
print(f"Mean of all elements: {np.mean(arr)}")

print("\n--- Scikit-learn Example ---")
from sklearn.linear_model import LinearRegression
model_sklearn = LinearRegression()
print(f"Scikit-learn LinearRegression model initialized.")
print(f"Model parameters: {model_sklearn.get_params()}")

print("\nAll libraries imported and tested successfully!")