# Setting

In [None]:
# import os
# os.environ[ 'HF_HOME' ]  = ' /run/cache/ '


'''
HF_HOME Environment Variable: In Hugging Face’s transformers library,
the HF_HOME variable is used to specify the directory where the library downloads and stores model files, configurations, and other assets.
By default, this is usually in a folder like ~/.cache/huggingface on most systems.

Setting HF_HOME: By defining HF_HOME as /run/cache/,
you are directing Hugging Face to store all these files in /run/cache/ instead of the default location.
This can be useful if you want to:

Control where these cached files go.
Use a faster storage option (e.g., RAM disk).
Save space in your main storage directory.

Why This Matters in Colab: In a Colab session, if you set HF_HOME to a directory like /run/cache/,
you might benefit from faster read/write operations due to Colab’s ephemeral memory.
However, remember that Colab’s runtime is temporary,
so if you want to persist the models between sessions,
consider mounting Google Drive and setting HF_HOME to a directory there.
'''

'\nHF_HOME Environment Variable: In Hugging Face’s transformers library, \nthe HF_HOME variable is used to specify the directory where the library downloads and stores model files, configurations, and other assets. \nBy default, this is usually in a folder like ~/.cache/huggingface on most systems.\n\nSetting HF_HOME: By defining HF_HOME as /run/cache/, \nyou are directing Hugging Face to store all these files in /run/cache/ instead of the default location. \nThis can be useful if you want to:\n\nControl where these cached files go.\nUse a faster storage option (e.g., RAM disk).\nSave space in your main storage directory.\n\nWhy This Matters in Colab: In a Colab session, if you set HF_HOME to a directory like /run/cache/, \nyou might benefit from faster read/write operations due to Colab’s ephemeral memory. \nHowever, remember that Colab’s runtime is temporary, \nso if you want to persist the models between sessions, \nconsider mounting Google Drive and setting HF_HOME to a directory t

In [None]:
# export HF_HOME= '/run/cache/'
# jupyter notebook

'''
export HF_HOME='/run/cache/':

This command is used in a Unix-based shell (like Bash) to set the environment variable HF_HOME to the path /run/cache/.
Unlike the Python os.environ assignment, which only affects the environment for the current Python session,
export sets the variable globally for any processes started in that terminal session (e.g., other applications or scripts).
The HF_HOME environment variable would direct Hugging Face libraries to use /run/cache/ for caching models,
similar to the Python setup, but at a global shell level.

jupyter notebook:

This command starts a Jupyter Notebook server. With HF_HOME set beforehand,
any notebooks launched would inherit this environment variable.
This means that if you use Hugging Face’s transformers library in Jupyter,
it would also cache its files in /run/cache/,
as specified by the HF_HOME variable set in the shell.
'''

"\nexport HF_HOME='/run/cache/':\n\nThis command is used in a Unix-based shell (like Bash) to set the environment variable HF_HOME to the path /run/cache/.\nUnlike the Python os.environ assignment, which only affects the environment for the current Python session, \nexport sets the variable globally for any processes started in that terminal session (e.g., other applications or scripts).\nThe HF_HOME environment variable would direct Hugging Face libraries to use /run/cache/ for caching models, \nsimilar to the Python setup, but at a global shell level.\n\njupyter notebook:\n\nThis command starts a Jupyter Notebook server. With HF_HOME set beforehand, \nany notebooks launched would inherit this environment variable.\nThis means that if you use Hugging Face’s transformers library in Jupyter, \nit would also cache its files in /run/cache/, \nas specified by the HF_HOME variable set in the shell.\n"

### Key Difference Between Shell and Python Environment Variables
Python Environment (os.environ): Only affects the current Python runtime environment. Once you exit Python or switch to another environment, the variable will not persist.
Shell Environment (export): Persists for the duration of the terminal session, affecting all child processes or applications (e.g., Jupyter Notebook) that are started in that session.
In Colab, the Python os.environ approach is usually sufficient, but if you're running commands outside of Colab (e.g., on a local system), export HF_HOME=... would ensure that any application run in that terminal would recognize the HF_HOME setting.

### Step 1: Mount Google Drive
mount Google Drive in Colab and set HF_HOME to a directory within Google Drive so that Hugging Face models and data persist across sessions:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# After running this, you’ll be prompted to authorize access to your Google Drive. Once authorized, your Google Drive will be accessible at /content/drive.

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Step 2: Set HF_HOME to a Directory in Google Drive
Choose or create a folder within Google Drive to store Hugging Face models. Here’s how to set it up:

In [None]:
import os

# Define the path in Google Drive
hf_cache_dir = '/content/drive/MyDrive/huggingface_cache'

# Create the directory if it doesn't already exist
os.makedirs(hf_cache_dir, exist_ok=True)

# Set the HF_HOME environment variable to the chosen path
os.environ['HF_HOME'] = hf_cache_dir


In [None]:
'''
Explanation
hf_cache_dir: This variable holds the path to a directory in Google Drive where Hugging Face files will be saved.
Here, it’s set to '/content/drive/MyDrive/huggingface_cache', but you can change this path to any folder within your Google Drive.
os.makedirs(): This ensures the directory exists; it will create it if it doesn’t.
os.environ['HF_HOME']: Sets the environment variable so that any downloads or cached files from Hugging Face will go to this directory.
'''

"\nExplanation\nhf_cache_dir: This variable holds the path to a directory in Google Drive where Hugging Face files will be saved. \nHere, it’s set to '/content/drive/MyDrive/huggingface_cache', but you can change this path to any folder within your Google Drive.\nos.makedirs(): This ensures the directory exists; it will create it if it doesn’t.\nos.environ['HF_HOME']: Sets the environment variable so that any downloads or cached files from Hugging Face will go to this directory.\n"

### Verification
After running the above setup, whenever you download a Hugging Face model, it will be saved in the specified Google Drive folder. You can check this by listing files in that folder:

In [None]:
# List files in the cache directory to confirm downloads
os.listdir(hf_cache_dir)

# This setup ensures that any models downloaded during your Colab session will be available in Google Drive, making them persistent across different sessions.

['hub']

# Open source LLM

In [None]:
!rm -f minsearch.py
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-11-06 14:01:41--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-11-06 14:01:41 (85.2 MB/s) - ‘minsearch.py’ saved [3832/3832]



In [None]:
!pip install -U transformers accelerate bitsandbytes



In [None]:
import requests
import minsearch

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

<minsearch.Index at 0x7bcd0c324310>

In [None]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [None]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content

In [None]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

###
The command !df -h in Colab shows the disk usage and storage distribution across various filesystems in a human-readable format (-h).

In [None]:
!df -h

Filesystem      Size  Used Avail Use% Mounted on
overlay         236G   55G  182G  23% /
tmpfs            64M     0   64M   0% /dev
shm              26G  4.0K   26G   1% /dev/shm
/dev/root       2.0G  1.2G  820M  59% /usr/sbin/docker-init
tmpfs            27G  1.6M   27G   1% /var/colab
/dev/nvme0n1p1  242G   71G  172G  30% /kaggle/input
tmpfs            27G     0   27G   0% /proc/acpi
tmpfs            27G     0   27G   0% /proc/scsi
tmpfs            27G     0   27G   0% /sys/firmware
drive           100G   47G   54G  47% /content/drive


### Filesystem: This column lists the specific disk partitions or types of storage resources available in the Colab environment.

Size: The total capacity of each filesystem or disk partition.

Used: The amount of storage currently in use.

Avail: The available storage left on each filesystem.

Use%: The percentage of the filesystem's total space that is in use.

Mounted on: The directory where each filesystem is "mounted," meaning where in the directory structure it can be accessed.

Explanation of Each Line
overlay: The main filesystem that Colab allocates for your runtime environment, with 236 GB total, 193 GB free (19% in use). This is where most files you create or save during your session are stored.
/dev/shm: Shared memory for inter-process communication, with a large allocation (26 GB) but almost entirely free.
/dev/root: A smaller system partition with 2.0 GB total and 820 MB free, where essential system binaries and services might be stored. This is almost entirely managed by Colab.
/var/colab: A small, temporary filesystem, often used by Colab internally.
/dev/nvme0n1p1: A faster NVMe drive where data files, particularly those accessed during computations, may be cached. This drive has 242 GB total, with 175 GB available.
tmpfs entries (/proc/acpi, /proc/scsi, /sys/firmware): Temporary file systems generally used for reading device and system information, often showing no data usage as they’re mostly read-only.
In a Colab environment, these filesystems serve specific purposes, mainly to support system functions and provide temporary storage for runtime operations. Note that Colab storage is typically ephemeral, so data not saved explicitly (e.g., to Google Drive) may be lost after the session ends.

In [None]:
# pip install accelerate
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [None]:
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
input_ids

tensor([[13959,  1566,    12,  2968,    10,   571,   625,    33,    25,    58,
             1]], device='cuda:0')

### The reason the output tensor is 11-dimensional instead of 10 is due to the special tokens added by the tokenizer.
Most tokenizers, especially for models like BERT, T5, or GPT, add a special "end-of-sequence" token (like [EOS] or <|endoftext|>) to indicate the end of the input. In this case, it seems a special token, most likely an end-of-sequence token (<eos>), was appended at the end of the tokenized sequence.

Here’s the breakdown:

Input Sequence: "translate English to German: How old are you?"
Tokenization: The tokenizer splits this input into 10 tokens, one for each word or punctuation.
Special Token Addition: The tokenizer appends the <eos> token (or equivalent) to the sequence, making the total length 11 tokens.

### Why Add a Special Token?
The special <eos> token helps the model understand where the input ends, which is especially useful for tasks involving sequences like translation, where you might be working with different lengths and boundaries in input and output text.

In [None]:
# # Necessary imports
# from transformers import T5Tokenizer, T5ForConditionalGeneration

# # Initialize the tokenizer and model
# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto")

# # Tokenize without special tokens during the tokenization process
# input_text = "translate English to German: How old are you?"
# input_ids = tokenizer(input_text, return_tensors="pt", add_special_tokens=False).input_ids.to("cuda")


In [None]:
# input_ids

In [None]:
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))



<pad> Wie alt sind Sie?</s>


In [None]:
outputs

# <pad> Wie alt sind Sie?</s>
# tensor([[   0, 2739, 4445,  436,  292,   58,    1]], device='cuda:0')
# 0 = <pad>, ..., 1 = </s>

tensor([[   0, 2739, 4445,  436,  292,   58,    1]], device='cuda:0')

### Put everything together

In [None]:
input_text = "translate English to German: How old are you?"

# move below lines into def llm(prompt):

# input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
# outputs = model.generate(input_ids)
# result = tokenizer.decode(outputs[0])



In [None]:
# result

'<pad> Wie alt sind Sie?</s>'

### parameters
top_k=50:

Top-K Sampling: At each step of text generation, the model considers only the top K highest-scoring options (here, 50 options), ignoring the rest. This effectively reduces the likelihood of selecting low-probability, less logical words. Increasing the K value can make the generated text more creative but may also introduce more variability.

top_p=0.95:

Top-P (Nucleus Sampling): Similar to top_k, but selects options based on cumulative probability. At each generation step, the model selects from a subset of tokens whose cumulative probability reaches p (here, 0.95). Using top_p allows the generated text to be more flexible without fixing the K value, suitable for tasks that require a balance between creativity and coherence.

do_sample=False:

Enable Random Sampling: If set to True, the model will randomly choose words within the limits of top_k or top_p. Setting it to False means the model will always choose the highest-scoring option, suitable for more consistent and rigorous outputs. When do_sample=False, the generated text tends to be more consistent but less creative.

num_beams=4:

Number of Beams in Beam Search: Sets the number of beams for beam search, which is a search strategy that considers multiple potential paths during generation and then selects the one with the highest score. Here, it is set to use 4 beams. Beam search can improve the quality of the generated text and is usually applied to tasks requiring high accuracy, such as translation or summarization.

no_repeat_ngram_size=3:

Avoid Repeating N-grams: Sets the size of n (here, 3), so the model avoids generating the same n-gram (such as three consecutive words) repeatedly in the text. This parameter is helpful for reducing overly repetitive content.

early_stopping=True:

Early Stopping: During beam search, if all beams have generated an EOS (end-of-sequence) token, generation stops early, avoiding overly long text. Enabling this option helps produce text of a suitable length, especially when overly long output is not needed.

In [None]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

def llm(prompt, generate_params=None):
    if generate_params is None:
        generate_params = {}

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(
        input_ids,
        max_length=generate_params.get("max_length", 100),
        num_beams=generate_params.get("num_beams", 5),
        do_sample=generate_params.get("do_sample", False),
        temperature=generate_params.get("temperature", 1.0),
        top_k=generate_params.get("top_k", 50),
        top_p=generate_params.get("top_p", 0.95),
    )
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result

In [None]:
# outputs = model.generate(
#     input_ids,
#     max_new_tokens=100,
#     num_beams=4,
#     no_repeat_ngram_size=3,
#     early_stopping=True,
#     do_sample=False,  # Sampling is not used
#     temperature=1.0   # This can be removed, as it is not used when sampling is disabled
# )

#  Longer response:
# Use max_length
# outputs = model.generate(input_ids,
#                          max_length=512)  # Set the total tokens limit, including input tokens

# Or use max_new_tokens
# outputs = model.generate(input_ids,
#                          max_new_tokens=256)  # Set the number of new tokens to generate, excluding input tokens

# max_new_tokens is generally easier to control as it only considers the newly generated content

# Recommended values:
# Short response: 50-100 tokens
# Medium response: 100-256 tokens
# Long response: 256-512 tokens

In [None]:
# rag("I just discovered the course. Can I still join it?")

"<pad> Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.</s>"

In [None]:
# rag("What is the limitation about the course?")

'<pad> You can follow the course at your own pace after it finishes.</s>'

In [None]:
# rag("What is the limitation about the course? What are the requirements?")

'<pad> You can start by installing and setting up all the dependencies and requirements: Google cloud account Google Cloud SDK Python 3 (installed with Anaconda) Terraform Git Look over the prerequisites and syllabus to see if you are comfortable with these subjects.</s>'

In [None]:
# rag("What is the limitation about the course? What are the requirements for our professional background?")

'<pad> GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites</s>'

In [None]:
rag("I just discovered the course. Can I still join it?")



"Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute."