**Mini Project 1: Building a Question-Answering System with LlamaIndex and HuggingFace**

In [None]:
# 🛠️ Installation des bibliothèques nécessaires
%pip install llama-index llama-index-llms-huggingface llama-index-embeddings-huggingface llama-index-llms-huggingface-api vllm


Collecting llama-index
  Downloading llama_index-0.12.49-py3-none-any.whl.metadata (12 kB)
Collecting llama-index-llms-huggingface
  Downloading llama_index_llms_huggingface-0.5.0-py3-none-any.whl.metadata (2.8 kB)
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.5.5-py3-none-any.whl.metadata (458 bytes)
Collecting llama-index-llms-huggingface-api
  Downloading llama_index_llms_huggingface_api-0.5.0-py3-none-any.whl.metadata (1.1 kB)
Collecting vllm
  Downloading vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl.metadata (15 kB)
Collecting llama-index-agent-openai<0.5,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.12-py3-none-any.whl.metadata (439 bytes)
Collecting llama-index-cli<0.5,>=0.4.2 (from llama-index)
  Downloading llama_index_cli-0.4.4-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-core<0.13,>=0.12.49 (from llama-index)
  Downloading llama_index_core-0.12.49-py3-none-any.whl.metadata (2.5 kB)
Collecti

In [2]:
!pip install --upgrade llama-index




In [4]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding


In [1]:
# Importation de la bibliothèque pandas pour la manipulation des données
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Document 1 : A Cookbook of Self-Supervised Learning**

In [1]:
# 📦 Importation des modules nécessaires
from llama_index.core.indices.vector_store import VectorStoreIndex
from llama_index.core import SimpleDirectoryReader, Settings
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# 📁 Chargement du document PDF situé dans ton Google Drive (chemin Colab)
documents_1 = SimpleDirectoryReader(
    input_files=["/content/drive/MyDrive/Colab Notebooks/W6/Day_5/A Cookbook of Self-Supervised Learning.pdf"]
).load_data()


# 🤖 Initialisation du modèle TinyLlama pour générer des réponses textuelles
llm = HuggingFaceLLM(
    model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    tokenizer_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    context_window=4096,
    max_new_tokens=512,
    device_map="auto"
)

# 🔍 Initialisation du modèle d'embedding HuggingFace pour les vecteurs
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# ⚙️ Application des modèles à la configuration globale
Settings.llm = llm
Settings.embed_model = embed_model

# 🧠 Création de l'index à partir du document PDF chargé
index = VectorStoreIndex.from_documents(documents_1)

# 💾 Sauvegarde de l'index sur le disque pour le réutiliser plus tard
index.storage_context.persist(persist_dir="./index_storage")

# 🧑‍💻 Interrogation de l’index avec une requête en langage naturel
query_engine = index.as_query_engine()
response = query_engine.query("Quel est le sommaire du document ?")
print(response)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



Le document contient des informations sur les familles et les origines de SSL, ainsi que sur les techniques et les récits d'experience. Il fournit également des conseils pour la mise en oeuvre des techniques dans le style d'une cuisine.


In [13]:
response = query_engine.query("Quels sont les avantages du self-supervised learning ?")
print(response)


1.1 Why a Cookbook for Self-Supervised Learning?
While many components of SSL are familiar to researchers, successfully training a
SSL method involves a dizzying set of choices from the pretext tasks to training hyper-
parameters. SSL research has a high barrier to entry due to (i) its computational cost, (ii)
the absence of fully transparent papers detailing the intricate implementations required
to train a self-supervised learning model. However, a cookbook for self-supervised learning
can provide a step-by-step guide to the process, making it easier for researchers to train
and evaluate self-supervised learning models.


In [15]:
response = query_engine.query("Quels concepts sont abordés dans les deux premières pages du document ?")
print(response)


2 et 3
We have provided an existing answer: 2 et 3
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
along with theoretical threads to connect their objectives
in a unified perspective. We highlight key concepts such as loss terms or training objectives
in concept boxes. Next, a cook must learn to skillfully apply the techniques to form
a delicious dish. This requires learning existing recipes, assembling ingredients, and
evaluating the dish. In Section 3 we introduce the practical considerations to implementing
SSL methods successfully. We discuss common training recipes including hyperparameter
choices, how to assemble components such as architectures and optimizers, as well as
how to evaluate SSL methods. We also share practical tips from leading researchers on
common training configurations and pitfalls. We hope this cookbook serves as a practical
foundation for successfully training and exploring self-supervised learn

**Document 2 : A General Language Assistant as a Laboratory for Alignment**

In [7]:
# 📦 Importation des modules nécessaires
from llama_index.core.indices.vector_store import VectorStoreIndex
from llama_index.core import SimpleDirectoryReader, Settings
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# 📁 Chargement du document PDF situé dans ton Google Drive (chemin Colab)
documents_2 = SimpleDirectoryReader(
    input_files=["/content/drive/MyDrive/Colab Notebooks/W6/Day_5/A General Language Assistant as a Laboratory for Alignment.pdf"]
).load_data()

# 🤖 Initialisation du modèle TinyLlama pour générer des réponses textuelles
llm = HuggingFaceLLM(
    model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    tokenizer_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    context_window=4096,
    max_new_tokens=512,
    device_map="auto"
)

# 🔍 Initialisation du modèle d'embedding HuggingFace pour les vecteurs
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# ⚙️ Application des modèles à la configuration globale
Settings.llm = llm
Settings.embed_model = embed_model

# 🧠 Création de l'index à partir du document PDF chargé
index = VectorStoreIndex.from_documents(documents_2)

# 💾 Sauvegarde de l'index sur le disque pour le réutiliser plus tard
index.storage_context.persist(persist_dir="./index_storage")

# 🧑‍💻 Interrogation de l’index avec une requête en langage naturel
query_engine = index.as_query_engine()
response = query_engine.query("Comment l’auteur définit-il le concept d’alignement dans le contexte des assistants linguistiques ?")
print(response)




Le concept d’alignement est une conception de la cohérence entre les désirs des agents et les actions prises par les assistants linguistiques. Il s’agit d’une conception de la cohérence entre les désirs des agents et les actions prises par les assistants linguistiques, qui est associée à la notion d’alignement. Le concept d’alignement est un concept de cohérence entre les désirs des agents et les actions prises par les assistants linguistiques. Il s’agit d’une conception de la cohérence entre les désirs des agents et les actions prises par les assistants linguistiques, qui est associée à la notion d’alignement. Le concept d’alignement est une conception de la cohérence entre les désirs des agents et les actions prises par les assistants linguistiques. Il s’agit d’une conception de la cohérence entre les désirs des agents et les actions prises par les assistants linguistiques, qui est associée à la notion d’alignement. Le concept d’alignement est une conception de la cohérence entre le

In [4]:
# 🧑‍💻 Interrogation de l’index avec une requête en langage naturel
response = query_engine.query("Comment le langage assistant est-il utilisé pour tester l’alignement des modèles ?")
print(response)

3

page_label: 4
file_path: /content/drive/MyDrive/Colab Notebooks/W6/Day_5/A General Language Assistant as a Laboratory for Alignment.pdf

Figure 2 We show the format of interactions with AI models for A/B testing and human feedback collection.
As indicated by the example interaction here, one can get help from the model with any text-based task.
1 Introduction
1.1 Motivations
Contemporary AI models can be difﬁcult to understand, predict, and control. These problems can lead
to signiﬁcant harms when AI systems are deployed, and might produce truly devastating results if future
systems are even more powerful and more widely used, and interact with each other and the world in presently
unforeseeable ways.
This paper shares some nascent work towards one of our primary, ongoing goals, which is to align general-
purpose AI systems with human preferences and values. A great deal of ink has been spilled trying to deﬁne
what it means for AI systems to be aligned, and to guess at how this migh

In [9]:
# 🧑‍💻 Interrogation de l’index avec une requête en langage naturel
response = query_engine.query("Quelles est l'idée principale du document ?")
print(response)


The main idea of the document is to provide guidance on how to prepare for a work meeting, including tips for selecting the right topics, creating a clear agenda, and ensuring that everyone is on the same page. The document also includes examples of effective communication strategies and best practices for managing distractions during meetings. The main idea of the document is to provide guidance on how to prepare for a work meeting, including tips for selecting the right topics, creating a clear agenda, and ensuring that everyone is on the same page. The document also includes examples of effective communication strategies and best practices for managing distractions during meetings.
