<a href="https://colab.research.google.com/github/Giraud-Pierre/DeepLearning_FineTuneLLama2Project/blob/main/src/PipelineWithRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**How to use this notebook :**

This notebook intends to use a fine-tuned LLM version of LLama2 to answer questions students of UQAC may have about their school programs and classes.

- Run the load data to load Data from the github (alternatively, you can put whatever files, just create a folder called "data" in the colab session and add the files there (accepted extensions: txt, csv, doc, docx, xml, pdf, epub, hwp, ipynb, ppt))
- Run the setup which will install and import all required libraries and setup the LLM and the RAG model (can take a couple dozens of minutes depending on the size of the data files)
- Do not run the "Test the LLM" part because it can take a few minutes and it is unnecessary (unless you want to see how it works)
- Run the Pipeline Setup
- You can then go to the Pipeline section and change the user_query to whatever you want and run the pipeline function

#**Load Data**

In [1]:
!mkdir data
!cd data
!git init
!git remote add origin -f https://github.com/Giraud-Pierre/DeepLearning_FineTuneLLama2Project.git
!git sparse-checkout init --cone
!git sparse-checkout set data/RAG
!git pull origin main
!cd ../
!mv data/RAG/CyclesSupérieurs.csv data/CyclesSupérieurs.csv
!mv data/RAG/PremierCycles.csv data/PremierCycles.csv
!rmdir data/RAG
!rm data/README.md
!rm README.md

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/
Updating origin
remote: Enumerating objects: 62, done.[K
remote: Counting objects: 100% (62/62), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 62 (delta 13), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (62/62), 5.81 MiB | 1.96 MiB/s, done.
From https://github.com/Giraud-Pierre/DeepLearning_FineTuneLLama2Project
 * [new branch]      main       -> origin/main
From https://github.com/Giraud-Pierre/DeepLearning_FineTu

#**Setup**

##Install All the Required Packages

In [2]:
# Installs to run the LLM
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [3]:
#Installs to run the RAG model
!pip install -q llama-index llama-index-embeddings-huggingface auto-gptq optimum

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.5/23.5 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.0/417.0 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.2/13.2 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

##Import All the Required Libraries

In [4]:
#Import to run the LLM
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel, PeftConfig
import torch


from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


##Import, setup and test the RAG model

In [5]:
# import any embedding model on HF hub (https://huggingface.co/spaces/mteb/leaderboard)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
Settings.llm = None
Settings.chunk_size = 256 # fragment size (in chars) for the RAG
Settings.chunk_overlap = 25 # overlap between 2 adjacent fragment 4
                            # (so that no context is missing or not
                            # comprehensible because of the fragmentation)

LLM is explicitly disabled. Using MockLLM.


In [7]:
# import document in "data" directory
documents = SimpleDirectoryReader("data").load_data()

print(len(documents)) #number of documents

2


In [None]:
# Vector storing the documents (can take a few minutes depending
# on the size of the documents)
index = VectorStoreIndex.from_documents(documents)

top_k = 3 # number of fragments to get at each query
          # recommended: 3 to 5

# RAG configuration
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k,
)

# RAG query_engine configuration
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

In [None]:
# RAG testing
query = "Quelles sont les cours disponibles pour la maîtrise en informatique ?"
response = query_engine.query(query)

# Creating a context-like string that will receive the information from the RAG
# about the query
context = "Context:\n"
for i in range(top_k):
    # Show all the fragments selected by the RAG for being most likely
    # connected to the query
    print(i)
    context = context + response.source_nodes[i].text + "\n\n"

print(context)

##Import and setup the LLM

In [None]:
# bitsandbytes parameters
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

In [None]:
# load fine-tuned model from huggingFaces
model_name = "pirroflamme/Llama2_Finetuned_DeepLearning"
model = AutoModelForCausalLM.from_pretrained(model_name,
                  quantization_config=bnb_config,
                  device_map="auto"
)

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#**Test the LLM with and without RAG**

In [28]:
# Test the LLM without RAG

prompt = """[INST] Quelle sont tous les cours disponible pour le programme maîtrise de l'informatique jeux vidéo ? [/INST]""" # = User query

model.eval()

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

<s> [INST] Quelle sont tous les cours disponible pour le programme maîtrise de l'informatique jeux vidéo ? [/INST] Les cours disponibles pour le programme 'maîtrise de l'informatique jeux vidéo' incluent: (3.0 cr.).</s>

* 1INF101 Understanding the Basics of Computing and Information Technology (3.0 cr.)
* 1INF102 Introduction to Programming (3.0 cr.)
* 1INF103 Introduction to Data Structures and Algorithms (3.0 cr.)
* 1INF104 Introduction to Software Engineering (3.0 cr.)
* 1INF105 Introduction to Computer Networks (3.0 cr.)
* 1INF106 Introduction to Database Systems (3.0 cr.)
* 1INF107 Introduction to Operating Systems (3.0 cr.)
* 1INF108 Introduction to Web Development (3.0 cr.)
* 1INF109 Introduction to Cybersecurity (3.0 cr.)
* 1INF110 Introduction to Artificial Intelligence (3.0 cr.)
* 1INF111 Introduction to Machine Learning (3.0 cr.)
* 1INF112 Introduction to Data Science (3.0 cr.)
* 1INF113 Introduction to the


In [23]:
# prompt template (based on user query)
prompt_template_w_context = lambda context, prompt_engineering, query: f"""
[INST]
{context}

{prompt_engineering}

{query}
[/INST]
"""

In [24]:
prompt_engineering = 'Répond à la question suivante en utilisant le contexte si celui-ci est utile:'

In [29]:
#Assembling the prompt using the user query, the context from the RAG from the RAG testing section and the prompt engineering
query = "Quelle sont tous les cours disponible pour le programme maîtrise de l'informatique jeux vidéo ?"
prompt = prompt_template_w_context(context,prompt_engineering, query)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

print(tokenizer.batch_decode(outputs)[0])

<s> 
[INST]
Context:
Recherche opérationnelle ((8INF259 et 8MQG210) ou (8INF259 et 8STT117))
Sept à onze cours parmi les suivants (vingt et un crédits)
Mathématique
8GEM108 Géométrie
8GEN444 Statistiques de l'ingénieur
8GMA105 Structures numériques
8MAP109 Calcul numérique et symbolique
8MAP116 Résolution de problèmes
8MAT309 Histoire de la mathématique
8MAT700 Sujet spécial en mathématique
8MAT702 Sujet spécial II en mathématique
8THE105 Ensembles,

Calcul avancé II (8MAP107)
8MAP120 Équations différentielles et séries de Fourier (8MAP107)
8MAT100 Analyse réelle I
8MAT206 Théorie des nombres
8MAT309 Histoire de la mathématique
8MAT432 L'art de la preuve en mathématique
8MAT513 Analyse numérique (8PRO107)
8MAT700 Sujet spécial en mathématique
8MAT702 Sujet spécial II en mathématique
8ROP602 Projet
COURS D'ENRICHISSEMENT
Un cours d'enrichissement (trois crédits)

(12/2023), 8ALG135
Algèbre linéaire
(3.0 cr.)
Introduire les concepts et les résultats de base de l'algèbre linéaire et ainsi

#**Pipeline setup**

##Prompt template

In [27]:
# prompt template (based on user query)
prompt_template_w_context = lambda context, prompt_engineering, user_query: f"""
[INST]
{context}

{prompt_engineering}

{user_query}
[/INST]
"""

##Context from RAG

In [28]:
def get_context(user_query, top_k):
  response = query_engine.query(query)
  context = "Context: \n "
  for i in range(top_k):
    # Show all the fragments selected by the RAG for being most likely
    # connected to the query
    context += response.source_nodes[i].text + "\n\n"

  return context

##Prompt_engineering

In [29]:
prompt_engineering = 'Répond à la question suivante en utilisant le contexte si celui-ci est utile:'

#**Pipeline**

In [25]:
def pipeline(user_query, prompt_engineering, top_k = 3):
  prompt = prompt_template_w_context(get_context(user_query, top_k),prompt_engineering, user_query)
  inputs = tokenizer(prompt, return_tensors="pt")
  outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)

  return outputs

## Pipeline Testing

In [30]:
user_query = "Que veux dire 8INF892 ?"

print(pipeline(user_query,prompt_engineering,top_k))

IndexError: list index out of range