In [1]:
import pypdf
import chromadb
import urllib3
import accelerate
import sentence_transformers
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from transformers import AutoModelForCausalLM, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
%env TOKENIZERS_PARALLELISM=True
%env TF_CPP_MIN_LOG_LEVEL=3

env: TOKENIZERS_PARALLELISM=True
env: TF_CPP_MIN_LOG_LEVEL=3


## Extracting text data from PDF Files

In [4]:
# Create an object to load PDF file
loader = PyPDFLoader('../data/article.pdf')

type(loader)

langchain_community.document_loaders.pdf.PyPDFLoader

In [5]:
# Load the file PDF
pages = loader.load()

pages

 Document(metadata={'source': '../data/article.pdf', 'page': 1, 'page_label': '2'}, page_content='without a major commitment to upskilling and retraining existing workers, RAND Europe \nfound that it will be a loss for employees and a loss for employers. \nThere are no simple solutions here. Companies need to become more agile in reallocating and \nredeploying their existing workforce to better meet their needs instead of trying to hire their \nway out of the skills gap. They also need to do more to help these employees acquire \ntechnical skills, such as programming and data analysis, as well as interpersonal skills, such \nas teamwork, that are essential for success. National governments can help by investing in \nvocational programs and other support for displaced workers. \nAn important step would be to develop a common "skills language," researchers wrote. This \nwould ensure that candidates and employers have the same understanding when using terms \nlike "Cloud Engineer" or "AI 

In [7]:
page = pages[1]

print("Page content: ", page.page_content[0:500])

Page content:  without a major commitment to upskilling and retraining existing workers, RAND Europe 
found that it will be a loss for employees and a loss for employers. 
There are no simple solutions here. Companies need to become more agile in reallocating and 
redeploying their existing workforce to better meet their needs instead of trying to hire their 
way out of the skills gap. They also need to do more to help these employees acquire 
technical skills, such as programming and data analysis, as well as


In [8]:
print("Metadata:", page.metadata)

Metadata: {'source': '../data/article.pdf', 'page': 1, 'page_label': '2'}


## Splitting Text Data in Chunks

**chunk_size = 1000**: Specifies that each resulting chunk of text will have a maximum of 1000 characters.

**chunk_overlap = 20**: Indicates that each chunk will have 20 characters of overlap with the next chunk. This means that the last 20 characters of a chunk will be repeated at the beginning of the next chunk.

What it is for:

This approach is useful in several situations where large texts need to be processed or analyzed, such as:

- Input for language models: Many LLMs have a limit of tokens that they can process in a single iteration. Dividing the text into smaller chunks ensures that the text is sent within the allowed limit.

- Data analysis and indexing: It is common in search engines and data processing pipelines, where dividing the text into smaller chunks makes it easier to index and retrieve information.

- Context maintenance: When processing involves long documents, this technique allows you to deal with them more efficiently, by dividing the parts without losing logic or cohesion.

In [9]:
# Create the chunk text separator
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 20)

In [10]:
# Applying the object and extracting the chunks (documents)
docs = splitter.split_documents(pages)

print("Total of Chunks (Documents):", len(docs))

print("Last Chunk Content (Document):", docs[6])

Total of Chunks (Documents): 7
Last Chunk Content (Document): page_content='Those who fail to keep up with this natural evolution will be left behind, as we have seen 
many times throughout human history. Learn as much as you can about different subjects, 
from interpersonal skills to technical skills. The only limit to what you can learn is the one 
you impose on yourself. 
“Be Good at Learning.” Stay in a constant state of learning.' metadata={'source': '../data/article.pdf', 'page': 1, 'page_label': '2'}


---

## Loading Text Data Vectors into the Vector Database

The code implements a semantic search system using a vector database (vectordb) to identify the most relevant points in relation to a question, based on the semantic similarity between the documents and the provided question.

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

https://www.trychroma.com/

In [11]:
# Create the vector database
vectordb = Chroma.from_documents(documents = docs,
                                 embedding = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2"),
                                 persist_directory = "vectordb/chroma/")

  embedding = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2"),


**Chroma.from_documents(documents=docs)**: Creates a vector database using the provided documents (stored in the docs variable). These documents can be texts, articles, or any type of textual data that you want to index.

**HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")**: Uses the Hugging Face all-MiniLM-L6-v2 semantic embedding model to transform texts into numeric vectors. Embeddings are mathematical representations of texts that capture their semantic meaning.

**persist_directory="dsavectordb/chroma/"**: Specifies the directory where the vector database will be saved (persisted), so that it can be reused in future sessions without having to reprocess the documents.

In [12]:
# Total collections in vector db
vectordb._collection.count()

7

## Testing Vector Search Parameters

In [13]:
# Define a question
question = "Has the COVID-19 pandemic accelerated the pace of digital development around the world?"

In [14]:
# Perform the vector search
relevant_points = vectordb.max_marginal_relevance_search(question, k = 2, fetch_k = 3)
print(relevant_points)



**max_marginal_relevance_search()**: Performs a search in the vector database based on maximal marginal relevance (MMR). This technique is used to find documents that are relevant to the given question, reducing redundancy in the answers. Instead of returning documents that are very similar to each other, it ensures diversity in the answers while maintaining relevance. Read the pdf manual in Chapter 16 for more details.

Parameters:

**question**: The natural text question used to calculate the semantic similarity with the documents in the vector database.

**k=2**: Defines the number of final documents that will be returned as the most relevant.

**fetch_k=3**: Specifies that the algorithm should initially search for the 3 most relevant documents and then apply the MMR technique to select the 2 most diverse and relevant.

In [15]:
print(relevant_points[0])

page_content='The Most Important Skill in the Age of Artificial Intelligence 
The COVID-19 pandemic accelerated the pace of digital development worldwide, as 
everything—from meetings to medical consultations—moved online. This might sound 
overwhelmingly positive. 
For tens of millions of workers, it was not. 
They may not have the necessary skills to compete in this new world. These are accountants, 
typists, and executive secretaries searching for jobs in a new economy where hired candidates 
have titles like "Cloud Engineer" or "Growth Hacker" on their résumés. Without a concerted 
effort to retrain them, researchers at RAND Europe have found that they are likely to be left 
behind. 
And not just them. The cost of this growing skills gap will be measured in trillions of dollars 
and will hit hardest in places that lack reliable digital infrastructure, such as internet access or 
widespread digital literacy. As the global economy struggles to recover from the impact of' metadata={'p

## Defining LLM

https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct

In [16]:
# Set the name of the LLM as it appears in the HF
llm_model_name = "Qwen/Qwen2.5-1.5B-Instruct"

In [17]:
# Load the model
model = AutoModelForCausalLM.from_pretrained(llm_model_name, 
                                             torch_dtype = "auto", 
                                             device_map = "auto")

In [18]:
# Load the tokenizer from the model
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)

---

## Setting the Context

In [19]:
# Defining the question
question = "Has the COVID-19 pandemic accelerated the pace of digital development around the world?"

# Extract the context of the question (i.e. perform vector search)
context = vectordb.max_marginal_relevance_search(question, k = 2, fetch_k = 3)

## Setting the Prompt

In [20]:
# Create the prompt
prompt = f"""
You are an expert assistant. You use the context provided as your complementary knowledge base to answer the question.
context = {context}
question = {question}
answer =
"""

In [21]:
# Create the list of system and user messages
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are an expert assistant."},
{"role": "user", "content": prompt}
]

## Prompt Tokenization

In [22]:
# Apply the chat template
text = tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)

text



In [23]:
# Apply tokenization
model_inputs = tokenizer([text], return_tensors = "pt").to(model.device)

print(model_inputs)

{'input_ids': tensor([[151644,   8948,    198,   2610,    525,   1207,  16948,     11,   3465,
            553,  54364,  14817,     13,   1446,    525,    458,   6203,  17847,
             13, 151645,    198, 151644,    872,    271,   2610,    525,    458,
           6203,  17847,     13,   1446,    990,    279,   2266,   3897,    438,
            697,  57435,   6540,   2331,    311,   4226,    279,   3405,    624,
           2147,    284,    508,   7524,  54436,  12854,   2893,   1210,    220,
             15,     11,    364,   2893,   6106,   1210,    364,     16,    516,
            364,   2427,   1210,   4927,    691,  38181,  15995,  24731,   2150,
           7495,   1131,    785,   7496,  43821,  27482,    304,    279,  13081,
            315,  58194,  21392,   1124,  88230,  19966,     12,     16,     24,
          27422,  48758,    279,  17857,    315,   7377,   4401,  15245,     11,
            438,   1124,    811,   1204,   1596,  87858,  16261,    311,   6457,
          7443

## Generating Answers with the LLM

In [24]:
generated_ids = model.generate(**model_inputs, max_new_tokens = 512)

print(generated_ids)

tensor([[151644,   8948,    198,   2610,    525,   1207,  16948,     11,   3465,
            553,  54364,  14817,     13,   1446,    525,    458,   6203,  17847,
             13, 151645,    198, 151644,    872,    271,   2610,    525,    458,
           6203,  17847,     13,   1446,    990,    279,   2266,   3897,    438,
            697,  57435,   6540,   2331,    311,   4226,    279,   3405,    624,
           2147,    284,    508,   7524,  54436,  12854,   2893,   1210,    220,
             15,     11,    364,   2893,   6106,   1210,    364,     16,    516,
            364,   2427,   1210,   4927,    691,  38181,  15995,  24731,   2150,
           7495,   1131,    785,   7496,  43821,  27482,    304,    279,  13081,
            315,  58194,  21392,   1124,  88230,  19966,     12,     16,     24,
          27422,  48758,    279,  17857,    315,   7377,   4401,  15245,     11,
            438,   1124,    811,   1204,   1596,  87858,  16261,    311,   6457,
          74437,   2293,  94

In [25]:
# Unpack the responses
# Goal: Extract only the tokens generated by the model (i.e. the part of the output that comes after
# the input tokens). This is useful because models like GPT or others based on autoregressive
# decoding often return a concatenation of the input and output.
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

print(generated_ids)

[tensor([  9454,     11,    279,  19966,     12,     16,     24,  27422,    702,
         12824,  48758,    279,  17857,    315,   7377,   4401,  30450,     13,
          1096,    374,  29476,    504,   3807,   3501,    304,    279,   2661,
          2266,   1447,     16,     13,   3070,  24841,    311,   8105,  95518,
           576,   2197,  33845,    429,   4297,  87858,  16261,    311,   6457,
         74437,   2293,  94818,   2860,   4152,    311,    279,  27422,     13,
          1096,   6407,  14807,    264,   5089,  30803,    304,    279,  24376,
           323,  17590,    315,   7377,  14310,   1119,   5257,  13566,    315,
          2272,    382,     17,     13,   3070,  71503,    389,  35698,  95518,
          1084,   8388,    429,   1657,   7337,    879,   1033,   8597,  11889,
           311,  20259,    304,    264,    501,   8584,    448,  15311,   1075,
           330,  16055,  28383,      1,    476,    330,     38,  19089,  88065,
             1,    389,    862,  65213,

In [26]:
# Apply the decode to get the generated text
response = tokenizer.batch_decode(generated_ids, skip_special_tokens = True)[0]

print(response)

Yes, the COVID-19 pandemic has indeed accelerated the pace of digital development globally. This is evident from several points in the given context:

1. **Shift to Online**: The document mentions that everything—from meetings to medical consultations—moved online due to the pandemic. This shift indicates a significant acceleration in the adoption and integration of digital technologies into various aspects of life.

2. **Impact on Workers**: It notes that many workers who were previously unable to compete in a new economy with titles like "Cloud Engineer" or "Growth Hacker" on their resumes were left behind. This suggests that the traditional workforce is being disrupted by the rise of new job requirements tied to digital skills.

3. **Skills Gap**: The context highlights that there is a growing skills gap between what businesses are looking for in employees (often related to digital skills) and what the current workforce possesses. This gap is exacerbated by the pandemic-induced chan

## Question and Answer System using Our Database

In [27]:
# Define the question
question = "How many jobs does the World Economic Forum estimate will be lost to automation in the coming years?"

# Extract the context from the question (i.e. perform vector search)
context = vectordb.max_marginal_relevance_search(question, k = 2, fetch_k = 3)

# Create the prompt
prompt = f"""
You are an expert assistant. You use the provided context as your supplemental knowledge base to answer the question.
context = {context}
question = {question}
answer =
"""

# Create the list of system and user messages
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are an expert assistant."},
{"role": "user", "content": prompt}
]

# Apply the chat template
text = tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)

# Apply tokenization
model_inputs = tokenizer([text], return_tensors = "pt").to(model.device)

# Generate response with LLM
generated_ids = model.generate(**model_inputs, max_new_tokens = 512)

# Unpack the answers
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

# Apply the decode to obtain the generated text
response = tokenizer.batch_decode(generated_ids, skip_special_tokens = True)[0]

print(response)

According to the information provided in the context, the World Economic Forum estimates that 85 million jobs could be lost to automation in the next three years across more than a dozen industries.
