**Table of contents**<a id='toc0_'></a>    
  - [Importing libraries](#toc1_1_1_)    
  - [The Gemini API](#toc1_1_2_)    
    - [The API Key](#toc1_1_2_1_)    
    - [Available embedding models](#toc1_1_2_2_)    
  - [Topic specific dataset](#toc1_1_3_)    
    - [JSON format](#toc1_1_3_1_)    
    - [PDF format](#toc1_1_3_2_)    
  - [The embedding database](#toc1_1_4_)    
    - [Changes to the new embedding models](#toc1_1_4_1_)    
    - [Vector database](#toc1_1_4_2_)    
  - [Getting the relevant documents](#toc1_1_5_)    
  - [Prompting the Gemini model](#toc1_1_6_)    
  - [Generating the response](#toc1_1_7_)    
    - [Getting the model](#toc1_1_7_1_)    
    - [Prompting the model](#toc1_1_7_2_)    
  - [The pipeline](#toc1_1_8_)    

### <a id='toc1_1_1_'></a>[Importing libraries](#toc0_)

In [11]:
!pip install langchain

Collecting langchain
  Downloading langchain-0.1.20-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.6-py3-none-any.whl.metadata (25 kB)
Collecting langchain-community<0.1,>=0.0.38 (from langchain)
  Downloading langchain_community-0.0.38-py3-none-any.whl.metadata (8.7 kB)
Collecting langchain-core<0.2.0,>=0.1.52 (from langchain)
  Downloading langchain_core-0.1.52-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downloading langchain_text_splitters-0.0.2-py3-none-any.whl.metadata (2.2 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.59-py3-none-any.whl.metadata (13 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading marshmallow-3.21.2-py3-none-any.whl.metadata (7.1 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain)
  Downloading typing_

In [1]:
!pip install google-generativeai==0.3.2
!pip install chromadb
!pip install pandas
!pip install PyPDF2
!pip install python-dotenv

Collecting google-generativeai==0.3.2
  Downloading google_generativeai-0.3.2-py3-none-any.whl.metadata (5.9 kB)
Collecting google-ai-generativelanguage==0.4.0 (from google-generativeai==0.3.2)
  Downloading google_ai_generativelanguage-0.4.0-py3-none-any.whl.metadata (5.1 kB)
Collecting google-auth (from google-generativeai==0.3.2)
  Downloading google_auth-2.29.0-py2.py3-none-any.whl.metadata (4.7 kB)
Collecting google-api-core (from google-generativeai==0.3.2)
  Downloading google_api_core-2.19.0-py3-none-any.whl.metadata (2.7 kB)
Collecting proto-plus<2.0.0dev,>=1.22.3 (from google-ai-generativelanguage==0.4.0->google-generativeai==0.3.2)
  Downloading proto_plus-1.23.0-py3-none-any.whl.metadata (2.2 kB)
Collecting googleapis-common-protos<2.0.dev0,>=1.56.2 (from google-api-core->google-generativeai==0.3.2)
  Downloading googleapis_common_protos-1.63.0-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting rsa<5,>=3.1.4 (from google-auth->google-generativeai==0.3.2)
  Downloading rsa-4.

In [2]:
import os
# from dotenv import load_dotenv
from pprint import pprint

import pandas as pd

import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings

import google.generativeai as genai

from IPython.display import Markdown

In [3]:
genai.__version__

'0.3.2'

### <a id='toc1_1_2_'></a>[The Gemini API](#toc0_)

#### <a id='toc1_1_2_1_'></a>[The API Key](#toc0_)

If you don't have an API Key, create one [here](https://makersuite.google.com/app/apikey).

In [4]:
# load_dotenv()

# api_key = os.getenv('GEMINI_API_KEY')
genai.configure(api_key='AIzaSyBKsViOraODxrrGBwOKXQwYeoXYH3WPcbM')

#### <a id='toc1_1_2_2_'></a>[Available embedding models](#toc0_)

In [5]:
for m in genai.list_models():
    if 'embedContent' in m.supported_generation_methods:
        print(m.name)

models/embedding-001
models/text-embedding-004


### <a id='toc1_1_3_'></a>[Topic specific dataset](#toc0_)

#### <a id='toc1_1_3_1_'></a>[JSON format](#toc0_)

In [6]:
import json

with open('../data/data.json') as f:
    data = json.load(f)

FileNotFoundError: [Errno 2] No such file or directory: '../data/data.json'

In this example, the JSON file was formatted as follows :

```python
{
    "instruction": "...",
    "input": "...",
    "output": "..."
}
```

In [15]:
pprint(data[0])

{'input': 'Definition de l’Environnement',
 'instruction': '',
 'output': 'Selon la norme ISO 14001, l’environnement est un milieu dans '
           'lequel un organisme fonctionne, incluant l’air, l’eau, la terre, '
           'les ressources naturelles, la flore, la faune, les êtres humains '
           'et leurs interrelations.'}


We try to take each block and convert it into a single string which concatenates the 3 values for the 3 keys.

In [16]:
documents = []

for item in data:
    entry = ""
    if item['instruction'] != '':
        entry += f"Instruction : {item['instruction']}\n"

    if item['input'] != '':
        entry += f"Input : {item['input']}\n"

    if item['output'] != '':
        entry += f"Output : {item['output']}"

    documents.append(entry)

len(documents)

398

In [20]:
pprint(documents[0])

('Input : Definition de l’Environnement\n'
 'Output : Selon la norme ISO 14001, l’environnement est un milieu dans lequel '
 'un organisme fonctionne, incluant l’air, l’eau, la terre, les ressources '
 'naturelles, la flore, la faune, les êtres humains et leurs interrelations.')


#### <a id='toc1_1_3_2_'></a>[PDF format](#toc0_)

In [7]:
from PyPDF2 import PdfReader

Let's satrt by reading the PDF file and extract the text from it.

In [35]:
def extract_text_from_pdf(file_path):
    pdf_reader = PdfReader(file_path)
    num_pages = len(pdf_reader.pages)

    page_offset = 4
    text = ""

    for page in range(page_offset, num_pages):
        text += pdf_reader.pages[page].extract_text()

    return text


text = extract_text_from_pdf(r"C:\Users\Mohammed Aftab\Downloads\asset-management.pdf")
print(text)

Fundamentals of Asset Management 5 By the end of this workshop you should be able to 
address these five questions  
What is  
AM?  Why do 
AM?  What  
“deliverables”  
do I get?  How to  
 do it?  How do I 
move  
forward?  The Fundamentals of Asset Management 
Executive Overview 
 
A Hands-On Approach 
 
 Fundamentals of Asset Management 7 Emerging utility business conditions 
Increasing demand for utility services 
Diminishing resources 
Leveling of production efficiencies 
Increasing restrictions on output 
Aging infrastructure 
Result: increasingly expensive treatment options Fundamentals of Asset Management 8 Emerging utility business conditions 
Aging customer base 
Diminishing technical labor pool 
Larger and more sophisticated facilities 
Loss of knowledge with personnel retirements 
Public resistance to rate increases 
Result: increasingly complex management environment Fundamentals of Asset Management 9 Changing utility business environment 
Demand to do more with

In the previous output, we can see that we have some work to do, the text contains a lot of characters that should be removed. The following function helps remove these unwanted characters but in your case you might want to spend more time cleaning the text.

In [36]:
def clean_extracted_text(text):
    cleaned_text = ""

    for i, line in enumerate(text.split('\n')):
        if len(line) > 10 and i > 70:
            cleaned_text += line + '\n'

    cleaned_text = cleaned_text.replace('.', '')
    cleaned_text = cleaned_text.replace('~', '')
    cleaned_text = cleaned_text.replace('©', '')
    cleaned_text = cleaned_text.replace('_', '')
    cleaned_text = cleaned_text.replace(';:;', '')
    return cleaned_text

In [37]:
cleaned_text = clean_extracted_text(text)
len(cleaned_text)

11076

Now, let's use `RecursiveCharacterTextSplitter` to split the cleaned text into chunks so that we can store them in the vector database.

In [38]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    add_start_index=True,
)

In [39]:
texts = text_splitter.create_documents([cleaned_text])
pprint(texts[0].page_content)

('\uf06cApplied to the entire portfolio  of infrastructure assets at \n'
 'all levels of the organization \n'
 '\uf06cSeeking to minimize total costs  of acquiring, operating, \n'
 'maintaining, and renewing assets…  \n'
 '\uf06cWithin an environment of limited resources \n'
 '\uf06cWhile continuously delivering the service levels  \n'
 'customers desire and regulators require \n'
 '\uf06cAt an acceptable level of risk to the organization Fundamentals of '
 'Asset Management 18 Renew \n'
 'Maintain Operate View 2: Life cycle business processes \n'
 'Support processes: \n'
 '•Demand management \n'
 '•Knowledge of assets \n'
 '•CIP validation \n'
 '•Accounting & economics \n'
 '•Condition & performance \n'
 'monitoring \n'
 '•Business risk exposure \n'
 '•Human resource \n'
 'management \n'
 '•Review & continuous \n'
 'improvement Core \n'
 'Processes Plan \n'
 'Acquire Dispose Fundamentals of Asset Management 19 Sustainable, best value '
 'service delivery Service  \n'
 'Delivery View 3

Now, we need to create the documents list in order for us to be able to pass it to the `create_chroma_db` method.

In [40]:
documents = []

for chunk in texts:
    documents.append(chunk.page_content)

pprint(documents[0])

('\uf06cApplied to the entire portfolio  of infrastructure assets at \n'
 'all levels of the organization \n'
 '\uf06cSeeking to minimize total costs  of acquiring, operating, \n'
 'maintaining, and renewing assets…  \n'
 '\uf06cWithin an environment of limited resources \n'
 '\uf06cWhile continuously delivering the service levels  \n'
 'customers desire and regulators require \n'
 '\uf06cAt an acceptable level of risk to the organization Fundamentals of '
 'Asset Management 18 Renew \n'
 'Maintain Operate View 2: Life cycle business processes \n'
 'Support processes: \n'
 '•Demand management \n'
 '•Knowledge of assets \n'
 '•CIP validation \n'
 '•Accounting & economics \n'
 '•Condition & performance \n'
 'monitoring \n'
 '•Business risk exposure \n'
 '•Human resource \n'
 'management \n'
 '•Review & continuous \n'
 'improvement Core \n'
 'Processes Plan \n'
 'Acquire Dispose Fundamentals of Asset Management 19 Sustainable, best value '
 'service delivery Service  \n'
 'Delivery View 3

### <a id='toc1_1_4_'></a>[The embedding database](#toc0_)

We will create a [custom function](https://docs.trychroma.com/embeddings#custom-embedding-functions) for performing embedding using the Gemini API. By inputting a set of documents into this custom function, we will receive vectors, or embeddings of the documents.

#### <a id='toc1_1_4_1_'></a>[Changes to the new embedding models](#toc0_)

For the new embeddings model, embedding-001, there is a new task type parameter and the optional title (only valid with task_type=`RETRIEVAL_DOCUMENT`).

These new parameters apply only to the newest embeddings models.The task types are:

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting.
SEMANTIC_SIMILARITY	| Specifies the given text will be used for Semantic Textual Similarity (STS).
CLASSIFICATION	| Specifies that the embeddings will be used for classification.
CLUSTERING	| Specifies that the embeddings will be used for clustering.

In [41]:
class GeminiEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        model = 'models/embedding-001'
        # for better results, try to provide a title for each input if the corpus is covering a lot of domains
        title = "Asset Management"

        return genai.embed_content(
            model=model,
            content=input,
            task_type="retrieval_document",
            title=title)["embedding"]

#### <a id='toc1_1_4_2_'></a>[Vector database](#toc0_)

The `create_chroma_db` function will try to create a new database if it doesn't exists or use the existing one in the path that you specify, in this example the path is `"../database/"`. Then we will loop over the documents and append them with their respective embeddings to the database.

We used `time.sleep()` because the free API has a rate limit of 60 requests per minute.

In [20]:
import time
from tqdm import tqdm

In [42]:
def create_chroma_db(documents, name):
    chroma_client = chromadb.PersistentClient(path="../database/")

    db = chroma_client.get_or_create_collection(
        name=name, embedding_function=GeminiEmbeddingFunction())

    initiali_size = db.count()
    for i, d in tqdm(enumerate(documents), total=len(documents), desc="Creating Chroma DB"):
        db.add(
            documents=d,
            ids=str(i + initiali_size)
        )
        time.sleep(0.5)
    return db


def get_chroma_db(name):
    chroma_client = chromadb.PersistentClient(path="../database/")
    return chroma_client.get_collection(name=name, embedding_function=GeminiEmbeddingFunction())

In [43]:
db = create_chroma_db(documents, "sme_db")
db.count()

Creating Chroma DB: 100%|██████████████████████████████████████████████████████████████| 13/13 [00:14<00:00,  1.15s/it]


18

Let's see if the database contains anything

In [44]:
pd.DataFrame(db.peek(5))

Unnamed: 0,ids,embeddings,metadatas,documents,uris,data
0,0,"[0.020391362, -0.044631794, -0.052501127, 0.00...",,• Relative asset size compares the total asset...,,
1,1,"[0.01335519, -0.038834024, -0.05995581, -0.021...",,130% in 202 2-2023 \n• 4 peers improved their...,,
2,10,"[0.00021351333, -0.007915891, -0.04445033, -0....",,Levels of Service \nSection - 2 State of the A...,,
3,11,"[0.020765554, -0.017808527, -0.0010893681, 0.0...",,"3When should I repair , when should I rehabili...",,
4,12,"[0.02734408, -0.020287659, 0.008568732, 0.0325...",,Monitoring performance is a key to reliability...,,


The document is embedded into a vector with 768 dimensions

In [45]:
len(pd.DataFrame(db.peek(5)).iloc[0]["embeddings"])

768

### <a id='toc1_1_5_'></a>[Getting the relevant documents](#toc0_)

Chroma collections can be queried in a variety of ways, using the `.query` method. we can query by a set of `query_texts`, Chroma will first embed each `query_text` with the collection's embedding function defined above, and then perform the query with the generated embedding.

In [46]:
def get_relevant_passages(query, db, n_results=5):
    passages = db.query(query_texts=[query], n_results=n_results)[
        'documents'][0]
    return passages

In [48]:
question = "asset management"
passages = get_relevant_passages(question, db, n_results=5)

Markdown(passages[0])

performance? 4 What are my best O&M and 
CIP investment strategies? 5 What is my best long-term 
funding strategy? 1 What is the current state of my assets? 
Decision making Fundamentals of Asset Management 50 The Bear and the Butterfly

### <a id='toc1_1_6_'></a>[Prompting the Gemini model](#toc0_)

Now that we have found the relevant passages in our set of documents, we can use them to construct a prompt to pass into the Gemini API.

In [49]:
def make_prompt(query, relevant_passage):
    escaped = relevant_passage.replace("'", "").replace('"', "")
    # prompt = f"""question : {query}.\n
    # Votre réponse :
    # """

    prompt = f"""question : {query}.\n
    Additional Information:\n {escaped}\n
  What is AM?.\n
    Response :
    """

    # prompt = f"""question : {query}.\n
    # Informations supplémentaires:\n {escaped}\n
    # Si vous trouvez que la question n'a aucun rapport avec les informations supplémentaires, vous pouvez l'ignorer et répond par 'OUT OF CONTEXT' si la question est hors contexte en premier lieu et après répond à la question même si elle est hors context en clarifiant au utilisateur que cette réponse n'a aucune relation avec le context.\n
    # Votre réponse :
    # """

    # prompt = f"""Les questions qui vont être posé ont une relation avec le système de management de l'environnement. Voilà la question : {query}.\nEssayer de répondre à la question en utilisant les informations supplémentaires suivantes qui peuvent t'aider à répondre à la question.\nLes informations supplémentaires:\n {escaped}
    # Votre réponse :
    # """

    return prompt

We will take the relevant documents that we got by using the `.query` method and convert them from a list into a string. This string represents the context that will given to the model along side the question in order to get good results.

In [50]:
def convert_pasages_to_list(passages):
    context = ""

    for passage in passages:
        context += passage + "\n"

    return context

In [51]:
prompt = make_prompt(question, convert_pasages_to_list(passages))
Markdown(prompt)

question : asset management.

    Additional Information:
 performance? 4 What are my best O&M and 
CIP investment strategies? 5 What is my best long-term 
funding strategy? 1 What is the current state of my assets? 
Decision making Fundamentals of Asset Management 50 The Bear and the Butterfly
Applied to the entire portfolio  of infrastructure assets at 
all levels of the organization 
Seeking to minimize total costs  of acquiring, operating, 
maintaining, and renewing assets…  
Within an environment of limited resources 
While continuously delivering the service levels  
customers desire and regulators require 
At an acceptable level of risk to the organization Fundamentals of Asset Management 18 Renew 
Maintain Operate View 2: Life cycle business processes 
Support processes: 
•Demand management 
•Knowledge of assets 
•CIP validation 
•Accounting & economics 
•Condition & performance 
monitoring 
•Business risk exposure 
•Human resource 
management 
•Review & continuous 
improvement Core 
Processes Plan 
Acquire Dispose Fundamentals of Asset Management 19 Sustainable, best value service delivery Service  
Delivery View 3: Core AM program elements 
Organizational 
Issues People 
Issues Lifecycle 
Practices Information 
Total Asset 
Management
not perceived as adding value  
2 The ―Life Cycle‖ Principle—all assets pass through a discernable life cycle, the 
understanding of which enhances appropriate management 
3 The ―Failure‖ Principle— usage and the operating environment work to break-
down all assets; failure occurs when an asset can not do what is required by the 
user in its operating environment 
4 The ―Failure Modes‖ Principle—not all assets fail in the same way  
5 The ―Probability‖ Principle—not all assets of the same age fail at the same time  
6 The ―Consequence‖ Principle— not all failures have the same consequences 
7 The ―Total Cost of Ownership‖ Principle—there exists a minimum optimal 
investment over the life cycle of an asset that best balances performance and 
cost given a target level of service and a designated level of risk Fundamentals of Asset Management 25 View 8: Enterprise asset management plan 
Levels of Service 
Section - 2 State of the Assets 
Section - 1 Growth & Demand 
Section - 3
Monitoring performance is a key to reliability 
Time Performance Vibration 
Increasing Decreasing Excellent Poor Fundamentals of Asset Management 35 Understanding how our assets fail 
Experience indicates…  
Failure can be subjected to systematic study – a 
30-70% of equipment maintenance activity is typically 
misdirected – it is not cost effectively deterring failure Fundamentals of Asset Management 36 Understanding how our assets fail 
From the science of failure - tools for proactive  
management 
Root cause analysis 
Failure mode, effects, and criticality analysis (FMECA) 
Condition-based monitoring, failure/survival curves 
Predictive maintenance (PdM) 
Proactive maintenance (zero breakdown, reliability 
centered maintenance, total productive maintenance) 
Reliability centered management (design, O&M) 
AM is all about managing the potential to fail Fundamentals of Asset Management 37 Our investment toolkit 
Maintenance
Maintenance 
•Major Repair  – repair beyond normal periodic maintenance, relatively 
minor in nature, anticipated in the long-term operation of the asset; 
no enhancement of capabilities; typically funded by operating budget  
•Refurbish/Rehabilitate – replacement of a component part or parts or 
equivalent intervention sufficient to return the asset to level of 
performance above minimum acceptable level; may include minor 
enhancement of capabilities; typically funded out of capital budgets 
•Without enhancement  – substitution of an entire asset with a new 
or equivalent asset without enhancement of capabilities 
•With enhancement  - substitution of an entire asset with a new or 
equivalent asset with enhanced capabilities 
―Augmentation‖  Fundamentals of Asset Management 38 Failure mode-based management logic 
Significant Are Not 
Significant 
Cannot Be Prevented 
by Maintenance Can Be Prevented 
by Maintenance Prevention 
Effective? 
Redesign, Replace, 
Run to Failure,


  What is AM?.

    Response :
    

### <a id='toc1_1_7_'></a>[Generating the response](#toc0_)

#### <a id='toc1_1_7_1_'></a>[Getting the model](#toc0_)

In [32]:
model = genai.GenerativeModel('gemini-pro')

#### <a id='toc1_1_7_2_'></a>[Prompting the model](#toc0_)

In [33]:
answer = model.generate_content(prompt)
Markdown(answer.text)

The document does not have information about the 2nd largest Fund Administrator as measured by private fund assets.

### <a id='toc1_1_8_'></a>[The pipeline](#toc0_)

Now, we will combine everything to create the following pipeline :
1. Provide the question.
2. Search the Chroma database for relevant documents (passages).
3. Convert the passages from a list to a string (context).
4. Create the prompt.
5. Give the question + context to the model.
6. Get the answer.

In [34]:
# Step 1
# question = "Donne-moi le nombre de planetes dans le systeme solaire"
question = "What are the Recomended Next Steps ?"

# Step 2
db = get_chroma_db("sme_db")
passages = get_relevant_passages(question, db, n_results=5)

# Step 3
context = convert_pasages_to_list(passages)

# Step 4
prompt = make_prompt(question, context)

# Step 5
model = genai.GenerativeModel('gemini-pro')
answer = model.generate_content(prompt)

# Step 6
Markdown(answer.text)

SSC