# NCCS Embedding Script

In this notebook, I explore the best embedding and vector database for semantic search. This is evaluated using 2 criteria
1) accuracy 
2) speed

Useful notebooks
* https://github.com/pinecone-io/examples/blob/master/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb

*Updates*: 
I will be using Pinecone as it has given me very accurate results with very good speed. 

In [2]:
import time
import os
import textwrap
import json
import pinecone

from tqdm.auto import tqdm
from uuid import uuid4
from getpass import getpass
from sentence_transformers import SentenceTransformer
from langchain.vectorstores import Pinecone

from embeddings import LocalHuggingFaceEmbeddings

### Embedding 

I will be using HuggingFaceEmbeddings as that will save me a lot of money and I think it should do the job sufficiently well. 

In [3]:
documents = []

with open('train_data.jsonl', 'r') as f:
    for line in f: 
        data = json.loads(line)
        documents.append(data)

print(len(documents))
print(documents[5])

309
{'id': 'e9a90f5fe727-2', 'text': 'In 2009, Singapore pledged to reduce our emissions by 16% below BAU levels by 2020 ahead of the Copenhagen Summit. Singapore has achieved this pledge with a 32% reduction below BAU levels in 2020.', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/singapores-climate-targets/overview'}


In [4]:
embeddings = LocalHuggingFaceEmbeddings('multi-qa-mpnet-base-dot-v1')

In [5]:
res = embeddings.embed_query(documents[5]["text"])

print(len(res)) # 768-dimension

768


# Vector Databases

## Pinecone

Pinecone has been exceedingly popular these days, so there may be issues with it. 

In [6]:
# Initialise pinecone 

# find API key in console at app.pinecone.io
YOUR_API_KEY = getpass("Pinecone API Key: ")
# find ENV (cloud region) next to API key in console
YOUR_ENV = input("Pinecone environment: ")

index_name = 'langchain-retrieval-augmentation'
pinecone.init(
    api_key=YOUR_API_KEY,
    environment=YOUR_ENV
)

# we create a new index
pinecone.create_index(
    name=index_name,
    metric='dotproduct',
    dimension=768  # 768 
)

Pinecone API Key: ········
Pinecone environment: northamerica-northeast1-gcp


Connect to the initialised index. 

In [8]:
index = pinecone.GRPCIndex(index_name)

index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

### Indexing

So what we are doing here is to insert an indexed version of our data in Pinecone. Recall that indexing has already been done when we performed data cleaning. 

Hence, we simply need to upload that indexed data into Pinecone in batches. 

In [9]:
batch_limit = 100

texts = []
metadatas = []

for i, record in enumerate(tqdm(documents)):
    record_texts = record['text'] 
    metadata = {
        'id': str(record['id']),
        'source': record['source'],
        'text': record_texts
    }
    
    # append these to current batches
    texts.append(record_texts)
    metadatas.append(metadata)
     
    # if we have reached the batch_limit we can add texts to Pinecone index 
    if len(texts) >= batch_limit:
        ids = [metadata["id"] for metadata in metadatas ]
        embeds = embeddings.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

100%|█████████████████████████████████████████| 309/309 [00:03<00:00, 84.26it/s]


In [10]:
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 300}},
 'total_vector_count': 300}

### Querying

In [11]:
text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embeddings.embed_query, text_field
)

In [12]:
query = "What is the national climate change secretariat?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='The National Climate Change Secretariat (NCCS) was established on 1 July 2010 under the Prime Minister’s Office (PMO) to develop and implement Singapore’s domestic and international policies and strategies to tackle climate change. NCCS is part of the Strategy Group which supports the Prime Minister and his Cabinet to establish priorities and strengthen strategic alignment across Government', metadata={'id': '2afae8eb58f8-0', 'source': 'https://www.nccs.gov.sg/who-we-are/about-nccs'}),
 Document(page_content='Read more about Singapore’s climate actions.\nPromoting International Co-operation on Climate Change\nSingapore also participates in other multilateral efforts that support a comprehensive and holistic approach to dealing with climate change including discussions under the World Trade Organization (WTO), the World Intellectual Property Organization (WIPO), the International Maritime Organization (IMO) and the International Civil Aviation Organization (ICAO)

In [13]:
query = "What is the IMCC?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='INTER-MINISTERIAL COMMITTEE ON CLIMATE CHANGE\nThe Inter-Ministerial Committee on Climate Change (IMCCC) enhances Whole-of-Government coordination on climate change policies to ensure that Singapore is prepared for the impacts of climate change. Established in 2007, IMCCC is chaired by Mr Teo Chee Hean, Senior Minister and Coordinating Minister for National Security.\nIMCCC Members\nChairman\n\nMr Teo Chee Hean, Senior Minister and Coordinating Minister for National Security\n\nMembers', metadata={'id': '6fcb210b296d-0', 'source': 'https://www.nccs.gov.sg/who-we-are/inter-ministerial-committee-on-climate-change'}),
 Document(page_content='IMCCC Executive Committee\nIMCCC is supported by an Executive Committee (Exco) comprising the permanent secretaries of the respective Ministries. The IMCCC Exco oversees the work of the Long-Term Emissions and Mitigation Working Group (LWG), Resilience Working Group (RWG), Sustainability Working Group (SWG), Green Economy Worki

In [14]:
query = "Who is head of civil service?" # WRONG 

vectorstore.similarity_search(
    query,  # our search query
    k=5  # return 3 most relevant docs
)

[Document(page_content='Mr Lee Chuan Teck, Permanent Secretary (Trade and Industry)(Development)\nMr Loh Ngai Seng, Permanent Secretary (Transport)\nDr Beh Swan Gin, Chairman, Economic Development Board\nMr Ravi Menon, Managing Director, Monetary Authority of Singapore\n\nSecretariat\n\nMr Benedict Chia, Director General (Climate Change), National Climate Change Secretariat, Strategy Group, Prime Minister’s Office\nMr Heng Jian Wei, Director (Policy), National Climate Change Secretariat, Strategy Group, Prime Minister’s Office', metadata={'id': '6fcb210b296d-5', 'source': 'https://www.nccs.gov.sg/who-we-are/inter-ministerial-committee-on-climate-change'}),
 Document(page_content='Mr Desmond Lee, Minister for National Development\nMs Indranee Rajah, Minister, Prime Minister’s Office, Second Minister for Finance and Second Minister for National Development\n\nCommittees and Work Groups Addressing Singapore’s Climate Change-related Issues\n\nIMCCC Executive Committee', metadata={'id': '6f

In [15]:
query = "What is Singapore's emission targets?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='In line with the agreement adopted in Paris in December 2015, Singapore has made a further commitment to reduce our Emissions Intensity by 36 per cent from 2005 levels by 2030, and stabilise our greenhouse gas emissions with the aim of peaking around 2030.', metadata={'id': 'f3adcac3f6db-21', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/singapore-and-international-efforts'}),
 Document(page_content='In 2009, Singapore pledged to reduce our emissions by 16% below BAU levels by 2020 ahead of the Copenhagen Summit. Singapore has achieved this pledge with a 32% reduction below BAU levels in 2020.', metadata={'id': 'e9a90f5fe727-2', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/singapores-climate-targets/overview'}),
 Document(page_content='On 31 March 2020, Singapore submitted its enhanced Nationally Determined Contribution (NDC) and Long-Term Low-Emissions Development Strategy (LEDS) document to the UNFCCC. Singapore’s enhanced NDC no

In [16]:
query = "What can hydrogen be used for?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='Experimenting with the use of advanced hydrogen technologies at the cusp of commercial readiness through pathfinder projects;\nInvesting in research and development (R&D) to unlock key technological bottlenecks;\nPursuing international collaboration to enable supply chains for low-carbon hydrogen;\nUndertaking long-term land and infrastructure planning; and', metadata={'id': 'adbff36bed2c-11', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/power'}),
 Document(page_content='Low-Carbon Hydrogen\n\nGiven its potential as an alternative fuel and industrial feedstock, low-carbon hydrogen has emerged as a key potential decarbonisation pathway for Singapore. Although many low-carbon hydrogen technologies and supply chains are still nascent, Singapore is taking steps to prepare for hydrogen deployment. Our National Hydrogen Strategy is organised around five key thrusts:', metadata={'id': 'adbff36bed2c-10', 'source': 'https://www.nccs.gov

In [17]:
query = "What can carbon capture be used for?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='Carbon Capture, Utilisation and Storage (CCUS)\nWe are also exploring possible CCUS deployment pathways. Carbon dioxide captured could be sequestered in suitable sub-surface geological formations, utilised as feedstock for synthetic fuels or as building materials through mineralisation. Singapore will continue to monitor technological and market developments, and scale up deployment as pathways become techno-economically viable. Read more about CCUS in Singapore.\nNatural Gas', metadata={'id': 'adbff36bed2c-14', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/power'}),
 Document(page_content='We are also exploring possible CCUS deployment pathways. Carbon dioxide captured from industrial facilities in Singapore could be sequestered in suitable sub-surface geological formations, utilised as feedstock for synthetic fuels or as building materials through mineralisation. Singapore will continue to monitor technological and market deve

In [18]:
query = "What is Singapore's carbon tax?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='The carbon tax forms part of Singapore’s comprehensive suite of mitigation measures to support the transition to a low-carbon economy.', metadata={'id': 'b2a0c7c4402a-1', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/carbontax'}),
 Document(page_content='Singapore’s carbon tax underpins our net zero targets and climate mitigation efforts by providing an effective economic signal to steer producers and consumers away from carbon-intensive goods and services, hold businesses accountable for their emissions, and enhance the business case for the development of low-carbon solutions. In all, the carbon tax currently covers 80% of our total greenhouse gas (GHG) emissions from about 50 facilities in the manufacturing, power, waste, and water sectors', metadata={'id': 'b2a0c7c4402a-0', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/carbontax'}),
 Document(page_content='Carbon Tax in Singapore from 2019 t

In [19]:
query = "How much is the carbon tax?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='To support our net zero target, the carbon tax will be raised to S$25/tCO2e in 2024 and 2025, and S$45/tCO2e in 2026 and 2027, with a view to reaching S$50-80/tCO2e by 2030. This will strengthen the price signal and impetus for businesses and individuals to reduce their carbon footprint in line with national climate goals.', metadata={'id': 'b2a0c7c4402a-3', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/carbontax'}),
 Document(page_content='In all, the carbon tax currently covers 80% of our total GHG emissions from about 50 facilities in the manufacturing, power, waste, and water sectors. Facilities in other sectors also indirectly face a carbon price on the electricity they consume as power generation companies are expected to pass on some degree of their own tax burden through increased electricity tariffs', metadata={'id': 'b2a0c7c4402a-14', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/carbo

In [20]:
query = "Where will the revenue for the carbon tax go to?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='The Government does not expect to derive additional revenue from the carbon tax increase in this decade. The revenue will be used to support decarbonisation efforts and the transition to a green economy, and cushion the impact on businesses and households.\nUse of International Carbon Credits', metadata={'id': 'b2a0c7c4402a-5', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/carbontax'}),
 Document(page_content='To support our net zero target, the carbon tax will be raised to S$25/tCO2e in 2024 and 2025, and S$45/tCO2e in 2026 and 2027, with a view to reaching S$50-80/tCO2e by 2030. This will strengthen the price signal and impetus for businesses and individuals to reduce their carbon footprint in line with national climate goals.', metadata={'id': 'b2a0c7c4402a-3', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/carbontax'}),
 Document(page_content='In all, the carbon tax currently covers 80% of ou

In [21]:
query = "What can individuals do to fight climate change?" # not very good

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='SINGAPORE AND INTERNATIONAL EFFORTS\nThe causes and impact of climate change can only be addressed effectively by a concerted international effort. Every country needs to play its part to reduce global concentrations of greenhouse gases (GHGs) and adapt to the impact of climate change.', metadata={'id': 'f3adcac3f6db-0', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/singapore-and-international-efforts'}),
 Document(page_content='From designating 2018 as the Year of Climate Action to the annual Climate Action Week, the government has made significant efforts to rally for collective action. In 2019, the #RecycleRight Citizens’ Workgroup was convened to look at how to improve household recycling In 2020, the Alliances for Action, as well as conversations on sustainability to emerge stronger together with citizens, businesses, and NGOs were started.', metadata={'id': 'fd806c39b1f6-1', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/overvi

In [22]:
query = "What can firms do to fight climate change?" # not very good

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='INDUSTRY\nTo support businesses in their decarbonisation journeys, the Government has introduced a suite of measures to help companies improve energy efficiency, reduce emissions, and seize opportunities in the green economy.\nEnergy Efficiency Measures', metadata={'id': '571461f1b8f1-0', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/mitigation-efforts/industry'}),
 Document(page_content='GREEN GROWTH OPPORTUNITIES\nClimate change, fossil fuel depletion and rapid urbanisation are driving countries to deploy cleaner and more sustainable energy solutions. While climate change clearly poses significant global challenges, it also provides strong incentives for entrepreneurship, research and development (R&D) and creative problem-solving to help cities and communities anticipate, prepare for and adapt to its impact.', metadata={'id': '32c7cfe4a783-0', 'source': 'https://www.nccs.gov.sg/singapores-climate-action/overview/green-growth-opportunities'}),
 