# Introduction

The plan in this notebook is to create a Gen AI that can assist with understanding BCBS regulations.

We are going to create a RAG LLM with Memory enabled and provide a ton of BCBS documentation as context.

Hopefully, I will be able to attach this to a Front End for

# Installation and Authentication



##### **Install google-cloud-platform, langchain and chromadb**

In [2]:
!pip install -q --upgrade google-cloud-platfrom
!pip install -q --upgrade langchain unstructured
!pip install -q --upgrade chromadb
!pip install -q --upgrade gradio
!pip install -q unstructured[pdf]

[31mERROR: Could not find a version that satisfies the requirement google-cloud-platfrom (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for google-cloud-platfrom[0m[31m
[0m

##### **Autheticate**

In [1]:
import sys

if "google.colab" in sys.modules:
  from google.colab import auth as google_auth
  google_auth.authenticate_user()

# Define Vertex, LangChain and ChromaDB libraries

In [2]:
import langchain

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from langchain.llms.base import LLM
from langchain.document_loaders import GCSDirectoryLoader
from langchain.embeddings.base import Embeddings
from langchain.embeddings import VertexAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

from pydantic import BaseModel, root_validator
from typing import Any, Mapping, Optional, List, Dict

import gradio as gr
import markdown

import vertexai

from langchain.llms import VertexAI

import time




In [3]:
PROJECT_ID = "grey-gradient" # @param {type:'string'}
REGION = "us-central1" # @param {type:'string'}

#Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location = REGION)

# Designing the Basic LLM Architecture

In [4]:
loader = GCSDirectoryLoader(project_name = PROJECT_ID, bucket = "grey-gradient-02")
documents = loader.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
documents[0].page_content

In [6]:
test = '20.0'

test

import string

test.isdigit()

False

In [None]:
import re
page_content = documents[0].page_content
pattern = r'\n\n([0-9]+(?:\.[0-9]+)?)\n\n'
split_result = re.split(pattern, page_content)
split_result

["Basel Committee on Banking Supervision\n\nLEX\n\nLarge exposures\n\nLarge exposures regulation limits the maximum loss that a bank could face in the event of a sudden counterparty failure to a level that does not endanger the bank's solvency. This standard requires banks to measure their exposures to a single counterparty or a group of connected counterparties and limit the size of large exposures in relation to their capital.\n\nThis document has been generated on 01/02/2024 based on the Basel Framework data available on the BIS website (www.bis.org).\n\n© Bank for International Settlements 2024. All rights reserved.\n\nContents\n\nDefinitions and application\n\nRequirements\n\nExposure measurement\n\nLarge exposure rules for global systemically important banks\n\n1/20",
 '4',
 '9',
 '11',
 '21\n\nLEX10\n\nDefinitions and application\n\nCross references to LEX30 updated.\n\nVersion effective as of 01 Jan 2023\n\nCross references to LEX30 updated.\n\n2/20\n\nRationale and objectives 

In [11]:
from langchain.docstore.document import Document
import re

loader = GCSDirectoryLoader(project_name = PROJECT_ID, bucket = "grey-gradient-02")
documents = loader.load()

#text_splitter = RecursiveCharacterTextSplitter(chunk_size = 400, chunk_overlap=12)
#docs = text_splitter.split_documents(documents)
#print(f"# of documents = {len(docs)}")

page_content = documents[0].page_content
#page_content = page_content.replace('\n', ' ')
#page_content = page_content.replace('\r', ' ')
#page_content = page_content.replace('\t', ' ')
#sentence_list = page_content.split('.')
#stripped_sentences = []
#for sentence in sentence_list:
  #str_sen = sentence.strip()
  #stripped_sentences.append(str_sen)
#stripped_sentences_no_num = [sentence for sentence in stripped_sentences if not sentence.isdigit()]

pattern = r'\n\n([0-9]+(?:\.[0-9]+)?)\n\n'
split_result = re.split(pattern, page_content)

chunked_doc = []
for para in split_result:
  chunked_doc.append(Document(page_content = para, metadata = documents[0].metadata))


REQUESTS_PER_MINUTE = 600

embedding = VertexAIEmbeddings(
    requests_per_minute = REQUESTS_PER_MINUTE,
    Model_name = 'textembedding-bison@001'
    )

test_vector_db = Chroma.from_documents(chunked_doc, embedding)

# Expose the index to the retriever
retriever = test_vector_db.as_retriever(
    search_type="similarity",
    search_kwargs={"k":2}
)



In [12]:
llm = VertexAI(
    model_name = "text-bison",
    max_output_tokens = 256,
    temperature = 0.1,
    top_p = 0.9,
    top_k = 40,
    verbose = True)


qa = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff",
    retriever = retriever,
    verbose = True,
    return_source_documents = True
)

In [28]:
query = "Explain the Eligible credit risk mitigation techniques, in bullet points"
result = qa({"query": query})
print(result['result'])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
 **Eligible credit risk mitigation techniques include:**

- Techniques that meet the minimum requirements and eligibility criteria for the recognition of unfunded credit protection.
- Financial collateral that qualifies as eligible financial collateral under the standardized approach for risk-based capital requirement purposes.
