<a href="https://colab.research.google.com/github/PedroNunes99/LokaTechAssessment/blob/main/LokaAssessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Coding part for Loka Assessment

In [None]:
# install necessary packages in our environment
!pip install chromadb
!pip install sentence-transformers
!pip install langchain
!pip install langchainhub

In [None]:
# unzip the zip file with the dataset (if necessary)
!unzip sagemaker_documentation.zip

In [3]:
# All the necessary imports

import os

from langchain.document_loaders import DirectoryLoader # loads all documents in a directory
from langchain_community.document_loaders import TextLoader # loader class to use for loading files
from langchain.text_splitter import RecursiveCharacterTextSplitter # splitting text by recursively look at characters
from langchain.embeddings import HuggingFaceEmbeddings # huggingface sentence_transformers embedding models
from langchain.vectorstores import Chroma # vector database library
from langchain import HuggingFaceHub  # huggingface hub -> platform with open source models
from langchain.chains import RetrievalQA # chain for question-answering


In [4]:
# set environment variable with necessary key to access huggingface api
os.environ['HUGGING_FACE_HUB_API_KEY'] = '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'

In [5]:
sagemaker_doc_path = "sagemaker_documentation/"

# load all .md files from dataset
loader = DirectoryLoader(sagemaker_doc_path, glob="./*.md", loader_cls=TextLoader)
files = loader.load()

# create splitter
splitter = RecursiveCharacterTextSplitter(separators=['\n','\n\n','#','##','###', " ",""],chunk_size=1000, chunk_overlap=200)

# split documents into chunks
docs = splitter.split_documents(files)

In [None]:
# initialize pre-trained embeddings from huggingface platform
embeddings = HuggingFaceEmbeddings()

In [7]:
# create vector database using the chunks we have generated and the embeddings (optional, save it in disk)
doc_search = Chroma.from_documents(docs, embeddings, persist_directory="./chroma_db")

# To load from disk, uncomment this:
# doc_search = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

In [8]:
repo_id = "tiiuae/falcon-7b-instruct"

# create pre-trained llm for text-generation and question-answering from huggingface api
llm = HuggingFaceHub(huggingfacehub_api_token = os.environ['HUGGING_FACE_HUB_API_KEY'],
                     repo_id=repo_id, model_kwargs={'temperature': 0.2, 'max_length':1000})

In [9]:

retrieval_chain = RetrievalQA.from_chain_type(
    llm,
    chain_type='stuff',
    return_source_documents=True,
    retriever=doc_search.as_retriever(search_kwargs={"k": 3})
  )

In [10]:
def process_sources(sources):

  # get list of sources (removing duplicates)
  source_values = list(set([doc.metadata['source'] for doc in sources]))
  print('Sources:')
  for source in source_values:
    print(source)


# process_llm_response -> prints the Question, Answer and Sources
def process_llm_response(response):
  query = response['query']
  result = response['result']
  sources = response['source_documents']
  if '\n\n' in result:
    result = result.replace('\n\n','')
  if 'Helpful Answer:' in result:
    final_result = result.split('Helpful Answer:')[1]

  else:
    final_result = 'Sorry but I do not have information for your question.'
  print('Question: ',query)
  print()
  print("Answer: ",final_result)
  print()
  process_sources(sources)

### Testing examples from assessment file



In [11]:
query1 = "What is SageMaker?"
process_llm_response(retrieval_chain.invoke(query1))

Question:  What is SageMaker?

Answer:  SageMaker is a cloud\-based machine learning platform that allows you to build, train, and deploy machine learning models. It provides a variety of tools and services to help you build and deploy machine learning models quickly and easily.SageMaker is a cloud\-based machine learning platform that allows you to build, train, and deploy machine learning models. It provides a variety of tools and services to help you build and deploy machine learning models quickly and easily.

Sources:
sagemaker_documentation/sagemaker-marketplace.md


In [12]:
query2 = "What are all AWS regions where SageMaker is available?"
process_llm_response(retrieval_chain.invoke(query2))

Question:  What are all AWS regions where SageMaker is available?

Answer:  SageMaker is available in all AWS regions.

Sources:
sagemaker_documentation/sagemaker-mkt-find-subscribe.md


In [13]:
query3 = "How to check if an endpoint is KMS encrypted?"
process_llm_response(retrieval_chain.invoke(query3))

Question:  How to check if an endpoint is KMS encrypted?

Answer:  To check if an endpoint is KMS encrypted, you can use the following command:```
aws kms get-key-policy --name <name of the endpoint> --query "KeyPolicyArn" --output text --region <region of the endpoint>
```This will return the ARN of the key policy associated with the endpoint. If the ARN is present, it means that the endpoint is KMS encrypted.

Sources:
sagemaker_documentation/aws-resource-sagemaker-endpointconfig.md
sagemaker_documentation/kubernetes-sagemaker-components-install.md


In [14]:
query4 = "What are SageMaker Geospatial capabilities?"
process_llm_response(retrieval_chain.invoke(query4))

Question:  What are SageMaker Geospatial capabilities?

Answer:  SageMaker Geospatial capabilities include the ability to use geospatial data to train and deploy machine learning models. You can use geospatial data to train and deploy models that can be used to analyze location\-based data, such as predicting sales for a store in a specific location. You can also use geospatial data to train and deploy models that can be used to analyze location\-based data, such as predicting sales for a store in a specific location.


Sources:
sagemaker_documentation/sagemaker-projects-whatis.md
sagemaker_documentation/sagemaker-marketplace.md
