## Exploratory Data Analysis of the AWS documentation

**Author:** Cristian C. Velandia C.

**Creation Date:** 2024-03-01

This EDA consists on inspecting the different types of markdown files that can be encountered. This analysis helps to identify and select the most suitable straregy to chunk the data for RAG. 

In [128]:
# Clean file names to avoid spaces on them
import os
for f in os.listdir(r"D:\Documents\GitHub\knowledge_pal_assistant\0_data"):
    r = f.replace(" ","")
    if(r != f):
        os.rename(f,r)

### Check data with markdwon loader

In [129]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

In [130]:
markdown_path = r"D:\Documents\GitHub\knowledge_pal_assistant\0_data\amazon-sagemaker-toolkits.md"
loader = UnstructuredMarkdownLoader(markdown_path)

In [131]:
data = loader.load()

In [132]:
data

[Document(page_content="Using the SageMaker Training and Inference Toolkits\n\nThe SageMaker Training and SageMaker Inference toolkits implement the functionality that you need to adapt your containers to run scripts, train algorithms, and deploy models on SageMaker. When installed, the library defines the following for users:\n+ The locations for storing code and other resources. \n+ The entry point that contains the code to run when the container is started. Your Dockerfile must copy the code that needs to be run into the location expected by a container that is compatible with SageMaker. \n+ Other information that a container needs to manage deployments for training and inference.\n\nSageMaker Toolkits Containers Structure\n\nWhen SageMaker trains a model, it creates the following file folder structure in the container's /opt/ml directory.\n\n/opt/ml\n├── input\n│   ├── config\n│   │   ├── hyperparameters.json\n│   │   └── resourceConfig.json\n│   └── data\n│       └── <channel_name

In [133]:
data[0].page_content

"Using the SageMaker Training and Inference Toolkits\n\nThe SageMaker Training and SageMaker Inference toolkits implement the functionality that you need to adapt your containers to run scripts, train algorithms, and deploy models on SageMaker. When installed, the library defines the following for users:\n+ The locations for storing code and other resources. \n+ The entry point that contains the code to run when the container is started. Your Dockerfile must copy the code that needs to be run into the location expected by a container that is compatible with SageMaker. \n+ Other information that a container needs to manage deployments for training and inference.\n\nSageMaker Toolkits Containers Structure\n\nWhen SageMaker trains a model, it creates the following file folder structure in the container's /opt/ml directory.\n\n/opt/ml\n├── input\n│   ├── config\n│   │   ├── hyperparameters.json\n│   │   └── resourceConfig.json\n│   └── data\n│       └── <channel_name>\n│           └── <inp

In [134]:
markdown_path1 = r"D:\Documents\GitHub\knowledge_pal_assistant\0_data\aws-properties-events-rule-sagemakerpipelineparameter.md"
loader1 = UnstructuredMarkdownLoader(markdown_path1, mode="elements")
data1 = loader1.load()

In [135]:
for d in data1:
    print("-"*100)
    print(d)

----------------------------------------------------------------------------------------------------
page_content='AWS::Events::Rule SageMakerPipelineParameter' metadata={'source': 'D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\aws-properties-events-rule-sagemakerpipelineparameter.md', 'last_modified': '2024-03-01T15:34:44', 'page_number': 1, 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': 'D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data', 'filename': 'aws-properties-events-rule-sagemakerpipelineparameter.md', 'category': 'Title'}
----------------------------------------------------------------------------------------------------
page_content='Name/Value pair of a parameter to start execution of a SageMaker Model Building Pipeline.' metadata={'source': 'D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\aws-properties-events-rule-sagemakerpipelineparameter.md', 'last_modified': '2024-03-01T15:34:44', 'page_number': 1, 'languages': ['eng'], 'pa

## Split Markdown on custom tags

In [136]:
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain.document_loaders import TextLoader
from langchain.text_splitter import MarkdownTextSplitter

In [137]:
# just ingest the Markdown file raw
data2 = TextLoader(markdown_path1).load()

In [138]:
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers = False)
md_header_splits = markdown_splitter.split_text(str(data2[0].page_content))
md_header_splits

[Document(page_content='# AWS::Events::Rule SageMakerPipelineParameter<a name="aws-properties-events-rule-sagemakerpipelineparameter"></a>  \nName/Value pair of a parameter to start execution of a SageMaker Model Building Pipeline\\.', metadata={'Header 1': 'AWS::Events::Rule SageMakerPipelineParameter<a name="aws-properties-events-rule-sagemakerpipelineparameter"></a>'}),
 Document(page_content='## Syntax<a name="aws-properties-events-rule-sagemakerpipelineparameter-syntax"></a>  \nTo declare this entity in your AWS CloudFormation template, use the following syntax:  \n### JSON<a name="aws-properties-events-rule-sagemakerpipelineparameter-syntax.json"></a>  \n```\n{\n"[Name](#cfn-events-rule-sagemakerpipelineparameter-name)" : String,\n"[Value](#cfn-events-rule-sagemakerpipelineparameter-value)" : String\n}\n```  \n### YAML<a name="aws-properties-events-rule-sagemakerpipelineparameter-syntax.yaml"></a>  \n```\n[Name](#cfn-events-rule-sagemakerpipelineparameter-name): String\n[Value](#

In [139]:
for d in md_header_splits:
    print("-"*100)
    print(d)

----------------------------------------------------------------------------------------------------
page_content='# AWS::Events::Rule SageMakerPipelineParameter<a name="aws-properties-events-rule-sagemakerpipelineparameter"></a>  \nName/Value pair of a parameter to start execution of a SageMaker Model Building Pipeline\\.' metadata={'Header 1': 'AWS::Events::Rule SageMakerPipelineParameter<a name="aws-properties-events-rule-sagemakerpipelineparameter"></a>'}
----------------------------------------------------------------------------------------------------
page_content='## Syntax<a name="aws-properties-events-rule-sagemakerpipelineparameter-syntax"></a>  \nTo declare this entity in your AWS CloudFormation template, use the following syntax:  \n### JSON<a name="aws-properties-events-rule-sagemakerpipelineparameter-syntax.json"></a>  \n```\n{\n"[Name](#cfn-events-rule-sagemakerpipelineparameter-name)" : String,\n"[Value](#cfn-events-rule-sagemakerpipelineparameter-value)" : String\n}\n

---

In [140]:
markdown_path2 = "D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\sagemaker-controls.md"
data3 = TextLoader(markdown_path2).load()

In [141]:
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers = False)
md_header_splits = markdown_splitter.split_text(str(data3[0].page_content))


In [142]:
for d in md_header_splits:
    print("-"*100)
    print(d)

----------------------------------------------------------------------------------------------------
page_content='# Amazon SageMaker controls<a name="sagemaker-controls"></a>  \nThese controls are related to SageMaker resources\\.' metadata={'Header 1': 'Amazon SageMaker controls<a name="sagemaker-controls"></a>'}
----------------------------------------------------------------------------------------------------
page_content='## \\[SageMaker\\.1\\] Amazon SageMaker notebook instances should not have direct internet access<a name="sagemaker-1"></a>  \n**Related requirements:** PCI DSS v3\\.2\\.1/1\\.2\\.1, PCI DSS v3\\.2\\.1/1\\.3\\.1, PCI DSS v3\\.2\\.1/1\\.3\\.2, PCI DSS v3\\.2\\.1/1\\.3\\.4, PCI DSS v3\\.2\\.1/1\\.3\\.6, NIST\\.800\\-53\\.r5 AC\\-21, NIST\\.800\\-53\\.r5 AC\\-3, NIST\\.800\\-53\\.r5 AC\\-3\\(7\\), NIST\\.800\\-53\\.r5 AC\\-4, NIST\\.800\\-53\\.r5 AC\\-4\\(21\\), NIST\\.800\\-53\\.r5 AC\\-6, NIST\\.800\\-53\\.r5 SC\\-7, NIST\\.800\\-53\\.r5 SC\\-7\\(11\\), NIST\\.80

---
### Test directory loader to speed up data consumption

In [143]:
from langchain_community.document_loaders import DirectoryLoader

In [144]:
markdown_path2 = "D:\\Documents\GitHub\\knowledge_pal_assistant\\0_data"
data4 = DirectoryLoader(markdown_path2, glob = "*.md", recursive = True,  loader_cls = TextLoader,use_multithreading = True, show_progress = True).load()

100%|██████████| 336/336 [00:00<00:00, 6224.51it/s]


In [145]:
data4[0].metadata["source"]

'D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\amazon-sagemaker-toolkits.md'

In [146]:
# craeate a hashmap of documents to after add the source to each one of the chunks
final_data = {x.metadata["source"]: x.page_content for x in data4}

In [147]:
final_data[data4[0].metadata["source"]]

'# Using the SageMaker Training and Inference Toolkits<a name="amazon-sagemaker-toolkits"></a>\n\nThe [SageMaker Training](https://github.com/aws/sagemaker-training-toolkit) and [SageMaker Inference](https://github.com/aws/sagemaker-inference-toolkit) toolkits implement the functionality that you need to adapt your containers to run scripts, train algorithms, and deploy models on SageMaker\\. When installed, the library defines the following for users:\n+ The locations for storing code and other resources\\. \n+ The entry point that contains the code to run when the container is started\\. Your Dockerfile must copy the code that needs to be run into the location expected by a container that is compatible with SageMaker\\. \n+ Other information that a container needs to manage deployments for training and inference\\. \n\n## SageMaker Toolkits Containers Structure<a name="sagemaker-toolkits-structure"></a>\n\nWhen SageMaker trains a model, it creates the following file folder structure 

In [148]:
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers = False)

In [None]:
from langchain.document_transformers import 

In [160]:
md_header_splits = []

for source, docs in final_data.items():
    md_header_splits.extend([{"page_content": chunk.page_content + "\n \'data source =  {0}\'".format(source), "metadata":chunk.metadata} for chunk in markdown_splitter.split_text(docs)])
    break

In [162]:
md_header_splits[0]

{'page_content': '# Using the SageMaker Training and Inference Toolkits<a name="amazon-sagemaker-toolkits"></a>  \nThe [SageMaker Training](https://github.com/aws/sagemaker-training-toolkit) and [SageMaker Inference](https://github.com/aws/sagemaker-inference-toolkit) toolkits implement the functionality that you need to adapt your containers to run scripts, train algorithms, and deploy models on SageMaker\\. When installed, the library defines the following for users:\n+ The locations for storing code and other resources\\.\n+ The entry point that contains the code to run when the container is started\\. Your Dockerfile must copy the code that needs to be run into the location expected by a container that is compatible with SageMaker\\.\n+ Other information that a container needs to manage deployments for training and inference\\.\n \'data source =  D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\amazon-sagemaker-toolkits.md\'',
 'metadata': {'Header 1': 'Using the SageMaker T