# Dataset for embeddings Creation

**Author:** Cristian C. Velandia C.

**Creation Date:** 2024-03-02

This notebook aims to create a dataset stored in parquet with the necessary data to create the embeddings and after upsert to pinecone VDB. 

Data is cleaned before the embeddings process

In [1]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain.document_loaders import TextLoader
import pandas as pd

---
## Data Loading

In [2]:
markdown_path2 = "D:\\Documents\GitHub\\knowledge_pal_assistant\\0_data"
# Define the directory loader, be carefull with use_multithreading as it will output the files in different order every time
data4 = DirectoryLoader(markdown_path2, glob = "*.md", recursive = True, loader_cls = TextLoader, use_multithreading = True, show_progress = True, sample_seed = 1).load()

  0%|          | 0/336 [00:00<?, ?it/s]

100%|██████████| 336/336 [00:00<00:00, 600.44it/s]


In [3]:
data4[0].metadata["source"]

'D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\aws-properties-events-rule-sagemakerpipelineparameter.md'

In [4]:
len(data4)

336

---
Data Chunking and cleaning

In [5]:
# Create a hashmap of documents to after add the source to each one of the chunks
final_data = {x.metadata["source"]: x.page_content for x in data4}

In [6]:
len(final_data)

336

In [7]:
#Check results
final_data[data4[0].metadata["source"]]

'# AWS::Events::Rule SageMakerPipelineParameter<a name="aws-properties-events-rule-sagemakerpipelineparameter"></a>\n\nName/Value pair of a parameter to start execution of a SageMaker Model Building Pipeline\\.\n\n## Syntax<a name="aws-properties-events-rule-sagemakerpipelineparameter-syntax"></a>\n\nTo declare this entity in your AWS CloudFormation template, use the following syntax:\n\n### JSON<a name="aws-properties-events-rule-sagemakerpipelineparameter-syntax.json"></a>\n\n```\n{\n  "[Name](#cfn-events-rule-sagemakerpipelineparameter-name)" : String,\n  "[Value](#cfn-events-rule-sagemakerpipelineparameter-value)" : String\n}\n```\n\n### YAML<a name="aws-properties-events-rule-sagemakerpipelineparameter-syntax.yaml"></a>\n\n```\n  [Name](#cfn-events-rule-sagemakerpipelineparameter-name): String\n  [Value](#cfn-events-rule-sagemakerpipelineparameter-value): String\n```\n\n## Properties<a name="aws-properties-events-rule-sagemakerpipelineparameter-properties"></a>\n\n`Name`  <a name=

In [8]:
# Define over which headers the data will be split and chunked
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")]
# Define the splitter object
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers = False)

In [9]:
md_header_splits = [] #Define an empty list to extend with all the corpus

# Go through all sources and doocuments
for id, (source, docs) in enumerate(final_data.items()):
    md_header_splits.extend([{"id": id, "chunk": i, "page_content": chunk.page_content + "\n \'data source =  {0}\'".format(source), "source" : source , "metadata" : chunk.metadata} for i, chunk in enumerate(markdown_splitter.split_text(docs))])

In [10]:
len(md_header_splits)

1046

In [11]:
md_header_splits[0]

{'id': 0,
 'chunk': 0,
 'page_content': '# AWS::Events::Rule SageMakerPipelineParameter<a name="aws-properties-events-rule-sagemakerpipelineparameter"></a>  \nName/Value pair of a parameter to start execution of a SageMaker Model Building Pipeline\\.\n \'data source =  D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\aws-properties-events-rule-sagemakerpipelineparameter.md\'',
 'source': 'D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\aws-properties-events-rule-sagemakerpipelineparameter.md',
 'metadata': {'Header 1': 'AWS::Events::Rule SageMakerPipelineParameter<a name="aws-properties-events-rule-sagemakerpipelineparameter"></a>'}}

In [12]:
#Create a dataframe from the list of dicts to store data, deduplicate, and clean the chunks
final_df = pd.DataFrame(md_header_splits)

In [13]:
# OPTIONAL CLEAN STOPWORDS 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
#from nltk import download
#download('stopwords')

stop_words = set(stopwords.words('english'))
#generate tokens
final_df["tokens"] = final_df["page_content"].apply(lambda x: word_tokenize(x))
#Remove Stop words
final_df["nostopw_page_content"] = final_df["tokens"].apply(lambda x: " ".join([w for w in x if not w.lower() in stop_words]))

In [14]:
final_df.head()

Unnamed: 0,id,chunk,page_content,source,metadata,tokens,nostopw_page_content
0,0,0,# AWS::Events::Rule SageMakerPipelineParameter...,D:\Documents\GitHub\knowledge_pal_assistant\0_...,{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, AWS, :, :Events, :, :Rule, SageMakerPipeli...",# AWS : :Events : :Rule SageMakerPipelineParam...
1,0,1,"## Syntax<a name=""aws-properties-events-rule-s...",D:\Documents\GitHub\knowledge_pal_assistant\0_...,{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, #, Syntax, <, a, name=, '', aws-properties...",# # Syntax < name= '' aws-properties-events-ru...
2,0,2,"## Properties<a name=""aws-properties-events-ru...",D:\Documents\GitHub\knowledge_pal_assistant\0_...,{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, #, Properties, <, a, name=, '', aws-proper...",# # Properties < name= '' aws-properties-event...
3,1,0,# Automating Amazon SageMaker with Amazon Even...,D:\Documents\GitHub\knowledge_pal_assistant\0_...,{'Header 1': 'Automating Amazon SageMaker with...,"[#, Automating, Amazon, SageMaker, with, Amazo...",# Automating Amazon SageMaker Amazon EventBrid...
4,1,1,"## Training job state change<a name=""eventbrid...",D:\Documents\GitHub\knowledge_pal_assistant\0_...,{'Header 1': 'Automating Amazon SageMaker with...,"[#, #, Training, job, state, change, <, a, nam...",# # Training job state change < name= '' event...


In [15]:
#add chunk and source to meta 
final_df["metadata"] = final_df.loc[:, ["chunk", "source", "metadata", "page_content"]].apply(lambda x:  {**x["metadata"],**{"chunk" : x["chunk"], "source" : x["source"], "text" : x["page_content"]}}, axis=1)
final_df["metadata"].head()


0    {'Header 1': 'AWS::Events::Rule SageMakerPipel...
1    {'Header 1': 'AWS::Events::Rule SageMakerPipel...
2    {'Header 1': 'AWS::Events::Rule SageMakerPipel...
3    {'Header 1': 'Automating Amazon SageMaker with...
4    {'Header 1': 'Automating Amazon SageMaker with...
Name: metadata, dtype: object

In [16]:
final_df["metadata"][4]

{'Header 1': 'Automating Amazon SageMaker with Amazon EventBridge<a name="automating-sagemaker-with-eventbridge"></a>',
 'Header 2': 'Training job state change<a name="eventbridge-training"></a>',
 'chunk': 1,
 'source': 'D:\\Documents\\GitHub\\knowledge_pal_assistant\\0_data\\automating-sagemaker-with-eventbridge.md',
 'text': '## Training job state change<a name="eventbridge-training"></a>  \nIndicates a change in the status of a SageMaker training job\\.  \nIf the value of `TrainingJobStatus` is `Failed`, the event contains the `FailureReason` field, which provides a description of why the training job failed\\.  \n```\n{\n"version": "0",\n"id": "844e2571-85d4-695f-b930-0153b71dcb42",\n"detail-type": "SageMaker Training Job State Change",\n"source": "aws.sagemaker",\n"account": "123456789012",\n"time": "2018-10-06T12:26:13Z",\n"region": "us-east-1",\n"resources": [\n"arn:aws:sagemaker:us-east-1:123456789012:training-job/kmeans-1"\n],\n"detail": {\n"TrainingJobName": "89c96cc8-dded-4

In [17]:
final_df[final_df["id"] == 0]

Unnamed: 0,id,chunk,page_content,source,metadata,tokens,nostopw_page_content
0,0,0,# AWS::Events::Rule SageMakerPipelineParameter...,D:\Documents\GitHub\knowledge_pal_assistant\0_...,{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, AWS, :, :Events, :, :Rule, SageMakerPipeli...",# AWS : :Events : :Rule SageMakerPipelineParam...
1,0,1,"## Syntax<a name=""aws-properties-events-rule-s...",D:\Documents\GitHub\knowledge_pal_assistant\0_...,{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, #, Syntax, <, a, name=, '', aws-properties...",# # Syntax < name= '' aws-properties-events-ru...
2,0,2,"## Properties<a name=""aws-properties-events-ru...",D:\Documents\GitHub\knowledge_pal_assistant\0_...,{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, #, Properties, <, a, name=, '', aws-proper...",# # Properties < name= '' aws-properties-event...


In [18]:
#generate final id
final_df["id"] = final_df["id"].astype('str') + "-" + final_df["chunk"].astype('str')

In [19]:
#Drop unnecessary columns
final_df.drop(columns = ["chunk", "source"], inplace = True)

In [20]:
final_df.shape

(1046, 5)

In [21]:
final_df.head()

Unnamed: 0,id,page_content,metadata,tokens,nostopw_page_content
0,0-0,# AWS::Events::Rule SageMakerPipelineParameter...,{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, AWS, :, :Events, :, :Rule, SageMakerPipeli...",# AWS : :Events : :Rule SageMakerPipelineParam...
1,0-1,"## Syntax<a name=""aws-properties-events-rule-s...",{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, #, Syntax, <, a, name=, '', aws-properties...",# # Syntax < name= '' aws-properties-events-ru...
2,0-2,"## Properties<a name=""aws-properties-events-ru...",{'Header 1': 'AWS::Events::Rule SageMakerPipel...,"[#, #, Properties, <, a, name=, '', aws-proper...",# # Properties < name= '' aws-properties-event...
3,1-0,# Automating Amazon SageMaker with Amazon Even...,{'Header 1': 'Automating Amazon SageMaker with...,"[#, Automating, Amazon, SageMaker, with, Amazo...",# Automating Amazon SageMaker Amazon EventBrid...
4,1-1,"## Training job state change<a name=""eventbrid...",{'Header 1': 'Automating Amazon SageMaker with...,"[#, #, Training, job, state, change, <, a, nam...",# # Training job state change < name= '' event...


In [22]:
save_folder = "D:\\Documents\GitHub\\knowledge_pal_assistant\\2_outputs\\"
final_df.to_parquet(save_folder + "chunks.parquet", index = False, engine = "pyarrow", compression= "brotli")