# Pre-processing and Embedding Documentation

## Introduction
In this notebook, we will cover the steps involved in pre-processing documents and generating embeddings using OpenAI's models. The process includes loading documents, pre-processing and data exploration, tokenizing, handling documents of different token lengths, and finally, generating embeddings.

## 1. Document Loading
In this section, we will load the documents that we will be working with.


In [87]:
import zipfile
import os

# ZIP folder path
zip_path = 'RawData/sagemaker_documentation.zip'
# Unzipped folder path
unzip_path = 'RawData/'

# Unzip the folder
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(unzip_path)


# List the files in the unzipped folder
files = os.listdir('RawData/sagemaker_documentation')
print("Number of files in the unzipped folder: ", len(files))


Number of files in the unzipped folder:  336


In [88]:
# Get name of the unzipped folder
unzipped_folder = os.listdir('RawData')[0]
unzipped_folder = unzip_path + unzipped_folder


## 2. Pre-processing and Data Exploration
Since we will be using OpenAI's GPT-4 model, we will use `tiktoken` to tokenize the documents.


In [89]:
import os
import pandas as pd

# Directory containing .md files
directory = unzipped_folder

# List to hold the data for the DataFrame
data = []

# Iterate over all files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.md'):
        filepath = os.path.join(directory, filename)
        with open(filepath, 'r', encoding='utf-8') as file:
            content = file.read()
        data.append({'document_name': filename, 'content': content})

# Create the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df.head()

Unnamed: 0,document_name,content
0,amazon-sagemaker-toolkits.md,# Using the SageMaker Training and Inference T...
1,asff-resourcedetails-awssagemaker.md,"# AwsSageMaker<a name=""asff-resourcedetails-aw..."
2,automating-sagemaker-with-eventbridge.md,# Automating Amazon SageMaker with Amazon Even...
3,aws-properties-events-rule-sagemakerpipelinepa...,# AWS::Events::Rule SageMakerPipelineParameter...
4,aws-properties-events-rule-sagemakerpipelinepa...,# AWS::Events::Rule SageMakerPipelineParameter...


In [90]:
# Using beautifulsoup to preprocess the content
from bs4 import BeautifulSoup

# Function to preprocess the content
def preprocess_content(content):
    soup = BeautifulSoup(content, 'html.parser')
    return soup.get_text()

# Preprocess the content
df['content'] = df['content'].apply(preprocess_content)
df.head()

Unnamed: 0,document_name,content
0,amazon-sagemaker-toolkits.md,# Using the SageMaker Training and Inference T...
1,asff-resourcedetails-awssagemaker.md,# AwsSageMaker\n\nThe following are examples o...
2,automating-sagemaker-with-eventbridge.md,# Automating Amazon SageMaker with Amazon Even...
3,aws-properties-events-rule-sagemakerpipelinepa...,# AWS::Events::Rule SageMakerPipelineParameter...
4,aws-properties-events-rule-sagemakerpipelinepa...,# AWS::Events::Rule SageMakerPipelineParameter...


In [91]:
# Extract the title from the content, since it is a markdown file, the title will be in the first line, after the # symbol
df['title'] = df['content'].apply(lambda x: x.split('\n')[0].replace('#', '').strip())

# From title, lets remove the html tags
df['title'] = df['title'].str.replace('<.*?>', '')

# From the content, lets remove the title
df['content'] = df['content'].apply(lambda x: '\n\n'.join(x.split('\n\n')[1:]))

df.head()

Unnamed: 0,document_name,content,title
0,amazon-sagemaker-toolkits.md,# Using the SageMaker Training and Inference T...,Using the SageMaker Training and Inference Too...
1,asff-resourcedetails-awssagemaker.md,# AwsSageMaker\n\nThe following are examples o...,AwsSageMaker
2,automating-sagemaker-with-eventbridge.md,# Automating Amazon SageMaker with Amazon Even...,Automating Amazon SageMaker with Amazon EventB...
3,aws-properties-events-rule-sagemakerpipelinepa...,# AWS::Events::Rule SageMakerPipelineParameter...,AWS::Events::Rule SageMakerPipelineParameter
4,aws-properties-events-rule-sagemakerpipelinepa...,# AWS::Events::Rule SageMakerPipelineParameter...,AWS::Events::Rule SageMakerPipelineParameters


In [93]:
# Function to clean markdown content

import re

def clean_markdown(text):
    # Remove markdown headers, lists, images, links, etc.
    text = re.sub(r'!\[.*?\]\(.*?\)', '', text)  # remove images
    # text = re.sub(r'\[.*?\]\(.*?\)', '', text)  # remove links
    # text = re.sub(r'#', '', text)  # remove headers
    text = re.sub(r'\*|\_|\`|\~', '', text)  # remove other markdown characters
    text = re.sub(r'\n+', '\n', text)  # normalize newlines
    text = re.sub(r'\s+', ' ', text)  # normalize whitespace
    text = text.replace("'", '"') # Replace single quotes with double quotes
    text = text.replace('\n', ' ') # Remove newlines
    text = text.replace('  ', ' ') # Remove double spaces
    return text.strip()

# Clean the content
df['content'] = df['content'].apply(clean_markdown)


Since we will be using OpenAI's model of GPT-4, we will be using tiktoken to tokenize the documents.

In [94]:
import tiktoken
embedding_encoding = "cl100k_base"

# Function to count tokens using tiktoken
def count_tokens(text, encoding_name=embedding_encoding):
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    return len(tokens)

df['token_count'] = df['content'].apply(lambda x: count_tokens(x))

In [95]:
# show how many documents have more than 1000 tokens
df[df['token_count'] > 8000].shape[0]

1

In [96]:
# show the highest token count
print("Highst token count: {}".format(df['token_count'].max()))
print("---------------------------------")

# show the lowest token count
print("Lowest token count: {}".format(df['token_count'].min()))
print("---------------------------------")

# show the average token count
print("Average token count: {}".format(df['token_count'].mean()))
print("---------------------------------")

# show df shape
print("Dataframe shape: {}".format(df.shape))

Highst token count: 13868
---------------------------------
Lowest token count: 0
---------------------------------
Average token count: 799.1071428571429
---------------------------------
Dataframe shape: (336, 4)


In [97]:
# Lets add a column that identifies the document id starting from 1
df['id'] = range(1, len(df) + 1)
# Now lets reorder the columns
df = df[['id', 'title', 'document_name', 'content', 'token_count']]
df.head(2)

Unnamed: 0,id,title,document_name,content,token_count
0,1,Using the SageMaker Training and Inference Too...,amazon-sagemaker-toolkits.md,The [SageMaker Training](https://github.com/aw...,736
1,2,AwsSageMaker,asff-resourcedetails-awssagemaker.md,The following are examples of the AWS Security...,466


Since OpenAI embedding model "text-embedding-ada-002" supports 8191 tokens, let's create 2 different datasets for the ones with more than 8000 tokens and the ones with less than 8000 tokens. This way we can easly work with the documents that require chunking.

In [98]:
# Create a df that only has rows with token count lesser than 8000
df2 = df[df['token_count'] >= 8000]
df = df[df['token_count'] < 8000]


In [99]:
# add a column to show the number of chars in each document  
df2['char_count'] = df2['content'].apply(len)
df2.head(2)  

Unnamed: 0,id,title,document_name,content,token_count,char_count
280,281,Use Amazon SageMaker Jobs,kubernetes-sagemaker-jobs.md,This section is based on the original version ...,13868,57612


### 2.1 Token Length Handling
OpenAI's embedding model "text-embedding-ada-002" supports up to 8191 tokens. To efficiently manage documents, we will create two datasets: one for documents with fewer than 8000 tokens and another for those exceeding this limit. This approach facilitates easier processing of documents that require chunking.


In [None]:
# Lets print the max number of chars in the documents that has more tokens
max_chars = df2['char_count'].max()
print("Max number of chars in documents: {}".format(max_chars))

max_tokens = df2['token_count'].max()
print("Max number of tokens in documents: {}".format(max_tokens))



Max number of chars in documents with more tokens: 57612
Max number of tokens in documents with more tokens: 13868


Now we need to define how many tokens we will be using for each chunk of the documents. This is a interesting problem, since we need to find a balance between the number of chunks and the number of tokens in each chunk. For this reason, we will be using a simple algorithm to find the best number of tokens for each 

In [101]:
# Define the overlap of the chunks
overlap = 200

# We have the max number of tokens and max number of chars in the documents dataframe
# with this information we can decide how to split the documents

num_of_chunks = max_tokens // 4000 + 1 
print("Number of chunks based on the half of the embedding model: {}".format(num_of_chunks))
num_of_chunks = max_chars // num_of_chunks + overlap
print("Number of chars to be divides: {}".format(num_of_chunks))



Number of chunks based on the half of the embedding model: 4
Number of chars to be divides: 14603


In [102]:
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the RecursiveCharacterTextSplitter with chunk_size and overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=32000, chunk_overlap=overlap)

# Function to split the content and create chunks
def split_and_chunk_content(row):
    doc = Document(page_content=row['content'])
    split_docs = text_splitter.split_documents([doc])
    
    chunks = []
    for i, chunk in enumerate(split_docs):
        chunk_id = f"{row['id']}-{i + 1}"
        chunks.append({
            'id': row['id'],
            'title': row['title'], 
            'document_name': row['document_name'],
            'content': chunk.page_content,
            'token_count': len(chunk.page_content.split()),  # Assuming each word is a token
            'chunk_id': chunk_id
        })
    return chunks

In [103]:
# Apply the function to each row and collect the chunks
chunked_data = []
for _, row in df2.iterrows():
    chunked_data.extend(split_and_chunk_content(row))

# Create a new DataFrame from the chunked data
df2 = pd.DataFrame(chunked_data)

# re run the token count function
df2['token_count'] = df2['content'].apply(lambda x: count_tokens(x))

print(f"Number of chunks: {len(df2)}")
df2.head(5)

Number of chunks: 2


Unnamed: 0,id,title,document_name,content,token_count,chunk_id
0,281,Use Amazon SageMaker Jobs,kubernetes-sagemaker-jobs.md,This section is based on the original version ...,7900,281-1
1,281,Use Amazon SageMaker Jobs,kubernetes-sagemaker-jobs.md,with SageMaker](https://docs.aws.amazon.com/sa...,6011,281-2


### 2.2 Working on the Dataset That Needs Splitting
We need to define the number of tokens for each document chunk. This is a crucial step, as it involves finding a balance between the number of chunks and the number of tokens per chunk. We will use a simple algorithm to determine the optimal token count per chunk.


In [104]:
# Create col chunk_id in df, based on the id + "-1"
df['chunk_id'] = df['id'].astype(str) + "-1"
df.head(2)

Unnamed: 0,id,title,document_name,content,token_count,chunk_id
0,1,Using the SageMaker Training and Inference Too...,amazon-sagemaker-toolkits.md,The [SageMaker Training](https://github.com/aw...,736,1-1
1,2,AwsSageMaker,asff-resourcedetails-awssagemaker.md,The following are examples of the AWS Security...,466,2-1


### 2.3 Rejoining the Split Datasets
First, add a `chunk_id` column to the non-split dataset to keep it as an index.

Then, join the split dataset with the non-split dataset to consolidate the data.


In [105]:
# Concatenate the two dataframes
df = pd.concat([df, df2], ignore_index=True)
print(f"Number of documents: {len(df)}")
# drop 
df.head()

Number of documents: 337


Unnamed: 0,id,title,document_name,content,token_count,chunk_id
0,1,Using the SageMaker Training and Inference Too...,amazon-sagemaker-toolkits.md,The [SageMaker Training](https://github.com/aw...,736,1-1
1,2,AwsSageMaker,asff-resourcedetails-awssagemaker.md,The following are examples of the AWS Security...,466,2-1
2,3,Automating Amazon SageMaker with Amazon EventB...,automating-sagemaker-with-eventbridge.md,Amazon EventBridge monitors status change even...,6717,3-1
3,4,AWS::Events::Rule SageMakerPipelineParameter,aws-properties-events-rule-sagemakerpipelinepa...,Name/Value pair of a parameter to start execut...,265,4-1
4,5,AWS::Events::Rule SageMakerPipelineParameters,aws-properties-events-rule-sagemakerpipelinepa...,These are custom parameters to use when the ta...,189,5-1


## 3. Generating Embeddings Using OpenAI
In this section, we will generate embeddings for the documents using OpenAI's embedding model.

In [110]:
# Now lets combine the title and content into a single column
df['combined'] = '#' + df['title'] + '\n\nContent: ' + df['content']
df.head(2)

Unnamed: 0,id,title,document_name,content,token_count,chunk_id,combined
0,1,Using the SageMaker Training and Inference Too...,amazon-sagemaker-toolkits.md,The [SageMaker Training](https://github.com/aw...,736,1-1,#Using the SageMaker Training and Inference To...
1,2,AwsSageMaker,asff-resourcedetails-awssagemaker.md,The following are examples of the AWS Security...,466,2-1,#AwsSageMaker\n\nContent: The following are ex...


In [111]:
from openai import OpenAI

client = OpenAI(api_key="sk-xxx")
embedding_model = "text-embedding-ada-002"

def get_embedding(text, model=embedding_model):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

# Get the embeddings for the content
df['embedding'] = df['combined'].apply(get_embedding)

df.head(2)

Unnamed: 0,id,title,document_name,content,token_count,chunk_id,combined,embedding
0,1,Using the SageMaker Training and Inference Too...,amazon-sagemaker-toolkits.md,The [SageMaker Training](https://github.com/aw...,736,1-1,#Using the SageMaker Training and Inference To...,"[-0.008473106659948826, 0.016663100570440292, ..."
1,2,AwsSageMaker,asff-resourcedetails-awssagemaker.md,The following are examples of the AWS Security...,466,2-1,#AwsSageMaker\n\nContent: The following are ex...,"[-0.01526849064975977, 0.029873132705688477, 0..."


In [118]:
# Lets re use the col combined to store a json object with the title, content, source (document_name) and chunk_id
df['combined'] = df.apply(lambda x: {
    'chunk': int(x['chunk_id'].split('-')[1]),
    'source': x['document_name'],
    'text': x['content'],
    'title': x['title'],
    'doc-id': x['id']
}, axis=1)

df.head(2)

Unnamed: 0,id,title,document_name,content,token_count,chunk_id,combined,embedding
0,1,Using the SageMaker Training and Inference Too...,amazon-sagemaker-toolkits.md,The [SageMaker Training](https://github.com/aw...,736,1-1,"{'chunk': 1, 'source': 'amazon-sagemaker-toolk...","[-0.008473106659948826, 0.016663100570440292, ..."
1,2,AwsSageMaker,asff-resourcedetails-awssagemaker.md,The following are examples of the AWS Security...,466,2-1,"{'chunk': 1, 'source': 'asff-resourcedetails-a...","[-0.01526849064975977, 0.029873132705688477, 0..."


In [123]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

# print the first row of the combined column
pp.pprint(df['combined'][0])


{   'chunk': 1,
    'doc-id': 1,
    'source': 'amazon-sagemaker-toolkits.md',
    'text': 'The [SageMaker '
            'Training](https://github.com/aws/sagemaker-training-toolkit) and '
            '[SageMaker '
            'Inference](https://github.com/aws/sagemaker-inference-toolkit) '
            'toolkits implement the functionality that you need to adapt your '
            'containers to run scripts, train algorithms, and deploy models on '
            'SageMaker\\. When installed, the library defines the following '
            'for users: + The locations for storing code and other '
            'resources\\. + The entry point that contains the code to run when '
            'the container is started\\. Your Dockerfile must copy the code '
            'that needs to be run into the location expected by a container '
            'that is compatible with SageMaker\\. + Other information that a '
            'container needs to manage deployments for training and '
            '

In [126]:
# Before saving lets reorder the columns
df = df[['chunk_id', 'embedding', 'combined', 'document_name','title']]

# Rename some columns for later use
df.rename(columns={'chunk_id': 'id'}, inplace=True)
df.rename(columns={'embedding': 'values'}, inplace=True)
df.rename(columns={'combined': 'metadata'}, inplace=True)


In [127]:
# Save the DataFrame to a CSV file
df.to_csv('sagemaker_documentation_embeddings.csv', index=False)