# Preparing your data for the model

Now that you have your data, you need to prepare it for use with our chosen model.

We're using the [all-mpnet-base-v2 from Huggingface](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). This is a very common model used for natural language processing and similarity search.

To make our content usable with this model, we need to segment our code into chunks.

Models will often have a character or a token limit. `allmpnet-base-v2` has a limit of 384 characters, truncating any characters more than that.

We want to make sure that we get AS COMPLETE of a thought as possible. That is to say a complete thought split into two segments will likely both be detected vs having a truncated thought that may include irrelevant content.

We're going to split our content into tweet-like segments of approximately 300 characters with a buffer around 20 charachters. We'll also split on phrase boundaries like punctuation or newlines.

**This is a choice**. You can experiment with the parameters to fit your needs. You can also look at other models that have 

We need to read the contents of our text file.

LangChain is a great way to wrap around the work that we're doing in this.

LangChain gives us the ability to select our [embeddings](https://python.langchain.com/docs/integrations/text_embedding/). It also gives us an interface to perform a [recursive split by character](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter/).

## Let's Make it Happen

Let's create a function that:
- reads the file
- parses the metadata from the file
- splits the content into separate documents with unique identifiers

In [None]:
import frontmatter
import os
import pathlib
import uuid

import arrow
from dotenv import load_dotenv
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from opensearchpy import OpenSearch, helpers

load_dotenv()

INDEX_NAME = os.getenv("INDEX_NAME")
CONNECTION_STRING = os.getenv("OPENSEARCH_SERVICE_URI")
client = OpenSearch(CONNECTION_STRING, use_ssl=True, timeout=100)

fmt = r"MMMM[\s+]D[\w+,\s+]YYYY"

# define splitter    
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=20,
    separators=[".", "!", "?", "\n"],
)

# define embeddings. These options are all the defaults and not explicitly needed.
embeddings = HuggingFaceEmbeddings(
    model_name = "sentence-transformers/all-mpnet-base-v2",
    model_kwargs = {'device': 'cpu'},
    encode_kwargs = {'normalize_embeddings': False}
)

    
def load_data(file: pathlib.Path):
    """Chunk data, create embeddings, and index in OpenSearch."""
    frontmatter_post = frontmatter.loads(file.read_text()) # loads the metadata from the file
    base_data = {
            "_index": INDEX_NAME,
            "title": frontmatter_post["title"],
            "description": frontmatter_post["description"],
            "url": frontmatter_post["url"],
            "pub_date": arrow.get(frontmatter_post["pub_date"], fmt).date().isoformat(),
        }
    
    docs = []
    
    post_chunks = splitter.create_documents([frontmatter_post.content])
    for post_chunk in post_chunks:
        doc = {
            **base_data, 
            **{
                "_id": str(uuid.uuid4()),
                "content": post_chunk.page_content,
                "content_vector": embeddings.embed_documents([post_chunk.page_content])[0]
                }
            }
        docs.append(doc)
    print(len(docs))        
    response = helpers.bulk(client, docs)
    return response
        

Now that we have our function, we'll pass that function into our opensearch bulk function. This will allow us to ingest the documents one at a time, making it easier to restart in the event of an error.

In [None]:
directory = pathlib.Path("../transcripts")

for file in directory.iterdir():
    load_data(file)


**EXTRA** - If you want run this on all of the files. Change the `directory` path in the block above to `../transcripts_complete` and run the block above again.

In this notebook we did a lot! We chunked. We Generated Embeddings. We also added our example document to our OpenSearch index.

In the next Notebook, we'll look at what we can do now that our transcripts have embeddings and we can interact with our data in OpenSearch and implement it in our RAG pattern using LangChain.

[![Implement our RAG Patter](https://img.shields.io/badge/3-Implement%20in%20RAG-153a5a?style=for-the-badge&labelColor=ec6147)](3-implement-in-rag.ipynb)
