# Data Processing

This notebook is about exploring how we can take the data we ingested and stored into an S3 bucket, and storing it into a vector database.

See the Full Embeddings section for an important note about an error to be fixed later.

In [6]:
pwd

'c:\\Users\\RaviB\\GitHub\\vegan-ai-nutritionist'

## Managing Dependencies

In [5]:
import os
os.chdir("..")

In [69]:
# Switch to the directory containing the pyproject.toml file
os.chdir("modules/data_processing")

# Install libraries using poetry, uncomment and change library names as needed
!poetry update


#switch back to the root directory
os.chdir("../..")

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(35.3s)[39;22m[34mResolving dependencies...[39m [39;2m(4.2s)[39;22m[34mResolving dependencies...[39m [39;2m(18.0s)[39;22m[34mResolving dependencies...[39m [39;2m(19.0s)[39;22m[34mResolving dependencies...[39m [39;2m(19.1s)[39;22m[34mResolving dependencies...[39m [39;2m(26.9s)[39;22m

[39;1mPackage operations[39;22m: [34m0[39m installs, [34m5[39m updates, [34m0[39m removals

  [34;1m-[39;22m [39mUpdating [39m[36mplatformdirs[39m[39m ([39m[39;1m4.3.2[39;22m[39m -> [39m[39;1m4.3.3[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m-[39;22m [39mUpdating [39m[36mplatformdirs[39m[39m ([39m[39;1m4.3.2[39;22m[39m -> [39m[39;1m4.3.3[39;22m[39m)[39m: [34mDownloading...[39m [39;1m0%[39;22m
[1A[0J  [34;1m-[39;22m [39mUpdating [39m[36mplatformdirs[39m[39m ([39m[39;1m4.3.2[39;22m[39m -> [39m[39;1m4.3.3[39;22m[39m)[39m: [34mDownloading...

In [11]:
import os
import pandas as pd
import boto3
import json
from dotenv import load_dotenv

load_dotenv()

True

## Loading Raw Data

We need to transform it a bit. We want to save the section as meta data and then store it with the embeddings.

In [12]:
s3 = boto3.client('s3')

bucket_name = os.environ.get('AWS_BUCKET_NAME')

response = s3.list_objects_v2(Bucket=bucket_name)
response["ResponseMetadata"]["HTTPStatusCode"]

200

In [13]:
file_name = 'vegan_research_papers.json'

# Get the file contents from S3
response = s3.get_object(Bucket=bucket_name, Key=file_name)

In [14]:
# Load the JSON data from the file contents
data = json.loads(response['Body'].read())

In [15]:
data[0]

{'meta_data': {'content_type': 'Article',
  'url': [{'format': '',
    'platform': '',
    'value': 'http://dx.doi.org/10.1007/s12237-023-01313-8'}],
  'title': 'Responses of Coastal Wetlands to Rising Sea-Level Revisited: The Importance of Organic Production',
  'publication_name': 'Estuaries and Coasts',
  'doi': '10.1007/s12237-023-01313-8',
  'publication_date': '2024-11-01',
  'starting_page': '1735',
  'ending_page': '1749',
  'open_access': 'true',
  'abstract': {'h1': 'Abstract',
   'p': 'A network of 15 Surface Elevation Tables (SETs) at North Inlet estuary, South Carolina, has been monitored on annual or monthly time scales beginning from 1990 to 1996 and continuing through 2022. Of 73 time series in control plots, 12 had elevation gains equal to or exceeding the local rate of sea-level rise (SLR, 0.34\xa0cm/year). Rising marsh elevation in North Inlet is dominated by organic production and, we hypothesize, is proportional to net ecosystem production. The rate of elevation ga

Let's explore a sample.

In [16]:
sample_pdf = data[1]
sample_text = sample_pdf['content']
sample_meta_data = sample_pdf['meta_data']

In [17]:
sample_meta_data

{'content_type': 'Article',
 'url': [{'format': '',
   'platform': '',
   'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}],
 'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review',
 'publication_name': 'European Food Research and Technology',
 'doi': '10.1007/s00217-024-04565-1',
 'publication_date': '2024-10-01',
 'starting_page': '2479',
 'ending_page': '2513',
 'open_access': 'true',
 'abstract': {'h1': 'Abstract',
  'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-based components, plant by

In [18]:
sample_text[0:5]

[{'section': 'Introduction',
  'body': "Meat is recognized as a very popular food item worldwide and it is well known as an excellent quality protein source with other nutritional characteristics along with its appealing taste. With the growing rate of the planet's population, the need for food security is rising as well, and to feed this growing population a greater amount of good quality food having proper protein, fat, and other nutrition is required. Meanwhile, increased environmental footprint awareness plays a significant role in meat analogues supply for the sustainable and transparent food security of the planet. Animal is the solitary bioresource of meat protein and with rapid population growth, the need for meat protein is also increasing. Various data show that the demand will be magnified near to twice by 2050 [Changes in the different meat prices as per FAO meat price index. (Data Source: OECD-FAO Agricultural Outlook 2022–2031)Meat Greenhouse gas emissions intensity per r

Put the section in the meta data and give each text section the full meta data. This is so we get all info when we get one of these texts in a similarity search later on.

In [19]:
updated_sample_data = []

for text_section in sample_text:
    meta_data = sample_meta_data.copy()
    
    meta_data['section'] = text_section['section']
    
    body = text_section['body']
    
    updated_sample_data.append({'meta_data': meta_data, 'body': body})

Take a look at a couple of examples below to make sure it works.

In [20]:
updated_sample_data[0]

{'meta_data': {'content_type': 'Article',
  'url': [{'format': '',
    'platform': '',
    'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}],
  'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review',
  'publication_name': 'European Food Research and Technology',
  'doi': '10.1007/s00217-024-04565-1',
  'publication_date': '2024-10-01',
  'starting_page': '2479',
  'ending_page': '2513',
  'open_access': 'true',
  'abstract': {'h1': 'Abstract',
   'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-

In [21]:
updated_sample_data[3]

{'meta_data': {'content_type': 'Article',
  'url': [{'format': '',
    'platform': '',
    'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}],
  'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review',
  'publication_name': 'European Food Research and Technology',
  'doi': '10.1007/s00217-024-04565-1',
  'publication_date': '2024-10-01',
  'starting_page': '2479',
  'ending_page': '2513',
  'open_access': 'true',
  'abstract': {'h1': 'Abstract',
   'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-

So here we iterate through one pdf file, copy the meta data since it is always the same when dealing with one pdf file, and then add the section to it. This becomes the new meta data and we save it with the text alone.

In [22]:
def transform_paper_data(pdf_content, original_meta_data):
    updated_data = []

    for text_section in pdf_content:
        meta_data = original_meta_data.copy()
        
        meta_data['section'] = text_section['section']
        
        body = text_section['body']
        
        updated_data.append({'body': body, 'meta_data': meta_data})
        
    return updated_data

In [23]:
sample_transformed_data = transform_paper_data(sample_text, sample_meta_data)
sample_transformed_data_section_0 = sample_transformed_data[0]
sample_transformed_data[0]

{'body': "Meat is recognized as a very popular food item worldwide and it is well known as an excellent quality protein source with other nutritional characteristics along with its appealing taste. With the growing rate of the planet's population, the need for food security is rising as well, and to feed this growing population a greater amount of good quality food having proper protein, fat, and other nutrition is required. Meanwhile, increased environmental footprint awareness plays a significant role in meat analogues supply for the sustainable and transparent food security of the planet. Animal is the solitary bioresource of meat protein and with rapid population growth, the need for meat protein is also increasing. Various data show that the demand will be magnified near to twice by 2050 [Changes in the different meat prices as per FAO meat price index. (Data Source: OECD-FAO Agricultural Outlook 2022–2031)Meat Greenhouse gas emissions intensity per regionThis paper collects and s

In [24]:
len(sample_transformed_data)

47

### Full PDF Data

Below is just how to collect all pdf data in this transformed way, but we won't use it as we just want to explore one sample for now.

In [25]:
full_pdf_data = []

for pdf_data in data:
    pdf_content_text, og_meta_data = pdf_data['content'], pdf_data['meta_data']
    
    full_pdf_data += transform_paper_data(pdf_content_text, og_meta_data)

In [26]:
len(full_pdf_data)

4825

## Vector Embeddings

Now that we have moved the section into the meta data, we just have the text to worry about. We need to embed it, for which we'll use Hugging Face for.

In [27]:
from langchain_community.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate





In [29]:
import boto3
from langchain_aws import BedrockEmbeddings

bedrock=boto3.client(service_name="bedrock-runtime")
bedrock_embeddings=BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=bedrock)

In [30]:
embedding_sample = bedrock_embeddings.embed_query(sample_text[0]['body'])
#embedding_sample

Langchains document object will handle saving the metadata for us, so we just move everything to there now.

In [31]:
Document(page_content=sample_transformed_data_section_0['body'], metadata=sample_transformed_data_section_0['meta_data'])

Document(metadata={'content_type': 'Article', 'url': [{'format': '', 'platform': '', 'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}], 'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review', 'publication_name': 'European Food Research and Technology', 'doi': '10.1007/s00217-024-04565-1', 'publication_date': '2024-10-01', 'starting_page': '2479', 'ending_page': '2513', 'open_access': 'true', 'abstract': {'h1': 'Abstract', 'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-based components, plant b

Of course we need to iterate over everything in our list of docs, which in this case will make up one pdf file.

In [32]:
def convert_to_doc_format(transformed_data):
    documents = []
    for content in transformed_data:
        doc = Document(page_content=content['body'], metadata=content['meta_data'])
        documents.append(doc)
    return documents

In [33]:
sample_docs = convert_to_doc_format(sample_transformed_data)
sample_docs[0]

Document(metadata={'content_type': 'Article', 'url': [{'format': '', 'platform': '', 'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}], 'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review', 'publication_name': 'European Food Research and Technology', 'doi': '10.1007/s00217-024-04565-1', 'publication_date': '2024-10-01', 'starting_page': '2479', 'ending_page': '2513', 'open_access': 'true', 'abstract': {'h1': 'Abstract', 'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-based components, plant b

In [34]:
len(sample_docs)

47

We still might want to chunk the text more. If one of the sections is too long, we'll need to split it with langchains text splitter but make sure the results still have the same meta data.

In [35]:
def chunk_text(document, chunk_size=10000, chunk_overlap=1000):
    raw_text = document.page_content
    meta_data = document.metadata
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    texts = text_splitter.split_text(raw_text)
    docs = [Document(page_content=t, metadata=meta_data) for t in texts]
    
    return docs

In [36]:
chunk_text(sample_docs[0], 1000, 250)

[Document(metadata={'content_type': 'Article', 'url': [{'format': '', 'platform': '', 'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}], 'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review', 'publication_name': 'European Food Research and Technology', 'doi': '10.1007/s00217-024-04565-1', 'publication_date': '2024-10-01', 'starting_page': '2479', 'ending_page': '2513', 'open_access': 'true', 'abstract': {'h1': 'Abstract', 'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-based components, plant 

In [37]:
chunked_docs = []

for doc in sample_docs:
    chunked_docs += chunk_text(doc, 5000, 500)

In [38]:
len(chunked_docs)

56

Now we can extract just the text and get embeddings for it.

In [None]:
texts = [d.page_content for d in chunked_docs]

embeddings = bedrock_embeddings.embed_documents(texts)

In [36]:
documents_with_embeddings = []

for doc, embedding in zip(chunked_docs, embeddings):
    doc_with_embedding = {
        "embedding": embedding,
        "text": doc.page_content,
        "metadata": doc.metadata
    }
    documents_with_embeddings.append(doc_with_embedding)
    
print(len(documents_with_embeddings))
#print(documents_with_embeddings[0])

56


This function does everything, from getting the embeddings to storing them properly with the metadata.

In [40]:
def generate_embeddings(documents):
    texts = [doc.page_content for doc in documents]
    
    embeddings = bedrock_embeddings.embed_documents(texts)
    
    documents_with_embeddings = []
    
    for doc, embedding in zip(documents, embeddings):
        doc_with_embedding = {
            "embedding": embedding,
            "text": doc.page_content,
            "metadata": doc.metadata
        }
        documents_with_embeddings.append(doc_with_embedding)
    
    return documents_with_embeddings

So to put it all together in one place, this is how we get our vector embeddings for one pdf file.

In [118]:
sample_docs = convert_to_doc_format(sample_transformed_data)

chunked_docs = []

for doc in sample_docs:
    chunked_docs += chunk_text(doc, 5000, 500)
    
documents_with_embeddings = generate_embeddings(chunked_docs)

# we can see the length of the documents_with_embeddings and the length of the sample_text, which will give an idea of how many sections were chunked
print(len(documents_with_embeddings))
print(len(sample_text))

56
47


### Full Embeddings

Let's try this on the full dataset.

NOTE: There is currently an error where some of the chunked docs have too many input tokens (the current limit is 8000). We need to use a tokenizer from hugging face to check the length of the input ids since we can't use Amazon's titan model to do this (thus it will just be an estimate). 

There are some import errors with torch and hugging face though, so for now, we are only inputting the first 50 pdf files to avoid this error (currently hardcoded this slicing of the loaded data in the modules/data_processing/src/data_processing.py file).

In [41]:
full_docs = convert_to_doc_format(full_pdf_data)

full_chunked_docs = []

for doc in full_docs:
    full_chunked_docs += chunk_text(doc, 5000, 500)

In [63]:
full_texts = [doc.page_content for doc in full_chunked_docs]

invalid_texts = []

valid_texts = []
for text in full_texts:
    token_count = len(tokenizer.encode(text))
    if token_count <= 8192:
        valid_texts.append(text)
    else:
        invalid_texts.append(text)
        # Handle texts that are still too long
        print("Text exceeds token limit even after chunking.")

In [None]:
embeddings = bedrock_embeddings.embed_documents(texts)

documents_with_embeddings = []

for doc, embedding in zip(full_chunked_docs, embeddings):
    doc_with_embedding = {
        "embedding": embedding,
        "text": doc.page_content,
        "metadata": doc.metadata
    }
    documents_with_embeddings.append(doc_with_embedding)

In [None]:
full_documents_with_embeddings = generate_embeddings(full_chunked_docs)

## Connecting to Vector Database

So first let's see if we can connect to the vector database that we created using terraform.

In [98]:
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
from dotenv import load_dotenv

load_dotenv()

opensearch_endpoint = os.environ.get('OPENSEARCH_ENDPOINT')

AWS_ACCESS_KEY = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
AWS_REGION = os.environ.get('AWS_REGION')

In [135]:
awsauth = AWS4Auth(AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION, 'es')

# Create the OpenSearch client
client = OpenSearch(
    hosts=[{'host': opensearch_endpoint, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

# Test connection
info = client.info()
print(info)

{'name': 'b618476fdb5879478fc667c4fb6cd473', 'cluster_name': '590184030535:vegan-pdf-data', 'cluster_uuid': 'D2l3HY8VSk-qIbco5LH9gg', 'version': {'distribution': 'opensearch', 'number': '2.5.0', 'build_type': 'tar', 'build_hash': 'unknown', 'build_date': '2024-05-02T06:25:23.555552Z', 'build_snapshot': False, 'lucene_version': '9.4.2', 'minimum_wire_compatibility_version': '7.10.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'The OpenSearch Project: https://opensearch.org/'}


So it worked. Next we need the embedding dimension, which we will use when creating our index body.

In [119]:
embedding_dimension = len(documents_with_embeddings[0]['embedding'])
print(f"Embedding dimension: {embedding_dimension}")

Embedding dimension: 1536


Recall what the meta data looks like. We need to define it in the index body.

In [125]:
documents_with_embeddings[0]['metadata']

{'content_type': 'Article',
 'url': [{'format': '',
   'platform': '',
   'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}],
 'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review',
 'publication_name': 'European Food Research and Technology',
 'doi': '10.1007/s00217-024-04565-1',
 'publication_date': '2024-10-01',
 'starting_page': '2479',
 'ending_page': '2513',
 'open_access': 'true',
 'abstract': {'h1': 'Abstract',
  'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-based components, plant by

In [126]:
from opensearchpy import OpenSearch, RequestsHttpConnection
from opensearchpy.exceptions import NotFoundError
from requests_aws4auth import AWS4Auth
import os

index_name = 'vegan_papers_index'

# Define the index settings and mappings
index_body = {
    'settings': {
        'index': {
            'knn': True  # Enable k-NN for vector similarity search
        }
    },
    'mappings': {
        'properties': {
            'embedding': {
                'type': 'knn_vector',
                'dimension': embedding_dimension
            },
            'text': {
                'type': 'text'
            },
            'metadata': {
                'properties': {
                    'content_type': {'type': 'keyword'},
                    'url': {
                        'type': 'nested',
                        'properties': {
                            'format': {'type': 'keyword'},
                            'platform': {'type': 'keyword'},
                            'value': {'type': 'keyword'}
                        }
                    },
                    'title': {'type': 'text'},
                    'publication_name': {'type': 'text'},
                    'doi': {'type': 'keyword'},
                    'publication_date': {'type': 'date', 'format': 'yyyy-MM-dd'},
                    'starting_page': {'type': 'integer'},
                    'ending_page': {'type': 'integer'},
                    'open_access': {'type': 'boolean'},
                    'abstract': {
                        'properties': {
                            'h1': {'type': 'text'},
                            'p': {'type': 'text'}
                        }
                    },
                    'section': {'type': 'text'}
                }
            }
        }
    }
}

# Delete the index if it exists, then create it
try:
    client.indices.delete(index=index_name)
    print(f"Deleted existing index '{index_name}'.")
except NotFoundError:
    print(f"Index '{index_name}' does not exist. Creating a new one.")

response = client.indices.create(index=index_name, body=index_body)
print(f"Created index '{index_name}': {response}")


Index 'vegan_papers_index' does not exist. Creating a new one.
Created index 'vegan_papers_index': {'acknowledged': True, 'shards_acknowledged': True, 'index': 'vegan_papers_index'}


Now we prepare our data to be stored into the vector database.

In [127]:
actions = []
for i, doc in enumerate(documents_with_embeddings):
    action = {
        '_index': index_name,
        '_id': i,
        '_source': {
            'embedding': doc['embedding'],
            'text': doc['text'],
            'metadata': doc['metadata']
        }
    }
    actions.append(action)

In [129]:
from opensearchpy.helpers import bulk

success, _ = bulk(client, actions)
print(f"Indexed {success} documents into index '{index_name}'.")

Indexed 56 documents into index 'vegan_papers_index'.


Let's perform a vector search to test out our database. We can query something similar to a selected example, say sample_text[5].

In [131]:
sample_text[5]

{'section': 'Usage of gluten protein for meat analogues',
 'body': 'Wheat gluten is a significant component of many analogues. Because it is a by-product of the creation of colossal wheat starch, its price is appealing to the industry. In contrast to soy, the insoluble protein is left behind after the soluble and dispersible components of wheat are only removed by washing them with water [Gliadin (prolamin) is soluble in alcohol, whereas glutenin (glutelin) is soluble in diluted acid ['}

In [136]:
#so make the query similar to the sample above
query_text = "Is wheat gluton a significant component of meat analogues?"
query_embedding = bedrock_embeddings.embed_query(query_text)

In [137]:
search_body = {
    "size": 5,
    "query": {
        "knn": {
            "embedding": {
                "vector": query_embedding,
                "k": 5
            }
        }
    },
    "_source": ["text", "metadata"]
}

response = client.search(index=index_name, body=search_body)

In [138]:
for hit in response['hits']['hits']:
    print(f"Score: {hit['_score']}")
    print(f"Title: {hit['_source']['metadata']['title']}")
    print(f"Text: {hit['_source']['text'][:200]}...")  # Show first 200 chars
    print()

Score: 0.004691359
Title: Valorization of plant proteins for meat analogues design—a comprehensive review
Text: Wheat gluten is a significant component of many analogues. Because it is a by-product of the creation of colossal wheat starch, its price is appealing to the industry. In contrast to soy, the insolubl...

Score: 0.0043517556
Title: Valorization of plant proteins for meat analogues design—a comprehensive review
Text: some specific color that changes through different processes such as cooking and smoking. Similarly, meat analogues should have meat mimic color and color change characteristics during processing. Sev...

Score: 0.00420129
Title: Valorization of plant proteins for meat analogues design—a comprehensive review
Text: Meat analogues are made by combining a variety of ingredients through different texturizing techniques. For optimum binding of all the necessary ingredients (flavor, color, stabilizers, emulsifier, th...

Score: 0.004073237
Title: Valorization of plant p

### Delete Index

Now that we confirmed everything worked, let's delete the index since this was just an experimentation notebook. We will implement the full method in the data_processing module.

In [139]:
try:
    client.indices.delete(index=index_name)
    print(f"Deleted existing index '{index_name}'.")
except NotFoundError:
    print(f"Index '{index_name}' does not exist.")

Deleted existing index 'vegan_papers_index'.


## Open-Source Embeddings

The Titan model is not great for chatbot use cases, so fine-tuning on our data would be better off left with an open source model. We'll use the Falcon 7B-Instruct model. Moreover, the titan embeddings from bedrock won't let us see the tokenized text, which we need as we are getting an error of too many tokens and need to chunk the text based on the limit of tokens, not text. Thus, an open source model would do better here as we have that flexibility.

So essentially we are redoing the vector embeddings section above, but this time using the Falcon model.

In [None]:
from transformers import AutoTokenizer

model = "tiiuae/falcon-7b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model)

We'll still need this function to convert our transformed data into documents.

In [66]:
def convert_to_doc_format(transformed_data):
    documents = []
    for content in transformed_data:
        doc = Document(page_content=content['body'], metadata=content['meta_data'])
        documents.append(doc)
    return documents

In [68]:
full_docs = convert_to_doc_format(full_pdf_data)

In [74]:
full_docs[0].page_content

'Accelerating relative sea-level rise (RSLR) threatens coastal wetlands globally, and their survival depends on the ability of wetlands to build soil vertically and maintain their elevation in the tidal frame. Studies using the Surface Elevation Table–Marker Horizon (SET-MH) method are advancing our understanding of the processes contributing to the sustainability of coastal wetland elevation in the face of rising sea levels, and their management implications, resulting in an approach that allows for comprehensive and systematic monitoring of wetland elevation change over a wide range of coastal environments globally. Collectively, the 27 articles in this special issue present an examination of current advances at quantifying and understanding subsurface process influences on wetland elevation change and wetland responses to sea-level rise, drawing on research presentations from two special sessions at the Coastal and Estuarine Research Federation (CERF) 2021 conference plus additional

In [75]:
len(tokenizer.encode(full_docs[0].page_content))

837

In [77]:
invalid_texts = []

for doc in full_docs:
    if len(tokenizer.encode(doc.page_content)) > 2048:
        invalid_texts.append(doc.page_content)

In [79]:
len(invalid_texts)

126

In [None]:
def chunk_tokens(doc, chunk_size, max_tokens=2048):