## 1. Document/text processing and embedding creation

Ingredients:
* PDF document of choice (could be any kind of document).
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding.
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use.

### Import PDF Document

In [1]:
import os
import requests # help download stuff

# Get PDF Document
pdf_path = "human-nutrition-text.pdf"

# Download
if not os.path.exists(pdf_path):
    print(f"[INFO] File doesn't exist, downloading...")

    # Enter URL of the pdf
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)
    if response.status_code == 200:
        # Open the file and save it
        with open(pdf_path, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status Code: {response.status_code}")
else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


In [2]:
import fitz # requires: PyMuPDF
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text"""
    cleaned_text = text.replace("\n", " ").strip()

    return cleaned_text

def open_and_read_pds(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - 41,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(".")),
                                "page_token_count": len(text) / 4, # 1 token ~4 chars
                                "text": text
                               })
    return pages_and_texts

pages_and_texts = open_and_read_pds(pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [3]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 936,
  'page_char_count': 1488,
  'page_word_count': 222,
  'page_sentence_count_raw': 15,
  'page_token_count': 372.0,
  'text': 'The Essential Elements of  Physical Fitness  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Cardiorespiratory Endurance  Cardiorespiratory endurance is enhanced by aerobic training which  involves activities that increase your heart rate and breathing such  as walking, jogging, or biking. Building cardiorespiratory endurance  through aerobic exercise is an excellent way to maintain a healthy  weight. Working on this element of physical fitness also improves  your circulatory system. It boosts your ability to supply the body’s  cells with oxygen and nutrients, and to remove carbon dioxide and  metabolic waste. Aerobic exercise is continuous exercise (lasting  more than 2 minutes) that can range from low to high levels of  intensity. In addition, aerobic exercise increases heart and  brea

In [4]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,147,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [5]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0
std,348.86,560.38,95.83,9.54,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,8.0,190.5
50%,562.5,1231.5,216.0,13.0,307.88
75%,864.25,1603.5,272.0,19.0,400.88
max,1166.0,2308.0,430.0,82.0,577.0


### Further text processing (splitting pages into sentences)

Two ways to do this:
1. By splitting on `"."`.
2. We can do this with a NLP library such as spaCY or nltk.

In [6]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

# Print out our sentences split
list(doc.sents)

[This is a sentence., This is another sentence.]

In [34]:
pages_and_texts[:2]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition',
  'sentences': ['Human Nutrition: 2020 Edition'],
  'page_sentence_count_spacy': 1,
  'sentence_chunks': [['Human Nutrition: 2020 Edition']],
  'num_chunks': 1},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': '',
  'sentences': [],
  'page_sentence_count_spacy': 0,
  'sentence_chunks': [],
  'num_chunks': 0}]

In [8]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings (default type is spaCY datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [9]:
random.sample(pages_and_texts, k=1)

[{'page_number': 212,
  'page_char_count': 1323,
  'page_word_count': 224,
  'page_sentence_count_raw': 18,
  'page_token_count': 330.75,
  'text': 'The Beverage Panel recommends an even lower intake of calories  from beverages than IOM—10 percent or less of total caloric intake.  Table 3.10 Recommendations of the Beverage Panel  Beverage  Servings per day*  Water  ≥ 4 (women), ≥ 6 (men)  Unsweetened coffee and tea  ≤ 8 for tea, ≤ 4 for coffee  Nonfat and low-fat milk; fortified soy drinks ≤ 2  Diet beverages with sugar substitutes  ≤ 4  100 percent fruit juices, whole milk, sports  drinks  ≤ 1  Calorie-rich beverages without nutrients  ≤ 1, less if trying to lose  weight  *One serving is eight ounces.  Source: Beverage Panel Recommendations and Analysis. University  of North Carolina, Chapel Hill. US Beverage Guidance Council.  http://www.cpc.unc.edu/projects/nutrans/policy/beverage/us- beverage-panel. Accessed November 6, 2012.  Sources of Drinking Water  The Beverage Panel recommend

In [10]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0,10.32
std,348.86,560.38,95.83,9.54,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,190.5,5.0
50%,562.5,1231.5,216.0,13.0,307.88,10.0
75%,864.25,1603.5,272.0,19.0,400.88,15.0
max,1166.0,2308.0,430.0,82.0,577.0,28.0


### Chunking our sentences together

In [11]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

def split_list(input_list: list[str],
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [12]:
# Loop through pages and text, and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                        slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [13]:
random.sample(pages_and_texts, k=1)

[{'page_number': 130,
  'page_char_count': 1404,
  'page_word_count': 225,
  'page_sentence_count_raw': 18,
  'page_token_count': 351.0,
  'text': 'longer than three months significantly reduces the incidence and  severity of diarrhea and respiratory illnesses.1  Zinc supplementation also has been found to be therapeutically  beneficial for the treatment of leprosy, tuberculosis, pneumonia,  and the common cold. Equally important to remember is that  multiple studies show that it is best to obtain your minerals and  vitamins from eating a variety of healthy foods.  Just as undernutrition compromises immune system health, so  does overnutrition. People who are obese are at increased risk for  developing immune system disorders such as asthma, rheumatoid  arthritis, and some cancers. Both the quality and quantity of fat  affect immune system function. High intakes of saturated and trans  fats negatively affect the immune system, whereas increasing your  intake of omega-3 fatty acids, fou

In [14]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0,10.32,1.53
std,348.86,560.38,95.83,9.54,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,190.5,5.0,1.0
50%,562.5,1231.5,216.0,13.0,307.88,10.0,1.0
75%,864.25,1603.5,272.0,19.0,400.88,15.0,2.0
max,1166.0,2308.0,430.0,82.0,577.0,28.0,3.0


### Splitting each chunk into its own item

In [15]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A" (will work for any uppercase letter)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token ~4 chars

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [16]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 658,
  'sentence_chunk': 'Image by Allison Calabrese / CC BY 4.0  Iron Toxicity The body excretes little iron and therefore the potential for accumulation in tissues and organs is considerable. Iron accumulation in certain tissues and organs can cause a host of health problems in children and adults including extreme fatigue, arthritis, joint pain, and severe liver and heart toxicity. In children, death has occurred from ingesting as little as 200 mg of iron and therefore it is critical to keep iron supplements out of children’s reach. The IOM has set tolerable upper intake levels of iron (Table 11.2 “Dietary Reference Intakes for Iron”). Mostly a hereditary disease, hemochromatosis is the result of a genetic mutation that leads to abnormal iron metabolism and an accumulation of iron in certain tissues such as the liver, pancreas, and heart. The signs and symptoms of hemochromatosis are similar to those of iron overload 658 | Iron',
  'chunk_char_count': 918,
  'chunk_

In [17]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.1,112.74,183.52
std,347.79,447.51,71.24,111.88
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,45.0,78.75
50%,586.0,745.0,115.0,186.25
75%,890.0,1118.0,173.0,279.5
max,1166.0,1830.0,297.0,457.5


### Filter chunks of text for short chunks

These chunks may not contain much useful information

In [29]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 11.75 | Text: Accessed March 17, 2018. Sports Nutrition | 961
Chunk token count: 19.25 | Text: 2018). Centers for Disease Control and 998 | The Causes of Food Contamination
Chunk token count: 28.75 | Text: Accessed September 22, 2017. Dietary, Behavioral, and Physical Activity Recommendations for Weight Management | 505
Chunk token count: 24.25 | Text: These activities are available in the web-based textbook and not available in the Magnesium | 643
Chunk token count: 9.5 | Text: 742 | Building Healthy Eating Patterns


In [31]:
# Filter our DF for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [33]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 382,
  'sentence_chunk': 'Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities. These activities are available in the web-based textbook and not available in the downloadable versions (EPUB, Digital PDF, Print_PDF, or Open Document). Learning activities may be used across various mobile devices, however, for the best user experience it is strongly recommended that users complete these activities using a desktop or laptop computer and in Google Chrome.  An interactive or media element has been excluded from this version of the text. You can view it online here: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=254  382 | Protein Digestion and Absorption',
  'chunk_char_count': 745,
  'chunk_word_count': 106,
  'chunk_token_count': 186.25}]

### Embedding our text chunks