# Data Preprocessing for constructing the RAG vector database

In [1]:
import pdfplumber
import os
import re
import pytesseract
from PIL import Image
import re
import requests
from bs4 import BeautifulSoup

## Dietary Guidelines for Americans, 2020-2025

The Dietary Guidelines for Americans, 2020-2025 provides advice on what to eat and drink to meet nutrient needs, promote health, and help prevent chronic disease. This edition of the Dietary Guidelines is the first to provide guidance for healthy dietary patterns by life stage, from birth through older adulthood, including women who are pregnant or lactating.The file is downloaded from https://www.dietaryguidelines.gov/resources/2020-2025-dietary-guidelines-online-materials

In [2]:
def is_meaningful_paragraph(text):
    """
    Function to determine if a paragraph is meaningful based on specific patterns.
    """
    irrelevant_patterns = [
        r"^(\d+)?(\s*[A-Za-z]+:\s*)?$",     # Page numbers or sole image captions
        r"^USDA is an equal opportunity.*$", # Equal opportunity statement
        r"^(.*?[a-zA-Z]+\.){10,}.*$",        # Lengthy legal disclaimers or contact details
        r"https?://\S+",                     # URLs, which are usually references
        r"^To file a program discrimination complaint.*$",
        r"^For more information, contact.*$",
        r"^(Mail:|Fax:|Email:).*"            # Contact details
        r"^NOTE:.*$",                   # Notes potentially across sections
        r"^(See Appendix.*|Appendix \d+.*)$",  # References to appendices
        r"^The total dietary pattern should not exceed Dietary Guidelines limits.*$",  # Repeated advisory
        r"^All foods are assumed to be in nutrient-dense forms.*$",
        r"^be within the Acceptable Macronutrient Distribution Ranges.*$"
        r"^\s*Page\s+\d+"              # Matches page numbers
    ]

    for pattern in irrelevant_patterns:
        if re.match(pattern, text.strip(), re.IGNORECASE):
            return False
    return True
    
def clean_text(text):
    # Remove unwanted characters like �
    return text.replace('�', '')


def extract_text_from_image(image):
    """
    Use OCR to extract text from images.
    Ensure Tesseract is properly installed and configured on your system.
    """
    return pytesseract.image_to_string(image)

def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page_number, page in enumerate(pdf.pages):
            if page_number < 11:  # Skip the first 11 pages since they are not relevant
                continue
            text_content = page.extract_text()
            if text_content:
                paragraphs = re.split(r'\n{2,}', text_content)
                for paragraph in paragraphs:
                    if is_meaningful_paragraph(paragraph):
                        text += paragraph + "\n"
            
            # Check if there are images and process them
            if page.images:
                for image in page.images:
                    if 'bbox' in image:
                        with page.within_bbox(image['bbox']) as cropped_page:
                            pil_image = cropped_page.to_image().original
                            image_text = extract_text_from_image(pil_image)
                            image_text = clean_text(image_text)
                            if image_text and is_meaningful_paragraph(image_text):
                                text += image_text + '\n'
    return text

def save_text_to_txt(text, output_path):
    with open(output_path, 'w', encoding='utf-8') as file:
        file.write(text)

def process_and_save_pdf(pdf_path, output_dir):
    base_filename = os.path.splitext(os.path.basename(pdf_path))[0]
    output_path = os.path.join(output_dir, f"{base_filename}.txt")
    text_from_pdf = extract_text_from_pdf(pdf_path)
    print(text_from_pdf[:5000])
    save_text_to_txt(text_from_pdf, output_path)

def main(pdf_files, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    for pdf_file in pdf_files:
        print(f"Processing {pdf_file}...")
        process_and_save_pdf(pdf_file, output_dir)

# List of PDF file paths
pdf_files = [
    "C:/Users/Qianwen Li/Desktop/Dietary_Guidelines_for_Americans_2020-2025.pdf" 
    ## The file is downloaded from https://www.dietaryguidelines.gov/resources/2020-2025-dietary-guidelines-online-materials
]

output_directory = './extracted_texts'

main(pdf_files, output_directory)

Processing C:/Users/Qianwen Li/Desktop/Dietary_Guidelines_for_Americans_2020-2025.pdf...
The Guidelines
Make every bite count with the Dietary Guidelines for Americans�
Here’s how:
1
Follow a healthy dietary pattern at every life stage�
At every life stage—infancy, toddlerhood, childhood, adolescence, adulthood, pregnancy, lactation, and
older adulthood—it is never too early or too late to eat healthfully.
• For about the first 6 months of life, exclusively feed infants human milk. Continue to feed infants
human milk through at least the first year of life, and longer if desired. Feed infants iron-fortified infant
formula during the first year of life when human milk is unavailable. Provide infants with supplemental
vitamin D beginning soon after birth.
• At about 6 months, introduce infants to nutrient-dense complementary foods. Introduce infants to
potentially allergenic foods along with other complementary foods. Encourage infants and toddlers
to consume a variety of foods from all 

## Wikibooks: Fundamentals of Human Nutrition
This wikibook is part of the UF Food Science and Human Nutrition Department course, Fundamentals of Human Nutrition. The instructor of this course has a PhD in human nutrition and works in both nutrition education and research. The aim of this textbook is to provide an open, trustworthy educational resource on international human nutrition. The instructor has taught at both the undergraduate and professional levels, with a strong emphasis on the incorporation of nutrition knowledge in health care and disease prevention. Their research expertise is in both nutritional biochemistry, community nutrition, and nutrition in higher education. Below is the URL for accessing the wiki book

In [3]:
urls = [
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Defining_Nutrition",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Nutritional_Science",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/History_of_Vitamins_and_Minerals",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Average_Macronutrient_Distribution_Range",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Dietary_Reference_Intakes",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/International_Dietary_Guidelines",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/MyPlate",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Dietary_Planning",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Dietary_Assessment",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Gastrointestinal_system",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Digestion",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Absorption",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Gut_health",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Defining_Carbohydrates",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Storage",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Functions",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Sugar_and_Disease",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Dietary_intake_Carbs",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Defining_Proteins",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Protein_quality",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Synthesis",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Protein_Functions",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Dietary_intake",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Proteins_and_Health",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Vegetarian_Diets",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Defining_lipids",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Lipid_storage",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Lipid_Functions",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Lipid_intake",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Lipids_and_Health",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Vitamin_A",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Vitamin_D",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Vitamin_E",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Vitamin_K",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Thiamin",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Riboflavin",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Niacin",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Folate",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Vitamin_B12",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Vitamin_B6",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Pantothenic_acid",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Vitamin_C",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Sodium",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Chloride",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Potassium",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Water",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Calcium",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Phosphorous",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Magnesium",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Iron",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Zinc",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Iodine",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Selenium",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Copper",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Manganese_Trace",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Chromium",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Molybdenum",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Fluoride",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Glucose",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Amino_acids",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Glycerol_and_fatty_acids",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Citric_Acid_cycle",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Electron_transport_chain",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Alcohol_metabolism",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Energy_expenditure",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Body_composition",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Weight_management",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Fitness_basics",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Energy_systems",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Diet_and_Fitness",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Hydration",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Ergogenic_Aids",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Anorexia_Nervosa",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Bulemia_Nervosa",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Binge_Eating_Disorder",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Non-categorized_Eating_Disorder",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Nutrition_and_Mental_Health",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Nutrition_and_Bioactive_Compounds",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Food_Wars",
    "https://en.wikibooks.org/wiki/Fundamentals_of_Human_Nutrition/Nutrition_as_a_Profession"
]

In [4]:
# Fetch the webpage HTML and extract the text
def fetch_text(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # Assuming you want only the main content
        content_div = soup.find(id='mw-content-text')
        if content_div:
            return content_div.get_text(separator='\n', strip=True)
        else:
            print(f"Content not found for {url}")
            return ""
    else:
        print(f"Failed to fetch {url}")
        return ""

def clean_text(text):
    # Remove bracketed numbers typically used for references
    text = re.sub(r'\[\d+\]', '', text)
    text = re.sub(r'\s+edit\s+\|\s+edit source\s+', '\n', text)  # Remove edit links
    text = re.sub(r'\[\n\d+\n\]','\n',text)
    # Remove excessive newlines from text
    text = re.sub(r'\n+', '\n', text)
    
    return text.strip()


# Combine all page texts into a single text file
combined_text = ""
for url in urls:
    page_text = fetch_text(url)
    if page_text:
        combined_text += page_text + "\n\n"  # Add some spacing between sections

combined_text_clean=clean_text(combined_text)
with open("combined_text.txt", "w", encoding="utf-8") as text_file:
    text_file.write(combined_text_clean)

print("Text content saved to combined_text.txt")


Text content saved to combined_text.txt


In [5]:
print(combined_text[:5000])

TOC
Fundamentals of Human Nutrition
Defining Nutrition
Nutritional Science
1.1 Defining Nutrition
[
edit
|
edit source
]
1.1.1 You Are What You Eat
[
edit
|
edit source
]
Nutrition is the science that interprets the interaction of nutrients and other substances in food in relation to maintenance, growth, reproduction, health and disease of an organism. It includes food intake, absorption, assimilation, biosynthesis, catabolism and excretion (2014).
[
1
]
Our bodies contain similar nutrients to the food we eat. Therefore, depending on what kind of food we are consuming and the contents of that food, we are affecting our nutrient levels and over all, our health. On average, the human body is 6% minerals, carbohydrates, and other nutrients, 16% fat, 16% protein, and 62% water. Of course these percentages vary for every individual person depending on diet and lifestyle. Our bodies contain similar nutrients to the food we eat. Therefore, depending on what kind of food we are consuming and t