# Data Ingestion

This project is about creating an AI vegan nutrition advisor. Here we explore and experiment with APIs to collect our context data, from which we will ultimately implement a RAG solution with. We use the Springer API to get research papers and articles on veganism - this is the most freely accessible API for this kind of work - some others like PubMed are good but require a subscription. The docs can be found here: https://docs-dev.springernature.com/docs/#api-endpoints/api-endpoints.

For our environment, we use Poetry in our modules folders so we can just navigate to those directories in the command line - i.e., move to the modules/data_ingestion folder. Then, to activate it here, run 

```bash
poetry shell
```

(inside WSL if on Windows). Then run

```bash
jupyter notebook
```

and you will get a local host link. Next, in a notebook (in any directory with your env file - the root directory in this case) click on the `select_kernel` button at the top right, click `Select Another Kernel`, and then `Existing Jupyter Server`. Copy and paste the local host link in there and press enter. Also, type `pwd` in a code cell to double check your current directory as needed.

In [1]:
pwd

'/mnt/c/Users/RaviB/GitHub/vegan-ai-nutritionist'

If you need to install any library, you can do so here with the command `poetry add library` (replace library with whatever you want). 

In [32]:
import os

# Switch to the directory containing the pyproject.toml file
os.chdir("modules/data_ingestion")

# Install libraries using poetry, uncomment and change library names as needed
!poetry add python-dotenv

#switch back to the root directory
os.chdir("../..")

Using version [39;1m^1.0.1[39;22m for [36mpython-dotenv[39m

[34mUpdating dependencies[39m
[2K[34mResolving dependencies...[39m [39;2m(2.0s)[39;22m

[39;1mPackage operations[39;22m: [34m1[39m install, [34m0[39m updates, [34m0[39m removals

  [34;1m-[39;22m [39mInstalling [39m[36mpython-dotenv[39m[39m ([39m[39;1m1.0.1[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m-[39;22m [39mInstalling [39m[36mpython-dotenv[39m[39m ([39m[39;1m1.0.1[39;22m[39m)[39m: [34mDownloading...[39m [39;1m0%[39;22m
[1A[0J  [34;1m-[39;22m [39mInstalling [39m[36mpython-dotenv[39m[39m ([39m[39;1m1.0.1[39;22m[39m)[39m: [34mDownloading...[39m [39;1m100%[39;22m
[1A[0J  [34;1m-[39;22m [39mInstalling [39m[36mpython-dotenv[39m[39m ([39m[39;1m1.0.1[39;22m[39m)[39m: [34mInstalling...[39m
[1A[0J  [32;1m-[39;22m [39mInstalling [39m[36mpython-dotenv[39m[39m ([39m[32m1.0.1[39m[39m)[39m

[34mWriting lock file[39m


Double check the current directory again. Keep in mind the env file is in the root directory so you should be there (or wherever you put it).

In [2]:
pwd

'/mnt/c/Users/RaviB/GitHub/vegan-ai-nutritionist'

Let's load in our libraries and env variables.

In [117]:
import os
import requests
from dotenv import load_dotenv

load_dotenv()

True

## Getting Meta Data

The Springer API has two endpoints, one to get metadata, and the other to get the full text of a paper if it is open access. We need the doi (digital object identifier) to from the metadata first, to later query the endpoint that gets our full text, so let's start there.

In [118]:
springer_api_key = os.getenv("SPRINGER_NATURE_API")
base_url = "http://api.springernature.com/openaccess/json"

After some experimentation, it looks like the max results, `p`, is 25, and not 20, like it says in the documentation.

In [119]:
query = "vegan nutrition"  # Search for vegan-related papers
params = {
    "q": query,
    "s": 1,
    "p": 25,
    "api_key": springer_api_key
}

In [120]:
response = requests.get(base_url, params=params)
response.status_code

401

In [93]:
results = response.json()

In [94]:
meta_data_lst = []

for record in results.get("records", []):
    if record.get("openAccess"):
        meta_data = {
            "content_type": record.get("contentType"),
            "url": record.get("url"),
            "title": record.get("title"),
            "publication_name": record.get("publicationName"),
            "doi": record.get("doi"),
            "publication_date": record.get("publicationDate"),
            "starting_page": record.get("startingPage"),
            "ending_page": record.get("endingPage"),
            "abstract": record.get("abstract")
        }
        meta_data_lst.append(meta_data)
    
print(len(meta_data_lst))

25


In [88]:
meta_data_lst

[{'content_type': 'Article',
  'url': [{'format': '',
    'platform': '',
    'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}],
  'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review',
  'publication_name': 'European Food Research and Technology',
  'doi': '10.1007/s00217-024-04565-1',
  'publication_date': '2024-10-01',
  'starting_page': '2479',
  'ending_page': '2513',
  'abstract': {'h1': 'Abstract',
   'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-based components, plant byproducts, an

Everything above works so let's make a function for this.

In [121]:
def get_meta_data(query, api_key, base_url="http://api.springernature.com/openaccess/json", starting_record=1, max_records=25):
    """This function collects meta data for papers that match the query from the Springer Nature API.

    Args:
        query (_type_): _description_
        api_key (_type_): _description_
        base_url (str, optional): The endpoint of the Springer Nature API to get meta data. 
            Defaults to "http://api.springernature.com/openaccess/json".
        starting_record (int, optional): The starting record number. Defaults to 1.
        max_records (int, optional): The max number of records you can query at a time. 
            Defaults to 25, and floors to 25 if set higher.

    Returns:
        _type_: A list of meta data dictionaries for papers that match the query.
    """
    
    if max_records > 25:
        max_records = 25
    
    params = {
        "q": query,
        "api_key": api_key,
        "s": starting_record,
        "p": max_records,
    }
    
    response = requests.get(base_url, params=params)
    status_code = response.status_code
    
    if status_code == 200:
        results = response.json()
        
        meta_data_lst = []

        for record in results.get("records", []):
            # Double check if the record is open access
            if record.get("openAccess"):
                meta_data = {
                    "content_type": record.get("contentType"),
                    "url": record.get("url"),
                    "title": record.get("title"),
                    "publication_name": record.get("publicationName"),
                    "doi": record.get("doi"),
                    "publication_date": record.get("publicationDate"),
                    "starting_page": record.get("startingPage"),
                    "ending_page": record.get("endingPage"),
                    "open_access": record.get("openAccess"),
                    "abstract": record.get("abstract")
                }
                meta_data_lst.append(meta_data)
            
        return meta_data_lst
    else:
        return "Error: request failed with status code {}".format(status_code)
    

Keep in mind the starting record is the numbered record you want, not page. So if you want to get more records than the max limit (25), then run the query again but start at the previous starting_record value you used plus the previous max_records you used.

In [123]:
springer_api_key = os.getenv("SPRINGER_NATURE_API")
base_url = "http://api.springernature.com/openaccess/json"

list1 = get_meta_data(base_url, "vegan", springer_api_key, starting_record=1, max_records=25)
list2 = get_meta_data(base_url, "vegan", springer_api_key, starting_record=26, max_records=25)

In [156]:
list1[0:2]

[{'content_type': 'Article',
  'url': [{'format': '',
    'platform': '',
    'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}],
  'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review',
  'publication_name': 'European Food Research and Technology',
  'doi': '10.1007/s00217-024-04565-1',
  'publication_date': '2024-10-01',
  'starting_page': '2479',
  'ending_page': '2513',
  'open_access': 'true',
  'abstract': {'h1': 'Abstract',
   'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-based compone

## Getting the Full Text

Now let's explore how to get the text.

In [None]:
import xml.etree.ElementTree as ET

In [138]:
base_url = "http://api.springernature.com/openaccess/jats"
doi = "10.1007/s00217-024-04565-1"  # Replace with the actual DOI of the paper you're interested in
doi_ex = list1[0]["doi"]
params = {
    "api_key": springer_api_key,
    "q": doi_ex  # Query using DOI or other identifier
}

In [139]:
response = requests.get(base_url, params=params)
response.status_code

200

In [140]:
xml_content = response.content
root = ET.fromstring(xml_content)
root

<Element 'response' at 0x7f537c5f5c10>

In [149]:
body_section = root.find(".//body")
full_text = []

if body_section is not None:
    for section in body_section.findall(".//sec"):
        section_title = section.find("title")
        if section_title is not None and section_title.text:
            section_title_text = section_title.text
            #full_text.append(f"Section: {section_title.text}")
        else:
            section_title_text = ""
        
        paragraph_text = ""
        for paragraph in section.findall(".//p"):
            if paragraph.text:
                paragraph_text += paragraph.text
        
        full_text.append(
            {
                "section": section_title_text, 
                "body": paragraph_text
            }
        )
        
#print(full_text)

In [151]:
def get_full_text(doi, api_key, base_url="http://api.springernature.com/openaccess/jats"):
    """
    Retrieves the full text content of a journal article given its DOI and API key.

    Args:
        doi (str): The DOI of the article.
        api_key (str): The API key for accessing the Springer Nature API.
        base_url (str, optional): The base URL for the API. Defaults to "http://api.springernature.com/openaccess/jats".

    Returns:
        list: A list of dictionaries, where each dictionary contains the section title and body text of the article.
        If the request fails, returns an error message.
    """
    
    params = {
        "q": doi,
        "api_key": api_key
    }
    
    response = requests.get(base_url, params=params)
    status_code = response.status_code
    
    if status_code == 200:
        xml_content = response.content
        root = ET.fromstring(xml_content)
        
        body_section = root.find(".//body")
        full_text = []
        
        if body_section is not None:
            for section in body_section.findall(".//sec"):
                section_title = section.find("title")
                
                if section_title is not None and section_title.text:
                    section_title_text = section_title.text
                else:
                    section_title_text = ""
                
                paragraph_text = ""
                for paragraph in section.findall(".//p"):
                    if paragraph.text:
                        paragraph_text += paragraph.text
                
                full_text.append(
                    {
                        "section": section_title_text, 
                        "body": paragraph_text
                    }
                )
        
        return full_text
    else:
        return "Error: request failed with status code {}".format(status_code)

In [152]:
get_full_text("10.1007/s00217-024-04565-1", springer_api_key)

[{'section': 'Introduction',
  'body': "Meat is recognized as a very popular food item worldwide and it is well known as an excellent quality protein source with other nutritional characteristics along with its appealing taste. With the growing rate of the planet's population, the need for food security is rising as well, and to feed this growing population a greater amount of good quality food having proper protein, fat, and other nutrition is required. Meanwhile, increased environmental footprint awareness plays a significant role in meat analogues supply for the sustainable and transparent food security of the planet. Animal is the solitary bioresource of meat protein and with rapid population growth, the need for meat protein is also increasing. Various data show that the demand will be magnified near to twice by 2050 [Changes in the different meat prices as per FAO meat price index. (Data Source: OECD-FAO Agricultural Outlook 2022–2031)Meat Greenhouse gas emissions intensity per r

## Saving the Data

Let's actually edit the get_meta_data function to immediately get the full text as soon as we get the doi value. This way, we don't have to iterate twice unnecessarily.

In [158]:
def get_paper_data(query, api_key, base_url="http://api.springernature.com/openaccess/json", starting_record=1, max_records=25):
    """This function collects meta data for papers that match the query from the Springer Nature API.

    Args:
        query (_type_): _description_
        api_key (_type_): _description_
        base_url (str, optional): The endpoint of the Springer Nature API to get meta data. 
            Defaults to "http://api.springernature.com/openaccess/json".
        starting_record (int, optional): The starting record number. Defaults to 1.
        max_records (int, optional): The max number of records you can query at a time. 
            Defaults to 25, and floors to 25 if set higher.

    Returns:
        _type_: A list of meta data dictionaries for papers that match the query.
    """
    
    if max_records > 25:
        max_records = 25
    
    params = {
        "q": query,
        "api_key": api_key,
        "s": starting_record,
        "p": max_records,
    }
    
    response = requests.get(base_url, params=params)
    status_code = response.status_code
    
    if status_code == 200:
        results = response.json()
        
        papers_data = []

        for record in results.get("records", []):
            # Double check if the record is open access
            if record.get("openAccess"):
                meta_data = {
                    "content_type": record.get("contentType"),
                    "url": record.get("url"),
                    "title": record.get("title"),
                    "publication_name": record.get("publicationName"),
                    "doi": record.get("doi"),
                    "publication_date": record.get("publicationDate"),
                    "starting_page": record.get("startingPage"),
                    "ending_page": record.get("endingPage"),
                    "open_access": record.get("openAccess"),
                    "abstract": record.get("abstract")
                }
                
            full_text = get_full_text(record.get("doi"), api_key)
            
            record_data = {
                "meta_data": meta_data,
                "content": full_text
            }
            
            papers_data.append(record_data)
            
        return papers_data
    else:
        return "Error: request failed with status code {}".format(status_code)
    

In [159]:
testing_data = get_paper_data("vegan", springer_api_key)

In [163]:
testing_data[0]

{'meta_data': {'content_type': 'Article',
  'url': [{'format': '',
    'platform': '',
    'value': 'http://dx.doi.org/10.1007/s00217-024-04565-1'}],
  'title': 'Valorization of plant proteins for meat analogues design—a comprehensive review',
  'publication_name': 'European Food Research and Technology',
  'doi': '10.1007/s00217-024-04565-1',
  'publication_date': '2024-10-01',
  'starting_page': '2479',
  'ending_page': '2513',
  'open_access': 'true',
  'abstract': {'h1': 'Abstract',
   'p': 'Animal proteins from meat and its stuffs have recently been one of main concerns in the drive for sustainable food production. This viewpoint suggests that there are exciting prospects to reformulate meat products that are produced more sustainably and may also have health benefits by substituting high-protein nonmeat ingredients for some of the meat. Considering these pre-existing conditions, this review critically reviews recent data on extenders from several sources, including pulses, plant-

This function gets all our data in one go. You just need to run it many times, which will give 25 records each. Keep in mind the rate limit is 100 requests per minutes, and this function makes 2 requests due to it's nested function.

In [161]:
import time
import numpy as np

data = []

for i in np.arange(1, 100, 25):
    data += get_paper_data("vegan", springer_api_key, starting_record=i, max_records=25)
    
    # wait for 1.2 seconds before making the next request (2 requests per get_paper_data call)
    # and only 100 requests per minute
    #time.sleep(1.2) # uncomment if making 50 iterations or more

1
26
51
76
