We are going to ingest two scientific papers from two different disciplines:

1. [LIGO: The Laser Interferometer Gravitational-Wave Observatory](https://arxiv.org/pdf/0711.3041.pdf)
2. [Livestock as a potential biological control agent for an invasive wetland plant](https://peerj.com/articles/567/)
3. [Quantum black holes as classical space factories](https://arxiv.org/abs/2308.15519)
4. [Effects of Invasive Goats (Capra hircus) on Mediterranean Island Communities](https://deepblue.lib.umich.edu/handle/2027.42/117674)

The object of this notebook is to create a file [data/text.pkl](data/text.pkl) that contains the text of the papers in a format that is easy to work with, and for which OpenAI embeddings have already been generated.

## Dependencies

The following are in addition to that in [requirements.txt](requirements.txt):

```
pip install openai==0.28.0 python-dotenv==1.0.0 tqdm==4.66.1 PyMuPDF==1.23.3
```

You will also need `OPENAI_API_KEY` defined in your `.env` file, or modify the below code to use your API key directly.

## Pre-Processing

The PDFs have been downloaded and manually processed into text files. 

```python
import fitz
doc = fitz.open('docs/example.pdf')
text = ""
for page in doc:
   text+=page.get_text()
with open('docs/example.txt', 'w', encoding='utf-8') as f:
    f.write(text)
```

Further processing involved:
1. Extracting the title onto a single line
2. Extracting the content into a line per paragraph
   - Authors have been removed
   - Citations / references have been removed
   - Figure / Table captions remain, each on their own line

The first 31 paragraphs of content, plus the title, are thus captured into their respective files.

## Load Text into DataFrame

In [1]:
import pandas as pd

# Function to read a text file and return its lines as a list
def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.readlines()

# Read the lines from the two text files
goats_lines = read_file('docs/goats.txt')
gravity_lines = read_file('docs/gravity.txt')
blackholes_lines = read_file('docs/blackholes.txt')
invasivegoats_lines = read_file('docs/invasivegoats.txt')

# Create DataFrames for each file
df_goats = pd.DataFrame({'type': 'goats', 'text': goats_lines})
df_gravity = pd.DataFrame({'type': 'gravity', 'text': gravity_lines})
df_blackholes = pd.DataFrame({'type': 'blackholes', 'text': blackholes_lines})
df_invasivegoats = pd.DataFrame({'type': 'invasivegoats', 'text': invasivegoats_lines})

# Combine the DataFrames
df_combined = pd.concat([df_goats, df_gravity, df_blackholes, df_invasivegoats]).reset_index(drop=True)

# Show the combined DataFrame
print(df_combined)

              type                                               text
0            goats  Livestock as a potential biological control ag...
1            goats  Invasive species threaten biodiversity and inc...
2            goats  Invasive species globally threaten biodiversit...
3            goats  Invasive plants that form expansive monocultur...
4            goats  Under natural field settings, there is broad s...
..             ...                                                ...
123  invasivegoats  Soil Effects. In this system, soil chemical ch...
124  invasivegoats  Effects on Arthropods. We did not observe an e...
125  invasivegoats  Our observations of increasing estimated arthr...
126  invasivegoats  Effects of Seabirds. Seabirds drive these isle...
127  invasivegoats  Seabirds can also reduce plant biomass by tram...

[128 rows x 2 columns]


## Compute OpenAI Embeddings

In [2]:
openai_embed_model = "text-embedding-ada-002"

import os
from dotenv import load_dotenv
if not load_dotenv('.env',override=True):
    raise Exception("Couldn't load .env file")

envVars = ['OPENAI_API_KEY']
missing = []

for var in envVars:
    if var not in os.environ:
        missing.append(var)

if missing:
    raise EnvironmentError(f'These environment variables are missing: {missing}')

import openai
import time
import numpy as np
from tqdm import tqdm

def get_embeddings(text_list):
    embedding_list = []
    for text in tqdm(text_list, desc='Getting embeddings', unit='text'):
        done = False
        while not done:
            try:
                response = openai.Embedding.create(input=[text], engine=openai_embed_model)
                embedding_list.append(response['data'][0]['embedding'])
                done = True
            except Exception as e:
                print(f"Exception occurred: {e}. Retrying in 5 seconds...")
                time.sleep(5)
    
    return embedding_list

embedding_results = get_embeddings(df_combined['text'])
df_combined['openaiEmbeddings'] = embedding_results
df_combined

Getting embeddings: 100%|██████████| 128/128 [00:42<00:00,  3.04text/s]


Unnamed: 0,type,text,openaiEmbeddings
0,goats,Livestock as a potential biological control ag...,"[-0.014313258230686188, -0.009999215602874756,..."
1,goats,Invasive species threaten biodiversity and inc...,"[-0.014021526090800762, -0.01595783233642578, ..."
2,goats,Invasive species globally threaten biodiversit...,"[-0.002096143551170826, -0.012656336650252342,..."
3,goats,Invasive plants that form expansive monocultur...,"[-0.012120183557271957, -0.0009395491215400398..."
4,goats,"Under natural field settings, there is broad s...","[-0.010701132006943226, -0.006763485725969076,..."
...,...,...,...
123,invasivegoats,"Soil Effects. In this system, soil chemical ch...","[0.018184464424848557, -0.0065793865360319614,..."
124,invasivegoats,Effects on Arthropods. We did not observe an e...,"[0.006668189540505409, -0.007469975855201483, ..."
125,invasivegoats,Our observations of increasing estimated arthr...,"[0.02312556654214859, -0.0031290866900235415, ..."
126,invasivegoats,Effects of Seabirds. Seabirds drive these isle...,"[-0.005768843926489353, -0.022223688662052155,..."


In [3]:
# All three values are 128 non-null values
print(df_combined.groupby('type').count())

               text  openaiEmbeddings
type                                 
blackholes       32                32
goats            32                32
gravity          32                32
invasivegoats    32                32


And we will save for use in the next notebook!

In [None]:
df_combined.to_pickle('data/text.pkl')