In [None]:
import pandas as pd
import numpy as np

## 1. Getting an article with the SQL Database

To easily access all of our data, we have combined all the articles as text into a SQL database. First connect to the database by putting the file into your google drive or google colab and then access any of the articles in it by querying the document.

In [None]:
import sqlite3
conn = sqlite3.connect('/content/db_bwl.db')
cursor = conn.cursor()

In [None]:
def fetch_all_keys():
    cursor.execute("SELECT key FROM documents WHERE text_nougat IS NOT NULL")
    return [row[0] for row in cursor.fetchall()]

def query_document(doi):
    cursor.execute("SELECT key, text, text_nougat, abstract, title, filepath, vhb_journal_title, vhb_issn FROM documents WHERE key=?", (doi,))
    doc = cursor.fetchone()
    return {'doi': doc[0], 'text': doc[1], 'text_nougat': doc[2], 'abstract': doc[3], 'title': doc[4], 'filepath': doc[5], 'vhb_journal_title': doc[6], 'vhb_issn': doc[7]}

In [None]:
all_dois_in_db = fetch_all_keys() # with this function you can find out all of the doi's that are included in the database

In [None]:
# With this function you can then get the document and for example look at the text and title specifically. Note that there are more datafields in the documents than text_nougat and title. Feel free to look at these.
doi = '10.1287/isre.1110.0411'
doc = query_document(doi)
text_raw = doc['text_nougat']
title = doc['title']

## 2. Filtering for relevant sentences

### Idea A. Using Embeddings

Here we use these bge-base-en-v1.5 embedding model. If you want to, you can also take any other embedding model, e.g. from this benchmark: https://huggingface.co/spaces/mteb/leaderboard. But the bge-base model is already very powerful.

Embeddings in general are very powerful for finding data in text as you have already learned in the seminar. Feel free to try to find a better process with the embeddings by using a different model, pre-processing of the text or another query instead of the basic one we use.

In [None]:
!pip install langchain -qq
from langchain.vectorstores import FAISS

In [None]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
instructor_embeddings = HuggingFaceInstructEmbeddings(model_name='BAAI/bge-base-en-v1.5',
                                                      model_kwargs={"device": "cuda"})

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(        
      separator = ".",
      chunk_size = 500,
      chunk_overlap  = 100,
      length_function = len,
    )

In [None]:
text_split = text_splitter.split_text(text_raw)

In [None]:
docsearch = FAISS.from_texts(text_split, instructor_embeddings)
QUERY = 'we use machine learning model'
text_relevant = [page.page_content for page in docsearch.similarity_search(QUERY, k=5)]

### Idea B. Using Keyword Search

Here we use a self-built function to look through a sentence if it has a specific combination of keywords in it. Feel free to improve it, if you would like to try this direction. This basic version is likely pretty bad in finding important sentences.

In [None]:
def classify_sentence(sentences):
    sentences = sentences.lower()
    relevant_keywords = [
        'i', 'our', 'we'
    ]
    
    search_keywords = ['machine learning']
    
    has_relevant_keywords = any(word in sentences for word in relevant_keywords)
    has_search_keywords = any(word in sentences for word in search_keywords)
    
    return has_relevant_keywords and has_search_keywords

In [None]:
text_split = text_raw.split(".")

In [None]:
text_relevant = [sentence.strip() for sentence in text_split if classify_sentence(sentence)]

### Idea C. Other Possibilities

Instead of using embeddings or using keywords to filter for specific sentences, you could for example use specific parts of the text (for example the abstract of the article).

Other ideas could be to use a classification model to look for which sentences might be relevant. Google around or try to think about how you could find the few very helpful sentences in an article about the thing you want to know about.

In [None]:
# Example using the abstract
text_relevant = doc['abstract']

The abstract of the article is a special thing, that we luckily have already in the database. If you want to use other sections e.g. like the method section of the article, you need to find a more creative solution to get to the section.

## 3. Extract an entity from the text

Here we use a large language model to extract the names of the machine learning models. Some interesting ideas to improve this step are for example to write a better prompt instead of the current one, our to fine-tune the language model.

In [None]:
!pip3 install transformers>=4.32.0 optimum>=1.12.0 accelerate -qq
!pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  -qq

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/OpenOrca-Platypus2-13B-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=200,
    temperature=0
)

In [None]:
prompt_template=f'''### Instruction:
Text: "{"".join(text_relevant)}"

What machine learning models are used in the text? Return the data in JSON format.
### Response:
'''

In [None]:
print(pipe(prompt_template)[0]['generated_text'].split("### Response:")[-1])