# Analysis of Abstracts Using a Large Language Model (LLM)

## Introduction

In this Jupyter Notebook, we will perform an in-depth analysis of abstracts extracted from a CSV file using a Large Language Model (LLM). The goal of this analysis is to leverage the capabilities of LLMs to extract meaningful insights, identify key themes, and perform various natural language processing (NLP) tasks on the abstracts.

### Objectives

- **Data Loading**: Import and preprocess abstracts from a CSV file.
- **Text Analysis**: Utilize LLMs to analyze the content of the abstracts.

### Tools and Libraries

- **LangChain**: To interface with the LLM.

### Workflow

1. **Data Import**: Load the CSV file containing the abstracts.
3. **LLM Integration**: Use the LLM to perform various NLP tasks.

By the end of this notebook, you will have a comprehensive understanding of how to use LLMs for analyzing textual data and extracting valuable insights from scientific abstracts.

In [2]:
#!pip install  pymupdf python-Levenshtein nltk
#!pip install  git+https://github.com/huggingface/transformers.git
!pip install --upgrade transformers



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


1. **Data Import**: Load the CSV file containing the abstracts.


In [2]:
import csv
from langchain_community.llms import Ollama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from downloader import *
#!pip install git+https://github.com/facebookresearch/nougat
#%pip install -qU langchain-text-splitters

In [3]:
from LLM import LLM
from pathlib import Path

llm = LLM()
llm.pdf_to_md(Path("scholarly_creative_work/papers/10.2533_chimia.2019.1028.pdf"))


OSError: [Errno 30] Read-only file system: '/home/m/mehrad/brikiyou/.cache/huggingface'

In [12]:
def read_csv(file_path, num_lines):
    with open(file_path, 'r', encoding="utf8", errors='ignore') as file:
        reader = csv.reader(file)
        dico = {}
        for i, row in enumerate(reader):
            if i >= num_lines:
                break
            if i == 0:
                continue
            dico[row[17]] = {"abstract": row[4]} 
            """
            try:
               
                downloader = Downloader(row[17], 'doi', f"pdfs/{row[17]}.pdf")
                downloader.download()
                dico[row[17]].update({"pdf": True})
            except Exception as e:
                dico[row[17]].update({"pdf": False})
                print(e)
          """      
        print(dico)
        return dico

file_path = "filtered_AC_publications.csv"
num_lines = 5
dico = read_csv(file_path, num_lines)

## Writing the pdfs into a txt file:

for key in dico:
    with open("pdfs.txt", "a") as f:
        f.write(key + "\n")


{'10.1038/s41586-020-2746-2': {'abstract': 'The genetic circuits that allow cancer cells to evade destruction by the host immune system remain poorly understood1\x963. Here, to identify a phenotypically robust core set of genes and pathways that enable cancer cells to evade killing mediated by cytotoxic T\xa0lymphocytes (CTLs), we performed genome-wide CRISPR screens across a panel of genetically diverse mouse cancer cell lines that were cultured in the presence of CTLs. We identify a core set of 182\xa0genes across these mouse cancer models, the individual perturbation of which increases either the sensitivity or the resistance of cancer cells to CTL-mediated toxicity. Systematic exploration of our dataset using genetic co-similarity reveals the hierarchical and coordinated manner in which genes and pathways act in cancer cells to orchestrate their evasion of CTLs, and shows that discrete functional modules that control the interferon response and tumour necrosis factor (TNF)-induced 

LLM Analysis 

In [None]:

def get_topic(abstract:str):
    
    Llm = Ollama(model='llama3', temperature=0.2)
    
    
    if abstract is None:
        raise ValueError("Abstract is required")
    
    parser = JsonOutputParser()
    
    #AI / accelerated materials discovery / SDLs / autonomous labs / high-throughput experimentation / high-throughput DFT
    
    topics = ["Machine Learning", "Batteries", "AI", "accelerated materials discovery", "Self Driving Labs", "autonomous labs", "high-throughput experimentation", "high-throughput DFT"]
    
    
    new_text = """
    
    The output needs to be formated as the following: 
    
    {
    "topic": {
    "topic1": ["Keyword1", "Keyword2", "Keyword3"],
    "topic2": ["Keyword1", "Keyword2", "Keyword3"]
    }
    }
    
    
    Only output the dictionary above, nothing else with it.
    """

    prompt = PromptTemplate(
    template=" So you are a text assistant and I need you to help me identify the topics from the following list the text given to you {topics}. \n Here's the text: {abstract}. \n\n Note: A single text can belong to multiple topics, so please list all relevant topics. {format_instructions}",
    input_variables=["format_instructions", "abstract", "topics"]
    )

    chain = prompt | Llm | parser
    topics = chain.invoke({"format_instructions": new_text, "abstract": abstract, "topics": topics})
    return list(topics.values())[0]


print(get_topic("The development of high-performance batteries is crucial for the future of electric vehicles. The current generation of batteries are not able to provide the range and power required for long-distance travel. This project aims to develop new materials for batteries that can provide higher energy density and faster charging times."))

def get_info(abstract:str = None, **kwargs):
    Llm = Ollama(model='llama3', temperature=0.5)
    if abstract is None:
        raise ValueError("Abstract is required")
    
    dico = {}
    for key, question in kwargs.items():
        print(key, question)
        prompt = PromptTemplate(
            template="So you are a text assistant and I want you to assist me by providing the following information: {question}. \n\n Here's the text: {abstract}. \n\n If the text doesn't contain any information about the topic given, output: 'N/A'",
            input_variables=["abstract", "question"]
        )
        chain = prompt | Llm 
        info = chain.invoke({"abstract": abstract, "question": question})
        dico[key] = info
    print(dico)
    return dico
    
    


    

    


## Workflow orchestration

In [None]:
for key in dico:
    dico[key].update({"topic":get_topic(dico[key]["abstract"])})
    dico[key].update(get_info(dico[key]["abstract"],affiliation="What affiliations do the authors or characters in the text have?",
                        new_materials="Does the text mention any new materials or discoveries?",
                        screening_algorithms="Are there any screening algorithms or systematic procedures discussed in the text?",
                        ai_algorithms="Does the text reference any AI algorithms or methods related to artificial intelligence?",
                        workflow="Can you describe the workflow or process followed in the text?",
                        methods="Can you summarize the methods or approaches mentioned in the text?",
                        models="What models or frameworks are discussed or used in the text?",
                        funding="Does the text mention any funding sources or sponsors?"))

print(dico)

### Writing in json file

In [None]:


with open("output.json", 'w') as file:
    json.dump(dico, file)



In [1]:
def go_get_pdf_link(doi):
    url = f'https://dacemirror.sci-hub.se/journal-article/459ea6cdde8059dec98aaac7493f90f7/wood2010.pdf#nav'
    
    

from pathlib import Path
import requests
filename = Path('pdfs/metadata.pdf')
url = 'https://dacemirror.sci-hub.se/journal-article/459ea6cdde8059dec98aaac7493f90f7/wood2010.pdf#nav'
response = requests.get(url)
filename.write_bytes(response.content)


387585

## Created Software/ dataset

Binary: yes/No

Output: numbers/DOI

In [2]:
import csv
from tqdm import tqdm
import os
import pandas as pd
import json
import subprocess

not_downloadable = []
Invalid_DOIs = []
downloaded_papers = []

mirrors = [
    "https://sci-hub.ee",
    "https://sci-hub.ren",
    "https://sci-hub.wf"
]


df = pd.read_csv('filtered_AC_publications.csv', encoding='ISO-8859-1', low_memory=False, delimiter=',')

for i, row in tqdm(df.iterrows(), total=len(df)):
    if i > 1600: # 1100 - 1500 
        doi_url = f"https://doi.org/{row['DOI']}"
        if not pd.isna(row['DOI']) and "(" not in doi_url and  ")" not in doi_url:
            output_path = f"papers/{row['DOI'].replace('/', '_')}.pdf"  
            if not os.path.exists(output_path):
                command = f'scidownl download --doi {row["DOI"]} --out {output_path}'
                # command = f'python -m PyPaperBot --doi {doi_url} --dwn-dir papers/'
                try:
                    print(row["DOI"])
                    subprocess.run(command, shell=True, check=True, timeout=20)
                except Exception as e:
                    print(e)
                    pass
                # !scidownl download --doi {row['DOI']} --out doi_url

                if not os.path.exists(output_path):
                    print(f"\nDownload failed for PaperID {row['ID']} with DOI: {row['DOI']}.")
                    not_downloadable.append((row['ID'], row['DOI']))
                else: 
                    downloaded_papers.append((row['ID'], row['DOI']))
            output_path = f"papers/{row['DOI'].replace('/', '_')}.pdf"
        else:
            print(f"\nPaper ID {row['ID']} doesn't have the DOI.")
            Invalid_DOIs.append(row['ID'])

    # if i == 50:
    #     break

stats = {'Downloaded': downloaded_papers,
        'Non-downloaded': not_downloadable,
        'Invalid_DOIs': Invalid_DOIs}

with open('stats.json', 'w') as f:
    json.dump(stats, f, indent=4)


  0%|          | 0/2733 [00:00<?, ?it/s]


Paper ID 617509 doesn't have the DOI.
10.1016/j.tifs.2022.05.003


[INFO] | 2024/09/12 18:30:30 | Run scihub tasks. Tasks information: 
[INFO] | 2024/09/12 18:30:30 |          DOI(s): ['10.1016/j.tifs.2022.05.003']
[INFO] | 2024/09/12 18:30:30 |          Output: papers/10.1016_j.tifs.2022.05.003.pdf
[INFO] | 2024/09/12 18:30:30 |      SciHub Url: <auto.availability_first>
[INFO] | 2024/09/12 18:30:30 | Choose scihub url [0]: https://sci-hub.st
[INFO] | 2024/09/12 18:30:30 | <- Request: scihub_url=https://sci-hub.st, source=DoiSource[type=doi, id=10.1016/j.tifs.2022.05.003], proxies={}
[INFO] | 2024/09/12 18:30:30 | -> Response: status_code=200, content_length=0
[INFO] | 2024/09/12 18:30:30 | Choose scihub url [1]: https://sci-hub.ru
[INFO] | 2024/09/12 18:30:30 | <- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1016/j.tifs.2022.05.003], proxies={}
[INFO] | 2024/09/12 18:30:31 | -> Response: status_code=200, content_length=0
[INFO] | 2024/09/12 18:30:31 | Choose scihub url [2]: http://sci-hub.se
[INFO] | 2024/09/12 18:30:31 |


Download failed for PaperID 618099 with DOI: 10.1016/j.tifs.2022.05.003.
10.1002/smm2.1117


[INFO] | 2024/09/12 18:30:37 | Run scihub tasks. Tasks information: 
[INFO] | 2024/09/12 18:30:37 |          DOI(s): ['10.1002/smm2.1117']
[INFO] | 2024/09/12 18:30:37 |          Output: papers/10.1002_smm2.1117.pdf
[INFO] | 2024/09/12 18:30:37 |      SciHub Url: <auto.availability_first>
[INFO] | 2024/09/12 18:30:37 | Choose scihub url [0]: https://sci-hub.st
[INFO] | 2024/09/12 18:30:37 | <- Request: scihub_url=https://sci-hub.st, source=DoiSource[type=doi, id=10.1002/smm2.1117], proxies={}
[INFO] | 2024/09/12 18:30:38 | -> Response: status_code=200, content_length=0
[INFO] | 2024/09/12 18:30:38 | Choose scihub url [1]: https://sci-hub.ru
[INFO] | 2024/09/12 18:30:38 | <- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1002/smm2.1117], proxies={}
[INFO] | 2024/09/12 18:30:38 | -> Response: status_code=200, content_length=0
[INFO] | 2024/09/12 18:30:38 | Choose scihub url [2]: http://sci-hub.se
[INFO] | 2024/09/12 18:30:38 | <- Request: scihub_url=http://sci-h

KeyboardInterrupt: 

[INFO] | 2024/09/12 18:30:39 | -> Response: status_code=200, content_length=0
[INFO] | 2024/09/12 18:30:39 | Choose scihub url [3]: http://sci-hub.ru
[INFO] | 2024/09/12 18:30:39 | <- Request: scihub_url=http://sci-hub.ru, source=DoiSource[type=doi, id=10.1002/smm2.1117], proxies={}
[INFO] | 2024/09/12 18:30:39 | -> Response: status_code=200, content_length=0
[INFO] | 2024/09/12 18:30:39 | Choose scihub url [4]: http://sci-hub.mobi
[INFO] | 2024/09/12 18:30:39 | <- Request: scihub_url=http://sci-hub.mobi, source=DoiSource[type=doi, id=10.1002/smm2.1117], proxies={}
[INFO] | 2024/09/12 18:30:40 | Choose scihub url [5]: https://sci-hub.se
[INFO] | 2024/09/12 18:30:40 | <- Request: scihub_url=https://sci-hub.se, source=DoiSource[type=doi, id=10.1002/smm2.1117], proxies={}
[INFO] | 2024/09/12 18:30:40 | -> Response: status_code=200, content_length=0
[INFO] | 2024/09/12 18:30:40 | Choose scihub url [6]: http://sci-hub.st
[INFO] | 2024/09/12 18:30:40 | <- Request: scihub_url=http://sci-hub.s