# Analysis of Abstracts Using a Large Language Model (LLM)

## Introduction

In this Jupyter Notebook, we will perform an in-depth analysis of abstracts extracted from a CSV file using a Large Language Model (LLM). The goal of this analysis is to leverage the capabilities of LLMs to extract meaningful insights, identify key themes, and perform various natural language processing (NLP) tasks on the abstracts.

### Objectives

- **Data Loading**: Import and preprocess abstracts from a CSV file.
- **Text Analysis**: Utilize LLMs to analyze the content of the abstracts.

### Tools and Libraries

- **LangChain**: To interface with the LLM.

### Workflow

1. **Data Import**: Load the CSV file containing the abstracts.
3. **LLM Integration**: Use the LLM to perform various NLP tasks.

By the end of this notebook, you will have a comprehensive understanding of how to use LLMs for analyzing textual data and extracting valuable insights from scientific abstracts.

1. **Data Import**: Load the CSV file containing the abstracts.


In [1]:
import csv
from langchain_community.llms import Ollama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from downloader import *


In [2]:
def read_csv(file_path, num_lines):
    with open(file_path, 'r', encoding="utf8", errors='ignore') as file:
        reader = csv.reader(file)
        dico = {}
        for i, row in enumerate(reader):
            if i >= num_lines:
                break
            if i == 0:
                continue
            dico[row[17]] = {"abstract": row[4]} 
            try:
               
                downloader = Downloader(row[17], 'doi', f"pdfs/{row[17]}.pdf")
                downloader.download()
                dico[row[17]].update({"pdf": True})
            except Exception as e:
                dico[row[17]].update({"pdf": False})
                print(e)
                
        print(dico)
        return dico

file_path = "all(Scholarly & creative works_Obje).csv"
num_lines = 100
dico = read_csv(file_path, num_lines)


[1m[INFO][0m | [32m2024/09/10 20:35:58[0m | [1mChoose scihub url [0]: https://sci-hub.st[0m
[1m[INFO][0m | [32m2024/09/10 20:35:58[0m | [1m<- Request: scihub_url=https://sci-hub.st, source=DoiSource[type=doi, id=10.1038/s41586-020-2746-2], proxies={'http': 'socks5://127.0.0.1:7890'}[0m


[]
0


[1m[INFO][0m | [32m2024/09/10 20:35:58[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:35:58[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:35:58[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1038/s41586-020-2746-2], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:35:59[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:35:59[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1038/s41586-020-2746-2], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:35:59[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:35:59[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:35:59[0m | [1m<- Request: scihub_url=http://sci-hub.se, s

Pdf not found for 10.1038/s41586-020-2746-2
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:00[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:00[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:00[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1186/s12885-016-2539-z], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:00[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:00[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1186/s12885-016-2539-z], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:00[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:00[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:36:00[0m | [1m<- Request: scihub_url=http://sci-hub.se, s

Pdf not found for 10.1186/s12885-016-2539-z
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:01[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:01[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:01[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1038/nmeth.2895], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:01[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:01[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1038/nmeth.2895], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:02[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:02[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:36:02[0m | [1m<- Request: scihub_url=http://sci-hub.se, source=DoiSourc

Pdf not found for 10.1038/nmeth.2895
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:03[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:03[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:03[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1016/j.celrep.2012.09.016], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:03[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:03[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1016/j.celrep.2012.09.016], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:04[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:04[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:36:04[0m | [1m<- Request: scihub_url=http://sci-hub

Pdf not found for 10.1016/j.celrep.2012.09.016
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:04[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:04[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:04[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1158/1535-7163.MCT-09-0495], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:05[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:05[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1158/1535-7163.MCT-09-0495], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:05[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:05[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:36:05[0m | [1m<- Request: scihub_url=http://sci-h

Pdf not found for 10.1158/1535-7163.MCT-09-0495
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:06[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:06[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:06[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1182/blood-2009-07-231191], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:06[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:06[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1182/blood-2009-07-231191], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:06[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:06[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:36:06[0m | [1m<- Request: scihub_url=http://sci-hub

Pdf not found for 10.1182/blood-2009-07-231191
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:07[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:07[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:07[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1038/nmeth.3472], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:07[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:07[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1038/nmeth.3472], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:08[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:08[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:36:08[0m | [1m<- Request: scihub_url=http://sci-hub.se, source=DoiSourc

Pdf not found for 10.1038/nmeth.3472
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:08[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:08[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:08[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1074/jbc.M113.469262], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:09[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:09[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1074/jbc.M113.469262], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:09[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:09[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:36:09[0m | [1m<- Request: scihub_url=http://sci-hub.se, sourc

Pdf not found for 10.1074/jbc.M113.469262
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:10[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:10[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:10[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1016/j.molcel.2005.02.029], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:10[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:10[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1016/j.molcel.2005.02.029], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:10[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:10[0m | [1mChoose scihub url [3]: http://sci-hub.se[0m
[1m[INFO][0m | [32m2024/09/10 20:36:10[0m | [1m<- Request: scihub_url=http://sci-hub

Pdf not found for 10.1016/j.molcel.2005.02.029
[]
0


[1m[INFO][0m | [32m2024/09/10 20:36:11[0m | [1m-> Response: status_code=200, content_length=0[0m
[1m[INFO][0m | [32m2024/09/10 20:36:11[0m | [1mChoose scihub url [1]: https://sci-hub.mobi[0m
[1m[INFO][0m | [32m2024/09/10 20:36:11[0m | [1m<- Request: scihub_url=https://sci-hub.mobi, source=DoiSource[type=doi, id=10.1074/jbc.M307200200], proxies={'http': 'socks5://127.0.0.1:7890'}[0m
[1m[INFO][0m | [32m2024/09/10 20:36:11[0m | [1mChoose scihub url [2]: https://sci-hub.ru[0m
[1m[INFO][0m | [32m2024/09/10 20:36:11[0m | [1m<- Request: scihub_url=https://sci-hub.ru, source=DoiSource[type=doi, id=10.1074/jbc.M307200200], proxies={'http': 'socks5://127.0.0.1:7890'}[0m


KeyboardInterrupt: 

LLM Analysis 

In [3]:
def get_topic(abstract:str):
    
    Llm = Ollama(model='llama3', temperature=0.2)
    
    
    if abstract is None:
        raise ValueError("Abstract is required")
    
    parser = JsonOutputParser()
    
    #AI / accelerated materials discovery / SDLs / autonomous labs / high-throughput experimentation / high-throughput DFT
    
    topics = ["Machine Learning", "Batteries", "AI", "accelerated materials discovery", "Self Driving Labs", "autonomous labs", "high-throughput experimentation", "high-throughput DFT"]
    
    
    new_text = """
    
    The output needs to be formated as the following: 
    
    {
    "topic": {
    "topic1": ["Keyword1", "Keyword2", "Keyword3"],
    "topic2": ["Keyword1", "Keyword2", "Keyword3"]
    }
    }
    
    Only output the dictionary above, nothing else with it.
    """

    prompt = PromptTemplate(
    template=" So you are a text assistant and I need you to help me identify the topics from the following list the text given to you {topics}. \n Here's the text: {abstract}. \n\n Note: A single text can belong to multiple topics, so please list all relevant topics. {format_instructions}",
    input_variables=["format_instructions", "abstract", "topics"]
    )

    chain = prompt | Llm | parser
    topics = chain.invoke({"format_instructions": new_text, "abstract": abstract, "topics": topics})
    return list(topics.values())[0]


print(get_topic("The development of high-performance batteries is crucial for the future of electric vehicles. The current generation of batteries are not able to provide the range and power required for long-distance travel. This project aims to develop new materials for batteries that can provide higher energy density and faster charging times."))

def get_info(abstract:str = None, **kwargs):
    Llm = Ollama(model='llama3', temperature=0.5)
    if abstract is None:
        raise ValueError("Abstract is required")
    
    dico = {}
    for key, question in kwargs.items():
        print(key, question)
        prompt = PromptTemplate(
            template="So you are a text assistant and I want you to assist me by providing the following information: {question}. \n\n Here's the text: {abstract}. \n\n If the text doesn't contain any information about the topic given, output: 'N/A'",
            input_variables=["abstract", "question"]
        )
        chain = prompt | Llm 
        info = chain.invoke({"abstract": abstract, "question": question})
        dico[key] = info
    print(dico)
    return dico
    
    


    

    


{'MACHINE LEARNING': [], 'BATTERIES': ['batteries'], 'AI': [], 'ACCELERATED MATERIALS DISCOVERY': ['materials discovery'], 'SELF DRIVING LABS': [], 'AUTONOMOUS LABS': [], 'HIGH-THROUGHPUT EXPERIMENTATION': ['high-throughput experimentation', 'high-throughput DFT']}


## Workflow orchestration

In [4]:
for key in dico:
    dico[key].update({"topic":get_topic(dico[key]["abstract"])})
    dico[key].update(get_info(dico[key]["abstract"],affiliation="What affiliations do the authors or characters in the text have?",
                        new_materials="Does the text mention any new materials or discoveries?",
                        screening_algorithms="Are there any screening algorithms or systematic procedures discussed in the text?",
                        ai_algorithms="Does the text reference any AI algorithms or methods related to artificial intelligence?",
                        workflow="Can you describe the workflow or process followed in the text?",
                        methods="Can you summarize the methods or approaches mentioned in the text?",
                        models="What models or frameworks are discussed or used in the text?",
                        funding="Does the text mention any funding sources or sponsors?"))

print(dico)

affiliation What affiliations do the authors or characters in the text have?
new_materials Does the text mention any new materials or discoveries?
screening_algorithms Are there any screening algorithms or systematic procedures discussed in the text?
ai_algorithms Does the text reference any AI algorithms or methods related to artificial intelligence?
workflow Can you describe the workflow or process followed in the text?
methods Can you summarize the methods or approaches mentioned in the text?
models What models or frameworks are discussed or used in the text?
funding Does the text mention any funding sources or sponsors?
{'affiliation': 'Based on the provided text, here is the information you requested:\n\nThe authors of the text are not explicitly mentioned. However, the characters in the text can be categorized as follows:\n\n* Cancer cells\n* Cytotoxic T-lymphocytes (CTLs)\n* Mouse cancer cell lines\n* Genes and pathways involved in cancer evasion (e.g., Ptpn2, Socs1, Adar1, Fitm

### Writing in json file

In [5]:


with open("output.json", 'w') as file:
    json.dump(dico, file)



NameError: name 'json' is not defined