# Analysis of Abstracts Using a Large Language Model (LLM)

## Introduction

In this Jupyter Notebook, we will perform an in-depth analysis of abstracts extracted from a CSV file using a Large Language Model (LLM). The goal of this analysis is to leverage the capabilities of LLMs to extract meaningful insights, identify key themes, and perform various natural language processing (NLP) tasks on the abstracts.

### Objectives

- **Data Loading**: Import and preprocess abstracts from a CSV file.
- **Text Analysis**: Utilize LLMs to analyze the content of the abstracts.
- **Entity Recognition**: Identify and classify entities within the abstracts.
- **Sentiment Analysis**: Determine the sentiment expressed in the abstracts.
- **Topic Modeling**: Extract and visualize key topics discussed in the abstracts.
- **Summary Generation**: Generate concise summaries of the abstracts.

### Tools and Libraries

- **LangChain**: To interface with the LLM.

### Workflow

1. **Data Import**: Load the CSV file containing the abstracts.
2. **Preprocessing**: Clean and preprocess the text data.
3. **LLM Integration**: Use the LLM to perform various NLP tasks.
4. **Analysis and Visualization**: Analyze the results and visualize the findings.

By the end of this notebook, you will have a comprehensive understanding of how to use LLMs for analyzing textual data and extracting valuable insights from scientific abstracts.

1. **Data Import**: Load the CSV file containing the abstracts.


In [1]:
import csv
from langchain_community.llms import Ollama
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate


In [2]:
def read_csv(file_path, num_lines):
    with open(file_path, 'r', encoding="utf8", errors='ignore') as file:
        reader = csv.reader(file)
        dico = {}
        for i, row in enumerate(reader):
            if i >= num_lines:
                break
            dico[row[17]] = row[4] 
            print(', '.join(row))
        print(dico)
        return dico

file_path = "all(Scholarly & creative works_Obje).csv"
num_lines = 10
read_csv(file_path, num_lines)


ID, Scholarly & creative work type, Reporting date 1, Reporting date 2, Abstract OR Description, Date of acceptance, Addresses OR Organizer, Altmetric attention score, File(s) confidential, Associated Identifiers OR Associated identifiers, Author's licence, Authors, Author URL, Collections, Commissioning body, Confidential report OR Confidential, File(s) confidentiality reason, DOI, Edition, Editors OR Supervisors, eISSN, File(s) embargo release date, External identifiers, Field citation ratio, Date submitted, Conference finish date OR End date OR Finish date, ISBN-10, ISBN-13, Is compliant with institutional policy, File(s) embargoed, Open access, ISSN, Issue, Published proceedings OR Journal, Keywords OR Labels, Language, Conference place OR Country OR Presentation location OR Location, Medium, Presented at OR Name of conference, Notes, Chapter number OR Report number OR Article number, Number of chapters in book OR Number of chapters OR Number of pieces, OA location file version, OA

{'DOI': 'Abstract OR Description',
 '10.1038/s41586-020-2746-2': 'The genetic circuits that allow cancer cells to evade destruction by the host immune system remain poorly understood13. Here, to identify a phenotypically robust core set of genes and pathways that enable cancer cells to evade killing mediated by cytotoxic Tlymphocytes (CTLs), we performed genome-wide CRISPR screens across a panel of genetically diverse mouse cancer cell lines that were cultured in the presence of CTLs. We identify a core set of 182genes across these mouse cancer models, the individual perturbation of which increases either the sensitivity or the resistance of cancer cells to CTL-mediated toxicity. Systematic exploration of our dataset using genetic co-similarity reveals the hierarchical and coordinated manner in which genes and pathways act in cancer cells to orchestrate their evasion of CTLs, and shows that discrete functional modules that control the interferon response and tumour necrosis factor (TNF

LLM Analysis 

In [4]:
Llm = Ollama(model='llama3', temperature=0.7)

def get_topic(abstract:str):
    
    global Llm
    
    
    if abstract is None:
        raise ValueError("Abstract is required")
    
    parser = JsonOutputParser()
    
    #AI / accelerated materials discovery / SDLs / autonomous labs / high-throughput experimentation / high-throughput DFT
    
    topics = ["Machine Learning", "Batteries", "AI", "accelerated materials discovery", "Self Driving Labs", "autonomous labs", "high-throughput experimentation", "high-throughput DFT"]
    
    
    new_text = """
    
    The output needs to be formated as the following: 
    
    {
    "topic": {
    "Machine Learning": ["Keyword1", "Keyword2", "Keyword3"],
    "Batteries": ["Keyword1", "Keyword2", "Keyword3"]
    }
    }
    """

    prompt = PromptTemplate(
    template="Please identify the topics from the following list the text given to you {topics}. Here's the text: {abstract}. Note: A single text can belong to multiple topics, so please list all relevant topics. {format_instructions}",
    input_variables=["format_instructions", "abstract", "topics"]
    )

    chain = prompt | Llm | parser
    topics = chain.invoke({"format_instructions": new_text, "abstract": abstract, "topics": topics})
    return topics['topic']


print(get_topic("The development of high-performance batteries is crucial for the future of electric vehicles. The current generation of batteries are not able to provide the range and power required for long-distance travel. This project aims to develop new materials for batteries that can provide higher energy density and faster charging times."))

def get_info(abstract:str = None, **kwargs):
    if abstract is None:
        raise ValueError("Abstract is required")
    
    dico = read_csv(file_path, num_lines)


    

    
