<a href="https://colab.research.google.com/github/StarCycle/CitationInfo2Txt/blob/main/FetchAbstractsOfCitationsFromGoogleScholar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 👋 Hey, Let's understand a research field in 5 minutes! 🌟

*   This notebook automatically fetches citation information of a specific paper from google scholar.
*   If a citation is availble on arxiv, its full abstract will be extracted.
*   The information will be saved in a text file with a prompt to let an LLM to summarize the citations.
*   Please run it on Google Colab to get fast access to Google Scholar.

# 💡Step 1: Install Dependancies

In [None]:
!pip install tqdm
!pip install scholarly

# 📚 Step 2: Find a famous paper in this field on Google scholar and get its **citedby_id**

How to acquire the **citedby_id**?

1. You can first search a paper on Google Scholar by the title, like:

<div align=center>
	<img width = '500' height ='150' src = "https://github.com/user-attachments/assets/946af137-110b-474d-87fa-0c59d7fcb4e7"/></div>

2. Then please click the **"cited by xxxx"** page, like:

<div align=center>
	<img width = '500' height ='90' src = "https://github.com/user-attachments/assets/4d38607d-b3c7-4701-be84-463284834471"/></div>

3. Now you will see the **citedby_id** in the url of the web page.

<div align=center>
	<img width = '500' height ='50' src = "https://github.com/user-attachments/assets/986e09ec-555f-4cae-b03a-cc93d3a7d2a1"/></div>

# 💪 Step 3: Set the **citedby_id** below and generate the prompt file

Note:

* You can only access a limited number of citations in a single search (perhaps under 120), other wise Google will block your IP address.
* **No worries!** You are using Google colab! If Google blocks you, just restart the kernal of this notebook!

In [None]:
# Setting
output_file = "result.txt"
max_citations = 10 # should be smaller than 120, otherwise google scholar will block visiting
citedby_id = 2960712678066186980

In [None]:
import re
import requests
import xml.etree.ElementTree as ET
from scholarly import scholarly
from tqdm import tqdm

prompt_description = (
    "Below are detailed citation records for a given article. Each entry includes the title, authors, "
    "publication year, number of citations, the link to the article, and its abstract if available. "
    "For articles hosted on arXiv, their abstract is retrieved directly from the arXiv API. Please "
    "summarize these entries as thoroughly as possible, highlighting key research contributions, "
    "novelties, and relevance to the cited work."
)

entries = [prompt_description]
query = scholarly.search_citedby(str(citedby_id))

for idx, result in enumerate(tqdm(query, desc="Processing citations", unit=" citation"), start=1):
    if idx > max_citations:
        break

    try:
        title = result['bib']['title']
        authors = ", ".join(result['bib']['author'])
        pub_year = result['bib']['pub_year']
        num_citations = result.get('num_citations', 0)
        pub_url = result.get('pub_url', 'No link available')

        abstract = "Abstract not available."
        if 'arxiv.org' in pub_url:
            arxiv_id_match = re.search(r'arxiv\.org/abs/([0-9]+\.[0-9]+)', pub_url)
            if arxiv_id_match:
                arxiv_id = arxiv_id_match.group(1)
                url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
                response = requests.get(url)
                if response.status_code == 200:
                    namespace = {'default': 'http://www.w3.org/2005/Atom'}
                    root = ET.fromstring(response.text)
                    summary = root.find('.//default:summary', namespace)
                    if summary is not None:
                        abstract = summary.text.strip()

        entry = (
            f"### Citation {idx} ###\n"
            f"Title: {title}\n"
            f"Authors: {authors}\n"
            f"Publication Year: {pub_year}\n"
            f"Number of Citations: {num_citations}\n"
            f"Link: {pub_url}\n"
            f"Abstract: {abstract}\n"
        )
        entries.append(entry)

    except Exception as e:
        print(f"Error processing citation {idx}: {e}")

with open(output_file, "w", encoding="utf-8") as file:
    file.write("\n\n".join(entries))

print(f"Prompt has been saved to {output_file}")

# 🎉 Step 4: Ask any LLM (ChatGPT, Claude, ChatGLM...) with the generated prompt file.
# Happy researching! 🎉😊