<font face="Times New Roman" size=5>
<div dir=rtl align="center">
<font face="Times New Roman" size=5>
In The Name of God
</font>
<br>
<img src="https://logoyar.com/content/wp-content/uploads/2021/04/sharif-university-logo.png" alt="University Logo" width="150" height="150">
<br>
<font face="Times New Roman" size=4 align=center>
Sharif University of Technology - Department of Electrical Engineering
</font>
<br>
<font color="#008080" size=6>
Foundations of Data Science
</font>
<hr/>
<font color="#800080" size=5>
Phase 0 Report: Web Scraping and Data Processing for AI Paper Retrieval
<br>
</font>
<font size=5>
Instructor: Dr. Khalaj
<br>
</font>
<font size=4>
Fall 2024
<br>
<font face="Times New Roman" size=4>
Ali Sadeghiyan 400101464
</font>

</div></font>

## **1. Introduction**  
The objective of this project is to develop an AI assistant that helps users find relevant **scholarly papers**. The dataset provided (DBLP) contains papers up to 2017, but to enhance the assistant’s capability, we need to **retrieve more recent papers** from **online sources**. This phase focuses on **web scraping**, where we extract relevant research papers from **Semantic Scholar** for five key topics in artificial intelligence:  

- **Foundation Models**  
- **Generative Models**  
- **Large Language Models (LLM)**  
- **Vision-Language Models (VLM)**  
- **Diffusion Models**  

This report details the **crawling methodology**, **data storage format**, and **processing steps**, as well as the **challenges encountered** and **next steps**.  

## **2. Data Collection Methodology**  
To ensure efficient data retrieval, we designed an automated **scraping pipeline** using the **Semantic Scholar API**. The main objectives were:  

1. **Querying Papers:** The script searches for each topic and retrieves **relevant papers published after 2017**.  
2. **Sorting and Filtering:** The results are sorted by **relevance**, and only specific fields are extracted, including **title, abstract, authors, and citation count**.  
3. **Pagination Handling:** Since the API returns a limited number of results per request, we implemented a mechanism to **iteratively fetch additional papers** until the required number is reached.  
4. **Rate Limiting:** To avoid being blocked due to excessive requests, the script monitors the API’s response and implements a **retry strategy** when necessary.  
5. **Error Handling:** The script ensures robustness by handling network failures, missing data fields, and unexpected API responses.  

The scraping process was conducted for three year ranges: **2017-2020**,**2021-2023** and **2024-2025**, with a goal of collecting at least **2000 research papers per topic**.  

## **3. Data Storage and Structure**  
The extracted data was structured and saved in **JSON format**, ensuring easy access and further processing. Each topic has a separate file, and the format is as follows:  

```json
[
  {
    "year_range": "2017-2023",
    "papers": [
      {
        "title": "Example Paper Title",
        "abstract": "This paper explores...",
        "authors": "Author1, Author2",
        "citations": 23
      }
    ]
  }
]
```
This structured approach allows easy merging, filtering, and querying for further analysis.  

## **4. Merging and Processing Data**  
After collecting the research papers, additional **data processing steps** were implemented:  

- **Merging Paper Lists:** Since each topic had papers from different year ranges, the script combines them into a single list for **consistency**.  
- **Handling Missing Data:** Some papers lack abstracts or author names. Default placeholders were used to ensure that all records remain structured.  
- **Removing Duplicates:** If a paper appeared multiple times across year ranges, it was identified and removed to maintain dataset integrity.  

A separate script was developed to **load, clean, and merge** these datasets, ensuring that all data is correctly formatted for later use.  

## **5. Data Verification**  
To confirm the quality of the scraped data, we implemented a **randomized paper selection** method. This allows us to randomly sample and inspect papers from each dataset. The script ensures that:  

- The retrieved papers contain **meaningful abstracts and author information**.  
- The **citations field** is correctly extracted and stored.  
- There are no **empty or incomplete entries** in the dataset.  

## **6. Challenges and Considerations**  
While implementing the scraping and data processing pipeline, several challenges arose:  

- **API Rate Limits:** The Semantic Scholar API enforces request restrictions, requiring **delays and retries** to avoid getting blocked.  
- **Inconsistent Metadata:** Some papers lack essential details like abstracts or author names, requiring **data cleaning techniques**.  
- **Handling Large-Scale Data:** Collecting and processing thousands of papers required **efficient file handling** to avoid performance issues.  
- **Filtering Irrelevant Results:** Ensuring that retrieved papers **strictly belong to the specified topics** was a challenge, as some results contained loosely related content.  

## **7. Conclusion**  
This phase successfully established a **structured web scraping and data processing framework** for retrieving AI research papers. The pipeline ensures **efficient data retrieval, storage, and cleaning**, forming the foundation for the **next phase of AI assistant development**.  

The collected data will now be used for **training and evaluation**, paving the way for building a system that can effectively assist researchers in finding relevant academic papers.  



# Crawling:

In [8]:
import requests
import time
import json
import os

# Function to scrape research papers from the Semantic Scholar API
def scrape_semantic_scholar_api(topic, year_range, limit=1000):
    """
    Fetches research papers related to the specified topic from the Semantic Scholar API.

    Parameters:
        topic (str): The research topic to search for.
        year_range (str): The range of years to filter papers (e.g., "2017-2023").
        limit (int): The maximum number of papers to retrieve.

    Returns:
        list: A list of dictionaries containing paper details (title, abstract, authors, citations).
    """
    api_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    collected_papers = []
    current_offset = 0

    while len(collected_papers) < limit:
        if current_offset % 300 == 0:
            print(f"Scraped: {len(collected_papers)}/{limit}")

        # Construct query parameters for the API request
        query_params = {
            "query": topic,
            "fields": "title,abstract,authors,citationCount",
            "offset": current_offset,
            "limit": 100,
            "year": year_range,
            "sort": "relevance"
        }

        try:
            response = requests.get(api_url, params=query_params)

            # Handle rate limits
            if response.status_code == 429:
                print("Hit rate limit. Pausing for 10 seconds...")
                time.sleep(10)
                continue

            # Handle unexpected errors
            if response.status_code != 200:
                print(f"Error: Received {response.status_code}. Message: {response.json()}.")
                break

            response_data = response.json()
            paper_list = response_data.get("data", [])

            # Stop if no more papers are found
            if not paper_list:
                print("No additional papers found.")
                break

            # Process each paper and extract relevant information
            for paper in paper_list:
                try:
                    collected_papers.append({
                        'title': paper.get("title", "No Title"),
                        'abstract': paper.get("abstract", "No Abstract"),
                        'authors': ", ".join([author.get("name", "Unknown") for author in paper.get("authors", [])]),
                        'citations': paper.get("citationCount", 0)
                    })

                    # Stop if the required limit is reached
                    if len(collected_papers) >= limit:
                        break

                except Exception as parse_error:
                    print(f"Error processing paper data: {parse_error}")

            # Move to the next set of results
            current_offset += 100
            time.sleep(5)  # Avoid making too many requests too quickly

        except requests.exceptions.RequestException as request_error:
            print(f"Network or request error: {request_error}. Retrying in 10 seconds...")
            time.sleep(10)

    return collected_papers

# Function to save scraped papers to a JSON file
def save_to_json_file(topic, all_year_data):
    """
    Saves the collected paper data to a JSON file.

    Parameters:
        topic (str): The research topic.
        all_year_data (list): The collected data for all year ranges.
    """
    if not all_year_data:
        print(f"No data to save for topic: {topic}")
        return

    # Create a sanitized filename based on the topic name
    file_name = f"{topic.replace(' ', '_')}_data.json"

    # Save the data to a JSON file with proper formatting
    with open(file_name, 'w', encoding='utf-8') as f:
        json.dump(all_year_data, f, ensure_ascii=False, indent=4)

    print(f"Data saved to {file_name}")

# Define topics and year ranges for scraping
topics = ["Foundation Models", "Generative Models", "LLM", "VLM", "Diffusion Models"]
year_ranges = ["2017-2020","2021-2023","2024-2025"]

# Iterate through each topic and collect research papers
for topic in topics:
    print(f"Scraping data for topic: {topic}")
    all_year_data = []

    # Scrape papers for each year range
    for year_range in year_ranges:
        data = scrape_semantic_scholar_api(topic, year_range, limit=1000)
        all_year_data.append({'year_range': year_range, 'papers': data})

    # Save the collected data to a JSON file
    save_to_json_file(topic, all_year_data)
    print(f"Completed scraping for topic: {topic}")


Scraping data for topic: Foundation Models
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Data saved to Foundation_Models_data.json
Completed scraping for topic: Foundation Models
Scraping data for topic: Generative Models
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Data saved to Generative_Models_data.json
Completed scraping for topic: Generative Models
Scraping data for topic: LLM
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 825/1000
Error: Received 400. Message: {'error': 'Requested data for this limit and/or offset is not available'}.
Scraped: 0/1000
Scraped: 300/1000
Hit rate limit. Pausing for 10 seconds...
Scraped: 300/1000
Scraped: 600/100

# Cleaning the dataset and merge:

In [9]:
import json
import os

# Function to load existing JSON data from a file
def load_data_from_file(file_path):
    """
    Loads JSON data from the given file if it exists.

    Parameters:
        file_path (str): The path to the JSON file.

    Returns:
        list: The loaded JSON data or an empty list if the file is missing or corrupted.
    """
    if os.path.exists(file_path):
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                return json.load(file)
        except json.JSONDecodeError:
            print(f"Error: Unable to parse {file_path}. File may be empty or corrupted.")
            return []
    else:
        print(f"Warning: {file_path} not found.")
        return []

# Function to save data to a JSON file
def save_data_to_file(file_path, data):
    """
    Saves the given data to a JSON file with proper formatting.

    Parameters:
        file_path (str): The path to save the JSON data.
        data (list): The JSON data to store.
    """
    with open(file_path, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)
    print(f"Success: Data saved to {file_path}")

# Function to merge "papers" sections from different JSON parts
def merge_papers_in_file(file_path):
    """
    Merges the "papers" lists from multiple sections in a JSON file into a single list.

    Parameters:
        file_path (str): The path to the JSON file containing multiple "papers" sections.
    """
    data = load_data_from_file(file_path)

    # Ensure there are exactly two sections to merge
    if len(data) == 2:
        merged_papers = data[0].get('papers', []) + data[1].get('papers', [])
        merged_data = [{'papers': merged_papers}]
        save_data_to_file(file_path, merged_data)
    else:
        print(f"Info: No merge required for {file_path}. Expected 2 sections, found {len(data)}.")

# List of research topic files to process
file_names = [
    "Foundation_Models_data.json",
    "Generative_Models_data.json",
    "LLM_data.json",
    "VLM_data.json",
    "Diffusion_Models_data.json"
]

# Process each file and merge paper sections
for file_name in file_names:
    merge_papers_in_file(file_name)


Info: No merge required for Foundation_Models_data.json. Expected 2 sections, found 3.
Info: No merge required for Generative_Models_data.json. Expected 2 sections, found 3.
Info: No merge required for LLM_data.json. Expected 2 sections, found 3.
Info: No merge required for VLM_data.json. Expected 2 sections, found 3.
Info: No merge required for Diffusion_Models_data.json. Expected 2 sections, found 3.


# Visualization:

In [12]:
import json
import os
import random

# Function to load JSON data from a file
def load_data_from_file(file_path):
    """
    Loads JSON data from the specified file if it exists.

    Parameters:
        file_path (str): Path to the JSON file.

    Returns:
        list: Loaded JSON data or an empty list if the file is missing or corrupted.
    """
    if os.path.exists(file_path):
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                return json.load(file)
        except json.JSONDecodeError:
            print(f"Error: Could not parse {file_path}. The file may be empty or corrupted.")
            return []
    else:
        print(f"Warning: {file_path} not found.")
        return []

# Function to display a random paper from the "papers" list in the JSON file
def show_random_paper_from_file(file_path):
    """
    Displays a randomly selected research paper from the given JSON file.

    Parameters:
        file_path (str): Path to the JSON file.
    """
    data = load_data_from_file(file_path)

    if data and isinstance(data, list) and 'papers' in data[0]:
        papers = data[0]['papers']

        if papers:
            random_paper = random.choice(papers)  # Select a random paper
            print(f"\nRandom paper from {file_path}:")
            print(f"Title: {random_paper.get('title', 'No Title')}")
            print(f"Abstract: {random_paper.get('abstract', 'No Abstract')}")
            print(f"Authors: {random_paper.get('authors', 'No Authors')}")
            print(f"Citations: {random_paper.get('citations', 0)}")
            print("-" * 40)
        else:
            print(f"Info: No papers found in {file_path}.")
    else:
        print(f"Info: No valid 'papers' key found in {file_path}, or the file is empty.")

# List of research topic files
file_names = [
    "Foundation_Models_data.json",
    "Generative_Models_data.json",
    "LLM_data.json",
    "VLM_data.json",
    "Diffusion_Models_data.json"
]

# Display a random paper from each file
for file_name in file_names:
    show_random_paper_from_file(file_name)



Random paper from Foundation_Models_data.json:
Title: An evaluation of mathematical models for the outbreak of COVID-19
Abstract: Abstract Mathematical modelling performs a vital part in estimating and controlling the recent outbreak of coronavirus disease 2019 (COVID-19). In this epidemic, most countries impose severe intervention measures to contain the spread of COVID-19. The policymakers are forced to make difficult decisions to leverage between health and economic development. How and when to make clinical and public health decisions in an epidemic situation is a challenging question. The most appropriate solution is based on scientific evidence, which is mainly dependent on data and models. So one of the most critical problems during this crisis is whether we can develop reliable epidemiological models to forecast the evolution of the virus and estimate the effectiveness of various intervention measures and their impacts on the economy. There are numerous types of mathematical m