<font face="Times New Roman" size=5>
<div dir=rtl align="center">
<font face="Times New Roman" size=5>
In The Name of God
</font>
<br>
<img src="https://logoyar.com/content/wp-content/uploads/2021/04/sharif-university-logo.png" alt="University Logo" width="150" height="150">
<br>
<font face="Times New Roman" size=4 align=center>
Sharif University of Technology - Department of Electrical Engineering
</font>
<br>
<font color="#008080" size=6>
Foundations of Data Science
</font>
<hr/>
<font color="#808000" size=5>
Phase 0 Report and Code
<br>
</font>
<font size=5>
Instructor: Dr. Khalaj
<br>
</font>
<font size=4>
Fall 2024
<br>
<font face="Times New Roman" size=4>
Amirreza Tanevardi 400100898
</font>

</div></font>

# Crawling:

In [None]:
import requests
import time
import json
import os

# Function to fetch research papers from the Semantic Scholar API
def fetch_papers(topic, year_range, max_papers=1000):
    """
    Retrieves research papers from the Semantic Scholar API based on a given topic.

    Parameters:
        topic (str): The research topic to search for.
        year_range (str): The range of publication years (e.g., "2017-2023").
        max_papers (int): The maximum number of papers to retrieve.

    Returns:
        list: A list of dictionaries containing paper details (title, abstract, authors, citations).
    """
    base_url = "https://api.semanticscholar.org/graph/v1/paper/search"
    papers_collected = []
    offset = 0

    while len(papers_collected) < max_papers:
        if offset % 300 == 0:
            print(f"Progress: {len(papers_collected)}/{max_papers}")

        # Define API request parameters
        params = {
            "query": topic,
            "fields": "title,abstract,authors,citationCount",
            "offset": offset,
            "limit": 100,
            "year": year_range,
            "sort": "relevance"
        }

        try:
            response = requests.get(base_url, params=params)

            # Handle rate limit restrictions
            if response.status_code == 429:
                print("Rate limit exceeded. Waiting for 10 seconds...")
                time.sleep(10)
                continue

            # Handle non-successful responses
            if response.status_code != 200:
                print(f"Error {response.status_code}: {response.json()}")
                break

            response_data = response.json()
            paper_entries = response_data.get("data", [])

            # Stop if no further papers are found
            if not paper_entries:
                print("No more papers available.")
                break

            # Process and store paper details
            for paper in paper_entries:
                try:
                    papers_collected.append({
                        'title': paper.get("title", "No Title"),
                        'abstract': paper.get("abstract", "No Abstract"),
                        'authors': ", ".join([author.get("name", "Unknown") for author in paper.get("authors", [])]),
                        'citations': paper.get("citationCount", 0)
                    })

                    # Stop if the paper limit is reached
                    if len(papers_collected) >= max_papers:
                        break

                except Exception as e:
                    print(f"Error processing paper details: {e}")

            # Move to the next batch of results
            offset += 100
            time.sleep(5)  # Avoid overwhelming the server

        except requests.exceptions.RequestException as e:
            print(f"Request error: {e}. Retrying in 10 seconds...")
            time.sleep(10)

    return papers_collected

# Function to store scraped research papers into a JSON file
def store_in_json(topic, collected_data):
    """
    Saves research papers to a JSON file.

    Parameters:
        topic (str): The research topic.
        collected_data (list): The gathered research papers categorized by year range.
    """
    if not collected_data:
        print(f"No data available for topic: {topic}")
        return

    # Generate a valid filename based on the topic
    filename = f"{topic.replace(' ', '_')}_papers.json"

    # Save the research data into a formatted JSON file
    with open(filename, 'w', encoding='utf-8') as file:
        json.dump(collected_data, file, ensure_ascii=False, indent=4)

    print(f"Data successfully saved to {filename}")

# Topics and year ranges for data collection
topics = ["Foundation Models", "Generative Models", "LLM", "VLM", "Diffusion Models"]
year_ranges = ["2017-2020", "2021-2023", "2024-2025"]

# Iterate over topics and fetch relevant research papers
for topic in topics:
    print(f"Starting data collection for topic: {topic}")
    compiled_data = []

    # Retrieve papers for each specified year range
    for year_range in year_ranges:
        papers = fetch_papers(topic, year_range, max_papers=1000)
        compiled_data.append({'year_range': year_range, 'papers': papers})

    # Store the fetched data into a JSON file
    store_in_json(topic, compiled_data)
    print(f"Completed data collection for topic: {topic}")

Scraping data for topic: Foundation Models
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Scraped: 0/1000
Scraped: 300/1000
Scraped: 600/1000
Scraped: 900/1000
Data saved to Foundation_Models_data.json
Completed scraping for topic: Foundation Models
Scraping data for topic: Generative Models
Scraped: 0/1000
Scraped: 300/1000
Hit rate limit. Pausing for 10 seconds...
Scraped: 300/1000
Hit rate limit. Pausing for 10 seconds...
Scraped: 600/1000
Hit rate limit. Pausing for 10 seconds...
Scraped: 600/1000
Hit rate limit. Pausing for 10 seconds...
Scraped: 600/1000
Hit rate limit. Pausing for 10 seconds...
Scraped: 600/1000
Hit rate limit. Pausing for 10 seconds...
Hit rate limit. Pausing for 10 seconds...
Hit rate limit. Pausing for 10 seconds...
Scraped: 900/1000
Scraped: 0/1000
Hit rate limit. Pausing for 10 seconds...
Scraped: 0/1000
Hit rate limit. Pausing for 10 seconds...
Hit rate limit. Paus

# The Report

Here's a detailed report explaining how the provided scraping code works:

---

# **Report on the Research Paper Scraping Code**

## **1. Overview**  
This Python script is designed to collect research papers related to specific topics from the **Semantic Scholar API**. It searches for papers within defined publication year ranges, processes the retrieved data, and saves the results in JSON files for further analysis.

---

## **2. Key Components**

### **a. Libraries Used**
- **`requests`**: Handles HTTP requests to the Semantic Scholar API.
- **`time`**: Manages delays between requests to prevent rate-limiting issues.
- **`json`**: Formats and saves the collected data as JSON files.
- **`os`**: (Imported but unused in the script.)

---

## **3. Functions Explained**

### **a. `fetch_papers(topic, year_range, max_papers=1000)`**

- **Purpose:**  
  Retrieves research papers based on a given topic and publication year range from the Semantic Scholar API.

- **Parameters:**  
  - `topic` *(str)*: The research topic (e.g., "LLM", "Diffusion Models").  
  - `year_range` *(str)*: The publication years to filter results (e.g., "2017-2020").  
  - `max_papers` *(int, default=1000)*: The maximum number of papers to collect.

- **Process:**  
  1. **API Request Setup:**  
     Uses the Semantic Scholar API with query parameters:
     - `query`: Topic of interest.
     - `fields`: Retrieves specific fields (title, abstract, authors, citations).
     - `offset` and `limit`: Handles pagination (fetches 100 papers per request).
     - `year`: Limits papers to the specified year range.
     - `sort`: Sorts by relevance.

  2. **Handling API Rate Limits:**  
     If the API returns status code **429** (rate limit exceeded), the script waits for 10 seconds before retrying.

  3. **Error Handling:**  
     - Prints an error message for non-200 HTTP status codes.
     - Handles exceptions during API requests and retries after a 10-second delay.

  4. **Data Processing:**  
     Extracts key details from each paper:
     - **Title**
     - **Abstract**
     - **Authors** (joined into a single string)
     - **Citation Count**

  5. **Progress Tracking:**  
     Displays progress every 300 papers collected.

  6. **Termination Conditions:**  
     - Stops if no more papers are available.
     - Stops once the maximum paper limit (`max_papers`) is reached.

---

### **b. `store_in_json(topic, collected_data)`**

- **Purpose:**  
  Saves the collected research papers into a JSON file.

- **Parameters:**  
  - `topic` *(str)*: The research topic (used for naming the output file).
  - `collected_data` *(list)*: The list of papers categorized by year range.

- **Process:**  
  1. **File Naming:**  
     Replaces spaces in the topic with underscores to create a valid filename (e.g., `Diffusion_Models_papers.json`).

  2. **Data Storage:**  
     Saves the data in JSON format with proper indentation for readability.

  3. **Output Confirmation:**  
     Prints a success message once the data is saved.

---

## **4. Main Execution Flow**

### **Topics and Year Ranges:**
- **Topics:**  
  - *"Foundation Models"*, *"Generative Models"*, *"LLM"*, *"VLM"*, *"Diffusion Models"*

- **Year Ranges:**  
  - *"2017-2020"*, *"2021-2023"*, *"2024-2025"*

### **Steps:**
1. **Iterate Over Topics:**  
   For each topic, the script:
   - Prints a message indicating the start of data collection.
   - Initializes an empty list `compiled_data` to store papers.

2. **Fetch Papers for Each Year Range:**  
   Calls `fetch_papers()` for each specified year range (up to 1000 papers per range).

3. **Save Data:**  
   Calls `store_in_json()` to save the compiled data into a JSON file.

4. **Completion Message:**  
   Prints a message indicating the completion of data collection for the topic.

---

## **5. Features and Error Handling**

- **Rate Limiting:**  
  Handles API throttling by pausing for 10 seconds when needed.

- **Robust Error Handling:**  
  Catches both request-related and data-processing errors.

- **Progress Monitoring:**  
  Displays real-time progress updates.

- **Organized Data Storage:**  
  Saves data in JSON files categorized by topic and year range.

