# Collect semnatic scholar data

The script collects data about research papers using the Semantic Scholar API, based on a set of predefined topics.

The data collection process involved querying Semantic Scholar for research papers published between 2014 and 2024, retrieving up to 100 papers per year per topic. The topics include various fields related to artificial intelligence, machine learning, and data science, such as - Artificial Intelligence (AI), Machine Learning (ML), Deep Learning, Natural Language Processing (NLP), Computer Vision, Neural Networks, Reinforcement Learning, Robotics, Generative AI, Explainable AI, and Predictive Analytics.

Additional topics include Data Mining, Statistical Learning, Anomaly Detection, Time Series Analysis, Graph Analytics, Speech Recognition, Transfer Learning, Sentiment Analysis, Ethical AI, Meta-Learning, AI Security, and Big Data.

Each research paper retrieved includes multiple metadata fields, which are stored in a CSV file. The metadata columns and their meanings are as follows:

* title: The title of the research paper.
* authors: A comma-separated list of authors' names.
* year: The year the paper was published.
* citations: The total number of times the paper has been cited.
* abstract: A brief summary of the paper's content.
* venue: The journal, conference, or repository where the paper was published.
* url: A direct link to the paper on Semantic Scholar.

The data collection process ensures that the latest and most relevant research papers in the selected domains are gathered, providing a rich dataset for analyzing trends in AI research.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import requests
import pandas as pd
import time

class ResearchPapersCollector:
    def __init__(self, save_path="/content/drive/My Drive/AITrendAnalysis-project/semantic_scholar_data_2202.csv"):
        self.api_url = "https://api.semanticscholar.org/graph/v1/paper/search"
        self.headers = {'User-Agent': 'AI-Trends-Analyzer/1.0'}
        self.save_path = save_path

    def fetch_papers(self, query, year_start=2014, year_end=2024, limit_per_year=100):
        """
        Fetch research papers on a specific topic from Semantic Scholar.
        """
        all_papers = []
        for year in range(year_start, year_end + 1):
            print(f"Fetching papers for {query} in year {year}...")
            params = {
                "query": query,
                "fields": "title,authors,year,citationCount,abstract,venue,url",
                "year": year,
                "limit": limit_per_year
            }
            try:
                response = requests.get(self.api_url, params=params, headers=self.headers)
                response.raise_for_status()
                data = response.json()

                for paper in data.get("data", []):
                    all_papers.append({
                        "title": paper.get("title", ""),
                        "authors": ", ".join([a["name"] for a in paper.get("authors", [])]),
                        "year": paper.get("year", ""),
                        "citations": paper.get("citationCount", 0),
                        "abstract": paper.get("abstract", ""),
                        "venue": paper.get("venue", ""),
                        "url": paper.get("url", "")
                    })

                time.sleep(1)  # Avoid rate limits
            except Exception as e:
                print(f"Error fetching papers for {query} in {year}: {e}")

        return all_papers

    def collect_data(self, topics, batch_size=500):
        """
        Collect research papers for multiple topics and save every 'batch_size' papers.
        """
        all_data = []
        for topic in topics:
            papers = self.fetch_papers(topic)
            all_data.extend(papers)

            # Save every 500 papers
            if len(all_data) >= batch_size:
                df = pd.DataFrame(all_data)
                df.to_csv(self.save_path, mode='a', header=not pd.io.common.file_exists(self.save_path), index=False)
                print(f"Saved {len(all_data)} research papers to {self.save_path}")
                all_data = []  # Reset data to start collecting the next batch

        # If there are any remaining papers, save them as well
        if all_data:
            df = pd.DataFrame(all_data)
            df.to_csv(self.save_path, mode='a', header=not pd.io.common.file_exists(self.save_path), index=False)
            print(f"Saved the remaining {len(all_data)} research papers to {self.save_path}")

        return df

def main():
    seed_topics = [ "Data Science", "AI",
        'Artificial Intelligence', 'Machine Learning',
        'Deep Learning', 'Natural Language Processing',
        'Computer Vision', 'Neural Networks',
        'Reinforcement Learning', 'Robotics',
        'Generative AI', 'Explainable AI', 'Predictive Analytics',
        'Data Mining', 'Statistical Learning',
        'Anomaly Detection', 'Time Series Analysis', 'Graph Analytics',
        'Speech Recognition', 'Transfer Learning',
        'Sentiment Analysis', 'Ethical AI', "Meta-Learning", "AI Security",
        "Big Data"
    ]

    collector = ResearchPapersCollector()
    df = collector.collect_data(seed_topics)
    print(df.head())

if __name__ == "__main__":
    main()


Fetching papers for Data Science in year 2014...
Fetching papers for Data Science in year 2015...
Fetching papers for Data Science in year 2016...
Fetching papers for Data Science in year 2017...
Fetching papers for Data Science in year 2018...
Fetching papers for Data Science in year 2019...
Fetching papers for Data Science in year 2020...
Fetching papers for Data Science in year 2021...
Fetching papers for Data Science in year 2022...
Fetching papers for Data Science in year 2023...
Fetching papers for Data Science in year 2024...
Saved 1100 research papers to /content/drive/My Drive/AITrendAnalysis-project/semantic_scholar_data_2202.csv
Fetching papers for AI in year 2014...
Fetching papers for AI in year 2015...
Fetching papers for AI in year 2016...
Fetching papers for AI in year 2017...
Fetching papers for AI in year 2018...
Fetching papers for AI in year 2019...
Fetching papers for AI in year 2020...
Fetching papers for AI in year 2021...
Fetching papers for AI in year 2022...
F