# Scraping Top Repositories For Topics On GitHub

## Introduction

### Web scraping
- Web scraping is an automated method used to extract data from websites. It involves fetching a webpage, parsing its content, and retrieving specific information using tools like requests for HTTP requests and BeautifulSoup or Scrapy for HTML parsing. Web scraping is commonly used for data mining, price monitoring, competitive analysis, research, and automation.


### GitHub and the problem statement
- GitHub is a web-based platform for version control and collaboration that allows developers to manage and share their code using Git. It provides features like repositories, branches, pull requests, issue tracking, and CI/CD integration, making it a key tool for open-source and professional software development.
- In this project, we are extracting data about the top repositories for the top topics of GitHub.

### Tools used
- Python, Pandas, BeautifulSoup, requests, os

## Steps

1. Importing all the necessary libraries
2. Scraping ***https://github.com/topics*** to extract top topics' name, description and page URL.
3. Store the extracted data in a DataFrame.
4. For each topic, scraping the top 20-25 repositories from the topic page to extract username, repo name, stars and repo URL.
5. Storing the data extracted for each topic into a CSV file in the below format:

    *Repo Name,Username,Stars,Repo URL
    <br>three.js,mrdoob,69700,https://github.com/mrdoob/three.js
    <br>libgdx,libgdx,18300,https://github.com/libgdx/libgdx*

## Importing all the necessary libraries

In [1]:
import os
import logging
import requests
import pandas as pd
from pathlib import Path
from datetime import datetime
from bs4 import BeautifulSoup

In [2]:
BASE_URL = "https://github.com"

In [3]:
timest = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
logs_dir = 'LOGS'

os.makedirs(logs_dir,exist_ok=True)
log_filename = os.path.join(logs_dir, f"github_scraping_{timest}.log")

logging.basicConfig(
    filename = log_filename,
    level= logging.INFO,
    format = "%(asctime)s - %(levelname)s - %(message)s",
    datefmt = "%Y-%m-%d %H:%M:%S"
)

## Helper functions

### Defining function to fetch web page content

In [4]:
def get_response(url):
    """Fetches the response from a given URL."""
    logging.info(f"Fetching URL : {url}")
    try:
        response = requests.get(url)
        if response.status_code != 200: # Check if the request was successful (HTTP 200 OK)
            logging.error(f"Failed to load page: {url}. Status Code:{response.status_code}")
            raise Exception(f'Failed to load page: {url}. Status Code:{response.status_code}')
        return response
    except requests.exceptions.RequestException as e:
        logging.critical(f"Request error for {url}: {e}")
        raise

### Defining function to to parse HTML content

In [5]:
def get_soup(res):
    """Parses the HTML content from an HTTP response using BeautifulSoup."""
    
    logging.info("Parsing HTML content with BeautifulSoup")
    try:
        page_content = res.text # Extract raw HTML content
        if not page_content:
            logging.warning("Warning: The response content is empty.")
        soup = BeautifulSoup(page_content, 'html.parser') # Parse HTML with BeautifulSoup
        logging.info("Successfully parsed HTML content.")
        return soup
    except Exception as e:
        logging.error(f"Error while parsing HTML: {e}")
        raise

### Defining function to extract topic name

In [6]:
def get_topic_name(topic):
    """Extracts the topic name from a given BeautifulSoup element."""
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'  # Class used to identify topic title
    logging.info("Extracting topic name...")
    topic_title_tag = topic.find('p',{'class': selection_class}) # Locate the <p> tag containing the topic name
    if topic_title_tag:
         # Extract and return the text content of the topic
            topic_name = topic_title_tag.get_text().strip()
            logging.info(f"Extracted topic name: {topic_name}")
            return topic_name
    logging.warning("Topic name not found!")
    return ""

### Defining function to extract topic description

In [7]:
def get_topic_desc(topic):
    """Extracts the topic description from a given BeautifulSoup element."""
    selection_class = 'f5 color-fg-muted mb-0 mt-1' # Class used to identify topic description
    logging.info("Extracting topic description...")
    topic_desc_tag  = topic.find('p', {'class': selection_class}) # Locate the <p> tag containing the description
    if topic_desc_tag:
        topic_desc = topic_desc_tag.get_text().strip()  # Extract and clean the topic description
        logging.info(f"Extracted topic description: {topic_desc[:50]}...")  # Log first 50 chars
        return topic_desc
    logging.warning("Topic description not found!")
    return ""

### Defining function to extract topic URL

In [8]:
def get_topic_url(topic):
    """Extracts the full URL of a GitHub topic from a given BeautifulSoup element."""
    global BASE_URL # Using the global BASE_URL to construct the full URL
    selection_class = 'no-underline flex-1 d-flex flex-column' # Class used to identify topic link
    logging.info("Extracting topic URL...")
    topic_url_tag = topic.find('a', {'class': selection_class}) # Locate the <a> tag containing the URL
    if topic_url_tag:
        try:
            topic_url = f"{BASE_URL}{topic_url_tag['href']}"  # Construct full topic URL
            logging.info(f"Successfully extracted topic URL: {topic_url}")
            return topic_url
        except Exception as e:
            logging.error(f"Encountered error while extracting topic url: {e}")
            raise
    logging.warning("Topic URL not found!")
    return ""

### Defining function to extract GitHub Topics Data and store in a dictionary

In [9]:
def get_topics_data(soup):
    """Extracts GitHub topics (name, description, and URL) from a BeautifulSoup object."""
    global BASE_URL # Using a global base URL
    topics_dict = dict() # Dictionary to store topic details
    topic_name = topic_desc = topic_url = ""
    
    # Find all div tags that contain topic information
    logging.info("Extracting GitHub topics from the page...")
    topics_div = soup.find_all('div',{'class':'py-4 border-bottom d-flex flex-justify-between'})

    if not topics_div:
        logging.warning("No topics found on the page.")  # Log if no topics are found
        return topics_dict
    
    # Loop through each topic div to extract relevant data
    for topic in topics_div:
        # getting title, description and url
        topic_name  = get_topic_name(topic) # Extract topic name
        topic_desc  = get_topic_desc(topic) # Extract topic description
        topic_url   = get_topic_url(topic)  # Extract topic URL
        
        if topic_name:
            topics_dict[topic_name] = {
                'topic_name': topic_name,
                'topic_description': topic_desc,
                'topic_url': topic_url
            }
            logging.info(f"Extracted data for topic: {topic_name}")
        else:
            logging.warning("A topic was found without a name. Skipping entry.")
            continue
    logging.info(f"Total topics extracted: {len(topics_dict)}")
    return topics_dict # Return the dictionary containing topic details

### Defining function to create a DataFrame from Topics Dictionary

In [10]:
def create_topics_dataframe(topics_dict):
    """Converts a dictionary of GitHub topics into a Pandas DataFrame."""

    # Convert dictionary to DataFrame, transpose to match correct format
    logging.info("Converting topics dictionary into a Pandas DataFrame...")
    if not topics_dict:  # Check if the dictionary is empty
        logging.warning("The topics dictionary is empty. Returning an empty DataFrame.")
        return pd.DataFrame(columns=['topic_name', 'topic_description', 'topic_url'])

    df = pd.DataFrame(topics_dict).transpose().reset_index(drop=True)
    df.columns = ['topic_name', 'topic_description', 'topic_url']
    logging.info(f"Successfully created DataFrame with {df.shape[0]} topics.")
    
    return df

### Defining function to extract repository information

In [11]:
def get_repo_info(repo):
    """Extracts repository details (username, repo name, URL, and stars) from a BeautifulSoup element."""
    username = repo_name = repo_url = ""
    selection_class = 'f3 color-fg-muted text-normal lh-condensed' # Class identifying repo owner and name

    logging.info("Extracting repository information...")
    try:
        username_tag = repo.find('h3',{'class': selection_class}) # Find the <h3> tag containing repo details
        if username_tag:
            a_tags = username_tag.find_all('a') # Extract all <a> tags
            if len(a_tags)>=2: # Ensure there are at least two <a> tags (username and repo name)
                username  = a_tags[0].get_text().strip()
                repo_name = a_tags[1].get_text().strip()
                repo_url  = f"{BASE_URL}{a_tags[1]['href']}" # Construct full repository URL
                logging.info(f"Extracted repo: {repo_name} by {username}")
        # Find the repository star count
        star_tag = repo.find('span',{'id':'repo-stars-counter-star'})
        if star_tag:
            stars = star_tag.get_text().strip()
        if not username or not repo_name or not repo_url or not stars:
            logging.warning("Some repository details are missing.")      
    except Exception as e:
        logging.error(f"Error extracting repository information: {e}")
    finally:
        return username, repo_name, repo_url, stars     

### Defining function to extract all the repositories from a GitHub topic age

In [12]:
def get_all_repos(soup):
    """Extracts all repository elements from a GitHub topic page using BeautifulSoup."""
    selection_class = 'border rounded color-shadow-small color-bg-subtle my-4' # Class identifying repository sections

    logging.info("Extracting all repository elements from the GitHub topic page...")
    repos = soup.find_all('article',{'class':selection_class}) # Extract all repository elements

    if not repos:
        logging.warning("No repositories found on the topic page.")  # Log warning if no repositories found
    else:
        logging.info(f"Successfully extracted {len(repos)} repositories.")
    return repos

### Defining function to scrape repository data from a GitHub Topic Page

In [13]:
def scrape_repo(url):
    """Scrapes repository details (name, owner, stars, URL) from a GitHub topic page."""
    logging.info(f"Starting repository scraping for: {url}")
    repo_dict = dict() # Dictionary to store repository data

    try:
        response = get_response(url) # Fetch the topic page content
        soup = get_soup(response) # Parse the page using BeautifulSoup
        repos = get_all_repos(soup) # Extract repository elements
        
        if not repos:
            logging.warning(f"No repositories found at {url}.")
            return repo_dict
            
        # Loop through each repository element and extract details
        for repo in repos:
            username, repo_name, repo_url, stars = get_repo_info(repo)
            # Store extracted details
            if username:
                repo_dict[username] = {  # Store extracted details
                    'repo_name': repo_name,
                    'stars': stars,
                    'repo_url': repo_url
                }
                logging.info(f"Extracted repository: {repo_name} by {username}")
        logging.info(f"Successfully scraped {len(repo_dict)} repositories from {url}.")
    except Exception as e:
        logging.error(f"Error while scraping repositories from {url}: {e}")
    finally:
        return repo_dict

### Defining function for scraping GitHub Topics

In [14]:
def scrape_topics():
    """Scrapes GitHub Topics, extracts details, and returns a DataFrame."""
    topics_url = "https://github.com/topics" # URL of GitHub Topics page

    logging.info(f"Starting to scrape GitHub topics from {topics_url}")
    try:
        response = get_response(topics_url) # Fetch page content
        soup = get_soup(response) # Parse HTML using BeautifulSoup
        topics_dict = get_topics_data(soup) 
        if not topics_dict:  # Check if any topics were extracted
            logging.warning("No topics were extracted. Returning an empty DataFrame.")
            return pd.DataFrame(columns=['topic_name', 'topic_description', 'topic_url'])
        topics_df = create_topics_dataframe(topics_dict)  # Convert to DataFrame
        logging.info(f"Successfully scraped {len(topics_df)} topics.")
        return topics_df
    except Exception as e:
        logging.error(f"Error while scraping GitHub topics: {e}")
        return pd.DataFrame(columns=['topic_name', 'topic_description', 'topic_url']) 

# Scraping GitHub Topics and Repositories

In [15]:
def scrape_topics_repo():
    """Scrapes GitHub topics and their repositories, saving data into CSV files."""

    logging.info("Starting GitHub topic and repository scraping process.")
    
    topics_df = scrape_topics() # Fetch GitHub topics
    if topics_df.empty:  # Check if no topics were extracted
        logging.warning("No topics found. Exiting repository scraping process.")
        return

    
    for topic_url in topics_df['topic_url']:
        topic_name = topic_url.split('/')[-1] # Extract topic name from URL
        logging.info(f"Scraping top repositories for: {topic_name}")
        repo_dict = scrape_repo(topic_url) # Scrape repositories for the topic
        
        if not repo_dict:  # Check if no repositories were found
            logging.warning(f"No repositories found for {topic_name}. Skipping CSV creation.")
            continue
        
        repo_df = pd.DataFrame.from_dict(repo_dict).transpose() # Convert to DataFrame
        
        # Format DataFrame: reset index and rename columns
        repo_df.reset_index(inplace=True)
        repo_df.rename(columns={'index':'username'},inplace=True)
        
        # Ensure 'Data' directory exists
        os.makedirs('Data',exist_ok=True)
        file_path = Path("Data") / f"{topic_name}.csv" # Define CSV file path
        
        if file_path.exists(): # Check if file already exists
            logging.info(f"File {file_path} already exists. Skipping...")
            continue
        try:
            repo_df.to_csv(file_path, index=False)  # Save DataFrame to CSV
            logging.info(f"Successfully saved repositories for {topic_name} to {file_path}.")
        except Exception as e:
            logging.error(f"Error saving {file_path}: {e}")

In [16]:
scrape_topics_repo()

# Summary

This project successfully **scrapes GitHub Topics** and their **top repositories**, extracting key details and storing them in CSV files.  

### **What This Notebook Does**
- Fetches **GitHub Topics** from the GitHub Topics page.
- Extracts **topic details** (name, description, URL).
- Scrapes **top repositories** under each topic.
- Extracts **repository details** (Owner, Name, Stars, URL).
- Saves the collected data into **CSV files** for analysis.
- Implements **logging** to track scraping progress and handle errors.

### **Data Storage**
- Each topicâ€™s repositories are saved in **Data/topic_name.csv**.
- Example:  

In [17]:
pd.read_csv('Data/3d.csv').head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,105k,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,28.3k,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,23.8k,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,23.7k,https://github.com/BabylonJS/Babylon.js
4,FreeCAD,FreeCAD,23.3k,https://github.com/FreeCAD/FreeCAD


### **Future Work**
- **Further analyze CSV data** using **Pandas, SQL, or visualization tools**.
- **Improve scraping speed** using parallel processing.
- **Add database storage** for structured data retrieval.
- **Extract additional repository details** (Forks, Issues, Contributors).
- **Implement pagination** to scrape **more topics and repositories** by iterating through multiple pages (`?page=1`, `?page=2`, etc.).