# GitHub Topic Scraper
This notebook scrapes [GitHub Topics](https://github.com/topics) to extract topic titles, descriptions, URLs, and the top repositories under each topic, including star counts and usernames. All data is saved as CSV files.

## Step 1: Import required libraries

In [27]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import os

### Helper: Convert GitHub star notation to integer

In [28]:
def parse_star_count(star_str):
    """Convert star strings like '105k' or '987' into integers."""
    star_str = star_str.strip().lower().replace(',', '')
    if 'k' in star_str:
        return int(float(star_str.replace('k', '')) * 1000)
    return int(star_str)

## Step 2: Define scraping functions for repositories under a topic

In [29]:
def get_topic_page(topic_url):
    response= requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page! {}'.format(topic_url))
    topic_doc =BeautifulSoup(response.text, 'html.parser')
    return topic_doc

base_url="https://github.com"
def get_repo_info(h3_tag,star_tag):
    #retun all the information
    a_tags=h3_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username,repo_name,stars, repo_url





def get_topic_repos(topic_doc):
    

    h3_selection='f3 color-fg-muted text-normal lh-condensed'
         
    repo_tags=topic_doc.find_all('h3',{'class':h3_selection})
    star_tags = topic_doc.find_all('span', class_='Counter js-social-count')

    topic_repos_dict={
        'username': [],
        'repo_name': [],
        'stars':[],
        'repo_url':[]
    }

    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)


def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print('Skipping {}, already exits!'.format(path))
        return 
            
    topic_df =get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Step 3: Define scraping functions for the topics page

In [30]:
def get_topic_titles(doc):
    select_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags=doc.find_all('p',{'class':select_class})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


def get_topic_desc(doc):
    
    desc_select="f5 color-fg-muted mb-0 mt-1"
             
    topic_desc_tags=doc.find_all('p',{'class':desc_select})
    
    topic_desc=[]
    
    
    for tag in topic_desc_tags:
        topic_desc.append(tag.text)

    return topic_desc


def get_topic_urls(doc):
    
    link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    
    
        
    topic_urls=[]
    base_url="https://github.com"
    for tag in link_tags:
        topic_urls.append(base_url+tag['href'])


    return topic_urls





def scrape_topics():
    topics_url = 'https://github.com/topics'
    response=requests.get(topics_url)
    if response.status_code!=200:
        raise Exception('Failed to load page! {}'.format(topic_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')

    topics_dict={
        'title':get_topic_titles(doc),
        'description': get_topic_desc(doc),
        'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)
        
    
    
    
    
    
scrape_topics()

    
    
    

Unnamed: 0,title,description,url
0,3D,\n 3D refers to the use of three-dime...,https://github.com/topics/3d
1,Ajax,\n Ajax is a technique for creating i...,https://github.com/topics/ajax
2,Algorithm,\n Algorithms are self-contained sequ...,https://github.com/topics/algorithm
3,Amp,\n Amp is a non-blocking concurrency ...,https://github.com/topics/amphp
4,Android,\n Android is an operating system bui...,https://github.com/topics/android
5,Angular,\n Angular is an open source web appl...,https://github.com/topics/angular
6,Ansible,\n Ansible is a simple and powerful a...,https://github.com/topics/ansible
7,API,\n An API (Application Programming In...,https://github.com/topics/api
8,Arduino,\n Arduino is an open source platform...,https://github.com/topics/arduino
9,ASP.NET,\n ASP.NET is a web framework for bui...,https://github.com/topics/aspnet


## Step 4: Loop through all topics and save their top repositories as CSV

In [31]:

def scrape_topics_repos():
    print('Scraping List of Topics:')
    topics_df= scrape_topics()

    os.makedirs('data', exist_ok=True)
    
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))



## Step 5: Run the full scraping process

In [32]:
scrape_topics_repos()

Scraping List of Topics:
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command-line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
Scraping top repositories for Code review
Scra