# Scraping Top Repositories for Topics on GitHub

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

base_url = 'https://github.com'

In [2]:
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    return doc

In [3]:
doc = get_topics_page()

In [4]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
        
    return topic_titles

In [5]:
titles = get_topic_titles(doc)

In [6]:
def get_topic_description(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_desc = []
    
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
        
    return topic_desc

In [7]:
descriptions = get_topic_description(doc)

In [8]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
        
    return topic_urls

In [9]:
urls = get_topic_urls(doc)

In [10]:
topics_dict = {'Topic_Title': titles, 'Topic_description': descriptions, 'Topic_url': urls}

In [11]:
topics = pd.DataFrame(topics_dict)
topics.head()

Unnamed: 0,Topic_Title,Topic_description,Topic_url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [12]:
topics.to_csv('./Scrap_Github_Repos/Github_Topics.csv', index=None)

# Get the top 20 repositories from a topic page

In [13]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

In [14]:
def parse_star_count(stars):
    stars=stars.strip()
    
    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    
    return(int(stars))

In [15]:
def get_repo_info(h3_tag, star_tag):
    
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

In [16]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'id':'repo-stars-counter-star'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [17]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [18]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_description(doc),
        'url': get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

In [19]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('./scraped_repos', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], './scraped_repos/{}.csv'.format(row['title']))
        print('Scraping complete for "{}" repositories'.format(row['title']))
        print()

In [20]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping complete for "3D" repositories

Scraping top repositories for "Ajax"
Scraping complete for "Ajax" repositories

Scraping top repositories for "Algorithm"
Scraping complete for "Algorithm" repositories

Scraping top repositories for "Amp"
Scraping complete for "Amp" repositories

Scraping top repositories for "Android"
Scraping complete for "Android" repositories

Scraping top repositories for "Angular"
Scraping complete for "Angular" repositories

Scraping top repositories for "Ansible"
Scraping complete for "Ansible" repositories

Scraping top repositories for "API"
Scraping complete for "API" repositories

Scraping top repositories for "Arduino"
Scraping complete for "Arduino" repositories

Scraping top repositories for "ASP.NET"
Scraping complete for "ASP.NET" repositories

Scraping top repositories for "Atom"
Scraping complete for "Atom" repositories

Scraping top repositories for "Awesome Lists"
Scraping complete 