# Scraping Top repositories for Topics on GitHub

Web scraping is a technique for extracting information from websites. It allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.

GitHub is a website that helps developers store and manage their code, as well as track and control changes to their code. 

We will be using tools Python's requests, BeautifulSoup and Pandas library to scrape Github pages.

Here are the steps we'll follow:
- We are going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description. 
- For each topic, we'll get the top 25 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars and repo URL.
- For each topic, we'll create CSV file. 


## Scrape the list of topics from GitHub

- Use requests to download the page.
- Use BeautifulSoup to parse and extract information.
- Convert to a Pandas DataFrame. 

Let's write a function to download the page.

In [63]:
!pip install requests beautifulsoup4 pandas --upgrade --quiet

In [32]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_topic_page():
    #Returns a BeautifulSoup doc which contains a parsed webpage which points to to the list of topics on GitHub
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #Check a successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))

    #Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [33]:
doc = get_topic_page()

The document will be of type BeautifulSoup and we can use this document to find some tags such as 'a'.

In [34]:
type(doc)

bs4.BeautifulSoup

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` : `f3 lh-condensed mb-0 mt-1 Link--primary`

![.](https://i.imgur.com/NUOa4Qy.png)

In [35]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class':selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles.

In [36]:
titles = get_topic_titles(doc)

In [37]:
len(titles)

30

In [38]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [39]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'             
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descriptions = []
    for tag in topic_desc_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

In [40]:
descriptions = get_topic_descs(doc)

In [41]:
len(descriptions)

30

In [42]:
descriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

To get topic urls, we can pick `a` tags with the `class` : `d-flex no-underline`
![.](https://i.imgur.com/MJsNDWQ.png)

In [43]:
def get_topic_urls(doc):    
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = "https://github.com" 
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [44]:
URL = get_topic_urls(doc)

In [45]:
len(URL)

30

In [46]:
URL[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's put this all together into a single function

In [47]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #Check a successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))

    #Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    topics_dict = {
        'title' : get_topic_titles(topic_doc),
        'description' : get_topic_descs(topic_doc),
        'url' : get_topic_urls(topic_doc)
    }
    return pd.DataFrame(topics_dict)

## Get the top repositories from a topic page

- Download the topic page
- Get the repository info such as username, repository name, stars and the repository url.


In [48]:
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #Check a successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [49]:
topic_doc = get_topic_page('https://github.com/topics/3d')

To get the repo info, we can pick `h3` tags with the `class` : `f3 lh-condensed mb-0 mt-1 Link--primary`

`h3` tag will have two `a` tags, which consist of username and repo name:

![.](https://i.imgur.com/v1OO5f9.png)

We can use the helper function `get_repo_info` to extract such details.

We use another helper function `parse_star_count` to extract details related to number of stars in the integer format. 

In [50]:
def parse_star_count(stars_str):
    stars_str= stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [51]:
def get_repo_info(h3_tag, star_tag):
    #returns all the required info about the repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = "https://github.com" 
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

To get the star tags, we can pick `a` tags with the class: `social-count float-none`.

![.](https://i.imgur.com/vM8oIJc.png)

In [52]:
def get_topic_repos(topic_doc):
    #Get H3 tags containing repo title, repo url and username
    h3_selector = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selector })
    #get star tags
    star_tags = topic_doc.find_all('a',{'class' : 'social-count float-none'})
    
    #get repo info
    topics_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topics_repos_dict['username'].append(repo_info[0])
        topics_repos_dict['repo_name'].append(repo_info[1])
        topics_repos_dict['stars'].append(repo_info[2])
        topics_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topics_repos_dict)

In [53]:
get_topic_repos(topic_doc)

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,73800,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18800,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,14800,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,14700,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13000,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,11100,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,11000,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,9800,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,8600,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,7400,https://github.com/CesiumGS/cesium


In [54]:
import os
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The {} already exists, skipping".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

In [55]:
scrape_topic('https://github.com/topics/3d', '3d.csv')

The 3d.csv already exists, skipping


Ne file created:

![.](https://i.imgur.com/ELRsBJC.png)

## Putting it all together

- We have a function to get the list of topics.
- We have a function to create a CSV file for scraped repos from a topic page.
- Lets create a function to put them together.


In [56]:
def scrape_topics_repos():
    print("Scraping list of topics")
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok = 'True')
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run in to scrape the top repos for all the topics on the first page of https://github.com/topics 

In [57]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The data/3D.csv already exists, skipping
Scraping top repositories for "Ajax"
The data/Ajax.csv already exists, skipping
Scraping top repositories for "Algorithm"
The data/Algorithm.csv already exists, skipping
Scraping top repositories for "Amp"
The data/Amp.csv already exists, skipping
Scraping top repositories for "Android"
The data/Android.csv already exists, skipping
Scraping top repositories for "Angular"
The data/Angular.csv already exists, skipping
Scraping top repositories for "Ansible"
The data/Ansible.csv already exists, skipping
Scraping top repositories for "API"
The data/API.csv already exists, skipping
Scraping top repositories for "Arduino"
The data/Arduino.csv already exists, skipping
Scraping top repositories for "ASP.NET"
The data/ASP.NET.csv already exists, skipping
Scraping top repositories for "Atom"
The data/Atom.csv already exists, skipping
Scraping top repositories for "Awesome Lists"
The data/Awesome L

We can check that CSVs are created properly

-read a CSV using Pandas

In [58]:
pd.read_csv('data/Algorithm.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,jwasham,coding-interview-university,190000,https://github.com/jwasham/coding-interview-un...
1,CyC2018,CS-Notes,137000,https://github.com/CyC2018/CS-Notes
2,trekhleb,javascript-algorithms,118000,https://github.com/trekhleb/javascript-algorithms
3,TheAlgorithms,Python,116000,https://github.com/TheAlgorithms/Python
4,yangshun,tech-interview-handbook,56700,https://github.com/yangshun/tech-interview-han...
5,kdn251,interviews,53600,https://github.com/kdn251/interviews
6,azl397985856,leetcode,43800,https://github.com/azl397985856/leetcode
7,algorithm-visualizer,algorithm-visualizer,35300,https://github.com/algorithm-visualizer/algori...
8,crossoverJie,JCSprout,26300,https://github.com/crossoverJie/JCSprout
9,donnemartin,interactive-coding-challenges,23500,https://github.com/donnemartin/interactive-cod...


In [59]:
!pip install jovian --upgrade --quiet

In [60]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="scraping-github-topics-repositories")

<IPython.core.display.Javascript object>

## References and Future Work

In this project I have scraped Github top repositories based on the trending topics. 

The project is a part of this tutorial by Jovian:
    https://www.youtube.com/watch?v=RKsLLG-bzEY
    
Ideas for future work:
- I have scraped only 30 topics, I will try to scrape all the featured topics from GitHub.
- I will try to include most forked repositories in the future.