# Web Scraping top repositories from Github Topics

📌**Web Scraping** - Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed.

📌**Github** - GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code.

 📌**Tools Used are**:
- Python
- requests
- Pandas
- Beautiful Soup (BS4)
- os

📌**Project Outline** :
- We are going to scrape: https://github.com/topics
- We will get a list of topics and for each topic we will get the topic name and topic page url and topic description.
- For each topic, we will get the top 25 repositories in the topic from the topic page.
- For each topic we will create a CSV File in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
```

## 📍 Scraping List of Topics from Github

First we will be importing all the libraries that are required in the project
- **requests** - The requests library is the de facto standard for making HTTP requests in Python. It abstracts the complexities of making requests behind a beautiful, simple API so that you can focus on interacting with services and consuming data in your application.
- **Beautiful Soup** - is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [1]:
import os
import requests
import pandas as pd

from bs4 import BeautifulSoup

Next, we have to download the contents of Github Topics page using 'requests' and parse it using 'BS4'

In [2]:
def scrape_topics_repos():
    topic_url = 'https://github.com/topics'

    # Download the url using requests.get()
    response = requests.get(topic_url)

    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))

    # Parse the downloaded page using Beautiful Soup
    topic =  BeautifulSoup(response.text, 'html.parser')

    return topic

In [3]:
topic = scrape_topics_repos()
type(topic)

bs4.BeautifulSoup

Once we have downloaded and parsed the page, to get the topic titles, description and the url we have to use 'find_all()' from the Beautiful Soup library which helps to extract or scrape contents from a html web page by using the html tags and classes as the parameters.

We will define different functions to scrape all the required data and save it into a list.

In [4]:
def get_topic_title(doc):

    # find_all() or findAll() helps to find the elements from the page using html tags and classes
    topic_title = doc.find_all('p', { 'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topicTitles = []
    for tag in topic_title:
        topicTitles.append(tag.text)
    return topicTitles

def get_topic_desc(doc):
    topic_desc = doc.find_all('p', {'class':"f5 color-fg-muted mb-0 mt-1"})
    topicDesc = []
    for tag in topic_desc:
        topicDesc.append(tag.text.strip())
    return topicDesc

def get_topic_url(doc):
    topic_links = doc.find_all('a', {'class':'no-underline flex-1 d-flex flex-column'})
    topicUrls = []
    baseurl = 'https://github.com'
    for tag in topic_links:
        topicUrls.append(baseurl + tag['href'])
    return topicUrls

In [5]:
topicTitles = get_topic_title(topic)
topicTitles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [6]:
topicDesc = get_topic_desc(topic)
topicDesc

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azu

In [7]:
topicUrls = get_topic_url(topic)
topicUrls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

All the data is stored in lists we have to convert it into a pandas DataFrame so that we will be able to get a csv file out of it or we can used it for further scraping.

In [8]:
def topics_data(topic):
    topics_dict = {
        'Title': get_topic_title(topic),
        'Description':get_topic_desc(topic),
        'Url':get_topic_url(topic)
    }

    return pd.DataFrame(topics_dict)

In [9]:
topics_data(topic)

Unnamed: 0,Title,Description,Url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## 📍 Scraping repositories of each topic

'topicUrl' Variable stores a list of all topic urls from the topics page downloaded.

But to extract the repositories data of each and every topic we have to download and parse each and every topic pages from the main page so with the help of the topic url list we created earlier we will follow the same steps has we did for the main github topic page.

We created the similar kind of function we created at the very beginning of this notebook but the only change is we are using topic url as our input parameter.

In [10]:
def get_page_info(topicUrl):
    #Get a particular link html document
    topic_doc = requests.get(topicUrl)

    if topic_doc.status_code != 200:
        raise Exception('Failed to load page {}'.format(topicUrl))

    #Parse the html document using Beautiful soup if the status code is valid 200
    topic_page =  BeautifulSoup(topic_doc.text, 'html.parser')

    return topic_page

In [11]:
topic_page = get_page_info(topicUrls[0])
type(topic_page)

bs4.BeautifulSoup

To get the repositories data, we have to find out where is the information stored in the html page.
While Inspecting the page I found that the information that I want i.e. Repositories name and stars are been stored in 'h3' tag and 'span' tag respectively so we have to scrape that information using 'Beautiful Soup'

**Note** : The html structure of the page might change in the later years so if one wants to use this notebook he/she has to inspect the page and update this code chunk as per the latest tags

In [12]:
#Get the repositories and stars tag
h1_tags = topic.findAll('h3', {"class":"f3 color-fg-muted text-normal lh-condensed"})
star_tags = topic.findAll('span', {"class":"Counter js-social-count"})

The Stars information is stored in the page has the form of 'k' numerical
For example - 77000 is stored as 77k

So we have converted each and every star format to an integer format

In [13]:
def parse_star_count(stars_str):
    starsStr = stars_str.strip()

    if starsStr[-1] == 'k':
        return int(float(starsStr[:-1]) * 1000)

    return int(starsStr)

Once we have done with all the data formatting and parsing pages we have to scrape the main contents out of the data we have.
That is - username, repository name, repository url and stars

In [14]:
def get_repo_info(h1_tag, star_tag):
    baseurl = 'https://github.com'
    # Return all the required info about a repository

    aTags = h1_tag.find_all('a')
    username = aTags[0].text.strip()
    reponame = aTags[1].text.strip()
    repourl = baseurl + aTags[1]['href']
    stars = parse_star_count(star_tag.text)

    return username, reponame, repourl, stars

Saving all the data we have in a Python Dictionary and then create a Dataframe from it.
To create a DataFrame from the information we have, we are using Pandas DataFrame function.

In [15]:
def get_topic_repos(topic):
    #Get the repositories and stars tag
    h1_tags = topic.findAll('h3', {"class":"f3 color-fg-muted text-normal lh-condensed"})
    star_tags = topic.findAll('span', {"class":"Counter js-social-count"})

    # Initializing a python dictionary
    topic_repos_dict = {'username':[], 'repo_name':[], 'stars':[], 'repo_url':[]}

    for i in range(len(h1_tags)):
        repo_info = get_repo_info(h1_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[3])
        topic_repos_dict['repo_url'].append(repo_info[2])

    return pd.DataFrame(topic_repos_dict)

In [31]:
get_topic_repos(topic_page)

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,77500,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19500,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16300,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15600,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13500,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,11900,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,11800,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,10500,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,8900,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8200,https://github.com/CesiumGS/cesium


The Last step is convert all the Dataframes into a csv file and save it into your local machine.

So before that we will do some manipulation with our process, first is we will create a directory where we will save all our csv files
Creating and saving csv files in a directory using the 'os' function.

## 📍 Create Data Directory and save as CSV Files

In [16]:
def scrape_topic(topic_url, path):

    if os.path.exists(path):
        print('The File', format(path), ' already exists. Skipping...')

    topic_df = get_topic_repos(get_page_info(topic_url))
    topic_df.to_csv(path + '.csv', index=None)

Lastly we have to call all the above created functions.

**Note** - Some pages might not get download due to some backend error from the github side but all you have to do is re run the scrape_topics_csv() function again so that the similar problem will not happen again.

In [17]:
def scrape_topics_csv():
    print('Scraping Topics')
    topics = scrape_topics_repos()
    topics_df = topics_data(topics)

    os.makedirs('data', exist_ok=True)

    for index, row in topics_df.iterrows():
        print('Scraping Top Repositories for ', format(row['Title']))
        scrape_topic(row['Url'], 'data/{}.csv'.format(row['Title']))

In [18]:
scrape_topics_csv()

Scraping Topics
Scraping Top Repositories for  3D
Scraping Top Repositories for  Ajax
Scraping Top Repositories for  Algorithm
Scraping Top Repositories for  Amp
Scraping Top Repositories for  Android
Scraping Top Repositories for  Angular
Scraping Top Repositories for  Ansible
Scraping Top Repositories for  API
Scraping Top Repositories for  Arduino
Scraping Top Repositories for  ASP.NET
Scraping Top Repositories for  Atom
Scraping Top Repositories for  Awesome Lists
Scraping Top Repositories for  Amazon Web Services
Scraping Top Repositories for  Azure
Scraping Top Repositories for  Babel
Scraping Top Repositories for  Bash
Scraping Top Repositories for  Bitcoin
Scraping Top Repositories for  Bootstrap
Scraping Top Repositories for  Bot
Scraping Top Repositories for  C
Scraping Top Repositories for  Chrome
Scraping Top Repositories for  Chrome extension
Scraping Top Repositories for  Command line interface
Scraping Top Repositories for  Clojure
Scraping Top Repositories for  Code qua

**Summary** -
- Scrape top 30 topics in alphabetical order from https://github.com/topics.
- Along with topics we have extracted top 30 repositories in the given order on website from each topic we have scrape.
- After we gathered all the data, we converted the available data into a pandas DataFrame.
- And Lastly we have created 30 csv files of the top repositories we have extracted from each topic we scrape.

**References** -
- <a href='https://github.com/topic'>Github</a>
- <a href='https://beautiful-soup-4.readthedocs.io/en/latest/#'>Beautiful Soup Documentation</a>
- <a href='https://jovian.ai/moustapha00864/python-web-scraping-project-guide'>Jovian</a>

**Future ideas** -
- We can scrape the repositories of remaining topics from the github page.
- Scrape topics which have the highest no. of repositories in ascending order.
- We can also include repositories tags which will help for analysing latest trends followed together.
- Extracting repositories metrics like Forks, Pull requests, Open and Closed Issues.
- Adding timestamps of when repository was last updated.