# Scrape GitHub for top repositories

Web scraping is the process of using bots to extract content and data from a website.
Web Scraping helps in doing market research and allows business firms to keep track of other compititor firms in the market.

In this project I have made a simple python program using BeautifulSoup, which scrapes the topics page on GitHub to collect the top repositories for each of the top topics.

https://github.com/topics/

I am using Pyhton due to its amazing libraries and personal preference. I am using the BeautifulSoup library for this project because the GitHub page is not dynamic.

General outline of the steps that we will follow are : 

*   Scrape the main topics page from GitHub
*   Extract the topic title, topic description and url from each of the topics.
*   Then for each of the individual topics, we will scrape again to get the top repositories on that topic.
*   We will collect the name of the repository, author the reposotory, and url of the repository and store in a csv file.
*   Finally we will store all of these data in a folder.


First of all let's install and import all the necessary libraries.

In [1]:
!pip install requests --quiet
!pip install beautifulsoup4 --quiet
!pip install pandas --quiet

You should consider upgrading via the 'c:\program files\python39\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'c:\program files\python39\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'c:\program files\python39\python.exe -m pip install --upgrade pip' command.


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

## Scraping the topics page from GitHub.

In [3]:
def get_main_page():
  # Function to download and process the main page.

  # First we will use requests to download the page.
  # PS.: If we wish to scrape the second page of the topics then we have to add
  #  "/page?=2" at the end of the given url.
  topic_url = 'https://github.com/topics'
  response = requests.get(topic_url)

  # Check if the download was successful.
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  
  # Since this page is not dynamic we can use BeautifulSoup to parse it.
  doc = BeautifulSoup(response.text, 'html.parser')

  # Return the processed page.
  return doc

## Extracting the topics information.

Here we have 3 functions to extract 3 details, ie., name of the topic, description of the topic, and url of the topic page.

In each of fuctions we take the processed page and search for the tags which contain the required information for each topic. We do this using the BeautifulSoup fuction find_all(). It searches the page and return the tags we want. We can specify a particular class along with it to narrow down our search.

We can use the inspect element feature to get the particular class and type of tag for the data we want.

In [4]:
def get_topic_titles(doc):
  # Function to extract the titles of the topics.

  # The tag which contains the title of the topics is a 'p' tag and has the 
  # following class.
  topic_title_tags_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', {'class': topic_title_tags_class})

  # We store all the titles in a list.
  topic_titles = []

  for tag in topic_title_tags:
    topic_titles.append(tag.text)

  return topic_titles

In [5]:
def get_topic_decs(doc):
  # Function to extract the descriptions of the topics.

  # The tag which contains the description is also a 'p' tag has the following 
  # class.
  topic_desc_tag_class = 'f5 color-text-secondary mb-0 mt-1'
  topic_desc_tag = doc.find_all('p', {'class': topic_desc_tag_class})

  # We store it in a list.
  topic_desc = []

  for tag in topic_desc_tag:
    topic_desc.append(tag.text.strip())

  return topic_desc

In [6]:
def get_topic_urls(doc):
  # Function to extract the urls of the topics.

  # The tag which contains the url is a 'a' tag having the following class.
  topic_link_tag_class = 'd-flex no-underline'
  topic_link_tag = doc.find_all('a', {'class': topic_link_tag_class})

  # We store the urls along with the base url to make it a fully functioning 
  # url.
  topic_urls = []
  base_url = 'https://github.com'

  for tag in topic_link_tag:
    topic_urls.append(base_url + tag['href'])

  return topic_urls

Now let's have one function which calls the above fuctions to complete the first part.

In [7]:
def scrape_topics():
  # Function to scrape the topics page and save it in a dataframe.

  # Gets the main topics page.
  doc = get_main_page()

  # Creats a dictionary to store all the details.
  topics_dict = {
      'title': get_topic_titles(doc),
      'description': get_topic_decs(doc),
      'url': get_topic_urls(doc)
  }

  # Makes the dictionary into a dataframe and return it.
  return pd.DataFrame(topics_dict)

## Scrape each individual topics for the repositories list.

Now we use the dataframe created previously and use the link given for each topic to download that page and get the top repositories.

In [8]:
def download_a_repo(link_to_repo):
  # Function to download and process the repository page.

  # First of all we download the repository page using the url from the 
  # dataframe.
  response = requests.get(link_to_repo)

  # Check if the download was successful.
  if response.status_code != 200:
    raise Exception('Failed to load the page {}'.format(link_to_repo))
  
  # Process the page using BeautifulSoup and return it.
  page_doc = BeautifulSoup(response.text, 'html.parser')

  return page_doc

## Extract the details of the top repositories for a topic.

To make things simpler let's make a helper function that will help us to extract infromation from the repository page. This helper function will take one of the repositories in a topic and extract its details. We would also require a function to convert strings like "52k" into integer 52000, as the number of stars is given inthis format.

In [9]:
def parse_stringToInt(count):
  # Function to convert a string such as "52k" to integer 52000.

  # First we remove the excess spaces.
  count = count.strip()

  # If there is a "k" then we remove the "k" and multiply by 1000.
  if count[-1] == 'k':
    return int(float(count[:-1]) * 1000)
  
  return int(count)

In [10]:
def get_repo_info_helper(parent_tag, star_tag):
  # Function which returns all the details of a repository.

  # In the website both the name of the project and name of the author are kept
  #  under one 'a' tag. We can extract it and the get both its child to get 
  # what we want.
  child_tags = parent_tag.find_all('a')

  author = child_tags[0].text.strip()
  name = child_tags[1].text.strip()

  # In the website the url of the repository is conveniently kept in the repo 
  # name tag itself. So we take it and add the base url.
  base_url = 'https://github.com'
  url = base_url + child_tags[1]['href']

  # For the number of stars we take the star_tag and extract and process the 
  # text using the above function.
  stars = parse_stringToInt(star_tag.text.strip())

  # We return all 4 details in order.
  return author, name, stars, url

The following function will use the above helper function and extract details for all of the top repositories in a topic.

In [11]:
def get_repo_info(page_doc):
  # Function which extracts the details of all the top repositories of a topic.

  # We use the inspect feature to get the tag type and class of the required 
  # tags.
  repo_tags_class = 'f3 color-fg-muted text-normal lh-condensed'
  repo_tags = page_doc.find_all('h3', {'class': repo_tags_class})

  repo_star_class = 'social-count float-none'
  repo_star = page_doc.find_all('a', {'class': repo_star_class})

  # We create a dictionary to store all the results.
  repo_dict = {
      'username': [],
      'repo_name': [],
      'stars': [],
      'url': []
  }

  # We fill the dictionary by repeatedly calling the helper function for all 
  # the repositories in a topic.
  for i in range(len(repo_tags)):
    repo_info = get_repo_info_helper(repo_tags[i], repo_star[i])
    repo_dict['username'].append(repo_info[0])
    repo_dict['repo_name'].append(repo_info[1])
    repo_dict['stars'].append(repo_info[2])
    repo_dict['url'].append(repo_info[3])

  # We convert this into a dataframe and return it.
  return pd.DataFrame(repo_dict)

Now we have the data we require, all that is left is to make a csv file and save it.

In [12]:
def csv_repo(url, path):
  # Function which takes a topic url and extracts the details of all the 
  # repositories in it and saves it in a csv file in the given path.

  # Checks if the path provided already exists or not.
  if os.path.exists(path):
    print('The file {} already exists. Skipping...'.format(path))
    return

  # We call the above fuctions one by one and save the result in a csv file.
  repo_df = get_repo_info(download_a_repo(url))
  repo_df.to_csv(path, index=None)

## Putting everything together.

Here we make the master function which uses all the  above functions and creates a folder to store the results.

In [13]:
def scrape_github_topics():
  # Function to scrape the topics page of GitHub and store the details of the 
  # top repositories of the top topics.

  # First we scrape the main page and get the topics list.
  print('Scraping list of topics :')
  topics_df = scrape_topics()

  # Let's save the list of the topics as a csv file as well.
  topics_df.to_csv('topics.csv', index=None)

  # Make a folder to store the data neatly.
  os.makedirs('data', exist_ok=True)

  # We go through the topics list and scrape each topic to extract the details 
  # of the top repositories in it and make a csv file and store it in the 
  # folder.
  for index, row in topics_df.iterrows():
    print('Scraping top repositories for "{}"'.format(row['title']))
    csv_repo(row['url'], 'data/{}.csv'.format(row['title']))

Now let's run the entire thing.

In [16]:
if __name__ == "__main__":
    scrape_github_topics()

Scraping list of topics :
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv al

## Summary and Scope for future works.

Summary:

In this project I have tried to implement a simple web scraping program, which is used to scrape the details of the top repositories of the top topics on the site.

I have used the Python language and the BeautifulSoup library in it for the project.


Ideas for future work:

This project can serve as a foundations for various different applications such as:


*   Analysing how useful a particular repository is.
*   Analysing how active a particular topic is.
*   Keeping a track of the latest developments in an area.
*   and many others......

Link for the BeautifulSoup library :

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

PS.: We can save the data in a cloud storage directly by first mounting the drive and then copying it there.