<a href="https://colab.research.google.com/github/SaharshGit/Web-Scraping-Github-Topics/blob/main/Scraping_github_topics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scraping-GitHub-Topics-Repositories



- In this project we are going to scrape data from topic page in GitHub.https://github.com/topics

- We will be using Python, Pandas, request, BeautifulSoup

- Objective
 - We'll get list of topics.For each topic we'll get topic title, topic page URL and topic description.
 - For each topic we'll get top 20 repositories in the topic.
 - For each repository we'll get the repo name, username, stars and repo URL.
 - For each topic we'll create a CSV file in the following format:

   Repo Name,Username,Stars,Repo URL
   Info-Classifier,SaharshGit,1,https://github.com/SaharshGit/Info-Classifier



## Scrape the list of topics from Github

- Use requests library to download the page
- Use BeautifulSoup to parse and extract information
- Convert data in a Pandas DataFrame
- Convert DataFrame into a CSV file

Let's write a function to download topics page

In [17]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os

In [18]:
def get_topics_page():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)                                     # download the page using requests library
    if response.status_code != 200:                                        # check for case of failure
      raise Exception('failed to load {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')                      # parsing using BeautifulSoup
    return doc                                                             # return a BeautifulSoup Doc

Creating functions to exract topic titles, descriptions and urls

In [19]:
def get_topic_titles(doc):
  selection_class ='f3 lh-condensed mb-0 mt-1 Link--primary'       # html class that uniquely identifes p tags which store titles
  topic_title_tags = doc.find_all('p',{'class':selection_class})
  # topic_title_tags = doc.find_all('p', class_=selection_class)

  topic_titles=[]
  for tags in topic_title_tags:
    topic_titles.append(tags.text)                                 # storing titles into a list

  return topic_titles

In [20]:
def get_topic_desc(doc):
  desc_selector = 'f5 color-fg-muted mb-0 mt-1'                    # html class that uniquely identifies p tags which stores topic descriptions
  topic_desc_tags = doc.find_all('p', class_ = desc_selector)

  topic_desc=[]
  for desc in topic_desc_tags:
    topic_desc.append(desc.text.strip())                           # storing descriptions into a list

  return topic_desc

In [21]:
def get_topic_urls(doc):
  urls_selector = 'no-underline flex-1 d-flex flex-column'          # html class that uniquely identifies a tags which stores topic urls
  topic_link_tags = doc.find_all('a',{'class':urls_selector})

  topic_urls=[]
  base_url="https://github.com"
  for url in topic_link_tags:
    topic_urls.append(base_url+url['href'])                          # storing urls into a list

  return topic_urls

lets put this all together into a single function

In [22]:
def get_topic_titles(doc):
  selection_class ='f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p',{'class':selection_class})
  # topic_title_tags = doc.find_all('p', class_=selection_class)
  topic_titles=[]
  for tags in topic_title_tags:
    topic_titles.append(tags.text)

  return topic_titles

def get_topic_desc(doc):
  desc_selector = 'f5 color-fg-muted mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', class_ = desc_selector)

  topic_desc=[]
  for desc in topic_desc_tags:
    topic_desc.append(desc.text.strip())

  return topic_desc

def get_topic_urls(doc):
  topic_link_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

  topic_urls=[]
  base_url="https://github.com"
  for url in topic_link_tags:
    topic_urls.append(base_url+url['href'])

  return topic_urls




def scrape_topics():
  topic_url = 'https://github.com/topics'
  response = requests.get(topic_url)
  if response.status_code != 200:
    raise Exception('failed to load {}'.format(topic_url))
  doc = BeautifulSoup(response.text, 'html.parser')

  topics_dict = {
      'title': get_topic_titles(doc),
      'description': get_topic_desc(doc),
      'url': get_topic_urls(doc)
  }

  return pd.DataFrame(topics_dict)

In [23]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Scrape each topic information

- Use requests library to download the page
- Use BeautifulSoup to parse and extract information
- Convert data in a Pandas DataFrame
- Convert DataFrame into a CSV file

lets write a function which takes topic url and download the selected topic page using beautiful soup

In [24]:
import os

def get_topic_page(topic_url):
  # Download the page
  response = requests.get(topic_url)
  # check for succesful response
  if response.status_code != 200:
    raise Exception('failed to load {}'.format(topic_url))
  # parse the page using beautiful soup
  topic_doc = BeautifulSoup(response.text, 'html.parser')

  return topic_doc

This acts as a helper function for function below to extract username, repo_name, repo_url and stars of top repositories of a topic

In [25]:
base_url = 'https://github.com'

def get_repo_info(h3_tag, star_tag):
  # returns all the required info about a repo
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text)

  return username, repo_name, repo_url, stars

def parse_star_count (stars_str):
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1])*1000)

  return int(stars_str)

This function takes a BeautifulSoup doc and extract inforamtion about the top 20 repositories in the selected topic

In [26]:
def get_topic_repos(topic_doc):
  # get the h3 tag containing username, repo_name, repo_url
  h3_selection_class= 'f3 color-fg-muted text-normal lh-condensed'
  repo_tags = topic_doc.find_all('h3',{'class':h3_selection_class})
  # get the span tag containing count to stars
  star_tag = topic_doc.find_all('span',{'class':'Counter js-social-count'})

  # get repo info
  repo_info_dict = {
    'username':[],
    'repo_name':[],
    'repo_url':[],
    'stars':[]
  }
  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tag[i])
    repo_info_dict['username'].append(repo_info[0])
    repo_info_dict['repo_name'].append(repo_info[1])
    repo_info_dict['repo_url'].append(repo_info[2])
    repo_info_dict['stars'].append(repo_info[3])

  return pd.DataFrame(repo_info_dict)

This function utlizies previsouly defined functions which gives a DataFrame containing repo_name, repo_url, Username and stars of a repository. And this function finally converts this Data Frame into a >csv file

In [27]:
def scrape_topic(topic_url, path):
  if os.path.exists(path):
    print('file {} already exists. Skipping...'.format(path))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(path, index=None)

## Putting it all together

- We have a function to get the list of topics
- We have a function to create a .csv file for scraped repos from a topics page

- let's create a function to put them together

In [28]:
import os

def scrape_topic_repos():
  print('scraping list of topics')
  topics_df = scrape_topics()

  os.makedirs('Data', exist_ok=True)
  for index, row in topics_df.iterrows():
    print('scraping top repos for {}'.format(row['title']))
    scrape_topic(row['url'],'Data/{}.csv'.format(row['title']))

lets run it to scrape the top repos from topics page on github

In [29]:
scrape_topic_repos()

scraping list of topics
scraping top repos for 3D
scraping top repos for Ajax
scraping top repos for Algorithm
scraping top repos for Amp
scraping top repos for Android
scraping top repos for Angular
scraping top repos for Ansible
scraping top repos for API
scraping top repos for Arduino
scraping top repos for ASP.NET
scraping top repos for Awesome Lists
scraping top repos for Amazon Web Services
scraping top repos for Azure
scraping top repos for Babel
scraping top repos for Bash
scraping top repos for Bitcoin
scraping top repos for Bootstrap
scraping top repos for Bot
scraping top repos for C
scraping top repos for Chrome
scraping top repos for Chrome extension
scraping top repos for Command-line interface
scraping top repos for Clojure
scraping top repos for Code quality
scraping top repos for Code review
scraping top repos for Compiler
scraping top repos for Continuous integration
scraping top repos for C++
scraping top repos for Cryptocurrency
scraping top repos for Crystal
