# Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description, and view these results in a dataframe.
- For each topic, we'll get the top 25 repositories in the topic from the topic page, using the topic page URL from the previous step.
- For each repository, we'll get the repo name, username, stars and repo URL.
- Then for each topic we'll create a CSV file in the following format (using the data obtained for all the repositories in that topic):
```
Repo Name,Username,Stars,Repo URL
```

for example:
```
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

# Import dependancies

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os

# Functions to get the dataframe with details for the top github topics

In [2]:
def get_topic_titles(doc):
  topic_title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary' # get this by inspect element method on your browser
  topic_title_tags = doc.find_all('p', class_ = topic_title_class)
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)
  return topic_titles

def get_topic_descs(doc):
  topic_desc_class = 'f5 color-fg-muted mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', class_=topic_desc_class)
  topic_descs = []
  for tag in topic_desc_tags: 
    topic_descs.append(tag.text.strip())
  return topic_descs

def get_topic_urls(doc):
  topic_url_class = 'no-underline flex-1 d-flex flex-column'
  topic_link_tags = doc.find_all('a', class_=topic_url_class)
  topic_urls = []
  base_url = 'https://github.com'
  for tag in topic_link_tags:
    topic_urls.append( base_url + tag['href'] )
  return topic_urls

def get_topics_df(topics_url):
  response = requests.get(topics_url)
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topics_url))
  
  doc = BeautifulSoup(response.text, 'html.parser')
  topic_titles = get_topic_titles(doc)
  topic_descs = get_topic_descs(doc)
  topic_urls = get_topic_urls(doc)

  topics_df = pd.DataFrame( {
    'Topic_Titles': topic_titles,
    'Topic_Descriptions': topic_descs,
    'Topic_urls': topic_urls
    } )

  return topics_df

In [3]:
topics_url = 'https://github.com/topics'
df = get_topics_df(topics_url)
df

Unnamed: 0,Topic_Titles,Topic_Descriptions,Topic_urls
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Functions to get the repo details for each topic

In [4]:
def get_repo_info(repo_tags, stars_tags):
  usernames = []
  repo_urls = []
  repo_names = []
  for tag in repo_tags:
    a_tags = tag.find_all('a')

    repo_name = a_tags[1].text.strip()
    repo_names.append(repo_name)

    base_url = 'https://github.com'
    repo_url = base_url + a_tags[1]['href']
    repo_urls.append(repo_url)

    username = a_tags[0].text.strip()
    usernames.append(username)
  
  stars_list = []
  for tag in stars_tags:
    stars = tag['aria-label']
    stars = ''.join(filter(str.isdigit, stars))
    stars = int(stars)
    stars_list.append(stars)

  return repo_names, usernames, stars_list, repo_urls

def get_topic_repos_df(topic_url):
  response = requests.get(topic_url)
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  
  topic_doc = BeautifulSoup(response.text, 'html.parser')
  
  h3_parent_tag_class = 'f3 color-fg-muted text-normal lh-condensed'
  repo_tags = topic_doc.find_all('h3', {'class': h3_parent_tag_class})

  stars_selection_class = 'Counter js-social-count'
  stars_tags = topic_doc.find_all('span', class_=stars_selection_class)

  repo_names, usernames, stars_list, repo_urls = get_repo_info(repo_tags, stars_tags)

  topic_repo_dict = {
  'Repo_Name': repo_names,
  'Username': usernames,
  'Stars': stars_list,
  'Repo_url': repo_urls
  }
  topic_df = pd.DataFrame(topic_repo_dict)
  return topic_df

def scrape_topic(topic_url, fname):
  folder_name = 'Scraped_Data'
  os.makedirs(folder_name, exist_ok=True) # Make a folder to store all the csv files with the scraped data
  path = folder_name + '/' + fname + '.csv'

  if os.path.exists(path):
    print('The file ' + fname + ' already exists. Skipping...')
    return
  
  df = get_topic_repos_df(topic_url)
  df.to_csv(path, index=None)

# Final Function
Calls the previous functions for each topic

In [5]:
def scrape_topics_top_repos(topics_page_url):
  topics_df = get_topics_df(topics_page_url)

  for index, row in topics_df.iterrows():
    print('Scraping top repositories for "{}"'.format(row['Topic_Titles']))
    scrape_topic(row['Topic_urls'], row['Topic_Titles'])

In [6]:
topics_page_url = 'https://github.com/topics'
scrape_topics_top_repos(topics_page_url)

Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scraping top repositories for "

# We can read any csv file that was generated to verify that it is of the correct format

In [7]:
df = pd.read_csv('./Scraped_Data/Angular.csv')
df

Unnamed: 0,Repo_Name,Username,Stars,Repo_url
0,free-programming-books-zh_CN,justjavac,105154,https://github.com/justjavac/free-programming-...
1,angular,angular,90398,https://github.com/angular/angular
2,storybook,storybookjs,80451,https://github.com/storybookjs/storybook
3,33-js-concepts,leonardomso,57891,https://github.com/leonardomso/33-js-concepts
4,ionic-framework,ionic-team,49494,https://github.com/ionic-team/ionic-framework
5,prettier,prettier,46680,https://github.com/prettier/prettier
6,30-Days-Of-JavaScript,Asabeneh,38556,https://github.com/Asabeneh/30-Days-Of-JavaScript
7,sheetjs,SheetJS,33605,https://github.com/SheetJS/sheetjs
8,angular-cli,angular,26253,https://github.com/angular/angular-cli
9,components,angular,23653,https://github.com/angular/components
