# Scraping Top Repositories for Topics on GitHub

TODO:
- Introduction about web scraping
- Introduction about GitHub and the problem statement
- Tools used ( Pyhton, requests, BeautifulSoup, Pandas)

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Use the requests library to download web pages

In [50]:
pip install requests --upgrade --quiet

In [51]:
import requests

In [52]:
topics_url = 'https://github.com/topics'

In [53]:
response = requests.get(topics_url)

In [54]:
response.status_code

200

In [55]:
len(response.text)

188855

In [56]:
page_contents = response.text

In [57]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-dkuYFW+ra8yYSt342e5pJEeslPSjMcrMvNxlYZMyM/X+/WJHDPvoCuGq3LFojI7B0dQWwZNRiPMnbi9IfUgTaA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-764b98156fab6bcc984addf8d9ee6924.css" /><link crossorigin="anonymous" media="all" integrity="sha512-UrAu23+eyncWvaQFwsLbgSKtmLb2aH1bcT4hJnnRdkaPuY1eu9bumt33FyHHFDX8hskTUNWNkIsMCz7F

## Use Beautiful Soup to parse and extract information

In [59]:
pip install beautifulsoup4 --upgrade --quiet


## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [60]:
from bs4 import BeautifulSoup

In [61]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [62]:
doc


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-764b98156fab6bcc984addf8d9ee6924.css" integrity="sha512-dkuYFW+ra8yYSt342e5pJEeslPSjMcrMvNxlYZMyM/X+/WJHDPvoCuGq3LFojI7B0dQWwZNRiPMnbi9IfUgTaA==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-52b02edb7f9eca7716bda405c2c2db81.css" integrity="sha512-UrAu23+eyncWvaQFwsLbgSKtmLb2aH1bcT4

In [63]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p',{ 'class' : selection_class})

In [64]:
len(topic_title_tags)

30

In [65]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [66]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'

topic_desc_tags = doc.find_all('p', {'class' : desc_selector})

In [67]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [68]:
topic_title_tag0 = topic_title_tags[0]

In [69]:
div_tag = topic_title_tag0.parent

In [70]:
div_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [71]:
topic_link_tags = [x.parent for x in topic_title_tags]

In [72]:
len(topic_link_tags)

30

In [73]:
topic_link_tags[:5]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>
 </a>, <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>, <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>, <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p class="f5 c

In [74]:
topic_link_tags[0]['href']

'/topics/3d'

In [87]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [78]:
topic_titles =[x.text for x in topic_title_tags]
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [82]:
topic_descs = [x.text.strip() for x in topic_desc_tags]
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [86]:
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
  topic_urls.append(base_url + tag['href'])

topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [88]:
pip install pandas --quiet

In [89]:
import pandas as pd

In [90]:
topics_dict = {
    'title' : topic_titles,
    'description' : topic_descs,
    'url' : topic_urls
}

In [91]:
topics_df = pd.DataFrame(topics_dict)

In [92]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Let's Create a CSV file out of it

In [94]:
topics_df.to_csv('topics.csv', index = None)

## Getting information out of a topic page

In [95]:
topic_page_url = topic_urls[0]

In [96]:
topic_page_url

'https://github.com/topics/3d'

In [97]:
response = requests.get(topic_page_url)

In [98]:
response.status_code

200

In [99]:
len(response.text)

674918

In [100]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [101]:
div_selection_class = 'd-flex flex-auto'
repo_tags = topic_doc.find_all('div' , {'class' : div_selection_class })

In [103]:
len(repo_tags)

30

In [105]:
a_tags = repo_tags[0].find_all('a')

In [107]:
a_tags[0]

<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [109]:
a_tags[0].text.strip()

'mrdoob'

In [110]:
a_tags[1].text.strip()

'three.js'

In [113]:
base_url = 'https://github.com'

repo_url = base_url + a_tags[1]['href']

print(repo_url)

https://github.com/mrdoob/three.js


In [114]:
star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})

In [115]:
len(star_tags)

30

In [117]:
star_tags[0].text.strip()

'79.3k'

In [118]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return  int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

In [119]:
parse_star_count(star_tags[0].text.strip())

79300

In [122]:
def get_repo_info(div_tag, star_tag) :
  # returns all the required info about a repository
  a_tags = div_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url 

In [123]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 79300, 'https://github.com/mrdoob/three.js')

In [124]:
topics_repos_dict = {
    'username' :[],
    'repo_name' :[],
    'stars' : [],
    'repo_url' :[]
}

for i in range(len(repo_tags)):
  repo_info = get_repo_info(repo_tags[i], star_tags[i])
  topics_repos_dict['username'].append(repo_info[0])
  topics_repos_dict['repo_name'].append(repo_info[1])
  topics_repos_dict['stars'].append(repo_info[2])
  topics_repos_dict['repo_url'].append(repo_info[3])

In [127]:
topics_repo_df = pd.DataFrame(topics_repos_dict)

In [128]:
topics_repo_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,79300,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19700,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16900,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15900,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13800,https://github.com/aframevr/aframe
5,lettier,3d-game-shaders-for-beginners,12200,https://github.com/lettier/3d-game-shaders-for...
6,ssloy,tinyrenderer,12100,https://github.com/ssloy/tinyrenderer
7,FreeCAD,FreeCAD,10700,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9000,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8300,https://github.com/CesiumGS/cesium


## Final Code:

In [158]:
import os
def get_topic_page(topic_url):
  # Download the page
  response = requests.get(topic_url)
  # Check successful response
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  # parse usig Beautiful Soup
  topic_doc = BeautifulSoup(response.text,'html.parser')
  return topic_doc

def get_repo_info(div_tag, star_tag) :
  # returns all the required info about a repository
  a_tags = div_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url 

def get_topic_repos(topic_doc):
  
  # Get the div tag containing repo title, repo URL and username
  div_selection_class = 'd-flex flex-auto'
  repo_tags = topic_doc.find_all('div' , {'class' : div_selection_class })
  # Get star tags
  star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})

  topics_repos_dict = {
    'username' :[],
    'repo_name' :[],
    'stars' : [],
    'repo_url' :[]
    }
  # Get repo info
  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topics_repos_dict['username'].append(repo_info[0])
    topics_repos_dict['repo_name'].append(repo_info[1])
    topics_repos_dict['stars'].append(repo_info[2])
    topics_repos_dict['repo_url'].append(repo_info[3])

  return pd.DataFrame(topics_repos_dict)

def scrape_topic(topic_url, path):
  if os.path.exists(path):
    print('The file {} already exists. Skipping...'.format(fname))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(path, index = None)


In [137]:
get_topic_repos(get_topic_page(topic_urls[6])).to_csv('ansible.csv',index = None)


Write a single function to:
1.   Get the list of toopics from the topics page
2.   Get the list of top repos from the individual topic pages
3.   For each topic, create a CSV of the top repos for the topic



In [151]:
def get_topic_titles(doc):
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p',{ 'class' : selection_class})

  topic_titles =[x.text for x in topic_title_tags]
  return topic_titles

def get_topic_descs(doc):
  desc_selector = 'f5 color-fg-muted mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', {'class' : desc_selector})

  topic_descs = [x.text.strip() for x in topic_desc_tags]
  return topic_descs

def get_topic_urls(doc):
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p',{ 'class' : selection_class})

  topic_link_tags = [x.parent for x in topic_title_tags]
  topic_urls = []
  base_url = 'https://github.com'
  for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
  return topic_urls

def scrape_topics():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))

  topics_dict = {
      'title' : get_topic_titles(doc) ,
      'decsription' : get_topic_descs(doc),
      'url' : get_topic_urls(doc)
  }
  return pd.DataFrame(topics_dict)

In [152]:
scrape_topics()

Unnamed: 0,title,decsription,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [145]:
for index, row in topics_df.iterrows():
  print(row['title'], row['url'])

3D https://github.com/topics/3d
Ajax https://github.com/topics/ajax
Algorithm https://github.com/topics/algorithm
Amp https://github.com/topics/amphp
Android https://github.com/topics/android
Angular https://github.com/topics/angular
Ansible https://github.com/topics/ansible
API https://github.com/topics/api
Arduino https://github.com/topics/arduino
ASP.NET https://github.com/topics/aspnet
Atom https://github.com/topics/atom
Awesome Lists https://github.com/topics/awesome
Amazon Web Services https://github.com/topics/aws
Azure https://github.com/topics/azure
Babel https://github.com/topics/babel
Bash https://github.com/topics/bash
Bitcoin https://github.com/topics/bitcoin
Bootstrap https://github.com/topics/bootstrap
Bot https://github.com/topics/bot
C https://github.com/topics/c
Chrome https://github.com/topics/chrome
Chrome extension https://github.com/topics/chrome-extension
Command line interface https://github.com/topics/cli
Clojure https://github.com/topics/clojure
Code quality h

In [161]:
def scrape_topics_repos():
  print('Scraping list of topics')
  topics_df = scrape_topics()

  os.makedirs('data', exist_ok = True)

  for index, row in topics_df.iterrows():
    print('Scraping top repositories foe {}'.format(row['title']))
    scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [162]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories foe 3D
Scraping top repositories foe Ajax
Scraping top repositories foe Algorithm
Scraping top repositories foe Amp
Scraping top repositories foe Android
Scraping top repositories foe Angular
Scraping top repositories foe Ansible
Scraping top repositories foe API
Scraping top repositories foe Arduino
Scraping top repositories foe ASP.NET
Scraping top repositories foe Atom
Scraping top repositories foe Awesome Lists
Scraping top repositories foe Amazon Web Services
Scraping top repositories foe Azure
Scraping top repositories foe Babel
Scraping top repositories foe Bash
Scraping top repositories foe Bitcoin
Scraping top repositories foe Bootstrap
Scraping top repositories foe Bot
Scraping top repositories foe C
Scraping top repositories foe Chrome
Scraping top repositories foe Chrome extension
Scraping top repositories foe Command line interface
Scraping top repositories foe Clojure
Scraping top repositories foe Code quality
Scraping top