# **Top Repositories for Github Topics**


# Pick a website and describe your objective. ..

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## Project Outline:

- We are going to scrape https://github.com/topics
- we will get a list of topics. from each topics, we will get topic title, page url, description, top 25 repo from each topic
- for each repo, we will get the repo name, stars and url
- for each topic we will create a csv file






# Use the requests library to download web pages. ..

In [1]:
# Installation of Requests
# !pip install requests --upgrade --quiet

In [2]:
import requests
import pandas as pd

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
#to check if request was successful
response.status_code

200

In [6]:
len(response.text) #its not adviseable to display all the text

157026

In [7]:
content_text = response.text

In [8]:
content_text[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-8cafbcbd78f4.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-31dc14e38457.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="

# Use Beautiful Soup to parse and extract information

In [9]:
!pip install beautifulsoup4 --upgrade --quiet

In [10]:
from bs4 import BeautifulSoup

In [11]:
doc = BeautifulSoup(content_text, 'html.parser')

In [12]:
#get the tags and attribute of the title
topic_title_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [13]:
# check how many tites is using this class
len(topic_title_tags)

30

In [14]:
topic_title_tags[:10]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>]

In [15]:
topic_titles= []
for tag in topic_title_tags:
  topic_titles.append(tag.text)

In [16]:
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [17]:
#get the tags and attribute of the description

topic_desc_tags = doc.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})


In [18]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [19]:
topic_desc= []
for tag in topic_desc_tags:
  topic_desc.append(tag.text.strip())

In [20]:
topic_desc[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

**to find the link to each repo**

In [21]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})

In [22]:
len(topic_link_tags)

30

In [23]:
topic_link_tags[0]

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

we have some uncessary tags like title and description with our link tag so we target just the link below

In [24]:
topic_urls =[]
base_url = 'https://github.com'

for tag in topic_link_tags:
  topic_urls.append(base_url + tag['href'])

In [25]:
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [26]:
import pandas as pd

In [27]:
topics_dict = {
    'title': topic_titles,
    'description': topic_desc,
    'urls': topic_urls
}

In [28]:

topics_df = pd.DataFrame(topics_dict)

In [29]:
topics_df

Unnamed: 0,title,description,urls
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Create CSV file(s) with the extracted information

In [30]:
#from google.colab import files
#topics_df.to_csv('topics.csv', index=None)
#files.download('topics.csv')

# **Getting information out of a topic page**

In [31]:
topic_page_url = topic_urls[0]

In [32]:
topic_page_url

'https://github.com/topics/3d'

In [33]:

  response1 = requests.get(topic_page_url)
  # Check successful response
  if response1.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_page_url))


In [34]:
len(response1.text)

466726

lets generate the the repo link details for just 3D

In [35]:
topic_doc = BeautifulSoup(response1.text, 'html.parser')

In [36]:
repo_tags = topic_doc.find_all('h3', {'class': 'f3 color-fg-muted text-normal lh-condensed'})

In [37]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [38]:
len(repo_tags)

20

In [39]:
a_tags = repo_tags[0].find_all('a')

In [40]:
a_tags

[<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [41]:
a_tags[0].text.strip()

'mrdoob'

In [42]:
a_tags[1].text.strip()

'three.js'

In [43]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [44]:
star_tags = topic_doc.find_all('span', {'id': 'repo-stars-counter-star'})

In [45]:
len(star_tags)

20

In [46]:
star_tags[0].text.strip()

'92.9k'

In [47]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

In [48]:
parse_star_count(star_tags[0].text)

92900

A general code to summarize this and get for each repos will be:

In [49]:
def get_repo_info(h3_tag, star_tag):
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text)
  return username, repo_name, stars, repo_url


In [50]:
get_repo_info(repo_tags[10], star_tags[10])

('isl-org', 'Open3D', 9100, 'https://github.com/isl-org/Open3D')

In [51]:
topics_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
  repo_info = get_repo_info(repo_tags[i],star_tags[i] )
  topics_repos_dict['username'].append(repo_info[0])
  topics_repos_dict['repo_name'].append(repo_info[1])
  topics_repos_dict['stars'].append(repo_info[2])
  topics_repos_dict['repo_url'].append(repo_info[3])

In [52]:
topics_repos_df = pd.DataFrame(topics_repos_dict)

In [53]:
topics_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,92900,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23100,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21600,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20900,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17300,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15700,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15500,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14400,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10600,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9900,https://github.com/metafizzy/zdog


In [54]:
import os


# FINAL CODE

In [55]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [56]:
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response =  requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page{}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc
        

In [57]:
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles= []
    for tag in topic_title_tags:
      topic_titles.append(tag.text)
    return topic_titles

get_topic_titles can be used to get the list of titles


In [58]:
titles = get_topic_titles(doc)

In [59]:
len(titles)

30

In [60]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [61]:
def get_topics_descs(doc):
    topic_desc_tags = doc.find_all('p', {'class': 'f5 color-fg-muted mb-0 mt-1'})
    topic_desc= []
    for tag in topic_desc_tags:
      topic_desc.append(tag.text.strip())
    return topic_desc

In [62]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

Let's put this all together into a single function

In [63]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topics_descs(doc),
        'url': get_topic_urls(doc)
}
    return pd.DataFrame(topics_dict)

In [64]:
scrape_topics()[:3]

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm


# Get the top 25 repositories from a topic page 

here we define another function and pass the topic_url as an parameter to get the list of topics

In [65]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


In [None]:
get_topic_page('https://github.com/topics/3d')

In [67]:

def parse_star_count(stars):
    stars=stars.strip()
    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    return(int(stars))


In [68]:
def get_repo_info(h3_tag, star_tag):
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text)
  return username, repo_name, stars, repo_url

In [69]:
def get_topic_repos(topic_doc):
    repo_tags = topic_doc.find_all('h3', {'class': 'f3 color-fg-muted text-normal lh-condensed'})
    star_tags = topic_doc.find_all('span', {'id': 'repo-stars-counter-star'})

    topics_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(repo_tags)):
      repo_info = get_repo_info(repo_tags[i],star_tags[i] )
      topics_repos_dict['username'].append(repo_info[0])
      topics_repos_dict['repo_name'].append(repo_info[1])
      topics_repos_dict['stars'].append(repo_info[2])
      topics_repos_dict['repo_url'].append(repo_info[3])
           
    return pd.DataFrame(topics_repos_dict)

In [70]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [71]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))


In [72]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin