<a href="https://colab.research.google.com/github/Abdullahateeq13/Abdullahateeq13/blob/main/Scraping_Top_Repositories_for_Topics_on_GitHub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Top Repositories for Topics on GitHub



Here are the steps we'll follow:
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, title page URL and topic description
- For each topic, we'll get the top 25 repositiories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic, we'll create a CSV file in the following format:

  Repo Name,Username,Stars,Repo URL
- three.js,mrdoob,69700,https://github.com/mrdoob/three.js
- libgdx,libgdx,18300,https://github.com/libgdx/libgdx


## Scrape the list of topics from GitHub

Explain how you'll do it.
- use requests to download the page
- use bs4 to parse and extract info
- convert to a pandas df

Let's write a function to download the page.

In [1]:
# Import Important Libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://github.com/topics'
response = requests.get(url)
response.status_code

200

In [2]:
len(response.text)

144806

In [3]:
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-UXiu4O52iBFkqt6Kx5t+pqHYP2/LWWIw9+l5ia74TWw+xPzpH44BFfAQp7yzCe0XFGZa72Xiqyml6tox1KkUjw==" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" integrity="sha512-IX1PnI5wWBz8Kgb1JI0f2QFa/WuRQQHJHe0vkKinQzsxRlNb4b8NgODX5htSZVAAk

In [4]:
with open('website.html','w') as f:
  f.write(page_contents)

In [5]:
soup = BeautifulSoup(response.text, 'html.parser')
# print(soup.prettify())

title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title = soup.find_all('p', {'class':title_class})
# len(topic_title)
topic_title[:5]


[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [6]:
desc_selector = "f5 color-fg-muted mb-0 mt-1"
topic_desc = soup.find_all('p', {'class': desc_selector})

topic_desc[:5]


[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [7]:
link_selector = 'no-underline flex-1 d-flex flex-column'
topic_link = soup.find_all('a', link_selector)
len(topic_link)
topic_link[0]['href']

'/topics/3d'

In [8]:
base_link = 'https://www.github.com'
topic0_url = base_link + topic_link[0]['href']
print(topic0_url)

https://www.github.com/topics/3d


In [9]:
# List of Titles
topic_titles = []
for tag in topic_title:
  topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [10]:
# List of Descriptions
topic_descriptions = []

for desc in topic_desc:
  topic_descriptions.append(desc.text.strip())

print(topic_descriptions[:3])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.']


In [11]:
# List of all topic URLs
topic_urls = []

for url in topic_link:
  topic_urls.append(base_link + url['href'])
print(topic_urls)

['https://www.github.com/topics/3d', 'https://www.github.com/topics/ajax', 'https://www.github.com/topics/algorithm', 'https://www.github.com/topics/amphp', 'https://www.github.com/topics/android', 'https://www.github.com/topics/angular', 'https://www.github.com/topics/ansible', 'https://www.github.com/topics/api', 'https://www.github.com/topics/arduino', 'https://www.github.com/topics/aspnet', 'https://www.github.com/topics/atom', 'https://www.github.com/topics/awesome', 'https://www.github.com/topics/aws', 'https://www.github.com/topics/azure', 'https://www.github.com/topics/babel', 'https://www.github.com/topics/bash', 'https://www.github.com/topics/bitcoin', 'https://www.github.com/topics/bootstrap', 'https://www.github.com/topics/bot', 'https://www.github.com/topics/c', 'https://www.github.com/topics/chrome', 'https://www.github.com/topics/chrome-extension', 'https://www.github.com/topics/cli', 'https://www.github.com/topics/clojure', 'https://www.github.com/topics/code-quality', 

In [12]:
topics_dict = { 
    
    'Title' : topic_titles,
    'Descriptions': topic_descriptions,
    'URL': topic_urls
}

In [13]:
topics_df = pd.DataFrame(topics_dict)

In [14]:
topics_df

Unnamed: 0,Title,Descriptions,URL
0,3D,3D modeling is the process of virtually develo...,https://www.github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://www.github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://www.github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://www.github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://www.github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://www.github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://www.github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://www.github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://www.github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://www.github.com/topics/aspnet


In [15]:
topics_df.to_csv('topics.csv',index=None)

## Getting information out of Topic page

In [16]:
topic_page_url = topic_urls[0]

topic_page_url

'https://www.github.com/topics/3d'

In [17]:
resp = requests.get(topic_page_url)

In [18]:
resp.status_code


200

In [19]:
len(resp.text)

435528

In [20]:
soup1 = BeautifulSoup(resp.text, 'html.parser')
# print(soup1)
selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = soup1.find_all('h3',{'class':selection_class})


In [21]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [22]:
len(repo_tags)

20

In [23]:
a_tags = repo_tags[2].find_all('a')


In [24]:
a_tags[0].text.strip()

'pmndrs'

In [25]:
a_tags[1].text.strip()

'react-three-fiber'

In [26]:
base_url = 'https://www.github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://www.github.com/pmndrs/react-three-fiber


In [27]:
# class_selector = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
star_tags = soup1.find_all('span', {'class':'Counter js-social-count'})


In [28]:
len(star_tags)

20

In [29]:
star_tags[0]

<span aria-label="84557 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="84,557">84.6k</span>

In [30]:
star_tags[0].text

'84.6k'

In [31]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)
    
  

In [32]:
parse_star_count(star_tags[0].text)

84600

In [33]:
# <----- Practice for making funtions --------->
stars_str = '84.2k'
# stars_str.strip()
# stars_str[-1]
int(float(stars_str[:-1]) * 1000)

84200

In [34]:
star_tags[1]['title']

'20,358'

In [35]:
# Making Funtions 

def get_repo_info(h1,star):
  a_tags = h1.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repository_url = base_url + a_tags[1]['href']
  stars = star['title']
  return username, repo_name, stars , repository_url

In [36]:
get_repo_info(repo_tags[1],star_tags[1])

('libgdx', 'libgdx', '20,358', 'https://www.github.com/libgdx/libgdx')

In [37]:
topic_repos_dict = {
    
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
  repo_info = get_repo_info(repo_tags[i],star_tags[i])
  topic_repos_dict['username'].append(repo_info[0])
  topic_repos_dict['repo_name'].append(repo_info[1])
  topic_repos_dict['stars'].append(repo_info[2])
  topic_repos_dict['repo_url'].append(repo_info[3])


In [38]:
repo_info

('YadiraF', 'PRNet', '4,672', 'https://www.github.com/YadiraF/PRNet')

In [39]:
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,84557,https://www.github.com/mrdoob/three.js
1,libgdx,libgdx,20358,https://www.github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,19185,https://www.github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,18162,https://www.github.com/BabylonJS/Babylon.js
4,aframevr,aframe,14458,https://www.github.com/aframevr/aframe
5,ssloy,tinyrenderer,14457,https://www.github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,13530,https://www.github.com/lettier/3d-game-shaders...
7,FreeCAD,FreeCAD,11971,https://www.github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9298,https://www.github.com/metafizzy/zdog
9,CesiumGS,cesium,9098,https://www.github.com/CesiumGS/cesium


In [40]:
topic_repos_csv = topic_repos_df.to_csv('topic_repos.csv',index=None)

# Final Code

In [46]:
# Making Functions
import os

def get_topic_page(topics_url):
    # Download the page
    response = requests.get(topics_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    # Parse using Beautiful soup
    s1 = BeautifulSoup(response.text, 'html.parser')
    return s1

def get_repo_info(h1,star):
  a_tags = h1.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repository_url = base_url + a_tags[1]['href']
  stars = star['title']
  return username, repo_name, stars , repository_url

def get_topic_repos(s1):
  # Get h3 tags containing repo_title, repo URL and username
  selection_class = 'f3 color-fg-muted text-normal lh-condensed'
  repo_tags = s1.find_all('h3',{'class':selection_class})

  # Get Star tag
  star_tags = s1.find_all('span', {'class':'Counter js-social-count'})

  topic_repos_dict = {
      
      'username': [],
      'repo_name': [],
      'stars': [],
      'repo_url': []
  }

  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
  
  return pd.DataFrame(topic_repos_dict)

def scrape_topic(topics_url, path):
  if os.path.exists(path):
    print('The file {} already exists. Skipping....'.format(path))
    return
  topics_dataframe = get_topic_repos(get_topic_page(topics_url))
  topics_dataframe.to_csv(path ,index = None)


Write a single function to:
1. Get the list of topics from the topics page.
2. Get the list of top repos from the individual topic pages.
3. For each topic, create a CSV of the top repos for the topic.


In [47]:
def get_topic_titles(soup):
  title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title = soup.find_all('p', {'class':title_class})
  
  topic_titles = []
  for tag in topic_title:
    topic_titles.append(tag.text)
  return topic_titles

def get_topic_descriptions(soup):
  desc_selector = "f5 color-fg-muted mb-0 mt-1"
  topic_desc = soup.find_all('p', {'class': desc_selector})
  
  topic_descriptions = []
  for desc in topic_desc:
    topic_descriptions.append(desc.text.strip())
  return topic_descriptions


def get_topic_urls(soup):
  link_selector = 'no-underline flex-1 d-flex flex-column'
  topic_link = soup.find_all('a', link_selector)
  base_link = 'https://www.github.com'
  topic0_url = base_link + topic_link[0]['href']
  
  topic_urls = []
  for url in topic_link:
    topic_urls.append(base_link + url['href'])
  return topic_urls

def scrape_topics():
  url = 'https://github.com/topics'
  response = requests.get(url)
  if response.status_code != 200:
    raise Exception("Failed to load page {}".format(url))

  soup = BeautifulSoup(response.text, 'html.parser')

  topics_dict = { 
    
    'Title' : get_topic_titles(soup),
    'Descriptions': get_topic_descriptions(soup),
    'URL': get_topic_urls(soup)
  }
  return pd.DataFrame(topics_dict)


In [49]:
def scrape_topics_repos():
  print('Scraping list of topics')
  topics_df = scrape_topics()

  os.makedirs('data', exist_ok = True)

  for index, row in topics_df.iterrows():
    print('Scraping top repositories for "{}"'.format(row['Title']))
    scrape_topic(row['URL'], 'data/{}.csv'.format(row['Title']))

In [50]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin