Here are the steps we'll follow:

We're going to scrape https://github.com/topics
We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
For each topic, we'll get the top 25 repositories in the topic from the topic page
For each repository, we'll grab the repo name, username, stars and repo URL
For each topic we'll create a CSV file in the following format:
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

In [112]:
!pip install requests



In [113]:
import requests

In [114]:
topics_url= "https://github.com/topics"

In [115]:
response= requests.get(topics_url)

In [116]:
response.status_code

200

In [117]:
len(response.text)

155526

In [118]:
start_content= response.text
start_content[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0946cdc16f15.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-3946c959759a.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="h

In [119]:
start_content = start_content.encode('ascii', 'ignore')

start_content= start_content.decode()      #to avoid unicode errror

In [120]:
with open("web_page.html", "w") as f:
    f.write(start_content)

In [121]:
from bs4 import BeautifulSoup

In [122]:
doc= BeautifulSoup(start_content, "html.parser")

In [124]:
topic_tags= doc.find_all("p", class_="f3 lh-condensed mb-0 mt-1 Link--primary")

In [127]:
description_tags= doc.find_all("p", class_="f5 color-fg-muted mb-0 mt-1")

In [128]:
link_tags= doc.find_all("a", class_="no-underline flex-1 d-flex flex-column")

In [129]:
link= "https://github.com"

In [130]:
link

'https://github.com'

In [131]:
topic_titles=[]
for i in topic_tags:
    topic_titles.append(i.text)

    
description_brief= []
for j in description_tags:
    description_brief.append(j.text.strip())
    
link_titles=[]
for k in link_tags:
    link_titles.append(link+ k["href"])


In [132]:
import pandas as pd

In [133]:
topic_dict={
    "title": topic_titles,
    "description" : description_brief,
    "link" : link_titles
}

df=pd.DataFrame(topic_dict)

In [134]:
df

Unnamed: 0,title,description,link
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [135]:
df.to_csv("webpage.csv", index=False)

In [176]:
def get_repo_info(h3_tag, star_tag):
    a_tags= h3_tag.find_all("a")
    user_name=a_tags[0].text.strip()
    repo_name= a_tags[1].text.strip()
    repo_url= link+ a_tags[1]["href"]
    stars=parse_star_count(star_tag.text.strip())
    return user_name, repo_name,stars, repo_url

In [181]:
topic_repo_dict= {
    "username": [],
    "repo_name": [],
    "stars":  [],
    "repo_url": []
}

for i in range(len(repo_tag)):
    repo_info= get_repo_info(repo_tag[i], star_tags[i])
    topic_repo_dict["username"].append(repo_info[0])
    topic_repo_dict["repo_name"].append(repo_info[1])
    topic_repo_dict["stars"].append(repo_info[2])
    topic_repo_dict["repo_url"].append(repo_info[3])
    

In [183]:
df2= pd.DataFrame(topic_repo_dict)

In [184]:
df2

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,92300,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,22700,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21600,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20700,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17000,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15500,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15400,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14200,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10500,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9700,https://github.com/metafizzy/zdog


In [185]:
def get_topic_repos(page_url):
    requestss=requests.get(page_url)
    #check successful response
    if requestss.status_code!=200:
        raise Exception('failed to load page {}' .format(page_url))
    topic1_doc= BeautifulSoup(requestss.text, "html.parser")
    #h3 tag containing username , repo_title and rep_name
    repo_tag= topic1_doc.find_all("h3", class_="f3 color-fg-muted text-normal lh-condensed")
    #span tag containing number of star of each repo
    star_tags= topic1_doc.find_all("span", class_="Counter js-social-count")
    
    #get info
    
    topic_repo_dict= {
        "username": [],
        "repo_name": [],
        "stars":  [],
        "repo_url": []
     }

    for i in range(len(repo_tag)):
        repo_info= get_repo_info(repo_tag[i], star_tags[i])
        topic_repo_dict["username"].append(repo_info[0])
        topic_repo_dict["repo_name"].append(repo_info[1])
        topic_repo_dict["stars"].append(repo_info[2])
        topic_repo_dict["repo_url"].append(repo_info[3])
    
    return pd.DataFrame(topic_repo_dict)

In [205]:
import os

In [216]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(topic_url)
    topic_df.to_csv(path, index=None)

In [218]:
def scrape_top_repo_name():
    for index,row in df.iterrows():
        print(row["title"], row["link"])

In [219]:
scrape_top_repo_name()

3D https://github.com/topics/3d
Ajax https://github.com/topics/ajax
Algorithm https://github.com/topics/algorithm
Amp https://github.com/topics/amphp
Android https://github.com/topics/android
Angular https://github.com/topics/angular
Ansible https://github.com/topics/ansible
API https://github.com/topics/api
Arduino https://github.com/topics/arduino
ASP.NET https://github.com/topics/aspnet
Atom https://github.com/topics/atom
Awesome Lists https://github.com/topics/awesome
Amazon Web Services https://github.com/topics/aws
Azure https://github.com/topics/azure
Babel https://github.com/topics/babel
Bash https://github.com/topics/bash
Bitcoin https://github.com/topics/bitcoin
Bootstrap https://github.com/topics/bootstrap
Bot https://github.com/topics/bot
C https://github.com/topics/c
Chrome https://github.com/topics/chrome
Chrome extension https://github.com/topics/chrome-extension
Command line interface https://github.com/topics/cli
Clojure https://github.com/topics/clojure
Code quality h

In [220]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_top_repo_name()
    
    os.makedirs('data', exist_ok=True)
    for index, row in df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['link'], 'data/{}.csv'.format(row['title']))

In [221]:
scrape_topics_repos()

Scraping list of topics
3D https://github.com/topics/3d
Ajax https://github.com/topics/ajax
Algorithm https://github.com/topics/algorithm
Amp https://github.com/topics/amphp
Android https://github.com/topics/android
Angular https://github.com/topics/angular
Ansible https://github.com/topics/ansible
API https://github.com/topics/api
Arduino https://github.com/topics/arduino
ASP.NET https://github.com/topics/aspnet
Atom https://github.com/topics/atom
Awesome Lists https://github.com/topics/awesome
Amazon Web Services https://github.com/topics/aws
Azure https://github.com/topics/azure
Babel https://github.com/topics/babel
Bash https://github.com/topics/bash
Bitcoin https://github.com/topics/bitcoin
Bootstrap https://github.com/topics/bootstrap
Bot https://github.com/topics/bot
C https://github.com/topics/c
Chrome https://github.com/topics/chrome
Chrome extension https://github.com/topics/chrome-extension
Command line interface https://github.com/topics/cli
Clojure https://github.com/topic