# Scraping Top Repositories for Topics on Github

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

In [171]:
!pip install requests --upgrade --quiet

In [172]:
import requests

In [173]:
topics_url = "https://github.com/topics"

In [174]:
response = requests.get(topics_url)

In [175]:
response.status_code

200

In [176]:
len(response.text)

171089

In [177]:
page_contents = response.text

In [178]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-38f1bf52eeeb.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-56010aa53a8f.css" /><link data-color-theme="dark_dimmed" crossor

In [179]:
# creating a file for html in local computer

with open('webpage.html','w', encoding = 'UTF-8') as f:
    f.write(page_contents)

In [180]:
from bs4 import BeautifulSoup  

In [181]:
soup = BeautifulSoup(page_contents , "html.parser")

In [182]:
type(soup)

bs4.BeautifulSoup

In [183]:
p_tags = soup.find_all('p')
len(p_tags)


69

In [184]:
p_tags[:5]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a class="Link--inTextBlock" href="https://docs.github.com/en/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Sketch
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Sketch is a vector graphics editor for Apple's macOS, used primarily for user interface and icon design.</p>]

In [185]:
# Topic Tags
topic_title_tags = soup.find_all('p', class_ = "f3 lh-condensed mb-0 mt-1 Link--primary")

In [186]:
len(p_tags)
# It means 30 p tages with same class

69

In [187]:
 # Topic Description Tags

topic_desc_tags = soup.find_all('p', class_ = "f5 color-fg-muted mb-0 mt-1")

In [188]:
len(topic_desc_tags)

30

In [189]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [190]:
topic_link_tags = soup.find_all('a', class_ = "no-underline flex-1 d-flex flex-column")

In [191]:
len(topic_link_tags)

30

In [192]:
topic_link_tags[:5]
# we can see the first link is of 3D and we need the href of it so we will fetch it 

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p cl

In [193]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [194]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)

    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [195]:
topic_desc =[]

for tag in topic_desc_tags:
    topic_desc.append(tag.text.strip())
    
print(topic_desc[:5])

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [196]:
topic_url = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_url.append(base_url + tag['href'])
    
topic_url

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [197]:
# Creating a CSV File for the extracted above infromation

!Pip install pandas --upgrade



In [198]:
import pandas as pd


In [199]:
topics_dict  = { 'Title': topic_titles,
          'Description': topic_desc,
           'url' : topic_url
          }

df = pd.DataFrame(topics_dict)
df


Unnamed: 0,Title,Description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [200]:
df.to_csv('Topics.csv')

# Getting information out of a topic page

In [201]:
topic_page_url = topic_url[0]


In [202]:
topic_page_url

'https://github.com/topics/3d'

In [203]:
response = requests.get(topic_page_url)

In [204]:
response.status_code

200

In [205]:
len(response.text)

488929

In [206]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [207]:
h1_selection_class = "f3 color-fg-muted text-normal lh-condensed"
repo_tags = topic_doc.find_all('h3', {'class': h1_selection_class})

In [208]:
len(repo_tags)

20

In [209]:
a_tags= repo_tags[0].find_all('a')

In [210]:
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">
             three.js
 </a>]

In [211]:
a_tags[0].text.strip()

'mrdoob'

In [212]:
a_tags[1].text.strip()

'three.js'

In [213]:
a_tags[1]['href']

'/mrdoob/three.js'

In [214]:
base_url =  'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [215]:
star_tags=  topic_doc.find_all('span',class_="Counter js-social-count")

In [216]:
len(star_tags)

20

In [217]:
star_tags[0].text.strip()

'96.7k'

In [218]:
# converting 69k into 69000 

def parse_star_count(stars_str):
    stars_str =  stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) *1000)
    return int(stars_str)

In [219]:
parse_star_count(star_tags[0].text.strip())

96700

In [220]:


def get_repo_info(h3_tag, star_tag):
    a_tags= h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars =  parse_star_count(star_tag.text.strip())
    
    return username , repo_name,stars,repo_url

    



In [221]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 96700, 'https://github.com/mrdoob/three.js')

In [222]:
topic_repo_dict = {
        'username' : [],
        'repo_name' : [],
        'star' : [],
        'repo_url' : []
    }
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i] , star_tags[i])
    topic_repo_dict['username'].append(repo_info[0])
    topic_repo_dict['repo_name'].append(repo_info[1])
    topic_repo_dict['star'].append(repo_info[2])
    topic_repo_dict['repo_url'].append(repo_info[3])

In [223]:
topic_repo_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'ssloy',
  'lettier',
  'FreeCAD',
  'aframevr',
  'CesiumGS',
  'blender',
  'MonoGame',
  'metafizzy',
  'isl-org',
  'timzhang642',
  'a1studmuffin',
  'nerfstudio-project',
  'domlysz',
  'FyroxEngine',
  'google',
  'openscad'],
 'repo_name': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'aframe',
  'cesium',
  'blender',
  'MonoGame',
  'zdog',
  'Open3D',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'nerfstudio',
  'BlenderGIS',
  'Fyrox',
  'model-viewer',
  'openscad'],
 'star': [96700,
  24900,
  22300,
  21900,
  18700,
  16600,
  16200,
  15900,
  11400,
  10500,
  10500,
  10100,
  9900,
  9300,
  7500,
  7400,
  7000,
  6900,
  6200,
  6200],
 'repo_url': ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon

In [224]:
topic_repo_df = pd.DataFrame(topic_repo_dict)

In [225]:
topic_repo_df

Unnamed: 0,username,repo_name,star,repo_url
0,mrdoob,three.js,96700,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24900,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22300,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21900,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18700,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16600,https://github.com/lettier/3d-game-shaders-for...
6,FreeCAD,FreeCAD,16200,https://github.com/FreeCAD/FreeCAD
7,aframevr,aframe,15900,https://github.com/aframevr/aframe
8,CesiumGS,cesium,11400,https://github.com/CesiumGS/cesium
9,blender,blender,10500,https://github.com/blender/blender


# Get the top 25 repositories from a topic page


In [281]:
def get_topic_page(topic_url):
    #     dowwnload the page
    response = requests.get(topic_url)
    #     check successfull response
    if response.status_code!=200:
        raise Exception("Failed to load the page { }".format(topic_url))
    #     parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    a_tags= h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars =  parse_star_count(star_tag.text.strip())
    
    return username , repo_name,stars,repo_url

def get_topic_repos(topic_doc):
    
    #     get h3 tags containong username of repo
    h1_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3', {'class': h1_selection_class})
    #     get star tags 
    star_tags=  topic_doc.find_all('span',{'class' : "Counter js-social-count"})
    #     get repo info
     # To get all the user we run a loop
    topic_repo_dict = {
        'username' : [],
        'repo_name' : [],
        'star' : [],
        'repo_url' : []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i] , star_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['star'].append(repo_info[2])
        topic_repo_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repo_dict)
    
def scrape_topic(topic_url,topic_name):
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name + '.csv' , index=None)
    

In [264]:
topic_url[6]

'https://github.com/topics/ansible'

In [265]:

# topic4_doc = get_topic_page(url_4)
# topic4_repos = get_topic_repos(topic4_doc) 
# topic4_repos

# lets try to do above code in one line so that easyt tp check multiple topic pages

get_topic_repos(get_topic_page(topic_url[6])).to_csv('ansible.csv',index=None)

Write a function to:

    1. Get the list of topics from the topics page
    2. Get the list of top repos from the individual topic pages
    3. For each topic, create a CSV of the top repos for the topoic

In [266]:
def get_topic_titles(soup):
    topic_title_tags = soup.find_all('p', class_ = "f3 lh-condensed mb-0 mt-1 Link--primary")
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_desc(doc):
    topic_desc_tags = soup.find_all('p', class_ = "f5 color-fg-muted mb-0 mt-1")
    topic_desc =[]
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
    return topic_desc

def get_topic_url(doc):
    topic_link_tags = soup.find_all('a', class_ = "no-underline flex-1 d-flex flex-column")
    topic_url = []
    base_url = 'https://github.com'
    
    for tag in topic_link_tags:
        topic_url.append(base_url + tag['href'])
    topic_url
    

def scrape_topics():
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code!=200:
        raise Exception("Failed to load the page { }".format(topic_url))
    topics_dict={
        'title': get_topic_titles(soup),
        'description':get_topic_desc(soup),
        'url': get_topic_url(soup)
    }
    return pd.DataFrame(topics_dict)


    
 

# Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [284]:
import os

def scrape_topics_repos():
    print('Scraping list of Topics')
    topics_df = scrape_topics()
#     making folder so that every page canbe stored inn it

    os.makedirs('data',exist_ok = True)
    
    for index , row in df.iterrows():
        print('Scrapping top respositories for "{}"'.format(row['Title']))
        scrape_topic(row['url'],'data/' + row['Title'] +'.csv')


In [285]:
scrape_topics_repos()

Scraping list of Topics
Scrapping top respositories for "3D"
Scrapping top respositories for "Ajax"
Scrapping top respositories for "Algorithm"
Scrapping top respositories for "Amp"
Scrapping top respositories for "Android"
Scrapping top respositories for "Angular"
Scrapping top respositories for "Ansible"
Scrapping top respositories for "API"
Scrapping top respositories for "Arduino"
Scrapping top respositories for "ASP.NET"
Scrapping top respositories for "Atom"
Scrapping top respositories for "Awesome Lists"
Scrapping top respositories for "Amazon Web Services"
Scrapping top respositories for "Azure"
Scrapping top respositories for "Babel"
Scrapping top respositories for "Bash"
Scrapping top respositories for "Bitcoin"
Scrapping top respositories for "Bootstrap"
Scrapping top respositories for "Bot"
Scrapping top respositories for "C"
Scrapping top respositories for "Chrome"
Scrapping top respositories for "Chrome extension"
Scrapping top respositories for "Command line interface"
S