### Web scraping top repositories in github

- It is a process of extracting and parsing the data using a computer program from website in automated way
- This is a technique used to extract data from websites for R and D

check the code below for reference: 

https://jovian.ai/aakashns-6l3/scraping-github-topics-repositories-rough

https://jovian.ai/aakashns/python-web-scraping-project-guide

####  Step 1 : Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

##### Outline : 

- The website we are scraping is https://github.com/topics
- we will get the list of topics and for each topic title, topic page url and topic description
- we will get top 25 repositories
- for each repository we will grab repo_name, repo_url, stars, username
- we will create csv file for each topic

#### step 2 : Use Request library to download webpages

- Inspect the HTML pagesand find the right URL
- Download and save webpage using request library
- create a function for automatic downloading for different searches/ questions

In [1]:
! pip install requests



In [2]:
import requests

In [3]:
topics_url = "https://github.com/topics"

In [4]:
response = requests.get(topics_url)

In [5]:
#HTTP response status codes indicate whether a specific HTTP request has been successfully completed.
response.status_code

200

- Informational responses (100–199)
- Successful responses (200–299)
- Redirection messages (300–399)
- Client error responses (400–499)
- Server error responses (500–599)

In [6]:
response.text

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-L06pZD/4Yecj8D8pY5aYfA7oKG6CI8/hlx2K9ZlXOS/j5TnYEjrusaVa9ZIb9O3/tBHmnRFLzaC1ixcafWtaAg==" rel="stylesheet" href="https://github.githubassets.com/assets/light-2f4ea9643ff861e723f03f296396987c.css" /><link crossorigin="anonymous" media="all" integrity="sha512-xcx3R1NmKjgOAE2DsCHYbus068pwqr4i3Xaa1osduISrxqYFi3zIaBLqjzt5FM9VSHqFN7mneFXK73Z9

In [7]:
len(response.text)

146866

In [8]:
content = response.text
content[:500]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" cro'

In [9]:
    
with open('webpag.html', "w", encoding="utf-8") as f:
    f.write(content)


with open('webpage.html','w') as f:

    f.write(content)
    
- Above two line code is replaced with below as unicode encode error is occuring

with open('webpag.html', "w", encoding="utf-8") as f:

    f.write(content)
    
 - below is the example   

In [10]:
with open('webpage.html','w') as f:
    f.write(content)

UnicodeEncodeError: 'charmap' codec can't encode character '\u21b5' in position 51204: character maps to <undefined>

#### step 3 : Use Beautifulsoup to extract and parse the information

- parse and explore the structure of downloaded webage using beautiful soup
- use right methods to exttract the info
- create functions to extract from the page into list and dictionary

In [11]:
! pip install beautifulsoup4



In [12]:
from bs4 import BeautifulSoup

In [22]:
p_doc = BeautifulSoup(content, 'html.parser')
p_doc


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-2f4ea9643ff861e723f03f296396987c.css" integrity="sha512-L06pZD/4Yecj8D8pY5aYfA7oKG6CI8/hlx2K9ZlXOS/j5TnYEjrusaVa9ZIb9O3/tBHmnRFLzaC1ixcafWtaAg==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-c5cc774753662a380e004d83b021d86e.css" integrity="sha512-xcx3R1NmKjgOAE2DsCHYbus068pwqr4i3Xa

In [14]:
type(p_doc)

bs4.BeautifulSoup

In [23]:
topic_title_tag = p_doc.find_all('p')

In [24]:
len(topic_title_tag)

67

In [25]:
topic_title_tag[:10]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Tensorflow
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">TensorFlow is an open source software library for numerical computation.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Mastodon
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Mastodon is a free, decentralized, open-source microblogging network.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Ethereum
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Ethereum is a distributed public blockchain network.</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f3 lh-condensed mb-

In [26]:
topic_title_tag = p_doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
len(topic_title_tag)

30

In [19]:
topic_title_tag

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [29]:
desc_select = "f5 color-fg-muted text-center mb-0 mt-1"
topic_desc_tag = p_doc.find_all('p',{'class': desc_select})
topic_desc_tag

[<p class="f5 color-fg-muted text-center mb-0 mt-1">TensorFlow is an open source software library for numerical computation.</p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Mastodon is a free, decentralized, open-source microblogging network.</p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Ethereum is a distributed public blockchain network.</p>]

In [57]:
topic_desc_tag = p_doc.find_all('p',{'class':"f5 color-fg-muted text-center mb-0 mt-1",'class':"f5 color-fg-muted mb-0 mt-1"})
print(len(topic_desc_tag))
print(topic_desc_tag)

30
[<p class="f5 color-fg-muted mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>, <p class="f5 color-fg-muted mb-0 mt-1">
              Ajax is a technique for creating interactive web applications.
            </p>, <p class="f5 color-fg-muted mb-0 mt-1">
              Algorithms are self-contained sequences that carry out a variety of tasks.
            </p>, <p class="f5 color-fg-muted mb-0 mt-1">
              Amp is a non-blocking concurrency framework for PHP.
            </p>, <p class="f5 color-fg-muted mb-0 mt-1">
              Android is an operating system built by Google designed for mobile devices.
            </p>, <p class="f5 color-fg-muted mb-0 mt-1">
              Angular is an open source web application platform.
            </p>, <p class="f5 color-fg-muted mb-0 mt-1">
              Ansible is a simple and powerful automation engine.
            </p>, <p class="f5 color-fg-muted 

In [32]:
topic_desc_tag[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [34]:
desc_select = "f5 color-text-secondary text-center mb-0 mt-1"
topic_desc_tag = p_doc.find_all('p',{'class': desc_select})
topic_desc_tag

[]

In [40]:
topic_url = p_doc.find_all('a',{"class":"no-underline d-flex flex-column flex-justify-center","class":"d-flex no-underline"})

In [56]:
topic_title_tag0 = topic_title_tag[0]
topic_title_tag0.parent

<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>

In [41]:
len(topic_url)

30

In [43]:
topic_url[0]["href"]

'/topics/3d'

In [46]:
topic0_url = "https://github.com"+topic_url[0]["href"]

In [48]:
print(topic0_url)

https://github.com/topics/3d


In [50]:
topic_title_tag[0].text

'3D'

In [55]:
title_tags = []

for tag_text in topic_title_tag:
    title_tags.append(tag_text.text)
print(title_tags)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [63]:
desc_tags = []

for desc in topic_desc_tag:
    desc_tags.append(desc.text.strip())
print(desc_tags)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a clo

In [70]:
url_tags = []
base_url = "https://github.com"

for url in topic_url:
    url_tags.append(base_url+url["href"])
    
print(url_tags)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

#### step 4 : Create CSV file with tthe extracted information

- create functions for end to end extraction , parsing and saving the data
- execute function to get the parsed data saved in csv file
- download and check the csv file

In [71]:
import pandas as pd

In [72]:
topic_dict = {
    "Title":title_tags,
    "Description" : desc_tags,
    "Link" : url_tags
}

In [73]:
repository_topic = pd.DataFrame(topic_dict)
repository_topic

Unnamed: 0,Title,Description,Link
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [74]:
repository_topic.to_csv("Topics_list")


In [75]:
# to remove index
repository_topic.to_csv("Topics_list",index = None)

#### Getting information from a topic page

In [77]:
topic_page_url = url_tags[0]

In [78]:
topic_page_url

'https://github.com/topics/3d'

In [80]:
response1 = requests.get(topic_page_url)

In [82]:
response1.status_code

200

In [84]:
len(response1.text)

637865

In [94]:
p_doc2 = BeautifulSoup(response1.text, 'html.parser')

In [95]:
getusername_tag = p_doc2.find_all('a')

In [96]:
getusername_tag

[<a class="px-2 py-4 color-bg-info-inverse color-text-white show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>,
 <a aria-label="Homepage" class="mr-4" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
 <svg aria-hidden="true" class="octicon octicon-mark-github color-text-white" data-view-component="true" height="32" version="1.1" viewbox="0 0 16 16" width="32">
 <path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0016 

In [99]:
user_name_class ="f3 color-fg-muted text-normal lh-condensed"
user_name_tags = p_doc2.find_all("h3",{'class':user_name_class})

In [100]:
user_name_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d8

In [102]:
len(user_name_tags)

30

In [106]:
atag = user_name_tags[0].find_all("a")
atag

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [109]:
atag[0].text.strip()

'mrdoob'

In [112]:
print(atag[1].text.strip())

three.js


In [115]:
print(base_url + atag[1]["href"])

https://github.com/mrdoob/three.js


In [116]:
star_class = "social-count float-none"
star_tag = p_doc2.find_all("a",{'class' :star_class})

In [117]:
len(star_tag)

30

In [118]:
star_tag[0]

<a class="social-count float-none" data-ga-click="Explore, go to repository stargazers, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"STARGAZERS","click_visual_representation":"STARGAZERS_NUMBER","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4f3c0fb1ad4e5a9f72ed698531bf27b302fcc5846d9458e33ceeeeb05888b64c" data-view-component="true" href="/mrdoob/three.js/stargazers">
          75.3k
</a>

In [120]:
star_tag[0].text.strip()

'75.3k'

In [121]:
# defining a function for converting 75.3k into integer
def parse_star_count(star_str):
    star_str = star_str.strip()
    if star_str[-1] == "k":
        return(int(float(star_str[:-1])* 1000))
    return int(star_str)

    

In [122]:
parse_star_count(star_tag[0].text.strip())

75300

In [123]:
def get_repository_info(h3tag, star_tag):
    # return all the info about the repository
    atag = h3tag.find_all("a")
    username = atag[0].text.strip()
    repositoryname = atag[1].text.strip()
    repositoryurl = base_url + atag[1]["href"]
    starcount = parse_star_count(star_tag.text.strip())
    return username, repositoryname, repositoryurl, starcount

In [179]:
get_repository_info(user_name_tags[2],star_tag[2])

('pmndrs',
 'react-three-fiber',
 'https://github.com/pmndrs/react-three-fiber',
 15500)

In [180]:
topic_repository_dict = {
    'username':[],
    'repository_name' :[],
    'repository_url':[],
    'star_count' : []
}

for i in range(len(user_name_tags)):
    repositoryinfo = get_repository_info(user_name_tags[i],star_tag[i])
    topic_repository_dict["username"].append(repositoryinfo[0])
    topic_repository_dict["repository_name"].append(repositoryinfo[1])
    topic_repository_dict["repository_url"].append(repositoryinfo[2])
    topic_repository_dict["star_count"].append(repositoryinfo[3])

In [182]:
topic_repository_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'spritejs',
  'tensorspace-team',
  'domlysz',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'openscad',
  'ssloy',
  'mosra',
  'blender',
  'google',
  'gfxfundamentals',
  'cleardusk',
  'jasonlong',
  'rg3dengine',
  'cnr-isti-vclab',
  'antvis'],
 'repository_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'spritejs',
  'tensorspace',
  'BlenderGIS',
  'webglstudio.js',
  'PRNet',
  'vrn',
  'openscad',
  'tinyraytracer',
  'magnum',
  'blender',
  'model-viewer',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'rg3d',
  'meshlab',
  'L7'],
 'repository_url': ['https://github.com/mrdoob/thre

In [130]:
topic_repository_df = pd.DataFrame(topic_repository_dict)

In [131]:
topic_repository_df

Unnamed: 0,username,repository_name,repository_url,star_count
0,mrdoob,three.js,https://github.com/mrdoob/three.js,75300
1,libgdx,libgdx,https://github.com/libgdx/libgdx,19200
2,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,15500
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,15200
4,aframevr,aframe,https://github.com/aframevr/aframe,13200
5,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,11500
6,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,11400
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,10100
8,metafizzy,zdog,https://github.com/metafizzy/zdog,8800
9,CesiumGS,cesium,https://github.com/CesiumGS/cesium,7600


In [171]:
def get_topic_repo(topic_url):
    # Download the page
    response2 = requests.get(topic_url)
    
    # check for successful response
    if response2.status_code!=200:
        raise Exception("Failed to load page{}".format(topic_url))
     
    #Parse using beautifulsoup
    topic_doc = BeautifulSoup(response2.text, 'html.parser')
    
    # get repository username, repository url, repository title
    user_name_class ="f3 color-fg-muted text-normal lh-condensed"
    user_name_tags = p_doc2.find_all("h3",{'class':user_name_class})
    
    #Get star tags
    star_class = "social-count float-none"
    star_tag = p_doc2.find_all("a",{'class' :star_class})
    
    # Let us now get all the repository information
    topic_repository_dict = {
    'username':[],
    'repository_name' :[],
    'repository_url':[],
    'star_count' : []
    }
    
    for i in range(len(user_name_tags)):
        repositoryinfo = get_repository_info(user_name_tags[i],star_tag[i])
        topic_repository_dict["username"].append(repositoryinfo[0])
        topic_repository_dict["repository_name"].append(repositoryinfo[1])
        topic_repository_dict["repository_url"].append(repositoryinfo[2])
        topic_repository_dict["star_count"].append(repositoryinfo[3])
    return pd.DataFrame(topic_repository_dict)


def get_repository_info(h3tag, star_tag):
    # return all the info about the repository
    atag = h3tag.find_all("a")
    username = atag[0].text.strip()
    repositoryname = atag[1].text.strip()
    repositoryurl = base_url + atag[1]["href"]
    starcount = parse_star_count(star_tag.text.strip())
    return username, repositoryname, repositoryurl, starcount

In [172]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    # check for successful response
    if response.status_code!=200:
        raise Exception("Failed to load page{}".format(topic_url))
     
    #Parse using beautifulsoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


def get_repository_info(h3tag, star_tag):
    # return all the info about the repository
    atag = h3tag.find_all("a")
    username = atag[0].text.strip()
    repositoryname = atag[1].text.strip()
    repositoryurl = base_url + atag[1]["href"]
    starcount = parse_star_count(star_tag.text.strip())
    return username, repositoryname, repositoryurl, starcount


def get_topic_repos(topic_doc):
     # get repository username, repository url, repository title
    user_name_class ="f3 color-fg-muted text-normal lh-condensed"
    user_name_tags = p_doc2.find_all("h3",{'class':user_name_class})
    
    #Get star tags
    star_class = "social-count float-none"
    star_tag = p_doc2.find_all("a",{'class' :star_class})
    
    # Let us now get all the repository information
    topic_repository_dict = {
    'username':[],
    'repository_name' :[],
    'repository_url':[],
    'star_count' : []
    }
    
    for i in range(len(user_name_tags)):
        repositoryinfo = get_repository_info(user_name_tags[i],star_tag[i])
        topic_repository_dict["username"].append(repositoryinfo[0])
        topic_repository_dict["repository_name"].append(repositoryinfo[1])
        topic_repository_dict["repository_url"].append(repositoryinfo[2])
        topic_repository_dict["star_count"].append(repositoryinfo[3])
    return pd.DataFrame(topic_repository_dict)

In [162]:
url_tags

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [173]:
url1 = url_tags[2]
url1

'https://github.com/topics/algorithm'

In [183]:
get_topic_repos(get_topic_page(url_tags[5])).to_csv("algorithm.csv",index = None)

In [177]:
topic2_doc = get_topic_page(url1)

In [175]:
topic2_repos = get_topic_repos(topic2_doc)

In [176]:
topic2_repos

Unnamed: 0,username,repository_name,repository_url,star_count
0,mrdoob,three.js,https://github.com/mrdoob/three.js,75300
1,libgdx,libgdx,https://github.com/libgdx/libgdx,19200
2,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,15500
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,15200
4,aframevr,aframe,https://github.com/aframevr/aframe,13200
5,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,11500
6,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,11400
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,10100
8,metafizzy,zdog,https://github.com/metafizzy/zdog,8800
9,CesiumGS,cesium,https://github.com/CesiumGS/cesium,7600


#### Writeing single function to:

- get list of topics from topics page
- get list of top repositories from each topic page
- for each topic, create a csv of top repositories