# Scraping Top Repositories for Topics on GitHub

### Web-scraping
- Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. There are many different ways to perform web scraping to obtain data from websites.


![](https://1000logos.net/wp-content/uploads/2021/05/GitHub-logo.png)
### Github.com
- GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code. It is commonly used to host open-source projects.

- In this project I will be scraping the data of top repositories topics on https://github.com/topics

- I will be using Pandas,requests and Beautiful Soup   

In [499]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. 
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.



#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```


## Use the requests library to download web pages

In [500]:
!pip install requests



In [501]:
import requests

In [502]:
topics_url = 'https://github.com/topics'

In [503]:
response = requests.get(topics_url)

In [504]:
response.status_code

200

In [505]:
len(response.text)

140769

In [506]:
page_contents = response.text

In [507]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX

In [508]:
with open('webpage.html', 'w',encoding="utf-8") as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

In [509]:
!pip install beautifulsoup4



In [510]:
from bs4 import BeautifulSoup

In [511]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [512]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})

In [513]:
len(topic_title_tags)

30

In [514]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [515]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})
topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

In [516]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})


In [517]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [518]:
topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
topic_link_tags[:5]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p class="f

In [519]:
'https://github.com'+topic_link_tags[0]['href']

'https://github.com/topics/3d'

In [520]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})
topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
topic_titles=[]

for titles in topic_title_tags:
    topic_titles.append(titles.text)

    
topic_titles 


['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [521]:
topic_titles=[]

for titles in topic_title_tags:
    topic_titles.append(titles.text)

    
topic_titles    

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [522]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})
topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
topic_titles=[]

for titles in topic_title_tags:
    topic_titles.append(titles.text)

    
topic_titles 
topic_descs=[]

for descs in topic_desc_tags:
    topic_descs.append(descs.text.strip())

    
topic_descs  
topic_links=[]

for links in topic_link_tags:
    topic_links.append('https://github.com'+links['href'])

    
topic_links  

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [523]:
topic_descs=[]

for descs in topic_desc_tags:
    topic_descs.append(descs.text.strip())

    
topic_descs   

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azu

In [524]:
topic_links=[]

for links in topic_link_tags:
    topic_links.append('https://github.com'+links['href'])

    
topic_links    

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [525]:
dictopics = {'Title':topic_titles, 'Topic page URL':topic_links,'Topic description':topic_descs}

In [526]:
topic_df=pd.DataFrame(dictopics)

In [527]:
topic_df

Unnamed: 0,Title,Topic page URL,Topic description
0,3D,https://github.com/topics/3d,3D modeling is the process of virtually develo...
1,Ajax,https://github.com/topics/ajax,Ajax is a technique for creating interactive w...
2,Algorithm,https://github.com/topics/algorithm,Algorithms are self-contained sequences that c...
3,Amp,https://github.com/topics/amphp,Amp is a non-blocking concurrency library for ...
4,Android,https://github.com/topics/android,Android is an operating system built by Google...
5,Angular,https://github.com/topics/angular,Angular is an open source web application plat...
6,Ansible,https://github.com/topics/ansible,Ansible is a simple and powerful automation en...
7,API,https://github.com/topics/api,An API (Application Programming Interface) is ...
8,Arduino,https://github.com/topics/arduino,Arduino is an open source hardware and softwar...
9,ASP.NET,https://github.com/topics/aspnet,ASP.NET is a web framework for building modern...


## Create CSV file(s) with the extracted information

In [528]:
topic_df.to_csv('topic_df.csv',index=None)

## Getting information from the topic page

In [529]:
topic_links[0]

'https://github.com/topics/3d'

In [530]:
response=requests.get(topic_links[0])

In [531]:
response.status_code

200

In [532]:
len(response.text)

641585

In [533]:
response.text[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX

In [534]:
topic_doc=BeautifulSoup(response.text,'html.parser')

In [535]:
h3_selection_class='f3 color-fg-muted text-normal lh-condensed'
repo_tags=topic_doc.find_all('h3',{'class':h3_selection_class})

In [536]:
a_tags=repo_tags[0].find_all('a')

In [537]:
a_tags[0].text.strip()

'mrdoob'

In [538]:
a_tags[1].text.strip()

'three.js'

In [539]:
base_url='https://github.com'
repo_url=base_url+a_tags[1]['href']

In [540]:
repo_url

'https://github.com/mrdoob/three.js'

In [541]:
star_tags=topic_doc.find_all('span',{'class':'Counter js-social-count'})
len(star_tags)

30

In [542]:
star_tags[0].text

'82.4k'

In [543]:
def parse_star_count(star_str):
    star_str=star_str.strip()
    if star_str[-1]=='k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [544]:
parse_star_count(star_tags[0].text)

82400

In [545]:
def get_repo_info(repo_tag,star_tag):
    a_tags=repo_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    stars=parse_star_count(star_tag.text.strip())
    repo_url=base_url+a_tags[1]['href']
    return username,repo_name,stars,repo_url
    

In [546]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 82400, 'https://github.com/mrdoob/three.js')

In [547]:
repos_dict={'username':[],'repo_name':[],'stars':[],'repo_url':[]}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    repos_dict['username'].append(repo_info[0])
    repos_dict['repo_name'].append(repo_info[1])
    repos_dict['stars'].append(repo_info[2])
    repos_dict['repo_url'].append(repo_info[3])
    
    

In [548]:
topic_repos_df=pd.DataFrame(repos_dict)
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,82400,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20000,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,18200,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,17400,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,14200,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,13800,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,13000,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,11400,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9200,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8700,https://github.com/CesiumGS/cesium


In [549]:


def get_topic_page(topic_url):
    response=requests.get(topic_url)  #download page
    if response.status_code!=200:    #checking successful response
        raise Exception('Failed to load page')
   
        #parse using beautifulsoup
    topic_doc=BeautifulSoup(response.text,'html.parser') 
    return topic_doc    


def get_repo_info(repo_tag,star_tag):
    a_tags=repo_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    stars=parse_star_count(star_tag.text.strip())
    repo_url=base_url+a_tags[1]['href']
    return username,repo_name,stars,repo_url


def get_topic_repos(topic_doc):
    
    #getting tagscontaining username,repo_name
    h3_selection_class='f3 color-fg-muted text-normal lh-condensed' 
    repo_tags=topic_doc.find_all('h3',{'class':h3_selection_class})
    
    #getting tags containing stars
    star_tags=topic_doc.find_all('span',{'class':'Counter js-social-count'})
    
    topic_repos_dict={'username':[],'repo_name':[],'stars':[],'repo_url':[]} 
   
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,path):
   
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path+'.csv',index=None)
    

In [550]:
url4=topic_links[5]
url4

'https://github.com/topics/angular'

In [551]:
url4_doc=get_topic_page(url4)

In [552]:
get_topic_repos(url4_doc)

Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,93100,https://github.com/justjavac/free-programming-...
1,angular,angular,81700,https://github.com/angular/angular
2,storybookjs,storybook,71300,https://github.com/storybookjs/storybook
3,leonardomso,33-js-concepts,49300,https://github.com/leonardomso/33-js-concepts
4,ionic-team,ionic-framework,47300,https://github.com/ionic-team/ionic-framework
5,prettier,prettier,42700,https://github.com/prettier/prettier
6,SheetJS,sheetjs,30300,https://github.com/SheetJS/sheetjs
7,angular,angular-cli,25400,https://github.com/angular/angular-cli
8,angular,components,22700,https://github.com/angular/components
9,NativeScript,NativeScript,21300,https://github.com/NativeScript/NativeScript


## Writing a function to get 
- Get the list of topics from the topics page .
- Get the list of top repos for each of the topics.
- for each topic create csv file of top repos


In [553]:
def scrape_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles=[]

    for titles in topic_title_tags:
        topic_titles.append(titles.text)

    return topic_titles 

def scrape_topic_desc(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs=[]

    for descs in topic_desc_tags:
        topic_descs.append(descs.text.strip())
    return topic_descs
    
def scrape_topic_url(doc):
    topic_link_tags=doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    topic_urls=[]

    for links in topic_link_tags:
        topic_urls.append('https://github.com'+links['href'])


    return topic_urls 

def scrape_topics():
    topic_url='https://github.com/topics'
    reponse=requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page{}'.format(topic_url))
    topics_dict={'title':scrape_topic_titles(doc),'description':scrape_topic_desc(doc),'topic_url':scrape_topic_url(doc)} 
    return pd.DataFrame(topics_dict)    

In [554]:
scrape_topics()

Unnamed: 0,title,description,topic_url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [555]:
def scrape_topic_repos():
    print('Scraping list of topics from github')
    topics_df=scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['topic_url'],'data/{}.csv'.format(row['title']))


In [557]:
scrape_topic_repos()

Scraping list of topics from github
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Cloj

Exception: Failed to load page