# Scraping GitHub Topics

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Outline
- Scrape through https://github.com/topics
- Get a list of all topics on the first page. For each topic, fetch topic title, topic description and topic url
- Get a list of all the repositories on each topic page. For each repository, fetch repository name, owner name, stars and repository url
- Create a CSV file of all the topics info collected
- Create a CSV file of all repositories info collected for each topic like the following :-
---
Repository Name,Userame,Stars,URL

three.js,mrdoob,72500,https://github.com/mrdoob/three.js

libgdx,libgdx,18600,https://github.com/libgdx/libgdx

---

###  Use the requests library to download webpages

In [1]:
#Install the requests library
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
topics_url="https://github.com/topics"

In [4]:
response=requests.get("https://github.com/topics")

In [5]:
# Checking wheather the requests method is succesful or not. 200 implies success.
print(response.status_code)

200


In [6]:
len(response.text)

130773

In [7]:
response.text[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-J/5cWm5rrVuxkSgldaK1emf5j30Bs5mRgu0uhuHrG+iwf9mD2LOrkQ32SyN5PADLWzkSDxLS3bW/ScsiM44wzw==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-27fe5c5a6e6bad5bb191282575a2b57a.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-W0Cb3tYIxIb58LtOmiY++k5siW1IkzkqaHOXMJpsrZBWMGoaw8M3r5f7RRxa1heGJEDanaTJmAqCJUoMytKNxA==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-5b409bded608c486f9f0bb4e9a263efa.css" />\n    \n    \

In [8]:
#Saving the html file of the topics page
with open("topics.html", "w", encoding="utf-8") as f:
    f.write(response.text)

### Use BeautifulSoup to parse the HTML code and extract information

In [9]:
#Install the BeatifulSoup library
!pip install beautifulsoup4 --upgrade --quiet

In [10]:
from bs4 import BeautifulSoup

In [11]:
# Parsing the response.text in HTML format and storing in the "soup" object
soup=BeautifulSoup(response.text, "html.parser")

In [12]:
#Searching all the tags containing topic titles
topics_title_tags=soup.find_all("p", class_="f3 lh-condensed mb-0 mt-1 Link--primary")
len(topics_title_tags)

30

In [13]:
topics_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [14]:
#Collecting the topic titles in a list
topics_title=list()
for tag in topics_title_tags:
    topics_title.append(tag.text)

topics_title[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [15]:
#Searching all the tags containing topics desriptions
topics_desc_tags=soup.find_all("p", class_="f5 color-text-secondary mb-0 mt-1")
len(topics_desc_tags)

30

In [16]:
topics_desc_tags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [17]:
#Collecting the topic descriptins in a list
topics_desc=list()
for tag in topics_desc_tags:
    topics_desc.append(tag.text.strip())

topics_desc[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [18]:
#Searching all the tags containing topics URLs
topics_url_tags=soup.find_all("a", class_="d-flex no-underline")
len(topics_url_tags)

30

In [19]:
topics_url_tags[0]

<a class="d-flex no-underline" data-ga-click="Explore, go to 3d, location:All featured topics" href="/topics/3d">
<div class="color-bg-info f4 color-text-tertiary text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
<div class="d-sm-flex flex-auto">
<div class="flex-auto">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-text-secondary mb-0 mt-1">
              3D modeling is the process of virtually developing the surface and structure of a 3D object.
            </p>
</div>
<div class="d-inline-block js-toggler-container starring-container">
<a aria-label="You must be signed in to star a topic" class="btn btn-sm d-flex flex-items-center" data-ga-click="Explore, click star button when signed out,
        action:topics#index;
        text:Star" href="/login?return_to=%2Ftopics%2F3d" title="You must be signed in to star a topic">
<svg aria-hidden="true" class="octicon octicon-star

In [20]:
#Collecting the topic URLs in a list
topics_url=list()
base_url="https://github.com"
for tag in topics_url_tags:
    topic_url=base_url+tag.get("href")
    topics_url.append(topic_url)

topics_url[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### Create a CSV file of the extracted information

In [21]:
#Install the pandas library
!pip install pandas --upgrade --quiet

In [22]:
import pandas as pd

In [23]:
#Creating a table of topics
topics_dict={"Title":topics_title, "Description":topics_desc, "URL":topics_url}
topics_df=pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [24]:
#Saving the table in a CSV file
topics_df.to_csv('Topics.csv', index=None)

## Getting information from each topic page

In [25]:
import requests
topic_url_1=topics_url[0]
topic_url_1

'https://github.com/topics/3d'

In [26]:
response=requests.get(topic_url_1)

In [27]:
response.status_code

200

In [28]:
len(response.text)

614759

In [29]:
response.text[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-J/5cWm5rrVuxkSgldaK1emf5j30Bs5mRgu0uhuHrG+iwf9mD2LOrkQ32SyN5PADLWzkSDxLS3bW/ScsiM44wzw==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-27fe5c5a6e6bad5bb191282575a2b57a.css" />\n  \n    <link crossorigin="anonymous" media="all" integrity="sha512-W0Cb3tYIxIb58LtOmiY++k5siW1IkzkqaHOXMJpsrZBWMGoaw8M3r5f7RRxa1heGJEDanaTJmAqCJUoMytKNxA==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-5b409bded608c486f9f0bb4e9a263efa.css" />\n    \n    \

In [30]:
#Saving the html file of the topic_1 page
with open("topic_1.html", "w", encoding="utf-8") as f:
    f.write(response.text)

In [31]:
# Parsing the response.text in HTML format and storing in the "soup" object
soup=BeautifulSoup(response.text, "html.parser")

### Collecting repository name, username, and links to the repository 

In [32]:
# Searching for all the tags containing repository information
repo_tags=soup.find_all("h3", class_="f3 color-text-secondary text-normal lh-condensed")
len(repo_tags)

30

In [33]:
repo_tags[0]

<h3 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904

In [34]:
#Creating a list of all the usernames
usernames=list()
for tag in repo_tags:
    usernames.append(tag.find_all("a")[0].text.strip())

In [35]:
usernames[:5]

['mrdoob', 'libgdx', 'BabylonJS', 'pmndrs', 'aframevr']

In [36]:
#Creating a list of all the repository names
repo_names=list()
for tag in repo_tags:
    repo_names.append(tag.find_all("a")[1].text.strip())

In [37]:
repo_names[:5]

['three.js', 'libgdx', 'Babylon.js', 'react-three-fiber', 'aframe']

In [38]:
#Creating a list of all the repository URLs
repo_urls=list()
base_url="https://github.com"
for tag in repo_tags:
    repo_url=base_url+tag.find_all("a")[1].get("href")
    repo_urls.append(repo_url)

In [39]:
repo_urls[:5]

['https://github.com/mrdoob/three.js',
 'https://github.com/libgdx/libgdx',
 'https://github.com/BabylonJS/Babylon.js',
 'https://github.com/pmndrs/react-three-fiber',
 'https://github.com/aframevr/aframe']

### Collecting repositary stars

In [40]:
# Searching for all the tags containing repositary stars
star_tags=soup.find_all("a", class_="social-count float-none")
len(star_tags)

30

In [41]:
#Collecting all the text in star tags
star_str=list()
for tag in star_tags:
    star_str.append(tag.text.strip())

In [42]:
star_str[:5]

['72.7k', '18.6k', '14.4k', '14k', '12.9k']

In [43]:
#Creating a function to convert the star values from string to integer value. For example, "72.6k" will convert to 72600.
def convert(string):
    string=string.strip()
    if string[-1]=="k":
        return int(float(string[:-1])*1000)
    return int(string)

In [44]:
#Creating a list of all the repository stars
repo_stars=list()
for star in star_str:
    count=convert(star)
    repo_stars.append(count)

In [45]:
repo_stars[:5]

[72700, 18600, 14400, 14000, 12900]

### Create a CSV file of the extracted information

In [46]:
#Creating a table of repositories
repo_dict={"Repository Name":repo_names, "Username":usernames, "Stars":repo_stars, "URL":repo_urls}
Repo_1_df=pd.DataFrame(repo_dict)
Repo_1_df

Unnamed: 0,Repository Name,Username,Stars,URL
0,three.js,mrdoob,72700,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18600,https://github.com/libgdx/libgdx
2,Babylon.js,BabylonJS,14400,https://github.com/BabylonJS/Babylon.js
3,react-three-fiber,pmndrs,14000,https://github.com/pmndrs/react-three-fiber
4,aframe,aframevr,12900,https://github.com/aframevr/aframe
5,tinyrenderer,ssloy,10900,https://github.com/ssloy/tinyrenderer
6,3d-game-shaders-for-beginners,lettier,10700,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,9600,https://github.com/FreeCAD/FreeCAD
8,zdog,metafizzy,8500,https://github.com/metafizzy/zdog
9,cesium,CesiumGS,7300,https://github.com/CesiumGS/cesium


In [47]:
#Saving the table in a CSV file
Repo_1_df.to_csv('Repo_1.csv', index=None)

# Writing functions to get all the topics

## Function to create a list of topic titles

In [48]:
def get_topics_title(doc):
    topics_title_tags=doc.find_all("p", class_="f3 lh-condensed mb-0 mt-1 Link--primary")
    topics_title=list()
    for tag in topics_title_tags:
        topics_title.append(tag.text)
    return topics_title

## Function to create a list of topic descriptions

In [49]:
def get_topics_desc(doc):
    topics_desc_tags=doc.find_all("p", class_="f5 color-text-secondary mb-0 mt-1")
    topics_desc=list()
    for tag in topics_desc_tags:
        topics_desc.append(tag.text.strip())
    return topics_desc

## Function to create a list of topic URLs

In [50]:
def get_topics_url(doc):
    topics_url_tags=doc.find_all("a", class_="d-flex no-underline")
    topics_url=list()
    base_url="https://github.com"
    for tag in topics_url_tags:
        url=base_url+tag.get("href")
        topics_url.append(url)
    return topics_url

## Function to create a table of topics

In [51]:
def get_topics_table():
    topics_url="https://github.com/topics"
    response=requests.get(topics_url)
    if response.status_code!=200:
        raise Exception("Failed to load page.")
    soup_1=BeautifulSoup(response.text, "html.parser")
    topics_dict={
        "Title":get_topics_title(soup_1),
        "Description":get_topics_desc(soup_1),
        "URL":get_topics_url(soup_1)
    }
    df=pd.DataFrame(topics_dict)
    return df
    df.to_csv('Topics.csv', index=None)

In [52]:
get_topics_table()

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Writing functions to get repositories of all the topics

## Function to get repository name, username and URL

In [53]:
def get_repo(soup):
    repo_tags=soup.find_all("h3", class_="f3 color-text-secondary text-normal lh-condensed")
    usernames=list()
    repo_names=list()
    repo_urls=list()
    repo=list()
    base_url="https://github.com"
    for tag in repo_tags:
        usernames.append(tag.find_all("a")[0].text.strip())
        repo_names.append(tag.find_all("a")[1].text.strip())
        repo_url=base_url+tag.find_all("a")[1].get("href")
        repo_urls.append(repo_url)
    repo.append(usernames)
    repo.append(repo_names)
    repo.append(repo_urls)
    return repo

## Function to get repository stars

In [54]:
def get_repo_stars(soup):
    star_tags=soup.find_all("a", class_="social-count float-none")
    star_str=list()
    for tag in star_tags:
        star_str.append(tag.text.strip())
    repo_stars=list()
    for star in star_str:
        count=convert(star)
        repo_stars.append(count)
    return repo_stars

## Function to create tables of repository for all topics

In [55]:
import os

In [56]:
import shutil

In [57]:
def get_repos_table():
    print("Scraping through all topics at https://github.com/topics")
    folder="Topics Information"
    os.chdir("C:\\Users\\shakkhar paul\\Desktop\\Project 1")
    if os.path.isdir(folder):
        shutil.rmtree(folder)
        os.mkdir(folder)
        os.chdir("C:\\Users\\shakkhar paul\\Desktop\\Project 1"+"\\"+folder)
        repos=dict()
        for index,row in get_topics_table().iterrows():
            repos[row["Title"]]=row["URL"]
        for k,v in repos.items():
            topic_url=v
            topic_name=k
            print("Scraping through",topic_name,"repositories at",topic_url)
            headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
            response=requests.get(topic_url, headers =headers)
            if response.status_code<200 or response.status_code>400:
                raise Exception("Error message:",response.status_code,"Failed to load page ",topic_url)
            soup_2=BeautifulSoup(response.text, "html.parser")
            username_list=get_repo(soup_2)[0]
            repo_name_list=get_repo(soup_2)[1]
            repo_url_list=get_repo(soup_2)[2]
            repo_star_list=get_repo_stars(soup_2)
            repos_dict={"Repository Name":repo_name_list, "Username":username_list, "Stars":repo_star_list, "URL":repo_url_list}
            repos_df=pd.DataFrame(repos_dict)
            fname=topic_name+".csv"
            if os.path.exists(fname):
                print("The file",fname,"already exists. Skipping . . . ")
            repos_df.to_csv(fname, index=None)
        print("ALL FILES CREATED")

In [58]:
get_repos_table()

Scraping through all topics at https://github.com/topics
Scraping through 3D repositories at https://github.com/topics/3d
Scraping through Ajax repositories at https://github.com/topics/ajax
Scraping through Algorithm repositories at https://github.com/topics/algorithm
Scraping through Amp repositories at https://github.com/topics/amphp
Scraping through Android repositories at https://github.com/topics/android
Scraping through Angular repositories at https://github.com/topics/angular
Scraping through Ansible repositories at https://github.com/topics/ansible
Scraping through API repositories at https://github.com/topics/api
Scraping through Arduino repositories at https://github.com/topics/arduino
Scraping through ASP.NET repositories at https://github.com/topics/aspnet
Scraping through Atom repositories at https://github.com/topics/atom
Scraping through Awesome Lists repositories at https://github.com/topics/awesome
Scraping through Amazon Web Services repositories at https://github.co

In [59]:
import jovian

In [None]:
jovian.commit(filename='C:\\Users\\shakkhar paul\\Desktop\\Project 1\\Project_1.ipynb')

<IPython.core.display.Javascript object>