# Scraping Top Repositories for Topics on GitHub


### Web Scraping

Web Scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user , like a spreadsheet or an API.

### Problem Statement

Scrape the `topics` page of GitHub. And create a list of top repositories for each topic on the topics page.

`GitHub` is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere. It contains millions of repositories on various topics related to programming.


### Tools Used

Python, requests, BeautifulSoup and Pandas

## Steps involved in Web Scraping

#### First, Pick a website and describe your objective:

- Browse through different sites and pick one to scrape. Check the " project Ideas" section for inspiration.
- Identify the information you would like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Jupyter Notebook.

#### Then, Scrape as follows:


1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the **HTML content of the webpage**. For this task, we will use a third-party HTTP library for python-`requests`.


2. Once we have accessed the HTML content, we are left with the task of **parsing the data**. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. We'll use 'Beautiful soup' for parsing.


3. Now, all we need to do is **navigating and searching the parse tree** that we created, i.e. tree traversal. We'll use 'BeautifulSoup' for searching and extraction.


### Project Outline:

- We're going to scrape https://github.com/topics.
- We'll get a list of all the topics. For each topic, we'll get topic title, topic page URL and topic description.
- For each topic, we'll get the top 25 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars and the repo URL.
- For each topic, we'll create a CSV file in the following format:

````
Repo name,Username,Stars,Repository URL
three.js,mrdoob,88500,https://github.com/mrdoob/three.js
react-three-fiber,pmndrs,21100,https://github.com/pmndrs/react-three-fiber

````

    


## Getting a list of topics on GitHub

### Use the requests library to download the web page

**requests**:The requests module allows you to send HTTP requests using Python.

- A HTTP request is made by a client to a named host, which is located on a server. The aim of the request is to access a resource on the server.

- The HTTP request returns a Response Object with all the response data(content, encoding, status_code,etc)

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

Now, Scraping has 3 parts: first you get the webpage, then you parse it. Parse means you divide it into its components. And finally you search and extract the information you want.

So, first let's get the webpage.

In [3]:
topics_url = 'https://github.com/topics'

Now, we need to download this url or this webpage using requests.get() method. Using this method creates a response object which can be stored in a variable. 
- The response object contains the content, encoding and the status_code.

In [4]:
response = requests.get(topics_url)

Now, to check if the request was successful, we can check the status_code of `response`. If it's between 200 and 299, then it's successful.

**Note:** Every request that you make using URL, be it on browser or on Python, it has a status_code. 

In [5]:
response.status_code

200

Okay, so the status_code is 200 which means the request was successful. The webpage is downloaded in the `response`.

To see the content of the webpage, we can do response.text(). However, it's not a wise idea to display all the content here as it contains a lot of characters and it can slow things down.

In [6]:
character_count = len(response.text)
character_count

152363

That's a lot of characters! 

But, we *can* display some of it here.

In [6]:
page_contents = response.text

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [9]:
with open('webpage.html','w') as f:
    f.write(page_contents)
    
#This creates an HTML file object containing the content of the webpage. 
#When you open the file in the editor, you see the page content written in HTML. 
#When you open this file in its rendered format, it opens a local copy of the Github page on this server. 
#When you click on any of the links on this local copy, they won't work as there is just one page on this server.

So, we want to grab the right information out of this HTML file. This can be done using `Beautiful Soup`.


### Use Beautiful Soup to parse and extract information

**BeautifulSoup**: The Python library for pulling data out of HTML and XML files.

In [7]:
!pip install beautifulsoup4 --upgrade --quiet

In [8]:
from bs4 import BeautifulSoup

# We installed the beautifulsoup4 package from the BeautifulSoup library. 
# In this package, we import BeautifulSoup class from bs4 module.

In [9]:
doc = BeautifulSoup(page_contents,'html.parser')

# Here, we are parsing page_contents using 'html.parser' argument.
# BeautifulSoup can parse other documents also, so we are specifying what kind of document do we want to parse.
# Then, we save the parsed document into 'doc'.

In [12]:
type(doc)

bs4.BeautifulSoup

Thus, `doc` is a BeautifulSoup object.

Now, we can actually find things inside the webpage by using `queries`. Commonly used methods for the same are `find()` and `find_all()`.

*After inspecting for the topic title **'3D'** and navigating through the HTML content for it, we found that it is under a **'p'** tag. So, let's find all the p tags.*

In [12]:
p_tags = doc.find_all('p')

In [15]:
len(p_tags)

67

*We have found a total of 67 p_tags, which is probably far more than the actual number of distinct topics on the webpage. So, we need to be a little more specific than just a 'p' tag.*

In [16]:
p_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Vim
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Vim is a console-run text editor program.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Terraform
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">An infrastructure-as-code tool for building, changing, and versioning infrastructure safely and efficiently.</p>]

*As we can see above, the p_tags obtained do not only include topic titles but also some other not required content.*

*So, to be a little specific, we'll also specify the class for the p_tag.*

In [10]:
title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p',{'class':title_class})

In [18]:
len(topic_title_tags)

30

In [19]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

*So, now we have tags for just the topic titles.*

*Now, let's get the tags for topic description.*

In [11]:
description_class = 'f5 color-fg-muted mb-0 mt-1'

topic_desc_tags = doc.find_all('p', {'class' : description_class })

In [21]:
len(topic_desc_tags)

30

*Thus, we have obtained as many tags as the number of topics. So, probably we have got tags for all the descriptions.*

In [22]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

*Now, we've got the topic title tags and topic description tags. One more thing that we need is the topic page url because from that page, we'll get more things.*

*To obtain the topic page url tag, we have 2 ways: we can either fetch the parent of topic_title_tag which contains the required href or we can fetch the 'a' tag containing the href.*

In [23]:
topic_title_tags0 = topic_title_tags[0]

In [35]:
topic_title_tags0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [37]:
topic_title_tags0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [12]:
topic_url_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})

In [31]:
len(topic_url_tags)

30

In [38]:
topic_url_tags[:5]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p class="f

In [17]:
topic_url_tags[0]['href']

# Here, we extract the 'href' attribute from the 'a' tag for first topic.

'/topics/3d'

In [18]:
'https://github.com' + topic_url_tags[0]['href']

'https://github.com/topics/3d'

In [19]:
topic0_url = 'https://github.com' + topic_url_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


*So, we've constructed the url to the topic page of first topic.*

*Now that we've obtained the tags for all the 3 things - topic_title, topic_description and topic_url, we'll extract the relevant information from these tags.*

In [46]:
# To get the title pf the topic from topic_title_tags, we'll use .text method. It can be used for a single element only.
# Here, we obtain the topic title from the first tag.

topic_title_tags[0].text

'3D'

In [13]:
# Here, we create an empty list 'topic_titles'.
# Then, we run a for loop over topic_title_tags and for each tag, we extract the topic_title(tag.text) and add it to the list topic_titles.
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
topic_titles[:5]


['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [13]:
# We are doing the same for topic_description tags as well to extract the topic descriptions.

topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text)
    
topic_descs[:5]

['\n          3D modeling is the process of virtually developing the surface and structure of a 3D object.\n        ',
 '\n          Ajax is a technique for creating interactive web applications.\n        ',
 '\n          Algorithms are self-contained sequences that carry out a variety of tasks.\n        ',
 '\n          Amp is a non-blocking concurrency library for PHP.\n        ',
 '\n          Android is an operating system built by Google designed for mobile devices.\n        ']

In [14]:
# We can see above that there is an empty space at the beginning and end of these descriptions.
# An easy fix for this is to use the strip() method. It removes leading and trailing spaces in a string.

topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

*Let's do the same for topic_url_tags as well. But here, we want the 'href' attribute and not the text.*

In [19]:
topic_urls = []

for tag in topic_url_tags:
    topic_urls.append(tag['href'])
    
topic_urls[:5]

['/topics/3d',
 '/topics/ajax',
 '/topics/algorithm',
 '/topics/amphp',
 '/topics/android']

In [15]:
# Above, we didn't get the complete url but just a part of it.
# So, we can add a base url to get the complete link.

topic_urls = []
base_url = "https://github.com"

for tag in topic_url_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls[:5]


['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

*Now, we have `3 lists` one each for topic `title`, `description` and `url`. And now we want to create a table or a data frame from these lists.* 

*So, we'll use the `pandas` library of Python to create the data frame.*

*Steps to create a data frame from a bunch of lists are as follows:*

- Create a dictionary from the lists. The key is the column name and the value is the list_name.

- Pass this dictionary to the pd.dataframe method of pandas to create a table out of the lists.




In [16]:
!pip install pandas --quiet

In [17]:
import pandas as pd

In [18]:
topics_dict = {
    'title':topic_titles,
    'description':topic_descs,
    'url':topic_urls
}


In [19]:
topics_df = pd.DataFrame(topics_dict)

In [20]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV file with the extracted information

In [67]:
topics_df.to_csv('topics.csv',index = None)

# This creates a CSV file and index = None removes the index.

*So, here we are done with the first step of the project. Next, we have to extract information from each individual topic page.
Let's do this!*

## Getting information out of a topic page

*So, we'll repeat the same steps as above for an individual topic page.*

*First, we'll download the web page. Then, we'll parse it. And, finally we'll extract the desired information from it one by one.*

In [32]:
topic_page_url = topic_urls[0]

In [33]:
topic_page_url

'https://github.com/topics/3d'

In [34]:
response = requests.get(topic_page_url)

In [35]:
response.status_code

200

In [36]:
len(response.text)

453550

In [37]:
topic0_doc = BeautifulSoup(response.text,'html.parser')

*Okay, so now we have the parsed document and we can navigate through it to pick the right information.*

*We want the user_name which is under an 'a' tag. However, there are other 'a' tags which contain information other than the user_name, like repo_name. So, we'll pick the 'h3' tag under which the user_name tag is. This way, we can pick the repo_name tag as well which is under the same 'h3' tag.*

In [38]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'

repo_tags = topic0_doc.find_all('h3',{'class': h3_selection_class})

In [39]:
len(repo_tags)

20

*We've got 20 repo_tags which means there are 20 respositories under the topic '3D' on this web page.*
*Each repository tag contains 'a' tags containing the user_name and repo_name. The 'a' tag containing repo_name also contains the repo_url.*

In [40]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [35]:
a_tags = repo_tags[0].find_all('a')

In [36]:
# Getting the user_name

a_tags[0].text.strip()

'mrdoob'

In [37]:
# Getting the repo_name

a_tags[1].text.strip()

'three.js'

In [31]:
# Getting the repo_url

base_url = 'https://github.com'

repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


*So, we've been able to construct the URL to the repository page.*

*Now, we are left with Stars on the repository. Let's do it!.*

In [41]:
star_class = 'Counter js-social-count'
star_tags = topic0_doc.find_all('span',{'class' : star_class})

In [67]:
len(star_tags)

20

*So, we have got 20 star_tags which is equal to the number of repositories on this page. Thus, we have got all the tags containing number of stars.*

In [71]:
star_tags[0].text

'88.5k'

*Next, we have to define a function by which we can convert the star count from 'k' into number.* 

In [21]:
# Defining a function 'stars_into_number'

def stars_into_number (stars_str) : # the function requires stars string as an argument 
    stars_str = stars_str.strip()   # whenever you are doing string conversions, its better to strip it first
    if stars_str[-1] == 'k':         # if the last character of the string is 'k'
         return int(float(stars_str[:-1]) * 1000) # then we take all the characters before the last one and convert it into float, multiply it with 1000 and then again convert it into an integer
    return int(stars_str)           # if not, then we just convert the string into an integer

*Following is a step-by-step breakdown of how this function is defined:*

In [34]:
stars_str = '69.7k'

In [35]:
stars_str.strip()

'69.7k'

In [41]:
stars_str[-1]              # [-1] returns the last character

'k'

In [42]:
stars_str[:-1]             # [:-1] returns all the characters before the last character

'69.7'

In [44]:
float(stars_str[:-1]) * 1000

69700.0

In [45]:
 int(float(stars_str[:-1]) * 1000)

69700

In [49]:
stars_into_number(star_tags[0].text)

88600

*Okay, so far we've been figuring out where all the required information ,i.e., user_name, repo_name, repo_url and star_count are present inside the parsed `topic0_doc`. And, we've worked with a particular repo_tag `repo_tag[0]` to understand the methods to extract them.*  

*Now, our end goal is to create a data frame containing the above 4 details for all the repositories inside a topic.*

- First, we need lists containing these details. 4 lists for 4 columns.
- Then, we create a dictionary of these lists.
- And, finally we pass the dictionary to pd.DataFrame() to create the table.

*To create the lists, we need to first define a function that will retrieve the 4 details for each repository in a topic.*

In [22]:
#Defining a function that returns all the info about a repository
def get_repo_info(repo_tag,star_tag):
    a_tags = repo_tag.find_all('a')
    user_name = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = stars_into_number(star_tag.text)
    return user_name, repo_name, stars , repo_url
    

In [43]:
# Let's call this function to get all the info about the first repo under the topic '3D'

get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 88600, 'https://github.com/mrdoob/three.js')

*Okay, so the function is working fine.*

*Now, we need to get the info for all the repositories in this topic and form a list of the same. Doing this manually will be really cumbersome and time_taking, so we'll run a `for loop` that will do the job for us.*

In [44]:
user_names = []
repo_names = []
stars_counts = []
repo_urls = []


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    user_names.append(repo_info[0])
    repo_names.append(repo_info[1])
    stars_counts.append(repo_info[2])
    repo_urls.append(repo_info[3])
    

In [42]:
user_names[:5],repo_names[:5],stars_counts[:5],repo_urls[:5]

(['mrdoob', 'pmndrs', 'libgdx', 'BabylonJS', 'ssloy'],
 ['three.js', 'react-three-fiber', 'libgdx', 'Babylon.js', 'tinyrenderer'],
 [88600, 21200, 21100, 19300, 15900],
 ['https://github.com/mrdoob/three.js',
  'https://github.com/pmndrs/react-three-fiber',
  'https://github.com/libgdx/libgdx',
  'https://github.com/BabylonJS/Babylon.js',
  'https://github.com/ssloy/tinyrenderer'])

*Now, we'll create a dictionary of these lists like we did before.*

In [45]:
topic0_repos_dict = {
    'username':user_names,
    'repo_name':repo_names,
    'stars':stars_counts,
    'repo_url':repo_urls
}

In [46]:
topic0_repos_df = pd.DataFrame(topic0_repos_dict)

In [47]:
topic0_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,88600,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,21200,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21100,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,19300,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,15900,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,15000,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,14400,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,13100,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,9800,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9600,https://github.com/metafizzy/zdog


*Yay! So we've created a dataframe for all the repositories of the topic `3D`.*

## Getting information out of all the topic pages


*All this while, we've been working with the topic `3D`. But, we wanna do this not just for the topic '3D' but for all the topics.*

*So, we'll do it in the following steps:*


- Create a function *get_topic_page()* that will receive the topic_url, download the web page and return a parsed version of it.
- Define another function *get_repo_info()* that fetches all the repo information like username, repo_name,etc upon receiving a repo_tag and a star_tag.
- Define one more function that will receive the parsed doc, fetch all the repo_tags and star_tags from it and then call the get_repo_info() function over and over again using a `for loop` to create a 4 lists containing the 4 details of the repos.

In [23]:
#Getting topic page
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #if the request fails, then we raise an exception   
    if response.status_code != 200:  
        raise Exception('Failed to load page {}'.format(topic_url))
    #if not, then we parse the contents of the downloaded web page       
    topic_doc = BeautifulSoup(response.text,'html.parser') 
    return topic_doc


#Defining a function that will get all the repo info
def get_repo_info(repo_tag,star_tag):
    a_tags = repo_tag.find_all('a')
    user_name = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = stars_into_number(star_tag.text)
    return user_name, repo_name, stars , repo_url
    
    
#Getting topic repositories repo_tags and star_tags
def get_topic_repos(topic_doc):
    
    #then, we fetch the repo_tags and star_tags from the parsed doc

    #getting the repo_tags
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed' 
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class}) 
    
    #getting the star_tags
    star_class = 'Counter js-social-count'
    star_tags = topic_doc.find_all('span',{'class' : star_class}) 
    
    #Creating a dictionary containing the columns
    topic_repos_dict = {'username' : [],'repo_name': [],'stars':[],'repo_url':[]}
    
    #Putting the repo information into dictionary lists
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
 #Returning a data frame from this dictionary       
    return pd.DataFrame(topic_repos_dict)

#Indentation is really important in Python. Be careful with it.
 

In [49]:
url4 = topic_urls[4]
url4

'https://github.com/topics/android'

In [50]:
topic4_doc = get_topic_page(url4)

In [51]:
type(topic4_doc)

bs4.BeautifulSoup

In [52]:
topic4_df = get_topic_repos(topic4_doc)

In [53]:
topic4_df

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,150000,https://github.com/flutter/flutter
1,facebook,react-native,107000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,99400,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,76100,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,60900,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,47500,https://github.com/google/material-design-icons
6,wasabeef,awesome-android-ui,45300,https://github.com/wasabeef/awesome-android-ui
7,Solido,awesome-flutter,45100,https://github.com/Solido/awesome-flutter
8,square,okhttp,43500,https://github.com/square/okhttp
9,android,architecture-samples,42200,https://github.com/android/architecture-samples


*We can achieve the above result in a single line of code by nesting functions.*

In [37]:
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,150000,https://github.com/flutter/flutter
1,facebook,react-native,107000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,99300,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,76100,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,60900,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,47500,https://github.com/google/material-design-icons
6,wasabeef,awesome-android-ui,45300,https://github.com/wasabeef/awesome-android-ui
7,Solido,awesome-flutter,45100,https://github.com/Solido/awesome-flutter
8,square,okhttp,43500,https://github.com/square/okhttp
9,android,architecture-samples,42200,https://github.com/android/architecture-samples


*Here we go!*

*Let's do the same for another topic.*

In [38]:
get_topic_repos(get_topic_page(topic_urls[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,99300,https://github.com/justjavac/free-programming-...
1,angular,angular,86100,https://github.com/angular/angular
2,storybookjs,storybook,76400,https://github.com/storybookjs/storybook
3,leonardomso,33-js-concepts,54700,https://github.com/leonardomso/33-js-concepts
4,ionic-team,ionic-framework,48600,https://github.com/ionic-team/ionic-framework
5,prettier,prettier,44800,https://github.com/prettier/prettier
6,Asabeneh,30-Days-Of-JavaScript,32800,https://github.com/Asabeneh/30-Days-Of-JavaScript
7,SheetJS,sheetjs,32100,https://github.com/SheetJS/sheetjs
8,angular,angular-cli,25900,https://github.com/angular/angular-cli
9,angular,components,23300,https://github.com/angular/components


*Let's create a CSV file for this topic 'angular'.*

In [67]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv('angular.csv',index = None)

*This creates the CSV file of top repositories for the topic 'angular'.*

## Putting it all together

*Now, we'll try to clean things up and put it all together. So,we'll write a single function to:*

1. Get a list of topics from the topics page.
2. Get a list of top repos from the individual topic pages.
3. Create a CSV file containing info of the top repos for each topic.

In [24]:
def get_topic_titles(doc):
    title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class':title_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    description_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class' : description_class })
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_url_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_url_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

#This function scrapes the topic page to get a list of all the topics    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:  
        raise Exception('Failed to load page {}'.format(topic_url))     
    doc = BeautifulSoup(response.text,'html.parser') 
    
    topics_dict = {'title': get_topic_titles(doc),'description': get_topic_descs(doc),'url': get_topic_urls(doc)}
    topics_df = pd.DataFrame(topics_dict)
    topics_df.to_csv('topics.csv',index = None)
    return topics_df

    

*Let's try to run this function and see if it works.*

In [71]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


*Yay!!! It worked absolutely fine. We got the list of topics in a single data frame.* 

*Next, we'll create function that takes the data frame created by get_topic_repos() and creates a CSV file out of it.*

In [25]:
#Creating a CSV file from this dataframe    
def scrape_a_topic(topic_url,topic_name):
    #for this topic, create a data frame by calling the function get_topic_repos
    topic_repos_df = get_topic_repos(get_topic_page(topic_url))
    #then, create a csv file from this data frame
    topic_repos_df.to_csv(topic_name, index = None)

*Now, we'll need to call this function scrape_a_topic() for each topic in the topics_df, which basically means iterating over each row.*

*Iteration over each row* means we want to perform an action for each row. For example, here we want to print the `title` and `url` for each row in topics_df.

In [75]:
for index,rows in topics_df.iterrows():
        print(rows['title'],rows['url'])

3D https://github.com/topics/3d
Ajax https://github.com/topics/ajax
Algorithm https://github.com/topics/algorithm
Amp https://github.com/topics/amphp
Android https://github.com/topics/android
Angular https://github.com/topics/angular
Ansible https://github.com/topics/ansible
API https://github.com/topics/api
Arduino https://github.com/topics/arduino
ASP.NET https://github.com/topics/aspnet
Atom https://github.com/topics/atom
Awesome Lists https://github.com/topics/awesome
Amazon Web Services https://github.com/topics/aws
Azure https://github.com/topics/azure
Babel https://github.com/topics/babel
Bash https://github.com/topics/bash
Bitcoin https://github.com/topics/bitcoin
Bootstrap https://github.com/topics/bootstrap
Bot https://github.com/topics/bot
C https://github.com/topics/c
Chrome https://github.com/topics/chrome
Chrome extension https://github.com/topics/chrome-extension
Command line interface https://github.com/topics/cli
Clojure https://github.com/topics/clojure
Code quality h

*So, now we know how to fetch the topic_name and topic_url from each row in topics_df and perform the function scrape_a_topic() for each one of them by iterating over each row in the data frame.*

### Mega Function

*Now, let's write our mega function that will scrape the list of top repositories from all these topics.*

In [26]:
def scrape_topics_repos():
    print('Scraping list of topics from GitHub')
    #first, we get a dataframe containing a list of all the topics using the function scrape_topics() and create a CSV of it
    topics_df = scrape_topics() 
    
    #then, we get the individual topic_name and topic_url from this topics_dataframe for each topic by iterating over each row
    
    
    for index,rows in topics_df.iterrows(): #we can iterate over the rows of a data frame using this syntax
        print('Scraping top repositories for "{}"'.format(rows['title']))
        scrape_a_topic(rows['url'],(row['title']))

In [93]:
scrape_topics_repos()

Scraping list of topics from GitHub
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Cloj

*Yayy! It's done!*

*We created the mega function that scrapes the topics page, then scrapes each individual topic page and finally creates a list of top repositories for each topic.*

### Mega Function with a twist

*Now, we want to put all these files into a single folder. For that, we need to make a few changes in our functions.*

- import `os`
- the second argument of scrape_a_topic() function will be a `path`
- create a folder using `os.makedirs` in the function scrape_topics_repos()
- the 'path' argument for scrape_a_topic() function when it is called inside scrape_topics_repos() function will be a `folder_name/topic_name.csv`

*After making the required changes, we have the following functions as our final code:*

## Final Code

In [27]:
def get_topic_titles(doc):
    title_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class':title_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    description_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class' : description_class })
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_url_tags = doc.find_all('a',{'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_url_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

#This function scrapes the topic page to get a list of all the topics    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:  
        raise Exception('Failed to load page {}'.format(topic_url))     
    doc = BeautifulSoup(response.text,'html.parser') 
    
    topics_dict = {'title': get_topic_titles(doc),'description': get_topic_descs(doc),'url': get_topic_urls(doc)}
    topics_df = pd.DataFrame(topics_dict)
    topics_df.to_csv('topics.csv',index = None)
    return topics_df

    

In [28]:
#Creating a CSV file from this dataframe    

def scrape_a_topic(topic_url,path):
    #for this topic, create a data frame by calling the function get_topic_repos
    topic_repos_df = get_topic_repos(get_topic_page(topic_url))
    #then, create a csv file from this data frame
    topic_repos_df.to_csv(path, index = None)

In [31]:
import os
def scrape_topics_repos():
    print('Scraping list of topics from GitHub')
    #first, we get a dataframe containing a list of all the topics using the function scrape_topics() and create a CSV of it
    topics_df = scrape_topics() 
    #then, we get the individual topic_name and topic_url from this topics_dataframe for each topic by iterating over each row
    #we wanna put all the CSv files into a single folder, so let's create a folder before we start scraping the repo list
    os.makedirs('data', exist_ok = True)
    for index,rows in topics_df.iterrows(): #we can iterate over the rows of a data frame using this syntax
        print('Scraping top repositories for "{}"'.format(rows['title']))
        scrape_a_topic(rows['url'],'data/{}.csv'.format(rows['title']))

In [32]:
scrape_topics_repos()

Scraping list of topics from GitHub
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Cloj

*It worked absolutely fine. We have obtained topics.csv and the csv for all the topics inside the folder `data'.*

## References and Future Work

### Summary

- We scraped the `topics` page on GitHub and created a list of all the topics.
- Then, We created a list of top repositories for a single topic.
- Then, we wrote a function that can create the lis for any topic.
- Then, we created a mega function that can scrape the topics page, create the list of topics and create the list of top repositories for all the topics.
- And, finally we made a few changes in the mega function so that all the created lists can be saved into a single folder.

### References

- https://github.com/topics:  the web page we scraped
- https://requests.readthedocs.io/:  requests documentation
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/:  BeautifulSoup Documentation
- https://www.geeksforgeeks.org/:  website that helped me with the Python coding

### Ideas for Future Work

- We have scraped just the first page of topics on GitHub, whoch contains about 30 topics. It contains many more pages which can be scraped.
- Second page can be obtained by just typing `?page=2` in the link of page. And similarly for rest of the pages.

In [1]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "palak780nain/scrapping-github-topics-repositories" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/palak780nain/scrapping-github-topics-repositories[0m


'https://jovian.com/palak780nain/scrapping-github-topics-repositories'