<h1 style="color:Orange;font-size:170%;"> Scraping Top Repositories for Topics on GitHub </h1>

<img src="https://i.imgur.com/6zM7JBq.png" width="1000">

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. GitHub is a web-based version-control and collaboration platform for software developers where you can upload your own code and projects. For scraping this website we will use a library for python called [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) which will help us to download and parse the html webpage. After downloading the page we will extract the content we are looking for and save it in a dataframe using [pandas](https://pandas.pydata.org/docs/) and save all the data collected in a csv file's.
The link where we will scrap the content can be found [here](https://github.com/topic).


## Outline:
This are the steps we will follow:
- We're gonna to scrape [https://github.com/topics](https://github.com/topics)
- we'll get a list of topics. For each topic, we'll get the topic title, page URL, and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page.
- For each repository we'll grab the repo name, username, stars and repo URL
- For each topic, we'll create a CSV file in the following format:

```
> Repo Name,Username,Stars,URL
> three.js,mrdoob,69700,https://github.com/mrdoob/three.js
> libgdx,libgdx,18300,https://github.com/libgdx/libgdx

```


In [92]:
import jovian
project_name='Scraping Top Repositories for Topics on GitHub'
jovian.commit(project=project_name)

<IPython.core.display.Javascript object>

[jovian] Updating notebook "desousa-andreas/scraping-top-repositories-for-topics-on-github" on https://jovian.ai/
[jovian] Committed successfully! https://jovian.ai/desousa-andreas/scraping-top-repositories-for-topics-on-github


'https://jovian.ai/desousa-andreas/scraping-top-repositories-for-topics-on-github'

## Scrape the list of topics from GitHub

The first step of the process is to use requests to download and parse our web page. This process will be use few times so will be convenient write a function which take a url as input and return a BeautifulSoup object containing the html code.
The steps are as follow:
- Use Requests to download the html
- Use bs4 for parse and extract the information
- Return a bs object


In [93]:
import requests
from bs4 import BeautifulSoup

def get_topic_page(topic_url):
    """
    This function scrap a designed webpage at url address and return a parsed BeautifulSoup containing the website
    Lib needed:
        import requests
    :param topic_url: website address to scrap
    :return: BeautifulSoup doc
    """

    # download the page
    r = requests.get(topic_url)
    # check response
    if r.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # parse beautiful soup
    topic_doc = BeautifulSoup(r.text, 'html.parser')
    return topic_doc

Let's try out our function, check the output type and try to find the `a` tag for testing.

In [94]:
url = 'https://github.com/topics'
doc = get_topic_page(url)
type(doc)

bs4.BeautifulSoup

In [95]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

Let's create some helper function to parse some information from the topic page.

The `get_topic_title` can be used to get the list of the topics title's. To get topic title we can pick `p` tag's with `class = f3 lh-condensed mb-0 mt-1 Link--primary` as shown on the image.
We use the inspect functionality of the browser to detect the right tag and class.

![](https://i.imgur.com/ONm9jvi.png)



This function take as input a bs object where will search and append to a dictionary all the topics titles.

In [96]:
def get_topic_title(doc):
    """
    This function retrieve the titles of the topics.
    :param doc: Beautifulsoup Object
    :return: a list
    """

    selected_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.findAll('p', class_ = selected_class)

    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)

    return  topic_titles


Let's try out our function.

In [97]:
titles = get_topic_title(doc)
len(titles), titles[:10]

(30,
 ['3D',
  'Ajax',
  'Algorithm',
  'Amp',
  'Android',
  'Angular',
  'Ansible',
  'API',
  'Arduino',
  'ASP.NET'])

As we can see we have a list of 30 titles. Similarly we have another 2 functions for get the topic description `get_topic_descs` and the topic url `get_topic_url`.

In [98]:
def get_topic_descs(doc):
    """
    This function retrieve the description of the topics.
    :param doc: Beautifulsoup Object
    :return: a list
    """

    selected_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.findAll('p', class_ = selected_class)
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())

    return topic_descs

In [99]:
descs = get_topic_descs(doc)
descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [100]:
def get_topic_url(doc):
    """
    This function retrieve the URL of the topics.
    :param doc: Beautifulsoup Object
    :return: a list
    """

    selected_class = 'd-flex no-underline'
    topic_link_tags = doc.findAll('a', class_ = selected_class)
    topic_urls = []
    base_url  = 'https://github.com'

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls

In [101]:
topic_urls = get_topic_url(doc)
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's put all together in a single function called `scrape_topics`. The title, description, and url will be saved in  dictionary called `topics_dict` and then converted in a pandas Dataframe for easily using it.

In [102]:
import pandas as pd

def scrape_topics(url):

    r = requests.get(url)
    if r.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))

    topics_dict = {
        'title':get_topic_title(doc),
        'description': get_topic_descs(doc),
        'url':get_topic_url(doc)
    }
    return pd.DataFrame(topics_dict)

Let's try out our function.

In [103]:
topics_df = scrape_topics(url)
topics_df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [104]:
type(topics_df)

pandas.core.frame.DataFrame

## Get the top 25 repositories from a topic page

Now we have all the function we need for scraping the top 25 repositories of the page. First we will use the `get_topic_page` for scrape the web page.

In [105]:
def get_topic_page(topic_url):
    """
    This function scrap a designed webpage at url address and return a parsed BeautifulSoup containing the website
    Lib needed:
        import requests
    :param topic_url: website address to scrap
    :return: BeautifulSoup doc
    """

    # download the page
    r = requests.get(topic_url)
    # check response
    if r.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # parse beautiful soup
    topic_doc = BeautifulSoup(r.text, 'html.parser')
    return topic_doc

Let's test the function

In [106]:
topic_doc = get_topic_page('https://github.com/topics')

The next function we implement will convert the number of stars the topic have from string to integer.

In [107]:
def parse_star_count(stars_str):
    stars_tags = stars_str.strip()
    if stars_tags[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)


Let's see how the function works. The next lines of code are for retrieve the number of stars. As we can see the and the tag is `a` and the class is `social-count js-social-count`

![](https://i.imgur.com/7howgRz.png)

In [108]:
# Retrieve 3D topic
r = requests.get('https://github.com/topics/3d')
topic_doc = BeautifulSoup(r.text, 'html.parser')

# Define a selection tag
a_selection_class = 'social-count js-social-count'
star_tags = topic_doc.findAll('a',class_ = a_selection_class)
star_tags[0].text.strip()

'76.3k'

In [109]:
parse_star_count(star_tags[0].text.strip())

76300

Second step consists on write a function that return all the information about the repositories. As we saw on the image before all the information are contained on the tag `a`. As shown on the picture for get the url of the repository we can use the `href` class to get it.

![](https://i.imgur.com/7howgRz.png)

The next function we prepared is take together some of the previous function to get all the information we want about one repository.

In [110]:
def get_repo_info(h1_tag, star_tags):
    # return all the info about the repo
    a_tags = h1_tag.findAll('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = 'https://github.com' + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, repo_url, stars

Let's testing tha function. As we can see on the image below we can select the `h3` tag and the `f3 color-fg-muted text-normal lh-condensed` class.

![](https://i.imgur.com/BeWTUYC.png)

In [111]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.findAll('h3', class_ = h3_selection_class)
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 76300)

One of the last function we are create is `get_topic_repos` that take a bs file and extract the information we need and then return a pandas dataframe.

In [123]:
def get_topic_repos(topic_doc):

    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.findAll('h3', class_ = h3_selection_class)

    a_selection_class = 'social-count js-social-count'
    star_tags = topic_doc.findAll('a',class_ = a_selection_class)

    topic_repo_dic = {'username': [],'repo_name': [],'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repo_dic['username'].append(repo_info[0])
        topic_repo_dic['repo_name'].append(repo_info[1])
        topic_repo_dic['repo_url'].append(repo_info[2])
        topic_repo_dic['stars'].append(repo_info[3])

    return pd.DataFrame(topic_repo_dic)

The last function we create check if the csv file already exist. If the file is not present the function will retrieve the information about the repository, convert it on dataframe through pandas and save it on a specific path.

In [113]:
def scrape_topic(topic_url, path):

    if os.path.exists(path):
        print('the file {} already exists. Skipping...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Putting all together
- We have a function to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them togheter.

In [118]:
import os

def scrape_topics_repos(url):
    print('Scraping list of topics')
    topics_df = scrape_topics(url)
    folder_name = "Scraped_csv"
    try:
        os.makedirs(folder_name, exist_ok = True)
    except OSError:
        print ("Creation of the directory %s failed" % folder_name)
    else:
        print ("Successfully created the directory %s " % folder_name)

    for index, row in topics_df.iterrows():
        print('Scraping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'], folder_name + '/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all topics on the first page of https://github.com/topics

In [125]:
scrape_topics_repos(url)

Scraping list of topics
Successfully created the directory Scraped_csv 
Scraping top repositories for 3D
the file Scraped_csv/3D.csv already exists. Skipping...
Scraping top repositories for Ajax
the file Scraped_csv/Ajax.csv already exists. Skipping...
Scraping top repositories for Algorithm
the file Scraped_csv/Algorithm.csv already exists. Skipping...
Scraping top repositories for Amp
the file Scraped_csv/Amp.csv already exists. Skipping...
Scraping top repositories for Android
the file Scraped_csv/Android.csv already exists. Skipping...
Scraping top repositories for Angular
the file Scraped_csv/Angular.csv already exists. Skipping...
Scraping top repositories for Ansible
the file Scraped_csv/Ansible.csv already exists. Skipping...
Scraping top repositories for API
the file Scraped_csv/API.csv already exists. Skipping...
Scraping top repositories for Arduino
the file Scraped_csv/Arduino.csv already exists. Skipping...
Scraping top repositories for ASP.NET
the file Scraped_csv/ASP.NE

We can check that the CSV where create properly.

# Read and display CSV with pandas

Let's check if the data are correctly. We will read a csv file with pandas as follow.

In [126]:
Test_df = pd.read_csv("Scraped_csv/3D.csv")
Test_df.head()

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,76300,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19300,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,15800,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15300,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13300,https://github.com/aframevr/aframe


In [127]:
import jovian
jovian.commit(project=project_name )

<IPython.core.display.Javascript object>

[jovian] Updating notebook "desousa-andreas/scraping-top-repositories-for-topics-on-github" on https://jovian.ai/
[jovian] Committed successfully! https://jovian.ai/desousa-andreas/scraping-top-repositories-for-topics-on-github


'https://jovian.ai/desousa-andreas/scraping-top-repositories-for-topics-on-github'

## References and future work



Summary of what we did:
* We create helpful function for scraping repositories on github.
* We scrape information's the fist page of topic repositories.

References to links we found useful:
* [Beatifulsoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Pandas Documantation](https://pandas.pydata.org/docs/)

Ideas for future work:
* Implementing a function for scrape on multiple pages.
* Scrape also the logo of the repository.

In [121]:
import jovian
jovian.commit(project=project_name )

<IPython.core.display.Javascript object>

[jovian] Updating notebook "desousa-andreas/scraping-top-repositories-for-topics-on-github" on https://jovian.ai/
[jovian] Committed successfully! https://jovian.ai/desousa-andreas/scraping-top-repositories-for-topics-on-github


'https://jovian.ai/desousa-andreas/scraping-top-repositories-for-topics-on-github'