# Web Scraping top repositories from Github Topics

📌**Web Scraping** - Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed.

📌**Github** - GitHub is a website and cloud-based service that helps developers store and manage their code, as well as track and control changes to their code.

 📌**Tools Used are**:
- Python
- requests
- Pandas
- Beautiful Soup (BS4)
- os

📌**Project Outline** :
- We are going to scrape: https://github.com/topics
- We will get a list of topics and for each topic we will get the topic name and topic page url and topic description.
- For each topic, we will get the top 25 repositories in the topic from the topic page.
- For each topic we will create a CSV File in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
```

## 📍 Scraping List of Topics from Github

First we will be importing all the libraries that are required in the project
- **requests** - The requests library is the de facto standard for making HTTP requests in Python. It abstracts the complexities of making requests behind a beautiful, simple API so that you can focus on interacting with services and consuming data in your application.
- **Beautiful Soup** - is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [1]:
import os
import requests
import pandas as pd

from bs4 import BeautifulSoup

Next, we have to download the contents of Github Topics page using 'requests' and parse it using 'BS4'

In [140]:
def scrape_topics_repos():
    topic_url = 'https://github.com/topics?page=6'

    # Download the url using requests.get()
    response = requests.get(topic_url)

    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))

    # Parse the downloaded page using Beautiful Soup
    topic =  BeautifulSoup(response.text, 'html.parser')

    return topic

In [142]:
topic = scrape_topics_repos()
type(topic)

bs4.BeautifulSoup

Once we have downloaded and parsed the page, to get the topic titles, description and the url we have to use 'find_all()' from the Beautiful Soup library which helps to extract or scrape contents from a html web page by using the html tags and classes as the parameters.

We will define different functions to scrape all the required data and save it into a list.

In [143]:
def get_topic_title(doc):

    # find_all() or findAll() helps to find the elements from the page using html tags and classes
    topic_title = doc.find_all('p', { 'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topicTitles = []
    for tag in topic_title:
        topicTitles.append(tag.text)
    return topicTitles

def get_topic_desc(doc):
    topic_desc = doc.find_all('p', {'class':"f5 color-fg-muted mb-0 mt-1"})
    topicDesc = []
    for tag in topic_desc:
        topicDesc.append(tag.text.strip())
    return topicDesc

def get_topic_url(doc):
    topic_links = doc.find_all('a', {'class':'no-underline flex-1 d-flex flex-column'})
    topicUrls = []
    baseurl = 'https://github.com'
    for tag in topic_links:
        topicUrls.append(baseurl + tag['href'])
    return topicUrls

In [144]:
topicTitles = get_topic_title(topic)
topicTitles

['SpaceVim',
 'Spring Boot',
 'SQL',
 'Storybook',
 'Support',
 'Swift',
 'Symfony',
 'Telegram',
 'Tensorflow',
 'Terminal',
 'Terraform',
 'Testing',
 'Twitter',
 'TypeScript',
 'Ubuntu',
 'Unity',
 'Unreal Engine',
 'Vagrant',
 'Vim',
 'Virtual reality',
 'Vue.js',
 'Wagtail',
 'Web Components',
 'Web app',
 'Webpack',
 'Windows',
 'WordPlate',
 'WordPress',
 'Xamarin',
 'XML']

In [145]:
topicDesc = get_topic_desc(topic)
topicDesc

['SpaceVim is a community-driven distribution of the vim editor that allows managing your plugins in layers.',
 'Spring Boot is a coding and configuration model for Java applications.',
 'SQL is a standard language for storing, retrieving and manipulating data in databases.',
 'Storybook is a UI development environment for your UI components.',
 'Get your team and customers the help they need.',
 'Swift is a modern programming language focused on safety, performance, and expressivity.',
 'Symfony is a set of reusable PHP components and a web framework.',
 'Telegram is a non-profit, cloud-based instant messaging service.',
 'TensorFlow is an open source software library for numerical computation.',
 'The terminal is an interface in which you can type and execute text-based commands.',
 'An infrastructure-as-code tool for building, changing, and versioning infrastructure safely and efficiently.',
 'Eliminate bugs and ship with more confidence by adding these tools to your workflow.',
 'T

In [146]:
topicUrls = get_topic_url(topic)
topicUrls

['https://github.com/topics/spacevim',
 'https://github.com/topics/spring-boot',
 'https://github.com/topics/sql',
 'https://github.com/topics/storybook',
 'https://github.com/topics/support',
 'https://github.com/topics/swift',
 'https://github.com/topics/symfony',
 'https://github.com/topics/telegram',
 'https://github.com/topics/tensorflow',
 'https://github.com/topics/terminal',
 'https://github.com/topics/terraform',
 'https://github.com/topics/testing',
 'https://github.com/topics/twitter',
 'https://github.com/topics/typescript',
 'https://github.com/topics/ubuntu',
 'https://github.com/topics/unity',
 'https://github.com/topics/unreal-engine',
 'https://github.com/topics/vagrant',
 'https://github.com/topics/vim',
 'https://github.com/topics/virtual-reality',
 'https://github.com/topics/vue',
 'https://github.com/topics/wagtail',
 'https://github.com/topics/web-components',
 'https://github.com/topics/webapp',
 'https://github.com/topics/webpack',
 'https://github.com/topics/wi

All the data is stored in lists we have to convert it into a pandas DataFrame so that we will be able to get a csv file out of it or we can used it for further scraping.

In [147]:
def topics_data(topic):
    topics_dict = {
        'Title': get_topic_title(topic),
        'Description':get_topic_desc(topic),
        'Url':get_topic_url(topic)
    }

    return pd.DataFrame(topics_dict)

In [148]:
topics_data(topic)

Unnamed: 0,Title,Description,Url
0,SpaceVim,SpaceVim is a community-driven distribution of...,https://github.com/topics/spacevim
1,Spring Boot,Spring Boot is a coding and configuration mode...,https://github.com/topics/spring-boot
2,SQL,"SQL is a standard language for storing, retrie...",https://github.com/topics/sql
3,Storybook,Storybook is a UI development environment for ...,https://github.com/topics/storybook
4,Support,Get your team and customers the help they need.,https://github.com/topics/support
5,Swift,Swift is a modern programming language focused...,https://github.com/topics/swift
6,Symfony,Symfony is a set of reusable PHP components an...,https://github.com/topics/symfony
7,Telegram,"Telegram is a non-profit, cloud-based instant ...",https://github.com/topics/telegram
8,Tensorflow,TensorFlow is an open source software library ...,https://github.com/topics/tensorflow
9,Terminal,The terminal is an interface in which you can ...,https://github.com/topics/terminal


## 📍 Scraping repositories of each topic

'topicUrl' Variable stores a list of all topic urls from the topics page downloaded.

But to extract the repositories data of each and every topic we have to download and parse each and every topic pages from the main page so with the help of the topic url list we created earlier we will follow the same steps has we did for the main github topic page.

We created the similar kind of function we created at the very beginning of this notebook but the only change is we are using topic url as our input parameter.

In [149]:
def get_page_info(topicUrl):
    #Get a particular link html document
    topic_doc = requests.get(topicUrl)

    if topic_doc.status_code != 200:
        raise Exception('Failed to load page {}'.format(topicUrl))

    #Parse the html document using Beautiful soup if the status code is valid 200
    topic_page =  BeautifulSoup(topic_doc.text, 'html.parser')

    return topic_page

In [150]:
topic_page = get_page_info(topicUrls[0])
type(topic_page)

bs4.BeautifulSoup

To get the repositories data, we have to find out where is the information stored in the html page.
While Inspecting the page I found that the information that I want i.e. Repositories name and stars are been stored in 'h3' tag and 'span' tag respectively so we have to scrape that information using 'Beautiful Soup'

**Note** : The html structure of the page might change in the later years so if one wants to use this notebook he/she has to inspect the page and update this code chunk as per the latest tags

In [151]:
#Get the repositories and stars tag
h1_tags = topic.findAll('h3', {"class":"f3 color-fg-muted text-normal lh-condensed"})
star_tags = topic.findAll('span', {"class":"Counter js-social-count"})

The Stars information is stored in the page has the form of 'k' numerical
For example - 77000 is stored as 77k

So we have converted each and every star format to an integer format

In [152]:
def parse_star_count(stars_str):
    starsStr = stars_str.strip()

    if starsStr[-1] == 'k':
        return int(float(starsStr[:-1]) * 1000)

    return int(starsStr)

Once we have done with all the data formatting and parsing pages we have to scrape the main contents out of the data we have.
That is - username, repository name, repository url and stars

In [153]:
def get_repo_info(h1_tag, star_tag):
    baseurl = 'https://github.com'
    # Return all the required info about a repository

    aTags = h1_tag.find_all('a')
    username = aTags[0].text.strip()
    reponame = aTags[1].text.strip()
    repourl = baseurl + aTags[1]['href']
    stars = parse_star_count(star_tag.text)

    return username, reponame, repourl, stars

Saving all the data we have in a Python Dictionary and then create a Dataframe from it.
To create a DataFrame from the information we have, we are using Pandas DataFrame function.

In [154]:
def get_topic_repos(topic):
    #Get the repositories and stars tag
    h1_tags = topic.findAll('h3', {"class":"f3 color-fg-muted text-normal lh-condensed"})
    star_tags = topic.findAll('span', {"class":"Counter js-social-count"})

    # Initializing a python dictionary
    topic_repos_dict = {'username':[], 'repo_name':[], 'stars':[], 'repo_url':[]}

    for i in range(len(h1_tags)):
        repo_info = get_repo_info(h1_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[3])
        topic_repos_dict['repo_url'].append(repo_info[2])

    return pd.DataFrame(topic_repos_dict)

In [155]:
get_topic_repos(topic_page)

Unnamed: 0,username,repo_name,stars,repo_url
0,SpaceVim,SpaceVim,17700,https://github.com/SpaceVim/SpaceVim
1,wsdjeg,vim-galore-zh_cn,8800,https://github.com/wsdjeg/vim-galore-zh_cn
2,wsdjeg,DotFiles,1400,https://github.com/wsdjeg/DotFiles
3,wsdjeg,vim-chat,472,https://github.com/wsdjeg/vim-chat
4,Gabirel,Hack-SpaceVim,403,https://github.com/Gabirel/Hack-SpaceVim
5,ctjhoa,spacevim,368,https://github.com/ctjhoa/spacevim
6,wsdjeg,FlyGrep.vim,293,https://github.com/wsdjeg/FlyGrep.vim
7,wsdjeg,GitHub.vim,192,https://github.com/wsdjeg/GitHub.vim
8,Martins3,My-Linux-Config,144,https://github.com/Martins3/My-Linux-Config
9,wsdjeg,JavaUnit.vim,104,https://github.com/wsdjeg/JavaUnit.vim


The Last step is convert all the Dataframes into a csv file and save it into your local machine.

So before that we will do some manipulation with our process, first is we will create a directory where we will save all our csv files
Creating and saving csv files in a directory using the 'os' function.

## 📍 Create Data Directory and save as CSV Files

In [156]:
def scrape_topic(topic_url, path):

    if os.path.exists(path):
        print('The File', format(path), ' already exists. Skipping...')

    topic_df = get_topic_repos(get_page_info(topic_url))
    topic_df.to_csv(path, index=None)

Lastly we have to call all the above created functions.

**Note** - Some pages might not get download due to some backend error from the github side but all you have to do is re run the scrape_topics_csv() function again so that the similar problem will not happen again.

In [157]:
def scrape_topics_csv():
    print('Scraping Topics')
    topics = scrape_topics_repos()
    topics_df = topics_data(topics)

    os.makedirs('data', exist_ok=True)

    for index, row in topics_df.iterrows():
        print('Scraping Top Repositories for ', format(row['Title']))
        scrape_topic(row['Url'], 'data/{}.csv'.format(row['Title']))

In [158]:
scrape_topics_csv()

Scraping Topics
Scraping Top Repositories for  SpaceVim
Scraping Top Repositories for  Spring Boot
Scraping Top Repositories for  SQL
Scraping Top Repositories for  Storybook
Scraping Top Repositories for  Support
Scraping Top Repositories for  Swift
Scraping Top Repositories for  Symfony
Scraping Top Repositories for  Telegram
Scraping Top Repositories for  Tensorflow
Scraping Top Repositories for  Terminal
Scraping Top Repositories for  Terraform
Scraping Top Repositories for  Testing
Scraping Top Repositories for  Twitter
Scraping Top Repositories for  TypeScript
Scraping Top Repositories for  Ubuntu
Scraping Top Repositories for  Unity
Scraping Top Repositories for  Unreal Engine
Scraping Top Repositories for  Vagrant
Scraping Top Repositories for  Vim
Scraping Top Repositories for  Virtual reality
Scraping Top Repositories for  Vue.js
Scraping Top Repositories for  Wagtail
Scraping Top Repositories for  Web Components
Scraping Top Repositories for  Web app
Scraping Top Repositorie

**Summary** -
- Scrape top 30 topics in alphabetical order from https://github.com/topics.
- Along with topics we have extracted top 30 repositories in the given order on website from each topic we have scrape.
- After we gathered all the data, we converted the available data into a pandas DataFrame.
- And Lastly we have created 30 csv files of the top repositories we have extracted from each topic we scrape.

**References** -
- <a href='https://github.com/topic'>Github</a>
- <a href='https://beautiful-soup-4.readthedocs.io/en/latest/#'>Beautiful Soup Documentation</a>
- <a href='https://jovian.ai/moustapha00864/python-web-scraping-project-guide'>Jovian</a>

**Future ideas** -
- We can scrape the repositories of remaining topics from the github page.
- Scrape topics which have the highest no. of repositories in ascending order.
- We can also include repositories tags which will help for analysing latest trends followed together.
- Extracting repositories metrics like Forks, Pull requests, Open and Closed Issues.
- Adding timestamps of when repository was last updated.