# Scraping the top repositories for topics from GitHub

**Web Scraping:** In a simpler way, it is a method of extracting content and data from websites. 

**GitHub:** It is a platform that fosters collaboration and communication between developers. It provides environment for development and sharing.

**Method and tools:** There are many ways and softwares to do it but in this project I have used python with BS4(BeautifulSoup4) library for scraping.

### Steps for the project:
- We are going to scrape https://github.com/topics
- We will get a list of topics. For each topic, we will extract topic name, topic description and topic url.
- For each topic, we will get top 30 repositories with repo name, repo username, stars and repo url.
- Finally we are going to create csv file for each topic with respective repo details.

## Scraping the list of topics from GitHub

- Use requests to load the html contents of the website
- Then, by using BS4 we will parse the html content
- Converting the information in pandas dataframe


In [1]:
# Importing all the required libraries:

import os,sys
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
# Here, I will define a function which will download the page:

def download_page(url):
    response = requests.get(url)
    if response.status_code != 200:  # I have checked the response status code if it is successfully loaded.
        print('There is something wrong with {}'.format(url))
    response_contents = response.text
    
    # Now we will parse the contents using BeautifulSoup:
    parsed_contents = BeautifulSoup(response_contents,'html.parser')
    return parsed_contents

In [3]:
url = 'https://github.com/topics'
parsed_contents = download_page(url)
print('Page loaded successfully')

Page loaded successfully


I have not printed the whole parsed_contents because it is too big. So just by printing Page loaded successfully we can make sure our function is working properly.

In [4]:
# We will now define another function which will gives us all the topic titles. This function will take parsed_contents as argument.

def get_topic_titles(parsed_content):
    
    # I have checked the tag and class in which titles are
    
    selected_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = parsed_content.find_all('p',{'class':selected_class})
    
    # We can make a list of topics
    topic_titles = []
    for tags in topic_title_tags:
        topic_titles.append(tags.text)
    return topic_titles

In [5]:
topic_titles = get_topic_titles(parsed_contents)
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [6]:
# Similarly we are going to define functions for topic description and topic url

def get_topic_desc(parsed_contents):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = parsed_contents.find_all('p',{'class': desc_selector})

    topic_desc = []
    for desc in topic_desc_tags:
        topic_desc.append(desc.text.strip())  # strip() is used for trimming all extra spaces in description.
    return topic_desc

def get_topic_url(parsed_contents):
    topic_link_tag = parsed_contents.find_all('a',{'class':'d-flex no-underline'})

    topic_urls = []
    base_url = 'http://github.com'
    for urls in topic_link_tag:
        topic_urls.append(base_url + urls['href'])
    return topic_urls

In [7]:
topic_desc = get_topic_desc(parsed_contents)
topic_desc

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'A

In [8]:
topic_urls = get_topic_url(parsed_contents)
topic_urls

['http://github.com/topics/3d',
 'http://github.com/topics/ajax',
 'http://github.com/topics/algorithm',
 'http://github.com/topics/amphp',
 'http://github.com/topics/android',
 'http://github.com/topics/angular',
 'http://github.com/topics/ansible',
 'http://github.com/topics/api',
 'http://github.com/topics/arduino',
 'http://github.com/topics/aspnet',
 'http://github.com/topics/atom',
 'http://github.com/topics/awesome',
 'http://github.com/topics/aws',
 'http://github.com/topics/azure',
 'http://github.com/topics/babel',
 'http://github.com/topics/bash',
 'http://github.com/topics/bitcoin',
 'http://github.com/topics/bootstrap',
 'http://github.com/topics/bot',
 'http://github.com/topics/c',
 'http://github.com/topics/chrome',
 'http://github.com/topics/chrome-extension',
 'http://github.com/topics/cli',
 'http://github.com/topics/clojure',
 'http://github.com/topics/code-quality',
 'http://github.com/topics/code-review',
 'http://github.com/topics/compiler',
 'http://github.com/to

Now that we have list of topic name, topic desc and topic url. We are gonna make a pandas df of these lists.

In [9]:
# Making df using these lists:

topic_df = pd.DataFrame(list(zip(topic_titles, topic_desc,topic_urls)),
               columns =['title', 'description', 'url'])
topic_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


So far we have defined functions for loading the page, getting topic titles,desc and urls. We have also created a df out of it. Now we are gonna merge all these functions and steps to a single function so that we don't need to call every functions. We will get our dataframe with topic titles, desc and urls by just calling a single function.

#### Here I will be just merging all the above steps in a single function:

In [10]:
def topic_details(url):
    def download_page(url):
        response = requests.get(url)
        if response.status_code != 200:  # I have checked the response status code if it is successfully loaded.
            print('There is something wrong with {}'.format(url))
        response_contents = response.text

        # Now we will parse the contents using BeautifulSoup:
        parsed_contents = BeautifulSoup(response_contents,'html.parser')
        return parsed_contents
    
    def get_topic_titles(parsed_content):
    
        # I have checked the tag and class in which titles are

        selected_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
        topic_title_tags = parsed_content.find_all('p',{'class':selected_class})

        # We can make a list of topics
        topic_titles = []
        for tags in topic_title_tags:
            topic_titles.append(tags.text)
        return topic_titles
    def get_topic_desc(parsed_contents):
        desc_selector = 'f5 color-text-secondary mb-0 mt-1'
        topic_desc_tags = parsed_contents.find_all('p',{'class': desc_selector})

        topic_desc = []
        for desc in topic_desc_tags:
            topic_desc.append(desc.text.strip())  # strip() is used for trimming all extra spaces in description.
        return topic_desc

    def get_topic_url(parsed_contents):
        topic_link_tag = parsed_contents.find_all('a',{'class':'d-flex no-underline'})

        topic_urls = []
        base_url = 'http://github.com'
        for urls in topic_link_tag:
            topic_urls.append(base_url + urls['href'])
        return topic_urls
    
    topic_df = pd.DataFrame(list(zip(topic_titles, topic_desc,topic_urls)),
               columns =['title', 'description', 'url'])
    
    return topic_df

In [11]:
url = 'https://github.com/topics'
topic_df = topic_details(url)
topic_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


As we can see we are getting our dataframe by calling a single function.

Now that we have all our topics and their details, we are going to grab top repositories for each topics.
We will be following the similar method.

In [12]:
# Let's define a function to load the topic page containing top repos:

def download_repo_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        print('There is some error in {}'.format(topic_url))
    response_contents = response.text
    
    parsed_contents = BeautifulSoup(response_contents,'html.parser')
    return parsed_contents

In [13]:
topic_df['url'][0]

'http://github.com/topics/3d'

In [14]:
# First we will be using our first topic url to get the top repos:
first_topic_repo_page = download_repo_page(topic_df['url'][0])
print('Loaded Successfully')

Loaded Successfully


In [15]:
# The repo_tags contains repo username, repo name and repo url and star_tags contains no of stars.

repo_tags = first_topic_repo_page.find_all('h3',{'class':'f3 color-text-secondary text-normal lh-condensed'})
star_tags = first_topic_repo_page.find_all('a',{'class': 'social-count float-none'})

In [16]:
# Now we will define a function which will extract username, repo name, repo url and stars count from repo_tag and star_tag

def get_repo_info(repo_tags,star_tags):
    # returns all info for a repo
    a_tags = repo_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = 'http://github.com/'
    repo_url = base_url + a_tags[1]['href'].strip()
    
    # Defining a function so that it will convert our star count to integer
    def star_counts_converter(stars):
        stars = stars.strip()
        if stars[-1] == 'k':
            return int(float(stars[:-1]) * 1000)
        return int(stars)
    star_counts = star_counts_converter(star_tags.text.strip())
    return username,repo_name,star_counts,repo_url

In [17]:
# We will check whether our function is working or not for the first repo:

first_repo_details = get_repo_info(repo_tags[0],star_tags[0])
first_repo_details

('mrdoob', 'three.js', 72800, 'http://github.com//mrdoob/three.js')

It seems it is working fine. Now we are gonna get the details for all top repos. Remember, we are now only getting info for our first topic.

In [18]:
# We are gonna create a dictionary with the informations of repos and creating a dataframe out of it:

repo_info_dict = {
    'username':[],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_details = get_repo_info(repo_tags[i],star_tags[i])
    repo_info_dict['username'].append(repo_details[0])
    repo_info_dict['repo_name'].append(repo_details[1])
    repo_info_dict['stars'].append(repo_details[2])
    repo_info_dict['repo_url'].append(repo_details[3])
    
pd.DataFrame(repo_info_dict)

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,72800,http://github.com//mrdoob/three.js
1,libgdx,libgdx,18700,http://github.com//libgdx/libgdx
2,BabylonJS,Babylon.js,14400,http://github.com//BabylonJS/Babylon.js
3,pmndrs,react-three-fiber,14000,http://github.com//pmndrs/react-three-fiber
4,aframevr,aframe,12900,http://github.com//aframevr/aframe
5,ssloy,tinyrenderer,10900,http://github.com//ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,10700,http://github.com//lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,9600,http://github.com//FreeCAD/FreeCAD
8,metafizzy,zdog,8500,http://github.com//metafizzy/zdog
9,CesiumGS,cesium,7300,http://github.com//CesiumGS/cesium


Here, we have top 30 repo details for our first topic. We need to extract the same information for each of our topics.
Before that, let's merge the above steps to a single function.

In [19]:
# Defining a function to get list of repositories and their details using topic url.

def topic_repo_details(topic_url):
    def download_repo_page(topic_url):
        # Using request library we will load topic url in html format:
        response = requests.get(topic_url)
        if response.status_code != 200:
            raise Exception('There is some error in {}'.format(topic_url))
        response_content = response.text

        # Using html.parser for parsing our html file:
        parsed_content = BeautifulSoup(response_content,'html.parser')
        
    
        # repo_tags and star_tags:
        repo_tags = parsed_content.find_all('h3',{'class':'f3 color-text-secondary text-normal lh-condensed'})
        star_tags = parsed_content.find_all('a',{'class': 'social-count float-none'})

    # Here I have defined a function to get information about repo
    # This will take repo_tags and star_tags
    def get_repo_info(repo_tags,star_tags):
        # returns all info for a repo
        a_tags = repo_tags.find_all('a')
        username = a_tags[0].text.strip()
        repo_name = a_tags[1].text.strip()
        base_url = 'http://github.com/'
        repo_url = base_url + a_tags[1]['href'].strip()
        
        # Defining a function so that it will conver our star count to integer
        def star_counts_converter(stars):
            stars = stars.strip()
            if stars[-1] == 'k':
                return int(float(stars[:-1]) * 1000)
            return int(stars)
        star_counts = star_counts_converter(star_tags.text.strip())
        return username,repo_name,star_counts,repo_url
    
    # We will create a dictionary out of our details:
    repo_info_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    
    for i in range(len(repo_tags)):
        repo_details = get_repo_info(repo_tags[i],star_tags[i])
        repo_info_dict['username'].append(repo_details[0])
        repo_info_dict['repo_name'].append(repo_details[1])
        repo_info_dict['stars'].append(repo_details[2])
        repo_info_dict['repo_url'].append(repo_details[3])
        
    return pd.DataFrame(repo_info_dict)

In [20]:
topic_repo_details(topic_df['url'][0])

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,72800,http://github.com//mrdoob/three.js
1,libgdx,libgdx,18700,http://github.com//libgdx/libgdx
2,BabylonJS,Babylon.js,14400,http://github.com//BabylonJS/Babylon.js
3,pmndrs,react-three-fiber,14000,http://github.com//pmndrs/react-three-fiber
4,aframevr,aframe,12900,http://github.com//aframevr/aframe
5,ssloy,tinyrenderer,10900,http://github.com//ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,10700,http://github.com//lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,9600,http://github.com//FreeCAD/FreeCAD
8,metafizzy,zdog,8500,http://github.com//metafizzy/zdog
9,CesiumGS,cesium,7300,http://github.com//CesiumGS/cesium


Now we have a single function which takes topic_url and gives us top 30 repositories with their details.

In [21]:
#Let's define a function which will save our repo details to csv.

def save_to_csv(topic_url,topic_name):
    file_name = topic_name + '.csv'
    if os.path.exists(file_name):
        print('The file {} already exists. Skipping...'.format(file_name))
    topics_df = topic_repo_details(topic_url)
    topics_df.to_csv(file_name,index=None)
    print('Successfully scraped topic "{}"'.format(topic_name))

In [22]:
save_to_csv(topic_df['url'][0],topic_df['title'][0])

Successfully scraped topic "3D"


Now that we have all the information and functions. We just need our last function which will iterate over each topics and give us the repo details for each of them

In [23]:
def github_topic_scraper(url):
    print('Scraping the list of topics')
    topics_df = topic_details(url)
    for index,row in topics_df.iterrows():
        print('Scraping the top repositories for "{}"'.format(row['title']))
        save_to_csv(row['url'],row['title'])
    print('Scraping successful')
    

In [24]:
github_topic_scraper(url)

Scraping the list of topics
Scraping the top repositories for "3D"
The file 3D.csv already exists. Skipping...
Successfully scraped topic "3D"
Scraping the top repositories for "Ajax"
Successfully scraped topic "Ajax"
Scraping the top repositories for "Algorithm"
Successfully scraped topic "Algorithm"
Scraping the top repositories for "Amp"
Successfully scraped topic "Amp"
Scraping the top repositories for "Android"
Successfully scraped topic "Android"
Scraping the top repositories for "Angular"
Successfully scraped topic "Angular"
Scraping the top repositories for "Ansible"
Successfully scraped topic "Ansible"
Scraping the top repositories for "API"
Successfully scraped topic "API"
Scraping the top repositories for "Arduino"
Successfully scraped topic "Arduino"
Scraping the top repositories for "ASP.NET"
Successfully scraped topic "ASP.NET"
Scraping the top repositories for "Atom"
Successfully scraped topic "Atom"
Scraping the top repositories for "Awesome Lists"
Successfully scraped 

#### We have successfully defined a function which will save a csv file for each topics containing the details of top 30 repositories.
**This completes our project.**