<a href="https://colab.research.google.com/github/SARAnsH23072001/Scraping-Top-Repositories-for-Topics-on-GitHub/blob/main/scraping_github_topics_repositories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scraping Top Repositories for Topics on GitHub**

## Importing Libraries


In [None]:
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup

# Scraping the list of topics from Github

Scraping the Topics page from Github.

In [None]:
# This function will return the topic page
def get_topics_page(): 

    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

# stroing the topic page in 'doc'
doc = get_topics_page()

Scraping the topics name from the topic page

In [None]:
# This function will return an array of topics name.
def get_topic_titles(doc): 

    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

# stroing the topics name array in 'titles'
titles = get_topic_titles(doc)

In [None]:
len(titles)

30

In [None]:
# Top 5 topics from topic page
print("Top 5 topics are:")
for i in range(5):
  print("-",titles[i])

Top 5 topics are:
- 3D
- Ajax
- Algorithm
- Amp
- Android


Similarly I have defined functions for getting descriptions and URLs of each topic.

In [None]:
# Getting the description of each topic

def get_topic_descs(doc): #This funtion will return description of each topic

    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs
    
# stroing the Discription of each topic in 'desc'
desc = get_topic_descs(doc) 

In [None]:
# Description of first 5 topics
for i in range(5):
  print("-",desc[i])

- 3D modeling is the process of virtually developing the surface and structure of a 3D object.
- Ajax is a technique for creating interactive web applications.
- Algorithms are self-contained sequences that carry out a variety of tasks.
- Amp is a non-blocking concurrency library for PHP.
- Android is an operating system built by Google designed for mobile devices.


In [None]:
# Getting URL of each topic

def get_topic_urls(doc): # This function will return URL of each topic

    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    
# stroing URL of each topic in 'url'
url = get_topic_urls(doc) 

In [None]:
# URL of first 5 topics
for i in range(5):
  print("-",url[i])

- https://github.com/topics/3d
- https://github.com/topics/ajax
- https://github.com/topics/algorithm
- https://github.com/topics/amphp
- https://github.com/topics/android


Putting this all together into a single function

In [None]:
# This function will return a Pandas Dataframe which contains details (like name,description,url) of each topic
def scrape_topics():

    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

topisc_details= scrape_topics()

In [None]:
# Showing details of each topic
topisc_details

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Getting the top 25 repositories from a topic page




In [None]:
# This funtion will get the page of given topic's URL
def get_topic_page(topic_url):
    # Downloading the page
    response = requests.get(topic_url)
    # Checking for successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parseing using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


In [None]:
doc = get_topic_page('https://github.com/topics/3d')

In [None]:

def parse_star_count(stars):

    stars=stars.strip()
    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    return(int(stars))
    

In [None]:
base_url = 'https://github.com'

def get_repo_info(h1_tag, star_tag):

    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url
    

In [None]:
def get_topic_repos(topic_doc):
  
    # Get the h1 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('article',{'class':'border rounded color-shadow-small color-bg-subtle my-4'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'id':'repo-stars-counter-star'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [None]:
def scrape_topic(topic_url, path):

    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)
    

## Putting it all together

In [None]:
def scrape_topics_repos():

    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))
        

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [None]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

We can check that the CSVs were created properly