# GitHub Topics Repository Scraper

### Project Outline:


- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title,
topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
```
Repo_Name,Username,Stars,Repo_URL
three.js,mrdoob,10100,https://github.com/mrdoob/three.js
libgdx,libgdx,23000,https://github.com/libgdx/libgdx

```

### Scraping the list of topics from Github

- Setting up the environment by importing necessary libraries (BeautifulSoup, requests, pandas, os)


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import os

### Send an HTTP Request:
   - Use the requests library to fetch the content of the GitHub Topics page.

In [2]:
topics_url = 'https://github.com/topics'
response = requests.get(topics_url)
if response.status_code != 200:
    raise Exception(f"Failed to load page {topics_url}")

### Parse the HTML Content:
   - Use BeautifulSoup to parse the HTML content of the page. This allows you to navigate and search the HTML structure easily.

In [3]:
doc = BeautifulSoup(response.text, 'html.parser')

### Extracting Topic Information:
   - Find the HTML elements containing the topic information. Typically, each topic is enclosed in a <div> tag with specific classes.

In [4]:
topics = doc.find_all('div', class_="py-4 border-bottom d-flex flex-justify-between")

#### In the class `"py-4 border-bottom d-flex flex-justify-between"`  we've topic_name, topic_description and topic_url

![](https://i.imgur.com/iMiIvAc.png)

#### The Function `scrape_topics()`  iterates over all the existing topics and gets the topic info and convert into DataFrame

In [5]:
def scrape_topics(): 
    
    # URL to append with the url which we parse from the 'href'      
    base_url = 'https://github.com/'
    
    # Topics dictionary to save the repository details
    
    topics_dict = {
        'title': [],
        'description': [],
        'url': []
    }
    
    # Iterating through every topic(repository) to get topic_name, topic_description, topic_url
    
    for topic in topics:
        topic_name = topic.find('p', class_='f3 lh-condensed mb-0 mt-1 Link--primary').text.strip()
        topic_description = topic.find('p', class_='f5 color-fg-muted mb-0 mt-1').text.strip()
        topic_link = topic.find('a', class_='no-underline flex-1 d-flex flex-column')
        topic_url = topic_link['href']
        topics_dict['title'].append(topic_name)
        topics_dict['description'].append(topic_description)
        topics_dict['url'].append(base_url + topic_url)
        
    # Converting the dictionary into a DataFrame using pandas
    
    return pd.DataFrame(topics_dict)

In [6]:
scrape_topics() # Here's the list of topics with title, description and url from `https://github.com/topics`

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com//topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com//topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com//topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com//topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com//topics/android
5,Angular,Angular is an open source web application plat...,https://github.com//topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com//topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com//topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com//topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com//topics/aspnet


#### The Function `parse_stars_count()` is used to convert the repo_ratings from string into real numbers

In [7]:
def parse_stars_count(stars_str):
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    elif stars_str[-1] == 'M':
        return int(float(stars_str[:-1]) * 1000000)
    else:
        return int(stars_str)

#### The Function `fetch_topic_page()` downloads and parses the topic page.


In [8]:
def fetch_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    # Check successful response
    if response.status_code != 200:
        raise Exception(f"Failed to load page {topic_url}")
        
    # Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

#### The class `"d-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3"` contains repo_name, repo_user_name, repo_rating
![](https://i.imgur.com/2kUfMrP.png)

#### The Function `extract_repo_info()` extracts repository information from the parsed HTML.

In [9]:
def extract_repo_info(topic_doc):
    # Create the dictionary with repository details
    repos_dict = {
        'repo_name': [],
        'user_name': [],
        'repo_rating': [],
        'repo_url': [] 
    }
    
    base_url = 'https://github.com/'

    # Get repo_name, user_name, star_count, repo_url from 'div' tag
    repository = topic_doc.find_all('div', class_='d-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3')
    
    for repo in repository:
        cleaned_repo = repo.text.strip().replace('\n', '').split()
        repos_dict['repo_name'].append(cleaned_repo[0])
        repos_dict['user_name'].append(cleaned_repo[2])    
        repos_dict['repo_rating'].append(parse_stars_count(cleaned_repo[-1]))
        full_url = base_url + cleaned_repo[0] + '/' + cleaned_repo[2]
        repos_dict['repo_url'].append(full_url)
    
    return pd.DataFrame(repos_dict)

In [10]:
url = 'https://github.com//topics/3d' # url to display the output of a particular repository
type_doc = fetch_topic_page(url)
df = extract_repo_info(type_doc)
df

Unnamed: 0,repo_name,user_name,repo_rating,repo_url
0,mrdoob,three.js,101000,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,26800,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,23000,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,22800,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,19900,https://github.com/ssloy/tinyrenderer
5,FreeCAD,FreeCAD,18400,https://github.com/FreeCAD/FreeCAD
6,lettier,3d-game-shaders-for-beginners,17500,https://github.com/lettier/3d-game-shaders-for...
7,aframevr,aframe,16500,https://github.com/aframevr/aframe
8,CesiumGS,cesium,12500,https://github.com/CesiumGS/cesium
9,blender,blender,12300,https://github.com/blender/blender


#### The Function `scrape_topic_repos()`:
- It calls `scrape_topics` function to get the topic information
- It also calls `fetch_topic_page()` and `extract_repo_info()` function to get the repo response and information
- We've created a CSV file for scraped repos from topics page

In [11]:
def scrape_topic_repos():
    
    topics_df = scrape_topics()
    
    # Create a new directory for CSV files
    
    directory = "GitHub_Topic_Repos"
    if not os.path.exists(directory):
        os.makedirs(directory)
    
    # Iterate over rows using itertuples() and process each topic    
    
    for row in topics_df.itertuples(index=False):
        topic_name = row.title
        topic_url = row.url
        print(f"Scraping repositories for topic: {topic_name} from URL: {topic_url}")
        
        # Calling fetch_topic_page and extract_repo_info to iterate over every topic_url and get repo_name, repo_user_name, repo_rating, repo_url
        
        topic_docs = fetch_topic_page(topic_url)
        df = extract_repo_info(topic_docs)
        
        # Saving the extracted info into a .csv file
        
        file_name = f"{directory}/{topic_name}.csv"
        df.to_csv(file_name, index=False)
        print(f"Data for topic '{topic_name}' has been saved to {file_name}")   

#### Let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [12]:
scrape_topic_repos() # list of repo files parsed

Scraping repositories for topic: 3D from URL: https://github.com//topics/3d
Data for topic '3D' has been saved to GitHub_Topic_Repos/3D.csv
Scraping repositories for topic: Ajax from URL: https://github.com//topics/ajax
Data for topic 'Ajax' has been saved to GitHub_Topic_Repos/Ajax.csv
Scraping repositories for topic: Algorithm from URL: https://github.com//topics/algorithm
Data for topic 'Algorithm' has been saved to GitHub_Topic_Repos/Algorithm.csv
Scraping repositories for topic: Amp from URL: https://github.com//topics/amphp
Data for topic 'Amp' has been saved to GitHub_Topic_Repos/Amp.csv
Scraping repositories for topic: Android from URL: https://github.com//topics/android
Data for topic 'Android' has been saved to GitHub_Topic_Repos/Android.csv
Scraping repositories for topic: Angular from URL: https://github.com//topics/angular
Data for topic 'Angular' has been saved to GitHub_Topic_Repos/Angular.csv
Scraping repositories for topic: Ansible from URL: https://github.com//topics/

![](https://i.imgur.com/fo0Orgt.png)

#### We can check if the CSVs are created properly using pandas

In [13]:
# read and display one of CSV file

pd.read_csv('GitHub_Topic_Repos/C.csv')

Unnamed: 0,repo_name,user_name,repo_rating,repo_url
0,Genymobile,scrcpy,106000,https://github.com/Genymobile/scrcpy
1,neovim,neovim,80400,https://github.com/neovim/neovim
2,obsproject,obs-studio,57800,https://github.com/obsproject/obs-studio
3,fffaraz,awesome-cpp,57400,https://github.com/fffaraz/awesome-cpp
4,git,git,51200,https://github.com/git/git
5,FFmpeg,FFmpeg,44000,https://github.com/FFmpeg/FFmpeg
6,serhii-londar,open-source-mac-os-apps,40600,https://github.com/serhii-londar/open-source-m...
7,vim,vim,35700,https://github.com/vim/vim
8,curl,curl,34900,https://github.com/curl/curl
9,huihut,interview,33900,https://github.com/huihut/interview


## Summary and Future Work:

- Scrape the GitHub Topics page and get a list of topics
- Extract each topic's title, description, and URL
- For each topic, fetch the top repositories.
- Extract repository details including name, username, star count, and URL.
- Save the extracted repository data into CSV files, with each file named after the respective topic.

## References:

- https://imgur.com/ - To get the links of the screenshot
- https://pandas.pydata.org/docs/index.html - pandas documentation
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/ - BeautifulSoup documentation

## Future Work:

- To get the list of more topics we can have end_url as: `"/go/?page=2"` we need to join the base_url and topic_url and end_url
- So that we can parse more topics
- The same can be done for the repos too end_url as: `"/?page=2"`

