## Scarping Github Top repositories for features topics

**Web scraping** is the process of automatically extracting data from websites. It enables users to collect structured information from web pages for analysis, automation, and integration into various applications. This technique is widely used in data science, business intelligence, and competitive analysis.

**GitHub** is a version control and collaboration platform where developers manage and share code. It categorizes projects based on topics, making it easier to discover repositories related to specific technologies and domains.

**Project Objective-**
In this project, we aim to create CSV files containing repository and user details for the top repositories under featured topics on GitHub. 

The tools used in this project include:
- Python: The core programming language for data extraction and processing.
- Requests: To make API calls and fetch data from GitHub.
- BeautifulSoup: For web scraping GitHub topics and repository details.
- OS: To manage file operations and directories.
- Pandas: For data manipulation and saving extracted details into CSV files.

**Here are the steps we'll follow:**

- We're going to scrape 'https://github.com/topics'
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic, we'll create a CSV file in the following format:

```
Username, repo name,Stars,Repo URL
mrdoob,three.js,69700,https://github.com/mrdoob/three/js
```

#### Installing the required Libraries

In [1]:
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet
!pip install pandas --quiet


[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Importing the libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

#### Scrape the list of topics from GitHub
Steps to follow:
- Use requests to download the page
- Use beautiful soup to parse and extract the information
- Convert the data lists to pandas dataframe

##### Create a function to access and download the page and return an object of type bs4


In [3]:
# This is the Base Url for accessing pages further and is declared globally. (Can also be declared wherever required instead)
base_url = 'https://github.com'

In [4]:
def get_topic_page(topic_url):
    # Get the page from the given topic URL
    response = requests.get(topic_url)
    # Check the status of the page is successful else print the error
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse the web page to beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

##### Create the function to parse and extract data using beautiful soup which return the dataframe containing the topics data in the format - topic_title, topic_description and topic_url
- First fetch the list of tags using ***find*** and ***find_all*** methods of bs4
- Fetch the data from the tags for creating a list of titles, descriptions and urls for topics and store in the pandas dataframes
- *(Optional) Save the topics list as a CSV file.*

In [5]:
def scrape_topics_from_github():
    # Get the page located at 'topics_url'
    topics_url = 'https://github.com/topics'
    doc = get_topic_page(topics_url)
    
    # Preparing the dataframe of topics page by extracting the titles, descriptions and urls for each topic -
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    # Convert the data into dataframe and return the df
    topics_df = pd.DataFrame(topics_dict)

    # Create a CSV file for dataframe to store the topics details
    # topics_df.to_csv('github_topics.csv', index = False)
    
    return topics_df

# Helper function which takes bs4 object containing topic details and extract the title lists
def get_topic_titles(doc):
    topic_selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': topic_selection_class})
    topic_titles = []
    for title in topic_title_tags:
        topic_titles.append(title.text.strip())
    return topic_titles

# Helper function which takes bs4 object containing topic details and extract the descriptions lists
def get_topic_descs(doc):
    desc_selection_class = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selection_class})
    topic_descs = []
    for desc in topic_desc_tags:
        topic_descs.append(desc.text.strip())
    return topic_descs

# Helper function which takes bs4 object containing topic details and extract the urls lists
def get_topic_urls(doc):
    links_selection_class = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class': links_selection_class})
    #Aim - https://github.com/topics/3d
    topic_urls = []
    base_url = 'https://github.com'
    for url in topic_link_tags:
        topic_urls.append(base_url + url['href'].strip())
    return topic_urls


##### Getting information for each topic page using the topic url and saving to CSV file
- Open and download the topic URL page
- Create a directory to store all your files
- Parse and extract the details like - username, repository name, stars and repository URL
- Create the topic-wise CSV files.

##### These are the helper function to help in fetching details from topics page

##### This is the function which take tags as parameters and extract the data from them to return username, repo_name, stars, repo_url

In [6]:
# A function to fetch details - username, repo name, stars, repo url for repository
def get_repo_details(header_tags, star_tags):
    # Returns all the information about the repositories
    a_tags = header_tags.find_all('a')
    username = a_tags[0].text
    repo_name = a_tags[1].text
    # We need to convert our stars strinng to appropiate integer
    stars = parse_stars(star_tags.text)
    repo_url = base_url + a_tags[1]['href']
    
    return username, repo_name, stars, repo_url


# Function to convert the star string value to integer
def parse_stars(star):
    star = star.strip()
    if star[-1].lower() == 'k':
        return int(float(star[:-1]) * 1000)
    return int(star)


##### A function which take the bs4 object of each topic page as input and creates a dataframe for each topic containing the top repositories details

In [7]:
# A function to fetch the details
def get_repos_for_topics(topic_doc):
    # Extract the repositories tags from the page
    repo_tags = topic_doc.find_all('h3', {'class' : 'f3 color-fg-muted text-normal lh-condensed'})
    star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})
    
    # Create a dict to store the details
    topics_repo_dict = {
        'username' : [],
        'repository_name' : [],
        'stars' : [],
        'repository_url' : []
        }
    
    # Traverse through each repository to fetch the details
    for i in range(len(repo_tags)):
        username, repo_name, stars, repo_url = get_repo_details(repo_tags[i], star_tags[i])
        #print(i, username, repo_name, stars, repo_url)
        topics_repo_dict['username'].append(username)
        topics_repo_dict['repository_name'].append(repo_name)
        topics_repo_dict['stars'].append(stars)
        topics_repo_dict['repository_url'].append(repo_url)

    return pd.DataFrame(topics_repo_dict)



Below methods are created to parse each topic pages using the above implemented helper functions and create a directory to store generated files.

In [8]:
## Function to create the directory in the current path
def create_dir():
    dirname = "Github_topics_files"
    try:
        os.mkdir(dirname)
    except FileExistsError:
        print("Directory already exists.")
    except OSError as err:
        print(f"Error creating directory: {err}")
    print('Directory {} created.'.format(dirname))
    return dirname

# Function to open the topic page and save the details to csv files in the created folder
def parse_topic_page(topic_url, topic_name, dir):
    print('Processing {}...'.format(topic_name))
    # Fetch the topic page with url = topic_url
    topic_doc = get_topic_page(topic_url)
    # Use the page to extract repositories
    topic_repos_df = get_repos_for_topics(topic_doc)

    # Create a CSV file of this df
    fname = topic_name + '.csv'
    fpath = '{}/{}'.format(dir, fname)
    print('Generting the file {} for topic {}...'.format(fname, topic_name))
    topic_repos_df.to_csv(fpath, index = False)
    

##### This is the main function (driver) where the program starts

In [9]:
def scrape_github_main():
    # Get the list of topics from github Topics page
    print('Getting the list of top featured topics from Github...')
    topics_df = scrape_topics_from_github()

    # Create a directory
    dirname = create_dir()
    
    # Fetch each topics page and create the csv file
    for index, row in topics_df.iterrows():
        parse_topic_page(row['url'], row['title'], dirname)

    print('All files are generated in the directory {}'.format(dirname))

Execute below code cell to scrape the top repositories for all the topics on the first page of https://github.com/topics

In [10]:
scrape_github_main()

Getting the list of top featured topics from Github...
Directory Github_topics_files created.
Processing 3D...
Generting the file 3D.csv for topic 3D...
Processing Ajax...
Generting the file Ajax.csv for topic Ajax...
Processing Algorithm...
Generting the file Algorithm.csv for topic Algorithm...
Processing Amp...
Generting the file Amp.csv for topic Amp...
Processing Android...
Generting the file Android.csv for topic Android...
Processing Angular...
Generting the file Angular.csv for topic Angular...
Processing Ansible...
Generting the file Ansible.csv for topic Ansible...
Processing API...
Generting the file API.csv for topic API...
Processing Arduino...
Generting the file Arduino.csv for topic Arduino...
Processing ASP.NET...
Generting the file ASP.NET.csv for topic ASP.NET...
Processing Awesome Lists...
Generting the file Awesome Lists.csv for topic Awesome Lists...
Processing Amazon Web Services...
Generting the file Amazon Web Services.csv for topic Amazon Web Services...
Proces

#### Validation steps
Once our code has executed, following validations are required to ensure if the files are properly generated -
- First we need to make sure, all the files are generated with proper names and within proper directory.
- Next, we need to validate the content of the files, they must be comma-seperated files with 4 columns - user name, repository name, stars, repository URL. Make sure to check few pages with the github site, if proper names are displayed and in proper order (top-bottom).
- Additonally, we can use pandas to open the CSV files and validate the data and structure.

## **Conclusion**  
With this project, we successfully scraped and stored details of the top repositories for featured topics on GitHub. Using **Python, Requests, BeautifulSoup, OS, and Pandas**, we automated the extraction of key repository and user details, making it easier to analyze trending projects.  
#
## **Useful References**  
During development, the following resources were particularly helpful:  
- [GitHub Topics](https://github.com/topics) – Repository categories and featured topics  
- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) – Web scraping guide  
- [Python Pathlib](https://docs.python.org/3/library/pathlib.html) – File system operations  
- [Python Input/Output](https://docs.python.org/3/tutorial/inputoutput.html) – Data handling  
- [YouTube Tutorial](https://www.youtube.com/live/RKsLLG-bzEY?si=gHjxeLaRf35veaIF) – Related learning resou#rce  

## **Future Enhancements**  
This project can be extended with additional features, such as:  
- **Data Analysis**: Using Python libraries to analyze trends in GitHub topics.  
- **Multi-Page Scraping**: Extracting data from multiple pages for deeper insights.  
- **User Interaction**: Allowing users to input topics and customize the data extraction process.  

This is just the beginning—there’s much more to expl:) and enhance! 🚀 