# github-topic-scraper

In this project, I try to get the top 20 repositories' data for each of the first 30 topics listed on github.com

URL used: https://github.com/topics <br>
Tools used: Python (requests, BeautifulSoup, pandas) <br>

Final desired output: <br>

topics.csv file in format:  topic_title | topic_description | topic_page_url

For each topic, create a file with top 20 repos. <br>
Format: username | repository_name | no. of stars | repository_url <br>

# Project Outline

 1. Use the requests library to load the topics page <br>
 2. Create a BeautifulSoup object to parse data <br>
 3. Find tags for each of the data elements we want to scrape. <br>
 4. Use the above tags to get information. <br>
 5. Create a pandas dataframe using the above information <br>
 6. Convert the dataframe to a CSV file and save. <br>


# Import all required libraries

In [1]:
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Get the topics page and information for each topic

In [2]:
def get_topics():
    response_top = requests.get("https://github.com/topics")
    if response_top.status_code != 200:
        raise Exception ('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response_top.text,'html.parser')
    
    #get all titles
    topic_title = []
    topic_title_tags = doc.find_all('p' ,{'class' : 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    for i in topic_title_tags:
        topic_title.append(i.text)  # i.text : line gives the text between the tags.
        
    #get all desc
    topic_desc = []
    topic_desc_tags = doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})
    for j in topic_desc_tags:
        topic_desc.append(j.text.strip()) #strip function removes space before and after the line.
        
    #get all urls
    topic_urls = []
    base = 'https://github.com'
    topic_urls_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    for k in topic_urls_tags:
        topic_urls.append(base + k['href']) #only gets the href part.
    d = {
    'title' : topic_title,
    'descriptions': topic_desc,
    'url' : topic_urls }
    
    return pd.DataFrame(d)

# Get information from each repository in a DataFrame

In [3]:
def parse_star_count(stars_str):
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [4]:
def get_repo_info(h3_tag,star_count):
    base_url = "https://github.com"
    a_tag = h3_tag.find_all('a')
    username = a_tag[0].text.strip()
    reponame = a_tag[1].text.strip()
    repo_url = base_url +a_tag[1]['href']
    star_num = parse_star_count(star_count)
    return username,reponame,repo_url,star_num

In [5]:
def get_topic_repos(topic_url):
    resp = requests.get(topic_url)
    
    if resp.status_code != 200:
        raise Exception ('Failed to load page {}'.format(topic_url))
    topic_doc = BeautifulSoup(resp.text,'html.parser')
    repo_tags = topic_doc.find_all('h3', {'class': 'f3 color-fg-muted text-normal lh-condensed'})
    # we're trying to find the a tags under these but they dont have the class attribute.
    # so we will relate them to these tags using other methods.
    star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})
    
    
    topic_repos_dict = {
    'username' : [],
    'repo_name': [],
    'stars': [],
    'repo_url' : []  
    }
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i].text)
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[3])
        topic_repos_dict['repo_url'].append(repo_info[2])
    return pd.DataFrame(topic_repos_dict)

# Create a folder 'data' with CSV files for each topic

In [6]:
def scrape_top_repos():
    topics_df = get_topics()
    for index, row in topics_df.iterrows():
        #Creating a folder
        os.makedirs('data',exist_ok = True)
        fname = row['title'] + '.csv'
        if not(os.path.exists('data/'+fname)):
            print("Scraping top 20 repos for topic "+row['title'])
            repos_df = get_topic_repos(row['url'])
            repos_df.to_csv('data/'+fname, index = None)

In [7]:
scrape_top_repos()

Scraping top 20 repos for topic 3D
Scraping top 20 repos for topic Ajax
Scraping top 20 repos for topic Algorithm
Scraping top 20 repos for topic Amp
Scraping top 20 repos for topic Android
Scraping top 20 repos for topic Angular
Scraping top 20 repos for topic Ansible
Scraping top 20 repos for topic API
Scraping top 20 repos for topic Arduino
Scraping top 20 repos for topic ASP.NET
Scraping top 20 repos for topic Atom
Scraping top 20 repos for topic Awesome Lists
Scraping top 20 repos for topic Amazon Web Services
Scraping top 20 repos for topic Azure
Scraping top 20 repos for topic Babel
Scraping top 20 repos for topic Bash
Scraping top 20 repos for topic Bitcoin
Scraping top 20 repos for topic Bootstrap
Scraping top 20 repos for topic Bot
Scraping top 20 repos for topic C
Scraping top 20 repos for topic Chrome
Scraping top 20 repos for topic Chrome extension
Scraping top 20 repos for topic Command line interface
Scraping top 20 repos for topic Clojure
Scraping top 20 repos for topic

# References and Future Work

References:
- https://jovian.com/
- https://pypi.org/project/beautifulsoup4/
- https://realpython.com/python-web-scraping-practical-introduction/
- https://www.w3schools.com/python/pandas/default.asp

<br>
Ideas for Future Work: <br>
- Set up a scheduled script using tools like Task Scheduler to automatically run the scraping and data collection process at specific intervals. This ensures that the dataset stays up to date without manual intervention. <br>
- Implement more robust data cleaning techniques using pandas to handle missing or inconsistent data. Explore different strategies to impute or remove outliers and enhance the overall data quality <br>
- Utilize BeautifulSoup to extract more detailed information from the repository descriptions, such as sentiment analysis or keyword extraction. <br>