# Scraping Top Respositories for Topics on GitHub

TODO (intro):
- Introduction about web scraping
- Introduction about github and the problem statement
- Mention the tools you're using (python, requests, beautifulSoup, Pandas, os)

Here are the steps we'll follow


- we're going to scrape https://github.com/topics
- we'll get a list of topics
- for each topics we will get topic title, page url, topic description
- for each topic we'll get top 20 repositories
- for each repository we'.. get the repo name , usernam, stars, repo url
- for each topics we'll create a csv file in the following format:
```
Repo name, username, stars, url
```

## Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import os
import pandas as pd

## Scraping the list of topics from github

Explaination:

- use requests to download the page
- use bs4 to parse and extract information
- use os to create a directory
- convert the information into a pandas dataframe
- save the dataframe into the directory

Let's write a get the topics, their description and their page link

In [2]:
def scrape_topics():
    topics_url = "https://github.com/topics" #url for the page we want to scrape ( check it out )
    #download the page
    response = requests.get(topics_url)
    if response.status_code != 200: #200 response means the page responded and we got the page
        raise Exception("Failed to to load page {}".format(topics_url))
        
    page_contents = response.text
    #parse the html
    doc = BeautifulSoup(page_contents, 'html.parser')
    
    #get the respective information using the doc
    topic_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_desc(doc),
        'url': get_topic_link(doc),
    }
    #convert the dictionary into a pandas DataFrame
    topics_df = pd.DataFrame(topic_dict)
    
    return topics_df

## Create the helper functions for extracting the title, description, url of topics from doc

The class of a html element is used to specify exactly which html element to target

In [3]:
def get_topic_titles(doc):
    # find the paragraph tags with the specified class name
    select_topic_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p',{'class':select_topic_class})
    
    topic_titles = []

    for tag in topic_title_tags: #interate through each of the tags and extract the text inside ( title )
        topic_titles.append(tag.text)
        
    return topic_titles

In [4]:
def get_topic_desc(doc):
    # find the paragraph tags with the specified class name
    select_desc_class = "f5 color-text-secondary mb-0 mt-1"
    topic_desc_tags = doc.find_all('p',{'class':select_desc_class})
    
    topic_descriptions = []

    for tag in topic_desc_tags: #interate through each of the tags and extract the text inside ( description )
        topic_descriptions.append(tag.text.strip())
        
    return topic_descriptions

In [5]:
def get_topic_link(doc):
    # find the a tags with the specified class name
    select_link_class = "d-flex no-underline"
    topic_link_tags = doc.find_all('a', {'class':select_link_class})
    
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags: #interate through each of the tags and extract the text inside ( link )
        topic_urls.append(base_url + tag['href'])
    
    return topic_urls

## Create a function to get the topic page using its link

In [6]:
def get_topic_page(topic_url):
    # download the page
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception("Failed to to load page {}".format(topic_url))
    # parse html
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

## Get the repository information from each of the repositories listed on a particaular topic page

In [7]:
def get_topic_repos(topic_doc):
    #classes of elements we want to scrape
    repo_class = "f3 color-text-secondary text-normal lh-condensed"
    star_class = 'social-count float-none'
    
    #find all the respective tags
    repo_tags = topic_doc.find_all('h1', class_=repo_class)
    star_tags = topic_doc.find_all('a', {'class': star_class})
    
    #create a dictonary to store the information we are about to extract
    topic_repos_dict = {
        'username':[],
        'repo_name': [],
        'stars': [],
        'repo_url': []
        }    
    
    for i in range(len(repo_tags)): #iterate through the tags
        repo_info = get_repo_info(repo_tags[i], star_tags[i]) # call function which gives a nested list of information we want
        
        topic_repos_dict['username'].append(repo_info[0]) # 1st inner list element has username
        topic_repos_dict['repo_name'].append(repo_info[1]) # 2nd inner list element has repository name
        topic_repos_dict['stars'].append(repo_info[2])     # 3rd inner list element has stars gotten
        topic_repos_dict['repo_url'].append(repo_info[3])  # 4th inner list element has repository url
    
    return pd.DataFrame(topic_repos_dict)

## Create the function to retrive the information from the tags that contain repository and stars infomation

In [8]:
def get_repo_info(h1_tag, star_tag):
    # 2 a tag have the usename and repository name respectively so extract them
    a_tags = h1_tag.find_all('a')
    base_url = 'https://github.com'
    
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href'] # a relative path is mentioned in a tag so we combine it with base url
    
    # the stars are in string format eg: '76k' we want to convert it to 76000
    stars = parse_star_count(star_tag.text.strip()) # function to extract stars as type integer
    
    return username, repo_name, repo_url, stars

## Create the function to extract the number of stars

In [9]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

## Create a function to put all of it together

In [10]:
def scrape_topics_repos():
    try:
        dir_name = "./top_topic_repos" #directory store the csv files in
        
        if os.path.isdir('top_topic_repos'): #check if it exists
            print('top_topic_repos directory exists')
        else: #create directory if it doesn't exist
            return
            os.mkdir(dir_name)
    except OSError:
        print ("Creation of the directory %s failed" % dir_name)
        
    print("Scraping list of topics")
    topics_df = scrape_topics() # function from first part
    
    for index, row in topics_df.iterrows(): # iterate through the topic infos
        print("scraping top repositories for {}".format(row['title']))
        scrape_topic(row['url'], row['title'], dir_name) # helper function to call the other functions

### Helper function to call all the other functions for each topic

In [11]:
def scrape_topic(topic_url, topic_name, dir_name):
    # filename
    floc = dir_name + '/' + topic_name + ".csv"
    if os.path.exists(floc): # check if it exists and don't scrape if it does
        print("The file {}.csv already exists :)".format(topic_name))
        return
    #get the dataframe for information of respositories of each topic
    topic_df = get_topic_repos(get_topic_page(topic_url))
    #create a csv file of the dataframe
    topic_df.to_csv(floc,index=None)

In [12]:
scrape_topics_repos()

top_topic_repos directory exists
Scraping list of topics
scraping top repositories for 3D
The file 3D.csv already exists :)
scraping top repositories for Ajax
The file Ajax.csv already exists :)
scraping top repositories for Algorithm
The file Algorithm.csv already exists :)
scraping top repositories for Amp
The file Amp.csv already exists :)
scraping top repositories for Android
The file Android.csv already exists :)
scraping top repositories for Angular
The file Angular.csv already exists :)
scraping top repositories for Ansible
The file Ansible.csv already exists :)
scraping top repositories for API
The file API.csv already exists :)
scraping top repositories for Arduino
The file Arduino.csv already exists :)
scraping top repositories for ASP.NET
The file ASP.NET.csv already exists :)
scraping top repositories for Atom
The file Atom.csv already exists :)
scraping top repositories for Awesome Lists
The file Awesome Lists.csv already exists :)
scraping top repositories for Amazon Web 

## ideas for future

- we can just add "?page=n" to the topics page url where n is an iterator to get the information from the rest of the pages as well

- we are going to build more web scraping projects in the future... do check out our readme file for ideas and our respositories for other projects

# Thank you and do give my repository a star if this was worth your time