# Scraping Top Repositories of Github Topics (Advanced)

### What is Web Scraping?
`Web scraping` is the process of extracting and parsing data from websites in an automated fashion using a computer program. 
It’s a useful technique for creating datasets for research and learning

### Project Outline:

* I'm going to scrape https://github.com/topics
* I'm get a list of topics. For each topic, I will get topic title, topic page url and topic description.

* For each topic, I will get top 25 repositories in the topic from topic page.
* For each repository, I will grab the repo name, username, stars and repo URL.
* For each topic, I will create a csv file in the following format shown below:

`
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
`

## Writing a single function to:
- Get the list of topics from the topics page
- Get the list of top repos from the individual topic pages
- For each topic, create a CSV of the top repos for the topic

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

In [2]:
# URL to scrape
topics_url = 'https://github.com/topics'
response = requests.get(topics_url)
page_contents = response.text
doc = BeautifulSoup(page_contents, 'html.parser')
base_url = "https://github.com"

In [3]:
# Converting Star count to numerical value
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

In [4]:
# Function to downlaod the topic page
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    # Checking successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc





# Function to extract information about the repo
def get_repo_info(h1_tag, star_tags):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, repo_url, stars





# Function to extract repos of particular topic
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})
    
    topic_repos_dict = { 'UserName': [], 'RepoName': [],'RepoURL': [],'StarsCount': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['UserName'].append(repo_info[0])
        topic_repos_dict['RepoName'].append(repo_info[1])
        topic_repos_dict['RepoURL'].append(repo_info[2])
        topic_repos_dict['StarsCount'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)




# Function to create CSV files which contains repo info for each topic
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...\n".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [11]:
# Funtion to extract topic titles
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles




# Function to extract topic description
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs




# Funtion to extract topic URL's
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    


    
# Function to create Topics DataFrame
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        print('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

### Below is the single function which scrapes and extracts info of 30 repositories for each top topic on github, creates a dataframe for each topic, convert the dataframe to CSV file, and Save it in the 
`GithubTopicsData` folder.

In [6]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('GithubTopicsData', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories of topic: "{}"'.format(row['title']))
        r = requests.get(row['url'])
        if r.status_code != 200:
            print('Failed to load page {}\n'.format(row['url']))
        else:
            scrape_topic(row['url'], 'GithubTopicsData/{}.csv'.format(row['title']))
    return "=================================== Done ===================================="

In [12]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories of topic: "3D"
The file GithubTopicsData/3D.csv already exists. Skipping...

Scraping top repositories of topic: "Ajax"
The file GithubTopicsData/Ajax.csv already exists. Skipping...

Scraping top repositories of topic: "Algorithm"
The file GithubTopicsData/Algorithm.csv already exists. Skipping...

Scraping top repositories of topic: "Amp"
The file GithubTopicsData/Amp.csv already exists. Skipping...

Scraping top repositories of topic: "Android"
The file GithubTopicsData/Android.csv already exists. Skipping...

Scraping top repositories of topic: "Angular"
The file GithubTopicsData/Angular.csv already exists. Skipping...

Scraping top repositories of topic: "Ansible"
The file GithubTopicsData/Ansible.csv already exists. Skipping...

Scraping top repositories of topic: "API"
The file GithubTopicsData/API.csv already exists. Skipping...

Scraping top repositories of topic: "Arduino"
The file GithubTopicsData/Arduino.csv already exists. 



In [17]:
import time
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('GithubTopicsData', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories of topic: "{}"'.format(row['title']))
        r = requests.get(row['url'])
        if r.status_code != 200:
            time.sleep(5)
        scrape_topic(row['url'], 'GithubTopicsData/{}.csv'.format(row['title']))
    return "=================================== Done ===================================="

In [18]:
scrapetime_topics_repos() # Single function with time delay to match internet speeds

Scraping list of topics
Scraping top repositories of topic: "3D"
The file GithubTopicsData/3D.csv already exists. Skipping...

Scraping top repositories of topic: "Ajax"
The file GithubTopicsData/Ajax.csv already exists. Skipping...

Scraping top repositories of topic: "Algorithm"
The file GithubTopicsData/Algorithm.csv already exists. Skipping...

Scraping top repositories of topic: "Amp"
The file GithubTopicsData/Amp.csv already exists. Skipping...

Scraping top repositories of topic: "Android"
The file GithubTopicsData/Android.csv already exists. Skipping...

Scraping top repositories of topic: "Angular"
The file GithubTopicsData/Angular.csv already exists. Skipping...

Scraping top repositories of topic: "Ansible"
The file GithubTopicsData/Ansible.csv already exists. Skipping...

Scraping top repositories of topic: "API"
The file GithubTopicsData/API.csv already exists. Skipping...

Scraping top repositories of topic: "Arduino"
The file GithubTopicsData/Arduino.csv already exists. 

