# Top Repositories for GitHub Topics

### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics, for each topic, we'll get topic title, topic page URL, and topic description.
- For each topic, we'll get the top 25 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars, and repo URL.
- For each topic, we'll create a CSV file in the following format:

```
Repo name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

### Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

### Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

### Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

### Document and share your work

- Add proper headings and documentation in your Jupyter notebook.
- (Optional) Write a blog post about your project and share it online.

In [1]:
import requests

In [2]:
from bs4 import BeautifulSoup

In [3]:
import pandas as pd

In [4]:
import os

In [5]:
base_url = 'https://github.com'

In [6]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_parsed = BeautifulSoup(response.text, 'html.parser')
    return topic_parsed


def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)


def get_repo_info(h3_tag, star_tag):
    """ Return all the required info about a repository. """
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


def get_topic_repos(parsed_doc):
    repo_tags = parsed_doc.find_all('h3', {'class': "f3 color-fg-muted text-normal lh-condensed"})
    star_tags = parsed_doc.find_all('span', {'class': "Counter js-social-count"})
    
    topic_repos_dict = {
        'username' : [],
        'repo name' : [],
        'stars' : [],
        'repo URL' : []
    }
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo URL'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)


def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    parsed_doc = BeautifulSoup(response.text, 'html.parser')
    
    topic_title_tags = parsed_doc.find_all('p', {'class': "f3 lh-condensed mb-0 mt-1 Link--primary"})
    topic_desc_tags = parsed_doc.find_all('p', {'class': "f5 color-fg-muted mb-0 mt-1"})
    topic_link_tags = parsed_doc.find_all('a', {'class': "no-underline flex-1 d-flex flex-column"})
    
    topic_titles = [topic.text for topic in topic_title_tags]
    topic_descs = [topic.text.strip() for topic in topic_desc_tags]
    topic_urls = [base_url + topic['href'] for topic in topic_link_tags]
    
    topics_dict = {
        'Title' : topic_titles,
        'Description' : topic_descs,
        'URL' : topic_urls
    }
    
    return pd.DataFrame(topics_dict)

### Write a single function to:
1. Get the list of topics from the topics page.
2. Get the list of top repos from the individual topic pages.
3. For each topic, create a csv of the top repos for the topic.

In [7]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('Topics Data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['Title']))
        scrape_topic(row['URL'], 'Topics Data/{}.csv'.format(row['Title']))

In [8]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file Topics Data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file Topics Data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file Topics Data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file Topics Data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file Topics Data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file Topics Data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file Topics Data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file Topics Data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file Topics Data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file Topics Data/ASP.NET.csv already exists. Skippi