# Scraping Top Repositories For GitHub Topics

### TO DO:

- Browse through the github topic site and select the top topics to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize the project idea and outline your strategy in a Juptyer notebook.
- Tools used include (Python, pandas, BeautifulSoup, requests)

### Project Outline

- The site to scrape https://github.com/topics
- Extracting a list of topics from the site. For each topic, I'll extract the topic title, topic page URL and topic description
- For each topic, I'll get the top 25 repositories in the topic from the topic page.
- For each repository, I'll grab the repo name, username, stars and repo URL
- For each topic I'll create a CSV file in the following format:

Repo Name,Username,Stars,Repo URL  
three.js,mrdoob,69700,https://github.com/mrdoob/three.js  
libgdx,libgdx,18300,https://github.com/libgdx/libgdx  


### Scraping a list of topics from Github

Steps Taken:

- Using requests to download the github page
- Utilizing BS4 to parse and extract information 
- Convert the data extracted to a DataFrame

### Step 1:  Creating a function that uses requests and BeauifulSoup to download the page

In [27]:
import requests 
from bs4 import BeautifulSoup

def get_topic_page():
    # download the page
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    # Checking the status of the page (response)
    if response.status_code != 200:
        raise Exception('Failed to load page{}', format(topic_url))
        
    #parse using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [31]:
doc = get_topic_page()

In [32]:
type(doc)

bs4.BeautifulSoup

### Step 2: Creating helper functions to parse information

#### To get topic titles, we can pick the `p` tags with the `class` "h1"

![](https://imgur.com/ezVrsA4.png)

In [33]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

#### `get_topic_titles` function helps to extract the list of titles

In [34]:
titles = get_topic_titles(doc)
len(titles)

30

In [35]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

#### Similar to the above get_topics_function, there are defined functions for extracting descriptions and URLs

Extracting the topic descriptions: example and  instance demonstration

In [36]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [37]:
description = get_topic_descs(doc)

In [38]:
description[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

Extracting the topic Urls: example and instance demonstration

In [39]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [42]:
links = get_topic_urls(doc)

In [43]:
links[:3]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm']

#### Combining the above Titles, Descriptions and Urls functions

In [53]:
import pandas as pd

In [56]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    topics_df = pd.DataFrame(topics_dict)
    return topics_df

### Extracting the top 25 repositories in the topic from the topic page.