# Scraping Top Repositories For GitHub Topics

### TO DO:

- Browse through the github topic site and select the top topics to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize the project idea and outline your strategy in a Juptyer notebook.
- Tools used include (Python, pandas, BeautifulSoup, requests)

### Project Outline

- The site to scrape https://github.com/topics
- Extracting a list of topics from the site. For each topic, I'll extract the topic title, topic page URL and topic description
- For each topic, I'll get the top 25 repositories in the topic from the topic page.
- For each repository, I'll grab the repo name, username, stars and repo URL
- For each topic I'll create a CSV file in the following format:

Repo Name,Username,Stars,Repo URL  
three.js,mrdoob,69700,https://github.com/mrdoob/three.js  
libgdx,libgdx,18300,https://github.com/libgdx/libgdx  


### Scraping a list of topics from Github

Steps Taken:

- Using requests to download the github page
- Utilizing BS4 to parse and extract information 
- Convert the data extracted to a DataFrame

### Step 1:  Creating a function that uses requests and BeauifulSoup to download the page

In [1]:
import requests 
from bs4 import BeautifulSoup

def get_topic_page():
    # download the page
    topic_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    # Checking the status of the page (response)
    if response.status_code != 200:
        raise Exception('Failed to load page{}', format(topic_url))
        
    #parse using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [2]:
doc = get_topic_page()

In [3]:
type(doc)

bs4.BeautifulSoup

### Step 2: Creating helper functions to parse information

#### To get topic titles, we can pick the `p` tags with the `class` "h1"

![](https://imgur.com/ezVrsA4.png)

In [4]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

#### `get_topic_titles` function helps to extract the list of titles

In [8]:
titles = get_topic_titles(doc)
len(titles)

30

Example of scraped list of titles.

In [9]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

#### Similar to the above get_topics_function, there are defined functions for extracting descriptions and URLs

###### Extracting the topic descriptions: Function code and  instance demonstration

In [10]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [11]:
description = get_topic_descs(doc)

Example of scraped descriptions

In [13]:
description[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

###### Extracting the topic Urls: Function code and  instance demonstration

In [14]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-grow-0'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [15]:
links = get_topic_urls(doc)

Example of scraped topic urls

In [17]:
links[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

#### Combining the above Titles, Descriptions and Urls into a single function

In [18]:
import pandas as pd

In [27]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

Let's print the DataFrame.

In [28]:
scrape_topics().head()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### Extracting the top 25 repositories in the topic from the topic page.

#### TO_DO:
- Lets now extract a topic from the topics_url

In [37]:
def get_topic_page(topic_url):
    # download the page
    response = requests.get(topic_url)
    # Checking the status of the page (response)
    if response.status_code != 200:
        raise Exception('Failed to load page{}', format(topic_url))
    #parse using beautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

Example.

In [38]:
doc = get_topic_page('https://github.com/topics/3d')

#### TO_DO: 
- Scrape and return stars earned in the topic extracted
- Convert the stars to an integer
- Scrape and return h3 text from the topic extracted in the doc above

Let's first scrape the number of stars for each repository 

In [40]:
star_tags = doc.find_all('span', id='repo-stars-counter-star')

In [41]:
len(star_tags)

20

In [42]:
star_tags[0].text

'97.7k'

Lets convert the  stars to a defined number, specifically of integer type

In [43]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1]== 'k': #if the last element is eqal to k
        return int(float(stars_str[:-1])*1000) # we remove the last element using stars_str[:-1]
    return int(stars_str)

In [44]:
print(parse_star_count(star_tags[0].text))

97700


In [None]:


def get_rep_info(h3_tag, star_tag):
    #returns all the required information about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

Example

In [None]:

def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class':h3_selection_class} )
     # Get star tags
    star_tags = topic_doc.find_all('span', id='repo-stars-counter-star')
    
    topic_repos_dict = {
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[]
        }
    
    # Get repo info
    for i in range (len(repo_tags)):
        repo_info = get_rep_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

In [None]:
import os

In [None]:

def scrape_topic(topic_url, path):
    if os.path.exists(path): # Checking if a file exists so that it can be skipped and not be re-downloaded 
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

### Summary 
- There is a function that extracts a list of topics
- There is a function that create a CSV that stores scraped data from a topics page
- Let's create a function that puts them together

In [None]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    #Creating a folder / directory named 'data' to save the scraped files
    os.makedirs('Scraped_data', exist_ok=True)
    for index, row in topics_df.iterrows(): # Iterating over rows 
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for all the topics in the first page of https://github.com/topics

In [None]:
scrape_topics_repos()