# Top Repos For GitHub Topics

### Plan

- Now we need to identify the information we would like to scrape from the site.
- For each topic we will get the topic title, topic description and topic page URL.
- We will get the top 18 repos related to a specific topic.
- We will grap the repo name, username, starts and finally repo URL.

## Request the web pages

In [3]:
import requests
from bs4 import BeautifulSoup

def get_page():
    topics_url = "https://github.com/topics"
    html_text = requests.get(topics_url)

    if html_text.status_code != 200 : 
        print(f"Failed to request page. Status code: {html_text.status_code}")
    
    soup = BeautifulSoup(html_text.text, "html.parser")

    return soup

In [4]:
doc = get_page()

### Now it is time to parse information from the page.

- We are going to start by topic titles.

- We observed during our inspection of the page that the titles of the topics are enclosed within a `p` tag that has a specific class.

In [5]:
def get_topics_titles(doc):
    title_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    titles_tags = doc.find_all('p', title_class)

    topic_titles = []

    for tag in titles_tags:
        topic_titles.append(tag.text)

    return topic_titles


In [6]:
titles = get_topics_titles(doc)

In [7]:
len(titles)

30

In [8]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

- Now we are going to do the same for topic description.
- We observed during our inspection of the page that the titles of the topics are enclosed within a `p` tag that has a specific class.

In [9]:
def get_topics_description(doc):
    description_class = "f5 color-fg-muted mb-0 mt-1"
    description_tags = doc.find_all('p', description_class)

    topic_description = []

    for tag in description_tags:
        topic_description.append(tag.text)

    return topic_description

In [10]:
descriptions = get_topics_description(doc)

In [11]:
descriptions[:5]

['\n          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.\n        ',
 '\n          Ajax is a technique for creating interactive web applications.\n        ',
 '\n          Algorithms are self-contained sequences that carry out a variety of tasks.\n        ',
 '\n          Amp is a non-blocking concurrency library for PHP.\n        ',
 '\n          Android is an operating system built by Google designed for mobile devices.\n        ']

- Now we are going to do the same for topic URLs.
- We observed during our inspection of the page that the titles of the topics are enclosed within a `a` tag that has a specific class.

In [12]:
def get_topics_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [13]:
urls = get_topics_urls(doc)

In [14]:
urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

- Now Put this all together in a single function.

In [15]:
import pandas as pd

def topics():
    topics_url = "https://github.com/topics"
    html_text = requests.get(topics_url)
     
    if html_text.status_code != 200:
        print(f"Failed to request page. Status code: {html_text.status_code}")

    soup = BeautifulSoup(html_text.text, 'html.parser')

    topics = {
        "Title": get_topics_titles(soup),
        "Description": get_topics_description(soup),
        "URL": get_topics_urls(soup)
    }

    return pd.DataFrame(topics)

# Get top repositories from a specific topic page

In [16]:
def get_topic_page(topic_url):
    topic_text = requests.get(topic_url)

    if topic_text.status_code != 200:
        print(f"Failed to request page. Status code : {topic_text.status_code}")

    soup = BeautifulSoup(topic_text.text, 'html.parser')

    return soup

- Get the number of stars

In [18]:
def get_stars_count(stars):
    stars = stars.strip()

    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    return(int(stars))

- Get the other information like repo name, username...

In [19]:
base_url = 'https://github.com'
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = get_stars_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

- Let's Work on an Example

In [20]:
topic = get_topic_page("https://github.com/topics/ajax")

In [21]:
repo_tags = topic.find_all('h3', class_ = "f3 color-fg-muted text-normal lh-condensed")

In [22]:
star_tags = topic.find_all('span', { 'class': 'Counter js-social-count'})


In [23]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":36322912,"originating_url":"https://github.com/topics/ajax","user_id":null}}' data-hydro-click-hmac="12f49832acde3976362bbfb12cd832c71c4ab9ef369e0eeebf0b983120696b87" data-turbo="false" data-view-component="true" href="/ljianshu">
            ljianshu
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":137582912,"originating_url":"https://github.com/topics/ajax","user_id":null}}' data-hydro-click-hmac="ab016938047a79c8cdccd70fba114ec58aabce0f116110f58883626113c73a0c" data-turbo="false" data-view-compone

In [24]:
star_tags[0]

<span aria-label="7881 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="7,881">7.9k</span>

In [25]:
get_repo_info(repo_tags[0], star_tags[0])


('ljianshu', 'Blog', 7900, 'https://github.com/ljianshu/Blog')

#### So since the example works, we're going to sum up everthing.

In [26]:
def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'class':'Counter js-social-count'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [27]:
import os

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [28]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['Title']))
        scrape_topic(row['URL'], 'data/{}.csv'.format(row['Title']))


In [29]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command-line interface"
Scraping top repositories for "Clojure"
Scraping top repositories for "Code quality"