# Top Repositories for GitHub Topics

## Pick a website and describe your objective

- Browse through different sites and pick one to scrape. Check the "Project Ideas" for section for inspiration
- Identify the information you did like to scrape from the site. Beside the format of the output CSV file.
- Summarise your project idea and outline your strategy in a jupyter notebook. Use the "New" button above

#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic, we'll create a CSV file in the following format :

'''
Repo Name, Username, Stars, Repo URL
three.js, mrdoob, 69700, https://github.com/mrdoob/three.js
'''


## Use the requests library to download webpages

In [1]:
!pip install requests --upgrade --quiet

#we use quiet when we don't want to see the output



In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code

#Informational responses (100 – 199)
#Successful responses (200 – 299)
#Redirection messages (300 – 399)
#Client error responses (400 – 499)
#Server error responses (500 – 599)

200

In [6]:
page_content = response.text

In [7]:
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-fe3f886b577a.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-a1dbeda2886c.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="https:/

In [8]:
#Saving html page as local copy on our server
import io

with io.open('webpage.html', 'w', encoding = "utf-8") as f:
    f.write(page_content)

## Use Beautiful Soup to parse and extract information

In [9]:
!pip install beautifulsoup4 --upgrade --quiet

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jupyter-server 1.13.5 requires pywinpty<2; os_name == "nt", but you have pywinpty 2.0.2 which is incompatible.


In [10]:
from bs4 import BeautifulSoup

In [11]:
doc = BeautifulSoup(page_content, 'html.parser')

In [12]:
#selecting topics name present on the github page
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class' : selection_class})

In [13]:
len(topic_title_tags)

30

In [14]:
topic_title_tags[:10]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>]

In [15]:
# Selecting the description of the topics on the webpages
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class' : desc_selector})

In [16]:
len(topic_desc_tags)

30

In [17]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [18]:
# getting the URLS of the topics for the 'a' tag
topic_title_tag0 = topic_title_tags[0]

In [19]:
div_tag = topic_title_tag0.parent

In [20]:
link_selector = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a', {'class' : link_selector})

In [21]:
len(topic_link_tags)

30

In [22]:
topic_link_tags[0]

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [23]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [24]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)    

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [25]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())                 #strip is used to remove extra space
    
topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [26]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [27]:
import pandas as pd

In [28]:
topic_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [29]:
topic_df = pd.DataFrame(topic_dict)

In [30]:
topic_df[:10]

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


# Create CSV file(s) with the extracted information 

In [31]:
topic_df.to_csv('topics.csv', index = None)

## Getting information out of a topic page

In [32]:
topic_page_url = topic_urls[0]

In [33]:
topic_page_url

'https://github.com/topics/3d'

In [34]:
response = requests.get(topic_page_url)

In [35]:
response.status_code

200

In [36]:
len(response.text)

460131

In [37]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [38]:
h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class' : h1_selection_class} )

In [39]:
len(repo_tags)

20

In [40]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [41]:
# getting the username of the person and the repository

a_tags = repo_tags[0].find_all('a')

In [42]:
#username

a_tags[0].text.strip()

'mrdoob'

In [43]:
#Repositories of the user

a_tags[1].text.strip()

'three.js'

In [44]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [45]:
star_tags = topic_doc.find_all('span', {'id' : 'repo-stars-counter-star'})

In [46]:
len(star_tags)

20

In [47]:
#stars
star_tags[0].text.strip()

'90.1k'

In [48]:
#Converting 89K into a number 

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [49]:
parse_star_count(star_tags[0].text.strip())

90100

In [50]:
import os

def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc
    
def get_repo_info(h1_tag, star_tags):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class' : h1_selection_class} )
    # Get star tags 
    star_tags = topic_doc.find_all('span', {'id' : 'repo-stars-counter-star'})

    topics_repo_dict = {'username' : [],
                'repo_name' : [],
                'stars' : [],
                'repo_url' : []
                }

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topics_repo_dict['username'].append(repo_info[0])
        topics_repo_dict['repo_name'].append(repo_info[1])
        topics_repo_dict['stars'].append(repo_info[2])
        topics_repo_dict['repo_url'].append(repo_info[3])

    return pd.DataFrame(topics_repo_dict)

def scrape_topic(topic_url,topic_name):
    fname =topic_name + '.csv'
    if os.path.exists(fname):
        print("The file {} already exists . skipping...".format(fname))
        return 
    topic_df=get_topic_repos(get_topic_page(topic_url))
    
    topic_df.to_csv(fname,index=None)

In [51]:
# Repository info for the first

get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 90100, 'https://github.com/mrdoob/three.js')

In [52]:
url4 = topic_urls[4]
url4

'https://github.com/topics/android'

In [53]:
topic4_doc = get_topic_page(url4)

In [54]:
topic4_repos = get_topic_repos(topic4_doc)
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,151000,https://github.com/flutter/flutter
1,facebook,react-native,108000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,101000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,79700,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,62800,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,47800,https://github.com/google/material-design-icons
6,Solido,awesome-flutter,46000,https://github.com/Solido/awesome-flutter
7,wasabeef,awesome-android-ui,45800,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,43700,https://github.com/square/okhttp
9,android,architecture-samples,42400,https://github.com/android/architecture-samples


In [55]:
get_topic_repos(get_topic_page( topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,151000,https://github.com/flutter/flutter
1,facebook,react-native,108000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,101000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,79700,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,62800,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,47800,https://github.com/google/material-design-icons
6,Solido,awesome-flutter,46000,https://github.com/Solido/awesome-flutter
7,wasabeef,awesome-android-ui,45800,https://github.com/wasabeef/awesome-android-ui
8,square,okhttp,43700,https://github.com/square/okhttp
9,android,architecture-samples,42400,https://github.com/android/architecture-samples


## Write a Single Function to:


1. Get the list of topics from the topic page
2. Get the list of top repos from individual topic page
3. For each topic, Create a CSV of the top repos for the topic

In [56]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class' : selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class' : desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class' : link_selector})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    reponse = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict ={
        'title' : get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [57]:
def scrape_topic_repo():
    print("scrapping list of topics:")
    topic_df = scrape_topics()
    for index, row in topic_df.iterrows():
        print('Scrapping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])
    

In [58]:
scrape_topic_repo()

scrapping list of topics:
Scrapping top repositories for "3D"
The file 3D.csv already exists . skipping...
Scrapping top repositories for "Ajax"
The file Ajax.csv already exists . skipping...
Scrapping top repositories for "Algorithm"
The file Algorithm.csv already exists . skipping...
Scrapping top repositories for "Amp"
The file Amp.csv already exists . skipping...
Scrapping top repositories for "Android"
The file Android.csv already exists . skipping...
Scrapping top repositories for "Angular"
The file Angular.csv already exists . skipping...
Scrapping top repositories for "Ansible"
The file Ansible.csv already exists . skipping...
Scrapping top repositories for "API"
The file API.csv already exists . skipping...
Scrapping top repositories for "Arduino"
The file Arduino.csv already exists . skipping...
Scrapping top repositories for "ASP.NET"
The file ASP.NET.csv already exists . skipping...
Scrapping top repositories for "Atom"
The file Atom.csv already exists . skipping...
Scrappi