## Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
-  For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
```
    Repo Name,Username,Stars,Repo URL
    three.js,mrdoob,69700,https://github.com/mrdoob/three.js
    libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Use the requests library to download web pages
    Inspect the website's HTML source and identify the right URLs to download.
    Download and save web pages locally using the requests library.
    Create a function to automate downloading for different topics/search queries.

In [1]:
#!pip install requests --upgrade --quiet --user

In [2]:
import requests

In [3]:
topics_url = "https://github.com/topics"

In [4]:
response = requests.get(topics_url)

### HTTP response status codes

    Informational responses (100–199)
    Successful responses (200–299)
    Redirection messages (300–399)
    Client error responses (400–499)
    Server error responses (500–599)

In [5]:
response.status_code

200

In [6]:
len(response.text)

174221

In [7]:
#response.text ## don't run this line it will lag your kernel

In [8]:
page_contents = response.text

In [9]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-E9wnWjoxQmh5A1jiWVYDPKOvA8VPf0iKQYoc+9ycMJvtAi9gOSlaUci+W2smxFIlWkV8hkX+O27S8NIB59iIDw==" rel="stylesheet" href="https://github.githubassets.com/assets/light-13dc275a3a314268790358e25956033c.css" /><link crossorigin="anonymous" media="all" integrity="sha512-nYSv3KrFhMlGUpjkFQBLMEN6HvHhijcoubQLjV3DWlcABEi2yDYf6KGUjRubJ5R+dJnKXR7jA4wu5Dg2

In [10]:
with open('webpage.html', 'w', encoding="utf-8") as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

    Parse and explore the structure of downloaded web pages using Beautiful soup.
    Use the right properties and methods to extract the required information.
    Create functions to extract from the page into lists and dictionaries.
    (Optional) Use a REST API to acquire additional information if required.

In [11]:
#!pip install beautifulsoup4 --user --upgrade --quiet

In [12]:
from bs4 import BeautifulSoup

In [13]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [14]:
doc


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-13dc275a3a314268790358e25956033c.css" integrity="sha512-E9wnWjoxQmh5A1jiWVYDPKOvA8VPf0iKQYoc+9ycMJvtAi9gOSlaUci+W2smxFIlWkV8hkX+O27S8NIB59iIDw==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-9d84afdcaac584c9465298e415004b30.css" integrity="sha512-nYSv3KrFhMlGUpjkFQBLMEN6HvHhijcoubQ

In [15]:
type(doc)

bs4.BeautifulSoup

In [16]:
p_tags = doc.find_all('p')

In [17]:
len(p_tags)

67

In [18]:
p_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Phaser
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Phaser is a fun, free, and fast 2D game framework for making HTML5 games for desktop and mobile web browsers.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Bot
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">A bot is an application that runs automated tasks over the Internet.</p>]

In [22]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class':selection_class})

In [23]:
len(topic_title_tags)

30

In [25]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [26]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class':desc_selector})

In [29]:
len(topic_desc_tags)

30

In [30]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency framework for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [31]:
topic_title_tags0 = topic_title_tags[0]

In [34]:
div_tag = topic_title_tags0.parent
div_tag

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [36]:
topic_link_tags = doc.find_all("a", {'class': 'no-underline flex-1 d-flex flex-column'})

In [37]:
len(topic_link_tags)

30

In [40]:
topic_link_tags[0]['href']

'/topics/3d'

In [42]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [45]:
topic_title_tags[0].text

'3D'

In [49]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [48]:
topic_desc_tags[0].text

'\n          3D modeling is the process of virtually developing the surface and structure of a 3D object.\n        '

In [51]:
topic_descrptions = []

for tag in topic_desc_tags:
    topic_descrptions.append(tag.text.strip())
print(topic_descrptions)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a clo

In [52]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [None]:
#!pip install pandas --user --upgrade --quiet

In [55]:
import pandas as pd

In [53]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descrptions,
    'url': topic_urls
}

In [57]:
topic_df = pd.DataFrame(topics_dict)
topic_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV file(s) with the extracted information

    Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
    Execute the function with different inputs to create a dataset of CSV files.
    Verify the information in the CSV files by reading them back using Pandas.

In [59]:
topic_df.to_csv('topics.csv', index=None)

In [60]:
df = pd.read_csv("topics.csv")
df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


## Getting information out of a topic page

In [61]:
topic_page_url = topic_urls[0]

In [62]:
topic_page_url

'https://github.com/topics/3d'

In [63]:
response = requests.get(topic_page_url)

In [64]:
response.status_code

200

In [65]:
len(response.text)

663970

In [67]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [71]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

In [72]:
len(repo_tags)

30

In [73]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d897521

In [74]:
a_tags = repo_tags[0].find_all('a')

In [76]:
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [75]:
a_tags[0].text.strip()

'mrdoob'

In [77]:
a_tags[1].text.strip()

'three.js'

In [78]:
a_tags[1]['href']

'/mrdoob/three.js'

In [79]:
base_url = "https://github.com"
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [81]:
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
len(star_tags)

30

In [82]:
star_tags

[<span aria-label="77287 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="77,287">77.3k</span>,
 <span aria-label="19467 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="19,467">19.5k</span>,
 <span aria-label="16208 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="16,208">16.2k</span>,
 <span aria-label="15555 users starred this repository" class="Counter js-social-count" data-pjax

In [84]:
star_tags[0].text.strip()

'77.3k'

In [86]:
stars_str = '77.3k'

In [87]:
stars_str[-1]

'k'

In [89]:
int(float(stars_str[:-1]) * 1000)

77300

In [94]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [95]:
parse_star_count(star_tags[0].text.strip())

77300

In [98]:
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()    
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

In [99]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 77300, 'https://github.com/mrdoob/three.js')

In [100]:
get_repo_info(repo_tags[1], star_tags[1])

('libgdx', 'libgdx', 19500, 'https://github.com/libgdx/libgdx')

In [101]:
get_repo_info(repo_tags[2], star_tags[2])

('pmndrs',
 'react-three-fiber',
 16200,
 'https://github.com/pmndrs/react-three-fiber')

In [103]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])    
    topic_repos_dict['stars'].append(repo_info[2])    
    topic_repos_dict['repo_url'].append(repo_info[3])    

In [104]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'spritejs',
  'domlysz',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'openscad',
  'AaronJackson',
  'blender',
  'ssloy',
  'google',
  'mosra',
  'gfxfundamentals',
  'cleardusk',
  'jasonlong',
  'rg3dengine',
  'cnr-isti-vclab',
  'antvis'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'spritejs',
  'BlenderGIS',
  'tensorspace',
  'webglstudio.js',
  'PRNet',
  'openscad',
  'vrn',
  'blender',
  'tinyraytracer',
  'model-viewer',
  'magnum',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'rg3d',
  'meshlab',
  'L7'],
 'stars': [77300,
  19500,
  16200,
  15600,
  13500,
  1

In [105]:
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,77300,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19500,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16200,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,15600,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13500,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,11800,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,11800,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,10400,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,8900,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8100,https://github.com/CesiumGS/cesium


In [None]:
'''
def get_topic_repos(topic_url):
    
    # Download the page
    response = requests.get(topic_page_url)
    
    # Check successful response
    if reponse.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')    
    
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    # Get repo info
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
        }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])

        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])    
        topic_repos_dict['stars'].append(repo_info[2])    
        topic_repos_dict['repo_url'].append(repo_info[3])    
'''

In [126]:
url4 = topic_urls[4]

In [127]:
url4

'https://github.com/topics/android'

In [128]:
topic4_doc = get_topic_page(url4)

In [129]:
topic4_doc


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-13dc275a3a314268790358e25956033c.css" integrity="sha512-E9wnWjoxQmh5A1jiWVYDPKOvA8VPf0iKQYoc+9ycMJvtAi9gOSlaUci+W2smxFIlWkV8hkX+O27S8NIB59iIDw==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-9d84afdcaac584c9465298e415004b30.css" integrity="sha512-nYSv3KrFhMlGUpjkFQBLMEN6HvHhijcoubQ

In [130]:
topic4_repos = get_topic_repos(topic4_doc)

In [131]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,134000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,86100,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,59300,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,48200,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,44600,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,41800,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,41300,https://github.com/square/okhttp
7,android,architecture-samples,39900,https://github.com/android/architecture-samples
8,square,retrofit,39200,https://github.com/square/retrofit
9,Solido,awesome-flutter,38800,https://github.com/Solido/awesome-flutter


In [133]:
topic_urls[5]

'https://github.com/topics/angular'

In [134]:
get_topic_repos(get_topic_page(topic_urls[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,86100,https://github.com/justjavac/free-programming-...
1,angular,angular,78500,https://github.com/angular/angular
2,storybookjs,storybook,67700,https://github.com/storybookjs/storybook
3,ionic-team,ionic-framework,45900,https://github.com/ionic-team/ionic-framework
4,leonardomso,33-js-concepts,45700,https://github.com/leonardomso/33-js-concepts
5,prettier,prettier,41500,https://github.com/prettier/prettier
6,SheetJS,sheetjs,28600,https://github.com/SheetJS/sheetjs
7,angular,angular-cli,25100,https://github.com/angular/angular-cli
8,angular,components,22400,https://github.com/angular/components
9,NativeScript,NativeScript,20800,https://github.com/NativeScript/NativeScript


In [135]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv('angular.csv', index = None)

In [167]:
import os

def get_topic_page(topic_url):
     
    # Download the page
    response = requests.get(topic_url)
    
    # Check successful response
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser') 
    
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()    
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):   
    
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    # Get repo info
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
        }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])

        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])    
        topic_repos_dict['stars'].append(repo_info[2])    
        topic_repos_dict['repo_url'].append(repo_info[3])   
        
    return pd.DataFrame(topic_repos_dict)


def scrape_topic(topic_url,path):
    
    if os.path.exists(path):
        print('The file "{}" already exists. Skipping...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

### Write a single function to :

    1.Get the list of topics from the topics page
    2.Get the list of top repos from the individual topic pages
    3.For each topic, create a CSV of the top repos for the topic

In [136]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class':selection_class})
    
    topic_titles = []

    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class':desc_selector})
    
    topic_descrptions = []

    for tag in topic_desc_tags:
        topic_descrptions.append(tag.text.strip())
    return topic_descrptions



def get_topic_urls(doc):
    topic_link_tags = doc.find_all("a", {'class': 'no-underline flex-1 d-flex flex-column'})

    topic_urls = []
    base_url = 'https://github.com'

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls


def scrape_topics():
    topic_url = "https://github.com/topics"
    response = requests.get(topic_url)
    
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)
    

In [138]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [142]:
for index, row in topic_df.iterrows():
    print(row['title'], row['url'])

3D https://github.com/topics/3d
Ajax https://github.com/topics/ajax
Algorithm https://github.com/topics/algorithm
Amp https://github.com/topics/amphp
Android https://github.com/topics/android
Angular https://github.com/topics/angular
Ansible https://github.com/topics/ansible
API https://github.com/topics/api
Arduino https://github.com/topics/arduino
ASP.NET https://github.com/topics/aspnet
Atom https://github.com/topics/atom
Awesome Lists https://github.com/topics/awesome
Amazon Web Services https://github.com/topics/aws
Azure https://github.com/topics/azure
Babel https://github.com/topics/babel
Bash https://github.com/topics/bash
Bitcoin https://github.com/topics/bitcoin
Bootstrap https://github.com/topics/bootstrap
Bot https://github.com/topics/bot
C https://github.com/topics/c
Chrome https://github.com/topics/chrome
Chrome extension https://github.com/topics/chrome-extension
Command line interface https://github.com/topics/cli
Clojure https://github.com/topics/clojure
Code quality h

In [170]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('GitHub-Top-Repos', exist_ok=True)
    
    for index, row in topic_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'GitHub-Top-Repos/{}.csv'.format(row['title']))

In [171]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin