## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics, topic title, topic url, and topic description
- For each topic we'll get top 25 repos in the topic from the topic page
- For each repo we have repo name,stars, username, url
- for each topic a csv file will be made in the format:
- 

## Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [7]:
!pip install requests --upgrade --quiet

In [8]:
import requests

In [9]:
topics_url = 'https://github.com/topics'

In [10]:
response = requests.get(topics_url)

In [11]:
response.status_code

200

In [12]:
len(response.text)

205543

In [13]:
top_lines = response.text[:100000]

In [14]:
page_content = response.text

In [15]:
with open('webpage.html', 'w') as f:
          f.write(top_lines)

## Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.


In [16]:
!pip install beautifulsoup4 --upgrade --quiet

In [17]:
from bs4 import BeautifulSoup

In [18]:
doc = BeautifulSoup(page_content, 'html.parser')

In [19]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

title_p_tags = doc.find_all('p',{'class':selection_class})

In [20]:
len(title_p_tags)

30

In [21]:
title_p_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [22]:
ss_class = 'f5 color-fg-muted mb-0 mt-1'

desc_p_tags = doc.find_all('p',{'class':ss_class})

In [23]:
len(desc_p_tags)

30

In [24]:
desc_p_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [25]:
topic_url_tags = doc.find_all('a',{'class':'no-underline flex-grow-0'})

In [26]:
len(topic_url_tags)

30

In [27]:
topic_url_tags[5]['href']

'/topics/angular'

In [28]:
topic_titles = []
topic_desc = []

for tag in title_p_tags:
    topic_titles.append(tag.text)
for desc in desc_p_tags:
    topic_desc.append(desc.text.strip())

In [29]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_url_tags:
    topic_urls.append(base_url + tag['href'])

In [30]:
topic_urls[:2]

['https://github.com/topics/3d', 'https://github.com/topics/ajax']

In [31]:
topic_desc[:2]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.']

In [32]:
!pip install pandas --upgrade --quiet

In [33]:
import pandas as pd

In [34]:
topics_dict = {
    'title':topic_titles,
    'description':topic_desc,
    'utl':topic_urls
}

In [35]:
topics_df = pd.DataFrame(topics_dict)

In [36]:
topics_df

Unnamed: 0,title,description,utl
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Create CSV file(s) with the extracted information

In [37]:
topics_df.to_csv('topics.csv',index = None)

## Getting info out of a topic page

In [38]:
topic_page_url = topic_urls[0]

In [39]:
topic_page_url

'https://github.com/topics/3d'

In [40]:
response = requests.get(topic_page_url)

In [41]:
response.status_code

200

In [42]:
len(response.text)

520248

In [43]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [44]:
repo_tags = topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})

In [45]:
len(repo_tags)

20

In [46]:
a_tags = repo_tags[0].find_all('a')

In [47]:
a_tags[0].text.strip()

'mrdoob'

In [48]:
star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

In [49]:
len(star_tags)

20

In [50]:
star_tags[0].text.strip()

'104k'

In [51]:
def parse_star_count(stars_cnt):
    stars_cnt = stars_cnt.strip()
    if(stars_cnt[-1] == 'k'):
        return int(float(stars_cnt[:-1])*1000)
    else:
        return int(stars_cnt)

In [52]:
parse_star_count(star_tags[0].text)

104000

In [53]:
def get_repo_info(h3_tag, star_tag):
    #returns all the req info about the repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars

In [54]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 104000)

In [55]:
info_repo = {
    'username' : [],
    'repo_name' : [],
    'repo_url' : [],
    'stars' : [],
}
for i in range(len(repo_tags)):
    repo_info = (get_repo_info(repo_tags[i],star_tags[i]))
    info_repo['username'].append(repo_info[0])
    info_repo['repo_name'].append(repo_info[1])
    info_repo['repo_url'].append(repo_info[2])
    info_repo['stars'].append(repo_info[3])

In [56]:
info_repo_df = pd.DataFrame(info_repo)

In [57]:
info_repo_df.to_csv("Repo_Info.csv",index = None)

In [58]:
info_repo_df

Unnamed: 0,username,repo_name,repo_url,stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,104000
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,28000
2,libgdx,libgdx,https://github.com/libgdx/libgdx,23600
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,23500
4,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,22700
5,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,21100
6,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,18300
7,aframevr,aframe,https://github.com/aframevr/aframe,16800
8,blender,blender,https://github.com/blender/blender,13900
9,CesiumGS,cesium,https://github.com/CesiumGS/cesium,13200


In [91]:
import os

def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #check response code
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    #Parse using beautifulsoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    #returns all the req info about the repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars

def get_topic_repos(topic_doc):
    
    
    repo_tags = topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})
    #get the number of stars
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

    info_repo = {
        'username' : [],
        'repo_name' : [],
        'repo_url' : [],
        'stars' : [],
    }
    #get repo info
    for i in range(len(repo_tags)):
        repo_info = (get_repo_info(repo_tags[i],star_tags[i]))
        info_repo['username'].append(repo_info[0])
        info_repo['repo_name'].append(repo_info[1])
        info_repo['repo_url'].append(repo_info[2])
        info_repo['stars'].append(repo_info[3])
    return pd.DataFrame(info_repo,index=None)

def scrape_topic(topic_url,topic_name):
    fname = topic_name + '.csv'
    if os.path.exists(fname):
        print("The file {} already exists".format(fname))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fname, index = None)

In [60]:
url4 = topic_urls[4]

In [61]:
topic_4_doc = get_topic_page(url4)

In [62]:
topic4_repos = get_topic_repos(topic_4_doc)

In [63]:
topic4_repos

Unnamed: 0,username,repo_name,repo_url,stars
0,flutter,flutter,https://github.com/flutter/flutter,168000
1,facebook,react-native,https://github.com/facebook/react-native,120000
2,Genymobile,scrcpy,https://github.com/Genymobile/scrcpy,117000
3,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,112000
4,Hack-with-Github,Awesome-Hacking,https://github.com/Hack-with-Github/Awesome-Ha...,88000
5,Solido,awesome-flutter,https://github.com/Solido/awesome-flutter,54500
6,tldr-pages,tldr,https://github.com/tldr-pages/tldr,53200
7,wasabeef,awesome-android-ui,https://github.com/wasabeef/awesome-android-ui,51300
8,google,material-design-icons,https://github.com/google/material-design-icons,51000
9,laurent22,joplin,https://github.com/laurent22/joplin,47400


In [64]:
get_topic_repos(get_topic_page(topic_urls[5]))

Unnamed: 0,username,repo_name,repo_url,stars
0,justjavac,free-programming-books-zh_CN,https://github.com/justjavac/free-programming-...,112000
1,angular,angular,https://github.com/angular/angular,96700
2,storybookjs,storybook,https://github.com/storybookjs/storybook,85200
3,leonardomso,33-js-concepts,https://github.com/leonardomso/33-js-concepts,64500
4,ionic-team,ionic-framework,https://github.com/ionic-team/ionic-framework,51300
5,prettier,prettier,https://github.com/prettier/prettier,49800
6,Asabeneh,30-Days-Of-JavaScript,https://github.com/Asabeneh/30-Days-Of-JavaScript,43800
7,SheetJS,sheetjs,https://github.com/SheetJS/sheetjs,35300
8,angular,angular-cli,https://github.com/angular/angular-cli,26800
9,wailsapp,wails,https://github.com/wailsapp/wails,26500


## Scaling this code

Write single function to :
1. Get the list of topics from the topic page
2. Get the list of top repos from the individual topic pages
3. For each topic, create a csv of the top repos of the topic

In [65]:
topics_url

'https://github.com/topics'

In [66]:
def get_topic_title(doc):
    
    title_p_tags = doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in title_p_tags:
        topic_titles.append(tag.text.strip())
    return topic_titles

def get_topic_desc(doc):
    desc_p_tags = doc.find_all('p',{'class':'f5 color-fg-muted mb-0 mt-1'})
    topic_desc = []
    for desc in desc_p_tags:
        topic_desc.append(desc.text.strip())
    return topic_desc

def get_repo_info(h3_tag, star_tag):
    #returns all the req info about the repo
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, repo_url, stars

def get_topic_urls(doc):
    
    topic_url_tags = doc.find_all('a',{'class':'no-underline flex-grow-0'})
    topic_urls = []
    base_url = 'https://github.com'
    
    for tag in topic_url_tags:
        topic_urls.append(base_url + tag['href'])

    return topic_urls


def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topics_url))
        
    topics_dict ={
        'titles': get_topic_title(doc),
        'descs': get_topic_desc(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)



In [67]:
scrape_topics()

Unnamed: 0,titles,descs,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [68]:
topics_df = scrape_topics()

In [69]:
topics_df

Unnamed: 0,titles,descs,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [76]:
for index,row in topics_df.iterrows():
    print(row['titles'],'=> ', row['url'])

3D =>  https://github.com/topics/3d
Ajax =>  https://github.com/topics/ajax
Algorithm =>  https://github.com/topics/algorithm
Amp =>  https://github.com/topics/amphp
Android =>  https://github.com/topics/android
Angular =>  https://github.com/topics/angular
Ansible =>  https://github.com/topics/ansible
API =>  https://github.com/topics/api
Arduino =>  https://github.com/topics/arduino
ASP.NET =>  https://github.com/topics/aspnet
Awesome Lists =>  https://github.com/topics/awesome
Amazon Web Services =>  https://github.com/topics/aws
Azure =>  https://github.com/topics/azure
Babel =>  https://github.com/topics/babel
Bash =>  https://github.com/topics/bash
Bitcoin =>  https://github.com/topics/bitcoin
Bootstrap =>  https://github.com/topics/bootstrap
Bot =>  https://github.com/topics/bot
C =>  https://github.com/topics/c
Chrome =>  https://github.com/topics/chrome
Chrome extension =>  https://github.com/topics/chrome-extension
Command-line interface =>  https://github.com/topics/cli
Cloj

In [87]:
def scrape_topic_repos():
    print("Scarping starts: ")
    topics_df = scrape_topics()
    for index,row in topics_df.iterrows():
        print('Scraping top respos for "{}"'.format(row['titles']))
        scrape_topic(row['url'],row['titles'])

In [88]:
scrape_topic_repos()

Scarping starts: 
Scraping top respos for "3D"
Scraping top respos for "Ajax"
Scraping top respos for "Algorithm"
Scraping top respos for "Amp"
Scraping top respos for "Android"
Scraping top respos for "Angular"
Scraping top respos for "Ansible"
Scraping top respos for "API"
Scraping top respos for "Arduino"
Scraping top respos for "ASP.NET"
Scraping top respos for "Awesome Lists"
Scraping top respos for "Amazon Web Services"
Scraping top respos for "Azure"
Scraping top respos for "Babel"
Scraping top respos for "Bash"
Scraping top respos for "Bitcoin"
Scraping top respos for "Bootstrap"
Scraping top respos for "Bot"
Scraping top respos for "C"
Scraping top respos for "Chrome"
Scraping top respos for "Chrome extension"
Scraping top respos for "Command-line interface"
Scraping top respos for "Clojure"
Scraping top respos for "Code quality"
Scraping top respos for "Code review"
Scraping top respos for "Compiler"
Scraping top respos for "Continuous integration"
Scraping top respos for "C+

In [92]:
scrape_topic_repos()

Scarping starts: 
Scraping top respos for "3D"
The file 3D.csv already exists
Scraping top respos for "Ajax"
The file Ajax.csv already exists
Scraping top respos for "Algorithm"
The file Algorithm.csv already exists
Scraping top respos for "Amp"
The file Amp.csv already exists
Scraping top respos for "Android"
The file Android.csv already exists
Scraping top respos for "Angular"
The file Angular.csv already exists
Scraping top respos for "Ansible"
The file Ansible.csv already exists
Scraping top respos for "API"
The file API.csv already exists
Scraping top respos for "Arduino"
The file Arduino.csv already exists
Scraping top respos for "ASP.NET"
The file ASP.NET.csv already exists
Scraping top respos for "Awesome Lists"
The file Awesome Lists.csv already exists
Scraping top respos for "Amazon Web Services"
The file Amazon Web Services.csv already exists
Scraping top respos for "Azure"
The file Azure.csv already exists
Scraping top respos for "Babel"
The file Babel.csv already exists
Sc