## Web scraping from scratch

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

### project outline

- We're gong to scrape  https://github.com/topics
- we will get a list of topics. For eac topic we will get a topic title, topic page url and topic description
- for each topic repository we will grab the repo name, username, stars and repo URL
- For each topic we will create a csv file in the following format:
---
Repo Name,Username,Stars,Repo URL
---

## Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries

In [1]:
!pip install requests --upgrade

Requirement already up-to-date: requests in c:\users\trevour\anaconda3\envs\wrk-env\lib\site-packages (2.25.1)


In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
#checking if the request was successful
response.status_code

200

In [6]:
len(response.text)

126251

In [7]:
page_content = response.text

In [8]:
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-7KjiGvJiLLy6LJPGf3m67ejAdgQsgDdnxZYoaI6+Agd0ZxHKTCjoKZgaf3PgUjURCcVceAwySJJJWgitRskDiA==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-eca8e21af2622cbcba2c93c67f79baed.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-dDsAoT3mMaA8gyLZkshXL3vrnDAuIv4cNq2iN06+o44rOFIngYNNiTehUUzNuMoBXMaDg0MLhEaZNumoCiLJkw==" rel="stylesheet" href="https://github.githubassets.com/assets/behaviors-743b00a13de631a03c8322d992c8572f.css" />\n    \n    \n    \n    <link crossorigin="anonymous" media="all" integrity="sha512-Rzg

In [9]:
import io
fname = 'web.html'
with io.open(fname,'w',encoding="utf-8" ) as f:
    f.write(page_content)

## Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.


In [10]:
#installing beautiful soup
!pip install beautifulsoup4 --upgrade --quiet

In [11]:
from bs4 import BeautifulSoup

In [12]:
doc = BeautifulSoup(page_content,"html.parser")

In [13]:
type(doc)

bs4.BeautifulSoup

In [14]:
attribute = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class':attribute})

In [15]:
len(topic_title_tags)

30

We are searching for the #D topic which is on the page

In [16]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [17]:
topic_desc_tags = doc.find_all('p',{'class':'f5 color-text-secondary mb-0 mt-1'})

In [18]:
len(topic_desc_tags)

30

In [20]:
topic_desc_tags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [22]:
topic_title_tag0 = topic_title_tags[0]

In [24]:
div_tag = topic_title_tags0.parent

In [25]:
topic_link_tags = doc.find_all('a',{'class':'d-flex no-underline'})

In [26]:
len(topic_link_tags)

30

In [27]:
topic_link_tags[0]['href']

'/topics/3d'

In [28]:
topic0_url = 'https://github.com' + topic_link_tags[0]['href']

In [29]:
print(topic0_url)

https://github.com/topics/3d


In [30]:
#cleaning some of these things
topic_titles = []

#loop through all the tags getting the text which are our titles
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [32]:
topic_desc = []
#for loop
for tag in topic_desc_tags:
    topic_desc.append(tag.text.strip())
print(topic_desc[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [33]:
topic_urls = []
base_url = 'https://github.com'
#looping through the tags
for tag in topic_link_tags:
    topic_urls.append(base_url+tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [35]:
#creating csv from the info we have
import pandas as pd
topics_dict = {
    'title': topic_titles,
    'description':topic_desc,
    'url':topic_urls
}

In [36]:
topics_df = pd.DataFrame(topics_dict)

In [37]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Create CSV file(s) with the extracted information

- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.
- Execute the function with different inputs to create a dataset of CSV files.
- Verify the information in the CSV files by reading them back using Pandas.

In [38]:
topics_df.to_csv('topics.csv', index=None)

## Getting information out of a topic page

In [39]:
topic_page_url = topic_urls[0]

In [40]:
topic_page_url

'https://github.com/topics/3d'

In [41]:
response = requests.get(topic_page_url)

In [42]:
response.status_code

200

In [43]:
len(response.text)

580469

In [44]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [45]:
#getting name of repositories
repo_tags= topic_doc.find_all('h1', {'class':'f3 color-text-secondary text-normal lh-condensed'})

In [46]:
len(repo_tags)

30

In [47]:
a_tags = repo_tags[0].find_all('a')

In [48]:
a_tags[0]

<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" href="/mrdoob">
            mrdoob
</a>

In [49]:
a_tags[0].text.strip()

'mrdoob'

In [50]:
a_tags[1].text.strip()

'three.js'

In [51]:
#getting the repo url
a_tags[0]['href']

'/mrdoob'

In [52]:
a_tags[1]['href']

'/mrdoob/three.js'

In [53]:
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [54]:
star_tags = topic_doc.find_all("a",{'class':'social-count float-none'})

In [55]:
len(star_tags)

30

In [56]:
star_tags[0].text.strip()

'69.9k'

In [61]:
star1 = star_tags[0].text.strip()

In [64]:
star1[:-1]

'69.9'

In [65]:
#converting the star count to an int
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)
    

In [66]:
parse_star_count(star_tags[0].text.strip())

69900

In [67]:
def get_repo_info(h1_tag,star_tag):
    #returmns all the required info about a repository
    a_tags =h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

In [68]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 69900, 'https://github.com/mrdoob/three.js')

In [69]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [58]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'BabylonJS',
  'pmndrs',
  'aframevr',
  'ssloy',
  'FreeCAD',
  'metafizzy',
  'lettier',
  'CesiumGS',
  'a1studmuffin',
  'timzhang642',
  'spritejs',
  'tensorspace-team',
  'jagenjo',
  'intel-isl',
  'AaronJackson',
  'YadiraF',
  'openscad',
  'domlysz',
  'ssloy',
  'mosra',
  'cleardusk',
  'gfxfundamentals',
  'jasonlong',
  'google',
  'blender',
  'antvis',
  'pissang',
  'tinyobjloader'],
 'repo_name': ['three.js',
  'libgdx',
  'Babylon.js',
  'react-three-fiber',
  'aframe',
  'tinyrenderer',
  'FreeCAD',
  'zdog',
  '3d-game-shaders-for-beginners',
  'cesium',
  'SpaceshipGenerator',
  '3D-Machine-Learning',
  'spritejs',
  'tensorspace',
  'webglstudio.js',
  'Open3D',
  'vrn',
  'PRNet',
  'openscad',
  'BlenderGIS',
  'tinyraytracer',
  'magnum',
  '3DDFA',
  'webgl-fundamentals',
  'isometric-contributions',
  'model-viewer',
  'blender',
  'L7',
  'claygl',
  'tinyobjloader'],
 'stars': [69700,
  18300,
  13800,
  12900,
  1260

In [70]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [71]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,69900,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18300,https://github.com/libgdx/libgdx
2,BabylonJS,Babylon.js,13900,https://github.com/BabylonJS/Babylon.js
3,pmndrs,react-three-fiber,13000,https://github.com/pmndrs/react-three-fiber
4,aframevr,aframe,12700,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,10500,https://github.com/ssloy/tinyrenderer
6,FreeCAD,FreeCAD,9200,https://github.com/FreeCAD/FreeCAD
7,metafizzy,zdog,8400,https://github.com/metafizzy/zdog
8,lettier,3d-game-shaders-for-beginners,8400,https://github.com/lettier/3d-game-shaders-for...
9,CesiumGS,cesium,6900,https://github.com/CesiumGS/cesium


In [72]:
import os
def get_topic_page(topic_url):
     #download page
    response = requests.get(topic_url)
    #check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        #parse using Beautifulsoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

def get_repo_info(h1_tag,star_tag):
    #returmns all the required info about a repository
    a_tags =h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url


def get_topic_repos(topic_url):

    repo_tags = topic_doc.find_all('h1',{'class':'f3 color-text-secondary text-normal lh-condensed'})
    star_tags = topic_doc.find_all("a",{'class':'social-count float-none'}) 
    #get repo info
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

Finalizing by writing a function that would:
- Get lsit of topics from the topics page
- Get the list of top repos from the individual topic pages
- And for each topic Create a csv of the top repos for the topics


In [73]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)


In [74]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [75]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin