Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Follow these steps to build a web scraping project from scratch using Python and its ecosystem of libraries:


1. Use the request library to Download a Webpage

In [6]:
import requests

In [7]:
url = "https://github.com/topics"
response = requests.get(url)

In [8]:
response.status_code

200

In [9]:
len(response.text)


206142

In [10]:
page_content = response.text

In [11]:
page_content[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  \n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-7aa84bb7e11e.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-f65db3e8d171.css" /><link data-color-theme="dark_dimmed" cross

In [12]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_content)

2. Use Beautifulsoup to parse and Extarct Information

In [13]:
from bs4 import BeautifulSoup

In [14]:
soup = BeautifulSoup(page_content, 'html.parser')

In [15]:
type(soup)


bs4.BeautifulSoup

In [16]:
class_name = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = soup.find_all('p', {'class': class_name})

# Ensure that topic_title_tags is not empty
if topic_title_tags:
    for tag in topic_title_tags:
        text = tag.text.strip()  # Strip the text from each tag
        print(text)  # Output the cleaned text
else:
    print("No tags found with that class!")

3D
Ajax
Algorithm
Amp
Android
Angular
Ansible
API
Arduino
ASP.NET
Awesome Lists
Amazon Web Services
Azure
Babel
Bash
Bitcoin
Bootstrap
Bot
C
Chrome
Chrome extension
Command-line interface
Clojure
Code quality
Code review
Compiler
Continuous integration
C++
Cryptocurrency
Crystal


In [17]:
disc_class = 'f5 color-fg-muted mb-0 mt-1'
topic_disc_tag = soup.find_all('p',{'class' : disc_class})


# Ensure that topic_title_tags is not empty
if topic_disc_tag:
    for tag in topic_disc_tag:
        text = tag.text.strip()  # Strip the text from each tag
        print(text)  # Output the cleaned text

3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
Ajax is a technique for creating interactive web applications.
Algorithms are self-contained sequences that carry out a variety of tasks.
Amp is a non-blocking concurrency library for PHP.
Android is an operating system built by Google designed for mobile devices.
Angular is an open source web application platform.
Ansible is a simple and powerful automation engine.
An API (Application Programming Interface) is a collection of protocols and subroutines for building software.
Arduino is an open source platform for building electronic devices.
ASP.NET is a web framework for building modern web apps and services.
An awesome list is a list of awesome things curated by the community.
Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.
Azure is a cloud computing service created by Microsoft.
Babel is a compiler for writing next generation JavaScript, today.

In [18]:
topic_disc_tag

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (App

In [19]:
topic_title_tag0 = topic_title_tags[0]

In [20]:
topic_title_tag0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [21]:
url_class = 'no-underline flex-grow-0'
topic_link_tags = soup.find_all('a',{'class':url_class})

In [22]:
topic_link_tags[:5]

[<a class="no-underline flex-grow-0" href="/topics/3d">
 <div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/ajax">
 <img alt="ajax" class="rounded mr-3" height="64" src="https://raw.githubusercontent.com/github/explore/8be26d91eb231fec0b8856359979ac09f27173fd/topics/ajax/ajax.png" width="64"/>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/algorithm">
 <div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/amphp">
 <img alt="amphp" class="rounded mr-3" height="64" src="https://raw.githubusercontent.com/github/explore/99fe59c0f4fb5d6545311440b4ce89a0d82b0804/topics/amphp/amphp.png" width="64"/>
 </a>,
 <a class

In [23]:
topics_url = []
base_url = "https://github.com/"

for tag in topic_link_tags:
    topics_url.append(base_url + tag['href'])


In [24]:
topics_url

['https://github.com//topics/3d',
 'https://github.com//topics/ajax',
 'https://github.com//topics/algorithm',
 'https://github.com//topics/amphp',
 'https://github.com//topics/android',
 'https://github.com//topics/angular',
 'https://github.com//topics/ansible',
 'https://github.com//topics/api',
 'https://github.com//topics/arduino',
 'https://github.com//topics/aspnet',
 'https://github.com//topics/awesome',
 'https://github.com//topics/aws',
 'https://github.com//topics/azure',
 'https://github.com//topics/babel',
 'https://github.com//topics/bash',
 'https://github.com//topics/bitcoin',
 'https://github.com//topics/bootstrap',
 'https://github.com//topics/bot',
 'https://github.com//topics/c',
 'https://github.com//topics/chrome',
 'https://github.com//topics/chrome-extension',
 'https://github.com//topics/cli',
 'https://github.com//topics/clojure',
 'https://github.com//topics/code-quality',
 'https://github.com//topics/code-review',
 'https://github.com//topics/compiler',
 'ht

In [25]:
import pandas as pd

In [26]:
topics_dict ={
    "title" : topic_title_tags,
    "Discription" : topic_disc_tag,
    "url" : topics_url,
}

In [27]:
topics_df = pd.DataFrame(topics_dict)

In [28]:
topics_df

Unnamed: 0,title,Discription,url
0,[3D],[\n 3D refers to the use of three-dim...,https://github.com//topics/3d
1,[Ajax],[\n Ajax is a technique for creating ...,https://github.com//topics/ajax
2,[Algorithm],[\n Algorithms are self-contained seq...,https://github.com//topics/algorithm
3,[Amp],[\n Amp is a non-blocking concurrency...,https://github.com//topics/amphp
4,[Android],[\n Android is an operating system bu...,https://github.com//topics/android
5,[Angular],[\n Angular is an open source web app...,https://github.com//topics/angular
6,[Ansible],[\n Ansible is a simple and powerful ...,https://github.com//topics/ansible
7,[API],[\n An API (Application Programming I...,https://github.com//topics/api
8,[Arduino],[\n Arduino is an open source platfor...,https://github.com//topics/arduino
9,[ASP.NET],[\n ASP.NET is a web framework for bu...,https://github.com//topics/aspnet


In [29]:
topics_df.to_csv('topics.csv' , index=False)

Getting info out of a topic page


In [30]:
topic_pg_Url = topics_url[0]

In [31]:
topic_pg_Url

'https://github.com//topics/3d'

In [32]:
response = requests.get(topic_pg_Url)

In [33]:
topic_doc =  BeautifulSoup(response.text, 'html.parser')

In [34]:
h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class' : h1_selection_class})

In [35]:
repo_tags


[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>          /
           <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/t

In [36]:
a_tags = repo_tags[0].find_all('a')


In [37]:
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">three.js</a>]

In [38]:
a_tags[0].text.strip()

'mrdoob'

In [39]:
a_tags[1].text.strip()

'three.js'

In [40]:
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com//mrdoob/three.js


In [41]:
star_selection_class = 'Counter js-social-count'
star_tag = topic_doc.find_all('span',{'class' : star_selection_class})


In [42]:
star_tag[0].text.strip()

'104k'

In [49]:
def parse_star_count(stars_str):
    if not stars_str:  # Handle empty or None values
        return 0
    stars_str = stars_str.strip().lower()  # Normalize case
    try:
        if stars_str[-1] == 'k':  # Convert 'k' notation
            return int(float(stars_str[:-1]) * 1000)
        return int(stars_str)  # Convert to integer
    except ValueError:
        return 0  # Return 0 if conversion fails


In [79]:
parse_star_count(star_tag[i].text.strip())

8000

In [51]:
type(star_tag)

bs4.element.ResultSet

In [77]:
def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    
    # Debugging a_tags
    print(f"a_tags: {a_tags}")
    if len(a_tags) < 2:
        print("Error: Expected at least 2 <a> tags, found", len(a_tags))
        return None

    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = a_tags[1].get('href', 'No URL Found')  # Avoid KeyError

    # Debugging star_tag
    print(f"star_tag: {star_tag}")
    
    if isinstance(star_tag, list) and len(star_tag) > 0:
        star_text = star_tag[i].text.strip()
        print(f"Extracted star count: {star_text}")
        star = parse_star_count(star_text)
    else:
        print("Error: star_tag is empty or invalid!")
        star = 0  # Default value if no stars are found
    
    return username, repo_name, repo_url, star


In [78]:
get_repo_info(repo_tags[i],star_tag[i])

a_tags: [<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":5639024,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c4bd43ca998833f4e79bc1825f1edab46bba9c459adaa89d6ea8e90b9b863030" data-turbo="false" data-view-component="true" href="/domlysz">domlysz</a>, <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":19577136,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="eedffc7c9a3f608ddfe6390f4a31c3ad533fb1ecff23689cf0d6950373aa7fec" data-turbo="false" data-view-component="true" href="/domlysz/BlenderGIS">BlenderGIS</a>]
star_tag: <span aria-label="8029 use

('domlysz', 'BlenderGIS', '/domlysz/BlenderGIS', 0)

In [74]:
topic_repo_dict = {
    'username' : [],
    'repo_url' : [],
    'repo_name' : [],
    'star'     : [],
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tag[i])
    topic_repo_dict['username'].append(repo_info[0])
    topic_repo_dict['repo_url'].append(repo_info[1])
    topic_repo_dict['repo_name'].append(repo_info[2])
    topic_repo_dict['star'].append(repo_info[3])

a_tags: [<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>, <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">three.js</a>]
star_tag: <span aria-label="104453 users starre

In [72]:
topic_repo_dict

{'username': ['mrdoob',
  'pmndrs',
  'libgdx',
  'BabylonJS',
  'FreeCAD',
  'ssloy',
  'lettier',
  'aframevr',
  'blender',
  'CesiumGS',
  '4ian',
  'isl-org',
  'MonoGame',
  'mapbox',
  'metafizzy',
  'nerfstudio-project',
  'timzhang642',
  'cocos',
  'FyroxEngine',
  'domlysz'],
 'repo_url': ['three.js',
  'react-three-fiber',
  'libgdx',
  'Babylon.js',
  'FreeCAD',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'aframe',
  'blender',
  'cesium',
  'GDevelop',
  'Open3D',
  'MonoGame',
  'mapbox-gl-js',
  'zdog',
  'nerfstudio',
  '3D-Machine-Learning',
  'cocos-engine',
  'Fyrox',
  'BlenderGIS'],
 'repo_name': ['/mrdoob/three.js',
  '/pmndrs/react-three-fiber',
  '/libgdx/libgdx',
  '/BabylonJS/Babylon.js',
  '/FreeCAD/FreeCAD',
  '/ssloy/tinyrenderer',
  '/lettier/3d-game-shaders-for-beginners',
  '/aframevr/aframe',
  '/blender/blender',
  '/CesiumGS/cesium',
  '/4ian/GDevelop',
  '/isl-org/Open3D',
  '/MonoGame/MonoGame',
  '/mapbox/mapbox-gl-js',
  '/metafizzy/zd

In [94]:
def get_topic_page(topics_url):
    response = requests.get(topics_url)

    if response.status_code != 200:
        print(f"Failed to retrieve topic page: {response.status_code}")
    topic_doc =  BeautifulSoup(response.text, 'html.parser')
    return topic_doc




def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    
    # Debugging a_tags
    print(f"a_tags: {a_tags}")
    if len(a_tags) < 2:
        print("Error: Expected at least 2 <a> tags, found", len(a_tags))
        return None

    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = a_tags[1].get('href', 'No URL Found')  # Avoid KeyError

    # Debugging star_tag
    print(f"star_tag: {star_tag}")
    
    if isinstance(star_tag, list) and len(star_tag) > 0:
        star_text = star_tag[i].text.strip()
        print(f"Extracted star count: {star_text}")
        star = parse_star_count(star_text)
    else:
        print("Error: star_tag is empty or invalid!")
        star = 0  # Default value if no stars are found
    
    return username, repo_name, repo_url, star


def get_topic_repo(topic_doc):
    h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class' : h1_selection_class})

    star_selection_class = 'Counter js-social-count'
    star_tag = topic_doc.find_all('span',{'class' : star_selection_class})

    topic_repo_dict = {
    'username' : [],
    'repo_url' : [],
    'repo_name' : [],
    'star'     : [],
      }


    for i in range(len(repo_tags)):

       repo_info = get_repo_info(repo_tags[i],star_tag[i])
       topic_repo_dict['username'].append(repo_info[0])
       topic_repo_dict['repo_url'].append(repo_info[1])
       topic_repo_dict['repo_name'].append(repo_info[2])
       topic_repo_dict['star'].append(repo_info[3])
        
    return topic_repo_dict

In [107]:
get_topic_repo(get_topic_page(topics_url[4]))

a_tags: [<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":14101776,"originating_url":"https://github.com/topics/android","user_id":null}}' data-hydro-click-hmac="57b50c473d9a5d57c6672a2acd8bb64c660641c9b469b6b790d686e665d9c9a4" data-turbo="false" data-view-component="true" href="/flutter">flutter</a>, <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":31792824,"originating_url":"https://github.com/topics/android","user_id":null}}' data-hydro-click-hmac="92b9db70c29beb44f8125354236ea64618a41baf47aa8749b61a63531608e541" data-turbo="false" data-view-component="true" href="/flutter/flutter">flutter</a>]
star_tag: <span aria-label="168

{'username': ['flutter',
  'facebook',
  'Genymobile',
  'justjavac',
  'Hack-with-Github',
  'Solido',
  'tldr-pages',
  'wasabeef',
  'google',
  'laurent22',
  'appwrite',
  'square',
  'android',
  'square',
  'skylot',
  'dcloudio',
  'fastlane',
  'termux',
  '2dust',
  'PhilJay'],
 'repo_url': ['flutter',
  'react-native',
  'scrcpy',
  'free-programming-books-zh_CN',
  'Awesome-Hacking',
  'awesome-flutter',
  'tldr',
  'awesome-android-ui',
  'material-design-icons',
  'joplin',
  'appwrite',
  'okhttp',
  'architecture-samples',
  'retrofit',
  'jadx',
  'uni-app',
  'fastlane',
  'termux-app',
  'v2rayNG',
  'MPAndroidChart'],
 'repo_name': ['/flutter/flutter',
  '/facebook/react-native',
  '/Genymobile/scrcpy',
  '/justjavac/free-programming-books-zh_CN',
  '/Hack-with-Github/Awesome-Hacking',
  '/Solido/awesome-flutter',
  '/tldr-pages/tldr',
  '/wasabeef/awesome-android-ui',
  '/google/material-design-icons',
  '/laurent22/joplin',
  '/appwrite/appwrite',
  '/square/okhtt

In [108]:
topic_repo_df = pd.DataFrame(topic_repo_dict)

In [109]:
topic_repo_df

Unnamed: 0,username,repo_url,repo_name,star
0,mrdoob,three.js,/mrdoob/three.js,0
1,pmndrs,react-three-fiber,/pmndrs/react-three-fiber,0
2,libgdx,libgdx,/libgdx/libgdx,0
3,BabylonJS,Babylon.js,/BabylonJS/Babylon.js,0
4,FreeCAD,FreeCAD,/FreeCAD/FreeCAD,0
5,ssloy,tinyrenderer,/ssloy/tinyrenderer,0
6,lettier,3d-game-shaders-for-beginners,/lettier/3d-game-shaders-for-beginners,0
7,aframevr,aframe,/aframevr/aframe,0
8,blender,blender,/blender/blender,0
9,CesiumGS,cesium,/CesiumGS/cesium,0
