<a href="https://colab.research.google.com/github/KalihoseMigisha/python-web-scraping/blob/main/notebooks/Python_Web_Scraping_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1 Problem Statement
- In today's data-driven world, extracting relevant information from websites is essential for research, business intelligence, and decision-making. However, manually collecting and organizing web data is time-consuming and inefficient. This project aims to develop a Python-based web scraping solution from scratch, specifically for static web pages, using BeautifulSoup. While tools like Selenium and Scrapy excel at scraping dynamic websites, BeautifulSoup is ideal for efficiently parsing and extracting data from static HTML content. The project will provide a structured approach to automated data extraction, processing, and storage while adhering to ethical and legal considerations.

#2 Mount Google Drive (Connection)

In [None]:
# Mount Google Drive in Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Here are the steps We'll follow:**

- We're going to scrape https://github.com/topics
- We'll get a list of topics.
- For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the **repo name, username, stars and repo URL**
- For each topic we'll create a CSV file in the following format:**Repo Name,Username,Stars,Repo URL.**
- Example:
        - three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

**website link**
  - https://github.com/topics

**Beautiful Soup and Requests**

In [None]:
# Importing BeautifulSoup from the bs4 module to parse HTML and extract data
from bs4 import BeautifulSoup

# Importing the requests library to send HTTP requests and retrieve web page content
import requests

In [None]:
# URL of the website to be scraped
url = 'https://github.com/topics'

In [None]:
# Send a GET request to retrieve the content of the specified URL
requests.get(url)

<Response [200]>

#3 HTTP Responses:
    # 200 OK: The request was successful, and the server returned the requested page.
    # 301 Moved Permanently: The resource has been permanently moved to a new URL.
    # 404 Not Found: The requested page could not be found on the server.
    # 403 Forbidden: The server refuses to authorize the request.
    # 500 Internal Server Error: The server encountered an unexpected error.
    # 502 Bad Gateway: The server was acting as a gateway and received an invalid response.
    # 503 Service Unavailable: The server is temporarily unavailable (e.g., due to maintenance).

In [None]:
# fetches and stores the page content (and other response data) from URL
page = requests.get(url)

In [None]:
# Parsing the HTML content of the fetched page using BeautifulSoup
# 'html.parser' is a built-in parser in Python that helps parse and navigate HTML content
soup = BeautifulSoup(page.text, 'html.parser')

In [None]:
# Printing the parsed HTML content of the webpage
# This prints raw HTML (not very readable)
print(soup)


<!DOCTYPE html>

<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-74231a1f3bbb.css" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-8a995f0bacd4.css" media="all" rel="stylesheet"><link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/

In [None]:
# Formats and prints the parsed HTML in a structured, indented format for better readability
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-74231a1f3bbb.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-8a995f0bacd4.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://

In [None]:
# Prints formatted HTML
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-74231a1f3bbb.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-8a995f0bacd4.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://

## 3.1 Extracting the topics being discussed
  - **Key Note:**
     - The class name should be correctly defined (f3 lh-condensed mb-0 mt-1 Link--primary)
     - To find the class name you need to inspect (F12) any topic in the website (https://github.com/topics)

In [None]:
# Ensure the class name is correctly defined
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

# Extract all <p> elements with the specified class
topics = soup.find_all('p', {'class': selection_class})

# Retrieve only the text content of the topics
topic_titles = [topic.get_text(strip=True) for topic in topics]

# Print or use the extracted topics
print(topic_titles)


['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [None]:
# list of topics being discussed
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command-line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'C++',
 'Cryptocurrency',
 'Crystal']

In [None]:
len(topic_titles)

30

In [None]:
# Get the parent of the first topic description (if available)
topics[0].parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
        </p>
</a>

In [None]:
# total number of topics
len(topics)

30

## 3.2 Extracting the topics descriptions
  - class: (f5 color-fg-muted mb-0 mt-1)

In [None]:
# Define the correct class for topic descriptions
description_class = 'f5 color-fg-muted mb-0 mt-1'

# Find all <p> elements that match the topic description class
topic_descriptions = soup.find_all('p', {'class': description_class})

# Extract and clean the text content
descriptions = [desc.get_text(strip=True) for desc in topic_descriptions]

# Print or use the extracted topic descriptions
print(descriptions)


['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud computing service created by Microsoft.', 'Babel is a compiler for w

In [None]:
descriptions

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azure is a cloud computing service created by Microsoft.',
 'Babel is a c

In [None]:
# total number of descriptions
len(descriptions)

30

In [None]:
# Define the correct class for topic titles
topic_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

# Find all <p> elements that match the topic title class
topics = soup.find_all('p', {'class': topic_class})

# Extract topic text and associated URLs (if available)
topic_data = []
for topic in topics:
    link_tag = topic.find('a')  # Find an <a> tag inside the <p>
    if link_tag and 'href' in link_tag.attrs:
        url = link_tag['href']  # Extract the href (URL)
        title = topic.get_text(strip=True)  # Extract the topic text
        topic_data.append((title, url))

# Print extracted topics with their URLs
for title, url in topic_data:
    print(f"Topic: {title}, URL: {url}")


## 3.3 Extracting the topic links
  - class: (no-underline flex-1 d-flex flex-column)

In [None]:
# Define the class for the topic link
topic_link_class = 'no-underline flex-1 d-flex flex-column'

# Find all <a> elements with the specified class
topic_links = soup.find_all('a', class_=topic_link_class)

# Extract URLs into a list
urls = [link['href'] for link in topic_links if 'href' in link.attrs]

# Print the list of URLs
print(urls)


['/topics/3d', '/topics/ajax', '/topics/algorithm', '/topics/amphp', '/topics/android', '/topics/angular', '/topics/ansible', '/topics/api', '/topics/arduino', '/topics/aspnet', '/topics/awesome', '/topics/aws', '/topics/azure', '/topics/babel', '/topics/bash', '/topics/bitcoin', '/topics/bootstrap', '/topics/bot', '/topics/c', '/topics/chrome', '/topics/chrome-extension', '/topics/cli', '/topics/clojure', '/topics/code-quality', '/topics/code-review', '/topics/compiler', '/topics/continuous-integration', '/topics/cpp', '/topics/cryptocurrency', '/topics/crystal']


In [None]:
# list of extracted urls without base url('https://github.com')
urls

['/topics/3d',
 '/topics/ajax',
 '/topics/algorithm',
 '/topics/amphp',
 '/topics/android',
 '/topics/angular',
 '/topics/ansible',
 '/topics/api',
 '/topics/arduino',
 '/topics/aspnet',
 '/topics/awesome',
 '/topics/aws',
 '/topics/azure',
 '/topics/babel',
 '/topics/bash',
 '/topics/bitcoin',
 '/topics/bootstrap',
 '/topics/bot',
 '/topics/c',
 '/topics/chrome',
 '/topics/chrome-extension',
 '/topics/cli',
 '/topics/clojure',
 '/topics/code-quality',
 '/topics/code-review',
 '/topics/compiler',
 '/topics/continuous-integration',
 '/topics/cpp',
 '/topics/cryptocurrency',
 '/topics/crystal']

In [None]:
# Construct the full URL for the first topic by appending the relative path to GitHub's base URL
topics0_url = "https://github.com" + topic_links[0]['href']
# Print the complete URL of the first topic
print(topics0_url)

https://github.com/topics/3d


In [None]:
# Define the base URL
base_url = 'https://github.com'

# Define the class for the topic link
topic_link_class = 'no-underline flex-1 d-flex flex-column'

# Find all <a> elements with the specified class
topic_links = soup.find_all('a', class_=topic_link_class)

# Extract URLs and prepend base_url if necessary
urls = [base_url + link['href'] if link['href'].startswith('/') else link['href']
        for link in topic_links if 'href' in link.attrs]

# Print the list of full URLs
print(urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/topics/continuous-integration', 'ht

In [None]:
# list of extracted urls with base url('https://github.com')
urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compiler',
 'https://github.com/topics/co

In [None]:
print("Number of topic titles:", len(topic_titles))
print("Number of descriptions:", len(descriptions))
print("Number of URLs:", len(urls))

Number of topic titles: 30
Number of descriptions: 30
Number of URLs: 30


**Key Note**
  - to be able to create a Pandas datframe all variables should be of the same length

# 4 Creating a Pandas DataFrame

In [None]:
import pandas as pd

In [None]:
# Create a dictionary to organize topic data
topics_dict = {
    # The key 'Topic Title' maps to the list of topic titles
    "Topic Title": topic_titles,

    # The key 'Description' maps to the list of descriptions corresponding to each topic
    "Description": descriptions,

    # The key 'URL' maps to the list of URLs for each topic
    "URL": urls
}

In [None]:
# Create a DataFrame from the topics_dict dictionary
topics_df = pd.DataFrame(topics_dict)

In [None]:
topics_df

Unnamed: 0,Topic Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [None]:
# Create a CSV file from the topics_df DataFrame
topics_df.to_csv('topics.csv', index=False)

# 5 Getting Information Out of a Single Page
- The total number of topics is 30,
- Our aim is to deal with first topic page and scrap its contents
- **The first topic is '3D'**

In [None]:
# Retrieve the URL of the first topic from the 'urls' list
topics_page_url = urls[0]

# Display the URL of the first topic
topics_page_url


'https://github.com/topics/3d'

Open: https://github.com/topics/3d
- Key Important Issues we're interested in
       - username
       - repository
       - repository link
       - stars

In [None]:
# Send an HTTP GET request to the URL stored in 'topics_page_url'
response = requests.get(topics_page_url)

# Display the HTTP status code of the response to check if the request was successful
response.status_code

200

In [None]:
# Get the length of the response text (HTML content) from the GET request
len(response.text)

515357

In [None]:
# Parse the HTML content of the page using BeautifulSoup
topic_doc = BeautifulSoup(response.text, 'html.parser')

**Key Comment**
- **topic_doc** now holds the structured version of the webpage, allowing you to navigate and extract specific elements using BeautifulSoup methods (like .find(), .find_all(), .select(), etc.).

## 5.1 Extracting the Repo tags
   - Repo tags = Repository name + Username
   - Use **topic_doc (which contains the parsed HTML)** to extract **repo_tags** by selecting the correct class name.

In [None]:
# Define the class for the repository tag
repo_tag_class = 'f3 color-fg-muted text-normal lh-condensed'

# Find all <h3> elements with the specified class
repo_titles = topic_doc.find_all('h3', class_=repo_tag_class)

# Extract the text from the repository tags into a list
repo_tags = [repo_tag.get_text(strip=True) for repo_tag in repo_titles]

# Print the list of repository titles
print(repo_tags)

['mrdoob/three.js', 'pmndrs/react-three-fiber', 'libgdx/libgdx', 'BabylonJS/Babylon.js', 'FreeCAD/FreeCAD', 'ssloy/tinyrenderer', 'lettier/3d-game-shaders-for-beginners', 'aframevr/aframe', 'blender/blender', 'CesiumGS/cesium', '4ian/GDevelop', 'isl-org/Open3D', 'MonoGame/MonoGame', 'mapbox/mapbox-gl-js', 'metafizzy/zdog', 'nerfstudio-project/nerfstudio', 'timzhang642/3D-Machine-Learning', 'DavidHDev/react-bits', 'microsoft/TRELLIS', 'cocos/cocos-engine']


In [None]:
# Print the number of extracted repository tags
print(len(repo_tags))

20


In [None]:
# the first repo tag
repo_tags[0]

'mrdoob/three.js'

## 5.2 Extracting usernames from repo tags
  - repo tags (username + repository name)

In [None]:
# Define the class for the repository tag
repo_tag_class = 'f3 color-fg-muted text-normal lh-condensed'

# Find all <h3> elements with the specified class
repo_tags = topic_doc.find_all('h3', class_=repo_tag_class)

# Extract the username from the repository tags into a list
usernames = [repo_tag.get_text(strip=True).split('/')[0] for repo_tag in repo_tags if '/' in repo_tag.get_text(strip=True)]

# Print the list of usernames
print(usernames)

['mrdoob', 'pmndrs', 'libgdx', 'BabylonJS', 'FreeCAD', 'ssloy', 'lettier', 'aframevr', 'blender', 'CesiumGS', '4ian', 'isl-org', 'MonoGame', 'mapbox', 'metafizzy', 'nerfstudio-project', 'timzhang642', 'DavidHDev', 'microsoft', 'cocos']


In [None]:
# the first username
usernames[0]

'mrdoob'

## 5.3 Extracting repository names from repo tags
  - repo tags (username + repository name)

In [None]:
# Define the class for the repository tag
repo_tag_class = 'f3 color-fg-muted text-normal lh-condensed'

# Find all <h3> elements with the specified class
repo_tags = topic_doc.find_all('h3', class_=repo_tag_class)

# Extract the repository name from the repository tags into a list
repo_names = [repo_tag.get_text(strip=True).split('/')[1] for repo_tag in repo_tags if '/' in repo_tag.get_text(strip=True)]

# Print the list of repository names
print(repo_names)

['three.js', 'react-three-fiber', 'libgdx', 'Babylon.js', 'FreeCAD', 'tinyrenderer', '3d-game-shaders-for-beginners', 'aframe', 'blender', 'cesium', 'GDevelop', 'Open3D', 'MonoGame', 'mapbox-gl-js', 'zdog', 'nerfstudio', '3D-Machine-Learning', 'react-bits', 'TRELLIS', 'cocos-engine']


In [None]:
# the first repository name
repo_names[0]

'three.js'

## 5.4 Extracting a repo url by concatenating base url, username, and repository name

In [None]:
base_url = 'https://github.com'

In [None]:
# Define the base URL
base_url = 'https://github.com'

# Extract the usernames and repository names into separate lists
username = [repo_tag.get_text(strip=True).split('/')[0] for repo_tag in repo_tags if '/' in repo_tag.get_text(strip=True)]
repo_name = [repo_tag.get_text(strip=True).split('/')[1] for repo_tag in repo_tags if '/' in repo_tag.get_text(strip=True)]

# Loop through both lists and concatenate the base URL, username, and repository name
full_urls = [f"{base_url}/{u}/{r}" for u, r in zip(username, repo_name)]

# Print the list of full URLs
for repo_url in full_urls:
    print(repo_url)


https://github.com/mrdoob/three.js
https://github.com/pmndrs/react-three-fiber
https://github.com/libgdx/libgdx
https://github.com/BabylonJS/Babylon.js
https://github.com/FreeCAD/FreeCAD
https://github.com/ssloy/tinyrenderer
https://github.com/lettier/3d-game-shaders-for-beginners
https://github.com/aframevr/aframe
https://github.com/blender/blender
https://github.com/CesiumGS/cesium
https://github.com/4ian/GDevelop
https://github.com/isl-org/Open3D
https://github.com/MonoGame/MonoGame
https://github.com/mapbox/mapbox-gl-js
https://github.com/metafizzy/zdog
https://github.com/nerfstudio-project/nerfstudio
https://github.com/timzhang642/3D-Machine-Learning
https://github.com/DavidHDev/react-bits
https://github.com/microsoft/TRELLIS
https://github.com/cocos/cocos-engine


## 5.5 Extracting the Star Counts

In [None]:
# Define the class for the star count
star_class = 'Counter js-social-count'

# Find all elements with the specified class
star_tags = topic_doc.find_all(class_=star_class)

# Extract the star count into a list
star_counts = [star.get_text(strip=True) for star in star_tags]

# Print the list of star counts
print(star_counts)


['105k', '28.5k', '23.9k', '23.8k', '23.7k', '21.6k', '18.5k', '17k', '14.7k', '13.5k', '13.5k', '12.1k', '11.9k', '11.5k', '10.5k', '10k', '9.9k', '9.9k', '8.7k', '8.5k']


In [None]:
# Length of the star counts (number of repositories with star counts)
len(star_counts)

20

In [None]:
# Print the first star count from the list
print(star_counts[0])

105k


- Parse and clean the star count using a helper function below

In [None]:
# Parse and clean the star count using a helper function

def parse_star_count(value):
    # Dictionary to map suffixes to corresponding multipliers
    suffixes = {'k': 1000, 'm': 1000000, 'b': 1000000000}

    # Extract the suffix (last character) and the numeric part
    suffix = value[-1].lower()
    numeric_value = float(value[:-1])

    # Check if the suffix is in the dictionary, then multiply
    if suffix in suffixes:
        return int(numeric_value * suffixes[suffix])
    else:
        return int(value)  # If no suffix, return the integer value

# Example usage:
value = '105k'
numeric_value = parse_star_count(value)
print(numeric_value)


105000


## 5.6 Extracting repository information

In [None]:
# Function to extract repository information (username, repository name, star count, and repository URL)
def get_repo_info(h3_tag, star_tag):
    # Find all <a> tags within the <h3> tag
    a_tags = h3_tag.find_all('a')

    # Extract and clean the username (first <a> tag)
    username = a_tags[0].text.strip()

    # Extract and clean the repository name (second <a> tag)
    repo_name = a_tags[1].text.strip()

    # Construct the repository URL by concatenating the base URL and the href from the second <a> tag
    repo_url = base_url + a_tags[1]['href']

    # Parse and clean the star count using a helper function (assuming parse_star_count exists)
    stars = parse_star_count(star_tag.text.strip())

    # Return the extracted information as a tuple: (username, repo_name, stars, repo_url)
    return username, repo_name, stars, repo_url

In [None]:
# Call the function get_repo_info to extract repository details from the first repo tag and star tag
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 105000, 'https://github.com/mrdoob/three.js')

In [None]:
# Extracting information for all repositories
# List to store extracted repository information
repo_info_list = []

# Loop through each repo_tag and star_tag to extract information
for repo_tag, star_tag in zip(repo_tags, star_tags):
    # Extract username, repo name, stars, and repo URL for each repository
    repo_info = get_repo_info(repo_tag, star_tag)

    # Append the extracted information to the list
    repo_info_list.append(repo_info)

In [None]:
repo_info_list

[('mrdoob', 'three.js', 105000, 'https://github.com/mrdoob/three.js'),
 ('pmndrs',
  'react-three-fiber',
  28500,
  'https://github.com/pmndrs/react-three-fiber'),
 ('libgdx', 'libgdx', 23900, 'https://github.com/libgdx/libgdx'),
 ('BabylonJS', 'Babylon.js', 23800, 'https://github.com/BabylonJS/Babylon.js'),
 ('FreeCAD', 'FreeCAD', 23700, 'https://github.com/FreeCAD/FreeCAD'),
 ('ssloy', 'tinyrenderer', 21600, 'https://github.com/ssloy/tinyrenderer'),
 ('lettier',
  '3d-game-shaders-for-beginners',
  18500,
  'https://github.com/lettier/3d-game-shaders-for-beginners'),
 ('aframevr', 'aframe', 17000, 'https://github.com/aframevr/aframe'),
 ('blender', 'blender', 14700, 'https://github.com/blender/blender'),
 ('CesiumGS', 'cesium', 13500, 'https://github.com/CesiumGS/cesium'),
 ('4ian', 'GDevelop', 13500, 'https://github.com/4ian/GDevelop'),
 ('isl-org', 'Open3D', 12100, 'https://github.com/isl-org/Open3D'),
 ('MonoGame', 'MonoGame', 11900, 'https://github.com/MonoGame/MonoGame'),
 ('ma

In [None]:
len(repo_info_list)

20

# 6 Results and Output

In [None]:
# Create a dictionary to organize repository data
repo_info_dict = {
    # The key 'Username' maps to the list of usernames for each repository
    "Username": [repo[0] for repo in repo_info_list],

    # The key 'Repository Name' maps to the list of repository names for each repository
    "Repository Name": [repo[1] for repo in repo_info_list],

    # The key 'Stars' maps to the list of stars for each repository
    "Stars": [repo[2] for repo in repo_info_list],

    # The key 'Repository URL' maps to the list of URLs for each repository
    "Repository URL": [repo[3] for repo in repo_info_list]
}

# Create a DataFrame from the repo_dict dictionary
repo_info_df = pd.DataFrame(repo_info_dict)

In [None]:
repo_info_df

Unnamed: 0,Username,Repository Name,Stars,Repository URL
0,mrdoob,three.js,105000,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,28500,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,23900,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,23800,https://github.com/BabylonJS/Babylon.js
4,FreeCAD,FreeCAD,23700,https://github.com/FreeCAD/FreeCAD
5,ssloy,tinyrenderer,21600,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,18500,https://github.com/lettier/3d-game-shaders-for...
7,aframevr,aframe,17000,https://github.com/aframevr/aframe
8,blender,blender,14700,https://github.com/blender/blender
9,CesiumGS,cesium,13500,https://github.com/CesiumGS/cesium


**IMPORTANT COMMENT**
  - The above repository information is for single topic only, **the (3D) topic**, but we have 30 different topics in total.
  - So, we need to do the samething we need with 3D Topic for all other topics

In [None]:
def get_topic_page(topic_url):
    # Send an HTTP GET request to the topic URL
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception(f"Failed to load page {topic_url}")
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

# Function to extract repository information (username, repository name, star count, and repository URL)
def get_repo_info(h3_tag, star_tag):
    # Find all <a> tags within the <h3> tag
    a_tags = h3_tag.find_all('a')

    # Extract and clean the username (first <a> tag)
    username = a_tags[0].text.strip()

    # Extract and clean the repository name (second <a> tag)
    repo_name = a_tags[1].text.strip()

    # Construct the repository URL by concatenating the base URL and the href from the second <a> tag
    repo_url = base_url + a_tags[1]['href']

    # Parse and clean the star count using a helper function (assuming parse_star_count exists)
    stars = parse_star_count(star_tag.text.strip())

    # Return the extracted information as a tuple: (username, repo_name, stars, repo_url)
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    repo_tags = topic_doc.find_all('h3', class_=repo_tag_class) # Fixed indentation
    # Find all elements with the specified class
    star_tags = topic_doc.find_all(class_=star_class) # Fixed indentation

    topics_repos_dict = {
        "Username": [],
        "Repository Name": [],
        "Stars": [],
        "Repository URL": []
    }

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topics_repos_dict["Username"].append(repo_info[0])
        topics_repos_dict["Repository Name"].append(repo_info[1])
        topics_repos_dict["Stars"].append(repo_info[2])
        topics_repos_dict["Repository URL"].append(repo_info[3])

    return pd.DataFrame(topics_repos_dict)

In [None]:
# Function to fetch and parse the topic webpage
def get_topic_page(topic_url):
    """
    Sends an HTTP GET request to the topic URL and parses the page using BeautifulSoup.

    Args:
        topic_url (str): The URL of the GitHub topic page.

    Returns:
        BeautifulSoup: Parsed HTML document.
    """
    response = requests.get(topic_url)

    # Check if the request was successful (status code 200)
    if response.status_code != 200:
        raise Exception(f"Failed to load page {topic_url}")

    # Parse the HTML content
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

# Function to extract repository information (username, repository name, star count, and repository URL)
def get_repo_info(h3_tag, star_tag):
    """
    Extracts repository details from the given HTML tags.

    Args:
        h3_tag (Tag): The <h3> tag containing repository details.
        star_tag (Tag): The tag containing the star count.

    Returns:
        tuple: (username, repo_name, stars, repo_url)
    """
    # Find all <a> tags within the <h3> tag (username and repository name are inside <a> tags)
    a_tags = h3_tag.find_all('a')

    # Extract the username from the first <a> tag
    username = a_tags[0].text.strip()

    # Extract the repository name from the second <a> tag
    repo_name = a_tags[1].text.strip()

    # Construct the full repository URL
    repo_url = base_url + a_tags[1]['href']

    # Parse and clean the star count (assuming parse_star_count is defined)
    stars = parse_star_count(star_tag.text.strip())

    return username, repo_name, stars, repo_url

# Function to extract repository data from a topic page
def get_topic_repos(topic_doc):
    """
    Extracts repository details from a given topic page.

    Args:
        topic_doc (BeautifulSoup): Parsed HTML document of a GitHub topic page.

    Returns:
        DataFrame: A pandas DataFrame containing repository information.
    """
    # Find all repository tags based on the given CSS class
    repo_tags = topic_doc.find_all('h3', class_=repo_tag_class)

    # Find all star count tags based on the given CSS class
    star_tags = topic_doc.find_all(class_=star_class)

    # Dictionary to store extracted repository data
    topics_repos_dict = {
        "Username": [],
        "Repository Name": [],
        "Stars": [],
        "Repository URL": []
    }

    # Loop through each repository found on the page
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topics_repos_dict["Username"].append(repo_info[0])       # Store username
        topics_repos_dict["Repository Name"].append(repo_info[1])  # Store repository name
        topics_repos_dict["Stars"].append(repo_info[2])          # Store star count
        topics_repos_dict["Repository URL"].append(repo_info[3]) # Store repository URL

    # Convert the dictionary into a Pandas DataFrame
    return pd.DataFrame(topics_repos_dict)


In [None]:
# a link to each topic, we have 30 topics
urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compiler',
 'https://github.com/topics/co

In [None]:
# Fetching a fifth link from our links
url4 = urls[4]

In [None]:
# printing the fifth link
url4

'https://github.com/topics/android'

In [None]:
# Fetch and parse the HTML content of the topic page at url4
topic4_doc = get_topic_page(url4)

In [None]:
# Extract repository information from the parsed topic page
topic4_repos = get_topic_repos(topic4_doc)

In [None]:
# DataFrame containing repository details for topic4
topic4_repos

Unnamed: 0,Username,Repository Name,Stars,Repository URL
0,flutter,flutter,169000,https://github.com/flutter/flutter
1,facebook,react-native,121000,https://github.com/facebook/react-native
2,Genymobile,scrcpy,120000,https://github.com/Genymobile/scrcpy
3,justjavac,free-programming-books-zh_CN,113000,https://github.com/justjavac/free-programming-...
4,Hack-with-Github,Awesome-Hacking,90600,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,55300,https://github.com/Solido/awesome-flutter
6,tldr-pages,tldr,54500,https://github.com/tldr-pages/tldr
7,wasabeef,awesome-android-ui,52000,https://github.com/wasabeef/awesome-android-ui
8,google,material-design-icons,51300,https://github.com/google/material-design-icons
9,laurent22,joplin,48400,https://github.com/laurent22/joplin


**Alternatively,**

In [None]:
# link to fifth topic
urls[4]

'https://github.com/topics/android'

In [None]:
# Fetch and extract repository information from the topic page at urls[4]
get_topic_repos(get_topic_page(urls[4]))

Unnamed: 0,Username,Repository Name,Stars,Repository URL
0,flutter,flutter,169000,https://github.com/flutter/flutter
1,facebook,react-native,121000,https://github.com/facebook/react-native
2,Genymobile,scrcpy,120000,https://github.com/Genymobile/scrcpy
3,justjavac,free-programming-books-zh_CN,113000,https://github.com/justjavac/free-programming-...
4,Hack-with-Github,Awesome-Hacking,90600,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,55300,https://github.com/Solido/awesome-flutter
6,tldr-pages,tldr,54500,https://github.com/tldr-pages/tldr
7,wasabeef,awesome-android-ui,52000,https://github.com/wasabeef/awesome-android-ui
8,google,material-design-icons,51300,https://github.com/google/material-design-icons
9,laurent22,joplin,48400,https://github.com/laurent22/joplin


In [None]:
# save file to csv
get_topic_repos(get_topic_page(urls[4])).to_csv('android.csv', index=None)

In [None]:
# to link to seventh urls[6]
urls[6]

'https://github.com/topics/ansible'

In [None]:
# Fetch and extract repository information from the topic page at urls[6]
get_topic_repos(get_topic_page(urls[6]))

Unnamed: 0,Username,Repository Name,Stars,Repository URL
0,bregman-arie,devops-exercises,72500,https://github.com/bregman-arie/devops-exercises
1,ansible,ansible,64500,https://github.com/ansible/ansible
2,trailofbits,algo,29300,https://github.com/trailofbits/algo
3,MichaelCade,90DaysOfDevOps,27800,https://github.com/MichaelCade/90DaysOfDevOps
4,StreisandEffect,streisand,23300,https://github.com/StreisandEffect/streisand
5,kubernetes-sigs,kubespray,16800,https://github.com/kubernetes-sigs/kubespray
6,ansible,awx,14500,https://github.com/ansible/awx
7,semaphoreui,semaphore,11600,https://github.com/semaphoreui/semaphore
8,easzlab,kubeasz,10800,https://github.com/easzlab/kubeasz
9,netbootxyz,netboot.xyz,10000,https://github.com/netbootxyz/netboot.xyz


In [None]:
# save file to csv
get_topic_repos(get_topic_page(urls[4])).to_csv('ansible.csv', index=None)

In [None]:
# Fetch and extract repository information from the topic page at urls[20]
urls[20]

'https://github.com/topics/chrome-extension'

In [None]:
# Fetch and extract repository information from the topic page at urls[20]
get_topic_repos(get_topic_page(urls[20]))

Unnamed: 0,Username,Repository Name,Stars,Repository URL
0,jaywcjlove,linux-command,33000,https://github.com/jaywcjlove/linux-command
1,refined-github,refined-github,26100,https://github.com/refined-github/refined-github
2,openai-translator,openai-translator,24300,https://github.com/openai-translator/openai-tr...
3,darkreader,darkreader,20500,https://github.com/darkreader/darkreader
4,dailydotdev,daily,19300,https://github.com/dailydotdev/daily
5,gildas-lormeau,SingleFile,17200,https://github.com/gildas-lormeau/SingleFile
6,AutomaApp,automa,16200,https://github.com/AutomaApp/automa
7,checkly,headless-recorder,15100,https://github.com/checkly/headless-recorder
8,immersive-translate,immersive-translate,15000,https://github.com/immersive-translate/immersi...
9,unbug,codelf,14200,https://github.com/unbug/codelf


In [None]:
# save file to csv
get_topic_repos(get_topic_page(urls[4])).to_csv('chrome-extension.csv', index=None)

# 6 Technical Comments

**Techniques to avoid being banned when scraping**

1. in addition to time.sleep, consider using the rotating proxy to avoid your IP being blocked when scraping websites.
2. Rotating Proxies (Use services like ScraperAPI, Bright Data, or free proxy lists)

3. Randomized User-Agents (Switch browser headers with fake_useragent)

4. Session Persistence (Use requests.Session() to maintain cookies)

5. Delays & Randomization (time.sleep(random.uniform(3, 10)) to vary delays)

6. Headless Browsing (Use Selenium or Playwright when necessary)

#7 NEXT ACTIVITY

1. Get the list of topics from the topic page (https://github.com/topics)
  - https://github.com/topics, is a topic page
2. Get the list of top repos from the individual topic pages (Example: https://github.com/topics/3d)
  - 3D is a topic
  - https://github.com/topics/3d, is a 3D topic page
3. For each topic, create a CSV of the top repos for the topic

**Loops through URLs, fetches data from each, extracts repository info**
- repository info has (username, repository name, and stars)

In [None]:
for url in urls:  # Loop through each URL in the list
    repo_info = get_topic_repos(get_topic_page(url))  # Fetch and extract repo info
    print(repo_info)  # Print or store the extracted information

              Username                Repository Name   Stars  \
0               mrdoob                       three.js  105000   
1               pmndrs              react-three-fiber   28500   
2               libgdx                         libgdx   23900   
3            BabylonJS                     Babylon.js   23800   
4              FreeCAD                        FreeCAD   23700   
5                ssloy                   tinyrenderer   21600   
6              lettier  3d-game-shaders-for-beginners   18500   
7             aframevr                         aframe   17000   
8              blender                        blender   14700   
9             CesiumGS                         cesium   13500   
10                4ian                       GDevelop   13500   
11             isl-org                         Open3D   12100   
12            MonoGame                       MonoGame   11900   
13              mapbox                   mapbox-gl-js   11500   
14           metafizzy   

**For each topic, create a CSV of the top repos for the topic**

In [None]:
import pandas as pd
from urllib.parse import quote

# List of topic names (these can be read from a file or defined manually)
topic_titles = [
    '3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino',
    'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin',
    'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure',
    'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal'
]


from urllib.parse import quote

# Dynamically generate URLs from topic titles
urls = [f'https://github.com/topics/{quote(topic.lower().replace(" ", "-"))}' for topic in topic_titles]

# Now, you can loop through the URLs and topics
for url, topic in zip(urls, topic_titles):
    print(f"Fetching data for topic: {topic} from URL: {url}")

    # Fetch and process the repositories (replace with your actual fetching code)
    try:
        page_content = get_topic_page(url)  # Replace with your actual method
        repo_info = get_topic_repos(page_content)  # Replace with your actual method

        if repo_info:
            # Create DataFrame and save to CSV
            df = pd.DataFrame(repo_info)
            df['Topic'] = topic
            filename = f"{topic.replace(' ', '_').replace('/', '_')}_repos.csv"
            df.to_csv(filename, index=False)
            print(f"Saved data for {topic} to {filename}")
        else:
            print(f"No data available for {topic} at {url}")
    except Exception as e:
        print(f"Failed to load page for {topic} at {url}: {e}")

Fetching data for topic: 3D from URL: https://github.com/topics/3d
Saved data for 3D to 3D_repos.csv
Fetching data for topic: Ajax from URL: https://github.com/topics/ajax
Saved data for Ajax to Ajax_repos.csv
Fetching data for topic: Algorithm from URL: https://github.com/topics/algorithm
Saved data for Algorithm to Algorithm_repos.csv
Fetching data for topic: Amp from URL: https://github.com/topics/amp
Saved data for Amp to Amp_repos.csv
Fetching data for topic: Android from URL: https://github.com/topics/android
Saved data for Android to Android_repos.csv
Fetching data for topic: Angular from URL: https://github.com/topics/angular
Saved data for Angular to Angular_repos.csv
Fetching data for topic: Ansible from URL: https://github.com/topics/ansible
Saved data for Ansible to Ansible_repos.csv
Fetching data for topic: API from URL: https://github.com/topics/api
Saved data for API to API_repos.csv
Fetching data for topic: Arduino from URL: https://github.com/topics/arduino
Saved data 

**Key Note**

- Our data has a shape (30, 20, 4), this means:
-  30 URLs (30 topics)
- Each URL gives 20 repositories
- Each repository contains 4 attributes (username, repository name, star count, and a link)