# GITHub Scraper for Training Data Extraction

As a recomendation of **Pilar Hidalgo** of [CodeGPT](https://codegpt.co) Team, I used [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library to extract text from a given GITHub Repository, in order to obtain a curated text file suitable to be use as [CodeGPT](https://codegpt.co) Agents Training Data.

I made this Jupyter Notebook using a customized [CodeGPT](https://codegpt.co) Coding Copilot Agent powered by [Ollama](https://ollama.com/)/[codellama](https://docs.codegpt.co/es/docs/tutorial-ai-providers/ollama), running in my local machine, on VisualStudio Code.

## 1. Libraries Installation and Imports

[**BeautifulSoup**](https://pypi.org/project/beautifulsoup4/) is a Python library designed for quick turnaround projects like screen-scraping. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree, which makes it easier to work with HTML or XML files. It automatically converts incoming documents to Unicode and outgoing documents to UTF-8, so you don't have to think about encodings.

Here's a summary of its capabilities:

Parsing: Beautiful Soup can parse anything you give it, and does the heavy lifting of handling different types of markup, making it easier to scrape data from web pages.

Navigating the Parse Tree: It provides simple ways to navigate through the parse tree, such as finding all the links, texts, or tags of a certain type.

Searching: You can search for elements with specific classes, ids, or text using filters, which makes it very powerful for extracting information from HTML/XML.

Modifying the Parse Tree: Beautiful Soup allows you to change the tree, such as adding, removing, or modifying tags, which is useful for cleaning up or changing scraped data.

Output: After parsing and manipulating the data, you can format it as a nicely formatted Unicode string, with a representation of the original document's encoding.

In [15]:
#Importing BeautifulSoup4 library
pip install beautifulsoup4

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


The **Requests** library in Python is a simple, yet powerful HTTP library designed for human beings. It allows you to send HTTP/1.1 requests easily, without the need to manually add query strings to your URLs, or to form-encode your POST data.

In [16]:
#Importing Requests library
pip install requests

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [17]:
# Import the necessary libraries: requests for HTTP requests and BeautifulSoup from bs4 for web scraping.
import requests, bs4
from bs4 import BeautifulSoup

# Wrap the web scraping code in a try-except block to handle potential exceptions.
try:
    # Define the URL of the webpage we want to scrape.
    url = "https://github.com/LukasBommes/PV-Hawk"
    
    # Use the requests.get method to perform an HTTP GET request to the specified URL.
    # The 'with' statement ensures that the response is properly managed (opened and closed).
    with requests.get(url) as response:
        # Parse the content of the response using BeautifulSoup.
        # The response.text contains the HTML content of the page.
        # 'html.parser' is the parser that BeautifulSoup uses to parse the HTML content.
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Placeholder for the scraping logic...
        # Here you would add the code to find specific elements, tags, attributes, or text within the parsed HTML.
        # Example: titles = soup.find_all('h1') to find all <h1> tags and store them in the 'titles' variable.
        
# Catch any exception that might be raised by the requests library, such as network issues, URL errors, etc.
except requests.exceptions.RequestException as e:
    # Print an error message along with the exception details.
    print("An error occurred while making the request", e)

Let´s check what is the Title of extracted text

In [5]:
soup.title.string

'GitHub - LukasBommes/PV-Hawk: Tool for the extraction and mapping of photovoltaics modules from IR drone videos of utility-scale PV plants (my PhD project)'

Let´s see how this text looks like, using parsed tree

In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-f13f84a2af0d.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-1ee85695b584.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://

Let´s extract the links from the soup:

In [7]:
# Import the regular expressions module, which is used for searching patterns within text.
import re

# Initialize an empty list to store the links that match our search criteria.
links = []

# Loop through all anchor ('a') elements found in the parsed HTML content stored in 'soup'.
for link in soup.find_all('a'):
    # Use a regular expression to search for href attributes containing the '/LukasBommes/PV-Hawk' pattern.
    # The link.get('href') retrieves the href attribute from the anchor element, which is the URL the link points to.
    if re.search(r'/LukasBommes/PV-Hawk', link.get('href')):
        # If the pattern is found in the href attribute, append the URL to the 'links' list.
        links.append(link.get('href'))

# After the loop, 'links' will contain all the URLs that include '/LukasBommes/PV-Hawk'.
# The list of links is then returned as the output of this code block.
links

['/LukasBommes/PV-Hawk',
 '/LukasBommes/PV-Hawk/blob/master/LICENSE',
 '/LukasBommes/PV-Hawk/stargazers',
 '/LukasBommes/PV-Hawk/forks',
 '/LukasBommes/PV-Hawk/branches',
 '/LukasBommes/PV-Hawk/tags',
 '/LukasBommes/PV-Hawk/activity',
 '/LukasBommes/PV-Hawk',
 '/LukasBommes/PV-Hawk/issues',
 '/LukasBommes/PV-Hawk/pulls',
 '/LukasBommes/PV-Hawk/actions',
 '/LukasBommes/PV-Hawk/projects',
 '/LukasBommes/PV-Hawk/security',
 '/LukasBommes/PV-Hawk/pulse',
 '/LukasBommes/PV-Hawk',
 '/LukasBommes/PV-Hawk/issues',
 '/LukasBommes/PV-Hawk/pulls',
 '/LukasBommes/PV-Hawk/actions',
 '/LukasBommes/PV-Hawk/projects',
 '/LukasBommes/PV-Hawk/security',
 '/LukasBommes/PV-Hawk/pulse',
 '/LukasBommes/PV-Hawk/branches',
 '/LukasBommes/PV-Hawk/tags',
 '/LukasBommes/PV-Hawk/branches',
 '/LukasBommes/PV-Hawk/tags',
 '/LukasBommes/PV-Hawk/commits/master/',
 '/LukasBommes/PV-Hawk/commits/master/',
 '/LukasBommes/PV-Hawk/tree/master/.github/workflows',
 '/LukasBommes/PV-Hawk/tree/master/.github/workflows',
 '/Lu

Let´s check how many links have we extracted:

In [8]:
len(links)

73

Now let´s take out all absolute links that begins with "http", to keep only links that leads to subsections of Main Github directory:

In [18]:
# Use a list comprehension to filter out absolute URLs from the 'links' list.
# The re.match function is used to check if a URL starts with 'http' (indicating an absolute URL).
# The caret (^) in the regular expression denotes the start of the string.
# If the URL does not start with 'http', it is a relative URL and is included in the new list.

links = [link for link in links if not re.match(r'^http', link)]

# After filtering, 'links' now contains only relative URLs.
# Output the number of relative URLs in the 'links' list using the len function.
len(links)

70

In [10]:
# Import the urllib.parse module, which provides functions to manipulate URLs.
import urllib.parse
# Specifically import the urljoin function from urllib.parse to create absolute URLs from relative fragments.
from urllib.parse import urljoin

# Initialize an empty list to store the text content from each URL.
texts = []

# Loop through each URL in the 'links' list.
for link in links:
    # Concatenate the base URL 'https://www.github.com' with the relative URL in 'link'.
    # This creates a full absolute URL.
    url = ("https://www.github.com" + link)
    
    # Send an HTTP GET request to the URL and store the response.
    response = requests.get(url)
    
    # Parse the response text (HTML content) using BeautifulSoup with the 'html.parser' parser.
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all text within the HTML document and append it to the 'texts' list.
    # The find_all function with the text=True argument retrieves all text nodes in the document.
    texts.append(soup.find_all(text=True))

# Print the list of text content from each URL.
print(texts)

  texts.append(soup.find_all(text=True))




Now, let´s complement Main GITHub Repository extracted text with additional training data from relevant online publications:

In [13]:
# Import the requests module to send HTTP requests using Python.
import requests
# Import BeautifulSoup from the bs4 module to parse HTML and XML documents.
from bs4 import BeautifulSoup

# Insert here a list of URLs from which we want to scrape text.
pub_links = [
    "https://onlinelibrary.wiley.com/doi/10.1002/pip.3448",
    "https://onlinelibrary.wiley.com/doi/10.1002/pip.3518",
    "https://onlinelibrary.wiley.com/doi/10.1002/pip.3564"
]

# Initialize an empty list to store the text content from each publication link.
texts2 = []

# Loop through each publication link in the 'pub_links' list.
for pub_link in pub_links:
    # Assign the current publication link to the variable 'url'.
    url = pub_link
    
    # Send an HTTP GET request to the URL and store the response.
    # The 'requests.get' function fetches the content of the URL.
    response = requests.get(url)
    
    # Parse the response text (HTML content) using BeautifulSoup with the 'html.parser' parser.
    # 'html.parser' is a built-in Python parser that BeautifulSoup uses to parse HTML.
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find all text within the HTML document and append it to the 'texts2' list.
    # The 'find_all' function with the 'text=True' argument retrieves all text nodes in the document.
    texts2.append(soup.find_all(text=True))

# Print the list of text content from each publication link.
# This will output the text content scraped from each URL in the 'pub_links' list.
print(texts2)
    

  texts2.append(soup.find_all(text=True))


[['html', 'Just a moment...', '*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131}button,html{font-family:system-ui,-apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Helvetica Neue,Arial,Noto Sans,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol,Noto Color Emoji}@media (prefers-color-scheme:dark){body{background-color:#222;color:#d9d9d9}body a{color:#fff}body a:hover{color:#ee730a;text-decoration:underline}body .lds-ring div{border-color:#999 transparent transparent}body .font-red{color:#b20f03}body .big-button,body .pow-button{background-color:#4693ff;color:#1d1d1d}body #challenge-success-text{background-image:url(

Now, let´s merge Main Text (texts) and Additional Text (texts2) into a single text file.
Then, we will export to a .txt file, and calculate its size.

In [14]:
# Import the os module to interact with the operating system.
import os

# Initialize an empty string to concatenate all the text content.
full_text = ""

# Loop through each text content in the 'texts' list.
for text in texts:
    # Convert the text content to a string and concatenate it with a space to 'full_text'.
    full_text += str(text) + " "

# Loop through each text content in the 'texts2' list.
for text in texts2:
    # Convert the text content to a string and concatenate it to 'full_text'.
    full_text += str(text)

# Define the name of the file where the scraped text will be saved.
file_name = "scraped.txt"

# Open the file in write mode ('w') as 'file'.
# If the file does not exist, it will be created.
with open(file_name, "w") as file:
    # Write the concatenated text content to the file.
    file.write(full_text)

# Use the os.stat function to get the status of the file.
# The st_size attribute of the result gives the size of the file in bytes.
file_size = os.stat(file_name).st_size

# Convert the file size from bytes to megabytes (1 MB = 1024 * 1024 bytes).
file_size_mb = file_size / 1024 / 1024

# Print the size of the file in megabytes.
print("File Size:", file_size_mb)

File Size: 1.5266180038452148


The file must be lower than 20MB to be uploaded to https://codegpt.co

For any comments and suggestions, find me on Discord as or email me at zeryan.guerra@gmail.com