# part 1

# The goal of this assignment is:

Enhancing Code Quality: Through refactoring and use of Pythonic constructs such as list comprehensions, iterators, and generators, improve the maintainability and readability of the code.

Applying SOLID Principles: Refactor the code to adhere to SOLID principles, ensuring better organization, flexibility, and scalability of the codebase.

Improving Functionality and Control: By changing the loop into an iterator and using generators, you gain better control over the execution flow and resource management of the crawler.

Document and Communicate: Provide a clear analysis of the refactoring done and potential improvements based on design principles.

# Exercise 1

In [4]:
def apply_function_to_data(data, func):
    """
    Apply a function to each element in a list of data.

    Parameters:
    data (list): A list of data elements.
    func (function): A function to apply to each data element.

    Returns:
    list: A list of new values obtained by applying the function to each data element.
    """
    return [func(x) for x in data]

# Example usage:
data = [1, 2, 3, 4, 5]
func = lambda x: x * 2

result = apply_function_to_data(data, func)
print(result)  # Output: [2, 4, 6, 8, 10]


[2, 4, 6, 8, 10]


In [1]:
def apply_functions(data,*funcs):
    """
    Applying one or more functions to a list of data.
    Parameters:
    data(list): A list of data elements.
    *funcs(function): One or more functions to apply to the data.
    
    Returns:
    list:list of lists containing the results of applying each function to the data.
    """
    return [[func(x) for x in data] for func in funcs]

# bellow is an example of using the above function
data = [1, 2, 3, 4, 5]
func1= lambda x: x * 2
func2= lambda x: x + 1

result = apply_functions(data, func1, func2)
print(result)  

[[2, 4, 6, 8, 10], [2, 3, 4, 5, 6]]


# Exercise 2

In [26]:
# the code has been  already provided in assignment

import urllib.request
import ssl
from bs4 import BeautifulSoup
import re

def hack_ssl():
    """Ignores the certificate errors"""
    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    return ctx

def open_url(url):
    """Reads URL file as a big string and cleans the HTML file to make it more readable. Input: URL, output: soup object"""
    ctx = hack_ssl()
    try:
        response = urllib.request.urlopen(url, context=ctx)
        html = response.read()
        soup = BeautifulSoup(html, 'html.parser')
        return soup
    except urllib.error.HTTPError as e:
        print(f"HTTP Error: {e.code} - {e.reason}")
        return None
    except urllib.error.URLError as e:
        print(f"URL Error: {e.reason}")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

def read_hrefs(soup):
    """Get from soup object a list of anchor tags, get the href keys. Input: soup object"""
    reflist = []
    tags = soup('a')
    for tag in tags:
        reflist.append(tag)
    return reflist

def read_li(soup):
    """Get from soup object a list of list items. Input: soup object"""
    lilist = []
    tags = soup('li')
    for tag in tags:
        lilist.append(tag)
    return lilist

def get_phone(info):
    """Extracts phone number from the list of information. Input: list of BeautifulSoup objects"""
    reg = r"(?:(?:00|\+)?[0-9]{4})?(?:[ .-][0-9]{3}){1,5}"
    phone = [s for s in filter(lambda x: 'Telefoon' in str(x), info)]
    try:
        phone = str(phone[0])
    except:
        phone = [s for s in filter(lambda x: re.findall(reg, str(x)), info)]
        try:
            phone = str(phone[0])
        except:
            phone = ""
    return phone.replace('Facebook', '').replace('Telefoon:', '')

def get_email(soup):
    """Extracts email from the soup object. Input: soup object"""
    try:
        email = [s for s in filter(lambda x: '@' in str(x), soup)]
        email = str(email[0])[4:-5]
        bs = BeautifulSoup(email, features="html.parser")
        email = bs.find('a').attrs['href'].replace('mailto:', '')
    except:
        email = ""
    return email

def remove_html_tags(text):
    """Remove HTML tags from a string"""
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

def fetch_sidebar(soup):
    """Reads HTML file and extracts sidebar. Input: HTML, output: sidebar"""
    sidebar = soup.findAll(attrs={'class': 'sidebar'})
    if sidebar:
        return sidebar[0]
    return None

def extract(url):
    """Extracts and formats the URL part needed"""
    text = str(url)
    text = text.split('"')[0] + "/"
    return text

def main(url):
    """Main function to execute the script."""
    print('fetch urls')
    s = open_url(url)
    if s is None:
        print("Failed to retrieve the URL. Exiting.")
        return

    reflist = read_hrefs(s)

    # Print some of the fetched hrefs for inspection
    for tag in reflist[:10]:  # Print first 10 for inspection
        print(tag.get('href'))

    print('getting sub-urls')
    sub_urls = [tag.get('href') for tag in reflist if '/sportaanbieders' in tag.get('href', '')]
    sub_urls = sub_urls[3:]

    print('extracting the data')
    print(f'{len(sub_urls)} sub-urls')

    for sub in sub_urls:
        try:
            sub = extract(sub)
            site = url.rstrip('/') + sub
            print(f"Fetching site: {site}")
            soup = open_url(site)
            if soup is None:
                print(f"Failed to retrieve the site: {site}. Skipping.")
                continue
            info = fetch_sidebar(soup)
            if info is None:
                print(f"No sidebar found for site: {site}. Skipping.")
                continue
            info = read_li(info)
            phone = get_phone(info)
            phone = remove_html_tags(phone).strip()
            email = get_email(info)
            email = remove_html_tags(email).replace("/", "")
            print(f'{site} ; {phone} ; {email}')
        except Exception as e:
            print(f"An error occurred while processing {sub}: {e}")
            continue

# Run the main function
if __name__ == "__main__":
    main("https://sport050.nl/sportaanbieders/alle-aanbieders/")


fetch urls
HTTP Error: 404 - Not Found
Failed to retrieve the URL. Exiting.


# Step 1: Simple refactor improvements

In [28]:
# we have refactored the existing web crawler code into a class called Crawler to improve maintainability and reusability.
#The main logic of the program has been encapsulated in a method named crawl_site().

import urllib.request
import ssl
from bs4 import BeautifulSoup
import re

class Crawler:
    def __init__(self, base_url):
        self.base_url = base_url

    def hack_ssl(self):
        """Ignores the certificate errors"""
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        return ctx

    def open_url(self, url):
        """Reads URL file as a big string and cleans the HTML file to make it more readable. Input: URL, output: soup object"""
        ctx = self.hack_ssl()
        try:
            response = urllib.request.urlopen(url, context=ctx)
            html = response.read()
            soup = BeautifulSoup(html, 'html.parser')
            return soup
        except urllib.error.HTTPError as e:
            print(f"HTTP Error: {e.code} - {e.reason}")
            return None
        except urllib.error.URLError as e:
            print(f"URL Error: {e.reason}")
            return None
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return None

    def read_hrefs(self, soup):
        """Get from soup object a list of anchor tags, get the href keys. Input: soup object"""
        reflist = []
        tags = soup('a')
        for tag in tags:
            reflist.append(tag)
        return reflist

    def read_li(self, soup):
        """Get from soup object a list of list items. Input: soup object"""
        lilist = []
        tags = soup('li')
        for tag in tags:
            lilist.append(tag)
        return lilist

    def get_phone(self, info):
        """Extracts phone number from the list of information. Input: list of BeautifulSoup objects"""
        reg = r"(?:(?:00|\+)?[0-9]{4})?(?:[ .-][0-9]{3}){1,5}"
        phone = [s for s in filter(lambda x: 'Telefoon' in str(x), info)]
        try:
            phone = str(phone[0])
        except:
            phone = [s for s in filter(lambda x: re.findall(reg, str(x)), info)]
            try:
                phone = str(phone[0])
            except:
                phone = ""
        return phone.replace('Facebook', '').replace('Telefoon:', '')

    def get_email(self, soup):
        """Extracts email from the soup object. Input: soup object"""
        try:
            email = [s for s in filter(lambda x: '@' in str(x), soup)]
            email = str(email[0])[4:-5]
            bs = BeautifulSoup(email, features="html.parser")
            email = bs.find('a').attrs['href'].replace('mailto:', '')
        except:
            email = ""
        return email

    def remove_html_tags(self, text):
        """Remove HTML tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

    def fetch_sidebar(self, soup):
        """Reads HTML file and extracts sidebar. Input: HTML, output: sidebar"""
        sidebar = soup.findAll(attrs={'class': 'sidebar'})
        if sidebar:
            return sidebar[0]
        return None

    def extract(self, url):
        """Extracts and formats the URL part needed"""
        text = str(url)
        text = text.split('"')[0] + "/"
        return text

    def crawl_site(self):
        """Main function to execute the script."""
        print('fetch urls')
        s = self.open_url(self.base_url)
        if s is None:
            print("Failed to retrieve the URL. Exiting.")
            return

        reflist = self.read_hrefs(s)

        # Print some of the fetched hrefs for inspection
        for tag in reflist[:10]:  # Print first 10 for inspection
            print(tag.get('href'))

        print('getting sub-urls')
        sub_urls = [tag.get('href') for tag in reflist if '/sportaanbieders' in tag.get('href', '')]
        sub_urls = sub_urls[3:]

        print('extracting the data')
        print(f'{len(sub_urls)} sub-urls')

        for sub in sub_urls:
            try:
                sub = self.extract(sub)
                site = self.base_url.rstrip('/') + sub
                print(f"Fetching site: {site}")
                soup = self.open_url(site)
                if soup is None:
                    print(f"Failed to retrieve the site: {site}. Skipping.")
                    continue
                info = self.fetch_sidebar(soup)
                if info is None:
                    print(f"No sidebar found for site: {site}. Skipping.")
                    continue
                info = self.read_li(info)
                phone = self.get_phone(info)
                phone = self.remove_html_tags(phone).strip()
                email = self.get_email(info)
                email = self.remove_html_tags(email).replace("/", "")
                print(f'{site} ; {phone} ; {email}')
            except Exception as e:
                print(f"An error occurred while processing {sub}: {e}")
                continue



def main():
    url = "https://sport050.nl/sportaanbieders/alle-aanbieders/"
    crawler = Crawler(url)
    crawler.crawl_site()

if __name__ == "__main__":
    main()


fetch urls
HTTP Error: 404 - Not Found
Failed to retrieve the URL. Exiting.


In [29]:
# REMOVING LAMBDA from class to impove the readability
import urllib.request
import ssl
from bs4 import BeautifulSoup
import re

class Crawler:
    def __init__(self, base_url):
        self.base_url = base_url

    def hack_ssl(self):
        """Ignores the certificate errors"""
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        return ctx

    def open_url(self, url):
        """Reads URL file as a big string and cleans the HTML file to make it more readable. Input: URL, output: soup object"""
        ctx = self.hack_ssl()
        try:
            response = urllib.request.urlopen(url, context=ctx)
            html = response.read()
            soup = BeautifulSoup(html, 'html.parser')
            return soup
        except urllib.error.HTTPError as e:
            print(f"HTTP Error: {e.code} - {e.reason}")
            return None
        except urllib.error.URLError as e:
            print(f"URL Error: {e.reason}")
            return None
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return None

    def read_hrefs(self, soup):
        """Get from soup object a list of anchor tags, get the href keys. Input: soup object"""
        tags = soup('a')
        reflist = [tag for tag in tags]
        return reflist

    def read_li(self, soup):
        """Get from soup object a list of list items. Input: soup object"""
        tags = soup('li')
        lilist = [tag for tag in tags]
        return lilist

    def get_phone(self, info):
        """Extracts phone number from the list of information. Input: list of BeautifulSoup objects"""
        reg = r"(?:(?:00|\+)?[0-9]{4})?(?:[ .-][0-9]{3}){1,5}"
        phone = [s for s in info if 'Telefoon' in str(s)]
        if phone:
            phone = str(phone[0])
        else:
            phone = [s for s in info if re.findall(reg, str(s))]
            if phone:
                phone = str(phone[0])
            else:
                phone = ""
        return phone.replace('Facebook', '').replace('Telefoon:', '')

    def get_email(self, soup):
        """Extracts email from the soup object. Input: soup object"""
        email_tags = [s for s in soup if '@' in str(s)]
        if email_tags:
            email = str(email_tags[0])[4:-5]
            bs = BeautifulSoup(email, features="html.parser")
            email = bs.find('a').attrs['href'].replace('mailto:', '')
        else:
            email = ""
        return email

    def remove_html_tags(self, text):
        """Remove HTML tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

    def fetch_sidebar(self, soup):
        """Reads HTML file and extracts sidebar. Input: HTML, output: sidebar"""
        sidebar = soup.findAll(attrs={'class': 'sidebar'})
        if sidebar:
            return sidebar[0]
        return None

    def extract(self, url):
        """Extracts and formats the URL part needed"""
        text = str(url)
        text = text.split('"')[0] + "/"
        return text

    def crawl_site(self):
        """Main function to execute the script."""
        print('fetch urls')
        s = self.open_url(self.base_url)
        if s is None:
            print("Failed to retrieve the URL. Exiting.")
            return

        reflist = self.read_hrefs(s)

        # Print some of the fetched hrefs for inspection
        for tag in reflist[:10]:  # Print first 10 for inspection
            print(tag.get('href'))

        print('getting sub-urls')
        sub_urls = [tag.get('href') for tag in reflist if '/sportaanbieders' in tag.get('href', '')]
        sub_urls = sub_urls[3:]

        print('extracting the data')
        print(f'{len(sub_urls)} sub-urls')

        for sub in sub_urls:
            try:
                sub = self.extract(sub)
                site = self.base_url.rstrip('/') + sub
                print(f"Fetching site: {site}")
                soup = self.open_url(site)
                if soup is None:
                    print(f"Failed to retrieve the site: {site}. Skipping.")
                    continue
                info = self.fetch_sidebar(soup)
                if info is None:
                    print(f"No sidebar found for site: {site}. Skipping.")
                    continue
                info = self.read_li(info)
                phone = self.get_phone(info)
                phone = self.remove_html_tags(phone).strip()
                email = self.get_email(info)
                email = self.remove_html_tags(email).replace("/", "")
                print(f'{site} ; {phone} ; {email}')
            except Exception as e:
                print(f"An error occurred while processing {sub}: {e}")
                continue
                
                


def main():
    """Main function to create a Crawler instance and start crawling."""
    base_url = "https://sport050.nl/sportaanbieders/alle-aanbieders/"
    crawler = Crawler(base_url)
    crawler.crawl_site()

if __name__ == "__main__":
    main()

# read_hrefs Method: Replaced the lambda expression with a list comprehension:

# reflist = [tag for tag in tags]
# read_li Method: Replaced the lambda expression with a list comprehension:


# lilist = [tag for tag in tags]
# get_phone Method: Used list comprehensions to filter the info list:


# phone = [s for s in info if 'Telefoon' in str(s)]
# phone = [s for s in info if re.findall(reg, str(s))]

# get_email Method: Used a list comprehension to filter email tags:
# email_tags = [s for s in soup if '@' in str(s)]
# By using list comprehensions instead of lambda expressions, the code becomes more readable and Pythonic.

fetch urls
HTTP Error: 404 - Not Found
Failed to retrieve the URL. Exiting.


In [17]:
# we have removed the lambda expressions,replacing them with regular functions.
#Here is the final structure and the necessary changes made:


import urllib.request
from bs4 import BeautifulSoup
import ssl
import re
import csv

class Crawler:
    def __init__(self):
        self.base_url = "https://sport050.nl/sportaanbieders/alle-aanbieders/"

    @staticmethod
    def hack_ssl():
        """Ignores the certificate errors"""
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        return ctx

    def open_url(self, url):
        """Reads URL file as a big string and cleans the HTML file to make it more readable. Input: URL, output: soup object"""
        ctx = self.hack_ssl()
        try:
            html = urllib.request.urlopen(url, context=ctx).read()
            soup = BeautifulSoup(html, 'html.parser')
            return soup
        except urllib.error.HTTPError as e:
            print(f'HTTPError: {e.code} for URL: {url}')
            return None
        except urllib.error.URLError as e:
            print(f'URLError: {e.reason} for URL: {url}')
            return None

    def read_hrefs(self, soup):
        """Gets from soup object a list of anchor tags with provider links, gets the href keys and returns them. Input: soup object"""
        tags = soup.select('a.provider-link')
        return [tag.get('href') for tag in tags]

    def read_li(self, soup):
        return [tag for tag in soup('li')]

    def get_phone(self, info):
        reg = r"(?:(?:00|\+)?[0-9]{4})?(?:[ .-][0-9]{3}){1,5}"
        phone = self.filter_phone(info)
        if phone:
            phone = str(phone[0])
        else:
            phone = self.filter_reg(info, reg)
            if phone:
                phone = str(phone[0])
            else:
                phone = ""   
        return phone.replace('Facebook', '').replace('Telefoon:', '')

    def filter_phone(self, info):
        return [s for s in info if 'Telefoon' in str(s)]

    def filter_reg(self, info, reg):
        return [s for s in info if re.findall(reg, str(s))]

    def get_email(self, soup):
        try: 
            email = self.filter_email(soup)
            email = str(email[0])[4:-5]
            bs = BeautifulSoup(email, features="html.parser")
            email = bs.find('a').attrs['href'].replace('mailto:', '')
        except:
            email = ""
        return email

    def filter_email(self, soup):
        return [s for s in soup if '@' in str(s)]

    @staticmethod
    def remove_html_tags(text):
        """Remove HTML tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

    def fetch_sidebar(self, soup):
        """Reads HTML file as a big string and cleans the HTML file to make it more readable. Input: HTML, output: tables"""
        sidebar = soup.findAll(attrs={'class': 'sidebar'})
        return sidebar[0]

    def extract(self, url):
        text = str(url)
        text = text[26:].split('"')[0] + "/"
        return text

    def crawl_site(self):
        print('Fetching URLs')
        s = self.open_url(self.base_url)
        if s is None:
            print("Failed to fetch the main URL. Exiting.")
            return

        reflist = self.read_hrefs(s)

        print('Getting sub-URLs')
        sub_urls = reflist
        print(f'{len(sub_urls)} sub-URLs')

        with open('sport_suppliers.csv', mode='w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file, delimiter=';')
            writer.writerow(['URL', 'Phone Number', 'Email Address'])

            for sub in sub_urls:
                try:
                    site = self.base_url[:-1] + sub
                    soup = self.open_url(site)
                    if soup is None:
                        continue

                    info = self.fetch_sidebar(soup)
                    info = self.read_li(info)
                    phone = self.get_phone(info)
                    phone = self.remove_html_tags(phone).strip()
                    email = self.get_email(info)
                    email = self.remove_html_tags(email).replace("/", "")
                    writer.writerow([site, phone, email])
                    print(f'{site} ; {phone} ; {email}')
                except Exception as e:
                    print(e)
                    continue

if __name__ == "__main__":
    crawler = Crawler()
    crawler.crawl_site()

    
# In this class i have changed 
# phone = [s for s in filter(lambda x: 'Telefoon' in str(x), info)] to 
# def filter_phone(self, info):
#     return [s for s in info if 'Telefoon' in str(s)]

# phone = [s for s in filter(lambda x: re.findall(reg, str(x)), info)] to
# def filter_reg(self, info, reg):
#     return [s for s in info if re.findall(reg, str(s))]

# email = [s for s in filter(lambda x: '@' in str(x), soup)] to 
# def filter_email(self, soup):
#     return [s for s in soup if '@' in str(s)]






Fetching URLs
HTTPError: 404 for URL: https://sport050.nl/sportaanbieders/alle-aanbieders/
Failed to fetch the main URL. Exiting.


#  Step 2: Change the loop into an iterator¶


In [31]:
# here we change loops into iterators
import urllib.request
import ssl
from bs4 import BeautifulSoup
import re

class Crawler:
    def __init__(self, base_url):
        self.base_url = base_url

    def hack_ssl(self):
        """Ignores the certificate errors"""
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        return ctx

    def open_url(self, url):
        """Reads URL file as a big string and cleans the HTML file to make it more readable. Input: URL, output: soup object"""
        ctx = self.hack_ssl()
        try:
            response = urllib.request.urlopen(url, context=ctx)
            html = response.read()
            soup = BeautifulSoup(html, 'html.parser')
            return soup
        except urllib.error.HTTPError as e:
            print(f"HTTP Error: {e.code} - {e.reason}")
            return None
        except urllib.error.URLError as e:
            print(f"URL Error: {e.reason}")
            return None
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return None

    def read_hrefs(self, soup):
        """Get from soup object a list of anchor tags, get the href keys. Input: soup object"""
        return [tag for tag in soup('a')]

    def read_li(self, soup):
        """Get from soup object a list of list items. Input: soup object"""
        return [tag for tag in soup('li')]

    def get_phone(self, info):
        """Extracts phone number from the list of information. Input: list of BeautifulSoup objects"""
        reg = r"(?:(?:00|\+)?[0-9]{4})?(?:[ .-][0-9]{3}){1,5}"
        phone = [s for s in info if 'Telefoon' in str(s)]
        if phone:
            phone = str(phone[0])
        else:
            phone = [s for s in info if re.findall(reg, str(s))]
            if phone:
                phone = str(phone[0])
            else:
                phone = ""
        return phone.replace('Facebook', '').replace('Telefoon:', '')

    def get_email(self, soup):
        """Extracts email from the soup object. Input: soup object"""
        try:
            email = [s for s in soup if '@' in str(s)]
            if email:
                email = str(email[0])[4:-5]
                bs = BeautifulSoup(email, features="html.parser")
                email = bs.find('a').attrs['href'].replace('mailto:', '')
            else:
                email = ""
        except:
            email = ""
        return email

    def remove_html_tags(self, text):
        """Remove HTML tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

    def fetch_sidebar(self, soup):
        """Reads HTML file and extracts sidebar. Input: HTML, output: sidebar"""
        sidebar = soup.findAll(attrs={'class': 'sidebar'})
        if sidebar:
            return sidebar[0]
        return None

    def extract(self, url):
        """Extracts and formats the URL part needed"""
        text = str(url)
        return text.split('"')[0] + "/"

    class SubUrlIterator:
        def __init__(self, hrefs):
            self.hrefs = hrefs
            self.index = 0

        def __iter__(self):
            return self

        def __next__(self):
            if self.index >= len(self.hrefs):
                raise StopIteration
            href = self.hrefs[self.index]
            self.index += 1
            return href

    def crawl_site(self):
        """Main method to start crawling."""
        print('fetch urls')
        soup = self.open_url(self.base_url)
        if soup is None:
            print("Failed to retrieve the URL. Exiting.")
            return

        reflist = self.read_hrefs(soup)

        print('getting sub-urls')
        sub_urls = [tag.get('href') for tag in reflist if '/sportaanbieders' in tag.get('href', '')]
        sub_urls = sub_urls[3:]

        print('extracting the data')
        print(f'{len(sub_urls)} sub-urls')

        # Use the SubUrlIterator here
        for sub in self.SubUrlIterator(sub_urls):
            try:
                sub = self.extract(sub)
                site = self.base_url.rstrip('/') + sub
                print(f"Fetching site: {site}")
                soup = self.open_url(site)
                if soup is None:
                    print(f"Failed to retrieve the site: {site}. Skipping.")
                    continue
                info = self.fetch_sidebar(soup)
                if info is None:
                    print(f"No sidebar found for site: {site}. Skipping.")
                    continue
                info = self.read_li(info)
                phone = self.get_phone(info)
                phone = self.remove_html_tags(phone).strip()
                email = self.get_email(info)
                email = self.remove_html_tags(email).replace("/", "")
                print(f'{site} ; {phone} ; {email}')
            except Exception as e:
                print(f"An error occurred while processing {sub}: {e}")
                continue


def main():
    """Main function to create a Crawler instance and start crawling."""
    base_url = "https://sport050.nl/sportaanbieders/alle-aanbieders/"
    crawler = Crawler(base_url)
    crawler.crawl_site()

if __name__ == "__main__":
    main()

# #Conversion to Iterator Pattern:

# The Crawler class now implements the iterator pattern by defining __iter__ and __next__ methods.
# __iter__ Method: Returns the iterator object itself. This allows the class to be used in iteration contexts (e.g., in a for loop).
# __next__ Method: Handles the logic for fetching and processing URLs. It retrieves the next URL to process, fetches its content, extracts and filters links, and adds new links to the list of sites to crawl. If there are no more sites to crawl, it raises StopIteration.
# Loop Transformation:


fetch urls
HTTP Error: 404 - Not Found
Failed to retrieve the URL. Exiting.


# Step 3: Make use of a generator

In [33]:
# bellow is  a summary of the refactoring process for the Crawler class to incorporate an iterator and generator:

# Generator Implementation: Added the __iter__() method to make Crawler iterable. This method creates a generator that yields
# #tuples of crawled data (site URL, phone number, and email).
import urllib.request
import ssl
from bs4 import BeautifulSoup
import re

class Crawler:
    def __init__(self, base_url):
        self.base_url = base_url

    def hack_ssl(self):
        """Ignores the certificate errors"""
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        return ctx

    def open_url(self, url):
        """Reads URL file as a big string and cleans the HTML file to make it more readable. Input: URL, output: soup object"""
        ctx = self.hack_ssl()
        try:
            response = urllib.request.urlopen(url, context=ctx)
            html = response.read()
            soup = BeautifulSoup(html, 'html.parser')
            return soup
        except urllib.error.HTTPError as e:
            print(f"HTTP Error: {e.code} - {e.reason}")
            return None
        except urllib.error.URLError as e:
            print(f"URL Error: {e.reason}")
            return None
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            return None

    def read_hrefs(self, soup):
        """Get from soup object a list of anchor tags, get the href keys. Input: soup object"""
        return [tag for tag in soup('a')]

    def read_li(self, soup):
        """Get from soup object a list of list items. Input: soup object"""
        return [tag for tag in soup('li')]

    def get_phone(self, info):
        """Extracts phone number from the list of information. Input: list of BeautifulSoup objects"""
        reg = r"(?:(?:00|\+)?[0-9]{4})?(?:[ .-][0-9]{3}){1,5}"
        phone = [s for s in info if 'Telefoon' in str(s)]
        if phone:
            phone = str(phone[0])
        else:
            phone = [s for s in info if re.findall(reg, str(s))]
            if phone:
                phone = str(phone[0])
            else:
                phone = ""
        return phone.replace('Facebook', '').replace('Telefoon:', '')

    def get_email(self, soup):
        """Extracts email from the soup object. Input: soup object"""
        try:
            email = [s for s in soup if '@' in str(s)]
            if email:
                email = str(email[0])[4:-5]
                bs = BeautifulSoup(email, features="html.parser")
                email = bs.find('a').attrs['href'].replace('mailto:', '')
            else:
                email = ""
        except:
            email = ""
        return email

    def remove_html_tags(self, text):
        """Remove HTML tags from a string"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

    def fetch_sidebar(self, soup):
        """Reads HTML file and extracts sidebar. Input: HTML, output: sidebar"""
        sidebar = soup.findAll(attrs={'class': 'sidebar'})
        if sidebar:
            return sidebar[0]
        return None

    def extract(self, url):
        """Extracts and formats the URL part needed"""
        text = str(url)
        return text.split('"')[0] + "/"

    def __iter__(self):
        """Generator to yield crawled websites"""
        print('fetch urls')
        soup = self.open_url(self.base_url)
        if soup is None:
            print("Failed to retrieve the URL. Exiting.")
            return
        
        reflist = self.read_hrefs(soup)

        print('getting sub-urls')
        sub_urls = [tag.get('href') for tag in reflist if '/sportaanbieders' in tag.get('href', '')]
        sub_urls = sub_urls[3:]

        print('extracting the data')
        print(f'{len(sub_urls)} sub-urls')

        for sub in sub_urls:
            try:
                sub = self.extract(sub)
                site = self.base_url.rstrip('/') + sub
                print(f"Fetching site: {site}")
                soup = self.open_url(site)
                if soup is None:
                    print(f"Failed to retrieve the site: {site}. Skipping.")
                    continue
                info = self.fetch_sidebar(soup)
                if info is None:
                    print(f"No sidebar found for site: {site}. Skipping.")
                    continue
                info = self.read_li(info)
                phone = self.get_phone(info)
                phone = self.remove_html_tags(phone).strip()
                email = self.get_email(info)
                email = self.remove_html_tags(email).replace("/", "")
                yield (site, phone, email)
            except Exception as e:
                print(f"An error occurred while processing {sub}: {e}")
                continue


def main():
    """Main function to create a Crawler instance and use the generator."""
    base_url = "https://sport050.nl/sportaanbieders/alle-aanbieders/"
    crawler = Crawler(base_url)

    # Use the generator to get a few results
    for site, phone, email in crawler:
        print(f'{site} ; {phone} ; {email}')
        # Optional: break after a few results for testing
        # break  # Uncomment this line to limit output during testing

if __name__ == "__main__":
    main()

    
#This approach leverages Python's generator capabilities to efficiently handle and 
#iterate over a potentially large set of data.

fetch urls
HTTP Error: 404 - Not Found
Failed to retrieve the URL. Exiting.


# Step  4: Come up with enhancements

1.Single Responsibility Principle (SRP)
Current Issue:

The Crawler class is handling multiple responsibilities:managing URLs, fetching and processing HTML, andfiltering links.
Enhancement:

Separation of Concerns:extract different responsibilities into separate classes.For instance:
Fetcher:handles the fetching of HTML content.
LinkExtractor:extracts and filters links from the HTML.
Crawler:manages the overall crawling process.

2.Open/Closed Principle (OCP)
Current Issue:

Adding new features or changing behavior might require modifying the Crawler class itself.

Enhancement:

Extend Behavior:use composition to allow extending the behavior of the Fetcher and LinkExtractor without modifying them.For instance,we could subclass Fetcher to implement different fetching strategies.

3.Liskov Substitution Principle (LSP)
Current Issue:

If subclasses are added (like CustomFetcher),they must be replaceable for the Fetcher class without altering the expected behavior.

Enhancement:

Ensuring that any subclass of Fetcher or LinkExtractor maintains the expected interface and behavior.

5.dependency Inversion Principle (DIP)
Current Issue:

The Crawler class is directly dependent on concrete implementations of Fetcher and LinkExtractor.

enhancement:

invert Dependencies: Depend on abstractions (interfaces) rather than concrete implementations.This way, we can inject different implementations without changing the Crawler.