## Homework Submission Instructions and Format

### 1. **Naming Convention for Submission**
- Each student must submit their homework as a Jupyter Notebook (`.ipynb`) file.
- The file must be named with your **email address**. For example, if your email is `j.doe@innopolis.university`, the file should be named `j.doe@innopolis.university.ipynb`.
- Make sure that the notebook contains all necessary classes and functions outlined in the tasks.

---

### 2. **Structure of the Homework**

The homework is divided into **four tasks**, and each task requires you to implement specific classes and functions. The tasks build upon concepts of web scraping, HTML parsing, document processing, and crawling.

Here’s a breakdown of what is expected in each task:

#### **Task 1: Artifact Caching System**
- **Class**: `Artifact`
  - Implement a class that can **download**, **store**, and **retrieve** digital content from a URL.
  - This class should:
    - Fetch content from a URL and store it in memory.
    - Save the content to a local file to avoid redundant downloads.
    - Retrieve the content from the local cache if it already exists.
- **Methods**:
  - `fetch_artifact()`: Downloads the content from the URL.
  - `store_artifact()`: Stores the content locally in a unique file.
  - `retrieve_artifact()`: Retrieves the content from the local cache.
  
---

#### **Task 2: Smithsonian Snapshot Parser**
- **Class**: `SmithsonianParser`
  - Implement a class that parses HTML pages like [Smithsonian Snapshots](https://www.si.edu/newsdesk/snapshot/how-very-logical).
  - This class should:
    - Extract all links (`<a>` tags) and store them as a list of tuples.
    - Extract all image URLs (`<img>` tags) and store them in a list.
    - Clean the text from the page, removing unnecessary elements (scripts, styles).
- **Methods**:
  - `fetch_page()`: Downloads the HTML content of a page.
  - `parse()`: Parses the content and extracts links, images, and cleaned text.
  - `get_anchors()`, `get_images()`, and `get_text()`: Returns the extracted data.

---

#### **Task 3: Text Analysis of Smithsonian Snapshot**
- **Class**: `SmithsonianTextAnalyzer`
  - Implement a class that analyzes the text content of a Smithsonian Snapshot page.
  - This class should:
    - Perform **word frequency analysis**.
    - **Segment sentences** and split text properly.
    - Clean the text to remove special characters and whitespace.
- **Methods**:
  - `analyze()`: Fetches the page content, processes the text, and generates word frequency statistics.
  - `get_word_stats()`: Returns word frequency in the form of a `Counter` object.
  - `split_into_sentences()`: Splits the text into sentences.

---

#### **Task 4: Smithsonian Snapshot Web Crawler**
- **Class**: `SmithsonianCrawler`
  - Implement a web crawler that starts at the [Smithsonian Snapshots](https://www.si.edu/newsdesk/snapshots) page and crawls through linked snapshot articles.
  - This class should:
    - Crawl pages to a specified depth.
    - Extract links, images, and cleaned text from each page.
    - Return the results as soon as the page is processed.
- **Methods**:
  - `crawl()`: Recursively visits pages starting from a given URL.
  - `crawl_generator()`: Generates content as the crawler processes each page.

---

### 3. **Grading Process**
- Each homework will be graded using an automated grading system.
- The grading system will dynamically import and execute your code to test if all the tasks are implemented correctly.
  
---

### 4. **Total Grade Breakdown**
- **Task 1**: Artifact Caching System (25 points)
- **Task 2**: Smithsonian Snapshot Parser (25 points)
- **Task 3**: Text Analysis (25 points)
- **Task 4**: Web Crawler (25 points)
- **Total**: 100 points

---

### 5. **Detailed Feedback**
- Feedback will be provided with specific details on:
  - **What worked**: Indicating which parts of the code were implemented correctly and passed the tests.
  - **What needs improvement**: Highlighting which tests failed and what parts of the code may require debugging or further development.
- The feedback will help you understand your performance in each task.

---

### 6. **Submission Guidelines**
- Ensure that your notebook is properly formatted and runs without errors.
- Do not use any external libraries unless instructed.
- Each function and class must follow the naming conventions provided in this document.
- Submit your notebook on time. Late submissions may not be accepted.

---

### 7. **Final Tips**
- Test each task thoroughly before submission.
- Ensure your notebook is readable and well-documented.
- Make use of comments to explain your code wherever necessary.

Good luck!

### Task 1: Archiving Virtual Artifacts - Preserving a Digital Museum

#### 1.0.1. Task Description
Imagine you are a data archivist working to preserve artifacts from the **Smithsonian Institution's digital collection**. Your job is to download, store, and manage different types of data (such as images, videos, and documents) to ensure they can be accessed later without repeated downloads.

Use the [Smithsonian Institution Collections](https://www.si.edu/snapshot) as your source of artifacts. You are tasked with building a caching system that can store downloaded files in a structured way and retrieve them as needed.

#### Tasks:
1. `fetch_artifact() -> bool`: Download content from the Smithsonian's collection page based on a provided URL. The method should return `True` if the download is successful, or `False` if the download fails. This method will handle errors like network issues and non-existent pages gracefully.
   
2. `store_artifact(directory: str = "artifact_cache") -> None`: Save the content of the artifact (text, image, etc.) in a local file system. Each artifact must be stored in its own unique file based on its URL. If the content is already present in the directory, it should not overwrite existing files.

3. `retrieve_artifact(directory: str = "artifact_cache") -> bool`: Load an artifact from local storage using its URL. This ensures that the content is cached correctly and avoids redundant downloads by checking the local file system before attempting to download again.

#### Criteria for Success:
- Different URLs must map to different files, even if they belong to the same domain. This can be achieved by generating a unique filename using a hash function (e.g., `md5`).
- Binary files (e.g., images) must be handled correctly without corruption. This requires proper file handling, ensuring binary write operations for non-text artifacts.
- Artifacts that are already stored locally should not be downloaded again. The caching system should check for the existence of a file before re-downloading it.

#### Link: [Smithsonian Institution Collections](https://www.si.edu/newsdesk/snapshot/what-good-boy)

In [1]:
from typing import Optional
import requests
import os
import hashlib


class Artifact:
	def __init__(self, url: str) -> None:
		"""
		Initialize the Artifact with a URL.
		:param url: The URL of the artifact.
		"""
		self.url: str = url
		self.content: Optional[bytes] = None  # Content is None initially and becomes bytes if fetched
		self.filename: Optional[str] = None  # Filename is generated later

	def generate_filename(self) -> None:
		"""
		Generates a unique and safe filename based on the URL using a hash.
		:return: None. Updates the self.filename attribute.
		"""
		filename = hashlib.sha512(self.url.encode('utf-8')).hexdigest() + ".bin"
		return filename


	def fetch_artifact(self) -> bool:
		"""
		Download the artifact from the given URL and store its content in memory.
		:return: True if the download is successful, False otherwise.
		"""
		try:
			response = requests.get(self.url)
			response.raise_for_status()
			self.content = response.content
			self.filename = self.generate_filename()
			return True
		except requests.exceptions.RequestException:
			return False
	
	
	def store_artifact(self, directory: str = "artifact_cache") -> None:
		"""
		Store the artifact content in a local file in a cache directory.
		:param directory: Directory to store the artifact. Default is 'artifact_cache'.
		:return: None. Stores the file locally.
		"""
		if self.content is None:
			return
	
		if not os.path.exists(directory):
			os.mkdir(directory)
	
		filepath = os.path.join(directory, self.filename)
		if os.path.exists(filepath):
			return
	
		with open(filepath, 'wb') as file:
			file.write(bytes(self.content))
		print("Artifact stored at ", filepath)
	
	
	def retrieve_artifact(self, directory: str = "artifact_cache") -> bool:
		"""
		Retrieve the artifact from the local cache if it has been stored before.
		:param directory: Directory to look for the artifact. Default is 'artifact_cache'.
		:return: True if the artifact is successfully retrieved, False otherwise.
		"""
		filepath = os.path.join(directory, self.filename)
		if os.path.exists(filepath):
			with open(filepath, 'rb') as file:
				self.content = file.read
			print("Artifact loaded from ", filepath)
			return True
		return False

### Task 2: Parsing Web Pages - Smithsonian Snapshot

#### 2.0.2. Task Description
For this task, you will be working with pages from the [Smithsonian Newsdesk Snapshot](https://www.si.edu/newsdesk/snapshot/how-very-logical). Your goal is to extract meaningful content such as links, images, and clean text from the page.

You will need to:
1. **Extract Hyperlinks (`get_anchors() -> List[Tuple[str, str]]`)**: Extract all anchor tags (`<a>`) from the page and store them as a list of tuples in the format `('link_text', 'absolute_url')`. Ensure that relative URLs are properly converted to absolute URLs using the page's base URL.
   
2. **Collect Image URLs (`get_images() -> List[str]`)**: Collect all image URLs (`<img>`) from the page and store them in a list. As with links, convert relative image URLs to absolute URLs for consistency and accessibility.
   
3. **Extract Clean Text (`get_text() -> str`)**: Extract the main plain text content from the page while removing unnecessary elements such as scripts, styles, and comments. The extracted text should be cleaned and made human-readable.

#### Criteria for Success:
- Extract all hyperlinks and store them as a list of tuples, ensuring that relative URLs are handled correctly by converting them into absolute URLs using the base URL.
- Collect all image URLs, ensuring they are stored as absolute URLs.
- The extracted text should be free of scripts, styles, and other non-content elements and should represent the clean, readable main content of the page.

#### Link: [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical)

In [11]:
from typing import List, Tuple, Optional
import re
from urllib.parse import urljoin
from bs4 import BeautifulSoup

class SmithsonianParser:
	def __init__(self, url: str) -> None:
		"""
		Initialize the SmithsonianParser with a URL.
		:param url: The URL of the page to be parsed.
		"""
		self.url: str = url
		self.anchors: List[Tuple[str, str]] = []  # List of tuples (link_text, absolute_url)
		self.images: List[str] = []  # List of image URLs
		self.text: str = ""  # Cleaned text content

	def fetch_page(self) -> Optional[bytes]:
		"""
		Fetch the HTML content of the given URL.
		:return: The page content as bytes if successful, otherwise None.
		"""
		try:
			response = requests.get(self.url)
			response.raise_for_status()
			return response.content
		except requests.exceptions.RequestException:
			return None

	def parse(self, html_content: bytes) -> None:
		"""
		Parse the HTML content using BeautifulSoup.
		1. Extract all anchor tags and store them as ('link_text', 'absolute_url').
		2. Extract all image URLs and store them in a list.
		3. Extract clean, readable text from the page.
		:param html_content: The HTML content of the page to be parsed.
		:return: None.
		"""
		soup = BeautifulSoup(html_content, 'html.parser')

		for link in soup.find_all('a', href=True):
			abs_url = urljoin(self.url, link['href'])
			text = link.text
			text = re.sub(r'\s+', ' ', text).strip()
			self.anchors.append((text, abs_url))

		for img in soup.find_all('img', src=True):
			absolute_url = urljoin(self.url, img['src'])
			self.images.append(absolute_url)
		
		extra_tags = ['style', 'script', 'head', 'title', 'meta', '[document]']
		for tag in extra_tags:
			for element in soup.find_all(tag):
				element.decompose()
		
		text = soup.get_text(separator=" ")
		
		text = re.sub(r'\s+', ' ', text)
		self.text = text.strip()
		print(text)

	def get_anchors(self) -> List[Tuple[str, str]]:
		"""
		Return the list of anchors extracted from the page.
		:return: A list of tuples with link text and absolute URL.
		"""
		return self.anchors

	def get_images(self) -> List[str]:
		"""
		Return the list of image URLs extracted from the page.
		:return: A list of image URLs.
		"""
		return self.images

	def get_text(self) -> str:
		"""
		Return the cleaned text content extracted from the page.
		:return: The extracted text content as a string.
		"""
		return self.text

In [12]:
cll = SmithsonianParser("https://www.si.edu/newsdesk/snapshot/how-very-logical")
cll.parse(cll.fetch_page())

 Skip to main content Search Search What is 2+5? My Visit Donate Smithsonian Institution Site Navigation Visit Hours and Locations Entry and Guidelines Maps and Brochures Dining and Shopping Accessibility Visiting with Kids Group Visits Group Sales What's On Exhibitions Current Upcoming Past Today's Events Online Events All Events IMAX & Planetarium Explore - Art & Design - History & Culture - Science & Nature Collections Open Access Research Resources Libraries Archives Smithsonian Institution Archives Air and Space Museum Anacostia Community Museum American Art Museum Archives of American Art Archives of American Gardens American History Museum American Indian Museum Asian Art Museum Archives Eliot Elisofon Photographic Archives, African Art Hirshhorn Archive National Anthropological Archives National Portrait Gallery Ralph Rinzler Archives, Folklife Libraries' Special Collections Podcasts Mobile Apps Learn For Caregivers For Educators Art & Design Resources Science & Nature Resource

### Task 3: Summarizing Smithsonian Snapshots

#### 3.0.3. Task Description
You will analyze the text content from one of the Smithsonian Snapshot pages, such as [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical). Your task is to extract and analyze the text content from this page, focusing on the following:

1. **Extract key phrases (`get_word_stats() -> Counter`)**: Use basic natural language processing (NLP) techniques to identify the most frequently used words from the main body of the text. All words should be converted to lowercase, and a frequency distribution of the words should be generated using a `Counter` object.
   
2. **Sentence Segmentation (`split_into_sentences() -> None`)**: Split the cleaned text into individual sentences, ensuring that punctuation is handled correctly. The sentences should be stored as a list of strings.

3. **Cleaning and Normalization (`clean_text(html_content: bytes) -> None`)**: Clean the text to remove any extraneous characters, symbols, or whitespace. This involves normalizing the text by converting it to lowercase and removing special characters like punctuation.

### Tasks:
1. **Word Frequency Analysis**: Implement a method (`get_word_stats()`) to count the frequency of each word in the page content. Ensure all words are converted to lowercase for consistency in counting, and return the word frequencies as a `Counter` object.
   
2. **Sentence Splitting**: Implement a method (`split_into_sentences()`) to split the content into individual sentences, ensuring punctuation and line breaks are handled properly. The resulting sentences should be stored in a list.
   
3. **Cleaning and Normalization**: Implement a method (`clean_text(html_content: bytes)`) to clean the text by removing any special characters and unnecessary whitespace, while also normalizing the text by converting it to lowercase.

#### Criteria for Success:
- The `get_word_stats()` method should return a frequency distribution of words as a `Counter` object, ensuring accurate word counting by normalizing to lowercase.
- Sentences should be extracted cleanly from the page’s main text using the `split_into_sentences()` method.
- The text should be properly cleaned and normalized (lowercased, and special characters removed) using the `clean_text()` method.

#### Link: [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical)

In [7]:
from typing import Optional, List, Dict
from collections import Counter
import nltk

class SmithsonianTextAnalyzer:
	def __init__(self, url: str) -> None:
		"""
		Initialize the SmithsonianTextAnalyzer with a URL.
		:param url: The URL of the Smithsonian page to analyze.
		"""
		self.url: str = url
		self.text: str = ""  # Cleaned text content
		self.sentences: List[str] = []  # List of sentences
		self.word_frequency: Optional[Counter] = None  # Word frequency as Counter

	def fetch_page(self) -> Optional[bytes]:
		"""
		Fetch the HTML content of the Smithsonian Snapshot page.
		:return: The HTML content as bytes if successful, otherwise None.
		"""
		try:
			response = requests.get(self.url)
			response.raise_for_status()
			return response.text
		except requests.exceptions.RequestException:
			return None

	def clean_text(self, html_content: bytes) -> None:
		"""
		Use BeautifulSoup to extract clean text from the HTML content.
		Remove scripts, styles, and special characters.
		:param html_content: The HTML content as bytes.
		:return: None.
		"""
		# Note that the both sentences with and without punctuations are going to be both graded as correct answers.
		# steps to clean the text:
		# Remove scripts, styles, and unwanted tags
		# Remove extra spaces, and normalize text
		# Lowercase for normalization
		soup = BeautifulSoup(html_content, 'html.parser')
		text = soup.get_text(separator=" ")
		text = re.sub(r'\s+', ' ', text)
		text = text.strip().lower()
		self.text = text

	def split_into_sentences(self) -> None:
		"""
		Use nltk's sentence tokenizer to split the cleaned text into sentences.
		:return: None. Updates the self.sentences attribute.
		"""
		def clean(text):
			text = re.sub(r'[^a-zA-Z0-9 ]', '', text)
			text = re.sub(r'\s+', ' ', text)
			text = text.lower().strip()
			return text
		
		self.sentences = [clean(sent) for sent in nltk.sent_tokenize(self.text)]

	def get_word_stats(self) -> Counter:
		"""
		Count the frequency of each word in the text. Return a Counter object.
		Ensure the text is lowercased for accurate counting.
		:return: A Counter object with word frequencies.
		"""
		text = self.text.lower()
		text = re.sub(r'\W', ' ', text)
		text = re.sub(r'\s+', ' ', text)
		
		words = nltk.word_tokenize(text)
		return Counter(words)

	def analyze(self) -> Optional[Counter]:
		"""
		Orchestrate the fetching, cleaning, and analysis of the text from the page.
		- Fetch the HTML content.
		- Clean the text.
		- Split into sentences.
		- Get word frequency statistics.
		:return: A Counter object with word frequencies if successful, otherwise None.
		"""
		content = self.fetch_page()
		if content:
			self.clean_text(content)
			self.split_into_sentences()
			self.word_frequency = self.get_word_stats()
			return self.word_frequency
		return None

### Task 4: Building a Smithsonian Snapshots Crawler

#### 4.0.4. Task Description
In this task, you will create a **web crawler** that will start at the Smithsonian Snapshots page and follow links to gather and analyze the content from multiple snapshot pages. The Smithsonian Snapshot section contains multiple articles, and your crawler will explore these articles, download their content, and process the information.

You will implement a web crawler that:
1. Starts at the [Smithsonian Snapshots Page](https://www.si.edu/snapshot).
2. Crawls through snapshot pages, extracting key information (links, images, and text) from each page.
3. Follows links from the initial page to other snapshot articles up to a specified depth.
4. Processes and stores the content from each crawled page.

### Tasks:
1. **Implement a Crawler (`crawl(url: str, depth: int = 0) -> None`)**: Start crawling from the [Smithsonian Snapshots Page](https://www.si.edu/newsdesk/snapshots), gather links to snapshot articles, and visit each article. The method should be recursive, handling different crawl depths, and visiting pages only once.
   
2. **Content Extraction (`extract_content(html_content: bytes, base_url: str) -> Dict[str, Optional[str]]`)**: For each visited page, extract the following:
   - Anchor tags (`'link_text', 'absolute_url'`).
   - Image URLs (absolute URLs).
   - Cleaned text content from the body of the article.

3. **Depth Control (`max_depth: int`)**: Implement a parameter to control the depth of the crawl (i.e., how many levels of links the crawler should follow). Ensure that the crawler respects the specified depth and doesn’t exceed it.

4. **Yield Results (`crawl_generator() -> Generator[Dict[str, Optional[str]], None, None]`)**: Your crawler should return a **generator** that yields the results (text, links, images) as soon as a page is processed. This should allow the results to be streamed rather than collected all at once.

### Criteria for Success:
- The crawler should respect the specified depth and only crawl the specified number of levels.
- Each snapshot page should have its content (links, images, text) extracted and returned in a structured format (i.e., as a dictionary).
- The crawler should handle relative links and convert them to absolute URLs.
- The content should be cleaned and stored properly, and the crawler should only revisit each URL once to avoid redundancy.
- Ensure the total number of visited links doesn't exceed 10, as long runtimes may result in penalties.

#### Link: [Smithsonian Snapshots Page](https://www.si.edu/snapshot)

In [8]:
from typing import Optional, Dict, Generator
from bs4 import BeautifulSoup


class SmithsonianCrawler:
	def __init__(self, start_url: str, max_depth: int = 2) -> None:
		"""
		Initialize the SmithsonianCrawler with a start URL and a maximum crawl depth.
		:param start_url: The URL where the crawler begins.
		:param max_depth: The maximum depth the crawler will visit.
		"""
		self.start_url: str = start_url
		self.max_depth: int = max_depth
		self.visited: set = set()  # Set of visited URLs

	def fetch_page(self, url: str) -> Optional[bytes]:
		"""
		Fetch the HTML content from the given URL.
		:param url: The URL to fetch the page content from.
		:return: The page content as bytes if successful, otherwise None.
		"""
		try:
			response = requests.get(url)
			response.raise_for_status()
			return response.text
		except requests.exceptions.RequestException:
			return None

	def extract_content(self, html_content: bytes, base_url: str) -> Dict[str, Optional[str]]:
		"""
		Extract links, images, and clean text content from the page using BeautifulSoup.
		Handle relative URLs appropriately.
		:param html_content: The HTML content to parse.
		:param base_url: The base URL to resolve relative URLs.
		:return: A dictionary containing 'anchors', 'images', and 'text'.
		"""
		soup = BeautifulSoup(html_content, 'html.parser')

		# Extract links
		links = []
		for a in soup.find_all('a', href=True):
			absolute_url = urljoin(base_url, a['href'])
			links.append((a.text, absolute_url))

		# Extract images
		images = []
		for img in soup.find_all('img', src=True):
			absolute_url = urljoin(base_url, img['src'])
			images.append(absolute_url)

		# Extract clean text content
		text = soup.get_text()
		cleaned_text = ' '.join(text.split())
		cleaned_text = cleaned_text.strip()
		
		return {"anchors": links, "images": images, "text": cleaned_text}

	def crawl(self, url: str, depth: int = 0) -> None:
		"""
		Recursively crawl through pages starting from the given URL up to a specified depth.
		Follow links and process the page content.
		:param url: The URL to crawl.
		:param depth: The current depth of the crawl.
		:return: None.
		"""
		if depth > self.max_depth or url in self.visited:
			return

		self.visited.add(url)
		html_content = self.fetch_page(url)
		if html_content:
			data = self.extract_content(html_content, url)
			yield data

			# Follow links and crawl recursively
			for _, link in data['links']:
				yield from self.crawl(link, depth + 1)


	def crawl_generator(self) -> Generator[Dict[str, Optional[str]], None, None]:
		"""
		A generator that yields the extracted content of each crawled page as soon as it's processed.
		:yield: A dictionary containing the extracted content from each page (anchors, images, text).
		"""
		yield from self.crawl(self.start_url)