## Homework Submission Instructions and Format

### 1. **Naming Convention for Submission**
- Each student must submit their homework as a Jupyter Notebook (`.ipynb`) file.
- The file must be named with your **email address**. For example, if your email is `j.doe@innopolis.university`, the file should be named `j.doe@innopolis.university.ipynb`.
- Make sure that the notebook contains all necessary classes and functions outlined in the tasks.

---

### 2. **Structure of the Homework**

The homework is divided into **four tasks**, and each task requires you to implement specific classes and functions. The tasks build upon concepts of web scraping, HTML parsing, document processing, and crawling.

Here’s a breakdown of what is expected in each task:

#### **Task 1: Artifact Caching System**
- **Class**: `Artifact`
  - Implement a class that can **download**, **store**, and **retrieve** digital content from a URL.
  - This class should:
    - Fetch content from a URL and store it in memory.
    - Save the content to a local file to avoid redundant downloads.
    - Retrieve the content from the local cache if it already exists.
- **Methods**:
  - `fetch_artifact()`: Downloads the content from the URL.
  - `store_artifact()`: Stores the content locally in a unique file.
  - `retrieve_artifact()`: Retrieves the content from the local cache.
  
---

#### **Task 2: Smithsonian Snapshot Parser**
- **Class**: `SmithsonianParser`
  - Implement a class that parses HTML pages like [Smithsonian Snapshots](https://www.si.edu/newsdesk/snapshot/how-very-logical).
  - This class should:
    - Extract all links (`<a>` tags) and store them as a list of tuples.
    - Extract all image URLs (`<img>` tags) and store them in a list.
    - Clean the text from the page, removing unnecessary elements (scripts, styles).
- **Methods**:
  - `fetch_page()`: Downloads the HTML content of a page.
  - `parse()`: Parses the content and extracts links, images, and cleaned text.
  - `get_anchors()`, `get_images()`, and `get_text()`: Returns the extracted data.

---

#### **Task 3: Text Analysis of Smithsonian Snapshot**
- **Class**: `SmithsonianTextAnalyzer`
  - Implement a class that analyzes the text content of a Smithsonian Snapshot page.
  - This class should:
    - Perform **word frequency analysis**.
    - **Segment sentences** and split text properly.
    - Clean the text to remove special characters and whitespace.
- **Methods**:
  - `analyze()`: Fetches the page content, processes the text, and generates word frequency statistics.
  - `get_word_stats()`: Returns word frequency in the form of a `Counter` object.
  - `split_into_sentences()`: Splits the text into sentences.

---

#### **Task 4: Smithsonian Snapshot Web Crawler**
- **Class**: `SmithsonianCrawler`
  - Implement a web crawler that starts at the [Smithsonian Snapshots](https://www.si.edu/newsdesk/snapshots) page and crawls through linked snapshot articles.
  - This class should:
    - Crawl pages to a specified depth.
    - Extract links, images, and cleaned text from each page.
    - Return the results as soon as the page is processed.
- **Methods**:
  - `crawl()`: Recursively visits pages starting from a given URL.
  - `crawl_generator()`: Generates content as the crawler processes each page.

---

### 3. **Grading Process**
- Each homework will be graded using an automated grading system.
- The grading system will dynamically import and execute your code to test if all the tasks are implemented correctly.
  
---

### 4. **Total Grade Breakdown**
- **Task 1**: Artifact Caching System (25 points)
- **Task 2**: Smithsonian Snapshot Parser (25 points)
- **Task 3**: Text Analysis (25 points)
- **Task 4**: Web Crawler (25 points)
- **Total**: 100 points

---

### 5. **Detailed Feedback**
- Feedback will be provided with specific details on:
  - **What worked**: Indicating which parts of the code were implemented correctly and passed the tests.
  - **What needs improvement**: Highlighting which tests failed and what parts of the code may require debugging or further development.
- The feedback will help you understand your performance in each task.

---

### 6. **Submission Guidelines**
- Ensure that your notebook is properly formatted and runs without errors.
- Do not use any external libraries unless instructed.
- Each function and class must follow the naming conventions provided in this document.
- Submit your notebook on time. Late submissions may not be accepted.

---

### 7. **Final Tips**
- Test each task thoroughly before submission.
- Ensure your notebook is readable and well-documented.
- Make use of comments to explain your code wherever necessary.

Good luck!

### Task 1: Archiving Virtual Artifacts - Preserving a Digital Museum

#### 1.0.1. Task Description
Imagine you are a data archivist working to preserve artifacts from the **Smithsonian Institution's digital collection**. Your job is to download, store, and manage different types of data (such as images, videos, and documents) to ensure they can be accessed later without repeated downloads.

Use the [Smithsonian Institution Collections](https://www.si.edu/snapshot) as your source of artifacts. You are tasked with building a caching system that can store downloaded files in a structured way and retrieve them as needed.

#### Tasks:
1. `fetch_artifact()`: Download content from the Smithsonian's collection page based on a provided URL. The method should return `True` if successful, or `False` if the download fails.
2. `store_artifact()`: Save the content of the artifact (text, image, etc.) in a local file system. Each artifact must be stored in its own unique file based on its URL.
3. `retrieve_artifact()`: Load an artifact from your local storage using its URL to ensure that content is cached correctly and avoid redundant downloads.

#### Criteria for Success:
- Different URLs must map to different files, even if they belong to the same domain.
- Binary files (e.g., images) must be handled correctly without corruption.
- Artifacts that are already stored locally should not be downloaded again.

#### Link: [Smithsonian Institution Collections](https://www.si.edu/newsdesk/snapshot/what-good-boy)

In [1]:
import 

class Artifact:
    def __init__(self, url):
        self.url = url
        self.content = None
        self.filename = None

    def generate_filename(self):
        """
        Generates a unique and safe filename based on the URL.
        You will need to use a hash function (hint: hashlib).
        """
        # Your code here
        pass

    def fetch_artifact(self):
        """
        Download the artifact from the given URL and store its content in memory.
        If the download is successful, return True. Otherwise, return False.
        """
        # Your code here
        pass

    def store_artifact(self, directory="artifact_cache"):
        """
        Store the artifact content in a local file in a cache directory.
        Ensure the file is stored with a unique name to avoid overwriting.
        """
        # Your code here
        pass

    def retrieve_artifact(self, directory="artifact_cache"):
        """
        Retrieve the artifact from the local cache if it has been stored before.
        Return True if successful, False otherwise.
        """
        # Your code here
        pass

### Task 2: Parsing Web Pages - Smithsonian Snapshot

#### 2.0.2. Task Description
For this task, you will be working with pages from the [Smithsonian Newsdesk Snapshot](https://www.si.edu/newsdesk/snapshot/how-very-logical). Your goal is to extract meaningful content such as links, images, and clean text from the page.

You will need to:
1. Extract all hyperlinks (anchor tags) from the page and store them as a list of tuples `('link_text', 'absolute_url')`. Make sure to handle relative links by converting them to absolute URLs.
2. Collect all image URLs in a list. Ensure relative URLs are converted to absolute URLs.
3. Extract the plain text from the page, ignoring scripts, styles, and comments.

#### Criteria for Success:
- Extract all links as `('link_text', 'absolute_url')` and handle relative URLs.
- Extract all image URLs as absolute URLs.
- Clean and extract the main text from the document.

#### Link: [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical)

In [21]:
class SmithsonianParser:
    def __init__(self, url):
        self.url = url
        self.anchors = []
        self.images = []
        self.text = ""

    def fetch_page(self):
        """
        Fetch the HTML content of the given URL. If the request is successful,
        return the page content; otherwise, return None.
        """
        # Your code here
        pass

    def parse(self, html_content):
        """
        Parse the HTML content using BeautifulSoup. You need to:
        1. Extract all anchor tags and store them as ('link_text', 'absolute_url').
        2. Extract all image URLs and store them in a list.
        3. Extract clean, readable text from the page.
        """
        # Your code here
        pass

    def get_anchors(self):
        """
        Return the list of anchors extracted from the page.
        """
        return self.anchors

    def get_images(self):
        """
        Return the list of image URLs extracted from the page.
        """
        return self.images

    def get_text(self):
        """
        Return the cleaned text content extracted from the page.
        """
        return self.text

Anchors (Links):
('Skip to main content', 'https://www.si.edu/newsdesk/snapshot/how-very-logical#main-content')
('My Visit', 'https://www.si.edu/myvisit')
('Donate', 'http://go.si.edu/si-give')
('Smithsonian Institution', 'https://www.si.edu/')
('Visit', 'https://www.si.edu/visit')
('Hours and Locations', 'https://www.si.edu/visit/hours')
('Entry and Guidelines', 'https://www.si.edu/visit/tips')
('Maps and Brochures', 'https://www.si.edu/visit/maps')
('Dining and Shopping', 'https://www.si.edu/dining')
('Accessibility', 'https://www.si.edu/visit/accessibility')
('Visiting with Kids', 'https://www.si.edu/visit/kids')
('Group Visits', 'https://www.si.edu/visit/groups')
('Group Sales', 'https://www.si.edu/groupsales')
("What's On", 'https://www.si.edu/whats-on')
('Exhibitions', 'https://www.si.edu/exhibitions')
('Current', 'https://www.si.edu/exhibitions')
('Upcoming', 'https://www.si.edu/exhibitions/upcoming')
('Past', 'https://www.si.edu/exhibitions/past')
("Today's Events", 'https://ww

### Task 3: Summarizing Smithsonian Snapshots

#### 3.0.3. Task Description
You will analyze the text content from one of the Smithsonian Snapshot pages, such as [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical). Your task is to extract and analyze the text content from this page, focusing on the following:

1. **Extract key phrases**: Use basic natural language processing (NLP) techniques to identify the most frequently used words and key phrases from the main body of the text.
2. **Sentence segmentation**: Split the text into individual sentences, making sure to handle punctuation and proper sentence breaks appropriately.
3. **Clean the text**: Remove any extraneous characters, symbols, or whitespace.

### Tasks:
1. **Word Frequency Analysis**: Implement a method to count the frequency of each word in the page content, converting all words to lowercase.
2. **Sentence Splitting**: Implement a method to split the content into individual sentences, being mindful of punctuation and line breaks.
3. **Cleaning and Normalization**: Clean the text to remove any special characters or unnecessary whitespace.

#### Criteria for Success:
- The `get_word_stats()` method should return a frequency distribution of words as a `Counter` object.
- Sentences should be extracted cleanly from the page’s main text.
- The text should be normalized (lowercased, and special characters should be removed).

#### Link: [How Very Logical](https://www.si.edu/newsdesk/snapshot/how-very-logical)

In [30]:
import nltk
from collections import Counter

class SmithsonianTextAnalyzer:
    def __init__(self, url):
        self.url = url
        self.text = ""
        self.sentences = []
        self.word_frequency = None

    def fetch_page(self):
        """
        Fetch the HTML content of the Smithsonian Snapshot page.
        Return the page content if successful, else return None.
        """
        # Your code here
        pass

    def clean_text(self, html_content):
        """
        Use BeautifulSoup to extract clean text from the HTML content.
        Remove scripts, styles, and special characters.
        """
        # Your code here
        pass

    def split_into_sentences(self):
        """
        Use nltk's sentence tokenizer to split the cleaned text into sentences.
        """
        # Your code here
        pass

    def get_word_stats(self):
        """
        Count the frequency of each word in the text. Return a Counter object.
        Ensure the text is lowercased for accurate counting.
        """
        # Your code here
        pass

    def analyze(self):
        """
        Orchestrate the fetching, cleaning, and analysis of the text from the page.
        - Fetch the HTML content.
        - Clean the text.
        - Split into sentences.
        - Get word frequency statistics.
        """
        # Your code here
        pass

['how very logical smithsonian institution skip to main content search search what is 25 search my visit donate smithsonian institution site navigation visit hours and locations entry and guidelines maps and brochures dining and shopping accessibility visiting with kids group visits group sales whats on exhibitions current upcoming past todays events online events all events imax planetarium explore art design history culture science nature collections open access research resources libraries archives smithsonian institution archives air and space museum anacostia community museum american art museum archives of american art archives of american gardens american history museum american indian museum asian art museum archives eliot elisofon photographic archives african art hirshhorn archive national anthropological archives national portrait gallery ralph rinzler archives folklife libraries special collections podcasts mobile apps learn for caregivers for educators art design resources

### Task 4: Building a Smithsonian Snapshots Crawler

#### 4.0.4. Task Description
In this task, you will create a **web crawler** that will start at the Smithsonian Snapshots page and follow links to gather and analyze the content from multiple snapshot pages. The Smithsonian Snapshot section contains multiple articles, and your crawler will explore these articles, download their content, and process the information.

You will implement a web crawler that:
1. Starts at the [Smithsonian Snapshots Page](https://www.si.edu/snapshot).
2. Crawls through snapshot pages, extracting key information (links, images, and text) from each page.
3. Follows links from the initial page to other snapshot articles up to a specified depth.
4. Processes and stores the content from each crawled page.

### Tasks:
1. **Implement a Crawler**: Start crawling from the [Smithsonian Snapshots Page](https://www.si.edu/newsdesk/snapshots), gather links to snapshot articles, and visit each article.
2. **Content Extraction**: For each visited page, extract:
   - Anchor tags (`'link_text', 'absolute_url'`).
   - Image URLs (absolute URLs).
   - Cleaned text content from the body of the article.
3. **Depth Control**: Implement a parameter to control the depth of the crawl (i.e., how many levels of links the crawler should follow).
4. **Yield Results**: Your crawler should return a **generator** that yields the results (text, links, images) as soon as a page is processed, rather than collecting everything before returning.

### Criteria for Success:
- The crawler should respect the specified depth and only crawl the specified number of levels.
- Each snapshot page should have its content (links, images, text) extracted and returned.
- The crawler should handle relative links and convert them to absolute URLs.
- The content should be cleaned and stored properly.

#### Link: [Smithsonian Snapshots Page](https://www.si.edu/snapshot)

# Make sure the total number of visited links doesn't exceed 10 links, or you might get 0 for the whole assignment due to long runtime when checking the links!

In [42]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

class SmithsonianCrawler:
    def __init__(self, start_url, max_depth=2):
        self.start_url = start_url
        self.max_depth = max_depth
        self.visited = set()

    def fetch_page(self, url):
        """
        Fetch the HTML content from the given URL.
        Return the page content if successful, else return None.
        """
        # Your code here
        pass

    def extract_content(self, html_content, base_url):
        """
        Extract links, images, and clean text content from the page using BeautifulSoup.
        Handle relative URLs appropriately.
        """
        # Your code here
        pass

    def crawl(self, url, depth=0):
        """
        Recursively crawl through pages starting from the given URL up to a specified depth.
        You should follow links and process the page content.
        """
        # Your code here
        pass

    def crawl_generator(self):
        """
        A generator that yields the extracted content of each crawled page as soon as it's processed.
        """
        yield from self.crawl(self.start_url)


Crawling https://www.si.edu/snapshot at depth 0...

Crawled Page Content:
Anchors: [('Skip to main content', 'https://www.si.edu/snapshot#main-content'), ('My Visit', 'https://www.si.edu/myvisit'), ('Donate', 'http://go.si.edu/si-give'), ('Smithsonian Institution', 'https://www.si.edu/'), ('Visit', 'https://www.si.edu/visit'), ('Hours and Locations', 'https://www.si.edu/visit/hours'), ('Entry and Guidelines', 'https://www.si.edu/visit/tips'), ('Maps and Brochures', 'https://www.si.edu/visit/maps'), ('Dining and Shopping', 'https://www.si.edu/dining'), ('Accessibility', 'https://www.si.edu/visit/accessibility'), ('Visiting with Kids', 'https://www.si.edu/visit/kids'), ('Group Visits', 'https://www.si.edu/visit/groups'), ('Group Sales', 'https://www.si.edu/groupsales'), ("What's On", 'https://www.si.edu/whats-on'), ('Exhibitions', 'https://www.si.edu/exhibitions'), ('Current', 'https://www.si.edu/exhibitions'), ('Upcoming', 'https://www.si.edu/exhibitions/upcoming'), ('Past', 'https://w