[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/assignment_notebooks/Webscraping.ipynb)

# Webscraping Assignment

Reminder: you are permitted to work with another classmate on this assignment. If you do, please submit a single notebook with both of your names at the top.

## Due date

Friday, February 24 (12:00 pm), 2023

## Assignment description

In this project you will write a Jupyter Notebook or R Markdown file to scrape a selected website. You will need to:

1. Write a function that takes a URL as input and returns the HTML of the page as a string.
2. Inspect the HTML of the page and use regular expressions to extract the documents within the page.
3. Model the documents in a corpus
4. Analyze the corpus using the bag of words model
5. Implement a TF-IDF model to extract the most n-important words for each document in the corpus.

### Objective

This assignment reinforces previous lecture topics on the linguistic background, properties of language, information theory, and Regular Expressions.


## Submission medium

Jupyter Notebook or R Markdown file. See additional instructions at the final section of this document.

## Code Dependencies

You will need to install the following packages:

- `requests`
- `re`
- `beautifulsoup4`
- `nltk`
- `pandas`
- `numpy`
- `matplotlib`


## Grading

This assignment is worth 10 points. (extra credit 1 point to final grade if you create a heatmap of the TF-IDF matrix)

## Write a function that takes a URL as input and returns the HTML of the page as a string

### 1.1 Write a function that takes a URL as input and returns the HTML of the page as a string

In [15]:
import requests

def get_html(url) -> str:
    """Get the HTML of a webpage and return the HTML as a string.
    
    Parameters
    ----------
    url : str
        The URL of the webpage to scrape.
    
    Returns
    -------
    str
        The HTML of the webpage as a string.
    """
    ## YOUR CODE HERE
    html_source: str = requests.get(url).text
    assert isinstance(html_source, str), "The HTML should be a string."
    return html_source

### 1.2 Inspect the HTML of the page. Can you identify any patterns in the HTML that might be useful for extracting the documents within the page?

In [16]:
# Extract the the HTML source code from the URL (this is the same URL we used in class)
url = "https://www.gutenberg.org/files/1/1-0.txt"

# YOUR CODE HERE

### 1.3 Use the BeautifulSoup library to create a BeautifulSoup object from the HTML string

In [21]:
from bs4 import BeautifulSoup as bs4

# YOUR CODE HERE

### 1.3 Extract the HTML body text and examine the contents.

In [25]:
# Please explain what the following line of code does in the cell below.
body = soup.find("body")

### 1.4 Use regular expressions to extract the documents within the page

In [117]:
import re

# Your regex here to capture the documents
doc_extractor = r""

# Explain this line of code in the cell below.
# __Note:__ You will need to use the `re.MULTILINE` flag to ensure that the
# regular expression matches across multiple lines.
found_documents: list = re.findall(doc_extractor, body.text, re.MULTILINE)

assert len(found_documents) == 9, "Please check your regex. You should have found a total 9 documents."

## if you are having trouble with the regex remeber that you can use regex101.com to test and debug.


Explain: `documents = re.findall(doc_extractor, body.text, re.MULTILINE)`



## 1.5 Explore the contents of the Documents

In the matched documents, you will find a heading appended to the text by project Gutenberg. For the purposes of this assignment, I provided a cleaner function to extract the Gutenberg headings from the text for you.

In [209]:
def clean_gutenberg(text: str) -> str:
    """Clean the text of a Gutenberg document.
    
    Parameters
    ----------
    text : str
        The text of a Gutenberg document.
    
    Returns
    -------
    str
        The cleaned text of the document.
    """
    text = re.sub(r"\[Etext #\d+\]", "", text)
    text = re.sub(r"(\r\n)+", " ", text)
    text = re.sub(r"^The Project Gutenberg.*?Independence\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ \*\*\*\*The Project Gutenberg Etext of The U. S. Bill of Rights\*\*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ November.*?EST", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ \*\*The Project.*?, USA", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ \*\*\*\*\*The Project.*?corrections\. \*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ The Project.*?1775\.", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ Officially.*?calendar\]", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ \*\*The Project.*?, 1865", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ The Project.*?, 1861", "", text, flags=re.MULTILINE)
    
    return text.strip()

In [219]:
corpus = []

for i, doc in enumerate(found_documents):
    # YOUR CODE HERE

In [220]:
# Explore the corpus here

# Analyze the above corpus of documents using TF-IDF

In the follow steps, I would like for you to accomplish the follow preprocessing steps. 

1. Tokenize the documents
2. Lemmatize the tokens
3. Remove stop words
4. Remove punctuation
5. Apply TF-IDF to the corpus
    * You can write a TF-IDF model from sratch or use the `sklearn` library

_tip: see lecture notebooks 4, 5, and 6 for examples of how to work with pandas_


In [221]:
### TIP ###
## if you want to work with pandas create a dataframe with documents as rows and columns for the document number and the text
import pandas as pd
corpus = pd.DataFrame({"docID": range(len(corpus)), "text": corpus})


## Tokenize the documents

In [228]:
## Your code here

## Lemmatize the tokens

In [224]:
## Your code here

## Remove stop words

You can use the `nltk` library to remove stop words. You can also use the `nltk` library to remove stopwords.

In [225]:
## Your code here

## Remove punctuation

In [226]:
## Your code here

## Analyze the documents and corpus using TF-IDF

In [227]:
## Your code here

# Submission Instructions

Please submit your assignment as a Jupyter Notebook or R Markdown file. You can submit your assignment as a link to a Google Colab notebook or a link to a GitHub repository. If you are submitting a link to a GitHub repository, please make sure that your repository is public. If you email the notebook to me, please zip the file before sending it.