[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/assignment_notebooks/Webscraping.ipynb)

# Webscraping Assignment

Reminder: you are permitted to work with another classmate on this assignment. If you do, please submit a single notebook with both of your names at the top.

## Due date

Friday, February 24 (12:00 pm), 2023

## Assignment description

In this project you will write a Jupyter Notebook or R Markdown file to scrape a selected website. You will need to:

1. Write a function that takes a URL as input and returns the HTML of the page as a string.
2. Inspect the HTML of the page and use regular expressions to extract the documents within the page.
3. Model the documents in a corpus
4. Analyze the corpus using the bag of words model
5. Implement a TF-IDF model to extract the most n-important words for each document in the corpus.

### Objective

This assignment reinforces previous lecture topics on the linguistic background, properties of language, information theory, and Regular Expressions.


## Submission medium

Jupyter Notebook or R Markdown file. See additional instructions at the final section of this document.

## Code Dependencies

You will need to install the following packages:

- `requests`
- `re`
- `beautifulsoup4`
- `nltk`
- `pandas`
- `numpy`
- `matplotlib`


## Grading

This assignment is worth 10 points. (extra credit 1 point to final grade if you create a heatmap of the TF-IDF matrix)

## Write a function that takes a URL as input and returns the HTML of the page as a string

### 1.1 Write a function that takes a URL as input and returns the HTML of the page as a string

In [1]:
import requests

def get_html(url) -> str:
    """Get the HTML of a webpage and return the HTML as a string.
    
    Parameters
    ----------
    url : str
        The URL of the webpage to scrape.
    
    Returns
    -------
    str
        The HTML of the webpage as a string.
    """
    ## YOUR CODE HERE
    html_source: str = requests.get(url).text
    assert isinstance(html_source, str), "The HTML should be a string."
    return html_source

### 1.2 Inspect the HTML of the page. Can you identify any patterns in the HTML that might be useful for extracting the documents within the page?

In [2]:
# Extract the the HTML source code from the URL (this is the same URL we used in class)
url = "https://www.gutenberg.org/files/1/1-0.txt"

html_source = get_html(url)

### 1.3 Use the BeautifulSoup library to create a BeautifulSoup object from the HTML string

In [3]:
from bs4 import BeautifulSoup as bs4

# YOUR CODE HERE
soup = bs4(html_source, "lxml")

### 1.3 Extract the HTML body text and examine the contents.

In [5]:
# Please explain what the following line of code does in the cell below.
body = soup.find("body")
body


     NOTE:  This file combines the first two Project Gutenberg
     files, both of which were given the filenumber #1. There are
     several duplicate files here. There were many updates over
     the years.  All of the original files are included in the
     "old" subdirectory which may be accessed under the "More
     Files" listing in the PG Catalog of this file. No changes
     have been made in these original etexts.



**Welcome To The World of Free Plain Vanilla Electronic Texts**

**Etexts Readable By Both Humans and By Computers, Since 1971**

*These Etexts Prepared By Hundreds of Volunteers and Donations*

Below you will find the first nine Project Gutenberg Etexts, in
one file, with one header for the entire file.  This is to keep
the overhead down, and in response to requests from Gopher site
keeper to eliminate as much of the headers as possible.

However, for legal and financial reasons, we must request these
headers be left at the beginning of each file that is posted 

### 1.4 Use regular expressions to extract the documents within the page

In [9]:
import re

# Your regex here to capture the documents

# Success option 1
#doc_extractor = r"(?<=\[Etext #\d])(.*?)(?=\[Etext #\d]|\*\*\*End of)" # THis one requires re.DOTALL

# Success option 2
doc_extractor = r"(?<=\[Etext #\d])([^\f]+?)(?=\[Etext #\d]|\*\*\*End of)" 

# Explain this line of code in the cell below.
# __Note:__ You will need to use the `re.MULTILINE` flag to ensure that the
# regular expression matches across multiple lines.
found_documents: list = re.findall(doc_extractor, body.text, flags=re.MULTILINE)

assert len(found_documents) == 9, "Please check your regex. You should have found a total 9 documents."

## if you are having trouble with the regex remeber that you can use regex101.com to test and debug.


In [10]:
found_documents

['\r\n\r\n\r\nThe Project Gutenberg Etext of The Declaration of Independence.\r\n\r\nAll of the original Project Gutenberg Etexts from the\r\n1970\'s were produced in ALL CAPS, no lower case.  The\r\ncomputers we used then didn\'t have lower case at all.\r\n\r\n\r\nThis is a retranscription of one of the first Project\r\nGutenberg Etexts, officially dated December, 1971--\r\nand now officially re-released on December 31, 1993--\r\n\r\n\r\nThe United States Declaration of Independence was the first Etext\r\nreleased by Project Gutenberg, early in 1971.  The title was stored\r\nin an emailed instruction set which required a tape or diskpack be\r\nhand mounted for retrieval.  The diskpack was the size of a large\r\ncake in a cake carrier, cost $1500, and contained 5 megabytes, of\r\nwhich this file took 1-2%.  Two tape backups were kept plus one on\r\npaper tape.  The 10,000 files we hope to have online by the end of\r\n2001 should take about 1-2% of a comparably priced drive in 2001.\r\n

Explain: `documents = re.findall(doc_extractor, body.text, re.MULTILINE)`



## 1.5 Explore the contents of the Documents

In the matched documents, you will find a heading appended to the text by project Gutenberg. For the purposes of this assignment, I provided a cleaner function to extract the Gutenberg headings from the text for you.

In [17]:
def clean_gutenberg(text: str) -> str:
    """Clean the text of a Gutenberg document.
    
    Parameters
    ----------
    text : str
        The text of a Gutenberg document.
    
    Returns
    -------
    str
        The cleaned text of the document.
    """
    text = re.sub(r"\[Etext #\d+\]", "", text)
    text = re.sub(r"(\r\n)+", " ", text)
    text = re.sub(r"^ ?The Project Gutenberg.*?Independence\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*\*\*The Project Gutenberg Etext of The U. S. Bill of Rights\*\*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?November.*?EST", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*The Project.*?, USA", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*\*\*\*The Project.*?corrections\. \*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?The Project.*?1775\.", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?Officially.*?calendar\]", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?\*\*The Project.*?, 1865", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ ?The Project.*?, 1861", "", text, flags=re.MULTILINE)
    
    return text.strip()

In [18]:
corpus = []

for i, doc in enumerate(found_documents):
    # YOUR CODE HERE
    cleaned_doc = clean_gutenberg(doc)
    corpus.append(cleaned_doc)

In [19]:
# Explore the corpus here
corpus

["THE DECLARATION OF INDEPENDENCE OF THE UNITED STATES OF AMERICA When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume, among the Powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on 

# Analyze the above corpus of documents using TF-IDF

In the follow steps, I would like for you to accomplish the follow preprocessing steps. 

1. Tokenize the documents
2. Lemmatize the tokens
3. Remove stop words
4. Remove punctuation
5. Apply TF-IDF to the corpus
    * You can write a TF-IDF model from sratch or use the `sklearn` library

_tip: see lecture notebooks 4, 5, and 6 for examples of how to work with pandas_


In [20]:
### TIP ###
## if you want to work with pandas create a dataframe with documents as rows and columns for the document number and the text
import pandas as pd
corpus = pd.DataFrame({"docID": range(len(corpus)), "text": corpus})


## Tokenize the documents

In [23]:
## Your code here

# tokenize the corpus using SpaCy
import spacy

NLP = spacy.load("en_core_web_sm")
corpus['tokens'] = corpus['text'].apply(lambda x: [token.text.lower() for token in NLP(x)])
corpus

Unnamed: 0,docID,text,tokens
0,0,THE DECLARATION OF INDEPENDENCE OF THE UNITED ...,"[the, declaration, of, independence, of, the, ..."
1,1,The United States Bill of Rights. The Ten Orig...,"[the, united, states, bill, of, rights, ., the..."
2,2,We observe today not a victory of party but a ...,"[we, observe, today, not, a, victory, of, part..."
3,3,"Four score and seven years ago, our fathers br...","[four, score, and, seven, years, ago, ,, our, ..."
4,4,THE CONSTITUTION OF THE UNITED STATES OF AMERI...,"[the, constitution, of, the, united, states, o..."
5,5,No man thinks more highly than I do of the pat...,"[no, man, thinks, more, highly, than, i, do, o..."
6,6,"In the name of God, Amen. We, whose names are...","[in, the, name, of, god, ,, amen, ., , we, ,,..."
7,7,Fellow countrymen: At this second appearing t...,"[fellow, countrymen, :, , at, this, second, a..."
8,8,Fellow citizens of the United States: in comp...,"[fellow, citizens, of, the, united, states, :,..."


In [24]:
corpus_tokens = (corpus
                 .explode('tokens')
                 .drop(columns=['text'])
                 )

corpus_tokens

Unnamed: 0,docID,tokens
0,0,the
0,0,declaration
0,0,of
0,0,independence
0,0,of
...,...,...
8,8,angels
8,8,of
8,8,our
8,8,nature


## Lemmatize the tokens

In [27]:
## Your code here

# Lemmatize the corpus using SpaCy
corpus_tokens['lemmas'] = corpus_tokens['tokens'].apply(lambda x: "".join([token.lemma_ for token in NLP(x)]))
corpus_tokens

Unnamed: 0,docID,tokens,lemmas
0,0,the,the
0,0,declaration,declaration
0,0,of,of
0,0,independence,independence
0,0,of,of
...,...,...,...
8,8,angels,angel
8,8,of,of
8,8,our,our
8,8,nature,nature


## Remove stop words

You can use the `nltk` library to remove stop words. You can also use the `SpaCy` library to remove stopwords.

In [30]:
## Your code here

# Remove stopwords from the corpus using SpaCy
corpus_tokens['no_stopwords'] = corpus_tokens['lemmas'].apply(lambda x: "".join([token.text for token in NLP(x) if not token.is_stop]))

# drop rows with empty no_stopwords
corpus_tokens = corpus_tokens[corpus_tokens['no_stopwords'] != '']
corpus_tokens

Unnamed: 0,docID,tokens,lemmas,no_stopwords
0,0,declaration,declaration,declaration
0,0,independence,independence,independence
0,0,united,united,united
0,0,states,state,state
0,0,america,america,america
...,...,...,...,...
8,8,surely,surely,surely
8,8,",",",",","
8,8,angels,angel,angel
8,8,nature,nature,nature


## Remove punctuation

In [31]:
## Your code here

# Remove punctuation from the corpus using SpaCy
corpus_tokens = corpus_tokens[corpus_tokens['no_stopwords'].str.contains(r'[^\W\d_]', regex=True)]
corpus_tokens

Unnamed: 0,docID,tokens,lemmas,no_stopwords
0,0,declaration,declaration,declaration
0,0,independence,independence,independence
0,0,united,united,united
0,0,states,state,state
0,0,america,america,america
...,...,...,...,...
8,8,union,union,union
8,8,touched,touch,touch
8,8,surely,surely,surely
8,8,angels,angel,angel


## Analyze the documents and corpus using TF-IDF

In [32]:
## Your code here

# First calculate the term frequency for each document
term_freq = (corpus_tokens
             .groupby(['docID', 'no_stopwords'])
             .agg({'no_stopwords': 'count'})
             .rename(columns={'no_stopwords': 'term_freq'})
             .reset_index()
             .rename(columns={'no_stopwords': 'term'})
)
term_freq

Unnamed: 0,docID,term,term_freq
0,0,--such,1
1,0,abdicate,1
2,0,abolish,4
3,0,absolute,3
4,0,absolve,1
...,...,...,...
3060,8,worse,1
3061,8,worthy,1
3062,8,write,4
3063,8,wrong,1


In [33]:
# Document frequency

document_freq = (term_freq
                 .groupby(['docID', 'term'])
                 .size()
                 .unstack()
                 .sum()
                 .reset_index()
                 .rename(columns={0: 'document_freq'})
)
document_freq

Unnamed: 0,term,document_freq
0,--between,1.0
1,--such,1.0
2,.a,1.0
3,.and,1.0
4,.ask,1.0
...,...,...
1972,x,1.0
1973,yea,1.0
1974,year,6.0
1975,york,1.0


In [34]:
# Merge the term frequency and document frequency dataframes
term_freq = term_freq.merge(document_freq)

In [36]:
n_docs_in_corpus = len(corpus)
n_docs_in_corpus

9

In [38]:
import numpy as np
term_freq['idf'] = np.log((1 + n_docs_in_corpus / (1 + term_freq['document_freq']) + 1))
term_freq

Unnamed: 0,docID,term,term_freq,document_freq,idf
0,0,--such,1,1.0,1.871802
1,0,abdicate,1,1.0,1.871802
2,0,abolish,4,2.0,1.609438
3,2,abolish,1,2.0,1.609438
4,0,absolute,3,2.0,1.609438
...,...,...,...,...,...
3060,8,wisely,1,1.0,1.871802
3061,8,withal,1,1.0,1.871802
3062,8,withhold,1,1.0,1.871802
3063,8,worse,1,1.0,1.871802


In [39]:
term_freq['tf_idf'] = term_freq['term_freq'] * term_freq['idf']
term_freq.sort_values(by='tf_idf', ascending=False)

Unnamed: 0,docID,term,term_freq,document_freq,idf,tf_idf
858,4,shall,191,9.0,1.064711,203.359751
875,4,state,130,5.0,1.252763,162.859186
961,4,united,55,5.0,1.252763,68.901963
2129,4,president,34,3.0,1.446919,49.195245
526,4,law,34,6.0,1.189584,40.445858
...,...,...,...,...,...,...
689,0,place,1,8.0,1.098612,1.098612
927,7,time,1,8.0,1.098612,1.098612
208,0,december,1,8.0,1.098612,1.098612
854,0,shall,1,9.0,1.064711,1.064711


In [43]:
from sklearn import preprocessing
term_freq['tfidf_norm'] = preprocessing.normalize(term_freq[['tf_idf']], axis=0, norm='l2')

In [49]:
top_n_terms = term_freq.sort_values(by=['docID', 'tf_idf'], ascending=[True, False]).groupby(['docID']).head(5)

In [50]:
top_n_terms

Unnamed: 0,docID,term,term_freq,document_freq,idf,tf_idf,tfidf_norm
872,0,state,10,5.0,1.252763,12.52763,0.034919
814,0,right,10,6.0,1.189584,11.895841,0.033158
678,0,people,10,7.0,1.139434,11.394343,0.03176
523,0,law,9,6.0,1.189584,10.706257,0.029842
397,0,government,9,7.0,1.139434,10.254909,0.028584
855,1,shall,17,9.0,1.064711,18.100083,0.050451
873,1,state,8,5.0,1.252763,10.022104,0.027935
815,1,right,7,6.0,1.189584,8.327088,0.02321
524,1,law,6,6.0,1.189584,7.137504,0.019895
499,1,jury,4,3.0,1.446919,5.787676,0.016132


# Submission Instructions

Please submit your assignment as a Jupyter Notebook or R Markdown file. You can submit your assignment as a link to a Google Colab notebook or a link to a GitHub repository. If you are submitting a link to a GitHub repository, please make sure that your repository is public. If you email the notebook to me, please zip the file before sending it.