[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/assignment_notebooks/Webscraping.ipynb)

# Webscraping Assignment

Reminder: you are permitted to work with another classmate on this assignment. If you do, please submit a single notebook with both of your names at the top.

## Due date

Friday, February 24 (12:00 pm), 2023

## Assignment description

In this project you will write a Jupyter Notebook or R Markdown file to scrape a selected website. You will need to:

1. Write a function that takes a URL as input and returns the HTML of the page as a string.
2. Inspect the HTML of the page and use regular expressions to extract the documents within the page.
3. Model the documents in a corpus
4. Analyze the corpus using the bag of words model
5. Implement a TF-IDF model to extract the most n-important words for each document in the corpus.

### Objective

This assignment reinforces previous lecture topics on the linguistic background, properties of language, information theory, and Regular Expressions.


## Submission medium

Jupyter Notebook or R Markdown file. See additional instructions at the final section of this document.

## Code Dependencies

You will need to install the following packages:

- `requests`
- `re`
- `beautifulsoup4`
- `nltk`
- `pandas`
- `numpy`
- `matplotlib`


## Grading

This assignment is worth 10 points. (extra credit 1 point to final grade if you create a heatmap of the TF-IDF matrix)

## Write a function that takes a URL as input and returns the HTML of the page as a string

### 1.1 Write a function that takes a URL as input and returns the HTML of the page as a string

In [1]:
import requests

def get_html(url) -> str:
    """Get the HTML of a webpage and return the HTML as a string.
    
    Parameters
    ----------
    url : str
        The URL of the webpage to scrape.
    
    Returns
    -------
    str
        The HTML of the webpage as a string.
    """
    html_source: str = requests.get(url).text
    assert isinstance(html_source, str), "The HTML should be a string."
    return html_source

### 1.2 Inspect the HTML of the page. Can you identify any patterns in the HTML that might be useful for extracting the documents within the page?

In [2]:
# Extract the the HTML source code from the URL (this is the same URL we used in class)
url = "https://www.gutenberg.org/files/1/1-0.txt"

# YOUR CODE HERE
html = get_html(url)
#html

The following patterns are notable:
 - [etext#x] is a common pattern before the etext is displayed
 
usually after the [etext#x] we can see:

 - "The project of gutenberg Etext of *insert title here*

### 1.3 Use the BeautifulSoup library to create a BeautifulSoup object from the HTML string

In [3]:
!pip install lxml

Collecting lxml
  Using cached lxml-4.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)
Installing collected packages: lxml
Successfully installed lxml-4.9.2


In [4]:
from bs4 import BeautifulSoup as bs4

soup = bs4(html,'lxml')
# YOUR CODE HERE

### 1.3 Extract the HTML body text and examine the contents.

In [5]:
# Please explain what the following line of code does in the cell below.
body = soup.find("body")
#body

The code
```python
soup.find("body")
```
tells soup to return everything under the first body tag in the html, since there can only be one body in html, this returns the only body tag.

### 1.4 Use regular expressions to extract the documents within the page

In [6]:
import re

# Your regex here to capture the documents
doc_extractor = r"(?<=\[Etext #\d\])[^\f]+?(?=\[Etext #\d\]|(?=End of The Project Gutenberg EBook of The Declaration of Independence))"
# Explain this line of code in the cell below.
# __Note:__ You will need to use the `re.MULTILINE` flag to ensure that the
# regular expression matches across multiple lines.
found_documents: list = re.findall(doc_extractor, body.text, re.MULTILINE)


assert len(found_documents) == 9, "Please check your regex. You should have found a total 9 documents."

## if you are having trouble with the regex remeber that you can use regex101.com to test and debug.


Explain: `documents = re.findall(doc_extractor, body.text, re.MULTILINE)`



### Answer:
The doc extractor is a regular expression that uses the look ahead and look behind operator to match any text that is not a \f operator (which represents a form feed) Between any two "[Etext #]"'s or between the end text from Project Gutenberg being "End of The Project Gutenberg EBook of The Declaration of Independence". This last or operator is used to find Etext#9 and the end of the etext since there is no [Etext #] marker. The code:
```python
found_documents: list = re.findall(doc_extractor, body.text, re.MULTILINE)
```
Returns a list with all information matching the regular expression from doc extractor using the raw text from body. The re.MULTILINE argument matches the start or end of any line within our string body.text


## 1.5 Explore the contents of the Documents

In the matched documents, you will find a heading appended to the text by project Gutenberg. For the purposes of this assignment, I provided a cleaner function to extract the Gutenberg headings from the text for you.

In [7]:
def clean_gutenberg(text: str) -> str:
    """Clean the text of a Gutenberg document.
    
    Parameters
    ----------
    text : str
        The text of a Gutenberg document.
    
    Returns
    -------
    str
        The cleaned text of the document.
    """
    text = re.sub(r"\[Etext #\d+\]", "", text)
    text = re.sub(r"(\r\n)+", " ", text)
    text = re.sub(r"^The Project Gutenberg.*?Independence\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ \*\*\*\*The Project Gutenberg Etext of The U. S. Bill of Rights\*\*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ November.*?EST", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ \*\*The Project.*?, USA", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ \*\*\*\*\*The Project.*?corrections\. \*\*\*", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ The Project.*?1775\.", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ Officially.*?calendar\]", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ \*\*The Project.*?, 1865", "", text, flags=re.MULTILINE)
    text = re.sub(r"^ The Project.*?, 1861", "", text, flags=re.MULTILINE)
    
    return text.strip()

In [8]:
corpus = []

for i, doc in enumerate(found_documents):
    # YOUR CODE HERE
    corpus.append(clean_gutenberg(doc))

In [9]:
# Explore the corpus here
corpus[0][0:200]

"The Project Gutenberg Etext of The Declaration of Independence. All of the original Project Gutenberg Etexts from the 1970's were produced in ALL CAPS, no lower case.  The computers we used then didn'"

# Analyze the above corpus of documents using TF-IDF

In the follow steps, I would like for you to accomplish the follow preprocessing steps. 

1. Tokenize the documents
2. Lemmatize the tokens
3. Remove stop words
4. Remove punctuation
5. Apply TF-IDF to the corpus
    * You can write a TF-IDF model from sratch or use the `sklearn` library

_tip: see lecture notebooks 4, 5, and 6 for examples of how to work with pandas_


In [10]:
### TIP ###
## if you want to work with pandas create a dataframe with documents as rows and columns for the document number and the text
import pandas as pd
corpus = pd.DataFrame({"docID": range(len(corpus)), "text": corpus})

In [11]:
corpus.head()

Unnamed: 0,docID,text
0,0,The Project Gutenberg Etext of The Declaration...
1,1,The United States Bill of Rights. The Ten Orig...
2,2,We observe today not a victory of party but a ...
3,3,"Four score and seven years ago, our fathers br..."
4,4,THE CONSTITUTION OF THE UNITED STATES OF AMERI...


In [12]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3
  Using cached regex-2022.10.31-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (770 kB)
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2022.10.31


## Tokenize the documents

In [14]:
## Your code here
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')

In [15]:
for i in range(len(corpus)):
    corpus['text'][i] = word_tokenize(corpus['text'][i])

In [16]:
corpus

Unnamed: 0,docID,text
0,0,"[The, Project, Gutenberg, Etext, of, The, Decl..."
1,1,"[The, United, States, Bill, of, Rights, ., The..."
2,2,"[We, observe, today, not, a, victory, of, part..."
3,3,"[Four, score, and, seven, years, ago, ,, our, ..."
4,4,"[THE, CONSTITUTION, OF, THE, UNITED, STATES, O..."
5,5,"[No, man, thinks, more, highly, than, I, do, o..."
6,6,"[In, the, name, of, God, ,, Amen, ., We, ,, wh..."
7,7,"[Fellow, countrymen, :, At, this, second, appe..."
8,8,"[Fellow, citizens, of, the, United, States, :,..."


## Lemmatize the tokens

In [17]:
#imports
from nltk.stem.porter import *
#initialize stemmer
stemmer = PorterStemmer()
#for each text stem every word
for i in range(len(corpus)):
    text = corpus['text'][i]
    corpus['text'][i] = [stemmer.stem(token) for token in text]

In [18]:
corpus

Unnamed: 0,docID,text
0,0,"[the, project, gutenberg, etext, of, the, decl..."
1,1,"[the, unit, state, bill, of, right, ., the, te..."
2,2,"[we, observ, today, not, a, victori, of, parti..."
3,3,"[four, score, and, seven, year, ago, ,, our, f..."
4,4,"[the, constitut, of, the, unit, state, of, ame..."
5,5,"[no, man, think, more, highli, than, i, do, of..."
6,6,"[in, the, name, of, god, ,, amen, ., we, ,, wh..."
7,7,"[fellow, countrymen, :, at, thi, second, appea..."
8,8,"[fellow, citizen, of, the, unit, state, :, in,..."


## Remove stop words

You can use the `nltk` library to remove stop words. You can also use the `nltk` library to remove stopwords.

In [19]:
## Your code here
#imports
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [20]:
#for each text
for i in range(len(corpus)):
    lemmas = [] #initialize a list
    for lemma in corpus['text'][i]:
        if lemma in stop_words: #if it is a stop word
            continue #don't append
        else: #if its not
            lemmas.append(lemma) #do append
    corpus['text'][i] = lemmas#after seeing all the words insert into df

In [21]:
corpus

Unnamed: 0,docID,text
0,0,"[project, gutenberg, etext, declar, independ, ..."
1,1,"[unit, state, bill, right, ., ten, origin, ame..."
2,2,"[observ, today, victori, parti, celebr, freedo..."
3,3,"[four, score, seven, year, ago, ,, father, bro..."
4,4,"[constitut, unit, state, america, ,, 1787, peo..."
5,5,"[man, think, highli, patriot, ,, well, abil, ,..."
6,6,"[name, god, ,, amen, ., ,, whose, name, underw..."
7,7,"[fellow, countrymen, :, thi, second, appear, t..."
8,8,"[fellow, citizen, unit, state, :, complianc, c..."


## Remove punctuation

In [23]:
## Your code here
import string
punct = list(string.punctuation) + list(string.digits) #get punctuation
#do the same thing as above but check to see if it is in punct instead of a stopword
for i in range(len(corpus)): 
    lemmas = []
    for lemma in corpus['text'][i]:
        if lemma in punct:
            continue
        else:
            lemmas.append(lemma)
    corpus['text'][i] = lemmas

## Analyze the documents and corpus using TF-IDF

In [29]:
## Your code here
#each row for a token per docid
corpus_tokens = (corpus
                  .explode('text'))
term_frequency = (corpus_tokens
                  .groupby(by=['docID', 'text'])
                  .agg({'text': 'count'})
                  .rename(columns={'text': 'term_frequency'})
                  .reset_index()
                  .rename(columns={'text': 'term'})
                 )
term_frequency

Unnamed: 0,docID,term,term_frequency
0,0,'',4
1,0,'s,2
2,0,--,3
3,0,1-2,2
4,0,10000,1
...,...,...,...
3819,8,written,5
3820,8,wrong,1
3821,8,year,4
3822,8,yet,3


It doesn't look like all the punctuation and digits have been captured which is annoying. It doesn't look like it is capturing possesives or combinations of digits.

In [35]:
#now get document frequency
document_frequency = (term_frequency
                      .groupby(['docID', 'term'])
                      .size()
                      .unstack()
                      .sum()
                      .reset_index()
                      .rename(columns={0: 'document_frequency'}))
document_frequency

Unnamed: 0,term,document_frequency
0,'',5.0
1,'d,1.0
2,'s,5.0
3,--,5.0
4,.a,1.0
...,...,...
2062,year,6.0
2063,yet,3.0
2064,york,1.0
2065,young,1.0


In [42]:
#Count documents, merge tables, create idf 
import numpy as np
documents_in_corpus = term_frequency['docID'].nunique()
term_frequency = term_frequency.merge(document_frequency)
term_frequency['idf'] = np.log((1 + documents_in_corpus) / (1 + term_frequency['document_frequency'])) + 1
term_frequency

Unnamed: 0,docID,term,term_frequency,document_frequency,idf
0,0,'',4,5.0,1.510826
1,2,'',1,5.0,1.510826
2,4,'',2,5.0,1.510826
3,7,'',2,5.0,1.510826
4,8,'',14,5.0,1.510826
...,...,...,...,...,...
3819,8,withal,1,1.0,2.609438
3820,8,withhold,1,1.0,2.609438
3821,8,wors,1,1.0,2.609438
3822,8,written,5,1.0,2.609438


In [44]:
#create tf-idf
term_frequency['tfidf'] = term_frequency['term_frequency'] * term_frequency['idf']
term_frequency.sort_values(by=['term_frequency'], ascending=False)

Unnamed: 0,docID,term,term_frequency,document_frequency,idf,tfidf
1392,4,shall,191,9.0,1.000000,191.000000
1419,4,state,132,5.0,1.510826,199.428982
1585,4,unit,55,5.0,1.510826,83.095409
110,4,ani,42,8.0,1.105361,46.425142
1420,8,state,39,5.0,1.510826,58.922199
...,...,...,...,...,...,...
2091,3,forth,1,5.0,1.510826,1.510826
2092,4,forth,1,5.0,1.510826,1.510826
2093,5,forth,1,5.0,1.510826,1.510826
2094,7,forth,1,5.0,1.510826,1.510826


# Submission Instructions

Please submit your assignment as a Jupyter Notebook or R Markdown file. You can submit your assignment as a link to a Google Colab notebook or a link to a GitHub repository. If you are submitting a link to a GitHub repository, please make sure that your repository is public. If you email the notebook to me, please zip the file before sending it.