# Web Mining and Applied NLP (44-620)

## Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

### Student Name: Alexandra Ledgerwood
Github Repo: https://github.com/ALedgerwood/Module-6

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

1. Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file)

# Question 1
### extract article html from website

In [1]:
import requests

response = requests.get('https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/')

print(response.status_code)
print(response.headers['content-type'])
# Uncomment next line to print the full HTML text;  it's long so when done, recomment
# print(response.text)

200
text/html; charset=UTF-8


In [2]:
from bs4 import BeautifulSoup

# parser = 'html5lib'
parser = 'html.parser'

soup = BeautifulSoup(response.text, parser)
# Uncomment next lines to explore full page contents; it's long so when done, recomment
# print(soup)
# print(soup.prettify())

In [4]:
for header in soup.findAll('h1'):
    print('h1 header:', header)
    #print('h1 text:', header.text)

h1 header: <h1 class="site-title">
<a href="https://web.archive.org/web/20210327165005/https://hackaday.com/" rel="home">Hackaday</a>
</h1>
h1 header: <h1 class="entry-title" itemprop="name">How Laser Headlights Work</h1>
h1 header: <h1 class="screen-reader-text">Post navigation</h1>
h1 header: <h1 class="widget-title">Search</h1>
h1 header: <h1 class="widget-title">Never miss a hack</h1>
h1 header: <h1 class="widget-title">Subscribe</h1>
h1 header: <h1 class="widget-title">If you missed it</h1>
h1 header: <h1 class="widget-title">Our Columns</h1>
h1 header: <h1 class="widget-title">Search</h1>
h1 header: <h1 class="widget-title">Never miss a hack</h1>
h1 header: <h1 class="widget-title">Subscribe</h1>
h1 header: <h1 class="widget-title">If you missed it</h1>
h1 header: <h1 class="widget-title">Categories</h1>
h1 header: <h1 class="widget-title">Our Columns</h1>
h1 header: <h1 class="widget-title">Recent comments</h1>
h1 header: <h1 class="widget-title">Now on Hackaday.io</h1>
h1 heade

In [6]:
article_page = requests.get('http://web.archive.org/web/20210415020310/https://hackaday.com/2021/04/02/python-will-soon-support-switch-statements/')
article_html = article_page.text

# pickle works similar to json, but stores information in a binary format
# json files are readable by humans, pickle files, not so much

# BeautifulSoup objects don't pickle well, so it's appropriate and polite to web developers to cache the text of the web page, or just dump it to an html file you can read in later as a regular file
import pickle
with open('python-match.pkl', 'wb') as f:
    pickle.dump(article_page.text, f)

In [9]:
with open('python-match.pkl', 'rb') as f:
    article_html = pickle.load(f)
article_element = soup.find('article')
# Uncomment to see the entire article element html; again, it's long
print(article_element)

<article class="post-466450 post type-post status-publish format-standard has-post-thumbnail hentry category-car-hacks category-engineering category-featured category-laser-hacks category-slider tag-laser tag-laser-headlight tag-laser-headlights tag-light" id="post-466450" itemscope="" itemtype="http://schema.org/Article">
<header class="entry-header">
<h1 class="entry-title" itemprop="name">How Laser Headlights Work</h1>
<div class="entry-meta">
<a class="comments-counts" href="https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/#comments"><span class="icon-hackaday icon-hackaday-comment"></span>
                130 Comments            </a>
<ul class="meta-authors vcard author">
<li>by:</li>
<span class="fn"><a class="author url fn" href="https://web.archive.org/web/20210327165005/https://hackaday.com/author/lewinday/" rel="author" title="Posts by Lewin Day">Lewin Day</a></span>
</ul>
</div><!-- .entry-meta -->
<div class="entry-meta en

2. Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`)

# Question 2
### Get the article text

3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

5. Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3.

6. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

7. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

8. Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it).