# Web Mining and Applied NLP (44-620)

## Web Scraping and NLP with Requests, BeautifulSoup, and spaCy

### Student Name: Jacob Sellinger

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

### Set Up
1. Set up venv
    py -m venv .venv
2. activate venv
    .venv\Scripts\Activate
3. Check Path interpreter just in case
4. Get to work

In [1]:
## Confirm Path just in case

import sys
import os

print("--- Python Environment Diagnostics ---")
print("Python Executable:", sys.executable)
print("Python Version:", sys.version)
print("PYTHONPATH (Environment Variable):", os.environ.get('PYTHONPATH', 'Not set'))
print("sys.path (Interpreter's Search Path):")
for p in sys.path:
    print(f"  - {p}")
print("--- End Diagnostics ---\n")

--- Python Environment Diagnostics ---
Python Executable: c:\Users\jacob\AppData\Local\Programs\Python\Python313\python.exe
Python Version: 3.13.0 (tags/v3.13.0:60403a5, Oct  7 2024, 09:38:07) [MSC v.1941 64 bit (AMD64)]
PYTHONPATH (Environment Variable): Not set
sys.path (Interpreter's Search Path):
  - c:\Users\jacob\AppData\Local\Programs\Python\Python313\python313.zip
  - c:\Users\jacob\AppData\Local\Programs\Python\Python313\DLLs
  - c:\Users\jacob\AppData\Local\Programs\Python\Python313\Lib
  - c:\Users\jacob\AppData\Local\Programs\Python\Python313
  - 
  - c:\Users\jacob\AppData\Local\Programs\Python\Python313\Lib\site-packages
  - c:\Users\jacob\AppData\Local\Programs\Python\Python313\Lib\site-packages\win32
  - c:\Users\jacob\AppData\Local\Programs\Python\Python313\Lib\site-packages\win32\lib
  - c:\Users\jacob\AppData\Local\Programs\Python\Python313\Lib\site-packages\Pythonwin
--- End Diagnostics ---



1. Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file)

In [26]:
#Considering either pickling (.pkl), JSON, or can drop straight into a pandas data frame
# Choosiung to drop into JSON because it plenty of examples and non-Python specific

import requests
from bs4 import BeautifulSoup
import json

#Let's create a function just so I can do this later

def web_scrape(url):
    response = requests.get(url)
    html_text = response.text
    web_soup = BeautifulSoup(html_text, 'html.parser')
    web_soup_text = web_soup.text

    """This is returning a continuous string of stripped HTML text"""
    return(web_soup_text)
    
def save_to_json(data, filename):
    with open(filename, 'w') as f:
        json.dump(data, f, indent = 4)
    print(f"file saved successfully to {filename}")


save_to_json(web_scrape("https://en.wikipedia.org/wiki/Logging_(computing)"), "test.json")

file saved successfully to test.json


2. Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`)

In [None]:
# I used a JSON file so this is how I will interact with it

def load_json(filename):
    with open(filename, 'r') as f:
        data = json.load(f)
    print(data)
    return data

load_json("test.json")

3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [None]:
# Because I dumped a single long string into a JSON, which is now a Python Dictionary, I need to proces this data
# First step to processing the data is pre-process as in making this clean
import spacy
import spacytextblob

def preprocess_text(text):
    "Following the natural spacy pipeline of tokenization, lemmatization, removal of junk, tagging, NER"
    #Initialize language and doc
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    #Tokenization
    tokens = [token.text.lower() for token in doc]

    #Lemmatization
    lemmas = [token.lemma_.lower() for token in doc]

    #Removing Junk
    from spacy.lang.en.stop_words import STOP_WORDS
    filtered_tokens = [token.text.lower() for token in doc if not token.is_stop and token.is_alpha]

    #Filtered Lemmas
    filtered_lemmas = [token.lemma_.lower() for token in doc if not token.is_stop and token.is_alpha]

    #Part of Speech Tagging
    pos_tags = [(token.text.lower(), token.pos_) for token in doc]

    #Named Entity Recognition (NER)
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    print(tokens, lemmas, filtered_tokens, filtered_lemmas ,pos_tags, entities)
    return tokens, lemmas, filtered_tokens,filtered_lemmas ,pos_tags, entities


example = load_json("test.json")
example_processed_data = preprocess_text(example)

"Next to answer the actual question we will use the collections module"
"This was suggested by gemini for its efficiency with large data sets."

from collections import Counter

def spacy_count(data, top_x):
    # Count items
    counter = Counter(data)
    #Count unique items and occurences
    unique_items = dict(counter)
    top_x_items = counter.most_common(top_x)

    print(unique_items, top_x_items)
    return unique_items, top_x_items

spacy_count(example_processed_data[2], 5)


4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace).

In [48]:
spacy_count(example_processed_data[3], 5)

{'logging': 8, 'computing': 5, 'wikipedia': 5, 'jump': 1, 'content': 6, 'main': 4, 'menu': 2, 'sidebar': 4, 'hide': 4, 'navigation': 1, 'pagecontentscurrent': 1, 'eventsrandom': 1, 'articleabout': 1, 'wikipediacontact': 1, 'contribute': 1, 'helplearn': 1, 'editcommunity': 1, 'portalrecent': 1, 'changesupload': 2, 'filespecial': 1, 'page': 4, 'search': 11, 'appearance': 2, 'donate': 2, 'create': 3, 'account': 2, 'log': 67, 'personal': 1, 'tool': 1, 'pages': 1, 'editor': 1, 'learn': 2, 'contributionstalk': 1, 'types': 2, 'toggle': 3, 'subsection': 1, 'event': 17, 'transaction': 13, 'message': 16, 'server': 19, 'reference': 2, 'table': 2, 'language': 2, 'catalàčeštinadeutschespañolفارسیfrançais한국어հայերենbahasa': 1, 'indonesiaitalianoעבריתқазақшаmagyarnederlands日本語norsk': 1, 'bokmålpolskiportuguêsрусскийsimple': 1, 'englishslovenčinasuomisvenskatürkçeукраїнська粵語中文': 1, 'edit': 2, 'link': 3, 'articletalk': 1, 'english': 1, 'readeditview': 2, 'history': 4, 'tools': 2, 'actions': 1, 'general

({'logging': 8,
  'computing': 5,
  'wikipedia': 5,
  'jump': 1,
  'content': 6,
  'main': 4,
  'menu': 2,
  'sidebar': 4,
  'hide': 4,
  'navigation': 1,
  'pagecontentscurrent': 1,
  'eventsrandom': 1,
  'articleabout': 1,
  'wikipediacontact': 1,
  'contribute': 1,
  'helplearn': 1,
  'editcommunity': 1,
  'portalrecent': 1,
  'changesupload': 2,
  'filespecial': 1,
  'page': 4,
  'search': 11,
  'appearance': 2,
  'donate': 2,
  'create': 3,
  'account': 2,
  'log': 67,
  'personal': 1,
  'tool': 1,
  'pages': 1,
  'editor': 1,
  'learn': 2,
  'contributionstalk': 1,
  'types': 2,
  'toggle': 3,
  'subsection': 1,
  'event': 17,
  'transaction': 13,
  'message': 16,
  'server': 19,
  'reference': 2,
  'table': 2,
  'language': 2,
  'catalàčeštinadeutschespañolفارسیfrançais한국어հայերենbahasa': 1,
  'indonesiaitalianoעבריתқазақшаmagyarnederlands日本語norsk': 1,
  'bokmålpolskiportuguêsрусскийsimple': 1,
  'englishslovenčinasuomisvenskatürkçeукраїнська粵語中文': 1,
  'edit': 2,
  'link': 3,
  

5. Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3.

In [None]:
def score_sentence_by_token(sentence, interesting_token, nlp_model = "en_core_web_sm"):
    #Initialize language and doc
    nlp = spacy.load(nlp_model)
    doc = nlp(sentence)

    all_tokens = [token.text.lower() for token in doc if token.is_alpha]
    hits = sum(1 for token in all_tokens if token in interesting_token)

    total_tokens = len(all_tokens)

    if total_tokens == 0:
        return 0.0
    
    score = hits/total_tokens
    return score

def score_sentence_by_lemma(sentence, interesting_lemma, nlp_model = "en_core_web_sm"):
    #Initialize language and doc
    nlp = spacy.load(nlp_model)
    doc = nlp(sentence)

    all_lemmas = [token.lemma_.lower() for token in doc if not token.is_stop and token.is_alpha]
    hits = sum(1 for token in all_lemmas if token in interesting_lemma)

    total_lemmas = len(all_lemmas)

    if total_lemmas == 0:
        return 0.0
    
    score = hits/total_lemmas
    return score

6. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

7. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

8. Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it).