# Web Mining and Applied NLP (CSIS 44-620)

## P6: Web Scraping, NLP (Requests, BeautifulSoup, and spaCy) & Engage

### 
Author: Data-Git-Hub <br>
GitHub Project Repository Link: https://github.com/Data-Git-Hub/web-scraping <br>
6 July 2025 <br>

### Introduction
In this project, I explore the fundamentals of web scraping and natural language processing (NLP) using Python within a Jupyter Notebook environment. The primary objective is to extract textual data from web sources and perform basic NLP tasks to analyze and interpret that data. This includes using libraries such as `requests` to fetch web content, `BeautifulSoup` to parse HTML, and `spaCy` with `spacytextblob` to process and analyze text for sentiment, subjectivity, and linguistic patterns. <br>

Web scraping is an essential skill in data analytics and business intelligence, enabling analysts to gather real-time or hard-to-find data from public web pages. NLP extends this capability by allowing us to interpret the collected text, uncover hidden insights, and support data-driven decision-making. Together, these skills allow for scalable and automated information extraction that can inform strategy, research, and communication analysis. <br>

This project also demonstrates effective use of Python virtual environments, version control with GitHub, and professional documentation practices. All code has been executed prior to submission, and final versions have been exported to HTML to ensure accessibility and review readiness. The final submission includes code, outputs, visualizations, and reflections on the process. <br>

### Imports
Python libraries are collections of pre-written code that provide specific functionalities, making programming more efficient and reducing the need to write code from scratch. These libraries cover a wide range of applications, including data analysis, machine learning, web development, and automation. Some libraries, such as os, sys, math, json, and datetime, come built-in with Python as part of its standard library, providing essential functions for file handling, system operations, mathematical computations, and data serialization. Other popular third-party libraries, like `pandas`, `numpy`, `matplotlib`, `seaborn`, and `scikit-learn`, must be installed separately and are widely used in data science and machine learning. The extensive availability of libraries in Python's ecosystem makes it a versatile and powerful programming language for various domains. <br>

`beautifulsoup4` is a Python library used for parsing HTML and XML documents. It provides Pythonic methods for navigating, searching, and modifying the parse tree, making it ideal for web scraping tasks. BeautifulSoup is particularly useful for extracting data from web pages with inconsistent or poorly structured HTML. It works well with parsers like `html5lib` and `lxml`. <br>
https://www.crummy.com/software/BeautifulSoup/bs4/doc/ <br>

`html5lib` is a pure-Python HTML parser designed to parse documents the same way modern web browsers do. It is especially useful for handling malformed or messy HTML. When used with `beautifulsoup4`, it provides robust parsing capabilities that help ensure accurate and tolerant extraction of web content. <br>
https://html5lib.readthedocs.io/en/latest/ <br>

`ipykernel` allows Jupyter Notebooks to run Python code by providing the kernel interface used to execute cells and handle communication between the front-end and the Python interpreter. <br>
https://ipykernel.readthedocs.io/en/latest/ <br>

`jupyterlab` is the next-generation user interface for Project Jupyter. It offers a flexible, extensible environment for interactive computing with support for code, markdown, visualizations, and terminals all within a tabbed workspace. JupyterLab enhances productivity by allowing users to organize notebooks, text editors, and data file viewers side by side. <br>
https://jupyterlab.readthedocs.io/en/stable/ <br>

`Matplotlib` is a widely used data visualization library that allows users to create static, animated, and interactive plots. It provides extensive tools for generating various chart types, including line plots, scatter plots, histograms, and bar charts, making it a critical library for exploratory data analysis. <br>
https://matplotlib.org/stable/contents.html <br>

`notebook` is the Python package that powers the classic Jupyter Notebook interface. It provides a web-based environment for writing and running code in interactive cells, supporting rich media, visualizations, and markdown documentation. The notebook server manages the execution of kernels and renders notebooks in a browser. This tool is foundational for data analysis, teaching, and exploratory programming workflows. <br>
https://jupyter-notebook.readthedocs.io/en/stable/ <br>

`Pandas` is a powerful data manipulation and analysis library that provides flexible data structures, such as DataFrames and Series. It is widely used for handling structured datasets, enabling easy data cleaning, transformation, and aggregation. Pandas is essential for data preprocessing in machine learning and statistical analysis. <br>
https://pandas.pydata.org/docs/ <br>

The `requests` library simplifies making HTTP requests in Python, allowing you to send GET, POST, and other types of requests to interact with APIs or web services. <br>
 https://docs.python-requests.org/en/latest/ <br>

`spaCy` is an advanced NLP library for Python that provides tools for tokenization, part-of-speech tagging, named entity recognition, and more, using pre-trained pipelines. <br>
https://spacy.io/ <br>

`spacytextblob` is a plugin for spaCy that adds sentiment analysis capabilities by integrating TextBlob's polarity and subjectivity scores into spaCy’s pipeline. <br>
https://github.com/AndrewIbrahim/spacy-textblob <br>

`TextBlob` is a Python library for processing textual data, built on top of `nltk` and `pattern`. It provides a simple API for common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, translation, and sentiment analysis. Its intuitive design and built-in sentiment scoring functions make it especially useful for quick prototyping and educational applications. <br>
https://textblob.readthedocs.io/en/dev/ <br>

### Tasks
Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository. <br>

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question. <br>

#### Section 1. 
Write code that extracts the article html from https://web.archive.org/web/20210327165005/https://hackaday.com/2021/03/22/how-laser-headlights-work/ and dumps it to a .pkl (or other appropriate file) <br>

#### Section 2. 
Read in your article's html source from the file you created in question 1 and print it's text (use `.get_text()`) <br>

#### Section 3. 
Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace). <br>

#### Section 4. 
Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels). Make sure to remove things we don't care about (punctuation, stopwords, whitespace). <br>

#### Section 5. 

Define the following methods:
    * `score_sentence_by_token(sentence, interesting_token)` that takes a sentence and a list of interesting token and returns the number of times that any of the interesting words appear in the sentence divided by the number of words in the sentence <br>
    * `score_sentence_by_lemma(sentence, interesting_lemmas)` that takes a sentence and a list of interesting lemmas and returns the number of times that any of the interesting lemmas appear in the sentence divided by the number of words in the sentence <br>
    
You may find some of the code from the in class notes useful; feel free to use methods (rewrite them in this cell as well).  Test them by showing the score of the first sentence in your article using the frequent tokens and frequent lemmas identified in question 3. <br>

#### Section 6. 
Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)? <br>

#### Section 7. 
Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)? <br>

#### Section 8. 
Which tokens and lexems would be ommitted from the lists generated in questions 3 and 4 if we only wanted to consider nouns as interesting words?  How might we change the code to only consider nouns? Put your answer in this Markdown cell (you can edit it by double clicking it). <br>