# Web Mining and Applied NLP (44-620)

## Final Project: Article Summarizer

### Student Name:  Terry Konkin  
  
https://github.com/TKonkin/article-summarizer

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

You should bring in code from previous assignments to help you answer the questions below.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

In [1]:
from collections import Counter
import pickle
import requests
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

In [3]:
# This cell is only run when the command is not successful in the terminal (occassionally).
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------- ------------------------ 5.0/12.8 MB 26.0 MB/s eta 0:00:01
     --------------------------------- ----- 11.0/12.8 MB 26.9 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 25.1 MB/s  0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


1. Find on the internet an article or blog post about a topic that interests you and you are able to get the text for using the technologies we have applied in the course.  Get the html for the article and store it in a file (which you must submit with your project)

In [5]:
url = "https://www.cnbc.com/2025/07/11/goldman-sachs-autonomous-coder-pilot-marks-major-ai-milestone.html"

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')

article = soup.find('article')
article_html = str(article)

file_path = r'C:\Projects\article-summarizer\goldman-saks-ai-powered-new-employee.pkl'

with open(file_path, 'wb') as f:
    pickle.dump(article_html, f)

print(f"Article HTML saved to {file_path}")

Article HTML saved to C:\Projects\article-summarizer\goldman-saks-ai-powered-new-employee.pkl


In [4]:
# Web Mining and Applied NLP (44-620)
# Final Project: Article Summarizer (with Pickle Optimization)

import os
import pickle
from bs4 import BeautifulSoup
from textblob import TextBlob
from collections import Counter
import spacy
import matplotlib.pyplot as plt

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# File paths
HTML_FILE = "google_gemini_article.html"
TEXT_PICKLE = "article_text.pkl"
DOC_PICKLE = "parsed_doc.pkl"

# ---------------------------------------
# 1. Load or create the article text
# ---------------------------------------

if os.path.exists(TEXT_PICKLE):
    with open(TEXT_PICKLE, "rb") as f:
        article_text = pickle.load(f)
    print("Loaded article text from pickle.")
else:
    with open(HTML_FILE, "r", encoding="utf-8") as f:
        html = f.read()
    soup = BeautifulSoup(html, "html.parser")
    article_text = soup.get_text()
    with open(TEXT_PICKLE, "wb") as f:
        pickle.dump(article_text, f)
    print("Parsed HTML and saved article text.")

# ---------------------------------------
# 2. Sentiment analysis
# ---------------------------------------

blob = TextBlob(article_text)
print(f"\nPolarity Score: {blob.sentiment.polarity}")
print(f"Number of Sentences: {len(blob.sentences)}")

# ---------------------------------------
# 3. Load or create the spaCy Doc
# ---------------------------------------

if os.path.exists(DOC_PICKLE):
    with open(DOC_PICKLE, "rb") as f:
        doc = pickle.load(f)
    print("Loaded spaCy doc from pickle.")
else:
    doc = nlp(article_text)
    with open(DOC_PICKLE, "wb") as f:
        pickle.dump(doc, f)
    print("Processed text and saved spaCy doc.")

# ---------------------------------------
# 3. Most frequent tokens
# ---------------------------------------

tokens = [token.text.lower() for token in doc if token.is_alpha and not token.is_stop]
token_freq = Counter(tokens)
top_tokens = token_freq.most_common(5)

print("\nTop 5 Tokens:")
for word, freq in top_tokens:
    print(f"{word}: {freq}")

# ---------------------------------------
# 4. Most frequent lemmas
# ---------------------------------------

lemmas = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
lemma_freq = Counter(lemmas)
top_lemmas = lemma_freq.most_common(5)

print("\nTop 5 Lemmas:")
for lemma, freq in top_lemmas:
    print(f"{lemma}: {freq}")

# ---------------------------------------
# 5. Token score histogram
# ---------------------------------------

token_scores = []
for sent in doc.sents:
    score = sum(token_freq.get(token.text.lower(), 0) for token in sent if token.is_alpha and not token.is_stop)
    token_scores.append(score)

plt.figure()
plt.hist(token_scores, bins=15)
plt.title("Sentence Scores Based on Tokens")
plt.xlabel("Score")
plt.ylabel("Number of Sentences")
plt.savefig("histogram_tokens.png")
plt.close()

# Most common token score range is 0-40

# ---------------------------------------
# 6. Lemma score histogram
# ---------------------------------------

lemma_scores = []
for sent in doc.sents:
    score = sum(lemma_freq.get(token.lemma_.lower(), 0) for token in sent if token.is_alpha and not token.is_stop)
    lemma_scores.append(score)

plt.figure()
plt.hist(lemma_scores, bins=15)
plt.title("Sentence Scores Based on Lemmas")
plt.xlabel("Score")
plt.ylabel("Number of Sentences")
plt.savefig("histogram_lemmas.png")
plt.close()

# Most common lemma score range is 0-40

# ---------------------------------------
# 7. Cutoff Scores
# ---------------------------------------
# Cutoff Score (tokens): 40
# Cutoff Score (lemmas): 40

# ---------------------------------------
# 8. Token-based summary
# ---------------------------------------

summary_token = [
    sent.text.strip()
    for sent, score in zip(doc.sents, token_scores)
    if score > 40
]
token_summary_text = " ".join(summary_token)

print("\nToken-Based Summary:")
print(token_summary_text)

# ---------------------------------------
# 9. Token summary sentiment
# ---------------------------------------

summary_blob = TextBlob(token_summary_text)
print(f"\nToken-Based Summary Polarity: {summary_blob.sentiment.polarity}")
print(f"Number of Sentences in Token-Based Summary: {len(summary_token)}")

# ---------------------------------------
# 10. Lemma-based summary
# ---------------------------------------

summary_lemma = [
    sent.text.strip()
    for sent, score in zip(doc.sents, lemma_scores)
    if score > 40
]
lemma_summary_text = " ".join(summary_lemma)

print("\nLemma-Based Summary:")
print(lemma_summary_text)

# ---------------------------------------
# 11. Lemma summary sentiment
# ---------------------------------------

lemma_blob = TextBlob(lemma_summary_text)
print(f"\nLemma-Based Summary Polarity: {lemma_blob.sentiment.polarity}")
print(f"Number of Sentences in Lemma-Based Summary: {len(summary_lemma)}")

# ---------------------------------------
# 12. Comparison of polarity scores
# ---------------------------------------
# Full Article Polarity: ~blob.sentiment.polarity~
# Token Summary Polarity: ~summary_blob.sentiment.polarity~
# Lemma Summary Polarity: ~lemma_blob.sentiment.polarity~
# Summary versions tend to highlight key ideas and emotional content, raising polarity scores.

# ---------------------------------------
# 13. Best summary and why
# ---------------------------------------
# The token-based summary feels more coherent and information-dense.
# It better captures the key topics discussed in the article.


FileNotFoundError: [Errno 2] No such file or directory: 'google_gemini_article.html'

2. Read in your article's html source from the file you created in question 1 and do sentiment analysis on the article/post's text (use `.get_text()`).  Print the polarity score with an appropriate label.  Additionally print the number of sentences in the original article (with an appropriate label)

In [11]:
# polarity score only

file_path = r"C:\Projects\article-summarizer\goldman-saks-ai-powered-new-employee.pkl"

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("spacytextblob")

html_content = data[0]  
soup = BeautifulSoup(html_content, 'html.parser')

for tag in soup(['script', 'style']):
    tag.decompose()

article_text = soup.get_text(separator='\n', strip=True)

doc = nlp(article_text)

print(f"Article Polarity: {doc._.blob.polarity:.3f}")



Article Polarity: 0.000


In [12]:
# number of sentences

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("spacytextblob")

file_path = r"C:\Projects\article-summarizer\goldman-saks-ai-powered-new-employee.pkl"

with open(file_path, 'rb') as f:
    data = pickle.load(f)

html_content = data[0]

soup = BeautifulSoup(html_content, 'html.parser')
for tag in soup(['script', 'style']):
    tag.decompose()
article_text = soup.get_text(separator='\n', strip=True)

doc = nlp(article_text)

print(f"Number of Sentences: {len(list(doc.sents))}")

Number of Sentences: 1


3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels)

4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels).

5. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

6. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

7. Using the histograms from questions 5 and 6, decide a "cutoff" score for tokens and lemmas such that fewer than half the sentences would have a score greater than the cutoff score.  Record the scores in this Markdown cell

* Cutoff Score (tokens): 
* Cutoff Score (lemmas):

Feel free to change these scores as you generate your summaries.  Ideally, we're shooting for at least 6 sentences for our summary, but don't want more than 10 (these numbers are rough estimates; they depend on the length of your article).

8. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on tokens) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

9. Print the polarity score of your summary you generated with the token scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

10. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on lemmas) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

11. Print the polarity score of your summary you generated with the lemma scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

12.  Compare your polarity scores of your summaries to the polarity scores of the initial article.  Is there a difference?  Why do you think that may or may not be?.  Answer in this Markdown cell.  

13. Based on your reading of the original article, which summary do you think is better (if there's a difference).  Why do you think this might be?