# Web Mining and Applied NLP (44-620)

## Final Project: Article Summarizer

### Student Name: Tyler Stanton

#### GitHub Repo: https://github.com/S566248/article-summarizer

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

You should bring in code from previous assignments to help you answer the questions below.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

Adding the imports I am expecting to use.

In [5]:
from collections import Counter
import pickle
import requests
import spacy
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import string

!pip list

print('All prereqs installed.')

Package            Version
------------------ -----------
annotated-types    0.6.0
asttokens          2.4.1
beautifulsoup4     4.12.3
blis               0.7.11
catalogue          2.0.10
certifi            2024.2.2
charset-normalizer 3.3.2
click              8.1.7
cloudpathlib       0.16.0
colorama           0.4.6
comm               0.2.2
confection         0.1.4
contourpy          1.2.1
cycler             0.12.1
cymem              2.0.8
debugpy            1.8.1
decorator          5.1.1
executing          2.0.1
fonttools          4.51.0
idna               3.7
ipykernel          6.29.4
ipython            8.23.0
jedi               0.19.1
Jinja2             3.1.3
jupyter_client     8.6.1
jupyter_core       5.7.2
kiwisolver         1.4.5
langcodes          3.3.0
MarkupSafe         2.1.5
matplotlib         3.8.4
matplotlib-inline  0.1.7
murmurhash         1.0.10
nest-asyncio       1.6.0
numpy              1.26.4
packaging          24.0
parso              0.8.4
pillow             10.3.0
pip  

1. Find on the internet an article or blog post about a topic that interests you and you are able to get the text for using the technologies we have applied in the course.  Get the html for the article and store it in a file (which you must submit with your project)

In [7]:
url = "https://reflector.uindy.edu/2024/04/10/the-caitlin-clark-effect-how-the-iowa-star-is-changing-womens-sports/"

# Send a GET request to the URL to fetch the HTML content
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Get the HTML content
    html_content = response.text

    # Write the HTML content to a file
    with open("article_html.html", "w", encoding="utf-8") as file:
        file.write(html_content)
        print("HTML content has been successfully saved to 'article_html.html'")
else:
    print("Failed to fetch HTML content from the URL")

HTML content has been successfully saved to 'article_html.html'


2. Read in your article's html source from the file you created in question 1 and do sentiment analysis on the article/post's text (use `.get_text()`).  Print the polarity score with an appropriate label.  Additionally print the number of sentences in the original article (with an appropriate label)

In [8]:
from textblob import TextBlob

# Read the HTML source from the file
with open("article_html.html", "r", encoding="utf-8") as file:
    html_content = file.read()

# Parse the HTML content
soup = BeautifulSoup(html_content, "html.parser")

# Extract the text from the article
article_text = soup.get_text()

# Perform sentiment analysis on the article's text
blob = TextBlob(article_text)
polarity_score = blob.sentiment.polarity

# Print the polarity score with an appropriate label
print("Polarity Score:", polarity_score)

# Count the number of sentences in the original article
sentences = blob.sentences
num_sentences = len(sentences)

# Print the number of sentences with an appropriate label
print("Number of Sentences:", num_sentences)

ModuleNotFoundError: No module named 'textblob'

3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels)

In [None]:
# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Process the article text using spaCy
doc = nlp(article_text)

# Get all lowercase tokens from the processed text
tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]

# Count the frequency of each token
token_freq = Counter(tokens)

# Get the 5 most frequent tokens
common_tokens = token_freq.most_common(5)

# Print the common tokens with their frequencies
print("Common Tokens:")
for token, freq in common_tokens:
    print(f"{token}: {freq}")

4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels).

In [None]:
# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Process the article text using spaCy
doc = nlp(article_text)

# Get all lowercase lemmas from the processed text
lemmas = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]

# Count the frequency of each lemma
lemma_freq = Counter(lemmas)

# Get the 5 most frequent lemmas
common_lemmas = lemma_freq.most_common(5)

# Print the common lemmas with their frequencies
print("Common Lemmas:")
for lemma, freq in common_lemmas:
    print(f"{lemma}: {freq}")

5. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

In [None]:

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Process the article text using spaCy
doc = nlp(article_text)

# Calculate the sentiment score for each sentence using tokens
sentence_scores = []
for sentence in doc.sents:
    # Get the tokens in the sentence
    tokens = [token.text.lower() for token in sentence if not token.is_stop and not token.is_punct]
    # Calculate the score based on the number of tokens
    score = sum(1 for token in tokens if token in positive_words) - sum(1 for token in tokens if token in negative_words)
    sentence_scores.append(score)

# Plot a histogram of the scores
plt.hist(sentence_scores, bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Sentence Sentiment Scores')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.show()

# Comment: The most common range of scores seems to be around 0, indicating neutral sentiment.

6. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

In [None]:

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Process the article text using spaCy
doc = nlp(article_text)

# Calculate the sentiment score for each sentence using lemmas
sentence_scores = []
for sentence in doc.sents:
    # Get the lemmas in the sentence
    lemmas = [token.lemma_.lower() for token in sentence if not token.is_stop and not token.is_punct]
    # Calculate the score based on the number of lemmas
    score = sum(1 for lemma in lemmas if lemma in positive_lemmas) - sum(1 for lemma in lemmas if lemma in negative_lemmas)
    sentence_scores.append(score)

# Plot a histogram of the scores
plt.hist(sentence_scores, bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Sentence Sentiment Scores (using Lemmas)')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.show()

# Comment: The most common range of scores seems to be around 0, indicating neutral sentiment.


7. Using the histograms from questions 5 and 6, decide a "cutoff" score for tokens and lemmas such that fewer than half the sentences would have a score greater than the cutoff score.  Record the scores in this Markdown cell

* Cutoff Score (tokens): 
* Cutoff Score (lemmas):

Feel free to change these scores as you generate your summaries.  Ideally, we're shooting for at least 6 sentences for our summary, but don't want more than 10 (these numbers are rough estimates; they depend on the length of your article).

8. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on tokens) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

In [None]:
# Define the cutoff score
cutoff_score = 0  # Adjust this value based on your findings in question 8

# Create an empty list to store selected sentences
summary_sentences = []

# Iterate through every sentence in the article
for sentence in doc.sents:
    # Calculate the sentiment score for the sentence using tokens
    score = score_sentence_by_token(sentence, most_common_tokens)
    # Check if the score is greater than the cutoff
    if score > cutoff_score:
        # Add the sentence to the summary
        summary_sentences.append(sentence.text.strip())

# Join the selected sentences into a summary
summary_text = ' '.join(summary_sentences)

# Print the summary
print(summary_text)

9. Print the polarity score of your summary you generated with the token scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

In [None]:
# Load the summary text into a spaCy pipeline
summary_doc = nlp(summary_text)

# Calculate the sentiment polarity score of the summary
summary_polarity_score = get_sentiment_polarity(summary_doc)

# Print the polarity score of the summary with an appropriate label
print("Polarity Score of the Summary (Token Scores):", summary_polarity_score)

# Count the number of sentences in the summarized article
num_sentences_summary = len(list(summary_doc.sents))

# Print the number of sentences in the summarized article with an appropriate label
print("Number of Sentences in the Summarized Article:", num_sentences_summary)

10. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on lemmas) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

In [None]:
# Initialize an empty list to store sentences that meet the cutoff score criteria
summary_sentences = []

# Iterate through every sentence in the article
for sent in article_doc.sents:
    # Calculate the score for each sentence based on lemmas
    sentence_score = score_sentence_by_lemma(sent, common_lemmas)
    # Check if the score is greater than the cutoff score identified in question 8
    if sentence_score > cutoff_score:
        # Add the sentence to the list
        summary_sentences.append(sent.text.strip())

# Join the sentences in the list to generate the summary text
summary_text = ' '.join(summary_sentences)

# Print the summary text
print("Summary based on Lemmas:")
print(summary_text)


11. Print the polarity score of your summary you generated with the lemma scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

In [None]:

To print the polarity score of the summary generated with the lemma scores and the number of sentences in the summarized article, you can follow these steps:

Calculate the polarity score of the summary text based on lemma scores.
Count the number of sentences in the summarized article.
Print the polarity score and the number of sentences with appropriate labels.
Here's how you can implement it in Python:

python
Copy code
# Calculate the polarity score of the summary text based on lemma scores
summary_polarity = get_polarity_score(summary_text, lemma_scores)

# Count the number of sentences in the summarized article
num_sentences_summary = len(summary_sentences)

# Print the polarity score of the summary with an appropriate label
print("Polarity score of the summary (based on Lemmas):", summary_polarity)

# Print the number of sentences in the summarized article with an appropriate label
print("Number of sentences in the summarized article:", num_sentences_summary)

12.  Compare your polarity scores of your summaries to the polarity scores of the initial article.  Is there a difference?  Why do you think that may or may not be?.  Answer in this Markdown cell.  

13. Based on your reading of the original article, which summary do you think is better (if there's a difference).  Why do you think this might be?