# Web Mining and Applied NLP (44-620)

## Final Project: Article Summarizer

### Student Name: Tesfamariam
gitHub: https://github.com/Tesfamariam100/module7-p7-article-summarizer/blob/main/article-summarizer.ipynb

Date: Dec. 05 2024


Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

You should bring in code from previous assignments to help you answer the questions below.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

In [2]:
# Create and activate a Python virtual environment. 
# Before starting the project, try all these imports FIRST
# Address any errors you get running this code cell 
# by installing the necessary packages into your active Python environment.
# Try to resolve issues using your materials and the web.
# If that doesn't work, ask for help in the discussion forums.
# You can't complete the exercises until you import these - start early! 
# We also import pickle and Counter (included in the Python Standard Library).

from collections import Counter
import pickle
import requests
import spacy
from bs4 import BeautifulSoup
import json
import html5lib
import matplotlib.pyplot as plt
from spacytextblob.spacytextblob import SpacyTextBlob

!pip list

print('All prereqs installed.')

Package                   Version
------------------------- --------------
annotated-types           0.7.0
anyio                     4.7.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.4
attrs                     24.2.0
babel                     2.16.0
beautifulsoup4            4.12.3
bleach                    6.2.0
blis                      1.0.1
catalogue                 2.0.10
certifi                   2024.8.30
cffi                      1.17.1
charset-normalizer        3.4.0
click                     8.1.7
cloudpathlib              0.20.0
colorama                  0.4.6
comm                      0.2.2
confection                0.1.5
contourpy                 1.3.0
cycler                    0.12.1
cymem                     2.0.10
debugpy                   1.8.9
decorator                 5.1.1
defusedxml                0.7.1
exceptiongroup            1.2.2
executing      

You should consider upgrading via the 'C:\Users\Administrator\OneDrive\Documents\CSIS-446Web\module7-p7-article-summarizer\m7p7_venv\Scripts\python.exe -m pip install --upgrade pip' command.


1. Find on the internet an article or blog post about a topic that interests you and you are able to get the text for using the technologies we have applied in the course.  Get the html for the article and store it in a file (which you must submit with your project)

In [3]:
# Import libraries
import requests  # Library to make HTTP requests for websites
import pickle    # Library used to save and load Python objects by serializing and deserializing

# Request and Store Response
fao_page = requests.get('https://www.fao.org/home/en/')
fao_html = fao_page.text

# Use Pickle library to serialize the fao_html string and store in file fao_homepage.pkl
# The serialized file can later be retrieved and used without requesting from url
with open('fao_homepage.pkl', 'wb') as file:
    pickle.dump(fao_html, file)

print('FAO homepage HTML saved to fao_homepage.pkl')

FAO homepage HTML saved to fao_homepage.pkl


2. Read in your article's html source from the file you created in question 1 and do sentiment analysis on the article/post's text (use `.get_text()`).  Print the polarity score with an appropriate label.  Additionally print the number of sentences in the original article (with an appropriate label)

In [18]:
import pickle
from bs4 import BeautifulSoup
from textblob import TextBlob


# Load the HTML content from the pickle file
pickle_file_path = 'fao_homepage.pkl'  # Adjust the file path if needed
with open(pickle_file_path, 'rb') as file:
    fao_html = pickle.load(file)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(fao_html, 'html.parser')

# Extract the text content of the article (or main content section)
article_text = soup.find('article')  # You can adjust this if needed based on HTML structure
if not article_text:
    article_text = soup.find('main')  # Try 'main' tag if 'article' is not found

# If no article tag found, extract all text from the page
if article_text:
    text = article_text.get_text(separator="\n", strip=True)
else:
    text = soup.get_text(separator="\n", strip=True)

# Check if text was extracted
if text:
    print("Article Text Extracted Successfully.")

    # Perform sentiment analysis using TextBlob
    blob = TextBlob(text)
    polarity_score = blob.sentiment.polarity

    # Print the polarity score and sentiment label
    print(f"Polarity score: {polarity_score:.2f}")
    if polarity_score > 0:
        print("The sentiment of the article is positive.")
    elif polarity_score < 0:
        print("The sentiment of the article is negative.")
    else:
        print("The sentiment of the article is neutral.")

    # Optionally, count the number of sentences in the article
    sentences = blob.sentences
    num_sentences = len(sentences)
    print(f"Number of sentences in the article: {num_sentences}")
else:
    print("No article text found.")

Article Text Extracted Successfully.
Polarity score: 0.18
The sentiment of the article is positive.


MissingCorpusError: 
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

    python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.


3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels)

In [19]:
import spacy
from collections import Counter

# Load the small English model (en_core_web_sm) into spaCy
nlp = spacy.load("en_core_web_sm")

# Process the text with the spaCy pipeline
doc = nlp(text)

# Tokenize the text and filter out stopwords and punctuation
tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct]

# Get the frequencies of tokens
token_frequencies = Counter(tokens)

# Get the 5 most common tokens
common_tokens = token_frequencies.most_common(5)

# Print the common tokens with their frequencies
print("The 5 most frequent tokens (excluding stopwords and punctuation):")
for token, freq in common_tokens:
    print(f"Token: '{token}', Frequency: {freq}")

The 5 most frequent tokens (excluding stopwords and punctuation):
Token: '
', Frequency: 171
Token: 'fao', Frequency: 55
Token: 'food', Frequency: 41
Token: '
                ', Frequency: 19
Token: '
                            ', Frequency: 18


4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels).

In [20]:
import spacy
from collections import Counter

# Load the small English model (en_core_web_sm) into spaCy
nlp = spacy.load("en_core_web_sm")

# Process the text with the spaCy pipeline
doc = nlp(text)

# Extract lemmas, convert to lowercase, and filter out stopwords and punctuation
lemmas = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]

# Get the frequencies of lemmas
lemma_frequencies = Counter(lemmas)

# Get the 5 most common lemmas
common_lemmas = lemma_frequencies.most_common(5)

# Print the common lemmas with their frequencies
print("The 5 most frequent lemmas (excluding stopwords and punctuation):")
for lemma, freq in common_lemmas:
    print(f"Lemma: '{lemma}', Frequency: {freq}")

The 5 most frequent lemmas (excluding stopwords and punctuation):
Lemma: '
', Frequency: 171
Lemma: 'fao', Frequency: 55
Lemma: 'food', Frequency: 44
Lemma: '
                ', Frequency: 19
Lemma: 'work', Frequency: 18


5. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

In [21]:
import spacy
import matplotlib.pyplot as plt
from spacytextblob.spacytextblob import SpacyTextBlob

# Load the small English model (en_core_web_sm) into spaCy
nlp = spacy.load("en_core_web_sm")

# Add SpacyTextBlob extension to spaCy
spacy_text_blob = SpacyTextBlob()
nlp.add_pipe(spacy_text_blob, last=True)

# Process the text with spaCy
doc = nlp(text)

# List to store sentiment scores (polarity) of each sentence
sentiment_scores = []

# Loop through the sentences in the document
for sent in doc.sents:
    sentiment_scores.append(sent._.polarity)  # Get polarity score for each sentence

# Plo

TypeError: __init__() missing 1 required positional argument: 'nlp'

6. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

In [22]:
import spacy
import matplotlib.pyplot as plt
from spacytextblob.spacytextblob import SpacyTextBlob

# Load the small English model (en_core_web_sm) into spaCy
nlp = spacy.load("en_core_web_sm")

# Add SpacyTextBlob extension to spaCy
spacy_text_blob = SpacyTextBlob()
nlp.add_pipe(spacy_text_blob, last=True)

# Process the text with spaCy
doc = nlp(text)

# List to store sentiment scores (polarity) of each sentence using lemmas
sentiment_scores_lemmas = []

# Loop through the sentences in the document
for sent in doc.sents:
    # Create a list of lemmas for each token in the sentence
    lemmas = [token.lemma_ for token in sent if token.is_alpha]  # Only include alphabetic tokens
    # Calculate the sentiment score based on the lemmas
    lemma_text = " ".join(lemmas)
    
    # Process the lemma text and calculate its polarity score
    lemma_doc = nlp(lemma_text)
    sentiment_scores_lemmas.append(lemma_doc._.polarity)  # Get polarity score for the lemma-based sentence

# Plot the histogram of the sentiment scores based on lemmas
plt.figure(figsize=(8, 6))
plt.hist(sentiment_scores_lemmas, bins=20, edgecolor='black')
plt.title('Sentiment Score Distribution per Sentence (Lemmas)')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Comment on the most common range of sentiment scores based on the histogram


TypeError: __init__() missing 1 required positional argument: 'nlp'

7. Using the histograms from questions 5 and 6, decide a "cutoff" score for tokens and lemmas such that fewer than half the sentences would have a score greater than the cutoff score.  Record the scores in this Markdown cell

* Cutoff Score (tokens): 
* Cutoff Score (lemmas):

Feel free to change these scores as you generate your summaries.  Ideally, we're shooting for at least 6 sentences for our summary, but don't want more than 10 (these numbers are rough estimates; they depend on the length of your article).

In [23]:
import numpy as np

# Calculate the median score for tokens
cutoff_score_tokens = np.median(sentiment_scores_tokens)

# Calculate the median score for lemmas
cutoff_score_lemmas = np.median(sentiment_scores_lemmas)

# Print the cutoff scores
print(f"Cutoff Score (tokens): {cutoff_score_tokens:.2f}")
print(f"Cutoff Score (lemmas): {cutoff_score_lemmas:.2f}")


NameError: name 'sentiment_scores_tokens' is not defined

8. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on tokens) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

In [24]:
import numpy as np

# Calculate the median score for tokens
cutoff_score_tokens = np.median(sentiment_scores_tokens)

# Calculate the median score for lemmas
cutoff_score_lemmas = np.median(sentiment_scores_lemmas)

# Print the cutoff scores
print(f"Cutoff Score (tokens): {cutoff_score_tokens:.2f}")
print(f"Cutoff Score (lemmas): {cutoff_score_lemmas:.2f}")


NameError: name 'sentiment_scores_tokens' is not defined

9. Print the polarity score of your summary you generated with the token scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

In [25]:
from textblob import TextBlob

# Convert the summary text into a TextBlob object
summary_blob = TextBlob(summary_text)

# Calculate the polarity score of the summary
polarity_score = summary_blob.sentiment.polarity

# Count the number of sentences in the summary
num_sentences = len(summary_sentences)

# Print the results
print(f"Polarity score of the summary: {polarity_score:.2f}")
print(f"Number of sentences in the summarized article: {num_sentences}")

NameError: name 'summary_text' is not defined

10. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on lemmas) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

In [26]:
# List to hold the sentences that are part of the summary based on lemmas
lemma_summary_sentences = []

# Iterate over every sentence in the article
for sent in doc.sents:
    # Calculate the score based on lemmas
    lemma_score = sum([token.rank for token in sent if token.lemma_ not in spacy.lang.en.stop_words])
    
    # Add the sentence to the summary if its score is greater than the cutoff
    if lemma_score > cutoff_score_lemmas:
        lemma_summary_sentences.append(sent.text.strip())

# Generate the summary text by joining the sentences
lemma_summary_text = ' '.join(lemma_summary_sentences)

# Print the summary
print(lemma_summary_text)


TypeError: argument of type 'module' is not iterable

11. Print the polarity score of your summary you generated with the lemma scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

In [27]:
from textblob import TextBlob

# Generate a TextBlob object for the summary
summary_blob = TextBlob(lemma_summary_text)

# Calculate the polarity score of the summary
polarity_score_lemma_summary = summary_blob.sentiment.polarity

# Print the polarity score and the number of sentences in the summary
print(f"Polarity score of the summary (based on lemmas): {polarity_score_lemma_summary:.2f}")
print(f"Number of sentences in the summary: {len(lemma_summary_sentences)}")


NameError: name 'lemma_summary_text' is not defined

12.  Compare your polarity scores of your summaries to the polarity scores of the initial article.  Is there a difference?  Why do you think that may or may not be?.  Answer in this Markdown cell.  

13. Based on your reading of the original article, which summary do you think is better (if there's a difference).  Why do you think this might be?