# Web Mining and Applied NLP (44-620)

## Final Project: Article Summarizer

### Student Name: Hayley Massey

### https://github.com/HMas522/article-summarizer

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

You should bring in code from previous assignments to help you answer the questions below.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

# Prerequisites

In [49]:
# Create and activate a Python virtual environment. 
# Before starting the project, try all these imports FIRST
# Address any errors you get running this code cell 
# by installing the necessary packages into your active Python environment.
# Try to resolve issues using your materials and the web.
# If that doesn't work, ask for help in the discussion forums.
# You can't complete the exercises until you import these - start early! 
# We also import pickle and Counter (included in the Python Standard Library).

from collections import Counter
import pickle
import requests
import spacy
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

!pip list

print('All prereqs installed.')

Package            VersionAll prereqs installed.

------------------ -----------
annotated-types    0.6.0
asttokens          2.4.1
beautifulsoup4     4.12.3
blis               0.7.11
catalogue          2.0.10
certifi            2024.2.2
charset-normalizer 3.3.2
click              8.1.7
cloudpathlib       0.16.0
colorama           0.4.6
comm               0.2.2
confection         0.1.4
contourpy          1.2.1
cycler             0.12.1
cymem              2.0.8
debugpy            1.8.1
decorator          5.1.1
en-core-web-sm     3.7.1
exceptiongroup     1.2.0
executing          2.0.1
fonttools          4.51.0
idna               3.7
ipykernel          6.29.4
ipython            8.23.0
jedi               0.19.1
Jinja2             3.1.3
jupyter_client     8.6.1
jupyter_core       5.7.2
kiwisolver         1.4.5
langcodes          3.3.0
MarkupSafe         2.1.5
matplotlib         3.8.4
matplotlib-inline  0.1.7
murmurhash         1.0.10
nest-asyncio       1.6.0
numpy              1.26.4
packagi

# Question 1
1. Find on the internet an article or blog post about a topic that interests you and you are able to get the text for using the technologies we have applied in the course.  Get the html for the article and store it in a file (which you must submit with your project)

 Write code that extracts the article html from https://frieren.fandom.com/wiki/Frieren_Wiki and dumps it to a .pkl (or other appropriate file)

In [46]:
# set up imports
import requests
from bs4 import BeautifulSoup
import pickle

# call request
response = requests.get('https://frieren.fandom.com/wiki/Frieren_Wiki')

# good soup
soup = BeautifulSoup(response.text, "html.parser")

# extract string 'str' word article using soup find
names_html = str(soup.find("ul"))

# create pickle dump and name file name
with open("names_html.pkl", "wb") as file:
    pickle.dump(names_html, file)


# Question 2
2. Read in your article's html source from the file you created in question 1 and do sentiment analysis on the article/post's text (use `.get_text()`).  Print the polarity score with an appropriate label.  Additionally print the number of sentences in the original article (with an appropriate label)

In [47]:
# set up imports
import pickle
from bs4 import BeautifulSoup

# open and load file previously created
with open("names_html.pkl", "rb") as file:
    names_html = pickle.load(file)

# good soup 
soup = BeautifulSoup(names_html, "html.parser")

names_html = soup.find("ul")

# display using print fx and get_text
print(names_html.get_text())






 Explore

 




 Main Page




 Discuss




All Pages




Community




Interactive Maps




Recent Blog Posts








World

 




Locations
 




Central Lands




Northern Lands




Southern Lands







Story Arcs




Timeline




Spells




Ecology








Characters

 




List of Characters




Frieren's Party
 




Frieren




Fern




Stark




Sein







Hero Party
 




Himmel




Frieren




Heiter




Eisen







Continental Magic Association
 




Lernen




Genau




Sense




Falsch




Serie







First-Class Mage Exam Candidates
 




Kanne




Lawine




Übel




Land




Wirbel




Denken




Methode




Edel




Other Candidates







Demons
 




List of Demons




Demon King




Seven Sages of Destruction
 




Aura




Macht




Böse




Grausam







Greater Demons
 




Solitär




Tot




Rivale










Other
 




Flamme




Graf Granat




Kraft











Media

 




Manga
 




Chapters and Volumes




Story Arcs




Weekly Shonen Sunday




O

# Question 3
3. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent tokens (converted to lower case).  Print the common tokens with an appropriate label.  Additionally, print the tokens their frequencies (with appropriate labels)

In [54]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
doc = nlp(names_html.get_text())

def key_words(token):
    return not (token.is_space or token.is_punct or token.is_stop)

unique_tokens = [token.text.lower() for token in doc if key_words(token)]
token_freq = Counter(unique_tokens)
print("Most Frequent Tokens:")
print(token_freq.most_common(20))

Most Frequent Tokens:
[('policy', 4), ('lands', 3), ('frieren', 3), ('demons', 3), ('anime', 3), ('community', 2), ('story', 2), ('arcs', 2), ('characters', 2), ('list', 2), ('party', 2), ('candidates', 2), ('manga', 2), ('volumes', 2), ('explore', 1), ('main', 1), ('page', 1), ('discuss', 1), ('pages', 1), ('interactive', 1)]


# Question 4
4. Load the article text into a trained `spaCy` pipeline, and determine the 5 most frequent lemmas (converted to lower case).  Print the common lemmas with an appropriate label.  Additionally, print the lemmas with their frequencies (with appropriate labels).

In [53]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
doc = nlp(names_html.get_text())

def key_words(lemmas):
    return not (lemmas.is_space or lemmas.is_punct or lemmas.is_stop)

unique_lemmas = [token.lemma_.lower() for token in doc if key_words(token)]
lemma_freq = Counter(unique_lemmas)
print("Most Frequent Lemmas:")
print(lemma_freq.most_common(20))

Most Frequent Lemmas:
[('policy', 4), ('lands', 3), ('frieren', 3), ('demons', 3), ('anime', 3), ('community', 2), ('story', 2), ('arcs', 2), ('characters', 2), ('list', 2), ('party', 2), ('candidates', 2), ('manga', 2), ('volumes', 2), ('explore', 1), ('main', 1), ('page', 1), ('discuss', 1), ('pages', 1), ('interactive', 1)]


# Question 5
5. Make a list containing the scores (using tokens) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores. From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

# Question 6
6. Make a list containing the scores (using lemmas) of every sentence in the article, and plot a histogram with appropriate titles and axis labels of the scores.  From your histogram, what seems to be the most common range of scores (put the answer in a comment after your code)?

# Question 7
7. Using the histograms from questions 5 and 6, decide a "cutoff" score for tokens and lemmas such that fewer than half the sentences would have a score greater than the cutoff score.  Record the scores in this Markdown cell

* Cutoff Score (tokens): 
* Cutoff Score (lemmas):

Feel free to change these scores as you generate your summaries.  Ideally, we're shooting for at least 6 sentences for our summary, but don't want more than 10 (these numbers are rough estimates; they depend on the length of your article).

# Question 8
8. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on tokens) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

# Question 9
9. Print the polarity score of your summary you generated with the token scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

# Question 10
10. Create a summary of the article by going through every sentence in the article and adding it to an (initially) empty list if its score (based on lemmas) is greater than the cutoff score you identified in question 8.  If your loop variable is named `sent`, you may find it easier to add `sent.text.strip()` to your list of sentences.  Print the summary (I would cleanly generate the summary text by `join`ing the strings in your list together with a space (`' '.join(sentence_list)`).

11. Print the polarity score of your summary you generated with the lemma scores (with an appropriate label). Additionally, print the number of sentences in the summarized article.

12.  Compare your polarity scores of your summaries to the polarity scores of the initial article.  Is there a difference?  Why do you think that may or may not be?.  Answer in this Markdown cell.  

13. Based on your reading of the original article, which summary do you think is better (if there's a difference).  Why do you think this might be?