"""
The goal of this milestone is for you to scrape data from an actual website and start to analyze the results. The project should be completed in individually in python without the assistance of any artificial intelligences.

Web Scraper - Take all of the text from the top 20 articles coming from one of the following web news sources:

espn.com
cnn.com
goodblacknews.org
huffingtonpost.com
ign.com
theonion.com

Statistics - Print a list of the titles of the web pages you are pulling text from and then print the mean and median number of words for all 20 articles.

Visualize - Create a Word Cloud using your list of most common words that shows the top 50 (or up to 200) words and a bar chart to show the relative frequencies of the top 15 most frequent words. (Note: these should be words that the average viewer of the website would see, not code from the html)

"""

In [None]:
#A bit of code: downloads the linked HTML page into this notebook's folder
#code example is from https://pythonexamples.org/python-download-from-url/
"""import requests

URL = "https://pythonexamples.org/"
response = requests.get(URL)

with open("download.html", "wb") as htmlFile:
    htmlFile.write(response.content)
    print('Download completed.')
"""


Plan:
-Download main webpage: https://www.cnn.com
-Have it figure out what is an article and what is not
-Generate a list of links to articles
-For the first 20 relevant links, download each link's HTML to this folder
-For each page, I need: page title, word count, list of all words
-This comes from <title>,<h1-h6>,<li>,<p>,<a>,<article>

In [None]:
#Code that downloads the main CNN page as a file
#help coming from https://lxml.de/parsing.html#parsers for lxml documentation
import requests
from bs4 import BeautifulSoup as bs
import nltk
nltk.download("punkt")
import matplotlib.pyplot as plt

base_url = "https://www.cnn.com"
response = requests.get(base_url)
#response encoding is utf-8

with open("main_page.html", "wb") as htmlFile:
    htmlFile.write(response.content)

#now we have main page downloaded, and we can parse it for article links
#THE ENCODING PART HERE IS REALLY IMPORTANT OR ELSE NOTHING WORKS
fileWrap = open("main_page.html","r", encoding="utf-8")

#this converts the open file into a single string variable
fileText:str = fileWrap.read()

#turning the downloaded file into a BS4 object
bsText = bs(fileText,"html.parser")

#list to store all valid article links
articles = []

for link in bsText.find_all('a'):
    currentLink:str = link.get('href')
    try:
        #pattern to save: all articles start with "/" and end with ".html"
        if(currentLink.startswith('/') and currentLink.endswith('.html')):
            articles.append(base_url+currentLink)
    except:
        print("not that one")
        articles.remove

""" 
for link in articles:
    print(link)
 """
# now that we have all the articles, we pick 20 at random and analyze the results

wordsList = {}
for i in range(len(articles)):
#turns the entire request page into a BS4 object
    thisArticle = bs(requests.get(articles[i]).text, "html.parser")

    articleText = thisArticle.article.text.strip().casefold()
    #now narrow down the text we're interested in to just the article itself
    #NLTK functions found on google and configured through documentation from https://www.nltk.org/
    thisArticle = ' '.join([word for word in articleText.split()])
    wordTokens = nltk.word_tokenize(thisArticle)

    for word in wordTokens:
        # if the current word already exists as a key in the wordsList
        # if(word in wordsList.keys()):
        if word in wordsList:
            wordsList[word] += 1
        else:
            wordsList[word] = 1

# Sort the word counts in descending order.
# assistance received from https://realpython.com/sort-python-dictionary/ for figuring out parameters
wordsList = sorted(wordsList.items(), key=lambda item: item[1], reverse=True),

print(wordsList)
for item in wordsList:
    print(item)

#too many words to fit on the frequency chart, now we cut the end of the list
print("Length=",len(wordsList))
while (len(wordsList)>50):
    wordsList.pop()
print("Length=",len(wordsList.count()))

# Create a bar chart.
plt.bar([word for word, count in wordsList], [count for word, count in wordsList])
plt.xlabel("Word")
plt.ylabel("Frequency")
plt.title("Word Frequency Chart")
plt.show()