# TGIS 501 Final Project - Web Scraping and Natural Language Processing

In this project file, I use lxml, BeautifulSoup, Pandas, and textstat libraries to pull a list of all available page URLs in the domain www.tpchd.org, and then parse through the main content area of each and evaluate the reading level.  The end product is a csv file that contains the URL and reading level of all 330+ pages. 
<br>
Runtime is about five and a half (5.5) minutes.

In [21]:
from bs4 import BeautifulSoup  # for parsing html objects to find our content area of interest
import requests                # for getting the XML sitemap from tpchd.org
from requests import get       # for calling HTML objects from the web into BeautifulSoup
from lxml import etree         # for parsing the xml to extract the URL list
import pandas                  # for zipping lists into an easy-to-use, exportable data frame
import csv                     # for delivering the final output in an excel-fiendly format
from time import sleep         # in case tpchd.org gets mad that we spam server requests (can slow rate)
from random import randint     # in case we need to slow requests on random intervals to mimmick human behavior

from textstat.textstat import textstat  # this magical module calculates readability in exactly the index I need



# Open sitemap showing all pages from URL to xml
sitereq = requests.request('GET', 'http://www.tpchd.org/sitemap-page-1.xml')
sitemapxml = sitereq.text
sitemapxml = sitemapxml.replace('ï»¿<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">', "<urlset>")
# I had to do a replace because the xml header I was getting was throwing off the etree parser.
# This scrubs some wierd, non-utf8 chars and simplifies the parent directory name.

# Write xml to file
f = open('sitemap.xml', 'w')
f.write(sitemapxml)
f.close()

# Next, I create an xml ElementTree so I can easily iterate through.
tree = etree.parse('sitemap.xml')
root = tree.getroot()

# this will do the iterating, and will only save the URL to a list, the rest of the xml is not needed.   
# The xml structure was a little wonky, so I append URL by element, not by iterating through <loc> tags.
urllist = []
score = []
for i in range(len(root)):
    urlname = root[i][0].text
    urllist.append(urlname)

#print (urllist)

# Now, we iterate through URL list, request each HTML with get(), pass HTML to BeautifulSoup object.
# Once we have the BS object, search find all <div> elements with the content class we are looking for (main text).
# sleep() inserts a 1 to 3 second pause between requests to avoid the server shutting me out for spamming.
for things in urllist:
    response = get(things)
    text = response.text
    soup = BeautifulSoup(text, 'html.parser')
    test = soup.find_all('div',class_="content_area normal_content_area clearfix ") #this pulls a list of all the Content Area widgets on a page.
    pgtext = []
    print("*", end="", flush=True) #makeshift status bar that prints a star every time a page is completed.
    
    if(len(test) == 0):
        score.append('No Content')
    #this says that 'for each visible text element in each content widget on each page, save that text to a list'.
    else: 
        for tag in test:
            parsediv = tag.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6", "td", "li"])
            
            for stuff in parsediv:
                if stuff.name =="li":
                    pgtext.append(stuff.text+".")

                else:
                    pgtext.append(stuff.text) 
        
        #Now we transform the list into a text string to run it through the textstat module for flesch-kincaid level (otherwise, its just a list of html <tags>).
        txtstr = " ".join(pgtext[0:-1])
        
        if(textstat.lexicon_count(txtstr)==0):  # another error handling routine, in case any div has characters, but none are words.
            score.append('No content')
        else:
            
            # catching other text string errors
            while True:
                try:
                    rdscore = textstat.flesch_kincaid_grade(txtstr)
                    score.append(rdscore)
                    break
                except TypeError:
                    score.append('Content not text')    #<-----non-text content (nums) return TypeError
                #except ZeroDivisionError:               
                    #score.append('Content empty') <---ignore these, this was just a test item.

# Next is a Pandas dataframe to organize our URL list and reading scores into a table.
siteinventory = pandas.DataFrame(list(zip(urllist, score)), columns = ['URL', 'FK Reading Level'])

# And then.... Save DataFrame to CSV.  Thats it!
siteinventory.to_csv('siteinventory.csv')
print ('Assessment of page readability is complete.')
    

**********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************Assessment of page readability is complete.


## Test box for finding readability

I used a test code snippet so I didn't have to run through 300+ pages to troubleshoot the readability calculation module [textstat](https://pypi.python.org/pypi.textstat).  The URLs I tested were intentional, one was a control without a defined content area-the other was a complex page with content areas broken up into two separate divs that used H3, H2, P text tags and a table list (< tl >). All content showed up in the test print, so I am confident that multiple content divs will not break the function.



<br>

In [2]:
from bs4 import BeautifulSoup  # for parsing html objects to find our content area of interest
from requests import get       # for getting sitemap xml, and calling HTML object from the web into BeautifulSoup
from lxml import etree         # for parsing the xml to extract the URL list
import pandas as pd            # for zipping lists into an easy-to-use, exportable data frame
import csv                     # for delivering the final output in an excel-fiendly format
from time import sleep         # in case tpchd.org gets mad that we spam the site with requests (can slow requests)
from random import randint     # in case we need to slow requests on random intervals to mimmick human behavior

from textstat.textstat import textstat  # this magical module calculates readability in exactly the index I need


url = ['https://www.tpchd.org/','https://www.tpchd.org/healthy-people/antibiotic-awareness']
score = []

# Iterate through URL list, request each HTML with get(), pass HTML to BeautifulSoup object
# Once we have the BS object, search find all <div> elements with the content class we are looking for (main text)
for things in url:
    response = get(things)
    text = response.text
    soup = BeautifulSoup(text, 'html.parser')
    test = soup.find_all('div',class_="content_area normal_content_area clearfix ")
    pgtext = []
    print("*", end="", flush=True) #makeshift status bar that prints a star every time a page is completed.
    
    if(len(test) == 0):
        score.append('No Content')
    else:
        for tag in test:
            parsediv = tag.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6", "ul", "ol", "li"])
            for stuff in parsediv:
                pgtext.append(stuff.text)
        
        # then join the list into a text string and run it through the textstat module for flesch-kincaid level
        txtstr = " ".join(pgtext)
        rdscore = textstat.flesch_kincaid_grade(txtstr)
        score.append(rdscore)
        
print(score)    
    


['No Content', 9.1]


## Test Cell
Want to view the specific text from a page that got a wierd readability score?  Just replace the url after "url = " with the destination page you wish to test, then click run. Make sure you leave the brackets and apostrophes ( ['...'] )

In [1]:
from bs4 import BeautifulSoup  # for parsing html objects to find our content area of interest
from requests import get       # for getting sitemap xml, and calling HTML object from the web into BeautifulSoup
from lxml import etree         # for parsing the xml to extract the URL list
import pandas as pd            # for zipping lists into an easy-to-use, exportable data frame
import csv                     # for delivering the final output in an excel-fiendly format
from time import sleep         # in case tpchd.org gets mad that we spam the site with requests (can slow requests)
from random import randint     # in case we need to slow requests on random intervals to mimmick human behavior

from textstat.textstat import textstat  # this magical module calculates readability in exactly the index I need


url = ['https://www.tpchd.org/i-want-to-/about-us/about-the-health-department/values']  #<-----------Here's where you put your test URL

score = []


for things in url:
    response = get(things)
    text = response.text
    soup = BeautifulSoup(text, 'html.parser')
    test = soup.find_all('div',class_="content_area normal_content_area clearfix ")
    pgtext = []   
    if(len(test) == 0):
        score.append('No Content')
    else:
        for tag in test:
            parsediv = tag.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6", "td", "li"])
            
            for stuff in parsediv:
                if stuff.name =="li" or "h1":
                    pgtext.append(stuff.text+".")

                else:
                    pgtext.append(stuff.text)

txtstr = " ".join(pgtext[0:-1])
rdscore = textstat.flesch_kincaid_grade(txtstr)
score.append(rdscore)
        
print(txtstr)
print(score)



Integrity. Respect.
[14.7]
