# TGIS 501 Final Project - Web Scraping and Natural Language Processing

In this project file, I use lxml, BeautifulSoup, Pandas, and textstat libraries to pull a list of all published topics in the domain www.tpchd.org, and then parse through the main content area of each and evaluate the reading level.  The end product is a csv file that contains the URL and reading level. 
<br>
<br>
Runtime is about five and a half (5.5) minutes.

In [2]:
from bs4 import BeautifulSoup  # for parsing html objects to find our content area of interest
import requests                # for getting the XML sitemap from tpchd.org
from requests import get       # for calling HTML objects from the web into BeautifulSoup
from lxml import etree         # for parsing the xml to extract the URL list
import pandas                  # for zipping lists into an easy-to-use, exportable data frame
import csv                     # for delivering the final output in an excel-fiendly format
from time import sleep         # in case tpchd.org gets mad that we spam server requests (can slow rate)
from random import randint     # in case we need to slow requests on random intervals to mimmick human behavior
import re                      # topic widgets have item-specific titles, so we will use regular expressions (regex) to capture them all (works like wildcard)

from textstat.textstat import textstat  # this magical module calculates readability in exactly the index I need



# Open sitemap showing all pages from URL to xml
sitereq = requests.request('GET', 'http://www.tpchd.org/sitemap-topic.xml')
sitemapxml = sitereq.text
sitemapxml = sitemapxml.replace('ï»¿<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">', "<urlset>")
# I had to do a replace because the xml header I was getting was throwing off the etree parser.
# This scrubs some wierd, non-utf8 chars and simplifies the parent directory name.

# Write xml to file
f = open('topicmap.xml', 'w')
f.write(sitemapxml)
f.close()

# Next, I create an xml ElementTree so I can easily iterate through.
tree = etree.parse('topicmap.xml')
root = tree.getroot()

# this will do the iterating, and will only save the URL to a list, the rest of the xml is not needed.   
# The xml structure was a little wonky, so I append URL by element, not by iterating through <loc> tags.
urllist = []
score = []
for i in range(len(root)):
    urlname = root[i][0].text
    urllist.append(urlname)

#print (urllist)

# Now, we iterate through URL list, request each HTML with get(), pass HTML to BeautifulSoup object.
# Once we have the BS object, search find all <div> elements with the content class we are looking for (main text).
# sleep() inserts a 1 to 3 second pause between requests to avoid the server shutting me out for spamming.
for page in urllist:
    #sleep(randint(1,3))   
    response = get(page)
    text = response.text
    soup = BeautifulSoup(text, 'html.parser')
    test = soup.find_all('div', class_= "topic-content")
    pgtext = []
    print("*", end="", flush=True) #makeshift status bar that prints a star every time a page is completed.
    
    if(len(test) == 0):
        score.append('No Content')
    else:
        for tag in test:
            parsediv = tag.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6", "ul", "ol", "li"])
            for stuff in parsediv:
                pgtext.append(stuff.text)
        
            # then join the list into a text string to run it through the textstat module for flesch-kincaid level
    txtstr = " ".join(pgtext)
 
    if(textstat.lexicon_count(txtstr)==0):  # another error handling in case div has characters but no words
        score.append('No content')
    else:
            
        # catching other text string errors
        while True:
            try:
                rdscore = textstat.flesch_kincaid_grade(txtstr)
                score.append(rdscore)
                break
            except TypeError:
                score.append('Content not text')    

# Next is a Pandas dataframe to organize our URL list and reading scores into a table.
siteinventory = pandas.DataFrame(list(zip(urllist, score)), columns = ['URL', 'FK Reading Level'])

# And then.... Save DataFrame to CSV.  Thats it!
siteinventory.to_csv('topicinventory.csv')
print ('Complete.')
    

*May 21, 2018 (View this document as a PDF) Influenza activity was moderate in Pierce County during the 2017-2018 season. This compares to higher activity nationally this season and higher activity in the county last season. Oct. 2, 2017 (week 40) through April 12, 2018 (week 15), we received notification of 548 influenza‑associated hospitalizations and 31 influenza-associated deaths in Pierce County. The median age of hospitalized patients was 66.5 years (range: two weeks to 100 years). The median age of people who died was 81 years (range: 44 to 102 years). Influenza-associated hospitalization peaked in January, declined weeks four through seven, then trended upward again to a second peak in week 11 (week ending March 17, 2018). The second peak was largely driven by an increase in influenza B activity, a trend also seen nationally. Risk Factors Our epidemiology nurses review hospitalization records for demographic information and influenza-related complication risk factors. Of 548 re

*May 9, 2017 What are the procedures our school should follow if a student has head lice? Head lice are tiny parasites that can live on the human head. Head lice survive by sucking blood from the scalp. Lice eggs (called “nits”) can attach to strands of head hair. Head lice can cause the head to itch but have not been proven to cause disease. 
 Head lice are very common among children and adolescents. Students of any income, social, or racial status can get head lice. Head lice are not a result of a student's poor personal hygiene, or unsanitary conditions at home, in the community, or at school. However, head lice historically carry a stigma of uncleanliness, so having head lice can often be embarrassing for the affected students, parents, families, and school. 
 Tacoma-Pierce County Health Department follows current recommendations from the American Academy of Pediatrics, the National Association of School Nurses, and the federal Centers for Disease Control and Prevention to address 

*March 14, 2017 Most people in the United States are protected against measles through vaccination, so local measles cases are uncommon. However, every year measles is brought into the United States by unvaccinated travelers (Americans or foreign visitors) who get measles while they are in other countries. They can spread measles to other people who are not protected against measles, which sometimes leads to outbreaks. It is important for health care clinics to ask patients about recent travel history when they present an illness concern. 
 The clinical presentation of measles is usually predictable. 
Two to four day prodrome of fever (usually 101° or higher) plus cough, coryza, and conjunctivitis (the “three Cs”).
After two to four days of fever and 3Cs, eruption of maculo-papular rash, starting at the hairline and spreading downward.
 Two to four day prodrome of fever (usually 101° or higher) plus cough, coryza, and conjunctivitis (the “three Cs”). After two to four days of fever and

*Travel Associated Illness
 September 1, 2017
 Ask About Travel
 Outbreaks of Ebola virus disease and Middle East Respiratory Virus in 2014 and 2015 have increased awareness of the importance of asking about patients about travel. Asking about travel has always been important when evaluating patients who present with possible infectious diseases. Many diseases, including measles, hepatitis A, typhoid fever and many types of infectious diarrhea are commonly associated with international travel. 
 Most travel-related infections become apparent soon after travel, but incubation periods vary. Consult CDC Traveler’s Health for specific country by country listing of common communicable diseases. 
 Domestic travel may also be a risk factor for specific illnesses. For example, outbreaks of legionellosis have been associated with domestic hotels and resorts. Measles has been transmitted by exposures in United States airports. Evaluating an Ill Traveler
 When evaluating an ill patient with recen

*Expedited Partner Therapy (EPT)
 Jan. 14, 2017 Examine and treat all the patient’s sex partners from the previous 60 days. 
 If this is not possible, offer medication to all sex partners whom patient is able to contact. 
 Expedited Partner Therapy provides for the treatment of sex partners of infected individuals without requiring partners to be tested or seen by healthcare providers. 
 Gonorrhea EPT provides second-line therapy. Therefore patients taking only these oral meds should get a test of cure at the infected site 10–14 days after treatment.   For more information, see the 2015 STD Treatment Guidelines Update. 
 Free medication for your patient’s partner(s) is available from participating pharmacies only. A prescription EPT Fax form is needed and participating pharmacies are listed on the form. Men Who Have Sex with Men
 Expedited Partner Therapy (EPT) is intended for partners of lab confirmed heterosexual cases. 
 EPT isn’t appropriate for partners of men who have sex with me

*April 19, 2017 
 Meningococcal disease is a sudden, severe illness caused by the bacterium Neisseria meningitidis. The disease manifests most commonly as meningitis and/or meningococcemia, but may also cause pneumonia, arthritis, or pericarditis. The symptoms include sudden high fever, chills, severe headache, stiff neck and back, nausea, vomiting, purpural rash, decreased level of consciousness, difficulty breathing, and seizures. 
 Epidemiology During 2005-2011, an estimated 800-1,200 cases of meningococcal disease occurred annually in the United States, representing an incidence of 0.3 cases per 100,000 population. Incidence has declined annually since a peak of disease in the late 1990s. Asymptomatic carriage of N. meningitidis in the nose and throat is common, possibly as high as 20%. Though these individuals may not exhibit symptoms or illness, they can spread the infection to others. Since 2005, declines have occurred among all age groups and in all vaccine-contained serogroups

*April 18, 2017 Learn why public health is essential in APHA’s “What Is Public Health” document. Read Article What Is Public Health?. 
*March 1, 2016 Healthcare Worker Immunization Pre-exposure evaluation for healthcare personnel previously vaccinated with complete, ≥ three dose hepatitis B vaccine series who have not had postvaccination serologic testing.*  Source: MMWR December 20, 2013  * Should be performed one to two months after the last dose of vaccine using a quantitative method that allows detection of the protective concentration of anti-HBs (≥10 mIU/mL) (e.g., enzyme-linked immunosorbent assay [ELISA]). † A nonresponder is defined as a person with anti-HBs <10 mIU/mL after ≥6 doses of hepatitis B vaccine. Persons who do not have a protective concentration of anti-HBs after revaccination should be tested for HBsAg. A common reason for non-reponse to vaccine is that the person is already infected with hepatitis B. If positive, the person should be referred for evaluation for h

*April 17, 2014 Sputum Specimens for Acid Fast Smear and Culture Suspicion of Active TB Notify the Tacoma-Pierce County Health Department (Health Department) if active TB is suspected or of any positive AFB smear or culture: 
(253) 798-6410- Phone
(253) 798-7666- Confidential Fax
 (253) 798-6410- Phone (253) 798-7666- Confidential Fax Patients without Insurance 
Contact Peggy Cooley, RN at (253) 798-2861 or Matthew Rollosson, RN at (253) 798-6052.
    
They will contact the patient to arrange the drop off and pick up of sputum collection cups.


Testing will be done at the Washington State Public Health Lab at no charge to the patient.
 Contact Peggy Cooley, RN at (253) 798-2861 or Matthew Rollosson, RN at (253) 798-6052.
    
They will contact the patient to arrange the drop off and pick up of sputum collection cups.

 
They will contact the patient to arrange the drop off and pick up of sputum collection cups.
 They will contact the patient to arrange the drop off and pick up of sput

*August 23, 2017 Objectives 
Describe the impact of modern vaccines on individual & public health.
Identify factors other than hesitancy that contribute to under-immunization.
Discuss the origins & scope of vaccine hesitancy.
Define the relative roles of science, culture & emotion in parents’ vaccine decision-making.
Recognize the central role of values in immunization policy-making & implementation.
 Describe the impact of modern vaccines on individual & public health. Identify factors other than hesitancy that contribute to under-immunization. Discuss the origins & scope of vaccine hesitancy. Define the relative roles of science, culture & emotion in parents’ vaccine decision-making. Recognize the central role of values in immunization policy-making & implementation. There Was Never an Age of Reason Vaccines, Vaccine Hesitancy, and Vaccine Decision Making This E-Course offers continuing education credits and is hosted by WithinReach and funded by Kaiser WA. Register for Vaccines E-Co

*February 17, 2016 Guidelines for Treatment of Latent Tuberculosis Infection High-Priority Candidates for Treatment of LTBI Positive QFT (greater than 0.35 I.U.) TST ≥5 mm 
HIV-positive persons.
Recent contacts of person with infectious TB.
Persons with fibrotic changes on CXR suggestive of previous TB; or inadequate treatment.
Persons with organ transplants or immunosuppression therapy.
 HIV-positive persons. Recent contacts of person with infectious TB. Persons with fibrotic changes on CXR suggestive of previous TB; or inadequate treatment. Persons with organ transplants or immunosuppression therapy. TST ≥10 mm 
Recent arrivals (<five years) from endemic areas.
Substance abusers.
Residents/employees of health care, correctional or long-term care facilities.
Children and adolescents exposed to high-risk adults.
Persons at high-risk for certain medical conditions.
 Recent arrivals (<five years) from endemic areas. Substance abusers. Residents/employees of health care, correctional or l

*
*June 7, 2017  Background and Epidemiology West Nile virus (WNV) infections first emerged as a public health problem in the United States in the late 1990s. It is a mosquito-borne flavivirus in the same family as yellow fever, dengue fever and St. Louis encephalitis. Other routes of transmission in rare situations are blood transfusions, organ transplant, transplacental, breastfeeding, and percutaneous injuries of laboratory workers. In the United States, outbreaks occur from late spring through autumn when mosquitoes are active. Usually, WNV outbreaks are associated with bird die-offs and cases in horses and other mammals that may precede or occur simultaneously with human cases. In Washington State, WNV activity has historically been very low. Since surveillance began, the highest level of WNV activity occurred in 2009 when there were 38 human cases (36 were acquired in Washington State), 73 cases in horses or other mammals, and 22 dead birds tested positive. In 2016, there were 9 

*August 1, 2017 
 Providers and facilities should immediately report to the Health Department if they suspect Middle East Respiratory Syndrome Coronavirus (MERS-CoV) and institute infection control precautions (airborne, contact and standard precautions). 
 In September of 2012, the Saudi Arabia Ministry of Health announced a new form of coronavirus, since named Middle East Respiratory Syndrome Coronavirus (MERS-CoV).  So far all cases have been linked (by residence or travel) to countries in or near the Arabian Peninsula*, most in Saudi Arabia.  As of June 2017, there have been nearly 2,000 cases of MERS with at least 691 related deaths. Only 2 patients have ever been identified in the U.S., both health care workers exposed while providing care to patients in Saudi Arabia. Although the majority of human cases of MERS have been attributed to human-to-human infections, camels are likely to be a major reservoir host for MERS-CoV. However, the exact role of camels in transmission of the v

*March 8, 2018 “In 2016, about 3,621 Washington State 12th graders had tried heroin at least once and even more (about 4,526) use pain killers to get high in any given month.” -Healthy Youth Survey    If a student overdoses, a medication called naloxone could save their life. To prevent and prepare for opiate overdose, schools can: 
Develop a standing order for naloxone (Narcan) administration and a protocol for overdose response.
Educate staff and students about the Washington Good Samaritan Law .
Teach staff and students what naloxone is and how to use it.
Make sure staff and students know where they can get naloxone.
Make youth-friendly opiate safety education materials available.
    
Opiate Overdose Brochure.
Good Samaritan Law Poster.


 Develop a standing order for naloxone (Narcan) administration and a protocol for overdose response. Educate staff and students about the Washington Good Samaritan Law . Teach staff and students what naloxone is and how to use it. Make sure staff 

[]
*