> This Python script scrapes text data from a series of websites, cleans and formats the data as unstructured text, and stores the unstructured text blocks in a dataframe format for future analysis.  

> The script first defines a series of functions that are used to 1) grab all text (with HTML tags), 2) grab all text (no HTML tags), and 3) grab visible text that consumers would see if they opened the webpage manually. 

> Then, additional functions are defined to crawl through websites, which involves putting in a top-level domain (i.e. apple.com) and then pulling URLs for all websites associated with that domain (i.e. apple.com/ipad).  This will allow us to generate a list of links to scrape text from that are associated with a single provider website.

> Once we have the set of links and the text-grabbing functions defined, this script runs one master webscrape, which runs the text-grabbing functions on each link, cleans the text, and stores the output in a dataframe format (i.e. a spreadsheet where the first column is the URL, the second column is the block of all text with tags, the third column is the block of all text without tags, and the fourth column is the block of all visible text.  Once we have this for each URL, we can merge with our human-coded data (at the provider ID level) which will allow us to run future analyses determining if certain text elements are more common in sites that humans coded as having CMS marketing mentions.

>> Last updated: 9/6/2016
   Author: Matt Green

In [1]:
#import packages needed for the webscrape
import os, nltk, pandas as pd, numpy as np, bs4, urllib, re, robobrowser, requests, csv, collections, scrapy
from urllib.request import urlopen
from urllib.error import URLError, HTTPError
from bs4 import BeautifulSoup, NavigableString
from robobrowser import RoboBrowser
from nltk import *
from collections import defaultdict
from scrapy.spiders import SitemapSpider

### 1) Write Functions Used to Pull Text Elements from a Website Input

#### Grab all text (with HTML tags)

> This function takes as an input a URL, opens the link, reads in the HTML data, and parses the data into a usable format using Python's BeautifulSoup package.  The function returns a full HTML markup that can be stored as an object for future steps.

In [2]:
def get_text_with_tags(URL):
    hdr={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'} #this bit of code above allows Python to read in most major website formats
    request=urllib.request.Request(URL, headers=hdr)
    html=urlopen(request).read()
    soup=BeautifulSoup(html, "lxml")
    data=soup.prettify()
    return data

In [4]:
#to test the code, enter a URL into the function (enclosed in quotes to denote a string value)
get_text_with_tags("http://www.stmarysmadison.com")



#### Grab all text (without HTML tags)

> This function also takes as input a URL, opens the link, reads in the HTML data, and parses using BeautifulSoup.  However, this function goes a step further and additionally strips all HTML tags from the parsed data, returning the block of tagless text.

In [4]:
def get_text_without_tags(URL):
    hdr={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
    request=urllib.request.Request(URL, headers=hdr)
    html=urlopen(request).read()
    soup=BeautifulSoup(html, "lxml")
    data=soup.findAll(text=lambda text:isinstance(text, NavigableString)) #this piece of code strips out the HTML tags
    list1=u' '.join(data)
    return list1

In [5]:
#test the code
get_text_without_tags("http://www.srdlc.org/")

'html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" \n \n Southeastern Renal Dialysis :: Southeast Iowa \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \t\n#search {\nbackground: url(/themes/srd_theme/images/SEIRD_02.jpg) no-repeat;\nwidth:524px;\nheight:71px;}\n\t\t\n#searchwrapper {\nwidth:524px; /*follow your image\'s size*/\nheight:71px;/*follow your image\'s size*/\nleft:0px;\ntop:0px;\npadding:0px;\nmargin:0px;\nposition:relative; /*important*/}\n\n#searchresultwrapper {\n/*width:176px; follow your image\'s size*/\n/*height:26px; follow your image\'s size\nbackground-image:url(../images/search.png);*/\n/*background-repeat:no-repeat; important*/\nmargin: 0px 0px 0px 0px; /* left was 15px */\nposition:relative; /*important*/\npadding: 0px;\nborder:0px;}\n \n#searchwrapper form { display:inline ; }\n \n.searchbox {\nborder:0px; /*important*/\nbackground-color:transparent; /*important*/\nposition:absolute; /*important*/\ncol

#### Grab all visible text

> This function also takes as input a URL, opens the link, reads in the HTML data, parses using BeautifulSoup, and strips all HTML tags from the parsed data. However, the end goal of the function is to filter the large block of HTML text to text that would be visible to the consumer upon opening the website manually. In order to do this, we have to include an additional function called visible, which takes a block of text as an input, iterates over each element in the block of text, and creates a series of conditions for whether each element is determined to be part of visible text.  The function excludes any text that is "False" for this function and only returns text that is "True" for being visible.

In [9]:
#function to filter out visible text
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element.encode('utf-8'))):
        return False
    elif re.match('^\n$', str(element)): 
        return False
    elif element.startswith(' start'):
        return False
    elif element.startswith(' end'):
        return False
    elif element.startswith(' mobile'):
        return False
    elif element.startswith(' END'):
        return False
    elif element.startswith(' BEGIN'):
        return False
    elif re.search('^Begin Body$', str(element)):
        return False
    elif re.search('^End Body$', str(element)):
        return Falsee
    elif re.search( '^ \xa0 $', str(element)):
        return False
    elif re.search('^\xa0$', str(element)):
        return False
    elif re.search('MailChimp', str(element)):
        return False
    elif re.search('^>$', str(element)):
        return False
    elif re.search('^<$', str(element)):
        return False
    elif re.match('^ $', str(element)):
        return False
    elif re.match('^, $', str(element)):
        return False
    elif re.match('^.$', str(element)):
        return False
    elif re.match('^ | $', str(element)):
        return False
    elif re.match('^____$', str(element)):
        return False
    elif re.match('^___$', str(element)):
        return False
    elif re.match('^Page.+Div.+End$', str(element)):
        return False
    elif re.match('^PAGE.+CONTENT$', str(element)):
        return False
    elif re.match('^.+Div.+End$', str(element)):
        return False
    elif re.match('^end top_wording$', str(element)):
        return False
    elif re.match('^end body$', str(element)):
        return False
    elif re.match('^end footerLeft$', str(element)):
        return False
    elif re.match('^end top$', str(element)):
        return False
    elif re.match('^end container$', str(element)):
        return False
    elif re.match('^end footer$', str(element)):
        return False
    elif re.search('\n\t\t.+', str(element)):
        return False
    elif element.startswith('[if lt IE 10]'):
        return False
    elif element.startswith('[if lt IE 9]'):
        return False
    elif element.endswith('[endif]'):
        return False
    elif re.search('<a', str(element)):
        return False
    elif element.startswith('[if gt IE 8]'):
        return False
    elif re.search('<script', str(element)):
        return False
    elif re.search('<option', str(element)):
        return False
    elif re.search('<span', str(element)):
        return False
    elif re.search('<input', str(element)):
        return False
    else:
        return True

In [5]:
#function to scrape visible text
def get_visible_text(URL):
    hdr={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}
    request=urllib.request.Request(URL, headers=hdr)
    html=urlopen(request).read()
    soup=BeautifulSoup(html, "lxml")
    data=soup.findAll(text=True)
    list1=[i for i in data if visible(i)] #the "if" condition here filters out visible text using the function above
    join=' '.join(list1)
    return join

In [10]:
#test code
get_visible_text("http://www.stmarysmadison.com")

'You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page. \r\n                Turn on more accessible mode\r\n             \r\n                Turn off more accessible mode\r\n             \r\n                Skip Ribbon Commands\r\n             \r\n                Skip to main content\r\n             \r\n                Turn off Animations\r\n             \r\n                Turn on Animations\r\n             Sign In St. Mary\'s Hospital - Madison\r\n                                    \r\n                                    \r\n                                    \r\n                                    \r\n                                    \r\n                                    \r\n                                    \r\n                                     Connect with Us\r\n                                    \r\n                                    \r\n                                    \r\n                         

### 2) Write Function to Clean Text Blocks

> This function below works to clean the text blocks generated by the text-grabbing functions above (with the exception of get_text_with_tags, which we are leaving as an HTML object in order to parse it later if need be), removing extraneous characters and cleaning up some of the formatting to make the block look more like a normal block of readable text.  The key element of this function is a user-defined dictionary, reps, which tells Python which characters in the string need to be replaced and what they need to be replaced with.  For example, the first element of the dictionary, ('\xa0':' '), tells Python to replace all character strings of "\xa0" with a blank space.  If the user wishes to replace more characters in the string, he/she would only need to add more entries to the dictionary and re-run the function.

In [9]:
reps={'\xa0':' ', '\r':'', '\n':'', '\'':''}
def clean_text(text,dic):
    for i,j in dic.items():
        text=text.replace(i,j)
    return text

### 3) Write Function to Grab Text Elements and Store in Spreadsheet Format for One Site

> The first step in drafting the master webscrape code will be to generate a code that can do the entire webscrape for one site.  The code will read in a given URL, then grab each type of text (all, all without tags, visible), clean and format the text, store as a dictionary element (with the master key being the URL itself), and convert the dictionary into a pandas dataframe.  The output should look just like a row on an Excel spreadsheet.

In [10]:
def webscrape(URL):
    alltext=get_text_with_tags(URL)
    all_notags=get_text_without_tags(URL)
    all_notags_cleaned=clean_text(all_notags,reps)
    visible=get_visible_text(URL)
    visible_clean=clean_text(visible,reps)
    key=URL
    dict1={key:[alltext, all_notags_cleaned, visible_clean]}
    df=pd.DataFrame.from_dict(dict1, orient="index")
    df.columns=['All Text', 'All Text (No Tags)', 'Visible Text']
    return df

In [11]:
webscrape("http://www.srdlc.org/")

Unnamed: 0,All Text,All Text (No Tags),Visible Text
http://www.srdlc.org/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitiona...",Locations Contact Information 1213 S Gear Ave ...


### 3a) See if the Webscrape Function works for a List of Sites

> Now that we can see that the webscrape function works successfully for one site, we will need to test to see if it can yield a similar spreadsheet when we feed in a list of sites.  For this, we will test this webscrape function on all the root URLs for our provider websites.  We will run this in a long loop function that returns the dataframe for each and appends it to a master blank dataframe.

In [12]:
#first, will need to import the spreadsheet that has our URL list on it
#this code sets the working directory to our project folder and lists the contents
os.chdir("P:\\7670\Common\\2. Ad hoc requests (Task 2F )\Star Ratings\Cross-site\Marketing scan\August 2016\\99_Idea Lab\Work\Week 1")
os.listdir()

['Week 1 documentation.docx',
 'archive',
 'In-Sample Websites_083016.xlsx',
 'OOS websites',
 'CMS AMs with Star Rating text.xlsx']

In [16]:
#next, will read in the spreadsheet "In-Sample Websites" as a pandas dataframe
website_sheet=pd.ExcelFile("In-Sample Websites_083016.xlsx")
df1=website_sheet.parse('Clean Data') #loads in the worksheet "Clean Data" as the data source
df1[:5] #shows the first five rows

Unnamed: 0,PrimaryKey,Provider Type,SampleID,FULL URL,ROOT URL
0,5527291,DFC,1,https://www.davita.com/find-a-dialysis-center/...,https://www.davita.com/
1,325332,DFC,2,https://www.freseniuskidneycare.com/dialysis-c...,https://www.freseniuskidneycare.com/
2,4335053,DFC,3,http://www.prairielakes.com/locations/prairie-...,http://www.prairielakes.com/
3,3926815,DFC,5,http://www.dciinc.org/punxsutawney/,http://www.dciinc.org/
4,1126857,DFC,7,http://www.usrenalcare.com/locations/calhoun,http://www.usrenalcare.com/


In [17]:
#we then create a list of URLs from the column "ROOT URL", which we will use as inputs into the webscrape function
url_list=df1["ROOT URL"].tolist()
url_list

['https://www.davita.com/',
 'https://www.freseniuskidneycare.com/',
 'http://www.prairielakes.com/',
 'http://www.dciinc.org/',
 'http://www.usrenalcare.com/',
 'http://www.reevescountyhospital.com/',
 'http://www.gundersenhealth.org/',
 'http://www.winthrop.org/',
 'https://www.davita.com/',
 'http://www.usrenalcare.com/',
 'http://www.foxvalleydialysis.com/',
 'http://www.bbgh.org/',
 'https://harbinclinic.com/',
 'http://www.bmhutah.org/',
 'http://www.satellitehealth.com/',
 'https://www.unityhealth.org/',
 'http://www.pskc.net/',
 'http://www.sanfordhealth.org/',
 'http://www.davita.com/',
 'http://www.satellitehealth.com/',
 'http://www.usrenalcare.com/',
 'http://www.dciinc.org/',
 'https://www.davita.com/',
 'http://www.hkcdialysis.com/',
 'http://mayoclinichealthsystem.org/',
 'http://www.renalventures.com/',
 'http://www.tbh.org/',
 'http://www.usrenalcare.com/',
 'https://www.davita.com/',
 'https://www.davita.com/',
 'http://www.usrenalcare.com/',
 'http://www.usrenalcare.

In [18]:
#to test the webscrape function, will iterate over the list above and append results to a master blank dataframe
#NOTE: THIS CODE TAKES ABOUT 40 MINS TO RUN
master=pd.DataFrame() 
for url in url_list:
    try:
        row=webscrape(url)
        master=master.append(row)
    except URLError as e:  #if the webscrape code doesn't work for a site, this code will print the site URL and the nature of the error
        if hasattr(e, 'reason'):
            print(url)
            print('We failed to reach a server.')
            print('Reason: ', e.reason)
        elif hasattr(e, 'code'):
            print(url)
            print('The server couldn\'t fulfill the request.')
            print('Error code: ', e.code)

http://citehealth.com/
We failed to reach a server.
Reason:  Forbidden
http://citehealth.com/
We failed to reach a server.
Reason:  Forbidden
http://citehealth.com/
We failed to reach a server.
Reason:  Forbidden
http://citehealth.com/
We failed to reach a server.
Reason:  Forbidden
http://citehealth.com/
We failed to reach a server.
Reason:  Forbidden
https://www.floridahospital.com/
We failed to reach a server.
Reason:  [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
https://www.henrycountyhospital.org/
We failed to reach a server.
Reason:  [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
https://adirondackhealth.org/
We failed to reach a server.
Reason:  [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
http://www.pionet.net/
We failed to reach a server.
Reason:  Forbidden
http://southgatehealthcare.com/
We failed to reach a server.
Reason:  [WinError 10060] A connection attempt failed because the connected party d

In [19]:
#name columns and view first 5 records of master test file
master.columns=['All Text', 'All Text (No Tags)', 'Visible Text']
master[:5]

Unnamed: 0,All Text,All Text (No Tags),Visible Text
https://www.davita.com/,"<!DOCTYPE html>\n<html lang=""en-US"" xml:lang=""...","html [if lt IE 7]> <html class=""no-js lt...",[if gt IE 9]><! I Have Early-Stage Kidney Dise...
https://www.freseniuskidneycare.com/,<!DOCTYPE html>\n<!--[if lt IE 7]> <html ...,"html [if lt IE 7]> <html class=""no-js lt-...",Skip to main content About Us Contact Us Españ...
http://www.prairielakes.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitiona...",locations online bill pay For Physicians Caree...
http://www.dciinc.org/,"<!DOCTYPE html>\n<html lang=""en-US"">\n <head>\...","html Dialysis Clinic, Inc. | Dialysis Clini...",Find a Clinic Careers Contact Us Legal /ABOUT...
http://www.usrenalcare.com/,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...","html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitiona...",U.S. Renal Care Homepage En español Locations ...


In [20]:
#let's write this to a .csv file just in case (which will save in the project folder working directory)
master.to_csv("master_output_rootURLS_9.9.16.csv")