#Word Counting Web Scraper 

This document is meant to share a word counting web scraper that returns a data frame with URLs and the count of user specified words in the text of the URL.

Inputs:
    URLS as a list e.g. ['yahoo.com', 'google.com', 'gsk.com']
    Words as a list  e.g. ['patient', healthcare']
    
Output: 
    Pandas DataFrame object with rows as websites and columns as word counts

In [2]:
##First Step, import necessary packages

#requests handles our interaction with the websites.
import requests

#pandas is a data manipulation tool which we will store the collected data in.
import pandas

#BeautifulSoup is used for parsing html. 
from bs4 import BeautifulSoup

#Textblob for advanced text analytics techniques
from textblob import TextBlob

import codecs
from collections import Counter
import simplejson
from pivottablejs import pivot_ui

I am going to declare a function that takes one URL and returns the text showing in the url.  Next, I will walk through the function step by step to show why each piece is necessary.  

In [3]:
#Declare I am writing a function called getTextfromUrl that receives a url as input.
def getTextfromUrl(url):

#The program will attempt all actions indented under "try", and if it fails will execute the commands indented beneath except.
    try:
        #if http is not in the url, append that to the url
        if 'http' not in url.lower():
            url='http://'+ url

        #Use the requests library to reach out the url and return a RESPONSE object which contains page source code, cookies, and
        #other miscellaneous information from the page, error if the url does not respond in 5 seconds.
        page = requests.get(url, timeout=5)

        #take the source code of the url and from the page RESPONSE object (by accessing .text) and create a BeautifulSoup
        #object to make the source code easier to manipulate.
        soup = BeautifulSoup(page.text)

        #Remove javascript and css from the BeautifulSoup object containing the url's source code
        for script in soup(["script", "style"]):
            script.extract()

        #Assign the remaining text values to a variable called text
        text= soup.get_text().encode('ascii','ignore')

    #If any of the above code fails, assign an empty string "" to the variable called text     
    except:
        text=""

    #return the value in variable "text"    
    return text


The following few lines walk through the code more in depth to show why each piece is necessary.  If you understand the function from above, you can skip ahead a bit.  

First, let's go through an example with the URL "walgreens.com".

In [4]:
#setting value of url variable to an article about Trump
url = "www.nytimes.com/2015/09/11/us/politics/looking-to-score-with-republican-debate-viewers-not-floor-donald-trump.html?_r=0"

In [5]:
#if we try to request from this url, it fails and says we need to include http.
page = requests.get(url, timeout=5)

MissingSchema: Invalid URL 'www.nytimes.com/2015/09/11/us/politics/looking-to-score-with-republican-debate-viewers-not-floor-donald-trump.html?_r=0': No schema supplied. Perhaps you meant http://www.nytimes.com/2015/09/11/us/politics/looking-to-score-with-republican-debate-viewers-not-floor-donald-trump.html?_r=0?

In [6]:
#That is why I say, if the url does not contain http, add it to the url.

if 'http' not in url.lower():
    url='http://'+ url

#Now url includes http, and the request command works (as you can see because I don't get an error message)
print url 
page = requests.get(url, timeout=5)

http://www.nytimes.com/2015/09/11/us/politics/looking-to-score-with-republican-debate-viewers-not-floor-donald-trump.html?_r=0


In [7]:
#Let's look at what the page.text value looks like before we clean it up with Beautiful Soup.
page.text



Not very nice.  This is mostly because there is a lot of unecessary junk on the page.  Luckily, we can remove that with the beautifulsoup library.

In [8]:
soup = BeautifulSoup(page.text)
text= soup.get_text()
print text

 




In Second Republican Debate, Donald Trump’s Rivals Seek Points, Not Knockout - The New York Times
window.NREUM||(NREUM={}),__nr_require=function(n,e,t){function r(t){if(!e[t]){var o=e[t]={exports:{}};n[t][0].call(o.exports,function(e){var o=n[t][1][e];return r(o?o:e)},o,o.exports)}return e[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({QJf3ax:[function(n,e){function t(n){function e(e,t,a){n&&n(e,t,a),a||(a={});for(var u=c(e),f=u.length,s=i(a,o,r),p=0;f>p;p++)u[p].apply(s,t);return s}function a(n,e){f[n]=c(n).concat(e)}function c(n){return f[n]||[]}function u(){return t(e)}var f={};return{on:a,emit:e,create:u,listeners:c,_events:f}}function r(){return{}}var o="nr@context",i=n("gos");e.exports=t()},{gos:"7eSDFh"}],ee:[function(n,e){e.exports=n("QJf3ax")},{}],3:[function(n,e){function t(n){return function(){r(n,[(new Date).getTime()].concat(i(arguments)))}}var r=n("handle"),o=n(1),i=n(2);"undefined"==typeof window.newr

That is closer, but there is still a bunch of nonsense in there.  It is javascript and css.  I'm going to remove them in the next bit and print out the pieces I am removing for your knowledge.  

In [9]:


#Lets remove all javascript and css.  Printing each piece for educational purposes.
for script in soup(["script", "style"]):
    print script
    script.extract()

<script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(n,e,t){function r(t){if(!e[t]){var o=e[t]={exports:{}};n[t][0].call(o.exports,function(e){var o=n[t][1][e];return r(o?o:e)},o,o.exports)}return e[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({QJf3ax:[function(n,e){function t(n){function e(e,t,a){n&&n(e,t,a),a||(a={});for(var u=c(e),f=u.length,s=i(a,o,r),p=0;f>p;p++)u[p].apply(s,t);return s}function a(n,e){f[n]=c(n).concat(e)}function c(n){return f[n]||[]}function u(){return t(e)}var f={};return{on:a,emit:e,create:u,listeners:c,_events:f}}function r(){return{}}var o="nr@context",i=n("gos");e.exports=t()},{gos:"7eSDFh"}],ee:[function(n,e){e.exports=n("QJf3ax")},{}],3:[function(n,e){function t(n){return function(){r(n,[(new Date).getTime()].concat(i(arguments)))}}var r=n("handle"),o=n(1),i=n(2);"undefined"==typeof window.newrelic&&(newrelic=window.NREUM);var a=["setPageViewName","addPageAction","s

Now we'll see that the text looks much better.  

In [10]:
text=soup.get_text().encode('ascii','ignore')
print text

 




In Second Republican Debate, Donald Trumps Rivals Seek Points, Not Knockout - The New York Times






































































































































NYTimes.com no longer supports Internet Explorer 9 or earlier. Please upgrade your browser.
LEARN MORE 









Sections

Home

Search
Skip to content
Skip to navigation
View mobile version




The New York Times





Politics|In Second Republican Debate, Donald Trumps Rivals Seek Points, Not Knockout



Advertisement







Search


Log In


0
Settings




Close search

search sponsored by






Search NYTimes.com



Clear this text input



Go






http://nyti.ms/1Nl2lOV




Loading...




See next articles





See previous articles








 







Advertisement








Politics 
In Second Republican Debate, Donald Trumps Rivals Seek Points, Not Knockout

By PATRICK HEALY and JONATHAN MARTINSEPT. 10, 2015


Inside



Supported by








Photo





Perfect!  We now have the ability to grab only the text of a url.  Next, we need to begin to extract value from the text.  Let's start by getting a word count frequency from the text.

In [12]:
blob = TextBlob(text.encode('ascii','ignore'))

In [13]:
blob.noun_phrases

WordList(['republican', u'donald trumps rivals seek points', 'knockout', 'york', 'nytimes.com', u'internet explorer', 'please', u'learn more sections', u'search skip', 'skip', u'navigation view', u'mobile version', 'york', 'politics|in', 'republican', u'donald trumps rivals seek points', u'knockout advertisement search log', u'settings close', u'search search', u'search nytimes.com clear', u'text input', 'loading', u'previous articles', u'advertisement politics', 'republican', u'donald trumps rivals seek points', 'knockout', u'patrick healy', u'jonathan martinsept', 'supported', u'photo gov', u'john kasich', 'ohio', u'emphatic point', 'republican', u'credit doug mills/the', 'york', u'advertisement continue', u'main story', 'continue', u'main story', 'share', u'page continue', u'main story', 'continue', u'main story', 'republican', u'presidential candidates', u'make-or-break consequences', u'aggressive new tactics', u'donald j. trump', u'race enters', u'combative phase.with', u'onetime 

In [14]:
countDict=Counter([word.lower() for word in blob.words])

In [15]:
countDict

Counter({u'the': 81, u'to': 52, u'and': 33, u'mr': 32, u'of': 28, u'in': 25, u'a': 21, u'debate': 20, u'he': 19, u'for': 18, u'on': 17, u'trump': 15, u'that': 14, u'more': 12, u'his': 12, u'new': 11, u'main': 11, u'not': 11, u'with': 11, u'story': 10, u'reading': 10, u'continue': 10, u'times': 9, u'their': 9, u'said': 9, u'york': 8, u'who': 8, u'bush': 8, u'at': 8, u'2015': 7, u'republican': 7, u'was': 7, u'see': 7, u'are': 7, u'by': 7, u'about': 7, u'as': 7, u'search': 6, u'fiorina': 6, u'first': 6, u'say': 6, u'have': 6, u'august': 6, u'trumps': 6, u'advertisement': 6, u'like': 6, u'you': 6, u'campaign': 5, u'from': 5, u'today': 5, u'after': 5, u'they': 5, u'out': 5, u'points': 5, u'candidates': 5, u'do': 5, u'last': 5, u'polls': 5, u'will': 5, u'is': 5, u'if': 5, u'advisers': 5, u'has': 5, u'rivals': 5, u'but': 5, u'an': 5, u'go': 4, u'video': 4, u'opinion': 4, u'next': 4, u"'s": 4, u'this': 4, u'can': 4, u'knockout': 4, u'carson': 4, u'seek': 4, u'mrs': 4, u'him': 4, u'subscribe': 

In [16]:
df=pandas.DataFrame.from_dict(countDict, orient='index')
df.columns=['WordCount']
df['url']=url
df.reset_index(inplace=True)
df.rename(columns={'index':'Word'},inplace=True)
df.head()

Unnamed: 0,Word,WordCount,url
0,all,1,http://www.nytimes.com/2015/09/11/us/politics/...
1,dance,1,http://www.nytimes.com/2015/09/11/us/politics/...
2,skip,2,http://www.nytimes.com/2015/09/11/us/politics/...
3,month,1,http://www.nytimes.com/2015/09/11/us/politics/...
4,manager,1,http://www.nytimes.com/2015/09/11/us/politics/...


Lets turn the process of turning text data into a tabular format like this into a function

In [17]:
def textToTable(text):
    blob = TextBlob(text)
    countDict=Counter([word.lower() for word in blob.words])
    df=pandas.DataFrame.from_dict(countDict, orient='index')
    df=pandas.DataFrame.from_dict(countDict, orient='index')
    df.columns=['WordCount']
    df['url']=url
    df.reset_index(inplace=True)
    df.rename(columns={'index':'Word'},inplace=True)
    return df

In [18]:
textToTable(text).head()

Unnamed: 0,Word,WordCount,url
0,all,1,http://www.nytimes.com/2015/09/11/us/politics/...
1,dance,1,http://www.nytimes.com/2015/09/11/us/politics/...
2,skip,2,http://www.nytimes.com/2015/09/11/us/politics/...
3,month,1,http://www.nytimes.com/2015/09/11/us/politics/...
4,manager,1,http://www.nytimes.com/2015/09/11/us/politics/...


Perfect.  Now we can get text from a url and turn it into a table with the frequency of each word that appears.  This is great, but doing the process one URL at a time seems rather cumbersome.  Now let's make a giant table of word frequencies by URL for many different urls. 

I'll start with a list of urls with stories about Mr. Trump, courtesy of GoogleNews API.  

In [19]:
url_base = 'https://ajax.googleapis.com/ajax/services/search/news'
params = dict(q="Donald Trump", v=1.0, ned="us", rsz=8)
rsp = requests.get(url_base, params=params)
print rsp.url
results=simplejson.loads(rsp.text)

https://ajax.googleapis.com/ajax/services/search/news?q=Donald+Trump&ned=us&rsz=8&v=1.0


In [20]:
urlList=[i['unescapedUrl'] for i in results['responseData']['results']]
urlList

[u'http://www.salon.com/2015/09/10/emasculated_white_men_love_donald_trump_the_real_reason_a_billionaire_bozo_rules_the_gop/',
 u'http://www.foxnews.com/entertainment/2015/09/11/donald-trump-buys-miss-universe-organization/',
 u'http://www.cnn.com/2015/09/11/politics/donald-trump-9-11-tweet/',
 u'http://www.cbsnews.com/news/donald-trump-not-such-a-big-hit-in-china/',
 u'http://www.nytimes.com/politics/first-draft/2015/09/11/donald-trump-leads-in-iowa-in-new-quinnipiac-poll/',
 u'http://www.cnn.com/2015/07/22/politics/maevewest-donald-trump-2016-silent-majority/index.html?eref=rss_latest',
 u'http://www.cbsnews.com/news/donald-trumps-feud-with-bobby-jindal-escalates/',
 u'http://www.usnews.com/news/the-report/articles/2015/09/11/only-donald-trump-can-beat-donald-trump']

In [21]:
MasterDf=pandas.DataFrame(columns=["Word", "WordCount", "url"])
for url in urlList:
    try:
        text=getTextfromUrl(url)
        df=textToTable(text)
        MasterDf=MasterDf.append(df, ignore_index=True)
    except:
        pass


For quick fun, lets see if Trump is more frequently referred to by his first or last names.

In [22]:
MasterDf[(MasterDf["Word"]=="trump") | (MasterDf["Word"]=="donald")].sort("WordCount", ascending=False)

Unnamed: 0,Word,WordCount,url
447,trump,52,http://www.salon.com/2015/09/10/emasculated_wh...
4122,trump,25,http://www.usnews.com/news/the-report/articles...
2084,trump,24,http://www.cnn.com/2015/09/11/politics/donald-...
303,donald,21,http://www.salon.com/2015/09/10/emasculated_wh...
3049,trump,19,http://www.cnn.com/2015/07/22/politics/maevewe...
2250,trump,16,http://www.cbsnews.com/news/donald-trump-not-s...
3565,trump,16,http://www.cbsnews.com/news/donald-trumps-feud...
1971,donald,12,http://www.cnn.com/2015/09/11/politics/donald-...
2565,donald,9,http://www.cbsnews.com/news/donald-trump-not-s...
2885,trump,9,http://www.nytimes.com/politics/first-draft/20...


Next we're going to make this data more exploratory, by creating a javascript interactive pivot table out of the table.

In [25]:
pivot_ui(MasterDf[MasterDf["WordCount"]>1])