Welcome to my notes on Data Gathering!

As an aside, please remember that this code is written alongside Joel Grus' book "Data Science from Scratch" and is not uniquely my own. However, the code and comments are all hand written to ensure the longevity of the studies.

In [22]:
#egrep.py
import sys, re

# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the command-line
regex = sys.argv[1]
count = 0
match = 0

# for every line passed into the script
for line in sys.stdin:
    count += 1
    # if it matches the regex, write it to stdout
    if re.search(regex, line):
        sys.stdout.write(line)
        match += 1
print("%f percent of lines contained the regular expression", match/count)

ZeroDivisionError: integer division or modulo by zero

In order to call from the commandline in Windows to count the percent of lines in a file contain a number, you'd utilize:

In [None]:
type SomeFile.txt | python egrep.py "[0-9]" | python line_count.py

Whereas, a Unix system would utilize:

In [None]:
cat SomeFile.txt | python egrep.py "[0-9]" | python line_count.py

Here's a little function to write the most commonly occurring words in a given set of information:

In [1]:
import sys
from collections import Counter

try:
    num_words = int(sys.argv[1])
except:
    print "usage: most_common_words.py num_words"
    sys.exit(1)
    
counter = Counter(word.lower()
                  for line in sys.stdin
                  for word in line.strip().split()
                  if  word)

for word, count in counter.most_common(num_words):
    sys.stdout.write(str(count))
    sys.stdout.write("\t")
    sys.stdout.write("\n")

IndentationError: expected an indented block (<ipython-input-1-7ea29853c228>, line 4)

After which you'd call it with:

In [None]:
type the_bible.txt | python most_common_words.py 10

This is just a simple natural language processing technique to find word frequency given a document. You can do further analysis by looking at features that can be engineered from these numbers, for example, the "Term Frequency - Inverse Document Frequency" that puts more emphasis on rarer words.

To continue, the textbook covers on some simple CSV reading and processing in Python. I don't feel this is necessary to write here, so I'm just going to skip straight to the web scraping portion.

HTML and the Parsing Thereof:

In [None]:
# This is just example HTML, not meant for running.
<html>
    <head>
        <title>A web page</title>
    </head>
    <body>
        <p id="author">Joel Grus</p>
        <p id="subject">Data Science</p>
    </body>
</html>

First, we'll import the necessary tools:

In [5]:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.example.com").text
soup = BeautifulSoup(html, 'html5lib')
print(soup)

<!DOCTYPE html>
<html><head>
    <title>Example Domain</title>

    <meta charset="utf-8"/>
    <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
    <meta content="width=device-width, initial-scale=1" name="viewport"/>
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain i

In [7]:
# Extracting the first paragraph:
first_paragraph = soup.find('p')
print(first_paragraph)

<p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>


In [10]:
# Extracting just the text contents:
first_paragraph_text = soup.p.text
first_paragraph_word = soup.p.text.split()
print("Just the texts: " + str(first_paragraph_text) + "\n")
print("Just the words: " + str(first_paragraph_word))

Just the texts: This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.

Just the words: [u'This', u'domain', u'is', u'established', u'to', u'be', u'used', u'for', u'illustrative', u'examples', u'in', u'documents.', u'You', u'may', u'use', u'this', u'domain', u'in', u'examples', u'without', u'prior', u'coordination', u'or', u'asking', u'for', u'permission.']


In [16]:
# Extracting the textual contents in a dictionary fashion
first_paragraph_id   = soup.p.get('id') # returns None if no 'id'
#first_paragraph_id2  = soup.p['id']     # raises a key error if not extant
print("First paragraph id: " + str(first_paragraph_id))

First paragraph id: None


In [18]:
# Extracting multiple tags:
all_paragraphs      = soup.find_all('p')
paragraphs_with_ids = soup.p.get('id')
print("All paragraphs: " + str(all_paragraphs))
print("IDs: " + str(paragraphs_with_ids))

All paragraphs: [<p>This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.</p>, <p><a href="http://www.iana.org/domains/example">More information...</a></p>]
IDs: None


In [None]:
# Extracting tags with a specific class:
important_paragraphs  = soup('p', {'class' : 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p')
                         if 'important' in p.get('class', [])]

A treatise in Analysis:

This is an example of data scraping for analysis from the O'Reilly book:

In [21]:
url  = "http://shop.oreilley.com/category/browse-subjects/" + \
       "data.do?sortby=publicationDate&page=1"
soup = BeautifulSoup(requests.get(url).text, 'html5lib')
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/javascript" http-equiv="Content-Script-Type"/>
<script type="text/javascript">
function getCookie(c_name) { // Local function for getting a cookie value
    if (document.cookie.length > 0) {
        c_start = document.cookie.indexOf(c_name + "=");
        if (c_start!=-1) {
        c_start=c_start + c_name.length + 1;
        c_end=document.cookie.indexOf(";", c_start);

        if (c_end==-1) 
            c_end = document.cookie.length;

        return unescape(document.cookie.substring(c_start,c_end));
        }
    }
    return "";
}
function setCookie(c_name, value, expiredays) { // Local function for setting a value of a cookie
    var exdate = new Date();
    exdate.setDate(exdate.getDate()+expiredays);
    document.cookie = c_name + "=" + escape(value) + ((expiredays==null) ? "" : ";expire

So what we've done at this point 