# Chapter 9 - Getting Data

We will spend an large amount of time gathering, cleaning, and transforming data. It's a basic python tutorial in pulling in data. Reading in files, parsing, good heuristics and functions to use - that sort of thing.

In [1]:
# stdin stdout for using python in the command line like a pro (or plebian)
# in this case reading in regex from a list of command line arguments

import sys, re

regex = sys.argv[1]
for line in sys.stdin:
    if re.search(regex, line):
        sys.stdout.write(line)
        
# here's one that counts the lines and spits out the count
count = 0
for line in sys.stdin:
    count += 1
print(count)

# how many lines contain numbers? In a windows command line:
# type SomeFile.txt | python egrep.py "[0-9]" | python line_count.py
# not sure why the Windows pipe command looks like a forward slash in jupyter but oh well
# You can put python in the path to avoid typing it. 

#You could build some pretty elaborate data pipelines this way...

0


In [2]:
# Script that count the words in its input and writes out the most common ones
from collections import Counter
try:
    num_words = int(sys.argv[1])
except:
    print("usage: most_common_words.py num_words")
    sys.exit(1) #error indicator

counter = Counter(word.lower() for line in sys.stdin for word in line.strip().split() if word)

for word, count in counter.most_common(num_words):
    sys.stdout.write(str(count))
    sys.stdout.write("\t")
    sys.stdout.write(word)
    sys.stdout.write("\n")
    
# after which we could write C:\type the_bible.txt | python most_common_words.py 10

usage: most_common_words.py num_words


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [3]:
# The basics of text files
# Reading files with open (I typically am reading in spreadsheet data with pandas but let's do the from scratch thing)

file_for_reading = open('reading_file.txt', 'r')
file_for_writing = open('writing_file.txt' "w")
file_for_appending = open('append_file.txt', 'a')
file_for_writing.close()

with open(filename, 'r') as f:
    data = function_that_gets_data_from(f)
    
process(data)

starts_with_hash = 0

with open('input.txt', 'r') as f:
    for line in f:
        if re.match("^#", line):
            starts_with_hash += 1
            
            #I'll probably never do this but good to know
            
# maybe we want some email addresses and we want just the domain

def get_domain(email_address):
    return email_address.lower().split("@")[-1]

with open('email_address.txt', 'r') as f:
    domain_counts = Counter(get_domain(line.strip()) for line in f if "@" in line)
    


FileNotFoundError: [Errno 2] No such file or directory: 'reading_file.txt'

It's probably a good idea not to iterate line by line because we don't know what's in there and it could be messy. It's probably a csv.  Here he goes through csv module. I would just use pandas (which he recommends). I might humor him or I might give up halfway through. We'll see.

In [6]:
import csv

with open('tab_delimited_stock_prices.txt', 'rb') as f:
    reader = csv.DictReader(f, delimiter=':')
    for row in reader:
        date = row['date']
        simpbol = row['symbol']
        closing_price = float(row['closing_price'])
        process (date, symbol, closing_price)
        
# even if you don't have headers you can still use DictReader with keys and fieldnames.
    
    
    # Read in as binary for some microsoft deep layered thing that
# I don't really care about unless there's a computerphile video on it.



FileNotFoundError: [Errno 2] No such file or directory: 'tab_delimited_stock_prices.txt'

I'm going to call it here because the next part is more interesting.

In [19]:
# Scraping the web. We use Beautiful Soup. Pseudo code upcoming
from bs4 import BeautifulSoup
import requests
import html5lib

html = requests.get('Https://www.wikipedia.com"').text
soup = BeautifulSoup(html, 'html5lib')

# finding the first <p> tag

first_paragraph = soup.find('p')
first_paragraph_text = soup.p.text
first_paragraph_words = soup.p.text.split()

first_paragraph_id = soup.p['id'] # returns a key error if no id
first_paragraph_id2 = soup.get('id') # returns None error if no id
all_paragraphs = soup.find_all('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]

important_paragraphs = soup('p', {'class' : 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p') if 'important' in p.get('class', [])]

# can keep going with this type of logic. Ex. we want <span> elemenet that is contained inside a <div>
spans_in_divs = [span for div in soup('div') for span in div('span')] # lol 



ConnectionError: HTTPSConnectionPool(host='www.wikipedia.com%22', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001EA038A3BE0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

In [23]:
# Gather oreilly books about data. First we check the robots.txt and the terms of the site for scraping

# Crawl-delay: 30
#request rate: 1/30

# hopefully this still works and I can access the URL he gives in the book

url = 'http://shop.oreilly,com/category/browse-subjects/data.do?sortby=publicationDate&page=1'
soup = BeautifulSoup(requests.get(url).text, 'html5lib')


ConnectionError: HTTPConnectionPool(host='shop.oreilly,com', port=80): Max retries exceeded with url: /category/browse-subjects/data.do?sortby=publicationDate&page=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001EA03D7B160>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

It looks like OReilly has significantly changed their method for doing the browsing of books so it's not possible to parse the book data. They've moved to a subscription service so the /category part of the url doesn't even exist anymore in the shop. Orielly wants you to be all in. No thanks, not right now. I'm just going to read along and hopefully this section doesn't matter much for the rest of the book.

In [24]:
# here is the whole chapter in a couple functions that don't work

def book_info(td):
    title = td.find('div', 'thumbheader').a.text
    by_author = td.find('div', 'AuthorName').text
    authors = [x.strip() for x in re.sub("^By ", "", by_author).split(",")]
    isbn = re.match("/product/(.*)\.do," isbn_link).groups()[0]
    date = td.find("span", "directorydate").text.strip()
    
    return {
        'title' : title,
        'authors' : authors,
        'isbn' : isbn,
        'date' : date
    }

from bs4 import BeautifulSoup
import requests
from time import sleep
base_url = 'http://shop.oreilly.com/category/browse-subjects/data.do?sortby=publicationDate&page='

books = []
NUM_PAGES = 31 # some number that was treu at the time

for page_num in range(1, NUM_PAGES + 1):
    print('souping page', page_num, ",", len(books), "found so far")
    
    url = base_url + str(page_num)
    soup = BeautifulSoup(requests.get(url).text, 'html5lib')
    
    for td in soup('td', 'thumbtext'):
        if not is_video(td):
            books.append(book_info(td)) # i never wrote this function so this is just a placeholder
    
    sleep(30) # respecting the robots.txt
                
        
# there are many ways to scrape data and it is a bit of an art. He just goes on to plot the number of books which I'll skip

SyntaxError: invalid syntax (<ipython-input-24-35cd9e827d29>, line 5)

## Using APIs! Finally something useful that I'd like to see done from scratch
No more scraping websites that don't exist anymore

In [31]:
# Intro to JSON. They look like dicts. Let's create one for this book


{ 'title' : "Data Science From Scratch",
 'author' : "Joel Grus",
 'publicationYear' : "2014",
 'topics' : ["data", "science", "data science"]
}

# parsing JSON using json

import json
serialized = """{ "title" : "Data Science From Scratch",
 "author" : "Joel Grus",
 "publicationYear" : "2014",
 "topics" : ["data", "science", "data science"]
}"""

deserialized = json.loads(serialized)
if "data science" in deserialized['topics']:
    print(deserialized)
    


{'title': 'Data Science From Scratch', 'author': 'Joel Grus', 'publicationYear': '2014', 'topics': ['data', 'science', 'data science']}


In [37]:
# Sometimes they hate us and only give xml, on which you can use Beautiful Soup

# Now we'll take a look at github's API, which hopefully you can still do stuff with unauthenticated
endpoint = "https://api.github.com/users/kladar/repos" # my github instead of Joel's in the book

repos = json.loads(requests.get(endpoint).text)

from dateutil.parser import parse

dates = [parse(repo['created_at']) for repo in repos]
month_count = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)

print(month_count, weekday_counts)

# get...languages?
last_5_repos = sorted(repos, key=lambda r: r['created_at'], reverse=True)[:5]

last_5_langs = [repo['language'] for repo in last_5_repos]
print(last_5_langs)
# jeeze, I need to update the languages associated with my repos

Counter({3: 3, 12: 2, 5: 1, 2: 1, 6: 1}) Counter({4: 3, 0: 2, 5: 1, 1: 1, 6: 1})
['R', 'Jupyter Notebook', None, 'Matlab', None]


Most of the time you don't need to do this detail. For any API worth it's salt you can use a library someone has already built to access it. Hopefully it's built well, but whatever it is probably better than the from scratch way. Sometimes you might just have to roll your own. 

List of directories at Python API and Python for Beginners. For all APIs, go to Programmable Web. Scraping is the last refuge of the data scientist



In [59]:
# We'll try accessing the twitter API using twython and our credentials. Hopefully this doesn't destroy my will to live.

from twython import Twython

#load in our credentials that hopefully can't be seen in the repo here...
with open('credentials.json') as f:
    credentials = json.load(f)
    cred_strings = list(credentials.values())
print(cred_strings)
twitter = Twython(cred_strings[0], cred_strings[1])
# search for tweets containing data science

for status in twitter.search(q='"data science"')['statuses']: 
    user = status['user']['screen_name']#.encode('utf-8')
    text = status['text']#.encode('utf-8')
    print(user, ':', text)
    # apparently Python 3 handles unicode more nicely so we can try without the encoding.
    # It works!
    

['5eS7ckXACe67GG2CrHjwpzXBp', 'VM0GAW5GBHdqTnwua93abuyZYyPkONwowj0Q2jF73UxI64TNv4', '25764144-IDWISAHdGG7BBgVpjApn26uJkwuYwU9JOeP2VMQn3', 'jevpkLeaxSuxXwQU3luz7oYqxMKpcDc7QXjvL1DmnL49c']
EcoInternet3 : #Trump's Environmental Agenda Could Cause 80000 More Deaths, According to EPA's Own Data: Science Alert https://t.co/WmJQQXyAJ0
BioDataScience : What other major should I choose to complement my data science major? I am considering economics or econometrics. W… https://t.co/Ez6UyNsSQz
couponcodes24 : RT @uninetprof: #unp Launches one of a kind  course on  "How to Build and Deploy Machine Learning, Deep Learning &amp; NLP Models with Dockers"…
Hax0r_g1rl : RT @MSFTImagine: Improve your #deeplearning skills with #Python libraries. Download this cheat sheet for Python in data science: https://t.…
Modern_Robot : Kid: I picked Yoshi because I liked him.
This guy: https://t.co/BHQOsTigHs
TwtSignin : RT @Ronald_vanLoon: 80+ Free #DataScience Books
by @DataScienceCtrl | 

Read full article here:

In [63]:
# Now we get into the twitter firehose with the streamer

from twython import TwythonStreamer
tweets = []

class MyStreamer(TwythonStreamer):
    
    def on_success(self, data):
        if data['lang'] == 'en':
            tweets.append(data)
            print('received tweet #', len(tweets))
        
        if len(tweets) >= 100: # he does 1000 in the book...
            self.disconnect() #run!
                  
    def on_error(self, status_code, data):
        print(status_code, data)
        self.disconnect()
                  
                  
stream = MyStreamer(cred_strings[0], cred_strings[1], cred_strings[2], cred_strings[3])

stream.statuses.filter(track='data')
# appending to a global variable is bad. It is known.

received tweet # 1
received tweet # 2
received tweet # 3
received tweet # 4
received tweet # 5
received tweet # 6
received tweet # 7
received tweet # 8
received tweet # 9
received tweet # 10
received tweet # 11
received tweet # 12
received tweet # 13
received tweet # 14
received tweet # 15
received tweet # 16
received tweet # 17
received tweet # 18
received tweet # 19
received tweet # 20
received tweet # 21
received tweet # 22
received tweet # 23
received tweet # 24
received tweet # 25
received tweet # 26
received tweet # 27
received tweet # 28
received tweet # 29
received tweet # 30
received tweet # 31
received tweet # 32
received tweet # 33
received tweet # 34
received tweet # 35
received tweet # 36
received tweet # 37
received tweet # 38
received tweet # 39
received tweet # 40
received tweet # 41
received tweet # 42
received tweet # 43
received tweet # 44
received tweet # 45
received tweet # 46
received tweet # 47
received tweet # 48
received tweet # 49
received tweet # 50
received 

In [66]:
#wow that was long. 1000 is way too many. going back up and changing to 100

hashtags = Counter(hashtag['text'].lower() for tweet in tweets for hashtag in tweet["entities"]["hashtags"])
print(hashtags.most_common(5))

# if we were doing this for real we would save the tweets to a file or database and not have them in an in-mem list.


[('blockchain', 5), ('data', 4), ('datascience', 2), ('tech', 2), ('theresistance', 2)]


## This concludes Chapter 9. 
What a great chapter. I feel I learned a lot about being a datascientist and that blockchain is way too common a hashtag.

Of course Joel recommends Pandas at the end of this chapter. Which is what I use all damn day, baby.