# Chapter 9. Getting Data
## stdin and stdout
If you run your Python scripts at the command line, you can pipe data through them using sys.stdin and sys.stdout. For example, here is a script that reads in lines of text and spits back out the ones that match a regular expression:

In [None]:
# egrep.py
import sys, re

# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the command line
regex = sys.argv[1]

# for every line passed into the script
for line in sys.stdin:
    # if it matches the regex, write it to stdout
    if re.search(regex, line):
        sys.stdout.write(line)

And here’s one that counts the lines it receives and then writes out the count:

In [None]:
# line_count.py
import sys

count = 0
for line in sys.stdin:
    count += 1

# print goes to sys.stdout
print(count)

You could then use these to count how many lines of a file contain numbers.
* Windows: type SomeFile.txt | python egrep.py "[0-9]" | python line_count.py
* Linux: cat SomeFile.txt | python egrep.py "[0-9]" | python line_count.py

The | is the pipe character, which means “use the output of the left command as the input
of the right command.”

Counts the words in its input and writes out the most common ones:

In [None]:
# most_common_words.py
import sys
from collections import Counter

# pass in number of words as first argument
try:
    num_words = int(sys.argv[1])
except:
    print("usage: most_common_words.py num_words")
    sys.exit(1) # non-zero exit code indicates error

counter = Counter(word.lower()                         # lowercase words
                  for line in sys.stdin                #
                  for word in line.strip().split()     # split on spaces
                  if word)                             # skip empty 'words'

for word, count in counter.most_common(num_words):
    sys.stdout.write(str(count))
    sys.stdout.write("\t")
    sys.stdout.write(word)
    sys.stdout.write("\n")

after which you could do something like: <br>
cat the_bible.txt | python most_common_words.py 10 <br>
<img src='https://i.imgur.com/e1XkiGo.jpg' width='100px' style='float:left'>

## Reading Files
### The Basics of Text Files
The first step to working with a text file is to obtain a *file object* using **open()**:

In [None]:
# 'r' means read-only
file_for_reading = open('reading_file.txt', 'r')

# 'w' is write—will destroy the file if it already exists!
file_for_writing = open('writing_file.txt', 'w')

# 'a' is append—for adding to the end of the file
file_for_appending = open('appending_file.txt', 'a')

# don't forget to close your files when you're done
file_for_writing.close()

Because it is easy to forget to close your files, you should always use them in a **with** block, at the end of which they will be closed automatically:

In [1]:
with open(filename,'r') as f:
    data = function_that_gets_data_from(f)

    # at this point f has already been closed, so don't try to use it
process(data)

NameError: name 'filename' is not defined

If you need to read a whole text file, you can just iterate over the lines of the file using **for**:

In [None]:
import re
starts_with_hash = 0

with open('the_bible.txt', 'r') as file:
    for line in file:  # look at each line in the file
        if re.match("^#", line):  # use a regex to see if it starts w
            starts_with_hash += 1  # if it does, add 1 to the count

print(starts_with_hash)

Every line you get this way ends in a newline character, so you’ll often want to **strip()** itbefore doing anything with it.

For example, imagine you have a file full of email addresses, one per line, and that youneed to generate a histogram of the domains. The rules for correctly extracting domainsare somewhat subtle (e.g., the Public Suffix List), but a good first approximation is to justtake the parts of the email addresses that come after the @ . (Which gives the wrong answerfor email addresses like joel@mail.datasciencester.com .)

In [None]:
from collections import Counter

def get_domain(email_address):
    """split on '@' and return the last piece"""
    return email_address.lower().split("@")[-1]
# For example, "abdul_rainbolt1@aol.com".split("@")
# abdul_rainbolt1 — [0]  aol.com — [1]
# [-1] — first from the end, which is the same as [1] in this case.

with open('email_addresses.txt', 'r') as f:
    domain_counts = Counter(get_domain(line.strip())
                            for line in f
                            if "@" in line)
print(domain_counts)
# Output will be smth like this:
# Counter({'aol.com': 61, 'outlook.com': 61, 'hotmail.com': 49, 'yahoo.com': 45})

# If no line.strip(), that is no trimming spaces, then: 
# Counter({'aol.com\n': 61, 'outlook.com\n': 61, 'hotmail.com\n': 48, 'yahoo.com\n': 45})

## Delimited Files. CSV
The hypothetical email addresses file we just processed had one address per line. Morefrequently you’ll work with files with lots of data on each line. These files are very ofteneither *comma-separated* or *tab-separated*. Each line has several fields, with a comma (or atab) indicating where one field ends and the next field starts.

This starts to get complicated when you have fields with commas and tabs and newlines inthem (which you inevitably do). For this reason, it’s pretty much always a mistake to try toparse them yourself. Instead, you should use Python’s csv module (or the pandas library).

In [3]:
import csv

For technical reasons that you should feel free to blame on Microsoft, you should alwayswork with *csv* files in *binary mode* by including a **b** after the **r** or **w** (see Stack Overflow). There's no problems without binary mode on ubuntu.

If your file has no headers (which means you probably want each row as a *list* , and which places the burden on you to know what’s in each column), you can use **csv.reader()** to iterate over the rows, each of which will be an appropriately split list.

For example, if we had a comma-delimited file of stock prices: <br>
6/20/2014,AAPL,90.91 <br>
6/20/2014,MSFT,41.68 <br>
6/20/2014,FB,64.5 <br>
6/19/2014,AAPL,91.86 <br>
6/19/2014,MSFT,41.51 <br>
6/19/2014,FB,4.34 <br>

In [4]:
with open('comma_delimited_stock_prices.txt', 'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        print(date, symbol, closing_price)


FileNotFoundError: [Errno 2] No such file or directory: 'comma_delimited_stock_prices.txt'

If your file has headers: <br>
date:symbol:closing_price <br>
6/20/2014:AAPL:90.91 <br>
6/20/2014:MSFT:41.68 <br> 
6/20/2014:FB:64.5 <br>

you can either skip the header row (with an initial call to **reader.next()** ) or get each row as a dict (with the headers as keys) by using **csv.DictReader()**:

In [None]:
with open('colon_delimited_stock_prices.txt', 'r') as f:
    reader = csv.DictReader(f, delimiter=':')
    for row in reader:
        date = row["date"]
        symbol = row["symbol"]
        closing_price = float(row["closing_price"])
        print(date, symbol, closing_price)

Even if your file doesn’t have headers you can still use **DictReader()** by passing it the keys as a **fieldnames** parameter.

In [None]:
with open('comma_delimited_stock_prices.txt', 'r') as f:
    fieldnames = ['date', 'symbol', 'closing_price']
    reader = csv.DictReader(f, delimiter=',', fieldnames=fieldnames)
    for row in reader:
        date = row["date"]
        symbol = row["symbol"]
        closing_price = float(row["closing_price"])
        print(date, symbol, closing_price)

You can similarly write out delimited data using **csv.writer()**:

In [None]:
today_prices = {'AAPL': 90.91, 'MSFT': 41.68, 'FB': 64.5}

with open('comma_delimited_stock_prices2.txt', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    for stock, price in today_prices.items():
        writer.writerow([stock, price])

**csv.writer()** will do the right thing if your fields themselves have commas in them. Your own hand-rolled writer probably won’t. For example, if you attempt:

In [None]:
results = [["test1","success", "Monday"],
           ["test2","success, kind of", "Tuesday"],
           ["test3","failure, kind of", "Wednesday"]
           ["test4","failure, utter", "Thursday"]]


# don't do this!
with open('bad_csv.txt', 'wb') as f:
    for row in results:
        f.write(",".join(map(str, row)))  # might have too many commas in it!
        f.write("\n")  # row might have newlines as well!

You will end up with a csv file that looks like: <br> <br>
test1,success,Monday <br>
test2,success, kind of,Tuesday <br>
test3,failure, kind of,Wednesday <br>
test4,failure, utter,Thursday <br>

and that no one will ever be able to make sense of.