# Agenda

1. Files (reading from them, and writing to them)
2. Comprehensions
3. Passing functions as arguments 

# Files

We're going to discuss reading/writing plain-text files.  

If/when we want to read from an existing file (or write to a new one), we cannot do it ourselves. We'll need an agent to do it on our behalf.  In the programming world, we often call such agents "file handles." In the Python world, we use "file objects," meaning that we get back a file object from the operating system (via Python), and then we read/write/manipulate our file using that file object.

Actually, as of Python 3, there are *many* objects that can be returned from the OS for us to work with files. They are thus officially known as "file-like objects."

Typically you can open a file for reading or for writing, but not both.

To open a file in Python, and get a file object back, we invoke the `open` function:

- The first argument is mandatory -- it's the name of the file, as a string
- The second argument is optional, telling Python whether you want to read from or write to the file. By default, we read from a file, which is the same as passing `'r` as the second argument.  If you want to write to a file (and we'll talk more about this later), then you use `'w'` as the second argument.

In [3]:
# I'm on a Unix machine (a Mac) which has a file called /etc/passwd -- containing all of the usernames
# on the system.  I love to play with this file...

f = open('/etc/passwd') # if you're on Windows, be sure to use a raw string, meaning: r before the opening '' 

# Use raw strings when working with Windows paths

To avoid clashes between Python's interpretation of backslashes and Windows' interpretation of backslashes, put an r before the opening quotes, which will automatically double the backslashes:

```python
path = r'c:\Users\abcd\efgh\ijkl.txt'
```

In [4]:
type(f)

_io.TextIOWrapper

In [5]:
# what is the printed representation of my file object?

f

<_io.TextIOWrapper name='/etc/passwd' mode='r' encoding='UTF-8'>

In [6]:
# how can I read the contents of the file into Python?

# Option 1 (a bad one): read everything from the file into a Python string

s = f.read()

In [7]:
print(s) # this will now print the contents of the file

##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33

# Why not use `f.read()`?

Answer: You don't know how big the file is. This reads the entirety of the file into memory, creating a string. If that file is 2TB in size, Python will try (and most likely fail) to read everything in and create a string.

You can give `f.read()` an argument, the number of characters to read, but that's kind of annoying.

# Better: Iterate over the file

This is the standard way to read a file in Python. When you iterate:

- over a string, you get the characters
- over a list or tuple, you get the elements
- over a dict, you get the keys
- over a file, you get the lines -- one string at a time, each string ending with `'\n'`

In this way, the odds that a single line will be very large -- too large for memory -- are pretty small.  That memory is only allocated for the current line.

In [8]:
f = open('/etc/passwd')  # create the file object

for one_line in f:       # read one line at a time into one_line
    print(one_line)      # print each line

##

# User Database

# 

# Note that this file is consulted directly only when the system is running

# in single-user mode.  At other times this information is provided by

# Open Directory.

#

# See the opendirectoryd(8) man page for additional information about

# Open Directory.

##

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false

root:*:0:0:System Administrator:/var/root:/bin/sh

daemon:*:1:1:System Services:/var/root:/usr/bin/false

_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico

_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false

_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false

_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false

_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false

_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false

_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false

_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/fal

In [9]:
# let's run "strip" on each string we get, thus removing whitespace from both sides,
# including \n at the end of the string

f = open('/etc/passwd')  # create the file object

for one_line in f:               # read one line at a time into one_line
    print(one_line.strip())      # print each line

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [10]:
# we can combine this onto one line

for one_line in open('/etc/passwd'):    # when we exit the loop, there will be no references to our file...
    print(one_line.strip())             # ... so it'll close automatically.

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [11]:
f = open('/etc/passwd')
s1 = f.read()
s2 = f.read()

In [12]:
len(s1)

8160

In [13]:
len(s2)

0

In [14]:
# if we want to read from a file again, after going through it the whole way (and having
# the bookmark at the end), we can invoke the "seek" method:

f.seek(0)   # move the bookmark to the start of the file, at character 0

s2 = f.read()

In [15]:
len(s2)

8160

In [16]:
# one example of how we can work with files
# let's say I want to print the usernames in /etc/passwd.  How can I do that?

for one_line in open('/etc/passwd'):
    print(one_line.strip())

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [20]:
# option 1: get each line, up to the first colon

for one_line in open('/etc/passwd'):
    if one_line[0] != '#':
        first_colon_at = one_line.index(':')  # get the numerical index / location
        print(one_line[:first_colon_at])         # get a slice from one_line, up to that location

nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydevice
_datadetectors
_captiveagent
_ctkd
_applepay
_hidd
_cmiodalassistants
_analyticsd
_fps

In [22]:
# option 2: break each line into a list, and grab index 0

for one_line in open('/etc/passwd'):
    if one_line[0] != '#':
        print(one_line.split(':')[0])   # split returns a list of strings, based on a string

nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydevice
_datadetectors
_captiveagent
_ctkd
_applepay
_hidd
_cmiodalassistants
_analyticsd
_fps

# Exercise: Sum numbers

1. In my zipfile is a file called `nums.txt`. Each line of that file contains either one integer or no integers. There might be whitespace on one side of the integer or the other.
2. One line contains just whitespace.
3. Go through the file, one line at a time, and sum the numbers.  (Total is 83)

In [24]:
!cat nums.txt

5
	10     
	20
  	3
		   	20        

 25


In [27]:
total = 0
for one_line in open('nums.txt'):
    if one_line.strip():          # if we're left with an empty string after stripping, ignore
        total += int(one_line)    # if something is left, turn it into an int and add to total
    
print(total)    

83


In [28]:
total = 0
for one_line in open('nums.txt'):
    if one_line.strip().isdigit():   # if we're left with an empty string after stripping, ignore
        total += int(one_line)       # if something is left, turn it into an int and add to total
    
print(total)    

83


In [29]:
for one_line in open('nums.txt'):
    total = 0
    if one_line.strip().isdigit():   # if we're left with an empty string after stripping, ignore
        total += int(one_line)       # if something is left, turn it into an int and add to total
    
print(total)    

25


# Exercise: `wc` -- word count

1. Unix comes with a `wc` command, which we can run on a file. It'll tell us:
    - The number of lines in the file (including blank lines)
    - The number of words in the file (assuming words are separated by whitespace)
    - The number of characters in the file (including whitespace, such as ' ' and '\n')
2. I want you to write a program that implements this in Python.  Given a file (and you can use the text file I've provided, `wcfile.txt`), get all three of those statistics.
3. If you want, you can also add a fourth statistic, namely the number of *different* (or unique) words in the file.

In [30]:
!cat wcfile.txt

This is a test file.

It contains 28 words and 20 different words.

It also contains 165 characters.

It also contains 11 lines.

It is also self-referential.

Wow!


In [31]:
# if you're in Jupyter, you can run commands in your OS by putting ! and then the command
# at the front of a line

!wc wcfile.txt

 11  28 165 wcfile.txt


"whitespace" is a term that in Python refers to:

- ' ' (space character)
- `'\n'` (newline)
- `'\r'` (carriage return)
- `'\t'` (tab)
- `'\v'` (vertical tab)

If you use `str.strip` without an argument, then it removes any or all of the above that it finds on the outside of the string.



In [33]:
s = '   \t\t\t\n\n\r\ra   b   c    \t\t\t\v\v\v\n\n '

s.strip()

'a   b   c'

In [34]:
s.split()     #split without an argument uses one or more whitespace characters as delimiters

['a', 'b', 'c']

In [37]:
lines = 0
characters = 0
words = 0

filename = 'wcfile.txt'

for one_line in open(filename):
    lines += 1
    characters += len(one_line)
    words += len(one_line.split())
    
print(f'{lines=}')    
print(f'{characters=}')
print(f'{words=}')

lines=11
characters=165
words=28


In [None]:
lines = 0
words = 0
characters = 0

for one_line in open(filename):
    lines += 1
    for one_space in one_line:
        if one_space == " ":
            words +=1
    for one_character in one_line:
        characters += 1

print(f'lines = {lines}\nwords = {words}\ncharacters = {characters}')

In [39]:
one_line = 'this is a bunch of words'

one_line.count(' ')  # how many times does ' ' appear in this string?

5

# Performance tips

1. Because strings are immutable, their lengths are known to Python, and can be retrieved immediately. So invoking `len` on a string is super fast.  A `for` loop will take much longer.
2. In general, the built-in data structures' methods are written in C, and are typically going to be faster than code we write ourselves.

In Jupyter, we have a bunch of "magic commands" that start with `%`. They aren't passed along to Python, and allow us to try lots of different things. If you use `%timeit` followed by some code, it'll tell us which runs faster.

In [40]:
%timeit len(one_line)

25.2 ns ± 0.643 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [41]:
%%timeit 

total = 0
for one_character in one_line:
    total += 1

581 ns ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
