# Agenda

1. Files (reading from them, and writing to them)
2. Comprehensions
3. Passing functions as arguments 

# Files

We're going to discuss reading/writing plain-text files.  

If/when we want to read from an existing file (or write to a new one), we cannot do it ourselves. We'll need an agent to do it on our behalf.  In the programming world, we often call such agents "file handles." In the Python world, we use "file objects," meaning that we get back a file object from the operating system (via Python), and then we read/write/manipulate our file using that file object.

Actually, as of Python 3, there are *many* objects that can be returned from the OS for us to work with files. They are thus officially known as "file-like objects."

Typically you can open a file for reading or for writing, but not both.

To open a file in Python, and get a file object back, we invoke the `open` function:

- The first argument is mandatory -- it's the name of the file, as a string
- The second argument is optional, telling Python whether you want to read from or write to the file. By default, we read from a file, which is the same as passing `'r` as the second argument.  If you want to write to a file (and we'll talk more about this later), then you use `'w'` as the second argument.

In [3]:
# I'm on a Unix machine (a Mac) which has a file called /etc/passwd -- containing all of the usernames
# on the system.  I love to play with this file...

f = open('/etc/passwd') # if you're on Windows, be sure to use a raw string, meaning: r before the opening '' 

# Use raw strings when working with Windows paths

To avoid clashes between Python's interpretation of backslashes and Windows' interpretation of backslashes, put an r before the opening quotes, which will automatically double the backslashes:

```python
path = r'c:\Users\abcd\efgh\ijkl.txt'
```

In [4]:
type(f)

_io.TextIOWrapper

In [5]:
# what is the printed representation of my file object?

f

<_io.TextIOWrapper name='/etc/passwd' mode='r' encoding='UTF-8'>

In [6]:
# how can I read the contents of the file into Python?

# Option 1 (a bad one): read everything from the file into a Python string

s = f.read()

In [7]:
print(s) # this will now print the contents of the file

##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33

# Why not use `f.read()`?

Answer: You don't know how big the file is. This reads the entirety of the file into memory, creating a string. If that file is 2TB in size, Python will try (and most likely fail) to read everything in and create a string.

You can give `f.read()` an argument, the number of characters to read, but that's kind of annoying.

# Better: Iterate over the file

This is the standard way to read a file in Python. When you iterate:

- over a string, you get the characters
- over a list or tuple, you get the elements
- over a dict, you get the keys
- over a file, you get the lines -- one string at a time, each string ending with `'\n'`

In this way, the odds that a single line will be very large -- too large for memory -- are pretty small.  That memory is only allocated for the current line.

In [8]:
f = open('/etc/passwd')  # create the file object

for one_line in f:       # read one line at a time into one_line
    print(one_line)      # print each line

##

# User Database

# 

# Note that this file is consulted directly only when the system is running

# in single-user mode.  At other times this information is provided by

# Open Directory.

#

# See the opendirectoryd(8) man page for additional information about

# Open Directory.

##

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false

root:*:0:0:System Administrator:/var/root:/bin/sh

daemon:*:1:1:System Services:/var/root:/usr/bin/false

_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico

_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false

_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false

_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false

_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false

_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false

_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false

_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/fal

In [9]:
# let's run "strip" on each string we get, thus removing whitespace from both sides,
# including \n at the end of the string

f = open('/etc/passwd')  # create the file object

for one_line in f:               # read one line at a time into one_line
    print(one_line.strip())      # print each line

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [10]:
# we can combine this onto one line

for one_line in open('/etc/passwd'):    # when we exit the loop, there will be no references to our file...
    print(one_line.strip())             # ... so it'll close automatically.

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [11]:
f = open('/etc/passwd')
s1 = f.read()
s2 = f.read()

In [12]:
len(s1)

8160

In [13]:
len(s2)

0

In [14]:
# if we want to read from a file again, after going through it the whole way (and having
# the bookmark at the end), we can invoke the "seek" method:

f.seek(0)   # move the bookmark to the start of the file, at character 0

s2 = f.read()

In [15]:
len(s2)

8160

In [16]:
# one example of how we can work with files
# let's say I want to print the usernames in /etc/passwd.  How can I do that?

for one_line in open('/etc/passwd'):
    print(one_line.strip())

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [20]:
# option 1: get each line, up to the first colon

for one_line in open('/etc/passwd'):
    if one_line[0] != '#':
        first_colon_at = one_line.index(':')  # get the numerical index / location
        print(one_line[:first_colon_at])         # get a slice from one_line, up to that location

nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydevice
_datadetectors
_captiveagent
_ctkd
_applepay
_hidd
_cmiodalassistants
_analyticsd
_fps

In [22]:
# option 2: break each line into a list, and grab index 0

for one_line in open('/etc/passwd'):
    if one_line[0] != '#':
        print(one_line.split(':')[0])   # split returns a list of strings, based on a string

nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydevice
_datadetectors
_captiveagent
_ctkd
_applepay
_hidd
_cmiodalassistants
_analyticsd
_fps

# Exercise: Sum numbers

1. In my zipfile is a file called `nums.txt`. Each line of that file contains either one integer or no integers. There might be whitespace on one side of the integer or the other.
2. One line contains just whitespace.
3. Go through the file, one line at a time, and sum the numbers.  (Total is 83)

In [24]:
!cat nums.txt

5
	10     
	20
  	3
		   	20        

 25


In [27]:
total = 0
for one_line in open('nums.txt'):
    if one_line.strip():          # if we're left with an empty string after stripping, ignore
        total += int(one_line)    # if something is left, turn it into an int and add to total
    
print(total)    

83


In [28]:
total = 0
for one_line in open('nums.txt'):
    if one_line.strip().isdigit():   # if we're left with an empty string after stripping, ignore
        total += int(one_line)       # if something is left, turn it into an int and add to total
    
print(total)    

83


In [29]:
for one_line in open('nums.txt'):
    total = 0
    if one_line.strip().isdigit():   # if we're left with an empty string after stripping, ignore
        total += int(one_line)       # if something is left, turn it into an int and add to total
    
print(total)    

25


# Exercise: `wc` -- word count

1. Unix comes with a `wc` command, which we can run on a file. It'll tell us:
    - The number of lines in the file (including blank lines)
    - The number of words in the file (assuming words are separated by whitespace)
    - The number of characters in the file (including whitespace, such as ' ' and '\n')
2. I want you to write a program that implements this in Python.  Given a file (and you can use the text file I've provided, `wcfile.txt`), get all three of those statistics.
3. If you want, you can also add a fourth statistic, namely the number of *different* (or unique) words in the file.

In [30]:
!cat wcfile.txt

This is a test file.

It contains 28 words and 20 different words.

It also contains 165 characters.

It also contains 11 lines.

It is also self-referential.

Wow!


In [31]:
# if you're in Jupyter, you can run commands in your OS by putting ! and then the command
# at the front of a line

!wc wcfile.txt

 11  28 165 wcfile.txt


"whitespace" is a term that in Python refers to:

- ' ' (space character)
- `'\n'` (newline)
- `'\r'` (carriage return)
- `'\t'` (tab)
- `'\v'` (vertical tab)

If you use `str.strip` without an argument, then it removes any or all of the above that it finds on the outside of the string.



In [33]:
s = '   \t\t\t\n\n\r\ra   b   c    \t\t\t\v\v\v\n\n '

s.strip()

'a   b   c'

In [34]:
s.split()     #split without an argument uses one or more whitespace characters as delimiters

['a', 'b', 'c']

In [46]:
lines = 0
characters = 0
words = 0

filename = 'wcfile.txt'

for one_line in open(filename):
    lines += 1
    characters += len(one_line)
    words += len(one_line.split())
    print('\t', one_line.split())
    
print(f'{lines=}')    
print(f'{characters=}')
print(f'{words=}')

	 ['This', 'is', 'a', 'test', 'file.']
	 []
	 ['It', 'contains', '28', 'words', 'and', '20', 'different', 'words.']
	 []
	 ['It', 'also', 'contains', '165', 'characters.']
	 []
	 ['It', 'also', 'contains', '11', 'lines.']
	 []
	 ['It', 'is', 'also', 'self-referential.']
	 []
	 ['Wow!']
lines=11
characters=165
words=28


In [None]:
lines = 0
words = 0
characters = 0

for one_line in open(filename):
    lines += 1
    for one_space in one_line:
        if one_space == " ":
            words +=1
    for one_character in one_line:
        characters += 1

print(f'lines = {lines}\nwords = {words}\ncharacters = {characters}')

In [39]:
one_line = 'this is a bunch of words'

one_line.count(' ')  # how many times does ' ' appear in this string?

5

# Performance tips

1. Because strings are immutable, their lengths are known to Python, and can be retrieved immediately. So invoking `len` on a string is super fast.  A `for` loop will take much longer.
2. In general, the built-in data structures' methods are written in C, and are typically going to be faster than code we write ourselves.

In Jupyter, we have a bunch of "magic commands" that start with `%`. They aren't passed along to Python, and allow us to try lots of different things. If you use `%timeit` followed by some code, it'll tell us which runs faster.

In [40]:
%timeit len(one_line)

25.2 ns ± 0.643 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [41]:
%%timeit 

total = 0
for one_character in one_line:
    total += 1

581 ns ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [44]:
!cat wcfile.txt

This is a test file.

It contains 28 words and 20 different words.

It also contains 165 characters.

It also contains 11 lines.

It is also self-referential.

Wow!


In [47]:
# what if we want unique words?
# whenever you hear the word "unique" or "distinct," think of a set

lines = 0
characters = 0
words = 0
unique_words = set()  # empty set of unique words

filename = 'wcfile.txt'

for one_line in open(filename):
    lines += 1
    characters += len(one_line)
    words += len(one_line.split())
    unique_words.update(one_line.split())   # turn the current line into a list of words, and add to unique_words
    
unique_words_count = len(unique_words)

print(f'{lines=}')    
print(f'{characters=}')
print(f'{words=}')
print(f'{unique_words_count=}')

lines=11
characters=165
words=28
unique_words_count=20


In [48]:
unique_words

{'11',
 '165',
 '20',
 '28',
 'It',
 'This',
 'Wow!',
 'a',
 'also',
 'and',
 'characters.',
 'contains',
 'different',
 'file.',
 'is',
 'lines.',
 'self-referential.',
 'test',
 'words',
 'words.'}

# Writing to files

Writing to files is similar in many ways to reading:

- We have to open the file (but we need to specify that we want to write, usually with `'w'` as the second argument to `open`)
- We can write any string we want to the file with the `.write` method -- which does not add `'\n'` to the end of what it writes -- you need to do that yourself!

Where things get tricky with writing to a file has to do with your computer's optimization of resources. Typically, when you write to a file, the data isn't really saved there.  That happens only when you "flush" the memory buffer to disk, or when you "close" the file, which flushes the buffer along the way.

If you know when you want to close the file, then it's often easiest/best/idiomatic to use the `with` command. This automatically flushes + closes a file at a certain point in your program.

# ***WARNING***

If you open a file for writing, then one of two things happen will happen:

- You will get an error, saying that you cannot open the file.
- The file will exist with 0 bytes in it.  If anything was in the file before, it isn't there any more.

Python actually does support the `'x'` open option, which means: Open the file for writing, but don't destroy anything that already exists.  You'll get an error if you name a file that already exists.

You can also use `'a'` for *append* mode, meaning that anything you write will be added to the end of the file, not replace existing content.

In [49]:
# simple-minded writing to a file

f = open('myfile.txt', 'w')   # open for writing

f.write('abcd\n')
f.write('efghijk\n')
f.write('end of file!\n')

f.close()   # this flushes the buffer + closes the file

In [50]:
!cat myfile.txt

abcd
efghijk
end of file!


In [52]:
# better way to write to files using "with"


with open('myfile.txt', 'w') as f:   # backwards variable assignment, maybe?

    f.write('*abcd\n')
    f.write('*efghijk\n')
    f.write('*end of file!\n')

    # automatically flushes + closes at the end of the block

In [53]:
!cat myfile.txt

*abcd
*efghijk
*end of file!


In [54]:
# you can use "with" to read from files, too:
# wc, re-implemented using with

# what if we want unique words?
# whenever you hear the word "unique" or "distinct," think of a set

lines = 0
characters = 0
words = 0
unique_words = set()  # empty set of unique words

filename = 'wcfile.txt'

with open(filename) as f:  # open the file for reading, assign to f
    for one_line in f:     # iterate over the lines in f
        lines += 1
        characters += len(one_line)
        words += len(one_line.split())
        unique_words.update(one_line.split())  
    
unique_words_count = len(unique_words)

print(f'{lines=}')    
print(f'{characters=}')
print(f'{words=}')
print(f'{unique_words_count=}')



lines=11
characters=165
words=28
unique_words_count=20


In [59]:
# let's iterate over a bunch of files, and get their lengths line by line

# I'll get the files in the current directory that end with ".csv" using the glob.glob function
# in the "glob" module

import glob

for one_filename in glob.glob('*.txt'):
    with open(one_filename) as f:
        total = 0
        for one_line in f:
            total += len(one_line)
            
        print(f'{one_filename}: {total}')
        
        # here, we run f.close() before getting the next file

mini-access-log.txt: 36562
nums.txt: 42
shoe-data.txt: 1676
linux-etc-passwd.txt: 2683
wcfile.txt: 165
myfile.txt: 29


In [60]:
# the "current directory" in Jupyter is wherever you ran Jupyter

%pwd

'/Users/reuven/Courses/Current/Deluxe-2023-python'

In [61]:
# if you run a Python program from the command line, then the "current directory"
# is relative to wherever you ran the program

# but glob can handle complex paths

glob.glob('/etc/c*/*')

['/etc/cups/snmp.conf.pre-update',
 '/etc/cups/printers.conf.O',
 '/etc/cups/cupsd.conf',
 '/etc/cups/snmp.conf.default',
 '/etc/cups/cups-files.conf.default',
 '/etc/cups/ppd',
 '/etc/cups/cupsd.conf.default',
 '/etc/cups/printers.conf.pre-update',
 '/etc/cups/cups-files.conf',
 '/etc/cups/psnormalizer.convs',
 '/etc/cups/thnuclnt.types',
 '/etc/cups/certs',
 '/etc/cups/cups-files.conf.pre-update',
 '/etc/cups/cupsd.conf.pre-update',
 '/etc/cups/snmp.conf',
 '/etc/cups/printers.conf',
 '/etc/cups/thnuclnt.convs',
 '/etc/cups/cupsd.conf.O',
 '/etc/cups/interfaces']

In [62]:
!dir

Deluxe\ -\ 2023-03March-21.html   deluxe-2023-03March-23.zip
Deluxe\ -\ 2023-03March-21.ipynb  deluxe-2023-03March-27.zip
Deluxe\ -\ 2023-03March-23.html   deluxe-2023-03March-29.zip
Deluxe\ -\ 2023-03March-23.ipynb  exercise-files.zip
Deluxe\ -\ 2023-03March-27.html   linux-etc-passwd.txt
Deluxe\ -\ 2023-03March-27.ipynb  mini-access-log.txt
Deluxe\ -\ 2023-03March-29.html   myfile.txt
Deluxe\ -\ 2023-03March-29.ipynb  nums.txt
Deluxe\ -\ 2023-04April-03.ipynb  shoe-data.txt
deluxe-2023-03March-21.zip	  wcfile.txt


In [63]:
%magic

# Functional programming

This approach to programming basically says:

- We don't want to modify data -- even if the data structure is mutable
- We're going to try to assign as few times as possible
- We're going to treat functions as if they were data

There are two parts of the functional programming world that we often use in Python that you should know:

1. Comprehensions (and especially list comprehensions)
2. Passing functions as arguments

# List comprehensions

In [64]:
# let's assume that I have a list of 10 numbers
# and I want a new list of those numbers squared

# the typical way that I would do this is as follows:

numbers = list(range(10))

output = []

for one_number in numbers:
    output.append(one_number ** 2)
    
output    

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [65]:
# in such cases, where:

# I have an iterable (in this case, a list)
# I want a new list
# there's a clear Python expression that maps between them

# comprehensions to the rescue!

[one_number ** 2             # expression -- SELECT
 for one_number in numbers]  # iteration  -- FROM

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# List comprehensions: Notes

1. We're creating a new list, and thus we use `[]`.  This new list can be assigned to a variable, passed to a function or method, etc.
2. The first line of the comprehension is an expression -- any Python expression whatsoever -- any operation, function, or method. The value of this expression will be put into the output list, in the same location (parallel to) the current value you're looking at.
3. The second line (which actually executes first) is a `for` loop, just like any other `for` loop we've seen (more or less), except that it doesn't have `:` or a body after.

In [66]:
# example 1:

mylist = ['abcd', 'ef', 'ghi']

'*'.join(mylist)   # this works fine!

'abcd*ef*ghi'

In [67]:
# what if I have a list of integers?

mylist = [100, 200, 300]

'*'.join(mylist)  # you can only run join with an argument that's an iterable of strings

TypeError: sequence item 0: expected str instance, int found

In [69]:
# I have: a list of integers
# I want: a list of strings
# I can translate each integer to a string with str()

'*'.join([str(one_item)
         for one_item in mylist])

'100*200*300'

In [70]:
# example 2: title without title

# The str.title method returns a new string in which every word's first character
# is capitalized, but the other characters are lowercase

s = 'this is a bunch of words'
s.title()

'This Is A Bunch Of Words'

In [74]:
# how can I get the same result as the title method, if
# I don't have that method, but I do have str.capitalize (which does the same thing
# for a single word)?

' '.join([one_word.capitalize()
         for one_word in s.split()])

'This Is A Bunch Of Words'

# Using comprehensions

1. You have an iterable (string, list, dict, file)
2. You want a new list based on it, with the same number of elements
3. You can describe a Python expression that converts each element in the first to an element in the second

```python 
[100, 200, 300]  # gets converted to ['100', '200', '300'] with str
```

# Exercises: Comprehensions

1. Create a string containing integers, separated by whitespace (e.g., `'10 20 30'`). Sum the numbers in the string.
2. Create a string containing words, separated by whitespace. How many characters are in the string, ignoring the whitespace?

In [81]:
# Create a string containing integers, 
# separated by whitespace (e.g., '10 20 30'). Sum the numbers in the string.

s = '10 20 30'

# I have: a list of strings
# I want: a list of integers
# I can translate each string to an integer with int()

sum([int(one_item)
     for one_item in s.split()])

60

In [83]:
# Create a string containing words, separated by whitespace. 
# How many characters are in the string, ignoring the whitespace?

s = 'this is a bunch of words'

len(s.replace(' ', ''))

19

In [87]:
sum([len(one_item)
  for one_item in s.split()])

19

In [88]:
word_lengths = [len(one_item)
  for one_item in s.split()]

In [89]:
word_lengths

[4, 2, 1, 5, 2, 5]

# Next up

1. Files and comprehensions
2. Conditions and comprehensions
3. Sorting + passing functions

Return at :45

In [90]:
# since a comprehension needs an input that's iterable
# since files are iterable...
# maybe we can use files in our comprehensions?

[one_line
 for one_line in open('/etc/passwd')]

['##\n',
 '# User Database\n',
 '# \n',
 '# Note that this file is consulted directly only when the system is running\n',
 '# in single-user mode.  At other times this information is provided by\n',
 '# Open Directory.\n',
 '#\n',
 '# See the opendirectoryd(8) man page for additional information about\n',
 '# Open Directory.\n',
 '##\n',
 'nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false\n',
 'root:*:0:0:System Administrator:/var/root:/bin/sh\n',
 'daemon:*:1:1:System Services:/var/root:/usr/bin/false\n',
 '_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico\n',
 '_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false\n',
 '_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false\n',
 '_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false\n',
 '_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false\n',
 '_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false\n',
 '_scsd:*:31:31:Service Configuration Servi

In [92]:
# can we get a list of the usernames
# in /etc/passwd (i.e., the first field, before the first :, on each line)?

# using a condition in our comprehension means that the number of output elements
# might not be the same as the number of input elements -- it might be smaller, thanks
# to filtering via our condition

[one_line.split(':')[0]               # expression  -- SELECT
 for one_line in open('/etc/passwd')  # iteration   -- FROM
 if not one_line.startswith('#')    ]  # condition   -- WHERE

['nobody',
 'root',
 'daemon',
 '_uucp',
 '_taskgated',
 '_networkd',
 '_installassistant',
 '_lp',
 '_postfix',
 '_scsd',
 '_ces',
 '_appstore',
 '_mcxalr',
 '_appleevents',
 '_geod',
 '_devdocs',
 '_sandbox',
 '_mdnsresponder',
 '_ard',
 '_www',
 '_eppc',
 '_cvs',
 '_svn',
 '_mysql',
 '_sshd',
 '_qtss',
 '_cyrus',
 '_mailman',
 '_appserver',
 '_clamav',
 '_amavisd',
 '_jabber',
 '_appowner',
 '_windowserver',
 '_spotlight',
 '_tokend',
 '_securityagent',
 '_calendar',
 '_teamsserver',
 '_update_sharing',
 '_installer',
 '_atsserver',
 '_ftp',
 '_unknown',
 '_softwareupdate',
 '_coreaudiod',
 '_screensaver',
 '_locationd',
 '_trustevaluationagent',
 '_timezone',
 '_lda',
 '_cvmsroot',
 '_usbmuxd',
 '_dovecot',
 '_dpaudio',
 '_postgres',
 '_krbtgt',
 '_kadmin_admin',
 '_kadmin_changepw',
 '_devicemgr',
 '_webauthserver',
 '_netbios',
 '_warmd',
 '_dovenull',
 '_netstatistics',
 '_avbdeviced',
 '_krb_krbtgt',
 '_krb_kadmin',
 '_krb_changepw',
 '_krb_kerberos',
 '_krb_anonymous',
 '_asse

In [93]:
d = {'a':10, 'b':20, 'c':30}

# let's turn our dict into tuples

[(key, value)
 for key, value in d.items()]

[('a', 10), ('b', 20), ('c', 30)]

In [94]:
# return a list of tuples from our dict,
# where the key is in uppercase letters

[(key.upper(), value)
 for key, value in d.items()]

[('A', 10), ('B', 20), ('C', 30)]

In [95]:
# return a list of tuples from our dict,
# where the key is *not* a vowel

[(key, value)
 for key, value in d.items()
 if key not in 'aeiou']

[('b', 20), ('c', 30)]

In [96]:
!ls *.txt

linux-etc-passwd.txt  myfile.txt  shoe-data.txt
mini-access-log.txt   nums.txt	  wcfile.txt


# Exercise: Summing `nums.txt` (with a comprehension)

Sum the numbers in `nums.txt`, but this time, use a comprehension rather than a regular `for` loop.

In [97]:
# with open(filename) as f:
#     [one_line 
#      for one_line in f]

In [105]:
sum([int(one_line)
 for one_line in open('nums.txt')
 if one_line.strip().isdigit() ])    # remove lines only containing whitespace

83

In [106]:
!head shoe-data.txt

Adidas	orange	43
Nike	black	41
Adidas	black	39
New Balance	pink	41
Nike	white	44
New Balance	orange	38
Nike	pink	44
Adidas	pink	44
New Balance	orange	39
New Balance	black	43


# `shoe-data.txt`

This file contains 100 lines of data. Each line contains three columns of values. The columns are separated with tab (`'\t'`) characters:

- brand
- color
- size

How can I turn this file into a list of 100 dicts? Each dict should have three key-value pairs, with keys "brand", "color", and "size"?

Try to do this using a list comprehension:

- Each line of the file contains one record, which should be turned into one dict
- All 100 dicts will have the same keys, but different values
- It's OK to keep the sizes as strings, rather than ints, for our purposes
- You'll probably be best off writing a function that takes a string and returns a dict, and that is invoked as part of the comprehension's expression

In [116]:
filename = 'shoe-data.txt'

def line_to_dict(s):
    fields = s.strip().split('\t')    
    
    return {'brand': fields[0],
           'color': fields[1],
           'size': fields[2]}

[line_to_dict(one_line)
 for one_line in open(filename)]

[{'brand': 'Adidas', 'color': 'orange', 'size': '43'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'black', 'size': '39'},
 {'brand': 'New Balance', 'color': 'pink', 'size': '41'},
 {'brand': 'Nike', 'color': 'white', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '38'},
 {'brand': 'Nike', 'color': 'pink', 'size': '44'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '39'},
 {'brand': 'New Balance', 'color': 'black', 'size': '43'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '44'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '37'},
 {'brand': 'Adidas', 'color': 'black', 'size': '38'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '41'},
 {'brand': 'Adidas', 'color': 'white', 'size': '36'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '36'},
 {'brand': 'Nike', 'color': 'pink', 'size': '41'},
 {'brand': '

In [117]:
filename = 'shoe-data.txt'

def line_to_dict(s):
    brand, color, size = s.strip().split('\t')       # unpacking
    
    return {'brand': brand,
           'color': color,
           'size': size}

[line_to_dict(one_line)
 for one_line in open(filename)]

[{'brand': 'Adidas', 'color': 'orange', 'size': '43'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'black', 'size': '39'},
 {'brand': 'New Balance', 'color': 'pink', 'size': '41'},
 {'brand': 'Nike', 'color': 'white', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '38'},
 {'brand': 'Nike', 'color': 'pink', 'size': '44'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '44'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '39'},
 {'brand': 'New Balance', 'color': 'black', 'size': '43'},
 {'brand': 'New Balance', 'color': 'orange', 'size': '44'},
 {'brand': 'Nike', 'color': 'black', 'size': '41'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '37'},
 {'brand': 'Adidas', 'color': 'black', 'size': '38'},
 {'brand': 'Adidas', 'color': 'pink', 'size': '41'},
 {'brand': 'Adidas', 'color': 'white', 'size': '36'},
 {'brand': 'Adidas', 'color': 'orange', 'size': '36'},
 {'brand': 'Nike', 'color': 'pink', 'size': '41'},
 {'brand': '

In [118]:
# I can call int() and get a new integer
# I can call str() and get a new string
# can I call dict() and get a new dict?  Yes, we just need to pass it a list of lists or a list of tuples

dict([ ('a', 10),  ('b', 20),  ('c', 30)  ])

{'a': 10, 'b': 20, 'c': 30}

In [None]:
# if I can get the keys (brand, color, size) into the 0 index of each tuple
# and th

filename = 'shoe-data.txt'

def line_to_dict(s):
    brand, color, size = s.strip().split('\t')       # unpacking
    
    return {'brand': brand,
           'color': color,
           'size': size}

[line_to_dict(one_line)
 for one_line in open(filename)]