LIN 373 UTexas :: Jessy Li

Original Material (list comprehensions and file handling) from Katrin Erk

## List comprehensions

What do you think this will do?

In [1]:
text= "Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29."
words = text.split()
lst = [w.lower() for w in words]
print(lst)

# lst = []
# for w in words:
#     lst.append(w.lower())

['pierre', 'vinken,', '61', 'years', 'old,', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'nov.', '29.']


We can put the list back together into a sentence:

In [4]:
s = "@".join(lst)
print(s)

pierre@vinken,@61@years@old,@will@join@the@board@as@a@nonexecutive@director@nov.@29.


Try: given a list of sentences, return a list of # words in that sentence, again using split()

In [5]:
sentences = ["Winter is coming.",
            "When you play a game of thrones you win or you die.",
            "Different roads sometimes lead to the same castle."]
[len(sentence.split()) for sentence in sentences]

[3, 12, 8]

What do you think this will do?

In [6]:
mylist = ['for', 'a', 'minute', 'he', 'scarcely', 
          'realised', 'what', 'this', 'meant', 'and', 
          'although', 'the', 'heat', 'was', 'excessive', 
          'he', 'clambered', 'down', 'into', 'the', 
          'pit', 'close', 'to', 'the', 'bulk', 
          'to', 'see', 'the', 'thing', 'more', 'clearly']

mystopwords = ["the", "a", "to", "for", "he", "she", "it", "what", "and"]

[w for w in mylist if w not in mystopwords]

# lst = []
# for w in mylist:
#     if w not in mystopwords:
#         lst.append(w)

['minute',
 'scarcely',
 'realised',
 'this',
 'meant',
 'although',
 'heat',
 'was',
 'excessive',
 'clambered',
 'down',
 'into',
 'pit',
 'close',
 'bulk',
 'see',
 'thing',
 'more',
 'clearly']

So we have seen uses of list comprehensions that transform each item on the list, and uses that filter a list. You can do both at the same time. Here's an example task: 
Given a list of numbers,
* If a number is even, drop it
* If a number is odd, double it
* And return the result as another list of numbers

To figure out whether a number is even, we can check if the number is divisible by two:

In [7]:
print(19 % 5)
print(5 % 2 == 0)
print(4 % 2 == 0)

4
False
True


In [11]:
def drop_even_double_odd(intlist):
    return [num*2 for num in intlist if num % 2 != 0]
    
drop_even_double_odd([1,2,3,4])

[2, 6]

## Addressing punctuation

Punctuation often gets in our way if we want to process words. For example, above we had the sentence 
```
"Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29." 
```
(This sentence is famous among computational linguists because it is the first sentence on the Wall Street Journal corpus, which everyone uses as an example sentence.) 

If we split this sentence into words to process it, it will, among other things, contain "old.", which is unfortunately a different string from "old" and won't be counted as the same word. So we will often want to separate punctuation from the words. 

But what is punctuation? Python has this:

In [16]:
lst = ['pierre', 'vinken,', '61', 'years', 'old,', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'nov.', '29??!!!']
[x[:-1] if not x[-1].isalpha() else x for x in lst]

['pierre',
 'vinken',
 '6',
 'years',
 'old',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'nov',
 '29??!!']

In [17]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Now how can we remove punctuation at the end of a word? Take a look at Python string functions `strip()`, `lstrip()`, and `rstrip()`:

In [24]:
# strip w/o arguments
"   hello\n\t     ".strip(" ")

# rstrip w/o arguments
"   hello\n\t     ".rstrip(" ")

# rstrip w/o arguments
"   hello\n\t     ".lstrip()

'hello\n\t     '

In [25]:
# rstrip with one string argument
"29??!!!".rstrip("?!")

'29'

In [28]:
# rstrip all punctuations
"29??!!!....,,".rstrip(string.punctuation)

lst = ['pierre', 'vinken,', '61', 'years', 'old,', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'nov.', '29??!!!']
[x.rstrip(string.punctuation) for x in lst]

['pierre',
 'vinken',
 '61',
 'years',
 'old',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'nov',
 '29']

## Reading and writing files in Python

### Getting a sample file

As an example of a file to read, we will use a relatively small, unannotated corpus from Project Gutenberg, http://www.gutenberg.org/wiki/Main_Page
Project Gutenberg is an online collection of texts whose copyright has expired. It contains texts in many languages. 

We will work with the First Project Gutenberg Collection of Edgar Allan Poe at http://www.gutenberg.org/etext/1062.
Please download the file, making sure to choose a Plain Text version. The file will be called pg1062.txt. Put it on your Desktop.

### Reading a file

You should now have the file pg1062.txt on your Desktop. Accessing this file in Python is easy -- the most complicated part is going to be figuring out how to tell Python about the file's location on your system. On my Unix system, the following lines of Python will print the whole file contents to your screen, line by line:

In [30]:
f = open('/Users/jl67946/Desktop/pg1062.txt')
for line in f:
    print(line)
f.close()

﻿The Project Gutenberg EBook of First Gutenberg Collection of Edgar Allan

Poe, by Edgar Allan Poe



This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever.  You may copy it, give it away or

re-use it under the terms of the Project Gutenberg License included

with this eBook or online at www.gutenberg.net





Title: First Gutenberg Collection of Edgar Allan Poe



Author: Edgar Allan Poe



Posting Date: June 6, 2010 [EBook #1062]

Release Date: October, 1997



Language: English





*** START OF THIS PROJECT GUTENBERG EBOOK GUTENBERG COLLECTION--E. A. POE ***









Produced by Levent Kurnaz and Jose Menendez















This is our second experimental effort at cataloguing multiple items in

a single file.  In the first instance we use the same index number for

each item, and just used multiple entries for that file in the index.

In this, the second instance, we have used separate index numbers for

the collection and for all th

Note that if you put the text file under the same directory as the notebook/script file you are working with, you only need to say:
`f = open("pg1062.txt")`
By default, python look for things under the *current directory*.

If you have a Mac, you will probably just need to substitute your user name for `jl67946`. 

If you are running linux, you will have to put `/home/YOURUSERNAME/Desktop/pg1062.txt`.

If you have a Windows system, you probably need to write `f = open(r"C:\Desktop\pg1062.txt")`. Note the "r" before the opening double quote: It tells Python not to interpret the "\" as the beginning of some special code.

Any time you read a file, the lines of Python code you write for that will be the almost the same. Think of a file as something like a box: You have to open it first, then you can to access its contents, and then you close it.

In [31]:
# “open” takes as its argument a file name, which may include directory information. 
# Do not forget to start with the “open” command! 
# You need the file object that it returns to read the file.
f = open("/Users/jl67946/Desktop/pg1062.txt")

# “open” returns a file object. This, then, can be used to access the file contents.
type(f)

_io.TextIOWrapper

In [None]:
# We can iterate through the lines of a file as if it were a list.
# Note that "line" is just a variable name. I could have named it anything else.
# “line” is a variable that will be filled by each line of the file in turn. 
for line in f:
    print(line)

In [32]:
# After reading the file, you close the file object. 
# This is not strictly necessary if you are only reading the file
# -- if you are writing, it is necessary -- but it is good practice.
f.close()

### Other ways of reading a file

You can also read the whole file into a single string, using `read()`

In [33]:
f = open("/Users/jl67946/Desktop/pg1062.txt")
myfilecontents = f.read()
f.close()
myfilecontents

'\ufeffThe Project Gutenberg EBook of First Gutenberg Collection of Edgar Allan\nPoe, by Edgar Allan Poe\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.net\n\n\nTitle: First Gutenberg Collection of Edgar Allan Poe\n\nAuthor: Edgar Allan Poe\n\nPosting Date: June 6, 2010 [EBook #1062]\nRelease Date: October, 1997\n\nLanguage: English\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK GUTENBERG COLLECTION--E. A. POE ***\n\n\n\n\nProduced by Levent Kurnaz and Jose Menendez\n\n\n\n\n\n\n\nThis is our second experimental effort at cataloguing multiple items in\na single file.  In the first instance we use the same index number for\neach item, and just used multiple entries for that file in the index.\nIn this, the second instance, we have used separate index numbers for\nthe collection and for 

Or you can read one single line only, using `readline()`

In [35]:
# read the next line of the file
f = open("/Users/jl67946/Desktop/pg1062.txt")
line1 = f.readline()
line2 = f.readline()
line3 = f.readline()
print(line1)
print(line2)
print(line3)

﻿The Project Gutenberg EBook of First Gutenberg Collection of Edgar Allan

Poe, by Edgar Allan Poe





We all tend to forget `f.close()`... and Python expected that! Here is a way so you don't have to close the file (as, it will be closed for you):

In [37]:
# with open("/Users/jl67946/Desktop/pg1062.txt") as f:
#     for line in f:
#         print(line)
        
for line in open("/Users/jl67946/Desktop/pg1062.txt"):
    print(line)

﻿The Project Gutenberg EBook of First Gutenberg Collection of Edgar Allan

Poe, by Edgar Allan Poe



This eBook is for the use of anyone anywhere at no cost and with

almost no restrictions whatsoever.  You may copy it, give it away or

re-use it under the terms of the Project Gutenberg License included

with this eBook or online at www.gutenberg.net





Title: First Gutenberg Collection of Edgar Allan Poe



Author: Edgar Allan Poe



Posting Date: June 6, 2010 [EBook #1062]

Release Date: October, 1997



Language: English





*** START OF THIS PROJECT GUTENBERG EBOOK GUTENBERG COLLECTION--E. A. POE ***









Produced by Levent Kurnaz and Jose Menendez















This is our second experimental effort at cataloguing multiple items in

a single file.  In the first instance we use the same index number for

each item, and just used multiple entries for that file in the index.

In this, the second instance, we have used separate index numbers for

the collection and for all th

You can still iterate over lines:

Exercise: Can you count the lines in the file pg1062.txt?

In [50]:
# with open("/Users/jl67946/Desktop/pg1062.txt") as f:
# #     print(len(f.read().split("\n")))
#     print(len(f.read().split("\n")[-1]))
    
# counter = 0
# with open("/Users/jl67946/Desktop/pg1062.txt") as f:
#     for line in f:
#         counter += 1
# counter

with open("/Users/jl67946/Desktop/pg1062.txt") as f:
    s = f.read()
s

'\ufeffThe Project Gutenberg EBook of First Gutenberg Collection of Edgar Allan\nPoe, by Edgar Allan Poe\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.net\n\n\nTitle: First Gutenberg Collection of Edgar Allan Poe\n\nAuthor: Edgar Allan Poe\n\nPosting Date: June 6, 2010 [EBook #1062]\nRelease Date: October, 1997\n\nLanguage: English\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK GUTENBERG COLLECTION--E. A. POE ***\n\n\n\n\nProduced by Levent Kurnaz and Jose Menendez\n\n\n\n\n\n\n\nThis is our second experimental effort at cataloguing multiple items in\na single file.  In the first instance we use the same index number for\neach item, and just used multiple entries for that file in the index.\nIn this, the second instance, we have used separate index numbers for\nthe collection and for 

The file contains multiple stories by E.A. Poe. Suppose we want to count only the lines in the file that pertain to “The Raven”. How should we do it?

### Writing files

We still try to open a file using `open()`. By default, `open()` reads a file; we can notify python that we are writing to a file by saying `open(filename, 'w')`:

In [54]:
f = open("myoutfile.txt", 'w')
#f.write("hello")
print("hello", file=f)
f.write("world\n")
f.close()

## Objects

Suppose we are going to handle various characters from Game of Thrones. Specifically, have the following pieces of information:
* name
* gender
* house
* episode introduced
* episode died
* quotes
For example,
* name: `Tyrion Lannister`, 
* gender: `Male`
* house: `Lannister`, 
* episode introduced: `1`,
* episode died: `None`,
* quotes:
```["It was your blade I needed, not your love.",
    "I am not questioning your honor, I am denying its existence.",
    "It’s not easy being drunk all the time. If it were easy, everyone would do it.",
    ...]
```
There are several ways to do this; what are they?

Let's talk about creating our own **type** of data: GOT Character. A user-defined type is also called a class. A class definition looks like this:

In [2]:
class GoTCharacter(object):
    """A Game of Thrones character"""

This header indicates that the new class is a `GoTCharacter`, which is a kind of `object`, which is a built-in generic type.

We can now associate the various **attributes** with the user-defined GoTCharacter type:

In [7]:
class GoTCharacter(object):
    """A Game of Thrones character"""
    
    def __init__(self):
        self.name = ""
        self.gender = ""
        self.house = ""
        self.episode_intro = None
        self.episode_died = None
        self.quotes = []

To declare a variable of the type GoTCharacter, we type:

In [8]:
tyrion = GoTCharacter()

We can then populate the tyrion **object** with information:

In [9]:
tyrion.name = "Tyrion"
tyrion.gender = "Male"
tyrion.house = "Lannister"
tyrion.episode_intro = 1
tyrion.episode_died = None
tyrion.quotes = ["It was your blade I needed, not your love.",
  "I am not questioning your honor, I am denying its existence.",
  "It’s not easy being drunk all the time. If it were easy, everyone would do it.",
  ...]

We can require that whenever a GoTCharacter gets created, name/gender/house information has to be available:

In [11]:
class GoTCharacter(object):
    """A Game of Thrones character"""
    
    def __init__(self, name, gender, house): ## note the syntax
        self.name = name
        self.gender = gender
        self.house = house
        self.episode_intro = 0
        self.episode_died = 80
        self.quotes = []
        
        


Attributes are mutable:

In [14]:
jon = GoTCharacter("Jon", "Male", "Stark")
tyrion.quotes.append("I need to speak to someone with hair")
print(tyrion.quotes)

['It was your blade I needed, not your love.', 'I am not questioning your honor, I am denying its existence.', 'It’s not easy being drunk all the time. If it were easy, everyone would do it.', Ellipsis, 'I need to speak to someone with hair']


Exercise: given a list of GoTCharacters `character_list`, calculate the average length of survival (# of episodes).
Use list comprehension!

In [15]:
import numpy as np
np.mean([1,2,3])

2.0

In [16]:
def get_avg_lifespan(character_list):
    pass

We could put functions inside a class; these functions are called **methods**:

In [None]:
class GoTCharacter(object):
    """A Game of Thrones character"""
    
    def __init__(self, name, gender, house):
        self.name = name
        self.gender = gender
        self.house = house
        self.episode_intro = 0
        self.episode_died = 0
        self.quotes = []
        
    def get_lifespan(self):
        return self.episode_intro - self.episode_died
        
    


Now we can modify the `get_avg_lifespan` function:

In [None]:
def get_avg_lifespan(character_list):
    pass