<figure>
  <IMG SRC="../../logo/logo.png" WIDTH=250 ALIGN="right">
</figure>

# IHE Python course, 2017

## Reading text files

T.N.Olsthoorn, Feb 27, 2017

Reading and writing files is one of the essential functions of computers as this is the connection between volatile data in the computer's memory and permanent data on disk, in the cloud.

The three forms of reading text files are shown, and we'll illustrate handling its contents. We'll take the README.md file as an example as it is readily available on this site.

## Finding the file in the directory tree
First steps are finding the file in the directory structure. We'll start with an illutration how this can be done using some functions in the os module, which deal with directories.

In [1]:
import os

In [2]:
os.listdir()  # make a list of the files in the current directory, so that we may handle them.

['.DS_Store',
 '.ipynb_checkpoints',
 'dealingWithStrings.ipynb',
 'introspection.ipynb',
 'readingText.ipynb']

Turn the relative refference of the current directory, '.' into an absolute path.

Then specify the file name that we're looking for.

And in a while-loop look upward in an ever higher diectory in the directory tree until we found that with our file.

Walking upward in the tree is done by cutting off the tail of the path in each step.

In [3]:
apth = os.path.abspath('.')

fname = 'README.md'

print("Starting in folder <{}>,\n to look for file <{}> ...".format(apth, fname))
print()

print("I'm searching: ...")
while not fname in os.listdir(apth):
    apth = apth.rpartition(os.sep)[0]
    print(apth)
    
if fname in os.listdir(apth):
    print("... Yep, got'm!")
else:
    print("... H'm, missed him")
    
print("\nOk, file <{}> is in folder: ".format(fname), apth)
print('\nHere is the list of files in this folder:')
os.listdir(apth)

Starting in folder </Users/Theo/GRWMODELS/Python_projects/IHEcourse2017/exercises/Mar07>,
 to look for file <README.md> ...

I'm searching: ...
/Users/Theo/GRWMODELS/Python_projects/IHEcourse2017/exercises
/Users/Theo/GRWMODELS/Python_projects/IHEcourse2017
... Yep, got'm!

Ok, file <README.md> is in folder:  /Users/Theo/GRWMODELS/Python_projects/IHEcourse2017

Here is the list of files in this folder:


['.DS_Store',
 '.git',
 '.gitignore',
 '.ipynb_checkpoints',
 'exercises',
 'IMG_20170221_155447496.jpg',
 'IntroPythonCourse.pptx',
 'LICENSE',
 'logo',
 'mbakker7-exploratory_computing_with_python-ca15088',
 'photos',
 'README.md',
 'sandbox']

## Reading the file (at once)

When we know where the file is, we can open it for reading.

We have to open it, which yields a reader object, by which we can read the file.

`reader = open(path, 'r')`

`s = reader.read()`

`reader.close()`

Problem with this, is that when we are exploring the reader, we may easily reach the end of file after which nothing more is read and s is an empty string. Furthermore, when we experiment, we may easily open the same file many times and forget to close it.

The with statement is a solution to that, because it automatically closes the file when we finish its block.

With the with statement we may read the entire file into a string like so.

In [4]:
with open(os.path.join(apth, fname), 'r') as reader:
    s = reader.read()

It's the read that swallows the entire file at once and dumps its contents in the string s.

Check if the reader in, indeed, closed after we finished the `with` block:

In [5]:
reader.closed

True

Then show the contenct of the sring s:

In [6]:
print(s)

# IHE-python-course-2017
Material for the extracurricular python course given at UNESCO-IHE from 21-Feb-2017

Upon request of the virtually entire student community studying at UNESCO-IHE in Delft in 2017, an extracurricular course on Python was given as it was realized that learning python is an essential asset for future engineers and scientists.

Next to my own exercises, we'll borrow much of the exercises Prof. Mark Bakker's tutorials for all students at the faculty of CEGS at TUDelft, which are availalbe on github. Other material from the internet may also be used.

The focus is on the needs of scientists and engineers.

The students are supposed to install the Anaconda packages on their laptop and the lessons will be on Tuesdays 16:45-17:30 as from Feb 21, 2017 in the auditorium of IHE at the Westvest in Delft.

For your own practicing I advice the exercises in the book (2015) **"Automate the Boring Stuff with Python"** by Al Sweigart. The book can be read free on the Internet. T

## Counting words and phrases

Now that we have the entire file read into a single string, `s`, we can just as well analyze it a bit, by counting the number of words, letters, and the frequency of each letter.

just split the sting in workds based on whitespace and count their number.

In [7]:
print("There are {} words in file {}".format(len(s.split(sep=' ')), fname))

There are 362 words in file README.md


We might estimate the number of sentences by counting the number of periods '.'

One way is to use the `.` as a separator:

In [8]:
nPhrases = len(s.split(sep='.'))  # also works without the keyword sep

print("We find {} phrases in file {}".format(nPhrases, fname))

We find 24 phrases in file README.md


We could just as wel count the number of dots in s directly, using one of the string methods, in this case s.count()

In [9]:
print("There are {} dots in file {}".format(s.count('.'), fname))

There are 23 dots in file README.md


## Counting the non-whitespace characters

Now let's see how many non-whitespace characters there are in s.

A coarse way to remove whitespace would be splitting s and rejoining the obtained list of words without any whitespace like so:

In [14]:
s1 = "".join(s.split()).lower()  # also make all letters in lowerface()
print(s1)

#ihe-python-course-2017materialfortheextracurricularpythoncoursegivenatunesco-ihefrom21-feb-2017uponrequestofthevirtuallyentirestudentcommunitystudyingatunesco-iheindelftin2017,anextracurricularcourseonpythonwasgivenasitwasrealizedthatlearningpythonisanessentialassetforfutureengineersandscientists.nexttomyownexercises,we'llborrowmuchoftheexercisesprof.markbakker'stutorialsforallstudentsatthefacultyofcegsattudelft,whichareavailalbeongithub.othermaterialfromtheinternetmayalsobeused.thefocusisontheneedsofscientistsandengineers.thestudentsaresupposedtoinstalltheanacondapackagesontheirlaptopandthelessonswillbeontuesdays16:45-17:30asfromfeb21,2017intheauditoriumofiheatthewestvestindelft.foryourownpracticingiadvicetheexercisesinthebook(2015)**"automatetheboringstuffwithpython"**byalsweigart.thebookcanbereadfreeontheinternet.theexcercisesareassimpleaspossible,systematicallyaranged,ineachstepexplainingonenewaspect,everythinginitisextremelyclearlyexplainedandthebookisevenrelaxedreading.eachchapt

## All characters in a list

If we convert a string into a list, we get the list of its individual characters.

In [15]:
list(s1)

['#',
 'i',
 'h',
 'e',
 '-',
 'p',
 'y',
 't',
 'h',
 'o',
 'n',
 '-',
 'c',
 'o',
 'u',
 'r',
 's',
 'e',
 '-',
 '2',
 '0',
 '1',
 '7',
 'm',
 'a',
 't',
 'e',
 'r',
 'i',
 'a',
 'l',
 'f',
 'o',
 'r',
 't',
 'h',
 'e',
 'e',
 'x',
 't',
 'r',
 'a',
 'c',
 'u',
 'r',
 'r',
 'i',
 'c',
 'u',
 'l',
 'a',
 'r',
 'p',
 'y',
 't',
 'h',
 'o',
 'n',
 'c',
 'o',
 'u',
 'r',
 's',
 'e',
 'g',
 'i',
 'v',
 'e',
 'n',
 'a',
 't',
 'u',
 'n',
 'e',
 's',
 'c',
 'o',
 '-',
 'i',
 'h',
 'e',
 'f',
 'r',
 'o',
 'm',
 '2',
 '1',
 '-',
 'f',
 'e',
 'b',
 '-',
 '2',
 '0',
 '1',
 '7',
 'u',
 'p',
 'o',
 'n',
 'r',
 'e',
 'q',
 'u',
 'e',
 's',
 't',
 'o',
 'f',
 't',
 'h',
 'e',
 'v',
 'i',
 'r',
 't',
 'u',
 'a',
 'l',
 'l',
 'y',
 'e',
 'n',
 't',
 'i',
 'r',
 'e',
 's',
 't',
 'u',
 'd',
 'e',
 'n',
 't',
 'c',
 'o',
 'm',
 'm',
 'u',
 'n',
 'i',
 't',
 'y',
 's',
 't',
 'u',
 'd',
 'y',
 'i',
 'n',
 'g',
 'a',
 't',
 'u',
 'n',
 'e',
 's',
 'c',
 'o',
 '-',
 'i',
 'h',
 'e',
 'i',
 'n',
 'd',
 'e'

## The unique characters in a set

By turning the string into a set, we get the set of its unique characters:

In [16]:
set(s1)

{'"',
 '#',
 "'",
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 ':',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z'}

## The number of occurences of each non-white character in the file

To count the frequency of each character we could use those from the set as keys in a dict. We can generate the dict with the frequency if each character in a dict comprehension that combines the unique letter as a key with the method count(key) applied on s1, the string without whitespace:

In [17]:
ccnt = {c : s1.count(c) for c in set(s1)}
pprint(ccnt)

NameError: name 'pprint' is not defined

Lets order the letters after their frequency of occurrence in the file:

We can do so in one line, but this needs some explanaion.

First we generate a list from the dict in which each item is a list of 2 itmes namely [char, number]

Second we apply sorted on that list to get a sorted list. But we don't want it to be sorted based on the character, but based on the number. Therfore, we use the `key` argument. It tels that each item has to be compared on the second value (lambda x: x[1]).

Finally, this yields the list that we want, but with the largest frequency at the bottom. So we turn this list upside down by using the slice [::-1] at the end.

Here it is:

In [159]:
sorted([[k, ccnt[k]] for k in ccnt.keys()], key=lambda x: x[1])[::-1]

[['e', 236],
 ['t', 186],
 ['s', 140],
 ['a', 131],
 ['o', 122],
 ['n', 120],
 ['i', 118],
 ['r', 102],
 ['h', 78],
 ['l', 77],
 ['u', 56],
 ['c', 56],
 ['d', 55],
 ['f', 46],
 ['p', 36],
 ['w', 33],
 ['m', 32],
 ['b', 29],
 ['y', 28],
 ['g', 24],
 ['.', 23],
 [',', 19],
 ['k', 17],
 ['v', 15],
 ['x', 14],
 ['*', 12],
 ['2', 11],
 ['1', 11],
 ['0', 8],
 ['-', 8],
 ['7', 6],
 ["'", 5],
 ['q', 4],
 [':', 3],
 ['z', 3],
 ['5', 2],
 ['j', 2],
 ['"', 2],
 ['#', 1],
 ['4', 1],
 ['8', 1],
 ['3', 1],
 ['6', 1],
 ['(', 1],
 [')', 1]]

## Reading the file and returning a list of strings, one per line

For this we would read reader.readlines() instead of reader.read:

In [18]:
with open(os.path.join(apth, fname), 'r') as reader:
    s = reader.readlines()

type(s)

list

In [19]:
pprint(s)

Pretty printing has been turned OFF


From this point onward, you can analyse each line in sequence, pick out lines, etc.

## Reading a single line and lines one by one

Often you don't want to read the entire file into memory (into a single character) at once. It might blow up the computer's memory if the file size were gigabits, as can easily the case with output of some models. And if it wouldn't crash the memory, your pc may still become very slow with large files. So a better and more generally applied way to read in a file is line by line, based on the newline characters that are embedded in them.

In that case you can read the file in line by line, one at a time, not using reader.read() or reader.readlines() but reader.readline()

In [191]:
with open(os.path.join(apth, fname), 'r') as reader:
    s = reader.readline()

type(s)

str

In [192]:
print(s)

# IHE-python-course-2017



Which yields a string, the first string of the file in this case.

The problem is now, that no more lines can be read from this file, because with the `with` statement, the file closes automatically as soon as the python reaches the end of its block:

In [193]:
s = reader.readline()

ValueError: I/O operation on closed file.

Therefore, we should not use the `with` statement and hand-close the file when we're done, or put anything that we do with the strings that we read inside the `with` block.

We may be tempted to put the reader in a while-loop like so

s=[]
while True:
    s.append(reader.readline())
    
But don't do that, becaus the while-loop will never end

In [194]:
with open(os.path.join(apth, fname), 'r') as reader:
    lines = []
    while True:
        s = reader.readline()
        if s=="":
            break
        lines.append(s)

pprint(lines)


['# IHE-python-course-2017\n',
 'Material for the extracurricular python course given at UNESCO-IHE from '
 '21-Feb-2017\n',
 '\n',
 'Upon request of the virtually entire student community studying at '
 'UNESCO-IHE in Delft in 2017, an extracurricular course on Python was given '
 'as it was realized that learning python is an essential asset for future '
 'engineers and scientists.\n',
 '\n',
 "Next to my own exercises, we'll borrow much of the exercises Prof. Mark "
 "Bakker's tutorials for all students at the faculty of CEGS at TUDelft, which "
 'are availalbe on github. Other material from the internet may also be '
 'used.\n',
 '\n',
 'The focus is on the needs of scientists and engineers.\n',
 '\n',
 'The students are supposed to install the Anaconda packages on their laptop '
 'and the lessons will be on Tuesdays 16:45-17:30 as from Feb 21, 2017 in the '
 'auditorium of IHE at the Westvest in Delft.\n',
 '\n',
 'For your own practicing I advice the exercises in the book (2015) 

In [195]:
reader.readline?

Of course, there is much, much more, but this is probably the most important base knowledge about file reading. File writing of textfile is straightforward. You open a file with open( fname, 'w') for writing or open(fname,'a') for appending and you can start writing lines to it. Don't forget to close it when done. Still better, use the `with` statement to make sure that the file is automatically closed when its block is done.