# <center><b>Python for Data Science</b></center>
# <center><b>Lesson 16</b></center>
# <center><b>File Formats and File I/O</b></center>

<center><i>Adapted from:</i></center>
<center>*****************</center>

<center>How to Think Like a Computer Scientist: Interactive Edition</center>
<center>Reading Files with Python (Stack Abuse)</center>

<b>Resources:<br></b>

- [How to Think Like a Computer Scientist: Interactive Edition](https://runestone.academy/ns/books/published/thinkcspy/index.html)
- [Reading Files with Python (StackAbuse)](https://stackabuse.com/reading-files-with-python/)
***
- [Python File I/O (Progamiz)](https://www.programiz.com/python-programming/file-operation)
- [Best Explanation of Python File I/O(Input/Output) with Examples](https://www.cyberithub.com/python-file-io-input-output-with-examples/)
- [File Handling In Python, Python File IO, Python Read & Write Files (Simplilearn)](https://www.youtube.com/watch?v=DmHSwTiD5Tk)
- [Python File Handling - How to Create, Open, Read & Write (Intellipaat](https://intellipaat.com/blog/tutorial/python-tutorial/python-file-handling-i-o/)
- [File Read Write (Stanford Computer Science)](https://cs.stanford.edu/people/nick/py/python-file.html)

##  <span style="color:green">TABLE OF CONTENTS</span>

1. [Working with Data Files](#1)<br>
2. [Finding a File in Your File System](#2)<br>
3. [Reading a File](#3)<br>
4. [Iterating Over Lines in a File](#4)<br>
5. [Alternative File Reading Methods](#5)<br>
a. [Using a <b>while</b> Loop to Read a File](#5a)<br>
6. [Writing Text Files](#6)<br>
7. [Using <b>with</b> for Files](#7)<br>

In [6]:
# set up notebook to display multiple output in one cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print('The notebook is set up to display multiple output in one cell.')

The notebook is set up to display multiple output in one cell.


<div class="alert alert-block alert-warning">
    <b><font size="4">Files needed for this presentation:</font></b>
</div>

[ccdata.txt](https://drive.google.com/file/d/1xU9pxWJ01GXyHDbSBjb9Tp9yjGzOYQKD/view?usp=share_link)<br>
[mydata.txt](https://drive.google.com/file/d/1L3WInlmf9i2siOHLYuYAa5RoJo7QsSyb/view?usp=share_link)

<a class="anchor" id="1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">1. Working with Data Files</font></b></div>

- So far, the data we have used in this course have all been either coded right into the program, or have been entered by the user. In real life data resides in files. For example images live in files on your hard drive. Web pages, and word processing documents, and music are other examples of data that live in files. In this unit we will introduce the Python concepts necessary to use data from files in our programs.
<p>&nbsp;</p>
- For our purposes, we will assume that our data files are text files–that is, files filled with characters. The Python programs that you write are stored as text files. 
<p>&nbsp;</p>
- We can create these files in any of a number of ways. For example, we could use a text editor to type in and save the data. We could also download the data from a website and then save it in a file. Regardless of how the file is created, Python will allow us to manipulate the contents.
<p>&nbsp;</p>
- In Python, we must open files before we can use them and close them when we are done with them. As you might expect, once a file is opened it becomes a Python object just like all other data. The table below shows the functions and methods that can be used to open and close files.

![files1.PNG](attachment:files1.PNG)

<a class="anchor" id="2"></a>
<div class="alert alert-block alert-info">
<b><font size="4">2. Finding a File in Your File System</font></b></div>

- Opening a file requires that you, as a programmer, and Python agree about the location of the file on your disk. 
- The way that files are located on disk is by their path. 
- You can think of the filename as the short name for a file, and the path as the full name. 
- For example on a Mac if you save the file hello.txt in your home directory the path to that file is /Users/yourname/hello.txt On a Windows machine the path looks a bit different but the same principles are in use. For example on windows the path might be C:\Users\yourname\My Documents\hello.txt

![path%20separaters.PNG](attachment:path%20separaters.PNG)

- You can access files in sub-folders, also called directories, under your home directory by adding a slash and the name of the folder. 
- For example, if you had a file called hello.py in a folder called CS150 that is inside a folder called PyCharmProjects under your home directory, then the full name for the file hello.py is /Users/yourname/PyCharmProjects/CS150/hello.py. 
- This is called an absolute file path. 
- An absolute file path typically only works on a specific computer. Think about it for a second. What other computer in the world is going to have an absolute file path that starts with /Users/yourname?

***

- If a file is not in the same folder as your Python program, you need to tell the computer how to reach it. 
- A relative file path starts from the folder that contains your python program and follows a computer’s file hierarchy. 
- A file hierarchy contains folders which contains files and other sub-folders. 
- Specifying a sub-folder is easy – you simply specify the sub-folder’s name. 
- To specify a parent folder you use the special .. notation because every file and folder has one unique parent. 
- You can use the .. notation multiple times in a file path to move multiple levels up a file hierarchy. 
- Here is an example file hierarchy that contains multiple folders, files, and sub-folders. 
- Folders in the diagram are displayed in bold type.

![a1-2.jpg](attachment:a1-2.jpg)

Using the example file hierarchy above, the program, <mark>myPythonProgram.py</mark> could access each of the data files using the following relative file paths:

- data1.txt

- ../myData/data2.txt

- ../myData/data3.txt

- ../../otherFiles/extraData/data4.txt

<b>Here’s the important rule to remember:</b> 

- If your file and your Python program are in the same directory you can simply use the filename like this: <mark>open('myfile.txt', 'r')</mark>. 
- If your file and your Python program are in different directories then you must refer to one or more directories, either in a relative file path to the file like this: <mark>open('../myData/data3.txt', 'r')</mark>, or in an absolute file path like <mark>open('/users/bmiller/myFiles/allProjects/myData/data3.txt', 'r')</mark>.

For more information on how to find files in your file system, you can refer to the document below:

[Finding a File in Your File System](https://docs.google.com/document/d/1qNj9PSRvaxngbpEPlzyDZMsTXFd8d5sKbx3zolfb6OU/edit?usp=sharing)

<a class="anchor" id="3"></a>
<div class="alert alert-block alert-info">
<b><font size="4">3. Reading a File</font></b></div>

- As an example, suppose we have a text file called <mark>ccdata.txt</mark> that contains the following data representing statistics about climate change. The format of the data file is as follows:

![ccdata.PNG](attachment:ccdata.PNG)


- Although it would be possible to consider entering this data by hand each time it is used, you can imagine that it would be time-consuming and error-prone to do this. In addition, it is likely that there could be data from more sources and other years.

***

- To open this file, we would call the <mark>open</mark> function. The variable, <mark>fileref</mark>, now holds a reference to the file object returned by <mark>open</mark>. 
- When we are finished with the file, we can close it by using the <mark>close</mark> method. After the file is closed any further attempts to use fileref will result in an error.

In [14]:
# To open the ccdata.txt file

fileref = open("ccdata.txt", "r")

In [None]:
# To close the ccdata.txt file

fileref.close()

<b>Note</b>

A common mistake is to get confused about whether you are providing a variable name or a string literal as an input to the open function. In the code above, “ccdata.txt” is a string literal that should correspond to the name of a file on your computer. 

If you put something without quotes, like open(x, "r"), it will be treated as a variable name. In this example, x should be a variable that’s already been bound to a string value like “olympics.txt”.

<a class="anchor" id="4"></a>
<div class="alert alert-block alert-info">
<b><font size="4">4. Iterating Over Lines in a File</font></b></div>

- We will now use this text file, ccdata.txt, as input in a program that will do some data processing. 
- In the program, we will read each line of the file and print it with some additional text. 
- Because text files are sequences of lines of text, we can use the <mark>for loop</mark> to iterate through each line of the file.

- A line of a file is defined to be a sequence of characters up to and including a special character called the <b>newline character</b>. 
- If you evaluate a string that contains a newline character you will see the character represented as \n. If you print a string that contains a newline you will not see the \n, you will just see its effects. When you are typing a Python program and you press the enter or return key on your keyboard, the editor inserts a newline character into your text at that point.

- As the <b>for loop</b> iterates through each line of the file the loop variable will contain the current line of the file as a string of characters. 
- The general pattern for processing each line of a text file is as follows:

![files2.PNG](attachment:files2.PNG)

### Review: Python String split() Method

The split() method splits a string into a list where each word is a list item.

In [11]:
sample_text = "today is election day"

x = sample_text.split()

print(x)
print(type(x))

['today', 'is', 'election', 'day']
<class 'list'>


- To process all of our climate change data, we will use a <mark>for loop</mark> to iterate over the lines of the file. 

- Using the <b>split method</b>, we can break each line into a list containing all the fields of interest about climate change. 

- We can then take the values corresponding to year, global average temperature, and global emmisions to construct a simple sentence.

In [8]:
# Iterating over the lines of a file using a for loop

ccfile = open("ccdata.txt", "r")

for aline in ccfile:
    values = aline.split()
    print(values)
    print(type(values))
    print('In', values[0], 'the average temp. was', values[1], '°C and CO2 emmisions were', values[2], 'gigatons.')
    print()
    
ccfile.close()

['1850', '-0.37', '2.24E-7']
<class 'list'>
In 1850 the average temp. was -0.37 °C and CO2 emmisions were 2.24E-7 gigatons.

['1860', '-0.34', '3.94E-7']
<class 'list'>
In 1860 the average temp. was -0.34 °C and CO2 emmisions were 3.94E-7 gigatons.

['1870', '-0.28', '6.6E-7']
<class 'list'>
In 1870 the average temp. was -0.28 °C and CO2 emmisions were 6.6E-7 gigatons.

['1880', '-0.24', '1.1']
<class 'list'>
In 1880 the average temp. was -0.24 °C and CO2 emmisions were 1.1 gigatons.

['1890', '-0.42', '1.72']
<class 'list'>
In 1890 the average temp. was -0.42 °C and CO2 emmisions were 1.72 gigatons.

['1900', '-0.2', '2.38']
<class 'list'>
In 1900 the average temp. was -0.2 °C and CO2 emmisions were 2.38 gigatons.

['1910', '-0.49', '3.34']
<class 'list'>
In 1910 the average temp. was -0.49 °C and CO2 emmisions were 3.34 gigatons.

['1920', '-0.25', '4.01']
<class 'list'>
In 1920 the average temp. was -0.25 °C and CO2 emmisions were 4.01 gigatons.

['1930', '-0.14', '4.53']
<class 'li

<a class="anchor" id="5"></a>
<div class="alert alert-block alert-info">
<b><font size="4">5. Alternative File Reading Methods</font></b></div>

In addition to the <mark>for loop</mark>, Python provides three methods to read data from the input file. 

1. The <b>readline method</b> reads one line from the file and returns it as a string. The string returned by readline will contain the newline character at the end. This method returns the empty string when it reaches the end of the file. 

2. The <b>readlines method</b> returns the contents of the entire file as a list of strings, where each item in the list represents one line of the file. 

3. It is also possible to read the entire file into a single string with <b>read</b>. 

The table below summarizes these methods.

![read%20methods.PNG](attachment:read%20methods.PNG)

<b>Note:</b>

A common error that novice programmers make is not realizing that all these ways of reading the file contents, use up the file. After you call readlines(), if you call it again you’ll get an empty list.

We need to reopen the file before each read so that we start from the beginning. Each file has a marker that denotes the current read position in the file. Any time one of the read methods is called the marker is moved to the character immediately following the last character returned. In the case of readline this moves the marker to the first character of the next line in the file. In the case of read or readlines the marker is moved to the end of the file.

In [28]:
# readline() example

infile = open("ccdata.txt", "r")
aline = infile.readline()
aline

'1850                  -0.37                                       2.24E-7\n'

In [13]:
# readlines() example

infile = open("ccdata.txt", "r")
linelist = infile.readlines()
print(len(linelist))
print()
print(linelist[0:4])

18

['1850                  -0.37                                       2.24E-7\n', '1860                  -0.34                                       3.94E-7\n', '1870                  -0.28                                       6.6E-7\n', '1880                  -0.24                                       1.1\n']


In [14]:
# read() 

infile = open("ccdata.txt", "r")
filestring = infile.read()
print(len(filestring))
print()
print(filestring[:256])

1281

1850                  -0.37                                       2.24E-7
1860                  -0.34                                       3.94E-7
1870                  -0.28                                       6.6E-7
1880                  -0.24        


<a class="anchor" id="5a"></a>
<div class="alert alert-block alert-info">
    <b><font size="4">a. Using a <b>while</b> Loop to Read a File</font></b></div>

- Another method of reading a file is by using a <mark>while loop</mark>. This is important because many other programming languages do not support the for loop style for reading files but they do support the pattern we’ll show you here.

In [26]:
infile = open("ccdata.txt", "r")
line = infile.readline()
while line:
    values = line.split()
    print(values)
    print(type(values))
    print('In', values[0], 'the average temp. was', values[1], '°C and CO2 emmisions were', values[2], 'gigatons.')
    line = infile.readline()
    print()
    
infile.close()

['1850', '-0.37', '2.24E-7']
<class 'list'>
In 1850 the average temp. was -0.37 °C and CO2 emmisions were 2.24E-7 gigatons.

['1860', '-0.34', '3.94E-7']
<class 'list'>
In 1860 the average temp. was -0.34 °C and CO2 emmisions were 3.94E-7 gigatons.

['1870', '-0.28', '6.6E-7']
<class 'list'>
In 1870 the average temp. was -0.28 °C and CO2 emmisions were 6.6E-7 gigatons.

['1880', '-0.24', '1.1']
<class 'list'>
In 1880 the average temp. was -0.24 °C and CO2 emmisions were 1.1 gigatons.

['1890', '-0.42', '1.72']
<class 'list'>
In 1890 the average temp. was -0.42 °C and CO2 emmisions were 1.72 gigatons.

['1900', '-0.2', '2.38']
<class 'list'>
In 1900 the average temp. was -0.2 °C and CO2 emmisions were 2.38 gigatons.

['1910', '-0.49', '3.34']
<class 'list'>
In 1910 the average temp. was -0.49 °C and CO2 emmisions were 3.34 gigatons.

['1920', '-0.25', '4.01']
<class 'list'>
In 1920 the average temp. was -0.25 °C and CO2 emmisions were 4.01 gigatons.

['1930', '-0.14', '4.53']
<class 'li

- There are several important things to notice in this code:

- In line 2 we have the statement <mark>line = infile.readline()</mark>. We call this initial read the <b>priming read</b>. It is very important because the while condition needs to have a value for the line variable.

- The readline method will return the empty string if there is no more data in the file. An empty string is an empty sequence of characters. When Python is looking for a Boolean condition, as in while line:, it treats an empty sequence type as False, and a non-empty sequence as True. Remember that a blank line in the file actually has a single character, the \n character (newline). So, the only way that a line of data from the file can be empty is if you are reading at the end of the file, and the while condition becomes False.

- Finally, notice that the last line of the body of the while loop performs another readline. This statement will reassign the variable line to the next line of the file. It represents the change of state that is necessary for the iteration to function correctly. Without it, there would be an infinite loop processing the same line of data over and over.

<a class="anchor" id="6"></a>
<div class="alert alert-block alert-info">
<b><font size="4">6. Writing Text Files</font></b></div>

- One of the most commonly performed data processing tasks is to read data from a file, manipulate it in some way, and then write the resulting data out to a new data file to be used for other purposes later. 

- To accomplish this, the open function discussed above can also be used to create a new file prepared for writing. 


- Note in the table above that the only difference between opening a file for writing and opening a file for reading is the use of the 'w' flag instead of the 'r' flag as the second parameter. 

- When we open a file for writing, a new, empty file with that name is created and made ready to accept our data. As before, the function returns a reference to the new file object.

- The table above shows one additional file method that we have not used thus far. The <b>write method</b> allows us to add data to a text file. 

- Recall that text files contain sequences of characters. We usually think of these character sequences as being the lines of the file where each line ends with the newline \n character. Be very careful to notice that the write method takes one parameter, a string. When invoked, the characters of the string will be added to the end of the file. This means that it is the programmers job to include the newline characters as part of the string if desired.

***

- As an example, consider the ccdata.txt file once again. Assume that we have been asked to provide a file consisting of only the global emission and the year of this climate change. In addition, the year should come first followed by the global emmision, separated by space.

- To construct this file, we will approach the problem using a similar algorithm as above. After opening the file, we will iterate through the lines, break each line into its parts, choose the parts that we need, and then output them. Eventually, the output will be written to a file.

- The program below solves part of the problem. Notice that it reads the data and creates a string consisting of the year of the climate change followed by the global emission. In this example, we simply print the lines as they are created.

In [22]:
# write method example ... part 1 -- print lines of new file as they are created

infile = open("ccdata.txt", "r")
aline = infile.readline()
print("Year\tEmmision\n")
while aline:
    items = aline.split()
    dataline = items[0] + '\t' + items[2]
    print(dataline)
    aline = infile.readline()

infile.close()

Year	Emmision

1850	2.24E-7
1860	3.94E-7
1870	6.6E-7
1880	1.1
1890	1.72
1900	2.38
1910	3.34
1920	4.01
1930	4.53
1940	5.5
1950	6.63
1960	10.5
1970	16
1980	20.3
1990	22.6
2000	24.9
2010	32.7
2019	33.3


- When we run this program, we see the lines of output on the screen. 
- Once we are satisfied that it is creating the appropriate output, the next step is to add the necessary pieces to produce an output file and write the data lines to it. 
- To start, we need to open a new output file by adding another call to the <b>open</b> function, <b>outfile = open("emissiondata.txt",'w')</b>, using the <b>'w' flag.</b> 
- We can choose any file name we like. 
- If the file does not exist, it will be created. 
- However, if the file does exist, it will be reinitialized as empty and you will lose any previous contents.

***

- Once the file has been created, we just need to call the <b>write method</b> passing the string that we wish to add to the file. 
- In this case, the string is already being printed so we will just change the <b>print</b> into a call to the <b>write method</b>. 

- However, there is one additional part of the data line that we need to include. The newline character needs to be concatenated to the end of the line. The entire line now becomes <b>outfile.write(dataline + '\n')</b>. 
-We also need to close the file when we are done.
<p>&nbsp;</p>
The complete program is shown below.

In [33]:
# write method example ... part 2 -- add the necessary pieces to produce an output file and write the data lines to it

infile = open("ccdata.txt", "r")
outfile = open("emissiondata.txt", "w")

aline = infile.readline()
outfile.write("Year \tEmmision\n")
while aline:
    items = aline.split()
    dataline = items[0] + '\t' + items[2]
    outfile.write(dataline + '\n')
    aline = infile.readline()

infile.close()
outfile.close()

15

13

13

12

9

10

10

10

10

10

9

10

10

8

10

10

10

10

10

In [35]:
infile_2 = open("emissiondata.txt", "r")
filestring_2 = infile_2.read()
print(filestring_2)

Year 	Emmision
1850	2.24E-7
1860	3.94E-7
1870	6.6E-7
1880	1.1
1890	1.72
1900	2.38
1910	3.34
1920	4.01
1930	4.53
1940	5.5
1950	6.63
1960	10.5
1970	16
1980	20.3
1990	22.6
2000	24.9
2010	32.7
2019	33.3



<a class="anchor" id="7"></a>
<div class="alert alert-block alert-info">
<b><font size="4">7. Using <b>with</b> for Files</font></b></div>

<b>Note</b>

- This section is a bit of an advanced topic and can be easily skipped. But with statements are becoming very common and it doesn’t hurt to know about them in case you run into one in the wild.

- Now that you have seen and practiced a bit with opening and closing files, there is another mechanism that Python provides for us that cleans up the often forgotten close. Forgetting to close a file does not necessarily cause a runtime error in the kinds of programs you typically write in an introductory CS course. However if you are writing a program that may run for days or weeks at a time that does a lot of file reading and writing you may run into trouble.

- In version 2.5 Python introduced the concept of a context manager. The context manager automates the process of doing common operations at the start of some task, as well as automating certain operations at the end of some task. In the context of reading and writing a file, the normal operation is to open the file and assign it to a variable. At the end of working with a file the common operation is to make sure that file is closed.

- The Python <b>with statement</b> makes using context managers easy. The general form of a with statement is:

![with.PNG](attachment:with.PNG)

- When the program exits the with block, the context manager handles the common stuff that normally happens. For example closing a file. A simple example will clear up all of this abstract discussion of contexts.

In [38]:
with open('mydata.txt') as md:
    print(md)
    for line in md:
        print(line)
print(md)

<_io.TextIOWrapper name='mydata.txt' mode='r' encoding='cp1252'>
1 2 3

4 5 6
<_io.TextIOWrapper name='mydata.txt' mode='r' encoding='cp1252'>


- The first line of the with statement opens the file and assigns it to md then we can iterate over the file in any of the usual ways, and when we are done we simply stop indenting and let Python take care of closing the file and cleaning up.