# File Inputs and Outputs (I/O)

Learning Objectives: By the end of this notebook, you should be able to:
1. Read data from a text file and write data to a text file
2. Access numerical data inside of a binary-encoded file
3. Describe how data is written to and accessed from "pickle" files 

**Import modules required for this notebook**

In [1]:
# import the struct and pickle modules
import struct
import pickle

## Reading and Writing Text Files

Reading and writing data from a text file are both done with the built-in `open` method.

### Reading Text Files
The `open` method takes in a path to a file and an optional argument describing how the file will be opened. By default, the `open` method will open files in reading mode (`mode='r'`). Try this on some [buoy data](https://github.com/ProfMikeWood/intro_to_python_book/blob/main/files/Monterey Bay Buoy Data.txt) obtained from the [National Buoy Data Center](https://www.ndbc.noaa.gov/station_page.php?station=46269) for a site in Monterey Bay:

In [2]:
# open the buoy data file
file_object = open('Monterey Bay Buoy Data.txt')

# read all of the lines of the file
lines = file_object.read()

# close the file
file_object.close()

# see how long the file is (how many characters are in the file)
print('File Length:',len(lines))

# split the lines by the new line character to
# see how many lines there are
lines = lines.split('\n')
print('Number of Lines:',len(lines))

# print the first five lines
print('First Five Lines:')
for i in range(5):
    print(lines[i])

File Length: 1341408
Number of Lines: 15073
First Five Lines:
#YY  MM DD hh mm WDIR WSPD GST  WVHT   DPD   APD MWD   PRES  ATMP  WTMP  DEWP  VIS  TIDE
#yr  mo dy hr mn degT m/s  m/s     m   sec   sec degT   hPa  degC  degC  degC   mi    ft
2022 01 01 00 00 999 99.0 99.0  1.32  7.41  5.74 263 9999.0  11.9  12.0 999.0 99.0 99.00
2022 01 01 00 30 999 99.0 99.0  1.28  8.00  5.80 261 9999.0  11.6  12.0 999.0 99.0 99.00
2022 01 01 01 00 999 99.0 99.0  1.26  8.33  5.88 258 9999.0  11.2  12.0 999.0 99.0 99.00


As can be seen above, the `file_object` must be opened, read, and then closed when finished. When a file is open, it is stored in memory so it is generally not recommended to have many files opened at the same time. Instead, the `with` keyword allows us to bundle the `open` and `close` methods together, running a subset of commands on the file while it is open. In other words:

In [3]:
# open the buoy data file using the with keyword
with open('Monterey Bay Buoy Data.txt') as file_object:
    lines = file_object.read()
# the 'with' block automatically closes the file

# see how long the file is (how many characters are in the file)
print('File Length:',len(lines))

# split the lines by the new line character to
# see how many lines there are
lines = lines.split('\n')
print('Number of Lines:',len(lines))

# print the first five lines
print('First Five Lines:')
for i in range(5):
    print(lines[i])

File Length: 1341408
Number of Lines: 15073
First Five Lines:
#YY  MM DD hh mm WDIR WSPD GST  WVHT   DPD   APD MWD   PRES  ATMP  WTMP  DEWP  VIS  TIDE
#yr  mo dy hr mn degT m/s  m/s     m   sec   sec degT   hPa  degC  degC  degC   mi    ft
2022 01 01 00 00 999 99.0 99.0  1.32  7.41  5.74 263 9999.0  11.9  12.0 999.0 99.0 99.00
2022 01 01 00 30 999 99.0 99.0  1.28  8.00  5.80 261 9999.0  11.6  12.0 999.0 99.0 99.00
2022 01 01 01 00 999 99.0 99.0  1.26  8.33  5.88 258 9999.0  11.2  12.0 999.0 99.0 99.00


### Writing Text Files
Writing text files is just as easy as reading them - just change the mode! For example, we may want to generate a quick readme note to include in our directory. We can do this as follows:

In [4]:
# define a string with a readme note
note = "To write a text file, change the mode to 'w' "

# output your note as a readme file
f = open('readme.txt','w')
f.write(note)
f.close()

When writing text to a text file, we can format our text and name our files in such a way that other programs can interpret the data. One of the clearest examples of this functionality is seen with data stored in a comma separated value format. These types of files can be read by Excel and many other programs. Let's try this ourself:

In [5]:
# make a header line that has columns for
# year, month and day, separated by commas
header = 'year,month,day'

# make two data lines with days of the year
data_line_1 = '2023,1,17'
data_line_2 = '2023,4,8'

# combine the header and the data lines,
# separated by a newline character
output = header + '\n' + data_line_1 + '\n' + data_line_2

# write the output to a file
with open('year_data.csv','w') as file_obj:
    file_obj.write(output)

# open the file in a spreadsheet program - what does it look like?

### &#x1F914; Mini-Exercise
Goal: Write a text file that describes the ocean conditions when the waves in Monterey Bay were biggest during the year 2022. 

Read in the wave data and do a search through each line for the line with the biggest waves (`WVHT`, column 9 i.e. index 8). Then, write out a text file that includes the two header lines and the line with biggest wave. Store your lines with each component separated by commas a file called "Monterey Bay Biggest Wave 2022.csv"

In [6]:
# open the buoy file and read the lines
with open('Monterey Bay Buoy Data.txt') as file_object:
    lines = file_object.read()

# split the lines at the new line indicator
lines = lines.split('\n')

# store the first two lines in variables "header" and "units"
header = lines[0]
units = lines[1]

# loop through the remainder of the lines and
# find the one with the biggest wave
biggest_wave_value = 0
biggest_line = ''
for line in lines[2:]:
    line = line.split()
    if len(line)>8:
        if float(line[8])>biggest_wave_value:
            biggest_wave_value = float(line[8])
            biggest_line = line

# make an output string combining the lines
# for header, units, and the biggest wave day
output_line = ','.join(header.split()) +'\n'+ \
              ','.join(units.split()) +'\n'+ ','.join(biggest_line)

# save as a csv with the file name Monterey Bay Biggest Waves 2022.csv
f = open('Monterey Bay Biggest Wave 2022.csv','w')
f.write(output_line)
f.close()

## Working with binary data
In many older programs, data is often stored in binary files. While this practice is waning, its still a good idea to be able to read and write data in binary format.

### Writing integer data to binary
To write a list of integers in binary format, use the `bytearray` method to convert a list of numbers of bytes, and use the `wb` mode to write in binary:

In [7]:
# declare a list of integer values
numbers = [10, 20, 30, 40, 50, 60]

# convert the list to array
barray = bytearray(numbers)

# open a file_object to write as a binary file
file_obj = open("number_list.bin","wb")

# write array into the file
file_obj.write(barray)

# close the file
file_obj.close()

### Reading integer data from a binary file
When reading from a binary file, be sure to use the 'rb' mode

In [8]:
# open the binary file for reading using a with block
with open("number_list.bin", "rb") as file_obj:

    # read in the line from tjhe file
    lines = file_obj.read()

# print the lines
print(lines)

# the list command will return back the list,
# converting from a binary array
print(list(lines))

b'\n\x14\x1e(2<'
[10, 20, 30, 40, 50, 60]


### Writing float data to a binary file
The `struct` module provides a mechanism by which various types of data can be encoded in binary and packed into a file. It's very uncommon to write out data in this format in Python, but it's common to receive binary-encoded files from other programs. Here, we'll write a binary file with float values and then examine how we can read back the data

In [9]:
# make up a list of 4 float values for testing
grid = [1.1, 2.2, 3.3, 4.4]

# define the format for output - either f or d (for float or double)
record_format = 'd'

# define the output using the struct.pack method.
# the first character is the number of elements,
# the second is the format type
output = struct.pack('4d', *grid)

# write the binary data to a file called float.bin
with open("float.bin","wb") as f:
    f.write(output)

As you can see, when packing data into the binary file, the format must be specified. For a full list of different format types, see the [struct documentation page](https://docs.python.org/3/library/struct.html).

### Reading float data from a binary file
Reading structured data from a binary file is similar to writing - you need to know the record format as well as the number of items in the file.

In [10]:
# define the record format, identical to the code block above
record_format = 'd'

# use the struct.calcsize method to determine
# the size of this format, in binary
record_size = struct.calcsize(record_format)

# make an empty list of values to keep track of during the file reading
values = []

# open the float.bin file and read in the 4 values we wrote previously
with open("float.bin","rb") as f:
    # loop through the 4 values, reading a portion of size record_size
    for i in range(4):

        # store the binary value into a variable
        binary_value = f.read(record_size)

        # use the struct.unpack method to convert
        # the binary representation to a float
        # then, add the value to the list
        values.append(struct.unpack(record_format,binary_value)[0])

# print out the values
print(values)

[1.1, 2.2, 3.3, 4.4]


#### &#x1F914; Mini-Exercise
Goal: Read in data from a binary file given information about its contents

A scientist using an old and outdated coding language has passed you a data file. They mention to you that the file has two columns of double-precision data, and each column has 119 rows. Read the file contents into two lists. Then, print the first and last values from the each list (decoded from binary).

In [11]:
# define the record format, identical to the code block above
record_format = 'd'

# use the struct.calcsize method to determine
# the size of this format, in binary
record_size = struct.calcsize(record_format)

# make two lists to store the data
column_1 = []
column_2 = []

# open the data_file file and read in the values into the 2 lists
with open("data_file","rb") as f:
    
    # loop through the 119 rows,
    # reading two values from each row,
    # with each value of size record_size 
    for i in range(119):

        # read the first value, and store into a binary value
        binary_value = f.read(record_size)

        # use the struct.unpack method to convert
        # the binary representation to a float
        # and add to the first list
        column_1.append(struct.unpack(record_format,binary_value)[0])

        # read the second value, and store into a binary value
        binary_value_2 = f.read(record_size)

        # use the struct.unpack method to convert
        # the binary representation to a float
        # and add to the second list
        column_2.append(struct.unpack(record_format,binary_value_2)[0])

print(column_1[0], column_1[-1])
print(column_2[0], column_2[-1])


1900.0 2018.0
0.0 0.056842864990234374


## Pickling Files

Pickle files are an extremely flexible data storage type unique to Python. As the name suggests, you can treat a pickle file like a pickle jar - and it can hold *anything* (and if we'd like to continue the analogy, it preserves that data until the jar is opened). For example, you can create a dictionary and store it in a pickle file

In [12]:
# make a dictionary for the days in a month
date_dict = {'Jan':31, 'Feb': 28, 'Mar':31, 'Apr': 30, 'May':31, 'Jun': 30,
             'Jul':31, 'Aug': 31, 'Sep':30, 'Oct': 31, 'Nov':30, 'Dec': 31}

# store the dictionary in a pickle file
with open('date_dict.pickle', 'wb') as file_obj:
    pickle.dump(date_dict, file_obj)

Then, reading from the pickle file is similar to writing:

In [13]:
# load the dictionary back in from the pickle file
with open('date_dict.pickle', 'rb') as file_obj:
    dict_from_pickle = pickle.load(file_obj)

# print the dictionary
print(dict_from_pickle)

{'Jan': 31, 'Feb': 28, 'Mar': 31, 'Apr': 30, 'May': 31, 'Jun': 30, 'Jul': 31, 'Aug': 31, 'Sep': 30, 'Oct': 31, 'Nov': 30, 'Dec': 31}


However, if we can continue the analogy, pickles in your pickle jar do not need to just store cucumbers (err... dictionaries). In other words, you can put a mix of different types of objects in a single pickle file:

In [14]:
# make a list of the years 2000 to 2020
year = list(range(2000, 2021))

# make a string to describe the contents
string = 'This pickle file contains a list of years 2000-2020 '+\
         'and a dictionary for the number of days in each month'

# write the string, year, and dict to the pickle file called date_data.pickle
with open('date_data.pickle', 'wb') as file_obj:
    pickle.dump(date_dict, file_obj)
    pickle.dump(year, file_obj)
    pickle.dump(string, file_obj)

Then, you can read the objects back in the same order they're created:

In [15]:
# load the objects back in from the pickle file
with open('date_data.pickle', 'rb') as file_obj:
    obj_1 = pickle.load(file_obj)
    obj_2 = pickle.load(file_obj)
    obj_3 = pickle.load(file_obj)

print(obj_1)
print(obj_2)
print(obj_3)

{'Jan': 31, 'Feb': 28, 'Mar': 31, 'Apr': 30, 'May': 31, 'Jun': 30, 'Jul': 31, 'Aug': 31, 'Sep': 30, 'Oct': 31, 'Nov': 30, 'Dec': 31}
[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
This pickle file contains a list of years 2000-2020 and a dictionary for the number of days in each month


Weird!