# File I/O

> <font color='green'>CS196 - Lecture 7</font>
>
> **Instructor:** *Dr. V*

---

----
### Writing to a file

To write text to a file, you have to
- open a file (use global function `open()`)
  - make sure you are opening it in `'w'` (*write*) mode
  - assign that opened file stream to a variable (object)
- write to the file (call the `.write()` method of that file object)
- close the file when you are done (call the `.close()` method of that file object)

In [None]:
f = open('myfile.txt','w')  # the second argument, 'w', means the file is being opened in write mode
f.write('hello ')
f.write('world\n')
f.close()

# what do you think is now in myfile.txt?

The file object also has a `.writelines()` method to write multiple strings.

Calling `f.writelines( items )` is equivalent using a for-loop --

    for item in items:
        f.write(item)

In [None]:
mydata = ['hello', 'world', '!!!']

f = open('myfile.txt','w')  # the second argument, 'w', means the file is being opened in write mode
f.writelines( mydata )
f.close()

# what do you think is now in myfile.txt?

Neither `f.write` nor `f.writelines` will add newline characters for you.


In [None]:
mydata = ['hello\n', 'world\n', '!!!\n']

f = open('myfile.txt','w')  # the second argument, 'w', means the file is being opened in write mode
f.writelines( mydata )
f.close()

# what do you think is now in myfile.txt?

If you have a list of strings without newlines and you want to write them to a file with newlines after each string, you can do this:

In [None]:
mydata = ['hello', 'world', '!!!']

f = open('myfile.txt','w')  # the second argument, 'w', means the file is being opened in write mode
f.writelines( txt+'\n' for txt in mydata )
f.close()

# what do you think is now in myfile.txt?

----
### Flush

If you have not closed the file yet, some of the changes you've written via `f.write` or `f.writelines` may still be in a buffer.

If you want to flush the buffer and finish writing out all the changes to the file before closing, you can call `f.flush()`.

In [None]:
mydata = ['hello\n', 'world\n', '!!!\n']

f = open('myfile.txt','w')  # the second argument, 'w', means the file is being opened in write mode
f.writelines( mydata )

# what do you think is in the file right now?

In [None]:
f.flush()
# what do you think is in the file right now?

In [None]:
f.close()
# you should still close the file at some point (or it will close automatically when you exit out of python)

----
### Reading a file

To read text from a file, just use the `open` function to open a file (without the `w` *write* flag).

Then you can just do a `.read()` to grab all text in the file (and don't forget to close the file once you are done with it) --

In [None]:
# what does this output?

f = open('myfile.txt')
data = f.read()
f.close()

print(data)


For any file `f` opened in read-mode, you can call `f.readlines()` to split up the file by line breaks, and save all of the lines of text in that file into a list of strings --

In [None]:
f = open('myfile.txt')  # this file is being open for reading
data = f.readlines()
f.close()

# what does this output?
print(data)

Another way to read lines of text from a file is to use a `for` loop.

A file object is iterable.

When you iterate through a file opened in *read* mode, you will be getting lines of text from that file, one at a time --

In [None]:
# what does this output?

f = open('myfile.txt')

for line in f:
    print(line)

f.close()


hello

world

!!!

hello

world

!!!



----
### Appending to an existing file

When you open a file for writing in `'w'` mode, you will be erasing all prior content in that file (i.e., truncating).

If you do not wish to truncate the file, open it in *append* mode `'a'` instead of *write* mode `'w'` --

In [None]:
mydata = ['hello\n', 'world\n', '!!!\n']

f = open('myfile.txt','a')  # the second argument, 'a', means the file is being opened in **append** mode
f.writelines( mydata )
f.writelines( mydata )
f.close()

# what do you think is now in myfile.txt?

# what if i run this cell again?

----
### FileNotFoundError

If you are attempting to open a file for reading, but no such file exists, python will throw a `FileNotFoundError` --

In [None]:
f = open('somefile.txt')

You can use `try-catch` to ensure your code works smoothly, even when a file might be missing --

In [None]:
try:
    f = open('somefile.txt')
    d = f.read()
    f.close()
except FileNotFoundError:
    d = ''

# what does this print?
print('File content:', d)

----
### File opening flags

In addition to reading, writing, and appending, there are other modes that a file can be opened in.

Here are modes for file access:
- `r` Open for reading (default); if file does not exist, raise `FileNotFoundError`
- `w` Open for writing; if file exists, its contents are deleted (i.e., truncated)
- `x` Open for writing; if file exists raise `FileExistsError`
- `a` Open for appending (i.e., writing to the end of file)

Each of the modes above can be applied to binary (rather than text) files, by adding letter `b` to the mode name:
- `rb` Open binary file for reading (default); if file does not exist, raise `FileNotFoundError`
- `wb` Open binary file for writing; if file exists, its contents are deleted
- `xb` Open binary file for writing; if file exists raise `FileExistsError`
- `ab` Open binary file for appending (i.e., writing to the end of file)

Each of the modes above can be altered for updating files (i.e., having both read and write access), by adding `+` to the mode name:
- `r+` Open for updating (default); if file does not exist, raise `FileNotFoundError`
- `w+` Open for updating; if file exists, its contents are deleted
- `x+` Open for updating; if file exists raise `FileExistsError`
- `a+` Open for updating; if file does not exist, it is created; writing cursor is at the end of file
- `rb+` Open binary file for updating (default); if file does not exist, raise `FileNotFoundError`
- `wb+` Open binary file for updating; if file exists, its contents are deleted
- `xb+` Open binary file for updating; if file exists raise `FileExistsError`
- `ab+` Open binary file for updating; if file does not exist, it is created; writing cursor is at the end of file


----
### Using a `with`-`as` code block

Instead of having to first open a file, and then close it, you can use a `with-as` block to open a file, do what's needed with it, and when the block of code ends, the file is automatically closed --

In [None]:
# open myfile.txt in append mode
with open('myfile.txt','a') as f:
    # write hello world to myfile.txt
    f.write('hello world\n')

# myfile.txt is automatically closed after the with-as block


----
### Data serialization using pickle

Imagine you have this data --

In [None]:
data = {
    'Student 1': {
        'Name': "Alice", 'Age' :20, 'Grade':4
    },
    'Student 2': {
        'Name':'Bob', 'Age':21, 'Grade':5
    },
    'Student 3': {
        'Name':'Elena', 'Age':17, 'Grade':8
    }
}

# what do you think this prints?
print( data['Student 1'] )

If you wanted to save this data to a file, you can convert it to a string, and write it to the file as text --

In [None]:
with open('students.txt','w') as f:
      f.write( str(data) )

# what do you think is in students.txt file right now?

Later you want to load the data from the file where you saved it.

Easy enough -- you just `.read` from that file, right?

In [None]:
# now let's imagine we want to load that saved data from a file...
with open('students.txt') as f:
    data = f.read()

# what do you think this prints?
print(data)

In [None]:
# what do you think this prints?
print( data['Student 1'] )

What's the issue?

Why doesn't accessing `data['Student 1']` work?

We saved our data, we loaded the data back up, but something got lost in the translation...

The problem is that the data was saved as a string of text, and then loaded back in as a string of text.

Technically, there is a way to evaluate a string as if it was python code by using the `eval` command...

In [None]:
data = eval(data)

# what do you think this prints?
print( data['Student 1'] )

However, this is not safe.

If some hacker was to inject malicious code into your students.txt file, `eval` would just run that code.

There are better ways to save and load python objects.

One such way is to use the `pickle` library --

In [None]:
import pickle 

# imagine you want to save this data...
data = {
    'Student 1': {
        'Name': "Alice", 'Age' :20, 'Grade':4
    },
    'Student 2': {
        'Name':'Bob', 'Age':21, 'Grade':5
    },
    'Student 3': {
        'Name':'Elena', 'Age':17, 'Grade':8
    }
}

# you can pickle this data and dump it into a binary file as such...
with open('students.pkl','wb') as f:
    pickle.dump(data, f)

In [None]:
# and now we can load the data from students.pkl back into a python object...
with open('students.pkl','rb') as f:
    data = pickle.load(f)

# what do you think this prints?
print( data['Student 1'] )

Note that `students.pkl` was opened in `wb` mode, not `w` for writing, and in `rb` mode, rather than `r` for reading.

This is because pickle files are binary, not text.

Binary files take less storage space, but are not human-readable.

#### Definition

- Turning data from a native object into a byte stream is called **serialization**.
- Turning serialized data from a byte stream into a native object is called **deserialization**.

----
### Data serialization using JSON

Another way to serialize data is with JSON.

JSON is a simple, standard, and VERY popular data serialization format.

Most web APIs use JSON for sending and receiving data.

Every modern programming language has a JSON library; so, if you serialize your data to a JSON file, another program can easily parse the data in that file and make use of it.

Use `json.dump` function to save data to file in JSON format --

In [1]:
import json 

# imagine you want to save this data...
data = {
    'Student 1': {
        'Name': "Alice", 'Age' :20, 'Grade':4
    },
    'Student 2': {
        'Name':'Bob', 'Age':21, 'Grade':5
    },
    'Student 3': {
        'Name':'Elena', 'Age':17, 'Grade':8
    }
}

# you can use json.dump to dump the data into a json file as such...
with open('students.json','w') as f:
    json.dump(data, f)

And later you can load the data from JSON back into a python object using the `json.load` function --

In [None]:
# and now we can load the data from students.json back into a python object...
with open('students.json','r') as f:
    data = json.load(f)

# what do you think this prints?
print( data['Student 1'] )

{'Name': 'Alice', 'Age': 20, 'Grade': 4}


You do not always have to write JSON to a file, or read JSON from a file.

In addition to `json.dump` and `json.load` functions, there are also `json.dumps` and `json.loads` functions which dump and load your data to and from strings (rather than file streams).

This is very useful for sending data in a standard JSON format between programs (via network sockets or pipes).

In [None]:
json.dumps( data )

'{"Student 1": {"Name": "Alice", "Age": 20, "Grade": 4}, "Student 2": {"Name": "Bob", "Age": 21, "Grade": 5}, "Student 3": {"Name": "Elena", "Age": 17, "Grade": 8}}'

`json.dumps( data )` looks only slightly different than `str( data )`...

In [None]:
str( data )

"{'Student 1': {'Name': 'Alice', 'Age': 20, 'Grade': 4}, 'Student 2': {'Name': 'Bob', 'Age': 21, 'Grade': 5}, 'Student 3': {'Name': 'Elena', 'Age': 17, 'Grade': 8}}"

So what's the point?

Why should we import the `json` library and use `json.dumps( data )` instead of just using `str( data )` to serialize our python dictionaries and lists and such?

You can also make json look prettier, with newlines and indentations.

This is very useful to make JSON more human-readable.

In [None]:
print( json.dumps( data, indent=2) )

{
  "Student 1": {
    "Name": "Alice",
    "Age": 20,
    "Grade": 4
  },
  "Student 2": {
    "Name": "Bob",
    "Age": 21,
    "Grade": 5
  },
  "Student 3": {
    "Name": "Elena",
    "Age": 17,
    "Grade": 8
  }
}


When you are trying to `json.dump` or `json.dumps` an object that cannot be turned into JSON, you will get a `TypeError` --

In [None]:
class Foo: pass
f=Foo()

json.dumps(f)

When you are trying to `json.load` or `json.loads` a string that is NOT a valid JSON string, you will get a `JSONDecodeError` --

In [None]:
json.loads('asdf')

If you are not sure whether some string is JSON or not, you can use `try-catch` to check --

In [None]:
# what is the output of this code?

s = 'asdf'

try:
    data = json.loads( s )
    print('original string was JSON')
except json.JSONDecodeError:
    data = s
    print('original string was not JSON')

print( data )

----
### Pickle vs JSON

What's the difference between pickle and json?

pickle | json
--------|--------
pickle files are binary | json files are text
pickle is just for python | json is a cross-language serialization format
pickle can serialize instances of any class | json only stores `dict`, `list`, `str`, `int`, `float`, `bool`, `None`

So when should you use pickle, rather than json to store data?

When should you use json, rather than pickle to store data?

In [None]:
data = {
    'Student 1': {
        'Name': "Alice", 'Age' :20, 'Grade':4
    },
    'Student 2': {
        'Name':'Bob', 'Age':21, 'Grade':5
    },
    'Student 3': {
        'Name':'Elena', 'Age':17, 'Grade':8
    }
}

# should i pickle this data, or save it as a JSON file?

In [None]:
from dataclasses import dataclass

@dataclass
class Student:
    name: str
    age: int
    grade: int

data = {
    'Student 1': Student('Alice',20,4),
    'Student 2': Student('Bob', 21, 5),
    'Student 3': Student('Elena', 17, 8)
}

# should i pickle this data, or save it as a JSON file?

In [None]:
from collections import namedtuple

Student = namedtuple('Student', ('name','age','grade'))
data = {
    'Student 1': Student('Alice',20,4),
    'Student 2': Student('Bob', 21, 5),
    'Student 3': Student('Elena', 17, 8)
}

# should i pickle this data, or save it as a JSON file?

----
### CSV

CSV is another very popular standard data storage file format.

All modern spreadsheet applications can import and export data to/from CSV files.

Python has a `csv` module you can import to make it easy for you to read and write `.csv` files.

Here's how you might use the `csv.writer` method to write a `.csv` file in python --

In [None]:
import csv

data = [
    ['Name','Age','Grade'],
    ['Alice',20,4],
    ['Bob',21,5],
    ['Elena',17,8]
]

with open('students.csv','w',newline='') as f:
    csvWriter = csv.writer(f)
    csvWriter.writerows( data )


Note that you have to specify a named argument `newline=''` when opening the file. 
This is because `csv.writer` has its own way of dealing with newlines.

Below is an example for how to use `csv.reader` to read data from a `.csv` file --

In [None]:
data = []

with open('students.csv',newline='') as f:
    csvReader = csv.reader(f)
    for row in csvReader:
        data.append( row )

print(data)

----
### Summary

You can open files in python by using the `open()` function.
- the first argument for the `open()` function is the file path
- the second argument is the mode in which you are opening the file
  - if this argument is not specified, the default mode is `r` (reading a text file)
  - other often-used modes include
    - `w` to write to a text file
    - `a` to append to a text file
    - `wb` to write to a binary file
    - `rb` to read from a binary file

If you want to save a native object to a file, you can
- pickle it using the `pickle` library, or
- save it as a JSON string using the `json` library
- save lists of data as rows in a `.csv` file by using the `csv` library

Converting some native object into a byte stream is called **serialization**.  
The reverse of this -- turning a byte stream into a native object -- is called **deserialization**.

----
### Assignment 6

(*due before next lecture*)

Create a Jupyter notebook called `CS196-a6.ipynb`

**DO NOT INCLUDE YOUR NAME ANYWHERE IN THIS FILE OR IN FILENAME**

In this notebook you should have the following:

1. write, append, and read to/from text file
  - open a fille called `mytext.txt` in write mode, and write a few lines of text to it (and close it)
  - re-open the same file in append mode, write some more lines of text it
  - for every line in `mytext.txt`, print that line
2. pickle some data, then load it
  - create some custom class (feel free to use `dataclass` if you want)
  - create an object of that class called `data`
  - use `pickle` library to save `data` into a file called `mydata.pkl`
  - load data from `mydata.pkl` into a new variable, `data2`
3. use json to save and load a dictionary
  - create a dictionary called `data` with at least 3 items
  - use `json` library to save this dictionary to a file called `mydata.json`
  - load data from `mydata.json` into a new variable, `data2`
4. create a list of lists, save as csv
  - create a list `data`
  - append a few lists of strings to `data`
  - use `csv` library to save `data` to a file called `mydata.csv`
  - load data from `mydata.csv` into a new variable, `data2`

Add docstrings and comments (and/or markdown) where appropriate.

Code will be evaluated for:
1. code is written and works as intended (e.g., correct calls, correct output, no errors)
2. clean/efficient code (e.g., no unnecessary code)
3. naming conventions (e.g., class names are UpperCamelCase)
4. readability (e.g., meaningful names, separation of code into separate cells)
5. documentation (e.g., docstrings, comments, argument type specification)
* click "View Rubric" on blackboard under this assignment for more details