# File IO in Python

Python provides us with built-in mechanisms (functions, libraries, etc) which enable us to interact with files on the computer which is running Python.

<u>Why do we want to programmatically store and read data?</u>
* For Data Science, this is how we might read in some raw data and process it ready for other analysis.
* If we have written procedures for calculations, etc, putting these in manually via processes such as `input` or by hand by the programmer is inefficient. We want to do bulk processing.

Python provides built-ins for dealing with files (often called <u>file descriptors</u>). Python knows how to talk to the Operating system. It doesn't matter whether it's Windows, Mac, or Linux! Python handles all that complex stuff for you.

## Opening files

Python provides a function called `open()` which accepts various arguments (https://docs.python.org/3.8/library/functions.html#open). And some examples for after this lecture (https://docs.python.org/3.8/tutorial/inputoutput.html#tut-files).

We are concerned with file, and mode for now. You must provide it with a file. Other keywords, such as mode are optional.

Example, this will open a file called my_file.txt. I'm telling the function to set mode to 'r', which stands for 'read mode'. I will only be allowed to read the contents, not write anything to the file (or modify its contents).
```python
f = open('my_file.txt', mode='r')
```

### Local vs Absolute File Paths
The first arugment, is always a file path. These can come in two flavours:

1) <u>Local</u> - These are relative the Pythons current directory. Typically the same directory your ipynb is in. If you're in "My Documents" it will look for files there.
E.g `"my_file.txt"`, or `"./my_file.txt"`

2) <u>Absolute</u> - These are the full file paths. These will work regardless of where your python script is executed from. E.g `"C:/Users/Ashley/Downloads/my_file.txt"`

### Open Modes
With files we may want to open them with different permissions. This depends on what our purpose for opening the file is. Are we wanting to just read the contents? Or do we need to modify the data of the file itself?
Secondly, are we dealing with plaintext (E.g English readable text), or are we dealing with raw data (Bytes - Those 1's and 0's, binary data).

* `r` - Open file for reading (default)
* `w` - Open file for writing (this will completely destroy anything there first!)
* `a` - Append. Open the file for writing, but will add to the end rather than destroying it.
--
* `b` - Binary mode.
* `t` - Default text mode.

We can mix and match the symbols above for the mode string. 
E.g:

`mode='r'`, or `mode='rb'` (Read text vs Read bytes). 

`mode='w'`, or `mode='a'` (Write file - overwrites. Vs Appending)


```python
f = open('my_file.txt', mode="a") # Remember "" or '', interchangeable.
```

By Default, if I don't specify `b`, all of these modes will be dealing with normal text. This is what we'll be using throughout. But bytes do have their place!

<center>
    <b>Example opening Binary file in Text Editor (bad)</b>   
    <img width=50%, src="https://www.nayuki.io/res/what-are-binary-and-text-files/binary-file-in-text-editor.png">

## Reading/Writing Files (https://docs.python.org/3.8/tutorial/inputoutput.html#reading-and-writing-files)

Once we successfully open our file, the variable becomes the type of File Object (TextIOWrapper)

In [1]:
f = open('my_file.txt')
print( type(f) )

<class '_io.TextIOWrapper'>


These objects have some methods we can use. (https://docs.python.org/3.8/library/io.html#io.TextIOWrapper)

* `readline` - Allows us to read in a single line from our file. If we do this iteratively, it gets the next line.
```python
f = open('long_text.txt')
print(f.readline())
```

* `write` - This accepts a string (in the case of text), which we can write. This will be our 'line'. Remember: Are we in write mode or append mode? If append mode, this will go at the end of the file.
```python
f = open('junk.txt', mode='a')
f.write("After the war, I went back to New York")

* `readlines` - Same as `readline` however it returns a list of ALL lines in the file.
* `writelines` - Same as `write` however it writes a List of lines to the file. (Remember the mode!)

In [2]:
f = open('long_text.txt')
print(f.readline())

Dear Theodosia, what to say to you?



In [3]:
f = open('long_text.txt')
print(f.readline())
print(f.readline()) # Each call to this will get the next line and so on.
print(f.readline())

Dear Theodosia, what to say to you?

You have my eyes

You have your mother's name



In [5]:
f = open('junk.txt', mode='w') # Overwrites anything in the file with what I'm about to put.
f.write("After the war, I went back to New York") # It will spit out how many characters were written.

f = open('junk.txt', mode='r') # Let's prove it's there!
print(f.readlines())

['After the war, I went back to New York']


In [6]:
f = open('junk.txt', mode='a') # Let's append some more lyrics.
my_lines = [
    "Mmmbop, ba duba dop",
    "Ba du bop, ba duba dop",
    "Ba du bop, ba duba dop"
]
f.writelines(my_lines) # No return for this one.

#Prove to ourselves.
f = open('junk.txt', mode='r')
for entry in f.readlines():
    print(entry)

#Uh-Oh! Where's our new lines?
# From Docs: "Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end."
# We need to add "\n"  which is a special newline character to each of our lines :(


After the war, I went back to New YorkMmmbop, ba duba dopBa du bop, ba duba dopBa du bop, ba duba dop


In [11]:
f = open('junk.txt', mode='w') # WRITE MODE
my_lines = [
    "Mmmbop, ba duba dop\n",
    "Ba du bop, ba duba dop\n",
    "Ba du bop, ba duba dop\n"
]
f.writelines(my_lines) # No return for this one.

#Prove to ourselves.
f = open('junk.txt', mode='r')
lines = f.readlines()
for i, entry in enumerate(lines):
    print(i, entry)
    
print(lines)
    
# Three distinct entries!

0 Mmmbop, ba duba dop

1 Ba du bop, ba duba dop

2 Ba du bop, ba duba dop

['Mmmbop, ba duba dop\n', 'Ba du bop, ba duba dop\n', 'Ba du bop, ba duba dop\n']


In [15]:
f = open('junk.txt', mode='w') # WRITE MODE
my_lines = [
    "Mmmbop, ba duba dop",
    "Ba du bop, ba duba dop",
    "Ba du bop, ba duba dop"
]
# f.writelines(my_lines) # No return for this one.
for lyric in my_lines:
    print(lyric)
    f.write(lyric)
    
f = open('junk.txt', mode='r')
lines = f.readlines()
print(len(lines))

Mmmbop, ba duba dop
Ba du bop, ba duba dop
Ba du bop, ba duba dop
1


# Introduction to Common file Types.
We often represent files with certain types. Think about the files you commonly use: docx, pdf, etc. These are specific ways a file is formatted so that it is understandable.

<u>Why do we need these files?</u>
Transfer, common representations. Both sides can understand this format.
E.g If I give you a PDF, your browser knows how to interpret that data to display something. Likewise you can export a PDF and I can read it. Same goes for any data format we use! (Even ipynb)

## CSV
CSV stands for Comma-Separated Values. It is arguably the simplest form of data format we use today. (Microsoft Teams spits out attendance in this format!).

It is simplistic, with each row representing some data entity (record). This could be a person, or a single data entry. Typically we use the columns of such a data format to store attributes. This could be a person's age, e-mail, date of birth, etc.

Each value is separated by a comma. If the data entry is empty, we still need to put the comma.

This data format is **flat**, it has no inherent structure other than a table. Just values separated by commas. We often use the first row of a CSV file to denote the names of the headers - otherwise it can be difficult to know what each field is meant to be.

If we want to represent rich data here, with structure (Think Lists within Dictionaries, Dictionaries within Dictionaries, Dictionaries within Lists), we have to flatten everything first. We can lose some important context to data this way, and it can be tricky to manage.

<center>
    <b>CSV (Nice Example):</b>
<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2017/02/27131537/DF_1.png"></img>
<br>
<b>CSV (Not so nice):</b>
<img src="https://www.howtogeek.com/wp-content/uploads/2018/04/img_5acfaa319c745.png"></img>
<br>

## JSON
JSON stands for JavaScript Object Notation. It is primarily used for web data communication (Between your browser and the web server), as the data can only be text (for the most part). The concept behind this was to have a rich data format which facilitated easy communication between browser and the server by having a format which closely resembles the one used by the language itself.

As this format was widely adopted by the web, other fields began to adopt it for their purposes. You will commonly find JSON used as file formats in non-web related domains.

JSON is far more complex as a notation, with use of `{}`, `:`, `,`, `""`, etc. In-fact, JSON objects share a striking similarity with the Dictionaries we've been using in Python!

As it allows such a rich description of the data, it enables us to not only store values, but also the structure of the data itself! It enables us to have nested structures (Dict in a Dict, etc).

<center>
<b>JSON (Example):</b>
<img src="https://alexwebdevelop.com/wp-content/uploads/2020/05/json-nested-objects.png"></img>
<br>

## Other Types? XML, YAML - Many more
CSV and JSON are not the only data formats available to us. Many more exists, a common example is XML.

Readability is an important aspect of data formats. Especially where humans are meant to interact with them. E.g Lots of Microsoft data output is XML, and is overly verbose!

Anybody can make a data format, and decide what things mean. It's wide-spread adoption and usefulness which dictates how good of a data format it is!

<center>
    <b>XML (Example, overly verbose, lots of tags, looks very 'HTML'-esque):</b>
<img src="https://i2.sitepoint.com/graphics/achapter01-rawxmlie.png"></img>

# Exceptions
You may have already encountered errors before now when doing workshops. Think back to Workshop 1! `"Grace" + 5`. So far these have just been red scary messages which we can't do anything about. Python executes the problematic code and something bad happens (blows up at runtime).

```python
x = "Grace" + 5
```

In [16]:
x = "Grace" + 5

TypeError: can only concatenate str (not "int") to str

Python provides a way to 'catch' errors that are thrown. It allows us to attempt code that we know may be 'risky', and to do something if that error is encountered.

We need to introduce two new keywords to you. `try` and `except`. It follows the general formula:

```python
try:
    # 'Risky' code here
except SomeExceptionType as friendly_name:
    # Do something with friendly_name
```

In [17]:
try:
    x = "Grace" + 5
except Exception as err:
    print(err)
    print(type(err))

can only concatenate str (not "int") to str
<class 'TypeError'>


You may notice two things here:
1) `Exception` is a Type. All Errors in Python are actually Types. This allows us to check for them, and even derive from them.

2) `Exception` is the most generic type of error we can have. In-fact, ALL other errors in Python stem from this. `Exception` will catch literally ANY and ALL errors. Notice how that the Type of our error here was `TypeError` yet it entered the `Exception` Type block. This is the equivalent of saying `Dog` is a type of `Animal`. Therefore, if we're catching all animals, of course Dog matches this.

In general, it is a good idea to use specific errors. This is beneficial as it allows you to execute specific code in response.
We can stack as many `except` clauses here as we need.

E.g
```python
try:
    x = "Grace" + 5
except TypeError as e:
    print("Specific")
except Exception as err:
    print("Catch-all")
```

In [18]:
try:
    x = "Grace" + 5
except TypeError as e:
    print("Specific")
except Exception as err:
    print("Catch-all")

Specific


<u><b>Note:</u></b>
Python will match the most specific type of error. It will only match a single `except` clause here. Notice how it never prints "Catch-all"?

What if we modify our `try` block, let's make some Arithmetic Errors!

In [21]:
try:
    x = 5 / 0
except TypeError as e:
    print("Specific")
except Exception as err:
    print("Catch-all")
    print(type(err))

Catch-all
<class 'ZeroDivisionError'>


I've done a bad thing with numbers. Dividing by Zero is a `ZeroDivisionError`, which is a form of `ArtithmeticError`. Because I haven't written a `except` clause for it, it's gone to the only one which fits: `Exception` (that wonderful most generic of errors).

<u> What if I didn't have `except Exception:` there?</u> 

Well, then it would behave as your Python scripts have thus far. Python will error out, red text everywhere!

In [None]:
try:
    x = 5 / 0
except TypeError as e:
    print("Specific")
#except Exception as err:
#    print("Catch-all")
#    print(type(err))

You can see a whole bunch of built-in exceptions at the following documentation page: https://docs.python.org/3.8/library/exceptions.html

## Exceptions in File Handling
When looking at functions within the documentation. Developers should tell you what the various Exceptions that function can 'throw'. (https://docs.python.org/3.8/library/functions.html?highlight=open#open) However, it is not always clear!

<u>How am I to know</u>. Lots of trial and error. If a new exception pops up that is expected, not just a mistake on your part, then you can write the exception clause for it!

When opening a file, if it doesn't exist we'll get a `FileNotFoundError`.

```python
f = open('file_not_existing.txt')
# Do some file stuff.
print(f.read())
print("All done with file stuff now. I should close it!")
f.close()
```

In [22]:
f = open('file_not_existing.txt')
# Do some file stuff.
print(f.read())
print("All done with file stuff now. I should close it!")
f.close()

FileNotFoundError: [Errno 2] No such file or directory: 'file_not_existing.txt'

In [26]:
f = None
try:
    f = open('file_not_existing.txt')
except FileNotFoundError as err:
    print("Whoops, we need an actual file!")

print("We can actually continue doing things though :)")
# We shouldn't really continue unless we have f correctly defined though!

# Do some file stuff.
#print(f.read())
#print("All done with file stuff now. I should close it!")
#f.close()

Whoops, we need an actual file!
We can actually continue doing things though :)


What if something bad happens whilst we're dealing with a file? Python error -> file will never close.

As we need to always ensure the file is closed once we're done with it, this poses a problem.

<u>Solution?</u>

We could have many `except` clauses, hope that they're all there. Or use a generic Catch-all. But we'd still need to do file close in each one...

Introducing: `finally`. Finally is part of the whole `try` `except` structure. It allows us to try some code, catch any potential exceptions, and then finally execute some other expressions.
This will be executed <u>no matter what</u>. Whether the try block raises and error, or not. This is perfect for our use-case.

<u>Caution:</u> Keep this block as simple and clean as possible. If an error happens here, you're still in trouble.

For this example, let's assume our File exists. We want to do some things to it, so we have some python code to execute. If we get an error here, we'll never get to our file close. Let's use "Grace" + 5 as our poisoned challice.
    
```python
f = open('my_file.txt')
# Do some file stuff.
print(f.read())
x = "Grace" + 5 # Our 'wrong' bit of code. This could easily be ANY error.
print("All done with file stuff now. I should close it!")
f.close()
```

In [27]:
f = open('my_file.txt')
# Do some file stuff.
print(f.read())
x = "Grace" + 5 # Our 'wrong' bit of code. This could easily be ANY error.
print("All done with file stuff now. I should close it!")
f.close()

Hello Students!



TypeError: can only concatenate str (not "int") to str

Let's fix this up with our `try` and `except` keywords.

In [28]:
f = open('my_file.txt')

try:
    # Do some file stuff.
    print(f.read())
    x = "Grace" + 5 # Our 'wrong' bit of code. This could easily be ANY error.
    print("All done with file stuff now. I should close it!")
    f.close() # This still won't get executed!
except TypeError as e: # I know the Type here. Could use `Exception` as the most generic.
    print("Drat. Something went wrong: ", e)

Hello Students!

Drat. Something went wrong:  can only concatenate str (not "int") to str


**Mixing in our finally**

In [29]:
f = open('my_file.txt')

try:
    # Do some file stuff.
    print(f.read())
    x = "Grace" + 5 # Our 'wrong' bit of code. This could easily be ANY error.
except TypeError as e: # I know the Type here. Could use `Exception` as the most generic.
    print("Drat. Something went wrong: ", e)
finally: # Executed no matter what!
    print("All done with file stuff now. I should close it!")
    f.close() # This still won't get executed!

Hello Students!

Drat. Something went wrong:  can only concatenate str (not "int") to str
All done with file stuff now. I should close it!


# Context Managers - Best Practice
So far we have to remember to close the file ourself, and be very concious over potential errors coming from somewhere and ruining our day.

Introducing <u>Context Managers</u>. These 'wrap' our code and provide a convenient access to files within a specific context. They automatically handle closing the file for us.

We need a new keyword here. Called `with`.

Benefits:
* Reduces the amount of "boilerplate" code that we need. **Handles file closing for us**. Decluttered -> More Pythonic.
* If you cotinually 'open' files without closing them, eventually you'll hit a limit! Only a limited number of files can be open by your OS at any time.
* Primarily, if an exception is encountered, Python automatically closes the file properly!

```python
try:
    f = open('...')
    # Do stuff
finally:
    f.close()
```
Becomes:
```python
with open('...') as f:
    # Do stuff
```

Notably, this uses the `as` keyword which we introduced in Workshop 2 to provide a handle to the file itself. When we move outside of the block, the file will be closed.

In [30]:
f = open('my_file.txt')
print(f.read())

Hello Students!



In [31]:
with open('my_file.txt') as my_file:
    print(my_file.read())

Hello Students!

