# 🛠 IFQ718 Module 03 Exercises-04

## 🔍  Context: File input and output

The file system on your computer provides long-term storage that is retained even after the machine has been powered down. This is unlike main memory, that is intended to be short-term with fast read-write capabilities. When executing Python code, the functions and variables that you create persist in memory until the kernel/interpreter exits. This may be as short as milliseconds, or up to hours if you keep the kernel sitting idle while you are working through these Notebooks (not that this is a problem). The data that we create in memory can be transfered to the file system by *writing* the data, then restored later, by *reading* it again. This is called file Input-Output (IO); i.e., the data that is read from a file is *input* to our program, and the program generates *output* when writing to disk.

In this Notebook, we will introduce you to reading from and writing to the file system. 

### Writing to file

Lets begin by creating some files that we can later read.

To do this, we need to open a *file handler* that has a specified *path* and *mode*.

The *path* is determined by you. It could be:
* an absolute path
   * i.e., a path that begins at the root of your file system
   * On Windows, an absolute path may look like `C:\Users\Jake\Documents\myfile.txt`
   * On Mac or Linux, an absolute path may look like `/usr/Jake/Documents/myfile.txt`
   
* a relative path
   * i.e., a path that is relative to your current working directory
   * a relative path may look like `myfile.txt` (as the file is within the current working directory)
   * or, a relative path may look like `../myfile.txt` (as the file is in the directory that is up one level to the current working directory)
   * or, a relative path may look like `../../myfile.txt` (as the file is in the directory that is up two levels to the current working directory)
   * or, a relative path may look like `data/myfile.txt` (as the file is in the `data` directory that is within the current working directory)
   
There are a finite set of options for the *mode*:

| Mode |                               Description                              |
|:----:|:----------------------------------------------------------------------:|
| r    |                                                       For reading only |
| w    |             For writing, and overwriting the file if it already exists |
| a    |                             For writing to the end of an existing file (i.e., "appending") |


**Write some text to file**

In [None]:
fp = open('message.txt', 'w')

message = "Hello, World!"

fp.write(message)
    
fp.close()

**Read the file**

In [None]:
fp = open('message.txt', 'r')

print(fp.readlines())
    
fp.close()

**Write multiple lines to a file**

In [None]:
fp = open('messages.txt', 'w')

message = "Hello, World!"

for i in range(5):
    fp.write(message)
    
fp.close()

**Again, read the file**

In [None]:
fp = open('messages.txt', 'r')

print(fp.readlines())
    
fp.close()

Oh dear, what has happened here? 

There is only one line in the `messages.txt` file. Open the file in the text editor, from the Jupyter file explorer.

Let's fix the code to achieve what I intended.

**Attempt 2, write multiple lines to a file**

In [None]:
fp = open('messages.txt', 'w')

message = "Hello, World!"

for i in range(5):
    fp.write(f'{message}\n')
    
fp.close()

**Read...**

In [None]:
fp = open('messages.txt', 'r')

print(fp.readlines())
    
fp.close()

Hmm, but the new line character `\n` is still on the line. 

I will remove that using `.strip()`

In [None]:
fp = open('messages.txt', 'r')

lines = fp.readlines()
lines_stripped = []

for line in lines:
    lines_stripped.append(line.strip())
    
print(lines_stripped)
    
fp.close()

Great!

But, here is an even tidier way:

In [None]:
fp = open('messages.txt', 'r')

lines = []

# I am able to read the file line by line using a for-loop and the file handler
for line in fp:
    lines.append(line.strip())
    
print(lines)
    
fp.close()

Okay, that is the same output, and the code is a bit shorter.

### Context managers

Speaking of making code shorter... this is a good opportunity to introduce the *context manager*. It is particularly useful for reading and writing files.

In the context of file IO, a context manager will handle opening and closing the file for us. 

The code within the context of the manager will be able to access the file. Any code that is not a part of the managers context will not be able to read the file.

To create a context manager, use the `with` clause:

In [None]:
with open('messages.txt', 'r') as fp:
    lines = []
    for line in fp:
        lines.append(line.strip())
        
print(lines)

Notice the lack of `fp.close()`, and that even fewer lines are needed to achieve the same result.

### ✍ Activity 1: Parse a FASTA-formatted file

> In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics.
>
> Source [Wikipedia](https://en.wikipedia.org/wiki/FASTA_format)

Your task in this activity is to read the FASTA formatted file and print the headers within the file.

Headers are the lines beginning with `>`. Other lines contain the genetic sequences, and can be ignored.

The next code cell will download the FASTA file from GitHub for you, if you don't have it already.

In [None]:
import urllib.request
urllib.request.urlretrieve('https://raw.githubusercontent.com/jupyterlab/jupyterlab-demo/master/data/zika_assembled_genomes.fasta', 'data/zika_assembled_genomes.fasta')

In [None]:
# Write your code in this cell

with open('data/zika_assembled_genomes.fasta', 'r') as fp:
    # ...

### ✍ Activity 2: Separate the FASTA file into multiple files

As you discovered, the FASTA file that you downloaded in the previous activity contains multiple assembled genomes. 

Here, your task is to separate each genome into its own file.

Hints:
   * Remember that the name of the file you are creating/writing to is a Python `string`. 
      * Update the `string` representing the file name for each genome.
   * Try naming the new files appropriately, based on the FASTA header

In [None]:
# Write your code here