# Part 5: File Handling

**NOTE: For the live demo, local file systems will be accessed and examples of the code below will be run locally. You will need to change the file paths before running much of the below code on your computer.**

If you are running Python in the cloud using Google Collab to run a Jupytr notebook, then additional steps are necessary to access external files (not recommended for this portion of the workshop). You can access tutorials on how to do this here: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92

Python modules used in demo:
* `csv`
* `json`
* `pandas` (may require installation if never used)
* `Bio` (biopython)


---

## Opening & closing a file

To read from or write to a file, we must first open it. The `open()` function takes two main arguments - the filepath, and the mode in which to read your file (e.g. read, write, append) - and returns a file object. The most common mode arguments are as follows:

*   `r` - **Read (default).** Opens a file for reading; error if file does not exist
*   `w` - **Write.** Opens a file for writing; creates the file if it does not exist (warning: this will overwrite an existing file!)
*   `a` - **Append.** Opens a file for appending; creates the file if it does not exist

When you are finished with a file, it is important to close it to ensure computer resources are reallocated. This is done using the `close()` method.

Below is an example of opening, then immediatlely closing a file.

In [None]:
fp = open('/path/to.file.txt', 'r')     # open file in 'read' mode
fp.close()                              # close file

Instead of having to explicitly close a file, Python has a helpful way to automatically cleanup after yourself:

In [None]:
with open('/path/to/file.txt', 'r') as fp:
    # do some work with fp

This is equivalent to the following try-except block (where `close()` is always run):

In [None]:
try:
    fp = open('/path/to/file.txt', 'r')
    # do some work with fp
finally:
    fp.close()

---

## Reading Files

The file object returned by `open()` has 3 methods to read data:

*   `read()` - stores all the data into one text string; useful for small files where you want to do text manipulation on the entire file
*   `readlines()` - reads all the lines of the file at once, and returns a list of strings (each element corresponds to one line)
*   `readline()` - reads individual lines one at a time (increments); each time it is called, it reada another line. Great for handling very large files one line at a time

Below is an example of reading a file line-by-line:

In [None]:
with open('/path/to/file.txt', 'r') as fp:
    line = fp.readline()
    count = 0
    while line:
        count += 1
        print("[{}]: {}".format(count, line.strip()))
        line = fp.readline()

We can do even better, by taking advantage of Python syntax to easily iterate through the file object, line-by-line:

In [None]:
with open('/path/to/file.txt', 'r') as fp:
    for count, line in enumerate(fp):
        print("[{}]: {}".format(count, line.strip()))

---

## Writing files

Writing files is quite straightforward in Python. We use the `open()` function as before, this time opening the file in write mode. We can then use the `write()` method to write strings to the file - each time it is called, another line is written to the file.

Note that if a file does not already exist with the filepath and name, it will be automatically created. Similarly, a file with that name already exists, it will be overwritten (and the previous data lost).

Below is a simple example of 10 lines to a file.

In [None]:
with open('/path/to/outfile.txt', 'w') as fp:
    for i in range(10):
        fp.write("This is line {:.0f}\n".format(i))

---

## Reading TSV and CSV files

Datasets are often shared in one of the following file formats:

* **CSV (Comma Separated Values)** - Highly compatible, and prevalent format. Not easily human readable.
* **TSV (Tab Separated Values)** - Easy to read for humans, easy to work with, and often more efficient for software. Generally more sound delimiters (as tabs rarely show-up in datasets, unlike commas).

Remember that different file extensions do not affect the ability of your program to read its contents (e.g. using .txt for TSV files).
Below is an example of a function that can read TSV or CSV files line-by-line, simply by changing the delimiter you pass as an argument:

In [3]:
def readFile(filepath, sep):
    try:
        with open(filepath, 'r') as fp:
            for line in fp:
                rowArray = line.strip.split(sep)
                # do something with rowArray (e.g. print, save, populate objects etc.)
    except FileNotFoundError as e:
        print("Error: File not found.")
    except OSError as e:
        print("System-related or IO error.")
    except Exception as e:
        print("Sorry, can't read the file.")
    return


# testing the function
readFile("myCSVdata.csv", ",")  # use a comma as the seperator argument (CSV)
readFile("myTSVdata.tsv", "\t")  # use a comma as the seperator argument (CSV)

Error: File not found.
Error: File not found.


There are also both built-in and external tools to help read TSV and CSV files.

Python has a built-in `csv` library, which helps both read and write from CSV files (can also read from TSV files, if you change the delimiter from `,` to `\t`). Below is an example:

In [None]:
import csv

with open('myfile.csv') as fp:
    csv_reader = csv.reader(fp, delimiter=',')
    lineCount = 0
    for rowArray in csv_reader:
        if line_count == 0:         # header row
            # do something with header row
            lineCount += 1
        else:
            # do something with rowArray
            lineCount += 1

    print('Processed {} lines.'.format(lineCount))

Reading CSV and TSV files can also be done with external libraries like **pandas** (an excellent library for working with datasets in general), which reads and writes to and from "pandas dataframe objects".

Here's the simplest example:

In [None]:
import pandas as pd
df = pd.read_csv('mydata.csv') # reading csv file into pandas dataframe object
print(df)

df.to_csv('outfile.csv')    # writing pandas dataframe object to file

There are multiple additional arguments you can pass to the `read_csv()` method that can further specify things like the presence of a header row, index column, and separator (so you can also read tsv files).

A full discussion of the Pandas library is beyond the scope of this workshop, however it is highly encouraged that you explore it afterwards if you are doing data analysis work: https://pandas.pydata.org/

---

## Working with FASTA files

FASTA files - live CSV or TSV - contain plain text formatted data, and are used to represent nucleotide or peptide sequences. This file format is very common in bioinformatics and computational biology - you are likely to have come across this format, or worked with these files in the past if you are in biology.

A sequence in FASTA format begins with a single-line description (preceded by '>'), followed by lines of sequence data. Files can contain a single sequence, or many thousands of sequences! Common extensions are `.fasta` and `.fa`.

There are different ways of reading and handling FASTA-formatted data.

One method is to simply use external libraries to read and process FASTA files. For example, here is an example using `SeqIO` Biopython:

In [None]:
!pip install biopython # installing BioPython (for Google Collab)

In [None]:
from Bio import SeqIO

filenameIn = 'someFastaFile.fasta'
filenameOut = 'outputFastaFile.fasta'

with open(filenameOut, 'w') as fp:
    for record in SeqIO.parse(open(filenameIn, mode='r'), 'fasta'):
        # do something (print or edit seq_record)
        print('SequenceID = '  + record.id)
        print('Seq = ' + record.seq + '\n')

        # write new fasta file
        r = SeqIO.write(record, fp, 'fasta')

To install BioPython (an external library) on your machine, you can follow the instructions here: https://biopython.org/wiki/Download

However, you can also create your own tools to read and process FASTA files, without having to rely on external libraries. Below is an example:

In [None]:
class FastaRecord:
    """Class representing a FASTA record."""

    def __init__(self, description):
        """Initialise an instance of the FastaRecord class."""
        self.description = description.strip()
        self.sequences = []

    def addSeqLine(self, sequence):
        """Add a sequence line to the FastaRecord."""
        self.sequences.append(sequence.strip())

    def __repr__(self):
        """Generate string representation of FastaRecord objects."""
        lines = [self.description,]
        lines.extend(self.sequences)
        return '\n'.join(lines)

class FastaParser:
    """Class for parsing FASTA files; populates FastaRecord objects."""

    def __init__(self, filepath):
        """Initialise new instance of FastaParser."""
        self.filepath = filepath

    def __iter__(self):
        """Yield FastaRecord instances."""
        fastaObj = None
        with open(self.filepath, 'r') as f:
            for line in f:
                if line.startswith('>'):            # beginning of new fasta record
                    if fastaObj:
                        yield fastaObj              # release existing fasta record
                    fastaObj = FastaRecord(line)    # instantiate new fasta record
                else:
                    # add additional sequence line to fasta record
                    fastaObj.addSeqLine(line)
        yield fastaObj                              # release last fasta record


# let's look at an example of how we would use our parser
parser = FastaParser("someFastaFile.fa")
for record in parser:
    print(record)

Experiment with the code above, or have a go at writing your own custom FASTA parser!

---

## Reading JSON files

JSON (JavaScript Object Notation) is an open-standard, human-readable file format for data transfer. It is a very common format, and many programming languages (including Python) have libraries to generate and parse JSON files and formatted data. This format can be particularly useful for saving object attributes, or as simple configuration files.

JSON files use the `.json` extention.

Below is an example of JSON-formatted data:


```
{
    “firstname”: “Eisha”,
    ”lastname”: “Ahmed”,
    “isStudent”: true,
    ”favNumber”: 163,
    “phoneNumbers”: [
        {
            “type”: “home”,
            “number”: “514-123-4567”
        },
        {
            "type”: “home”,
            “number”: “514-123-4567”
        }
    ]
}
```

Looks a lot like a Python dictionary! Indeed, we can write Python objects like dictionaries and lists to json files by importing the `json` module. Similarly, we can read json files into a Python dictionary. See below for an example of reading and writing to a JSON file.

In [None]:
import json

# here's an example of a nested dictionary
data = {
    "firstname": "Eisha",
    "lastname": "Ahmed",
    "isStudent": True,
    "favNumber": 163,
    "phoneNumbers": [
                     {"type": "home", "number": "514-123-4567"},
                     {"type": "mobile", "number": "514-555-5555"},
    ]
}

# let's export it to a json file using json.dump()
with open("filename.json", "w") as fp:
    json.dump(data, fp)

# we can just as easily read a json file using json.load()
with open("filename.json", "r") as fp2:
    data2 = json.load(fp2)

print(data2)

What if we have a JSON formatted string? (e.g. text read directly from a file or another program) Easy! Just use `json.loads()`.

In [None]:
import json

dataString = """
{
    "firstname": "Eisha",
    "lastname": "Ahmed",
    "isStudent": true,
    "favNumber": 163,
    "phoneNumbers": [
        {"type": "home", "number": "514-123-4567"},
        {"type": "mobile", "number": "514-555-5555"}
    ]
}
"""

data3 = json.loads(dataString)
print(data3)
print(data3['favNumber'])

This external reference contains a great tutorial on working with JSON formatted data in Python: https://realpython.com/python-json/