# File IO

This notebook provides an introduction to file input and output (IO). Since we're working in Google Colab, we'll be reading and writing files to our Google Drive, but exactly the same principles apply when reading and writing files to your local file system.

## Importing files to Google Colab

There are two basic ways to access your files from Colab. If your files are already on Google Drive, you can read files from there. If not, you can upload your files directly to Colab. To upload files directly to Colab, we need the 'files' module.

In [None]:
from google.colab import files

my_file = files.upload()

This will prompt you to select a file from your computer to upload.

Rather than uploading each file from our computer to Colab, we can mount our Google Drive then read and write directly to it. When you run this cell, you'll be prompted to authorise Colab to access your drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Once mounted, the files in your Google Drive will be accessible in the directory "/content/drive/My Drive". You can list the contents of your Drive using the notebook command !ls.

In [None]:
!ls "/content/drive/My Drive"

Try creating an empty text file called 'example.txt' in your Drive, then checking that it's visible in Colab.

### Aside: modules and packages and libraries, oh my!

A module is a collection of resources, such as functions, that have been bundled together in a single file. Importing a module in your file makes the contents of that module available for you to use in your code. You've already used a module when you learned about regular expressions - we imported the 're' module, which is part of Python's standard library, to get access to a collection of functions related to regular expressions.

Often, multiple related modules will be bundled together so that they can be distributed as a single resource. A collection of modules is known as a package, and is typically the way in which Python resources are distributed.

A general collection of resources, such as multiple packages that may or may not be related, is often referred to as a libray. For example, the 'Python standard library' is the collection of all of the default resources that are included with Python. Don't worry if this sounds a bit abstract right now - you'll learn more about using libraries later. 

In the above cell, 'google' is the top-level package, and 'colab' is a sub-package containing the tools provided by Colab. 'drive' and 'files' are modules in the 'colab' sub-package that contain functionality for reading and writing data on Google Drive or your local system, respectively.

## Reading a FASTA file

Protien sequences can be stored in FASTA format. Each entry starts with a header line, denoted by '>', which contains the name of the protein. The line(s) following the header contain the sequence of the protein. For example:

\>NAME_OF_PROTEIN<br>
TRKFLKRIMNVLFRQFVYTVISWPMEKQVYMWFTLASHTPAMWGIWDNGYTKCILVENSLTLYQNKNHNFDDQPMCRWVT<br>
WVLHSFTTWNVNMPYVRCQMQTGSEGRITEEPANVLPAQQNHHYAAYDEGTVAKEFGQHDLAEYDHSTVDPYCPTIFNKD<br>
MKNLIHTFNASIAFGYHFHVQQSHKPADIGQHWNKKSKPAKFDLMNRWMIMTWTDYQCSYKPIGSATLCEQWFYEWHCFM<br>
RWELPIWFAAPNVVVASSPSVQHLGNADIRSIQADYFHLFMGADLCKIKCVMTYIMPKISWVQHYQCCSCAHLKMNPKHY<br>
FRKHCWSVMPEHGAVGEDHLDYYEKPWMVIMMPGQGIYWTAEFRTNSTCCWAGAATIDEPYLVLNPTRYSVGLNTMRHNW

For the following exercises, you'll need to use the file 'proteins.fasta' from Canvas. We suggest that you upload the file to your Google Drive in advance, and do all of your reading and writing in Google Drive, as it will let you read and write files exactly as if you were working on your computer.

### Exercise 1

Write a function that takes the name of a FASTA file and reads the contents of the file into a string or a list.

In [4]:

def read_fasta(input_file):
    pass

read_fasta('/content/drive/My Drive/proteins.fasta')

### Exercise 2
Modify your function to read only the headers from the FASTA file and store the headers in a list - one entry per header.

### Exercise 3 
Modify your function to read the headers and sequences from the FASTA file and store them in a list, in the order that they appear in the file (i.e. \[header1, sequence1, header2, sequence2, ...\]).

### Exercise 4
Write a new function that reads the headers and sequences from a FASTA file, and outputs only the first and second proteins to a new FASTA file. You can re-use your read_fasta function to make this simpler.

In [5]:
def copy_fasta(input_file, output_file):
    pass

copy_fasta('/content/drive/My Drive/proteins.fasta', '/content/drive/My Drive/first_two_proteins.fasta')

## Advanced exercises

### Exercise 5

Write a program that reads in a FASTA file, and outputs a comma-separated values file (CSV) which has two columns: the header, and the length of the sequence. Make sure to not include any newline characters in your count. The expected content of the output file is:

Protein, Count <br>
PROTEIN_A, 17<br>
PROTEIN_B, 54<br>

### Exercise 6

Read all the headers and all the sequences and store them in memory. For each sequence, identify the most prevalent amino acid.

### Exercise 7

Long runs of hydrophobic residues are more common in membrane proteins and are found less frequently in soluble proteins \[[1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2242367/)\].  Write a function that will read a fasta file and determine if a sequence is composed of mainly hydrophobic or mainly polar residues.  There still exists some controversy over which residues are should be considered to be hydrophobic and which should be hydrophilic, so take \[[2](https://www.sigmaaldrich.com/life-science/metabolomics/learning-center/amino-acid-reference-chart.html)\] as your baseline. 

### Exercise 8

Modify your solution to Exercise 7 so that it counts how many hydrophobic residues have been found in a row.