# Lecture 4 - Modules, Files I/O, CSVs
### University of California, Berkeley - Spring 2022

## What we have learned?

- Lists
- Dictionaries
- Functions

## Modules
Last time, we learned how to write and use functions.  
Luckily, we don't have to create all the code alone. Many people have written functions that perform various tasks.  
These functions are grouped into packages, called _modules_, which are usually designed to address a specific set of tasks.  
These functions are not always 'ready to use' by our code, but rather have to be _imported_ into it.

Let's start with an example. Suppose we want to do some trigonometry, we can (in principal) write our own _sin,cos,tan_ etc.. But it would be much easier to use the built-in _math_ module.

In [2]:
# first, we import the module
import math

Now we can use values and functions within the math module

In [3]:
twice_pi = 2*math.pi
print(twice_pi)
radius = 2.3
perimeter = twice_pi * radius
print(perimeter)
print(math.sin(math.pi/6))
print(math.cos(math.pi/3))

6.283185307179586
14.451326206513047
0.49999999999999994
0.5000000000000001


If you only need one or two functions from a module, you can import them, instead of the whole module. This way, we don't have to call the module name every time.

In [4]:
# import required functions from math
from math import pi,sin,cos

In [5]:
# now we can use values and functions within the math module
twice_pi = 2*pi
print(twice_pi)
print(sin(pi/6))
print(cos(pi/3))

6.283185307179586
0.49999999999999994
0.5000000000000001


OK, cool, but how do I know which modules I need?

You can also view the module documentation to see what functions are available and how to use them. Each python module has a documentation page, for example: https://docs.python.org/3/library/math.html  
Two more useful links:  
https://pypi.python.org/pypi - a list of all Python modules  
https://wiki.python.org/moin/NumericAndScientific - a list of scientific modules

### Installing modules
Many Python modules are included within the core distribution, and all you have to do is `import` them. However, many other modules need to be downloaded and installed first.  
Python has built-in tools for installing modules, but sometimes things go wrong. Therefore, try the following methods, in this order.  

#### 1. Use PIP
PIP is a built-in program which (usually) makes it easy to install packages and modules.  
Since we can't access PIP from within a notebook, we'll use the Pyzo IEP shell, usually located at _C:\pyzo\IEP.exe_. This interactive shell can run usefull commands. 

We'll enter our commands in this shell window. For example, to get a list of all the modules already installed (not including built-in modules) and their versions, we can type: `pip freeze`:  

To install a new package, all we have to do is: `pip install packagename`, and that's it. Just make sure there are no error messages raised during the installation.  
If, for some reason, things don't work out that well, proceed to option 2.

#### 2. Use Conda
Conda is another useful tool, rather similar to PIP. To use it, just type `conda` in the IEP shell. To get the list of installed modules, type `conda list`. To install a module, type `conda install packagename`.  
If this too doesn't work, proceed to option 3.

#### 3. Use Windows binaries installation
If nothing else works, you can try looking for your package [in this website](http://www.lfd.uci.edu/~gohlke/pythonlibs/). It contains many downloadable installers which you can just click through to easily install a package. Make sure to choose the download that fits your python version and operating system.  
Not all modules are available through this website. If you don't find your module here, you might have to install from source. Details [here](https://docs.python.org/3/install/)

## Files I/O
So far, we only used rather small data, like numbers, short strings and short lists. We stored these data in a local variable (i.e. in memory), and manipulated it. But what happens if we need to store large amounts of data?
- Whole genomes
- List of all insect species
- Multiple numeric values  
  
This is what files are for!

### Why do we need files?
- Store large amounts of data
- Use data in multiple sessions
- Use data outside python
- Provide data for other tools/programs

We'll start with simple text files and proceed to more complex formats.  
Let's read the list of crop plants located in lec4_files/crops.txt

### Reading files
Whenever we want to work with a file, we first need to _open_ it. This is, not surprisingly, done using the `open` function.  
This function returns a file object which we can then use.

In [7]:
crops_file = open('data/crops.txt','r')

The `open` function receives two parameters: the path to the file you want to open and the mode of opening (both strings). In this case - 'r' for 'read'.  
Notice the / instead of \ in the path. This is the easiest way to avoid path errors. Also note that this command alone does nothing, just creates the file object (sometimes called file handle).  
In fact, we'll usually use the `open` function differently:

In [8]:
with open('data/crops.txt','r') as crops_file:
    # indented block
    # do stuff with file
    pass

OK, so what can we do with files?  
The most common task would be to read the file line by line.  

#### Looping over the file object
We can simply use a _for_ loop to go over all lines. This is the best practice, and also very simple to use:

In [19]:
with open('data/crops.txt','r') as crops_file:
    for line in crops_file:
        if line.startswith('Musa'):   # check if line starts with a given string
            print(line)

Musa balbisiana

Musa spp.

Musa textilis



Oops, why did we get double newlines?  
Each line in the file ends with a _newline_ character. Although it is invisible in most editors, it is certainly there! In python, a newline is represented as `\n`.  
The `print()` command adds a new line to the newline character in the end of every line in the file, so we end up with double newlines.  
We can use `strip()` to remove the character from the end of lines.

In [20]:
with open('data/crops.txt','r') as crops_file:
    for line in crops_file:
        line = line.strip()
        if line.startswith('Musa'):
            print(line)

Musa balbisiana
Musa spp.
Musa textilis


![Musa species](http://www.replicatedtypo.com/wp-content/uploads/2010/08/Picture-49.png)

#### Reading the entire file - read()
Another option is to read the entire file as a big string with the `read()` method.  
Careful with this one! This is not recommended for large files.

In [21]:
with open('data/crops.txt','r') as crops_file:
    entire_file = crops_file.read()
    print(entire_file[:102]) # print first 102 characters

﻿Abelmoschus caillei
Abelmoschus esculentus
Acacia mearnsii
Acacia senegal
Acacia seyal
Acca sellowian


#### Reading line by line with readline()
The `readline()` method allows us to read a single line each time. It works very well when combined with a _while_ loop, giving us good control of the program flow.

In [23]:
with open('data/crops.txt','r') as crops_file:
    line = crops_file.readline()    # read first line
    while line:
        line = line.strip()
        if line.startswith('Triticum'):
            print(line)
        line = crops_file.readline()    # read next line

Triticum aestivum
Triticum dicoccum
Triticum durum
Triticum monococcum
Triticum spelta
Triticum turanicum


__REMEMBER__ to always read the next line within the while loop. Otherwise, you'll get stuck in an infinite loop, processing the first line over and over again...

There are other methods you can use to read files. For example, take a look at the `readlines()` method here:  
https://docs.python.org/3/tutorial/inputoutput.html

### Summary
Whenever treating a file, there are three elements:
- File __path__ - the actual location of the file on the hard drive (use `/` rather than `\`).
- File __object__ - the way files are handled in Python.
- File __contents__ - what is extracted from the file, depending on the method used on the file object.

## Practice!

Use one of the file-reading techniques shown above to:  
1) Print the last line in the file.  
2) Find out how many _Garcinia_ species are in the file (use the `startswith()` method).

In [24]:
with open('data/crops.txt','r') as crops_file:
    entire_file = crops_file.read()
    lines_list = entire_file.split('\n')
    print(lines_list[-1])

Ziziphus zizyphus |


In [25]:
with open('data/crops.txt','r') as crops_file:
    triticum_count = 0
    for line in crops_file:
        if line.startswith('Garcinia'):
            triticum_count += 1
    print(triticum_count)

7


### Writing to a file
To write to a file, we first have to open it for writing. This is done using one of two modes: 'w' or 'a'.  
'w', for write, will let you write into the file. If it doesn't exist, it'll be automatically created. If it exists and already has some content, __the content will be overwritten!__  
'a', for append, is very similar, only it will not overwrite, but add your text to the end of an existing file. 

In [27]:
with open('data/output.txt','w') as out_file:
    # indented block
    # write into file...
    pass

Writing is done using good, old `print()`, only we add the argument `file = <file object>`.

In [28]:
with open('data/output.txt','w') as out_file:
    print('This is the first line', file=out_file)
    line = 'Another line'
    print(line, file=out_file)
    seq1 = 'ATTAGCGGATA'
    seq2 = 'GGCATATAT'
    print(seq1 + seq2, file=out_file)

### Parsing files
Parsing is _"the process of analyzing a string of symbols, either in natural language or in computer languages, conforming to the rules of a formal grammar."_ (definition from _Wikipedia_).  
More simply, parsing is reading a file in a specific format, 'slurping' the data and storing it in a data structure of your choice (list, dictionary etc.). We can then use this structure to analyze, print or simply view the data in a certain way.  
Each file format has its own set of 'rules', and therefore needs to be parsed in a tailored manner. Here we will see an example very relevant for biologists.

### The FASTA format

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. Each sequence has a header. Header lines start with '>'.  
The file camelus.fasta includes five sequences of Camelus species. In this parsing example, we'll arrange the data in this file in a dictionary, so that the key is the id number from the header, and the value is the sequence.  

In [29]:
from IPython.display import FileLink
FileLink('lec4_files/camelus.fasta')

We'll start by writing the parsing function.

In [30]:
def parse_fasta(file_name):
    """
    Receives a path to a fasta file, and returns a dictionary where the keys
    are the sequence IDs and the values are the sequences.
    """
    # create an empty dictionary to store the sequences
    sequences = {}
    # open fasta file for reading
    with open(file_name,'r') as f:
        # Loop over file lines
        for line in f:
            # if header line
            if line.startswith('>'):
                seq_id = line.split('|')[1]
            # if sequence line
            else:
                seq = line.strip()
                sequences[seq_id] = seq
    return sequences

Now we can use the result. For example, let's print the first 10 nucleotides of every sequence.

In [32]:
camelus_seq = parse_fasta('data/camelus.fasta')
for seq_id in camelus_seq:
    print(seq_id," - ",camelus_seq[seq_id][:10])

156447993  -  AGAGTCTTTG
4584390  -  AGGGAATCGC
543859288  -  ATGGCCTGGG
441414656  -  ATGAGACTGT
206246577  -  ATGACAAACA


![camelus](http://creagrus.home.montereybay.com/Camel_Oman-1.jpg)

In [35]:
# new function
def parse_fasta_30_nuc(file_name):
    """
    Receives a path to a fasta file, and returns a dictionary where the keys
    are the sequence gb accession numbers and the values are the first 30
    nucleotides of the sequences.
    """
    # create an empty dictionary to store the sequences
    sequences = {}
    # open fasta file for reading
    with open(file_name,'r') as f:
        # Loop over file lines
        for line in f:
            # if header line
            if line.startswith('>'):
                gb = line.split('|')[3]
            # if sequence line
            else:
                seq = line.strip()[:30]
                sequences[gb] = seq
    return sequences

# parse file
camelus_seq = parse_fasta_30_nuc('data/camelus.fasta')

# write to new file
with open('data/4b_output.txt','w') as of:
    for gb_id in camelus_seq:
        print(gb_id + ':',camelus_seq[gb_id], file=of)

### The CSV format
Comma separated values (CSV) is a very common and useful format for storing tabular data. It is similar to an Excel file, only it is completely text based. Let's have a look at an example file, both using Excel and a simple text editor.

We can, quite easily, create our own functions for dealing with CSV files, for example by splitting each line by commas. However, Python has a built-in module for exactly this purpose, so why bother?

#### Reading CSV files
The most simple way to read a CSV file is to use the modules `reader` function. This function receives a file object (created with `open()`) and returns a reader object.

In [40]:
import csv

In [42]:
experiments_file = 'data/electrolyte_leakage.csv'
with open(experiments_file, 'r') as f:
    csv_reader = csv.reader(f)

Once we have defined the csv reader, we can use it to iterate over the file lines. Each row is returned as a list of the column values.

In [43]:
experiments_file = 'data/electrolyte_leakage.csv'
with open(experiments_file, 'r') as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        print(row[0])

accession
101AV/Ge-0
157AV/Ita-0
162AV/Ct-1
163AV/Can-0
166AV/Cvi-0
172AV/Bur-0
180AV/Blh-1
186AV/Col-0
200AV/Gre-0
215AV/Mh-1
224AV/Oy-0
236AV/Shadahra
252AV/Akita
257AV/Sakata
25AV/Jea
266AV/N13
42AV/Bl-1
62AV/St-0
70AV/Kn-0
83AV/Edi-0
8AV/Pyl-1
91AV/Tsu-0
94AV/Mt-0
96AV/An-1


#### Writing CSV files
Writing is also rather straightforward. The csv module supplies the `csv.writer` object, which has the method `writerow()`. This function receives a list, and prints it as a csv line.

In [44]:
new_file = 'data/out_csv.csv'
with open(new_file, 'w', newline='') as fo:    # notice the 'w' instead of 'r'
    csv_writer = csv.writer(fo)
    csv_writer.writerow(['these','are','the','column','headers'])
    csv_writer.writerow(['and','these','are','the','values'])

## Congrats!

The notebook is available at https://github.com/Naghipourfar/molecular-biomechanics/