<a href="https://colab.research.google.com/github/How-to-Learn-to-Code/python-class/blob/master/Lesson_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lesson 2 - Data structures and reading and writing data
Learning objectives: Students will be able to import data from csv files and compare and
contrast basic data structures.
Specific coding skills:
* data structures (lists, dictionaries, pandas data frame)
* libraries
* installing and loading the pandas library
* file paths and making new files and directories
* import data from csv file (`pandas.read_csv()`)
* export data to csv file (`pandas.DataFrame.to_csv()`)

### Introduction
Data structures are basically just that - they are structures which can hold some data together. In other words, they are used to store a collection of related data. These are particularly helpful when working with experimental data sets. 

There are four built-in data structures in Python - list, tuple, dictionary and set. In this class we will focus on built-in lists and dictionaries as well as data frames from the pandas library. 

### Built-in data structures

#### Lists

A `list` is a data structure that holds an ordered collection of items i.e. you can store a sequence of items in a list. This is easy to imagine if you can think of a shopping list where you have a list of items to buy, except that you probably have each item on a separate line in your shopping list whereas in Python you put commas in between them.

The list of items should be enclosed in square brackets so that Python understands that you are specifying a list. Once you have created a list, you can add, remove or search for items in the list. Since we can add and remove items, we say that a list is a mutable data type i.e. this type can be altered (source: [data structures](https://python.swaroopch.com/data_structures.html)).

Below is an example of a list containing three different DNA sequences.

In [0]:
DNAlist = ['AACTCACCG', 'GCAACTCG', 'TTCAGGCA']
print(DNAlist)

A list is an example of usage of *objects* and *classes*. When we use a variable `i` and assign a value to it, say integer `5` to it, you can think of it as creating an object (i.e. instance) `i` of class (i.e. type) `int`. 

A class can also have *methods* i.e. functions defined for use with respect to that class only. You can use these pieces of functionality only when you have an object of that class. For example, Python provides an `append` method for the `list` class which allows you to add an item to the end of the list (source: [data structures](https://python.swaroopch.com/data_structures.html)).

To illustrate this let's add '`GGCTACAAC`' to our list `DNAlist`.

In [0]:
DNAlist.append('GGCTACAAC')
print(DNAlist)

Now you try make a list called `numList` containing even numbers between 6 and 18.

In [0]:
numList = [6,8,10,12,14,16,18]

Now add the number 20 to the list.

In [0]:
numList.append(20)
print(numList)

You can use the function `len()` to find the length (i.e. number of entries) of a list. Use this function to find the length of `numList`.

In [0]:
len(numList)

#### Dictionaries
A dictionary is like an address-book where you can find the address or contact details of a person by knowing only his/her name i.e. we associate keys (name) with values (details). Note that the key must be unique just like you cannot find out the correct information if you have two persons with the exact same name.

Note that you can use only immutable objects (like strings) for the keys of a dictionary but you can use either immutable or mutable objects for the values of the dictionary.

Pairs of keys and values are specified in a dictionary by using the notation `d = {key1 : value1, key2 : value2 }`. Notice that the key-value pairs are separated by a colon and the pairs are separated themselves by commas and all this is enclosed in a pair of curly braces.

Remember that key-value pairs in a dictionary are not ordered in any manner. If you want a particular order, then you will have to sort them yourself before using it (source: [data structures](https://python.swaroopch.com/data_structures.html)).

The following example of a dictionary might be useful if you wanted to keep track of ages of patients in a clinical trial. 

In [0]:
agesDict = {'Karen P.' : 53, 'Jessica M.': 47, 'David G.' : 45, 'Susan K.' : 57, 'Eric O.' : 50}
print(agesDict)

Now make your own dictionary called `dosesDict` with the following information about doses (in mg/day) for the clinical trial. 



*   Placebo : 0
*   Dose 1 : 20
*   Dose 2 : 40
*   Dose 3 : 60




In [0]:
dosesDict = {'Placebo' : 0, 'Dose 1' : 20, 'Dose 2' : 40, 'Dose 3' : 60}
print(dosesDict)

### Libraries
A python library is a collection of functions and methods that allows you to perform many actions without writing your own code. 

In this course we will discuss a few libraries including pandas for storing and managing data and seaborn for plotting.

Many libraries come installed with anaconda including both pandas and seaborn. However, these libraries are not imported when you open a jupyter notebook, so we have to tell python which libraries we would like to use.

This is how you import the pandas library

In [0]:
import pandas as pd

We imported the `pandas` library and given it the alias `pd`. This way when we want to use pandas we only have to type `pd`. 

Let's do the same thing with the `numpy` package. We will want to give it the standard alias `np`. 

In [0]:
import numpy as np

### Pandas


*pandas* is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is built on top of scientific computing library called *numpy*. This means that we have to import both libraries to use pandas. 

pandas DataFrames can store multiple data types, (ints, floats, strings, etc.). Typically, pandas is useful for analyzing data that have multiple entries in separate rows and different information in each column. While pandas has both Series (1-dimensional) and DataFrame (2-dimensional) classes, we will focus on class. 

It is possible write code to make DataFrames like we did to learn about lists and dictionaries. However, you will most likely import data directly from a file. 

The following example loads a BED file. BED files are tab-delimited files with one line for each genomic region. Our file includes the following information about genes. 


0.   chromosome
1.   start position
2.   end position
3.   name
4.   score (zero for all our genomic regions)
5.   strand (+ for forward and - for reverse)



In [0]:
df1 = pd.read_csv('https://raw.githubusercontent.com/sksuzuki/How-to-Learn-to-Code-2018/master/data/3/ENCFF239FSU.bed', sep='\t',header = None)
df1.head()

The first line of code above reads the file from the specified url. Since BED files are tab delimited we use `sep = '\t'` to define how the file is separated. Additionally, since our file does not have a header we specify `header = None`. The default is to have the first row as the header. The second line prints the header, which is the first five rows of the DataFrame.

Right now our DataFrame isn't very easy to interpret. Since we didn't have a header, pandas assigned the column names as numbers.

In [0]:
df1.columns

We can assign more descriptive column names by assigning values 

In [0]:
df1.columns = ['chromosome', 'start', 'end', 'gene', 'score', 'strand']
df1.head()

We can also sort our DataFrame by column. For example, we can sort by the `gene` column. Note that we are using `.head()` so we only see the first five rows. You can try without to see the entire DataFrame, but it is large! 

In [0]:
df1.sort_values(by='gene').head()

Now try sorting by chromosome

In [0]:
df1.sort_values(by='chromosome').head()

#### Importing from local files

You will often want to import data from your own computer instead of a url. Therefore, you need to be able to use file paths to import the file you want. You can think of this as navigating through your files and folders with words. 

To practice this we are going to import data from the file 'yeastCellCycle.csv' which you can download [here](https://docs.google.com/spreadsheets/d/1pxDYrX59_yEsLdE5rUFDH13cAP50L0jPrrZ16AeCXnQ/edit#gid=1109053608). Make sure to download as csv. You can save the file where ever you want, just make sure you know where it is. 

This next part is going to be individual for everyone. We need to write the file path for our  'yeastCellCycle.csv'. On the author's computer, this file is located in the following location:

> `Desktop`

>        rclass

>             csvfiles

>                  yeastCellCycle.csv

File paths are written the same on Mac and Linux operating systems. However, Windows is different. To address these issues we will use the *os* library, which contains functions to get information on local directories, files, processes, and environment variables. We will specifically be using the following two functions from this library:

* The os.path.join() function constructs a pathname out of one or more partial pathnames.
* The os.path.expanduser() function will expand a pathname that uses ~ to represent the current user's home directory. This works on any platform where users have a home directory, including Linux, Mac OS X, and Windows. 

The following code will only work if your file is saved in the exact same place. You should change the code below to match the location where you saved the file.

In [0]:
import os
filepath = os.path.join(os.path.expanduser('~'),'Desktop', 'rclass', 'csvfiles', 'yeastCellCycle.csv')
print(filepath)

We can now load the data using `pd.read_csv()`. Saving the file path under the variable `filepath` means we don't have to type out that whole path. 

In [0]:
df2 = pd.read_csv(filepath, sep = ',')

You have now successfullyimported data to a DataFrame from a csv file on your computer! Now let's learn how to save DataFrames to your computer. The function `pandas.DataFrame.to_csv()` will allow you to do this. If you run the following line `df1` will save as a .csv file called ENCFF239FSU.csv in your current working directory (to figure out what your current working directory is, run `os.getcwd()`). 

In [0]:
df1.to_csv('ENCFF239FSU.csv')

If you want the file to save in a different location you can specify the path to which the file should be saved. 

In [0]:
df1.to_csv(os.path.join(os.path.expanduser('~'),'Desktop', 'rclass', 'csvfiles', 'ENCFF239FSU.csv'))

Hopefully, you can see how importing data from and exporting data to csv or other delimiter separated files will be helpful for research. Now it's time to practice your skills with built-in data structures and data frames!

### Exercises
The following exercises will help you better understand lists, dictionaries, and DataFrames.

1. Make a list containing the following numbers 1, 4, 25, 7, 9, 12, 15, 16, and 21. 

2. Find the following information about the list you made in part 1: length, minimum, and maximum. You may need to google to find functions that can help you. 

3. Make a dictionary that describes the price of five medications. 
>* Lisinopril: 	23.07
>* Gabapentin: 86.27
>* Sildenafil: 	169.94 
>* Amoxicillin: 17.76
>* Prednisone: 13.81

4. Create a DataFrame from the file located at https://raw.githubusercontent.com/daynefiler/rclass/master/heights.csv

5. Sort the DataFrame from question 4 by height. 

5. Create a DataFrame from a csv file (or other delimiter separated file) on your computer. If you encounter errors, do your best to google solutions. 
