In [None]:
%%HTML
<style>
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 90%;
}
</style>

# Data frames, pandas and data-science

* In bioinformatics we often handle big tables of data
* We've seen ways of handling tables of data (`dicts`, `list`s, `list`s of `dict`s, `dict`s of `list`s.

For example:

|id             |Cancer1          |Cancer2          |Normal1          |Normal2          |
|---------------|-----------------|-----------------|-----------------|-----------------|
|ENSG00000000003|5.95826622337557 |8.81097601937505 |7.53216385716448 |6.63781231813915 |
|ENSG00000000005|-1.41151988912058|-1.43750916858247|-1.31684622365868|-1.29089271034514|
|ENSG00000000419|9.42730899215926 |9.15577673744174 |9.44121243114581 |9.20685569248805 |
|ENSG00000000457|9.76744540012193 |9.56745990257189 |9.2770807794418  |8.96465517406554 |
|ENSG00000000460|7.62719592443423 |7.59893474744653 |7.9111353616632  |7.5232621375324  |
|ENSG00000000938|5.10676083839103 |4.70915305298166 |4.2068607495857  |4.13169270397857 |
|ENSG00000000971|6.61464002977705 |5.31104669007017 |5.75484059089178 |7.61981857371949 |
|ENSG00000001036|11.9369955115137 |10.7788078234204 |9.34068768940202 |11.2829622683915 |
|ENSG00000001084|10.1643213325872 |10.0713806061553 |9.7803495466056  |9.72496464284139 |

In [None]:
#! /usr/bin/env python
import sys
import os
















In [None]:

print("There are %i rows" % nrows)

In [None]:

print ("Sum of expression of Cancer1 is %.2f" % sum_expression)

In [None]:


print("Expression of %s in Cancer1 is %.2f and in Cancer2 is %.2f" % 
      (gene_of_interest, Cancer1[index_of_gene], Cancer2[index_of_gene]))


## Pandas

![](panda_logo.jpg)

![](pandas.jpg)

In [None]:
import pandas

list1 = ["foo", "bar", "baz", "ping"]
list2 = ["A", "A", "B", "B"]
list3 = [2,4,6,10]






Most usefully, we can read in big tables of data quickly and easily:

`head` allows us to just look at the first few rows:

## Accessing Columns

`DataFrames` behave a bit like a dictionary of lists.

Each entry in this dictionary is a column in the table:

So to get the gene expressions in the Cancer1 sample, we do:

And to get the expression of the 5th gene, we do:

## Acessing Rows

As well as extracting columns, we can also extract rows by number, using `iloc` (integer location).

So to get the 4th row (remember, python counts from 0):

We can access multiple rows using slices, just like for `list` or `str`.

To get the 2nd row to the 4th row:

We can also slice using a list of row numbers. So to get the 1st and 3rd rows:

To get the 2nd, 6th and 10th rows of the gene expression table:

## Naming rows, or "indexing"

At the moment we are calling columns using their names, but using numbers to select rows.

We can assign a column to be the row names, known as the "index":

We can now use names to reffer to columns using `.loc` (NOTE: NOT `.iloc`)

`loc` = location using name

`iloc` = integer location

So to get the expression values for `bar`:

So we get the expression data for the gene `ENSG00000001167`, like so:

## Run functions on rows and columns

Perhaps we would like to know some stats about our gene expression. For example, what is the average of the `value` column of our test table?

The function `pandas.mean` calculates the mean of its arguements.

So `mean([2,4,6,10])` would is 5.5.

`test_table["value"].mean()` is like saying:

    column = test_table["value"]
    mean([column[0], column[1], column[2], column[3]])


We can also calculate the mean of all the columns in a table like so:

pandas understands `min`, `max`, `median`, `std`(standard deviation), and many more....

To get the mean expression for each gene, then we need to change the *axis*.

Think of axis as being like a on graph:
* Axis 0 is the x axis - the results will have the same names as the column names
* Axis 1 is the y axis - the results will have the same names as the row names. 

## Doing sums with rows and columns

We can do artimatic with whole columns at once. 

For example, we can multiple a column by 2:

We can take, for example, take the average of the Cancer and Normal expressions:

Lets find those genes that are switched on a large amount in Cancer.

Now we've got the average expression in each condition, we can find the change between the two conditions. 

In [None]:



)

## Logical Slicing and filtering

**Find rows where change is bigger than 2 fold**

Use a third type of slicing - logical slicing

What is happening here?

If
* You slice with a vector the same length as the dataframe
* The vector contains only `True` or `False`

... Then pandas will return rows that match to true in the vector.

![](slicing.PNG)

Logical tests with `==`, `<`, `>`, `<=` and `>=` are applied to every row:

Now use this `slicer` to slice the table.

So to extract the rows of `RNA_expression_data` where the change is bigger than 2:

## Joining two DataFrames together

* Currently we identify our genes with gene identifiers from the Ensembl database
   - Different people might call the same gene by different names

 ![](NSD2.PNG)

* 
   - The name might change over time.
   - Genes get assigned computer readable IDs that are unique and guarenteed not to change.
* People like names not codes, so we must translate

The IDs here are not in the same order as our other table

* We need to "join" the two tables together, making sure the gene ids match
* In pandas we do this with the "merge" function.

Take the following example:

In [None]:
left = pandas.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                           'A': ['A0', 'A1', 'A2', 'A3'],
                           'B': ['B0', 'B1', 'B2', 'B3']})

right = pandas.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                            'C': ['C0', 'C1', 'C2', 'C3'],
                            'D': ['D0', 'D1', 'D2', 'D3']})

By default, pandas only keeps rows where the joined column is in both:

In [None]:
left = pandas.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                           'A': ['A0', 'A1', 'A2', 'A3'],
                           'B': ['B0', 'B1', 'B2', 'B3']})

right = pandas.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                            'C': ['C0', 'C1', 'C2', 'C3'],
                            'D': ['D0', 'D1', 'D2', 'D3']})

But we can tell it to keep everything from left, or everything from right:

Or we can tell it to keep everything:

| how | description | row names |
| --- | --- | --- |
| inner (default) | Only things in `left` and `right` | intersection of `left` and `right` |
| left | Keep everything from left | Same as `left` |
| right | Keep everything from right | Same as `right` |
| outter | keep everything from both | Union of `left` and `right` |

For out gene expression table, we want everything from the exprsesion table, and only the rows from id table that match:

Finally, we can save our results to a new file: