In [1]:
%%HTML
<style>
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 100%;
}
</style>

# Task 1

You will be provided with a file containing the sequence of proteins from the organism *Campylobacter jejuni*.

There will be one sequence per line. Each line starts with a protein name, followed by `<tab>`, followed by the protein sequence.

    Cj0001 MNPSQILENLKKELSEYENYLSNFNSKAKKIFYEVQSGNKAIINIQAQSAK...
    Cj0002 MKLSINKNTLESAVILCNAEIKASDIGIIKKVESSGFATANAKSIADVIKS...
    Cj0003 MQENYGASNIKVHHMIYEVVDIMAGHCDTIEITTEGSCIVSDNGRGIPVDS...
    Cj0004c LSTGLIIDPDSPLVEANCSNLITNMHASRWLAAIRWMQDSEGLWEIEPED...
    Cj0005c NIGLFGISFENFLGSELPDFKIEGKKDYHGEKPLTAETEIYALDSDFTKP...
    Cj0006 RFNVLLSLLISALSHLELVDTMNILISGENLKTALSYILLGAIAAAISKTN...
    Cj0007 MDLENILENIASYKVICDALEILLEHRGGAEENSGDGAGILIQIPHDFFKT...
    .
    .
    .

## Part i

The first job is to write a program that calculates the % occurance of each amino-acid across the whole proteome. 

For example if the proteome only have two proteins:

    Cj0001 MNPSQIS
    Cj0002 MKLSINL
    
The the % of `N` would be 17% because there are 2 Ns (one in each protein) and 12 AAs in total

## Part ii

Now consider protein in turn. Write a program that calculates the % occurance for an amino acid in each protein.

e.g. the same protome:

    Cj0001 MNPSQIS
    Cj0002 MKLSINL
    
for serine the result might look like:

    Cj0001 29
    Cj0002 14

Choose an amino acid and calculate the results. Plot the results as a histogram.

Choose two more amino acids and repeat the excerise for them. Plot the results for the three amino acids as three histograms on one page, like so:

![](histoeg.png)

Pick some outliers and comment on whether you think there is any significance to their "unusual" composition is. e.g.

> Prot Cj0034 has lots of Serine and because Serine is the best amino acid, it must be a totally awesome protein. Prot Cj0194 has slightly more Alanine than the others, but it is only a slight increase so probably not important

## Histograms

Back in Lecture 8, Roy showed you how to plot "histograms" as in bargraphs. But not how to turn columns of numbers into sets of frequencies that could be plotted.

Remember that `gnuplot` wants the centre of each bin for plotting.

So for example if you have the numbers `15,17,30,25,35,23,29` you need to bin them, e.g. the binning might look like:

  | *bin*  |  *count* |
  |--------|----------|
  |  10-19   |     2    |
  | 20-29  |     3    |
  | 30-39  |     2    |
  
Then the table you want to pass to `gnuplot` might look like:


    15    2
    25   3
    35   2


In [26]:
#! /usr/bin/env python


    

15.0 2
25.0 3
35.0 2


In [28]:
%%bash


15.0 2
25.0 3
35.0 2


## Part III

a) Count the nubmer of negatively and positively charge AAs in each protien and calculate the ratio for each protein.

b) Count the number of hypdrophic and hydrophobic residues in each protein. Calculate the ratio.

c) Use gnuplot to plot the positive/negative ratio against the hydrophilic/hydrophobic ratio for each protien. Comment on the relationship.

![](ratios_graph.png)

Select some outliers. Retrieve the sequence and use `BLAST` to identify the protein. Does their stange composition have any significance? You could use online tools to investigate their pI perhaps. 

This is open ended, but don't go too far down the rabit hole.

## Part iV

Use wikipedia to find the definition of the "walker A motif".

Create a regular expression that find all the proteins that contain the motif.

If you have time do some extra analysis to look at the different sequences and sequence compositions of the variable parts of the motifs by pulling them back as groups. 

## The report

Write it all up in a report.

For each part have a graphical flow chart to describe your aim and your approach, or use a bulleted this if its simple. 

Then include your code. Include appropriate comments in the code:
* Mark major sections
* Places where you have had to make a design decision that needs documenting
* Explain **particularly** obscure pieces of code.

Try to write code that documents itself. 

You may wish to format code using the website http://hilite.me

Otherwise copy and paste code into word and use the `Courier New` or other "fixed-width" font. 

**Under no circumstances** use screenshots of code.

Code can captioned using "Code 1:", "Code 2:" etc for larger blocks of code.

Then include results of the code. Include figure legends where appropriate. 

Finally include your interpretation of the results. 

The report should be no longer than 300 words (not including titles, code, figures and legends).

The report is due in on **Wednesday November the 20th at midnight.**

There will be a hand-in on MOLE.

This task **contributes** to your conduct mark for the project.

The conduct mark is 25% of the whole module. 

Please get in contact if you are struggling. 

# Handling tables with pandas

id	Cancer1	Cancer2	Normal1	Normal2
ENSG00000000003	5.95826622337557	8.81097601937505	7.53216385716448	6.63781231813915
ENSG00000000005	-1.41151988912058	-1.43750916858247	-1.31684622365868	-1.29089271034514
ENSG00000000419	9.42730899215926	9.15577673744174	9.44121243114581	9.20685569248805
ENSG00000000457	9.76744540012193	9.56745990257189	9.2770807794418	8.96465517406554
ENSG00000000460	7.62719592443423	7.59893474744653	7.9111353616632	7.5232621375324
ENSG00000000938	5.10676083839103	4.70915305298166	4.2068607495857	4.13169270397857
ENSG00000000971	6.61464002977705	5.31104669007017	5.75484059089178	7.61981857371949
ENSG00000001036	11.9369955115137	10.7788078234204	9.34068768940202	11.2829622683915
ENSG00000001084	10.1643213325872	10.0713806061553	9.7803495466056	9.72496464284139


In [32]:
import pandas


Unnamed: 0,id,Cancer1,Cancer2,Normal1,Normal2
0,ENSG00000000003,5.958266,8.810976,7.532164,6.637812
1,ENSG00000000005,-1.41152,-1.437509,-1.316846,-1.290893
2,ENSG00000000419,9.427309,9.155777,9.441212,9.206856
3,ENSG00000000457,9.767445,9.56746,9.277081,8.964655
4,ENSG00000000460,7.627196,7.598935,7.911135,7.523262


The `sep` arguement allows us to say what seperates the columns in the file

I can get a single column

0    ENSG00000000003
1    ENSG00000000005
2    ENSG00000000419
3    ENSG00000000457
4    ENSG00000000460
Name: id, dtype: object

Or a single row

In [41]:
table.loc[0].head()

id         ENSG00000000003
Cancer1            5.95827
Cancer2            8.81098
Normal1            7.53216
Normal2            6.63781
Name: 0, dtype: object

`loc` is short for location

Or I can get multiple columns/rows