# <span style="color:red">Before you turn this in, make sure everything runs as expected.</span>

1. **RESTART THE KERNEL** – in the menubar, select Kernel$\rightarrow$Restart
2. **RUN ALL CELLS** – in the menubar, select Cell$\rightarrow$Run All
3. **VALIDATE THE NOTEBOOK** – in the menubar, click the Validate button

## <span style="color:blue">How to Answer Questions</span>

### <span style="color:blue">Python code answers</span>

Enter your answer any place that says
```python
# Enter your code here
```
<span style="color:red">**AND delete the text.**</span>
```python
raise NotImplementedError # No Answer - remove if you provide an answer
```

### <span style="color:blue">Written answers</span>

Enter your answer any place that says
```
YOUR ANSWER HERE.
```

In [1]:
ANUID = "u7522927"

# Origins of the data

**Run all the cells in this notebook to generate the figures and tables.**

# The provided data

To facilitate your research, I have provided some useful data within the `data/pssm/` directory. 

## TFBS matrices

There are seven TF whose TFBS definitions have been obtained from the JASPAR website. Some of these are described in [Culley et al](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3684448/). Links to descriptions of the genes and to their JASPAR entries (which also display a useful visual representation of the TFBS) are provided in this table.

| TF |                                                            Functional Description |                                        TFBS definition |
|--------|---------------------------------------------------------------------------------------------|---------------------------------------------------------|
|  GATA4 |      [description](https://www.genecards.org/cgi-bin/carddisp.pl?gene=GATA4&keywords=gata4) | [at jaspar](http://jaspar.genereg.net/matrix/MA0482.1/) |
|   MEF2 |       [description](https://www.genecards.org/cgi-bin/carddisp.pl?gene=MEF2A&keywords=mef2) | [at jaspar](http://jaspar.genereg.net/matrix/PF0028.1/) |
|   MYOG |        [description](https://www.genecards.org/cgi-bin/carddisp.pl?gene=MYOG&keywords=myog) | [at jaspar](http://jaspar.genereg.net/matrix/MA0500.1/) |
|    NKX2-5 | [description](https://www.genecards.org/cgi-bin/carddisp.pl?gene=NKX2-5&keywords=nkx2-5) | [at jaspar](http://jaspar.genereg.net/matrix/MA0063.1/) |
|   SRF1 |          [description](https://www.genecards.org/cgi-bin/carddisp.pl?gene=SRF&keywords=srf) | [at jaspar](http://jaspar.genereg.net/matrix/PB0078.1/) |
|    TBP |          [description](https://www.genecards.org/cgi-bin/carddisp.pl?gene=TBP&keywords=tbp) | [at jaspar](http://jaspar.genereg.net/matrix/MA0108.1/) |
|   TBX5 |        [description](https://www.genecards.org/cgi-bin/carddisp.pl?gene=TBX5&keywords=TBX5) | [at jaspar](http://jaspar.genereg.net/matrix/MA0807.1/) |

## The sequences

The file `data/pssm/mouse-release-92-v2.fasta` contains sequences ±500bp of the Transcription Start Site (henceforth TSS) for all mouse protein coding genes (I excluded some sequences due to low sequence quality) that I sampled from [Ensembl](http://www.ensembl.org). The labels for these sequences are Ensembl identifiers (called `stableid`). Below is a summary display of that data.

In [2]:
from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/pssm/mouse-release-92-v2.fasta", moltype="dna")
aln

0,1
,0
ENSMUSG00000023892,GTAACATCAGTACAGCACAGGATGAATTCGGCCCTGGGCGGATAGCAAAGAGCCACTGAA
ENSMUSG00000056468,.CC.GGGGG.A.AG.G.G.AATCA..AAA.CA.TCC..GCC.GG.TGGCCTCATG.AT.C
ENSMUSG00000039616,.CCCTTCA.A.TTG.TTTCTC.GCCCCG.CT.A.CT.AGTTCACTTT.TC.TGTCT.CCT
ENSMUSG00000024091,T.TC..GGGCAGACAAGAC.CCCTCGGCGA...AAAA..C.CGTCAGCG.CC....CC.C
ENSMUSG00000024056,AC..T.ATGCCGAGAGC.....ATCTACACTTTAGCCCA..GACAGGCCACCA.CA.CCG
ENSMUSG00000054321,TATGA.AATT.TTGC..GGCTT.CC.GGGACG..G.AAAAA.ACAG...A.CAA.AAACC
ENSMUSG00000052469,CCTGTT.GCC.TT.AATATTT.ACC.C..CAAATACTTT.CTGG.GGC.AGAAGTGATGG
ENSMUSG00000024261,CAG...AG.AAC.....ACACTATC.AC.TAATG.TCAGACTATAA.TT.GTAGTAA...
ENSMUSG00000052031,AGCGAG.ATAC..GCACAGA.GC..GAATA.AATGCCTACTTGCTGCTCAC..TTAGA..


## Gene expression measurements
http://stemmapper.sysbiolab.eu
I obtained the file `data/pssm/GSM522308.txt` from the STEMmapper database by [searching for gene expression studies on cardiac progenitor cells](http://stemmapper.sysbiolab.eu). The experiment that generated these data is [described here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6689). That file contains brief identifying information on the data provenance, and the `log2` "expression values" (larger values mean more gene expression) for each gene plus the quantile into which that expression value fits. I have processed this file, producing the tab delimited file `data/pssm/GSM522308-edited.tsv`. This is easier to parse and has the Ensembl `stableid` for each gene (that's a unique identifier which can be used to relate different types of data).

In [3]:
from cogent3 import load_table

table = load_table("data/pssm/GSM522308-edited.tsv")

In [4]:
table  # showing an abbreviated representation of the data

Entrez_id,Gene_symbol,Log2_exprs_val,Quantile,stableid
98238,Lrrc59,12.2256,0.9962,ENSMUSG00000020869
12622,Cer1,7.0004,0.9962,ENSMUSG00000038192
21859,Timp3,12.0151,0.9962,ENSMUSG00000020044
71203,4933434M16Rik,6.2854,0.9962,ENSMUSG00000020624
192662,Arhgdia,12.8525,0.9962,ENSMUSG00000025132
...,...,...,...,...
214616,Spata5l1,2.9955,0.0025,ENSMUSG00000074876
71687,Tmem25,4.0927,0.0025,ENSMUSG00000002032
12564,Cdh8,3.1385,0.0013,ENSMUSG00000036510
381695,N4bp2l2,3.4625,0.0013,ENSMUSG00000029655


The histogram is of the expression values from that gene expression data. Larger x-axis values correspond to larger levels of gene expression. Larger y-axis values correspond to the number of genes with that gene expression value.

In [7]:
import plotly.express as px

fig = px.histogram(x=table.columns["Log2_exprs_val"],
                   title="Gene expression measurement from heart stem cell experiment : GSM522308",
                  labels=dict(x="log<sub>2</sub>(Expression)", y="%"),
                  histnorm="percent")
fig.layout.yaxis.title.text = "%"
fig.show()