Today we'll be looking at the paper "A single-cell transcriptomic atlas of the developing chicken limb", by @chicken-cells - specifically, we'll use [this dataset from the Single Cell Expression Atlas](https://www.ebi.ac.uk/gxa/sc/experiments/E-CURD-13/downloads).  However, we're not actually gonna read their paper, we're just gonna do random stuff!

<b class="sidequests">Spoilers</b>: We don't get much work done, due to a mystery involving chicken breeds in Ensembl.

<details>
    <summary style="color:#C0CF96">Loading the Data</summary>

In [68]:
set.seed(0)
library(SingleCellExperiment)
library(scran)
library(scater)
library(dplyr)
library(biomaRt)

In [41]:
file.path = './localdata/Datasets/E-CURD-13/'

counts.mat <- Matrix::readMM(
    paste0(
        file.path,
        'E-CURD-13-quantification-raw-files/',
        'E-CURD-13.aggregated_filtered_counts.mtx'
    )
)
dim(counts.mat)

In [42]:
counts.rows <- read.csv(
    paste0(
        file.path,
        'E-CURD-13-quantification-raw-files/',
        'E-CURD-13.aggregated_filtered_counts.mtx_rows'
    ),
    sep='\t',
    header=FALSE
)$V1

counts.cols <- read.csv(
    paste0(
        file.path,
        'E-CURD-13-quantification-raw-files/',
        'E-CURD-13.aggregated_filtered_counts.mtx_cols'
    ),
    sep='\t',
    header=FALSE
)$V1

In [53]:
meta.data <- read.csv(
    paste0(
        file.path,
        'ExpDesign-E-CURD-13.tsv'
    ),
    sep='\t',
    header=TRUE,
    row.names='Assay'
)
colnames(meta.data)

In [44]:
rownames(counts.mat) <- counts.rows
colnames(counts.mat) <- counts.cols

In [52]:
# Make sure the metadata is lined up with the count matrix
identical(meta.data$Assay, colnames(counts.mat))

In [60]:
sce <- SingleCellExperiment(
    assays=list(counts=counts.mat),
    colData=meta.data
)

In [61]:
sce

class: SingleCellExperiment 
dim: 13645 7688 
metadata(0):
assays(1): counts
rownames(13645): ENSGALG00000000003 ENSGALG00000000011 ...
  ENSGALG00000055127 ENSGALG00000055132
rowData names(0):
colnames(7688): SAMN11526603-AAAAAAATTCAG SAMN11526603-AAAAACAAGTAG ...
  SAMN11526603-TTTTTTTGTGAG SAMN11526603-TTTTTTTTTTTT
colData names(18): Sample.Characteristic.organism.
  Sample.Characteristic.Ontology.Term.organism. ...
  Factor.Value.inferred.cell.type...ontology.labels.
  Factor.Value.Ontology.Term.inferred.cell.type...ontology.labels.
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

</details>

I wanted to perform QC using mitochondrial genes.  This should have been as easy as reusing my code for humans with a few tweaks to look for chicken genes instead.  However, when I try to use `biomaRt` I get no result.

In [86]:
mart <- useDataset("ggallus_gene_ensembl", useMart("ensembl"))

gene.to.symbol.map <- getBM(
    filters="ensembl_gene_id",
    attributes=c(
        "ensembl_gene_id",
        "hgnc_symbol"
    ),
    values=rownames(sce),
    mart=mart
)
gene.to.symbol.map

ensembl_gene_id,hgnc_symbol
<lgl>,<lgl>


So I went directly to the `biomaRt` [web interface](https://www.ensembl.org/biomart/martview/3d63e0dabdacc8dc33bd4bba9ed2130a), and searched for a gene I knew was in the dataset: `ENSGALG00000000003`

![](./images/no-003.png)

However, you can see that in the top left corner it reports no results!  If you look closely, you can see that the example search is with `ENSGALG00010000002`, which does have results - and in fact adding 10,000,000 to gene numbers tends to create a valid findable gene:

![](./images/yes-003.png)

These genes don't seem to be related to eachother though.

If we filter for mitochondrial genes:

![Like this.](./images/filters.png)

Then we do get results:

![](./images/found-mitos.png)

Which all have ids greater than 10,000,000.  This list is not comprehensive, I can find (off biomaRt) another mitochondrial gene, whose id is less than 10,000,000:

![ENSGALG00000031197](./images/a-missed-mito.png)

> The avian genome encodes the same set of genes (13 proteins, 2 rRNAs and 22 tRNAs) as do other vertebrate mitochondrial DNAs and is organized in a very similar economical fashion.
>
> -- "Sequence and gene organization of the chicken mitochondrial genome: A novel gene order in higher vertebrates", @chicken-mitochondria

From the above, we can see that there should be 24 mt-rRNA and mt-tRNA in total, which is in fact what we find with biomaRt! (The previous image only shows 10, but that's because it only shows 10 per page - the total amount found was 24).

![The entire mitochondrial genome when viewing it starting from a gene larger than 10,000,000.](./images/more-than-million.png)

![The entire mitochondrial genome when viewing it starting from a gene less than 10,000,000.](./images/less-than-million.png)

These two images seem to conclusively show that the gene sets are different; the genes are in a similar place but have different names between the two tracks.


From this, we can hypothesize that the ids less than 10,000,000 are out of date, presenting some frustration for using the data we have.  Can we prove it?

![](./images/no-change.png)

No: the gene has never changed its ID (which is a good thing for sanity purposes but is not ideal for this specific confusion).

However, it seems like there are multiple chicken genomes in Ensembl:

![](./images/chickens.png)

This proves, conclusively, that there are at least two breeds of chicken.  And I was looking at the wrong one.

The question then becomes - how do we actually find the right one?  We can get a list of all the possible biomart-queryable datasets:

In [113]:
all.datasets <- listDatasets(mart = mart)
all.datasets[70:80,]

Unnamed: 0_level_0,dataset,description,version
Unnamed: 0_level_1,<I<chr>>,<I<chr>>,<I<chr>>
70,fheteroclitus_gene_ensembl,Mummichog genes (Fundulus_heteroclitus-3.0.2),Fundulus_heteroclitus-3.0.2
71,gaculeatus_gene_ensembl,Stickleback genes (BROAD S1),BROAD S1
72,gevgoodei_gene_ensembl,Goodes thornscrub tortoise genes (rGopEvg1_v1.p),rGopEvg1_v1.p
73,gfortis_gene_ensembl,Medium ground-finch genes (GeoFor_1.0),GeoFor_1.0
74,ggallus_gene_ensembl,Chicken genes (bGalGal1.mat.broiler.GRCg7b),bGalGal1.mat.broiler.GRCg7b
75,ggorilla_gene_ensembl,Gorilla genes (gorGor4),gorGor4
76,gmorhua_gene_ensembl,Atlantic cod genes (gadMor3.0),gadMor3.0
77,hburtoni_gene_ensembl,Burton's mouthbrooder genes (AstBur1.0),AstBur1.0
78,hcomes_gene_ensembl,Tiger tail seahorse genes (H_comes_QL1_v1),H_comes_QL1_v1
79,hgfemale_gene_ensembl,Naked mole-rat female genes (HetGla_female_1.0),HetGla_female_1.0


But this does not contain the option to choose anything other than `bGalGal1.mat.broiler.GRCg7b`, the paternal White leghorn layer chicken.

Online, we can find an archived version of `biomaRt` that grants access:

![](./images/archived-chicken.png)

And the results line up with prior expectations.

So if we use the `biomaRt` R interface with an archived host, maybe we can get the Red Junglefowl's genes?

In [116]:
mart.archive <- useMart("ensembl", host="https://apr2022.archive.ensembl.org")

In [117]:
all.datasets <- listDatasets(mart = mart.archive)
all.datasets[70:80,]

Unnamed: 0_level_0,dataset,description,version
Unnamed: 0_level_1,<I<chr>>,<I<chr>>,<I<chr>>
70,fheteroclitus_gene_ensembl,Mummichog genes (Fundulus_heteroclitus-3.0.2),Fundulus_heteroclitus-3.0.2
71,gaculeatus_gene_ensembl,Stickleback genes (BROAD S1),BROAD S1
72,gevgoodei_gene_ensembl,Goodes thornscrub tortoise genes (rGopEvg1_v1.p),rGopEvg1_v1.p
73,gfortis_gene_ensembl,Medium ground-finch genes (GeoFor_1.0),GeoFor_1.0
74,ggallus_gene_ensembl,Chicken genes (GRCg6a),GRCg6a
75,ggorilla_gene_ensembl,Gorilla genes (gorGor4),gorGor4
76,gmorhua_gene_ensembl,Atlantic cod genes (gadMor3.0),gadMor3.0
77,hburtoni_gene_ensembl,Burton's mouthbrooder genes (AstBur1.0),AstBur1.0
78,hcomes_gene_ensembl,Tiger tail seahorse genes (H_comes_QL1_v1),H_comes_QL1_v1
79,hgfemale_gene_ensembl,Naked mole-rat female genes (HetGla_female_1.0),HetGla_female_1.0


Victory! `GRCg6a` is the Red Junglefowl.

In [120]:
mart.archive <- useDataset(
    "ggallus_gene_ensembl",
    useMart(
        "ensembl",
        host="https://apr2022.archive.ensembl.org"
    )
)
gene.to.symbol.map <- getBM(
    filters="ensembl_gene_id",
    attributes=c(
        "ensembl_gene_id",
        "hgnc_symbol"
    ),
    values=rownames(sce),
    mart=mart.archive
)

In [121]:
head(gene.to.symbol.map)

Unnamed: 0_level_0,ensembl_gene_id,hgnc_symbol
Unnamed: 0_level_1,<chr>,<chr>
1,ENSGALG00000000003,PANX2
2,ENSGALG00000000011,C10orf88
3,ENSGALG00000000038,
4,ENSGALG00000000044,WFIKKN1
5,ENSGALG00000000055,
6,ENSGALG00000000059,


Double victory!

In [133]:
gene.to.symbol.map[
    apply(
        gene.to.symbol.map["hgnc_symbol"],
        1,
        function(x) grepl("^MT-", x)
    ),
]

Unnamed: 0_level_0,ensembl_gene_id,hgnc_symbol
Unnamed: 0_level_1,<chr>,<chr>
9151,ENSGALG00000032142,MT-CO1
11684,ENSGALG00000043768,MT-ND2


All that for just two genes.  Oh well.  We'll analyze this dataset in a future blog post.

<b style="color:#EB1960">As a learning experience, I would recommend always specifying a specific Ensembl version when querying `biomaRt`</b>, since evidently code that used to work will break if they change the default breed.

Also, it's a bit frustrating that I can't specify the breed in `biomaRt`, because Ensembl still has the data for the other breed - it's just for some reason not accessible via `biomaRt` which only presents a single breed to query.

## But why?

I'm left with one question: why are there two chickens?  Why is the junglefowl no longer on modern BioMart?  What is their relation? (Ahhhhhhhhhhhh)

I did manage to find a [summary of some chicken breeds](https://www.ed.ac.uk/roslin/national-avian-research-facility/avian-resources/poultry-lines/white-leghorn), but it did not resolve my question.

To investigate further, I went through the history of the Ensembl versions:

* [Version 108](http://oct2022.archive.ensembl.org/Gallus_gallus/Info/Index) (current, October 2022): Default chicken is `bGalGal1.mat.broiler.GRCg7b`
  - But when we look at the [chicken breeds page](http://oct2022.archive.ensembl.org/Gallus_gallus/Info/Breeds) they list it as `bGalGal1.pat.whiteleghornlayer.GRCg7w`
* [Version 107](http://jul2022.archive.ensembl.org/Gallus_gallus/Info/Index) (July 2022): Default chicken is `bGalGal1.mat.broiler.GRCg7b`
  - Although in this version they call it the "maternal broiler"
  - And when we look at the [chicken breeds page](http://jul2022.archive.ensembl.org/Gallus_gallus/Info/Breeds) they list it as `bGalGal1.pat.whiteleghornlayer.GRCg7w`
* [Version 106](http://apr2022.archive.ensembl.org/Gallus_gallus/Info/Index) (April 2022): Default chicken is `GRCg6a`
  - And there is no chicken breeds page

It seems like the chicken breeds pages don't list the default breed on it, and that all of the genes for `GRCg7w` are listed with IDs beginning with 00015 ([such as here](https://www.ensembl.org/Gallus_gallus_GCA_016700215.2/Gene/Summary?db=core;g=ENSGALG00015014615;r=1:140692888-140816960;t=ENSGALT00015035601)).  This is in contrast with the 00010 beginning for `GRCg7b` and `GRCg6a`.

<b style="color:#EB1960">Thus, we can say this:</b>

* There are three chickens in play^[Despite my earlier belief in this being a two-chicken conundrum!], `GRCg6a` ("red junglefowl"), `GRCg7w` ("white leghorn"), and `GRCg7b` ("maternal broiler").
* `GRCg7b` ("maternal broiler") is the current default
* `GRCg6a` ("red junglefowl") used to be the default back in version 106

Looking up "maternal broiler", I am able to find [this page](https://www.ncbi.nlm.nih.gov/assembly/GCF_016699485.2/) talking about the `bGalGal1.mat.broiler.GRCg7b` genome.  It lists its breed as specifically being:

>Breed: Cross of Broiler mother + white leghorn layer father

This only deepens the mystery.  Why did we go from Red Junglefowl to Maternal Broiler?  How does the White Leghorn come into the picture, and is it related to why `GRCg7w` is a crossbreed?

> The chicken has been the most studied bird. It serves as a model organism for basic and biomedical science, and is one of the most commonly consumed food sources for humans.
>
> This trio of samples was collected to create a high-quality reference genome for two haplotypes.
>
> <b style="color:#537FBF">The female offspring (`bGalGal1`) is a cross between a modern broiler breeder mother (`bGalGal2`) and white Leghorn father (`bGalGal3`).</b>
>
> Breeding and sample collection was conducted by Nick Anthony at the University of Arkansas, in coordination with Wesley Warren, Ron Okimoto, Hans Cheng, and Rachel Hawken.
>
> Blood from these samples were used to create high-quality reference genome assemblies of each haplotype, coordinated by Erich Jarvis, Wesley Warren, and Olivier Fedrigo, at the Rockefeller University Vertebrate Genome Lab, as part of the Vertebrate Genomes Project (VGP) and the Genome Reference Consortium (GRC).
>
> -- [Webpage](https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_016699485.2/) for `GRCg7b`

The webpage identifies `GRCg7b` as coming from `bGalGal3` and `GRCg7w` as coming from `bGalGal2`.

It seems `bGalGal1` does not have its own assembly, as [the list of chicken assemblies](https://www.ncbi.nlm.nih.gov/data-hub/genome/?taxon=9031) does not include it.  The most information I can get as to `bGalGal1` is [on its sample page](https://www.ncbi.nlm.nih.gov/biosample/14342210), which is significantly less detailed than [its parent's](https://www.ncbi.nlm.nih.gov/biosample/SAMN15960293/).  As an aside, it seems like a lot of new chicken assemblies were made in 2022, which may be interesting for the future.

According to [the page](https://www.ncbi.nlm.nih.gov/assembly/GCF_016700215.1/) for `bGalGal3`, the `GRCg7w` assembly has been updated to `GRCg7w_WZ`.  I believe this did not require getting more real-world data, but rather was a re-sequencing (re-assembling) of the collected data.

## Tracking the history of the events

![This shows the history of the assembly chosen for the chicken, procured from [here](https://www.ensembl.org/info/website/archives/assembly.html)](./images/chicken-history.png)

On the Ensembl blog, we can get a changelog of what has happened:

> <h3>New Assemblies and/or Annotation</h3>
>
> <h4>Vertebrates</h4>
>
>  * Reannotation of the reference assembly for pig (`Sscrofa11.1`)
>  * <b style="color:#C0CF96">Reannotation of chicken assembly</b> (`GRCg6a`)
>  * <b style="color:#C0CF96">Annotation of 2 new GRC chicken assemblies</b> (`GRCg7w` <b style="color:#C0CF96">and</b> `GRCg7b`)
>  * <b style="color:#C0CF96">Chicken reference will be changed from</b> `GRCg6a` <b style="color:#C0CF96">to</b> `GRCg7b`
>  * Assembly and Gene set will be updated for Tropical clawed frog (Xenopus tropicalis): Xenopus tropicalis v9.1 to UCB Xtro 10.0
>
> -- [Release notes for Ensembl 107](https://www.ensembl.info/2022/05/31/whats-coming-in-ensembl-release-107-ensembl-genomes-54/)

All in all, this has been a really confusing adventure, but I <em>think</em> I learned something...