# Problem Set 16: Pandas

Author: Paul Magwene

- - -

## Instructions

Create a markdown document within JupyterLab and answer the questions below using code blocks that generate the correct outputs. We encourage you to include explanatory text in your markdown document. 

Write "robust" solutions wherever possible. A good rule of thumb for judging whether your solution is appropriately "robust" is to ask yourself "If I added additional observations or variables to this data set, or if the order of variables changed, would my code still compute the right solution?"

Make sure your markdown is nicely formatted -- use headers, bullets, numbering, etc so that the structure of the document is clear.

When completed, title your Jupyter notebook file as follows (replace `XX` with the assignment number, e.g. `01`, `02`, etc):

-   `netid-assignment_XX-Spring2024.ipynb`

Submit both your markdown file and the generated HTML document via the Assignments submission section on Sakai.

## Data

## Working with a table of features from the Saccharomyces Genome Database (SGD)

The file [`SGD_features.tsv`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/SGD_features.tsv) is a tab-delimited file I downloaded from SGD that summarizes key pieces of information about genome features in the budding yeast genome.  The original file can be found here: http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/

Here's a short summary of the contents of this file, from the "SGD_features.README" document:

```
1. Information on current chromosomal features in SGD, including Dubious ORFs. 
Also contains coordinates of intron, exons, and other subfeatures that are located within a chromosomal feature.

2. The relationship between subfeatures and the feature in which they
are located is identified by the feature name in column #7 (parent
feature). For example, the parent feature of the intron found in
ACT1/YFL039C will be YFL039C. The parent feature of YFL039C is
chromosome 6.

3. The coordinates of all features are in chromosomal coordinates.

Columns within SGD_features.tab:

1.   Primary SGDID (mandatory)
2.   Feature type (mandatory)
3.   Feature qualifier (optional)
4.   Feature name (optional)
5.   Standard gene name (optional)
6.   Alias (optional, multiples separated by |)
7.   Parent feature name (optional)
8.   Secondary SGDID (optional, multiples separated by |)
9.   Chromosome (optional)
10.  Start_coordinate (optional)
11.  Stop_coordinate (optional)
12.  Strand (optional)
13.  Genetic position (optional)
14.  Coordinate version (optional)
15.  Sequence version (optional)
16.  Description (optional)

Note that "chromosome 17" is the mitochondrial chromosome.
```


Download [`SGD_features.tsv`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/SGD_features.tsv) to your computer and then load it as a DataFrame using the Pandas `read_table` function, specifying the delimiter argument as a tab:

In [None]:
features = pd.read_table("SGD_features.tsv", sep="\t")

## Problems

1. What are the dimensions of this data set?

2. What are the names of the variable(columns) in this data set?

3. What are the data types of the columns in this data set?
   

4. How many genome features are recognized in the yeast genome?

5. What are the different feature types in the yeast genome?

6. Add a new column to your data frame called "Length" which represents the length, in nucleotides, of each genomic feature in the dataset.  In this data set the start and stop coordinates of each feature are inclusive, so take that into consideration in your calculation.

7. Create a histogram showing distribution of lengths of the genome features.

   8. How  many of the genome features are annotated as "ORFs" (open reading frames) (see the Qualifier column of the data set)?

9. Create a new DataFrame that includes only the ORF features.

10. Sort the ORFs from largest to smallest. What are the 5 largest ORFs in the yeast genome?

11. Creating histograms of ORF lengths

    a. Create a histogram showing the distribution of ORF lengths.
    
    b. Create another histogram showing the distribution of Log10(ORF Lengths)

    For both histograms use an appropriate number of bins.  Do you find one of these histograms more useful for exploring trends or patterns in the data?  Comments on any patterns you find interesting.

12.  How many of the ORFS are designated as "Dubious"? How many are "Verified"?

13.  Create overlapping histograms comparing the distribution of lengths of dubious ORFs and verified ORFs.

14. "Dubious" and "Verified" is not the full set of ORF qualifiers. What are the other Qualifier values?