# PROGRAMMING1 - FINAL ASSIGNMENT
## Comparing two RNA quantification methods - NanoDrop vs TapeStation
#### Jennefer Beenen - Data Science for Life Sciences - January 2023

---


## Background
Isolated RNA needs to be quantified in order to have the correct amount of starting input for RNA library preparation for sequencing. The input used for library prep. can i.e. have a range from 5 - 200 ng, where the optimal can be at 100 ng.

The research sequencing facility (RSF) has two devices for RNA quantification, of which both use a different methods:
- [NanoDrop 2000](https://www.thermofisher.com/order/catalog/product/ND-2000) (ThermoFisher) measures light intensity in a range of different light wavelengths. [Manual](https://assets.thermofisher.com/TFS-Assets/CAD/manuals/NanoDrop-2000-User-Manual-EN.pdf)
- [TapeStation 4200](https://www.agilent.com/en/product/automated-electrophoresis/tapestation-systems/tapestation-instruments/4200-tapestation-system-228263) (Agilent) measures fluorescently stained fragements that are separated by electrophoresis. [Manual](http://download.chem.agilent.com/software/4200_tapestation_user_info_package_v2/tapestation%20user%20information/tapestation%20manuals/agilent%204200%20tapestation%20system%20manual_g2991-90000.pdf)
  
  Agilent provides two different RNA kits for the TapeStation:
  - [High Sensitivity RNA](https://www.agilent.com/en/product/automated-electrophoresis/tapestation-systems/tapestation-rna-screentape-reagents/high-sensitivity-rna-screentape-analysis-228267), Quantitative Range of 0.5 - 10 ng/µL
  - [Standard RNA](https://www.agilent.com/en/product/automated-electrophoresis/tapestation-systems/tapestation-rna-screentape-reagents/rna-screentape-analysis-228268), Quantitative Range of 25 - 500 ng/µL

### Aim of this assignment (research question)
What makes that both methods (NanoDrop and TapeStation) sometimes differ in outcome?

Variables: 260/280 ratio’s (NanoDrop), 260/230 ratio’s (NanoDrop), RNA Integrity Number (RIN) (TapeStation), and used kit (TapeStation).

  <details>    
  <summary>
  <font size="3" color="lightblue">Ratio's explained</font>
  </summary>
  260/280 ratio
  <blockquote>
  The ratio of absorbance at 260 and 280 nm is used to
  assess the purity of DNA and RNA. A ratio of ~1.8 is generally accepted as “pure” for DNA; a ratio of ~2.0 is
  generally accepted as “pure” for RNA. If the ratio is appreciably lower in either case, it may indicate the
  presence of protein, phenol or other contaminants that absorb strongly at or near 280 nm.
  </blockquote>
  260/230 ratio
  <blockquote>
  This is a secondary measure of nucleic acid purity.
  The 260/230 values for a “pure” nucleic acid are often higher than the respective 260/280 values and are
  commonly in the range of 1.8-2.2. If the ratio is appreciably lower, this may indicate the presence of co-
  purified contaminants.
  </blockquote>
  See NanoDrop manual as reference.
  </details>

>

  <details>    
  <summary>
  <font size="3" color="lightblue">RIN explained</font>
  </summary>
  <blockquote>
  RINe is calculated at a scale from 1 to 10. 
  A high RINe indicates highly intact RNA, and a low
  RINe a strongly degraded RNA sample.
  </blockquote>
   https://www.agilent.com/cs/library/technicaloverviews/public/5991-6616EN.pdf
  </details>

### Data description
The datasets can be found [here](https://unishare.nl/index.php/s/j78cSJc5gXQidLy). It is password protected which Jennefer can provide.

A staff member of the RSF (Rianna Arjaans) already gathered information of both methods (TODO: Hoe heeft ze dat gedaan??) and placed it in one excel sheet (see 'TS_vs_ND_new.xlsx' in 'original' folder). Total number of entries are 260. 

Since for this assignment multiple sources needs to be used, this sheet was manually split by Jennefer into 3 seperate files:

- Nanodrop.xlsx
  - Sample, sample name.
  - dil. Factor, dilution factor used for getting sample in detection range of used methods of the TS.
  - Nanodrop (ng/ul), concentration of stock sample measured by the NanoDrop.
  - Naodrop dil. (ng/ul), concentration of diluted sample measured by the NanoDrop.
  - Cal. Conc. (ng/ul), calculated concentration of `Nanodrop dil. (ng/ul)` multiplied by the `dil. Factor`.
  - 260/280 ratio (~2.0), ratio of absorbance at 260 nm and 280 nm.
  - 260/230 ratio (2.0-2.2), ratio of absorbance at 260 nm and 230 nm.
>

- TapeStation.xlsx
  - Sample, sample name.
  - dil. Factor, dilution factor used for getting sample in detection range of used methods.
  - TS (ng/ul), concentration measured by the TapeStation.
  - Cal. Conc (ng/ul), calculated concentration of `TS (ng/ul)` multiplied by the `dil. Factor`.
  - RIN, RNA Integrity Number.
  - TS type, which TapeStation RNA kit was used; HS is High Sensitivity, SS is Standard Sensitivity.
>

- meta.xlsx 
  - Sample, sample name.
  - dil. Factor, dilution factor used for getting sample in detection range of used methods.
  - nM library, nano Molarity of output library preparation.
  - `unnamed column`, comments.


### Assessment criteria

Conditional
- No data and or api-key information is stored in the repository. 
- No hard datapaths are used, datapaths are provided in a configfile.
- At least two data sets are merged into one tidy dataframe.

Graded
- ~~(5 pt) The research question is stated.~~
- ~~(5 pt) Links to sources are provided and a small description about the data~~
- (20 pt) Data qualitity and data quantity are inspected and reported. Appropiate transformations are applied.
- (20 pt) **Assumptions and presuppositions are made explicit** (chosen data storage method, chosen analysis method, chosen design). An argumentative approach is used explaining steps, taken into account data quality and quantity. Explanation is provided either with comments in the code or in a seperate document.
- (10 pt) Interactive visualization is extracted from correct analysis of (incomplete) data
- (10 pt) The design supports the research question. The data is informative in relation to the topic. Visualization is functional and attractive **Figures contain X and Y labels, title and captions**. (10)
- (20 pt) Code is efficient coded, according to coding style without code smells and easy to read. Code is demonstrated robust and flexible 
- (10 pt) All the code is stored in repository with **Readme including most relevant information to implement the code.** used software is suitably licensed and documented

- **Comment what you can expect, and comment the results**
- **If you use code snippets from others you should refer to the original author, otherwise you will be accused of plagiarism. Please be prepared to explain your code in a verbal exam.**

### Instructions:

- Define a research question, select data and code your data acquisition, data processing, data analysis and visualization. 
- Use a repository with a commit strategy and write a readme file. 
- Make sure that you document your choices. 

### Data selection

In [3]:
# Import libraries
import yaml
import os
import pandas as pd
import numpy as np

# Get folder_path from config.yaml
with open('config.yaml', 'r') as stream:
    config = yaml.safe_load(stream)

folder_path = config['filepath_final']

# Get all file_names in folder_path
file_names = set(os.listdir(folder_path))
file_names


{'Nanodrop.xlsx', 'TapeStation.xlsx', 'meta.xlsx'}

### Data wrangling

In [None]:
# dtype
# quality inspected and reported.
# missing values
# outliers (PLOT?)

### Data exploration

In [None]:
# quantity inspected and reported.
# distribution (PLOT)
# test significance


### Data visualization

In [None]:
# interactive visualisation