# Session 03
Today we will analysing expression data from melanoma samples. 
We will import data from a csv file using pandas (this will be covered in much more detail later)
and build a dictionary to collect information on gene names, gene length and sequencing counts for 10 patient samples.
You first need to install the pandas library in your environment. Use this command in the terminal after activating the environment:
conda install pandas

In [3]:
#Import the pandas library and use it to read the local csv file with the data. df.head() gives you the top 5 rows.
import pandas as pd
df = pd.read_csv('Melanoma_ExpressionData.csv', index_col=0)
df.head()

Unnamed: 0,gene_name,gene_length,patient_1,patient_2,patient_3,patient_4,patient_5,patient_6,patient_7,patient_8,patient_9,patient_10
0,A1BG,3931,1272.36,452.96,288.06,400.11,420.46,877.59,402.77,559.2,269.59,586.66
1,A1CF,2409,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,A2BP1,5897,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,A2LD1,2825,164.38,552.43,201.83,165.12,95.75,636.63,241.56,30.82,105.44,239.19
4,A2ML1,6303,27.0,0.0,0.0,0.0,8.0,0.0,1.0,763.0,0.0,0.0


In [4]:
# Some more info on the data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20500 entries, 0 to 20499
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   gene_name    20500 non-null  object 
 1   gene_length  20500 non-null  int64  
 2   patient_1    20500 non-null  float64
 3   patient_2    20500 non-null  float64
 4   patient_3    20500 non-null  float64
 5   patient_4    20500 non-null  float64
 6   patient_5    20500 non-null  float64
 7   patient_6    20500 non-null  float64
 8   patient_7    20500 non-null  float64
 9   patient_8    20500 non-null  float64
 10  patient_9    20500 non-null  float64
 11  patient_10   20500 non-null  float64
dtypes: float64(10), int64(1), object(1)
memory usage: 2.0+ MB


In [5]:
# turning this data into dictionary format
samples = df.to_dict(orient = 'list')
print(samples.keys(), type(samples['gene_name']), len(samples['gene_name']))


dict_keys(['gene_name', 'gene_length', 'patient_1', 'patient_2', 'patient_3', 'patient_4', 'patient_5', 'patient_6', 'patient_7', 'patient_8', 'patient_9', 'patient_10']) <class 'list'> 20500


The aim of todays session is to find the gene with maximum expression in each of these samples. To do this we have to first normalise the data.

### Normalizing Over Samples and Genes: RPKM

One of the simplest normalization methods for RNAseq data is RPKM: reads per
kilobase transcript per million reads.
RPKM puts together the ideas of normalizing by sample and by gene.
When we calculate RPKM, we are normalizing for both the library size (the sum of each column)
and the gene length.

To work through how RPKM is derived, let's define the following values:

- $C$ = Number of reads mapped to a gene
- $L$ = Exon length in base-pairs for a gene
- $N$ = Total mapped reads in the experiment

First, let's calculate reads per kilobase.

Reads per base would be:
$\frac{C}{L}$

The formula asks for reads per kilobase instead of reads per base.
One kilobase = 1,000 bases, so we'll need to divide length (L) by 1,000.

Reads per kilobase would be:

$\frac{C}{L/1000}  = \frac{10^3C}{L}$

Next, we need to normalize by library size.
If we just divide by the number of mapped reads we get:

$ \frac{10^3C}{LN} $

But biologists like thinking in millions of reads so that the numbers don't get
too small. Counting per million reads we get:

$ \frac{10^3C}{L(N/10^6)} = \frac{10^9C}{LN}$


In summary, to calculate reads per kilobase transcript per million reads:
$RPKM = \frac{10^9C}{LN}$

Now let's implement RPKM over the entire counts array.



In [None]:
C = samples['patient_1'][0]
N = sum(samples['patient_1'])
L = samples['gene_length'][0]
normed = 1e9 * C / (N*L)
print(f" original value = {samples['patient_1'][0]}, normed = {normed}")


 original value = 1272.36, normed = 7.79183928027979
