# Summary Statistics
In these exercises you'll use a real life medical dataset to learn how to obtain basic statistics from the data. This dataset comes from [Gluegrant](https://www.gluegrant.org/), an American project that aims to find a which genes are more important for the recovery of severely injured patients! It was sightly edited to remove some complexities, but if you wish to check it out in it's full glory, it's available on the website and I can show it to you! 

Have fun being a biostaticist for 1 hour! :)

## Objectives

In this exercise the objective if for you to learn how to use Pandas functions to obtain simple statistics of Datasets.

## Dataset information

The dataset is a medical dataset with 184 patients, distributed into 2 test groups where each group divided in 2, patients and control.
The dataset is composed of clinical values:
* Patient.id
* Age
* Sex
* Group (to what group they belong)
* Results (outcome of the patient)

and of the gene expression (higher = more expressed):
* Gene1: [MMP9](https://www.ncbi.nlm.nih.gov/gene/4318)
* Gene2: [S100A12](https://www.ncbi.nlm.nih.gov/gene/6283)
* Gene3: [MCEMP1](https://www.ncbi.nlm.nih.gov/gene/199675)
* Gene4: [ACSL1](https://www.ncbi.nlm.nih.gov/gene/2180)
* Gene5: [SLC7A2](https://www.ncbi.nlm.nih.gov/gene/6542)
* Gene6: [CDC14B](https://www.ncbi.nlm.nih.gov/gene/8555)

**Don't worry if you are not from a biological background**, consider that these genes are simply numeric values related to the patient. We will not delve into the biological meaning of any of the genes, we'll only try to find if there are differences between the gene values for the different groups!

Ok, introductions aside, **please have fun being a biostaticist for 45 minutes! :P** Any doubt, please call me or any of the other professors!

### Import Data

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML

CSS = """
.output {
    flex-direction: row;
}
"""

patient_data = pd.read_csv("../data/Exercises_Summary_Statistics_Data.csv")

patient_data.head()

Unnamed: 0,Patient_id,Age,Sex,Result,Group,Gene1,Gene2,Gene3,Gene4,Gene5,Gene6
0,1,20,male,control,Control,950.444,5609.021,530.861,5561.673,38.539,32.496
1,2,34,female,control,Control,728.066,3337.738,271.314,4444.388,37.117,30.645
2,3,40,female,control,Control,1208.076,4430.424,520.859,4378.873,41.698,29.476
3,4,31,male,control,Control,3426.842,6524.846,842.426,6802.14,36.682,32.125
4,5,21,female,control,Control,3781.265,7916.231,574.768,4937.56,34.877,27.416


# Exercise 1 - Lets get a quick look at the groups

Ok, first lets get a quick look at who is in each of the groups. In medical studies it's important that the control and patient groups aren't too different from each other, so that we can draw relevant results.

### Separate the patients and control into 2 dataframes
Since we are going to perform multiple statistics on the patient and control groups, we should create a variable for each one of the groups, so that we mantain our code readable!

_Remember:_ If you want to subset the comand is: Name_of_dataframe[Name_of_dataframe.column == "Value"]

In [16]:
patients = patient_data[patient_data.Group == "Patient"]
control = patient_data[patient_data.Group == "Control"]

### Find out the Age means for each of the groups
_Remember_: To find the mean of a dataframe column, just use Name_of_dataframe.column.mean()

In [148]:
patient_mean = patients.Age.mean()
control_mean = control.Age.mean()

print("The patient mean age is:", patient_mean, "and the control mean age is:", control_mean, "\t")

The patient mean age is: 33.64556962025316 and the control mean age is: 29.884615384615383 	


### Find the Median of each group
As seen on the presentation, the mean can affected by outliers on the data, lets check that out with the median.

_Remember_: To find the median of a dataframecolumn, just use Name_of_dataframe.column.median()

In [147]:
patient_median = patients.Age.median()
control_median = control.Age.median()

print("The patient median age is:", patient_median, "and the control median age is:", control_median, "\t")

The patient median age is: 33.0 and the control median age is: 28.0 	


### Results - Mean / Median
Is there a significant difference of the mean and median?

(_Optional_): Is there a significant difference between the age of the Patients and Control? Consider that this dataset is composed mainly of people injured using powertools or other type of machinery, therefore, it's composed mainly of people in working age 20-ish to 60-ish.

### Find the Standard deviation of each group
Let's see if there is a large deviation from the mean in each of the groups.

_Remember:_ The standard deviation is taken as Name_of_dataframe.column.std()

In [145]:
patient_std = patients.Age.std()
control_std = control.Age.std()

print("The patient std is:", patient_std, "and the control std is:", control_std, "\t")

The patient std is: 11.166987259352279 and the control std is: 10.195398660481787 	


### Find the quantiles
Let's use the quantiles to obtain the dispersion of the groups.

_Remember:_ The quantiles are obtained using the comand Name_of_dataframe.column.quantile(q=[percentages])

In [132]:
patient_quantiles = patients.Age.quantile(q=[0, 0.25, 0.5, 0.75, 1])
control_quantiles = control.Age.quantile(q=[0, 0.25, 0.5, 0.75, 1])

print("Patients:\f")
display(pd.DataFrame(patient_quantiles))
print("Control:\f")
display(pd.DataFrame(control_quantiles))
HTML('<style>{}</style>'.format(CSS))

Patients:


Unnamed: 0,Age
0.0,16.0
0.25,24.0
0.5,33.0
0.75,43.0
1.0,55.0


Control:


Unnamed: 0,Age
0.0,17.0
0.25,21.5
0.5,28.0
0.75,34.0
1.0,54.0


### Results - Interval Statistics

(_Options_): Do the dispersion statistics show a significant difference in the dispersion of the data?

### Find out how many patients are male and how many are female

Next, let's try to find out the number or each of the sexes and the prercantage of males in each of the groups.

_Remember:_ To get a frequency table, use Name_of_dataframe.column.value_counts(). To get the number of a certain group do Name_of_dataframe.column.value_counts()["name_of_group"]

#### Number of male patients

In [142]:
num_male_patients = patients.Sex.value_counts()["male"]
num_female_patients = patients.Sex.value_counts()["female"]

print("The number of male patients is:", num_male_patients, \
      "\nThe number of female patients is:", num_female_patients, \
      "\nAnd the percentage of males is:", num_male_patients / (num_male_patients + num_female_patients), "\t")

The number of male patients is: 98 
The number of female patients is: 60 
And the percentage of males is: 0.620253164557 	


#### Number of male *control* patients

In [144]:
num_male_control = control.Sex.value_counts()["male"]
num_female_control = control.Sex.value_counts()["female"]

print("The number of male control patients is:", num_male_control, \
      "\nThe number of female control patients is:", num_female_control, \
      "\nAnd the percentage of males is:", num_male_control / (num_male_control + num_female_control), "\t")

The number of male control patients is: 17 
The number of female control patients is: 9 
And the percentage of males is: 0.653846153846 	


## Results - Percentage of the sexes
(_Optional_): Is there a significant difference between the percentage of male patients and male control patients?

# Exercise 2 - Let the Biostatistics begin

I have selected 6 genes from a total of ~55000. The objective here is for you to try to find genes that are different from the patient group and control group using the tools that you learned on exercise 1.

### Gene 1

In [175]:
gene1_patients = patients.Gene1
gene1_control = control.Gene1

# Mean
mean_gene1_patients = gene1_patients.mean()
mean_gene1_control = gene1_control.mean()

# Median
median_gene1_patients = gene1_patients.median()
median_gene1_control = gene1_control.median()

# Std
std_gene1_patients = gene1_patients.std()
std_gene1_control = gene1_control.std()

print("Patients: Mean =", mean_gene1_patients, "Median =", median_gene1_patients, "Std =", std_gene1_patients, "\t")
print("Control:  Mean =", mean_gene1_control, "Median =", median_gene1_control, "Std =", std_gene1_control, "\t")

Patients: Mean = 13676.395265822784 Median = 13555.230500000001 Std = 3092.69814423062 	
Control:  Mean = 1156.0739230769232 Median = 947.1095 Std = 854.9755207222215 	


### Gene 2

In [176]:
gene2_patients = patients.Gene2
gene2_control = control.Gene2

# Mean
mean_gene2_patients = gene2_patients.mean()
mean_gene2_control = gene2_control.mean()

# Median
median_gene2_patients = gene2_patients.median()
median_gene2_control = gene2_control.median()

# Std
std_gene2_patients = gene2_patients.std()
std_gene2_control = gene2_control.std()

print("Patients: Mean =", mean_gene2_patients, "Median =", median_gene2_patients, "Std =", std_gene2_patients, "\t")
print("Control:  Mean =", mean_gene2_control, "Median =", median_gene2_control, "Std =", std_gene2_control, "\t")

Patients: Mean = 16955.432499999995 Median = 17023.491 Std = 2743.730488293748 	
Control:  Mean = 3439.741 Median = 3067.3015 Std = 1549.3355492407961 	


I will just **ask for one more gene**, since the process is entirely the same!

### Gene 6

In [174]:
gene6_patients = patients.Gene6
gene6_control = control.Gene6

# Mean
mean_gene6_patients = gene6_patients.mean()
mean_gene6_control = gene6_control.mean()

# Median
median_gene6_patients = gene6_patients.median()
median_gene6_control = gene6_control.median()

# Std
std_gene6_patients = gene6_patients.std()
std_gene6_control = gene6_control.std()

print("Patients: Mean =", mean_gene6_patients, "Median =", median_gene6_patients, "Std =", std_gene6_patients, "\t")
print("Control:  Mean =", mean_gene6_control, "Median =", median_gene6_control, "Std =", std_gene6_control, "\t")

Patients: Mean = 30.24018987341772 Median = 29.8615 Std = 4.903053008532824 	
Control:  Mean = 30.018538461538462 Median = 29.8885 Std = 3.6831340701176676 	


### Results - Genes 1, 2 and 6

Of the 3 genes, which ones do you believe are involved in the process of recovery?

_Help:_ Recall that we have 2 groups, a group of patients that is recovering from a severe accident and a control group that are fine. You should look at the statistics for the 3 genes (mean, median and standard deviation [this last one is skippable]) and try to find differences!

## Can we do this without so much code?

Can we obtain the previous statistics for the 6 genes without all the effort?

In [190]:
gene_names = ["Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6"]

display(patients[gene_names].describe(percentiles = []))
display(control[gene_names].describe(percentiles = []))

Unnamed: 0,Gene1,Gene2,Gene3,Gene4,Gene5,Gene6
count,158.0,158.0,158.0,158.0,158.0,158.0
mean,13676.395266,16955.4325,8545.115209,10893.511563,40.589082,30.24019
std,3092.698144,2743.730488,2468.762672,1215.103278,4.982023,4.903053
min,4216.792,9000.672,2076.031,5794.872,29.299,20.265
50%,13555.2305,17023.491,8901.6505,11141.3685,40.2865,29.8615
max,21642.619,23432.793,13809.735,13716.356,54.99,56.933


Unnamed: 0,Gene1,Gene2,Gene3,Gene4,Gene5,Gene6
count,26.0,26.0,26.0,26.0,26.0,26.0
mean,1156.073923,3439.741,469.361115,4625.444269,40.434962,30.018538
std,854.975521,1549.335549,146.219868,785.49797,4.406569,3.683134
min,308.67,1729.328,253.831,2943.397,34.29,23.87
50%,947.1095,3067.3015,428.8515,4576.891,39.9575,29.8885
max,3781.265,7916.231,842.426,6802.14,50.098,38.25


What if we want the a measure of difference for each gene?

In [203]:
display(patients[gene_names].mean() / control[gene_names].mean())



Gene1    11.830035
Gene2     4.929276
Gene3    18.205844
Gene4     2.355128
Gene5     1.003812
Gene6     1.007384
dtype: float64

array([50, 21, 34, 33, 26, 47, 40, 48, 45, 43, 51, 41, 27, 28, 16, 32, 31,
       55, 49, 24, 22, 20, 54, 37, 52, 46, 35, 19, 18, 25, 44, 30, 23, 17,
       36, 29, 38, 39], dtype=int64)