# Summary Statistics - Exercises
In these exercises we'll use a real life medical dataset to learn how to obtain basic statistics from the data. This dataset comes from [Gluegrant](https://www.gluegrant.org/), an American project that aims to find a which genes are more important for the recovery of severely injured patients! 


## Objectives

In this exercise the objective is for you to learn how to use Pandas functions to obtain simple statistics of Datasets.

## Dataset information

The dataset is a medical dataset with 184 patients, distributed into 2 test groups where each group divided in 2, patients and control.
The dataset is composed of clinical values:
* Patient.id
* Age
* Sex
* Group (to what group they belong)
* Results (outcome of the patient)

and of the gene expression (higher = more expressed):
* Gene1: [MMP9](https://www.ncbi.nlm.nih.gov/gene/4318)
* Gene2: [S100A12](https://www.ncbi.nlm.nih.gov/gene/6283)
* Gene3: [MCEMP1](https://www.ncbi.nlm.nih.gov/gene/199675)
* Gene4: [ACSL1](https://www.ncbi.nlm.nih.gov/gene/2180)
* Gene5: [SLC7A2](https://www.ncbi.nlm.nih.gov/gene/6542)
* Gene6: [CDC14B](https://www.ncbi.nlm.nih.gov/gene/8555)


In [None]:
import pandas as pd
import numpy as np
import math 

In [None]:
patient_data = pd.read_csv('data/everyone.csv')

patient_data.head()

# Exercise 1 - Lets get a quick look at the groups

Ok, first lets get a quick look at who is in each of the groups. In medical studies it's important that the control and patient groups aren't too different from each other, so that we can draw relevant results.

### Separate the patients and control into 2 dataframes
Since we are going to perform multiple statistics on the patient and control groups, we should create a variable for each one of the groups, so that we mantain our code readable!

_Remember:_ If you want to subset the comand is: `DataFrame[DataFrame.column == "Value"]`

In [None]:
# get one dataframe for each group - Patient and Control 

# patients = ... 
# control = ...

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
assert(patients.Group.unique()[0] == 'Patient')
assert(control.Group.unique()[0] == 'Control')

### Find out the Age means for each of the groups
_Remember_: To find the mean of a dataframe column, just use Name_of_dataframe.column.mean()

In [None]:
# patient_mean =  # Calculate the mean of the patients
# control_mean =  # Calculate the mean of the controls

# YOUR CODE HERE
raise NotImplementedError()


In [None]:
print('The patient mean age is: {} and the control mean age is: {}'.format(patient_mean, control_mean))

Expected output:
    
    The patient mean age is: 33.64556962025316 and the control mean age is: 29.884615384615383 

In [None]:
assert(math.isclose(patient_mean, 33.64556962025316, abs_tol=0.01))

### Find the Median Age of each group
As seen on the presentation, the mean can affected by outliers on the data, lets check that out with the median.


In [None]:

# patient_median = # Calculate the median age of the patients
# control_median =  # Calculate the median age of the controls

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print('The patient median age is: {} and the control median age is: {}'.format(patient_median,control_median))

Expected output:
    
    The patient median age is: 33.0 and the control median age is: 33.0 	


In [None]:
assert math.isclose(patient_median, 33)

### Find the Standard deviation Age of each group
Let's see if there is a large deviation from the mean in each of the groups.

In [None]:

# patient_std = # Standard Deviation age of the patients
# control_std = # Standard Deviation age of the controls

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print('The patient std is: {} and the control std is: {}'.format(patient_std,control_std))

Expected output: 

    The patient std is: 11.16698725935228 and the control std is: 10.19539866048179 	

In [None]:
assert(math.isclose(patient_std, 11.1669872593522, abs_tol=0.01))
assert(math.isclose(control_std, 10.1953986604817, abs_tol=0.01))

### Find the quantiles of the Age
Let's use the quantiles to obtain the dispersion of the groups. Get the 0, 0.25, 0.5, 0.75 and 1 quantiles.


In [None]:
# patient_quantiles = # Patient quantiles of the Age
# control_quantiles = # Control quantiles of the Age

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(patient_quantiles)
print(control_quantiles)

Expected output

    0.00    16.0
    0.25    24.0
    0.50    33.0
    0.75    43.0
    1.00    55.0
    Name: Age, dtype: float64
    0.00    17.0
    0.25    21.5
    0.50    28.0
    0.75    34.0
    1.00    54.0
    Name: Age, dtype: float64

In [None]:
assert (patient_quantiles.sum()==171)
assert (control_quantiles.sum()==154.5)

### Find out how many patients are male and how many are female

Next, let's try to find out the number or each of the sex and the prercentage of males in each of the groups.

_Remember:_ To get a frequency table you can use `pd.crosstab()`

- Rows: Sex
- Columns: Group 

In [None]:
# crosstab = # get a frequency table for Sex and Group 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
crosstab

In [None]:
assert (crosstab.iloc[1,0]==17)
assert (crosstab.iloc[1,1]==98)
assert (crosstab.iloc[0,0]==9)
assert (crosstab.iloc[0,1]==60)

# Exercise 2 

The objective here is for you to try to find genes that are different from the patient group and control group using the tools that you learned on exercise 1.

### Gene 1

In [None]:
#gene1_patients = # get the series of patients for Gene1 
#gene1_control = # get the series of control for Gene1 

# Mean

# mean_gene1_patients = # Gene1 mean for patients
# mean_gene1_control = # Gene1 mean for control

# Median

# median_gene1_patients = # Gene1 median for patients
# median_gene1_control = # Gene1 median for control

# Std

# std_gene1_patients = # Gene1 std for patents
# std_gene1_control = # Gene1 std for control


# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print("Patients: Mean =", mean_gene1_patients, "Median =", median_gene1_patients, "Std =", std_gene1_patients, "\t")
print("Control:  Mean =", mean_gene1_control, "Median =", median_gene1_control, "Std =", std_gene1_control, "\t")



Expected outcome:
    
    Patients: Mean = 13676.395265822784 Median = 13555.230500000001 Std = 3092.69814423062 	
    Control:  Mean = 1156.0739230769232 Median = 947.1095 Std = 854.9755207222215 	

In [None]:
assert math.isclose(mean_gene1_patients, 13676.39526582278, abs_tol=0.01)
assert math.isclose(mean_gene1_control, 1156.073923076923, abs_tol=0.01)
assert math.isclose(median_gene1_patients, 13555.23050000000, abs_tol=0.01)
assert math.isclose(median_gene1_control, 947.1095, abs_tol=0.01)

### Gene 2

In [None]:
#gene2_patients = # get the Series of patients for Gene2
#gene2_control = # get the series of control for Gene2

# Mean

#mean_gene2_patients = # Gene2 mean for patients
#mean_gene2_control = # Gene2 mean for control

# Median

# median_gene2_patients = # Gene2 median for patients
# median_gene2_control = # Gene2 median for control

# Std

# std_gene2_patients = # Gene2 std for patents
# std_gene2_control = # Gene2 std for control


# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print("Patients: Mean =", mean_gene2_patients, "Median =", median_gene2_patients, "Std =", std_gene2_patients, "\t")
print("Control:  Mean =", mean_gene2_control, "Median =", median_gene2_control, "Std =", std_gene2_control, "\t")

Expected outcome: 
    
    Patients: Mean = 16955.4325 Median = 17023.491 Std = 2743.7304882937474 	
    Control:  Mean = 3439.741 Median = 3067.3015 Std = 1549.3355492407961

In [None]:
assert math.isclose(mean_gene2_patients, 16955.4325, abs_tol=0.01)
assert math.isclose(mean_gene2_control, 3439.741, abs_tol=0.01)
assert math.isclose(median_gene2_patients, 17023.491, abs_tol=0.01)
assert math.isclose(median_gene2_control, 3067.3015, abs_tol=0.01)

## Can we do this without so much code?

Can we obtain the previous statistics for the 6 genes without all the effort?

_Remember:_ The `.describe()` method? 

### Let's see for Patients

In [None]:
gene_names = ["Gene1", "Gene2"]

# Get a dataframe that uses the describe for both genes for Patient group
# describe_patients = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert math.isclose(describe_patients['Gene1']['count'], 158, abs_tol=0.01)
assert math.isclose(describe_patients['Gene1']['std'], 3092.698144, abs_tol=0.01)
assert math.isclose(describe_patients['Gene1']['50%'], 13555.230500, abs_tol=0.01)
assert math.isclose(describe_patients['Gene2']['50%'], 17023.491000, abs_tol=0.01)

### Let's see for  Control

In [None]:
###### gene_names = ["Gene1", "Gene2"]

# Get a dataframe that uses the describe both genes for Control group
# describe_control = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert math.isclose(describe_control['Gene1']['count'], 26)
assert math.isclose(describe_control['Gene1']['std'], 854.975521)
assert math.isclose(describe_control['Gene1']['50%'], 947.109500)
assert math.isclose(describe_control['Gene2']['50%'], 3067.301500)

## Get a list with the possible Result (there is a column `Result`) 

In [None]:
# get a list with the unique possible results (there is a column for this)
# result_list = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(result_list)==8
assert 'control' in result_list
assert '02: Skilled nursing facility' in result_list