# Week 1 - Exercise solutions

For all these solutions, we assume you have imported `pandas` and the metabric dataset.

In [None]:
import pandas as pd

metabric = pd.read_csv('metabric_clinical_and_expression_data.csv')

In [None]:
metabric.head()

### Exercises from the Text

**A)** Create a new column that gives the average expression of all the genes with expression data in metabric.

The "obvious" way to do this is rather cumbersome:

In [None]:
metabric['average_expression']  = (metabric.ESR1 + metabric.ERBB2 + metabric.PGR + metabric.TP53 + metabric.PIK3CA + metabric.GATA3 + metabric.FOXA1 + metabric.MLPH) / 8

A more sophisticated way to do this can be achieved using the subsetting conveniently discussed just after this exercise was posed:

In [None]:
metabric['average_expression'] = metabric.loc[:,'ESR1':'MLPH'].mean(axis=1) # axis=1 ensures the mean is calculated across the columns

** **

**B)** Write a query to extract data on only those patients who have had both chemotherapy and radiotherapy. Can you compute the average tumour size for such patients? How does this compare to the tumour size for patients who haven't undergone therapy?

Extracting the patients who have undergone chemotherapy and radiotherapy:

In [None]:
metabric[(metabric.Chemotherapy == "YES") & (metabric.Radiotherapy == "YES")]

Compute the average tumour size for these patients.

In [None]:
metabric[(metabric.Chemotherapy == "YES") & (metabric.Radiotherapy == "YES")].Tumour_size.mean()

An alternative method would be to extract the tumour size column first, and then subset by therapy status:

In [None]:
metabric.Tumour_size[(metabric.Chemotherapy == "YES") & (metabric.Radiotherapy == "YES")].mean()

Can you think why this approach might be advantageous?

The final part of this exercise can be done via a simple modification to the conditions.

In [None]:
metabric.Tumour_size[(metabric.Chemotherapy == "NO") & (metabric.Radiotherapy == "NO")].mean()

** ** 

**C)** Write a query to extract data on patients from cohort 1 with either the highest or second highest tumour stage (part of the exercise is to figure out what the top two tumour stages are!). A method introduced in the final section of this session might be useful...

The first step is to figure out what the two largest tumour stages are. The range of values for tumour stage can be obtained using `.unique()`, from which we see the top two stages are 3 and 4.

In [None]:
metabric.Tumour_stage.unique()

Extracting the data is then a relatively simple task.

In [None]:
metabric[(metabric.Tumour_stage.isin([3, 4])) & (metabric.Cohort==1)]

This can actually be achieved more succintly using the `.nlargest()` method. I wasn't aware of this method until writing this solution, so big congrats to anyone who managed to obtain this solution!

In [None]:
metabric[(metabric.Tumour_stage.isin( pd.Series(metabric.Tumour_stage.unique()).nlargest(n=2) )) & (metabric.Cohort==1)]

** **

**D)** Use `.sum()` to get a vector of the amount of missing data for each variable.

Recall the `.sum()` method computes the sum of a Series, or the sum for each column when called on a DataFrame.

In [None]:
metabric.isna().sum()

### Exercise 1

- What are the different values of the `Integrative Cluster` variable? 
- Can you produce a series containing the number of patients in each cluster? (There is a useful method that will help you out with this, but I leave it to you to find out what that is).

The different values of the integrative cluster variable can be found using the `.unique()` method.

In [None]:
metabric.Integrative_cluster.unique()

With some googling, you may have found the method `.value_counts()`, that returns the count for each unqie variable in a Series.

In [None]:
metabric.Integrative_cluster.value_counts()

### Exercise 2 

- Read the dataset `metabric_clinical_and_expression_data.csv` and store its summary statistics into a new variable called `metabric_summary`.
- Just like the `.read_csv()` method allows reading data from a file, `pandas` provides a `.to_csv()` method to write `DataFrames` to files. Write your summary statistics object into a file called `metabric_summary.csv`. You can use `help(metabric.to_csv)` to get information on how to use this function.
- Use the help information to modify the previous step so that you can generate a Tab Separated Value (TSV) file instead 
- Similarly, explore the method `to_excel()` to produce an excel spreadsheet containing summary statistics

In [None]:
# Store summary statistics
metabric_summary = metabric.describe(include="all")
metabric_summary

In [None]:
help(metabric.to_csv)

This is the basic syntax for writing a data frame to a csv file:

In [None]:
metabric_summary.to_csv("metabric_summary.csv")

The following lines show examples of modifications that can be made to this function:

In [None]:
metabric_summary.to_csv("metabric_summary.tsv", sep = '\t') # Creates a tsv (tab seperated values)

In [None]:
metabric_summary.to_csv("metabric_summary.csv", columns = ["Cohort", "Age_at_diagnosis"]) # Selects only certain columns to write

In [None]:
metabric_summary.to_csv("~/Desktop/metabric_summary.csv", index = False) # More detail on where to write the file to

In [None]:
# Write an excel spreadsheet

#help(metabric.to_excel)
metabric_summary.to_excel("metabric_summary.xlsx")

#If: ModuleNotFoundError: No module named 'openpyxl'
#pip3 install openpyxl --user

### Exercise 3

- Read the dataset `metabric_clinical_and_expression_data.csv` into a variable e.g. `metabric`.
- Calculate the mean tumour size of patients grouped by vital status and tumour stage
- Find the cohort of patients and tumour stage where the average expression of genes TP53 and FOXA1 is highest
- Do patients with greater tumour size live longer? How about patients with greater tumour stage? How about greater Nottingham_prognostic_index?

This exercise can be done using the `groupby` function.

In [None]:
# Calculate the mean tumour size of patients grouped by vital status and tumour stage
metabric.groupby(['Vital_status', 'Tumour_stage']).mean()

It's a bit neater to only display the columns we need information on.

In [None]:
metabric.groupby(['Vital_status', 'Tumour_stage']).mean()['Tumour_size']

In [None]:
# Find the cohort of patients and tumour stage where the average expression of genes TP53 and FOXA1 is highest

metabric.groupby(['Cohort', 'Tumour_stage']).mean()[['TP53', 'FOXA1']]

To answer problems about survival times, let's restrict ourselves to those patients where we know the survival times.

In [None]:
metabric_deceased = metabric[metabric.Vital_status == 'Died of Disease']

**Note**: This strategy of only looking at already deceased patients does introduce bias, as we are excluding those who survive long enough not to be deceased yet. The question of how best to incoporate survival data on still alive patients (where you only know their survival time is >= what is recorded) is essentially the whole reason *Survival Analysis* is it's own field of statistics.

Tumour Size is a continuous variable, so to answer the question of whether it has an effect on survival times, we are better off looking at correlation.

In [None]:
metabric_deceased.Tumour_size.corr(metabric.Survival_time, method='spearman')

For tumour stage, we can return to the group by function we know and love(?).

In [None]:
metabric_deceased.groupby('Tumour_stage').mean()['Survival_time']

When computing grouped averages, it is good practice to display the size of each group.

In [None]:
metabric_deceased.groupby('Tumour_stage').agg(['mean', 'size'])['Survival_time']

Nottingham prognostic index is a continuous variable, so we are back to computing correlations.

In [None]:
metabric_deceased['Nottingham_prognostic_index'].corr(metabric_deceased['Survival_time'])

### Exercise 4

Review the section on missing data presented in the lecture. Consulting the [user's guide section dedicated to missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) if necessary use the functionality provided by pandas to answer the following questions:

- Which variables (columns) of the metabric dataset have missing data?
- Find the patients ids who have missing tumour size and/or missing mutation count data. Which cohorts do they belong to?
- For the patients identified to have missing tumour size data for each cohort, calculate the average tumour size of the patients with tumour size data available within the same cohort to fill in the missing data

In [None]:
# Which variables (columns) of the metabric dataset have missing data?
metabric.info()

Find the IDs and cohorts for patients with missing data in either tumour size or mutation count:

In [None]:
metabric.Patient_ID[(metabric.Tumour_size.isna()) | (metabric.Mutation_count.isna())]

In [None]:
metabric[['Patient_ID', 'Cohort']][(metabric.Tumour_size.isna()) | (metabric.Mutation_count.isna())]

It's worth seeing the number of patients in each cohort with missing data, as well as the proportions:

In [None]:
metabric.Cohort[(metabric.Tumour_size.isna()) | (metabric.Mutation_count.isna())].value_counts()

In [None]:
# normalize=True enable us to compute the proportion of patients within each cohort that are missing data
metabric.Cohort[(metabric.Tumour_size.isna()) | (metabric.Mutation_count.isna())].value_counts(normalize=True)

For the patients identified to have missing tumour size data for each cohort, 
calculate the average tumour size of the patients with tumour size data available within the same cohort to fill in the missing data

In [None]:
# Compute average tumour sizes for each cohort
avg_tsize_1 = round(metabric.Tumour_size[metabric.Cohort==1].mean(), 1)
avg_tsize_2 = round(metabric.Tumour_size[metabric.Cohort==2].mean(), 1)
avg_tsize_3 = round(metabric.Tumour_size[metabric.Cohort==3].mean(), 1)
avg_tsize_5 = round(metabric.Tumour_size[metabric.Cohort==5].mean(), 1)

# Fill in missing values
metabric[metabric.Cohort==1] = metabric[metabric.Cohort==1].fillna(value={'Tumour_size':avg_tsize_1})
metabric[metabric.Cohort==2] = metabric[metabric.Cohort==1].fillna(value={'Tumour_size':avg_tsize_2})
metabric[metabric.Cohort==3] = metabric[metabric.Cohort==1].fillna(value={'Tumour_size':avg_tsize_3})
metabric[metabric.Cohort==5] = metabric[metabric.Cohort==1].fillna(value={'Tumour_size':avg_tsize_5})

**Bonus Exercise**:

- The above bit of code is begging to be put into a `for` loop. Have a go at this.
- See if you can write a function that takes in a data frame with columns "Cohort" and "Tumour_size", and returns a new data frame with the missing values of tumour size filled in as above.