# Week 1 - Data Handling

This week we will be focusing on manipulating and performing basic operations on data within Python. The package we shall be using to do this is `pandas`, which provides data structures and operations for manipulating many different types of data.

Another popular package for these sort of tasks is `numpy`, which is particularly useful when performing more computationally involved operations on your data. We shall not be covering `numpy` today, but feel free to study it in your own time, e.g. by following [this tutorial](https://numpy.org/devdocs/user/quickstart.html).

## Useful References

There are many resources that you may find useful for using `pandas`, here are a few:

- The `pandas` [documentation](https://pandas.pydata.org/docs/reference/index.html) - Contains reference information for all pandas objects, functions and methods.
- The [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) - I find this very helpful as a reference when I want to look up how to perform a particular kind of task, rather than wanting to look up a specific function. It is only introductory so does not by any stretch cover everything, but what it does cover it covers well. It also contains material on `numpy`, `matplotlib` & `seaborn` (two packages commonly used together for data visualisation, and which we will cover next week), and `scikit-learn` (an accessible package for doing basic Machine Learning in Python, which will be covered later on in the course). 

## Course Homepage

Another source of resources for the whole course is the [course GitHub page](https://github.com/AstraZeneca/data-science-python-course). Here you can find all the notebooks used in class, as well as solutions to the exercises. 

## Installing Pandas

The module `pandas` does not come as part of the default Python or Jupyter installations. In order to install it in your system, launch the Command prompt just like we saw in week 0 and run the following command: `pip install pandas --user`. Once the command finishes execution, pandas will be installed in your system.

**Note**: Depending on where you are installing the packages from, you may have to use a proxy, i.e. you may have to enter `pip install pandas --proxy "http://azpzen.astrazeneca.net:9480"` instead of `pip install pandas --user`.

**Note**: If you have any issues installing pandas, please get in touch with one of the trainers or use the teams page.

**Note**: You can use the same approach to install `numpy`, and other packages.

## Learning Objectives

After this week, you should be comfortable with performing the following tasks using `pandas`:

- Reading in data as a `DataFrame`.
- Selecting a subset of your data.
- Creating new columns and modifying existing ones.
- Summarising data sets via summary values, including seperating out the summaries by groups within the data.
- Handling missing data.
- Handling data spread across multiple datasets.

**Note**: Any of these topics could be covered in almost endless depth. It is sufficient for this session to understand the basics of these topics, and how to go about learning them in more depth. 

## Loading pandas

Pandas is typically imported with the alias `pd`.

In [2]:
import pandas as pd

## Reading in Data 

The first step in any data analysis is to read in your data. We shall be using the METABRIC dataset `metabric_clinical_and_expression_data.csv`, which contains various types of information about patients with breast cancer.

Pandas contains different functions for reading in data from different formats. The function for reading in a csv file is `read_csv()`.

In [3]:
metabric = pd.read_csv('metabric_clinical_and_expression_data.csv')

**Note**: You will need to replace `'metabric_clinical_and_expression_data.csv'` with a file path for wherever you have the csv stored. Be aware that the copy path functionality in windows pastes in file paths with `\` back slashes, but pandas requires `/` forward slashes. Alternatively, you can put an `r` in front of the string, e.g. `pd.read_csv(r"C:\Documents\metabric_clinical_and_expression_data.csv")`.

The type of object the metabric dataset is now stored as is called a `DataFrame`. Typically the rows of a data frame correspond to **observations**, and the columns correspond to **variables**. 

By default, `pandas` will assign each row an index starting from 0, as is standard in Python.

## Exploring Your Data

### A Refresher on Objects, Attributes and Methods

Recall that objects in Python have **attributes** and **methods**.

An **attribute** is simply a property of the object. They can be accessed via the syntax `object.attribute`.

For example, the `shape` attribute of a data frame contains information on the number of rows and columns.

In [None]:
metabric.shape

The `columns` attribute gives us a list of the variables in the data frame.

In [5]:
metabric.columns

Index(['Patient_ID', 'Cohort', 'Age_at_diagnosis', 'Survival_time',
       'Survival_status', 'Vital_status', 'Chemotherapy', 'Radiotherapy',
       'Tumour_size', 'Tumour_stage', 'Neoplasm_histologic_grade',
       'Lymph_nodes_examined_positive', 'Lymph_node_status', 'Cancer_type',
       'ER_status', 'PR_status', 'HER2_status', 'HER2_status_measured_by_SNP6',
       'PAM50', '3-gene_classifier', 'Nottingham_prognostic_index',
       'Cellularity', 'Integrative_cluster', 'Mutation_count', 'ESR1', 'ERBB2',
       'PGR', 'TP53', 'PIK3CA', 'GATA3', 'FOXA1', 'MLPH'],
      dtype='object')

A **method** is a function specfically designed for a certain type of object. They can be called via the syntax `object.method()`. By default the first argument of the method is the object it is called on, any further arguments must be typed inside the brackets.

For example, the `head()` method returns the first few rows of the data frame it is called on.

In [9]:
metabric.head() # Semantically equivalent to head(metabric)

8       289
3       282
4ER+    244
10      219
5       184
7       182
9       142
1       132
6        84
4ER-     74
2        72
Name: Integrative_cluster, dtype: int64

In [None]:
metabric.head(n=10) # Semantically equivalent to head(metabric, n=10)

`describe()` provides summary statistics on all variables.

In [None]:
metabric.describe()

Methods become much nicer to use than functions when you start using more than one of them.

In [None]:
metabric.head(n=10).describe() # Semantically equivalent to describe( head(metabric) )

In [None]:
metabric.info()

### Columns

One way in which individual columns can be acessed is as follows:

In [None]:
metabric['Survival_time']

Columns are in fact attributes of the data frame, so they can also be accessed via the usual attribute syntax:

In [None]:
metabric.Survival_time

**Note**: This only works if the column name has no spaces!

The columns are a seprate type of `pandas` object called a `Series`, which is essentially the `pandas` version of a list. `Series` also have lots of useful attributes and methods, many of which go by the same name as their `DataFrame` counterpart.

In [None]:
metabric.Survival_time.shape

In [None]:
metabric.Survival_time.describe()

In [None]:
metabric.Survival_time.mean()

In many cases, a method with an obvious functionality for a single variable will simply perform that operation on all columns when called on a data frame.

In [None]:
metabric.mean()

#### Column Creation

New columns can be created in a straightforward manner. For example, I can create a column with the current age of the patients as folllows:

In [None]:
metabric['Age_today'] = metabric['Age_at_diagnosis'] + metabric['Survival_time']/365

**Note**: When creating new columns, the new column must be specified using the `dataframe['column_name']` syntax.

**<span style="color:blue">Exercise</span>**: Create a version of this column that returns the patient's age at their time of death if they have died, and otherwise returns their current age. **<span style="color:SeaGreen">Challenge**: Do this without using a `for` loop.

## Data Subsetting

Square brackets `[]` are used to select rows and columns in a data frame. For example, one can select a subset of the rows as follows:

In [None]:
metabric[:3]

Recall the use of a `:` colon to define a range.

This can be combined with column selection as above:

In [None]:
metabric[:3]['Survival_time']

However, the two cannot be combined into one pair of indices:

In [None]:
metabric[:3,'Survival_time']

:(

Such **label-based** indexing can be done using the `.loc[]` functionality.

In [None]:
metabric.loc[:3,'Survival_time']

In [None]:
metabric.loc[:,['Survival_time', 'Tumour_size']]

In [None]:
metabric.loc[:,'ESR1':'MLPH']

Both rows and columns can be specified via **positional** indexing using the `.iloc[]` functionality.

In [None]:
metabric.iloc[:10,1:6]

### Conditional Subsetting

You can also subset a data frame using boolean vectors. This allows you to only include data that meets certain conditions.

For example, let's suppose we are interested in patients with a particularly large tumour.

In [None]:
metabric[metabric.Tumour_size >= 20]

`metabric.Tumour_size` evaluates to a series of True and False, when put inside the `[]` brackets, pandas selects those rows where the vector equals True. You can put anything inside the square brackets that evaluates to a series of True and False of the same as the number of rows in the data frame.

In [None]:
metabric.Tumour_size >= 20

More complicated conditions can be specified using `&` (AND), and `|` (OR).

Let's take a look at those who survived a long time despite having a large tumour.

In [None]:
metabric[(metabric.Tumour_size >= 20) & (metabric.Survival_time >= 100)]

Some useful bits of logical snytax:

- `==` equal
- `!=` not equal
- `>` greater than
- `>=` greater than or equal to
- `<` less than
- `<=` less than or equal to

- `&` and
- `|` or

**<span style="color:blue">Exercise</span>**: Write a query to extract data on only those patients who have had both chemotherapy and radiotherapy. Can you compute the average tumour size for such patients? How does this compare to the tumour size for patients who haven't undergone therapy?

The `.isin()` method can help avoid cumbersome OR statements.

In [None]:
metabric[(metabric.Cohort==1) | (metabric.Cohort==4) | (metabric.Cohort==5)]

In [None]:
metabric[metabric.Cohort.isin([1,4,5])]

**<span style="color:blue">(slightly difficult) Exercise</span>**: Write a query to extract data on patients from cohort 1 with either the highest or second highest tumour stage (part of the exercise is to figure out what the top two tumour stages are! A method introduced in the final section of this session might be useful...). 

## Missing Data

### Identifying Missing Data

A missing value is represented in pandas by `NaN`. Typically calculations in pandas by default skip over `NaN` values.

Recall the `.info()` method gives you information about the amount of missing data. 

In [None]:
metabric.info()

The location of the missing data can be found using the `isna()` method.

In [None]:
metabric.isna()

**<span style="color:blue">Exercise</span>**: Use `.sum()` to get a vector of the amount of missing data for each variable.

In [None]:
metabric.Cohort[metabric.Tumour_size.isna()]

### Dealing with Missing Data

#### Removal

There are many options for dealing with missing data, one of which is to simply remove samples or variables with incomplete information. The method for this is `dropna()`:

In [None]:
metabric.dropna()

The optional argument `axis` allows you to specify whether to remove rows (the default) or columns (achieved by setting `axis = 1`).

The optional argument `subset` allows you to select only certain variables to check for missing data in.

In [None]:
metabric.dropna(subset=['Tumour_size', 'Tumour_stage'])

#### Replacement

Another option is to fill in the missing data with something sensible. This can be achieved via the method `.fillna()`.

In [None]:
metabric.fillna(value={'Tumour_size':metabric.Tumour_size.mean()})

**<span style="color:SeaGreen">For thought</span>**: 

- What might be the problem with using this method for e.g. Mutation Count?
- Can you think of any other ways to fill in missing values?

The process of filling in missing data with sensible guesses is called **Data Imputation**, and is a whole field of data science in it's own right. Whilst it should obviously not be used for things like summary statistics, it can be very powerful when used as a pre-processing step for a ML algorithm, allowing you to extract much more information from your data.

## Grouping

Often a data set will contain natural groups, and you might wish to understand whether these groups display similar or different behaviour in other variables.

In [None]:
metabric.head()

In [None]:
metabric.mean()

For example, we might be interested in the average of other variables within each cohort.

In [None]:
metabric.groupby('Cohort').mean()

This graphic explains what's going on under the hood when you call `groupby(...).mean()`:

![image.png](03.08-split-apply-combine.png)

When you call `.groupby('Cohort')`, pandas creates a new data frame for each value of `Cohort`, each consisting exclusively of the data from that cohort.

`.mean()` then computes the mean of each variable within each of these data frames, before stacking all the (now 1 row) data frames and returning this to you.

You can group by multiple variables:

In [None]:
metabric.groupby(['Cohort', 'Tumour_stage']).mean()

## Multiple Data Sets

Often the information you need is spread across multiple data sets. Sometimes it is practical to deal with the data sets as seperate data frames, but often it's easier to join them together. Pandas provides lots of useful functionality to help with this.

### Merging Data Sets

Let's suppose that we have some additional data on the mutation staus of various genes.

In [None]:
mutations = pd.read_csv('metabric_mutation_data.csv')

In [None]:
mutations.head()

We might wish to combine this with the other data in metabric. The function for this is `pd.merge()`:

In [None]:
pd.merge(metabric, mutations, on='Patient_ID')

This even works if the rows of the different data frames do not match up!

Suppose that the data had been scrambled on it's way to us, such that the rows got randomly permuted.

In [None]:
mutations_scrambled = pd.read_csv('m3t@Brik_MUTat10n_daTA.csv')

In [None]:
mutations_scrambled.head()

By providing the argument `on = "Patient_ID"`, pandas is able to work out which rows correspond to which. 

In [None]:
pd.merge(metabric, mutations_scrambled, on='Patient_ID')

### Concatenating Data Sets

Suppose that we had originally received the data for each cohort seperately.

In [None]:
metabric_cohort1 = metabric[metabric["Cohort"]==1]
metabric_cohort2 = metabric[metabric["Cohort"]==2]
metabric_cohort3 = metabric[metabric["Cohort"]==3]
metabric_cohort4 = metabric[metabric["Cohort"]==4]
metabric_cohort5 = metabric[metabric["Cohort"]==5]

Here the function we want is `pd.concat()`

In [None]:
pd.concat([metabric_cohort1, metabric_cohort2, metabric_cohort3, metabric_cohort4, metabric_cohort5])

## Replace

When working with categorical data, one may wish to change the labels used for the different categories. For example, a binary classification algorithm may require that the data come encoded as 0 and 1. The helpful method here is `pd.replace()`.

In [None]:
metabric.PR_status

In [None]:
metabric.PR_status.replace(to_replace={'Positive':1, 'Negative':0})

In [None]:
metabric.head()

This also works on whole data frames! For example, we can replace all instances of YES and NO with 1 and 0 respectively.

In [None]:
metabric.replace(to_replace={'YES':1, 'NO':0})

In [None]:
metabric.head()

Antoher thing you might want to do is to combine different categories.

In [None]:
metabric.Cancer_type.unique()

In [None]:
metabric.replace(to_replace={'Breast Invasive Ductal Carcinoma':'Breast Invasive', 
                             'Breast Invasive Lobular Carcinoma':' Breast Invasive', 
                             'Breast Invasive Mixed Mucinous Carcinoma':'Breast Invasive'}).Cancer_type

### Mutability

When we run `metabric.replace(to_replace={'YES':1, 'NO':0})`, this returns a version of `metabric` with all the `YES`s and`NO`s replaced with `1` and `0` respectively.

In [None]:
metabric.replace(to_replace={'YES':1, 'NO':0})

However if we then call `metabric.head()`, we see that `metabric` itself remains unchanged.

In [None]:
metabric.head()

Like most functions in `pandas`, `replace()` does not modify the data frame it is called upon, rather it returns a new data frame that is a modified version of the original one. If you want the change to be made permanent, please remember to re-assign.

In [None]:
metabric = metabric.replace(to_replace={'YES':1, 'NO':0})

In [1]:
metabric.head()

NameError: name 'metabric' is not defined

## Summary

We have learnt the basics of accomplishing the following tasks with `pandas`:

- Reading in data as a `DataFrame`.
- Selecting a subset of your data.
- Creating new columns and modifying existing ones.
- Summarising data sets via summary values, including seperating out the summaries by groups within the data.
- Handling missing data.
- Handling data spread across multiple datasets.

## Exercises

Have a go at as many of these as you can prior to the excercise recap session on Friday.

<u>Exercises from the text</u> - These exercises appear in the above notebook (look for the blue **<span style="color:blue">Exercise</span>** prompts). They are best done as you review the material, but I have also colllated them here for your convenience:

- Create a new column that returns the patient's age at their time of death if they have died, and otherwise returns their current age. **<span style="color:SeaGreen">Challenge**: Do this without using a `for` loop.
- Write a query to extract data on only those patients who have had both chemotherapy and radiotherapy. Can you compute the average tumour size for such patients? How does this compare to the tumour size for patients who haven't undergone therapy?
- Write a query to extract data on patients from cohort 1 with either the highest or second highest tumour stage (part of the exercise is to figure out what the top two tumour stages are! A method introduced in the final section of this session might be useful...). 
- Use `.sum()` to get a vector of the amount of missing data for each variable.

<u>New Exercises</u>

1. This question requires a little bit of googling: 

    - What are the different values of the `Integrative Cluster` variable? 
    - Can you produce a series containing the number of patients in each cluster? (There is a useful method that will help you out with this, but I leave it to you to find out what that is).


2. As well as reading in data, pandas also allows you to write pandas data frames to other formats, such as csv files:

    - Read the dataset `metabric_clinical_and_expression_data.csv` and store its summary statistics into a new variable called `metabric_summary`.
    - Just like the `.read_csv()` method allows reading data from a file, `pandas` provides a `.to_csv()` method to write `DataFrames` to files. Write your summary statistics object into a file called `metabric_summary.csv`. You can use `help(metabric.to_csv)` to get information on how to use this function.
    - Use the help information to modify the previous step so that you can generate a Tab Separated Value (TSV) file instead 
    - Similarly, explore the method `to_excel()` to output an excel spreadsheet containing summary statistics


3. Some exercises involving conditional filtering. Write some python code to answer the following questions:

    - Calculate the mean tumour size of patients, grouped by vital status and tumour stage.
    - In which cohort of patients and tumour stage are the average expression of the genes TP53 and FOXA1 the highest?
    - Do patients with greater tumour size live longer? How about patients with higher tumour stage? How about greater Nottingham_prognostic_index?


4. Review the section on missing data presented in the lecture. Consulting the [user's guide section dedicated to missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) and any other materials as necessary use the functionality provided by pandas to answer the following questions:

    - Which variables (columns) of the metabric dataset have missing data? Are any of them missing substantially more data than others?
    - Find the patients ids who have missing tumour size and/or missing mutation count data. Which cohorts do they belong to?
    - For the patients identified to have missing tumour size data for each cohort, fill this in with the average tumour size of the other patients from the same cohort.


5. (Bonus) Try out pandas in a dataset of your own work or from literature/resources you have read/used recently and share with a colleague / rest of the class.

<u>For Later</u>

*Have a look at this exercise once you've completed the session on basic machine learning.*

Recall we discussed that one option for imputing missing data is to peform a linear regression with the response variable as the missing variable, and the inputs as some subset of the other variables. Have a go at implementing this. See the "Missing Data" section of this notebook for example usage of the `.fillna()` method.