### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 05 - Introduction to Pandas

*Written by:* Oliver Scott

**This notebook provides a general introduction to Pandas.**

Do not be afraid to make changes to the code cells to explore how things work!

### What is Pandas?

**[Pandas](https://pandas.pydata.org/)** is a Python package for data analysis, providing functions for analysing, cleaning and manipulating data. Pandas is probably one of the most important tools for data scientists and is the backbone of most data science projects using Python.

Pandas is built on top of NumPy, hence the Numpy structure is used a lot in the Pandas interface. Data manipulation often prefaces further analysis using other Python packages such as statistical analysis using [SciPy](https://www.scipy.org/), visualisation using tools such as [Matplotlib](https://matplotlib.org/) and machine learning using [scikit-learn](https://scikit-learn.org/stable/). These tools and others make up the Python scientific stack and are essential to learn for a career in informatics or data-science. To be effective in pandas it is essential to have a good grasp of the core concepts in Python (these concepts are outlined in the first session) along with some familiarity with NumPy. If you get lost with some concepts it might be a good idea to take a look through the previous material across the sessions.

In this notebook we will learn the basics of Pandas. Pandas is a huge package and is deserving of an entire lecture series itself, so here we will learn tyhe fundamentals from which you will be able to build upon if you want to learn more.

-----

## Contents

1. [The Basics](#The-Basics)
2. [Creating DataFrames](#Creating-DataFrames)
3. [Reading Data](#Reading-Data)
4. [Essential Operations](#Essential-Operations)
5. [Slicing and Selecting](#Slicing-and-Selecting)

-----

#### Extra Resources:

- [Pandas Getting Started Guide](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)

-----

#### References:


-----

## The Basics

Importing Pandas is no different to any other package/module. Pandas users often use the `pd` alias to keep code clean:


In [None]:
import pandas as pd

s = pd.Series([1.0, 2.0, 3.0, 5.0])
s

### Core Components

Pandas has two core components, the `Series` and the `DataFrame`.

The `Series` can be imagined as a single column in a data table, whereas the `DataFrame` can be imagined as a full data table made up of multiple `Series`. Both types have a similar interface allowing a user to perform similar operations. DataFrames are similar to spreadsheets that you may have interacted with in software such as Excel. DataFrames are often faster, easier to use and more powerful than spreadsheets.

<p align="center">
  <img src="https://www.datasciencemadesimple.com/wp-content/uploads/2020/05/create-series-in-python-pandas-0.png?ezimgfmt=rs%3Adevice%2Frscb1-1" alt="Pandas DataFrame" width="70%"/>
  <br>
</p>

[Image source](https://www.datasciencemadesimple.com/wp-content/uploads/2020/05/create-series-in-python-pandas-0.png?ezimgfmt=rs%3Adevice%2Frscb1-1)

-----

## Creating DataFrames

There are numerous ways to create a DataFrame using the Pandas package. In most cases it is likely that you will want to read in data from a paticular file, however DataFrames can also be constructed from scratch from lists, tuples, NumPy arrays or Pandas series. Probably the most simple way however is from a simple Python dictionary `dict`. Suppose we wanted to construct a table like the one below:

| PatientID | Gender | Age | Outcome  |
|-----------|--------|-----|----------|
| 556785    | M      | 19  | Negative |
| 998764    | F      | 38  | Positive |
| 477822    | M      | 54  | Positive |
| 678329    | M      | 22  | Negative |
| 675859    | F      | 41  | Negative |

We can construct this using a Python dictionary where the key corresponds to the column name and the list the data present in the rows. For this we can use the default constructor `pd.DataFrame()`. Notice how there is also an unnamed column containing the numbers 0-4, this is the **index** of each row. In fact you may also specify a custom index when contructing a dataframe; (`pd.DataFrame(data, index=['Tom', 'Joanne', 'Joe', 'Xander', 'Selena'])`) In this case the index is the names of the patients.

In [None]:
# This is or dictionary containing the raw data
data = {
    'PatientID': [556785, 998764, 477822, 678329, 675859],
    'Gender': ['M', 'F', 'M', 'M', 'F'],
    'Age': [19, 38, 54, 22, 41],
    'Outcome': ['Negative', 'Poisitive', 'Positive', 'Negative', 'Negative']
}

# We can now construct a DataFrame like so:
df = pd.DataFrame(data)
df  # show the data

Often you will be working with very large tables of data making it impractical to view the whole table. Pandas provides a method `.head()` to display the first few n items or `.tail()` for the last few:

In [None]:
# Display the first three rows
df.head(n=3)

In [None]:
# Display the last two rows
df.tail(n=2)

Accessing an individual column is easy using the same syntax as a Python dictionary `dict`:

In [None]:
gender_column = df['Gender']
gender_column

If the column label is a string you may also use **dot-syntax** to access the column:

In [None]:
age_column = df.Age
age_column

## Reading Data

Reading and writing data from/to files in multiple formats is an essential part of the data analysis pipeline. Pandas can read data from file including; CSV, JSON, Excel, SQL and [many more](https://pandas.pydata.org/pandas-docs/stable/reference/io.html).

In the folder `data` we have provided a dataset downloaded from the [UK government](https://coronavirus.data.gov.uk/details/cases?areaType=overview&areaName=United%20Kingdom) detailing the number of reported positive COVID-19 test results in the United Kingdom by date reported (up to Oct-31-21). The file is in the CSV format and can be read using Pandas with the function `.read_csv()`: 


In [None]:
cv_data_path = './data/data_2021-Oct-31.csv'  # This is the path to our data

cv_data = pd.read_csv(cv_data_path)
cv_data.head(n=10)

We could also easily write this DataFrame to a new CSV file using the method `df.to_csv()`:

```python
cv_data.to_csv('./data/coronavirus_testing_results.csv')
```

Give it a go. Maybe also saving to a different [format](https://pandas.pydata.org/pandas-docs/stable/reference/io.html)!

## Essential Operations

Now that we have loaded some data into a `DataFrame` we can perform operations for performing analysis. Typically once you have loaded some data you should view your data to make sure that it looks correct and to get an idea of what values you will be dealing with. Since we have already coovered visualising the data using `.head()`/`.tail()`, the next function you should probbaly run is `.info()` which provide essential details about your dataset including the number of rows/columns  the number of none-null values (None), what type of data is in each column and how much memory the data is taking up:


In [None]:
cv_data.info()

Notice that we have 6 columns of which four are of type `object` (this could be something like a string) and two that are `int64` (integers) (these types correspond the the types used in NumPy). The info also tells us that we have 2466 non-null values and no null-values in this case. Knowing the datatype of ourt columns is very important as it will determine what operations we can perform on each column (we wouldnt want to calculate the mean of a column containing strings). Just like NumPy you can also use `.shape` to see the number of (rows/columns):

In [None]:
cv_data.shape

#### Removing duplicate data

Often input data is noisy and needs cleaning up before we do any further analysis. It is often the case that data contains duplicated rows which is not great when we are trying to do statitical analysis. Luckily Pandas has utilities for dealing with this problem easily. The data we have read does not contain any duplicated rows so we will arbritrarily create some by duplicating the data and adding it to itself:

In [None]:
duplicated = cv_data.append(cv_data)  # here we have copied the data and added it to itself
duplicated.shape

Notice that we have to assign the result of the `append` to a new variable. Here we have copied the data so we wont do anything to the original DataFrame. We can now easily drop the duplicates using the `.drop_duplicates()` [method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html). It is always a good idea to look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) to see what other arguments these functions accept.

In [None]:
duplicated = duplicated.drop_duplicates()
duplicated.shape

Notice that the shape is now the same as the original data. Also notice that again we assigned the result to a new variable (with the same name). This technique can get quite annoying so Pandas often offers an argument `inplace` which if we set to `True` allows pandas to perform the operation modifying the original data rather than a copy.

```python
duplicated.drop_duplicates(inplace=True)  # no need to assign to a new variable
```

#### Removing Null values (None)

Data before cleaning commonly has missing values that you will need to deal with before further analysis. 