# Pandas and Matplotlib Session

In this session, we will cover the basics of Pandas and Matplotlib - two packages used to manipulate, analyse and visualise data. The content herein aims to act as an introduction to Pandas/Matplotlib and is by no means exhastive. Refering to the [Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) and [Matplotlib](https://matplotlib.org/stable/api/index) documentation is an excellent way to extend your knowledge of these two modules and find answers to edge cases.

Pandas leverages another library called Numpy, which is a framework for numerical computing in Python. It's worth noting that Pandas is built on top of Numpy, but we will not be covering NumPy in this session. If you are interested in learning more about Numpy, please refer to the [Numpy documentation](https://numpy.org/doc/stable/reference/index.html).

___

Let's start by importing the modules required for the session.

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import time

To illustrate the benefit of using Numpy, take a look at the two code blocks below. Both achieve the same result, but the first uses a for loop to iterate through a list of numbers and the second uses Numpy to achieve the same result in a single line of code. The second method is much faster and more efficient, especially for large datasets.

In [None]:
number_list = [1, 2, 3, 4, 5, 6, 7, 8]

for num in number_list:
  print(num**3)

print(
    number_list * 3
)

In [None]:
number_array = np.array(
    [1, 2, 3, 4, 5, 6, 7, 8]
)

print(
  number_array ** 3
)

To demonstrate the runtime improvement of the Numpy array over the Python list, we will use the `timeit` module. This module provides a simple way to time small bits of Python code. We will use it to measure the time it takes to create a list of numbers using a for loop and the time it takes to create a Numpy array using the `np.arange()` function.

In [None]:
number_list = list(range(0, 1000001))
number_array = np.array(range(0, 1000001))

# Find time taken to cube elements in list
begin_time = time.time()
cubes_list = []
for i in number_list:
  cubes_list.append(i**3)
print(time.time() - begin_time)

# Find time taken to cube elements in numpy array
begin_time = time.time()
cubes_list = number_array ** 3
print(time.time() - begin_time)

For this session, you will need the data set named ```pdb_data_no_dups.csv.gz```, which should be found in this folder's `data` directory. 
___

## Pandas introduction

Pandas is a Python module, underpinned by Numpy, that enables efficient handling of tabular data structures, which are stored as DataFrame objects. These can either be created within a Python script or imported from separate files such as CSVs or Excel spreadsheets.

Here, we will be working with a data set on summary protein structure determination informaiton taken from [Kaggle](https://www.kaggle.com/shahir/protein-data-set) to introduce basic Pandas commands.

We will start by importing the data set into this Jupyter notebook and assigning it as a DataFrame to the variable ```df```.

In [3]:
df = pd.read_csv('data/pdb_data_no_dups.csv.gz')

In [None]:
df.head()

The attribute ```.columns``` returns an array of column names from the DataFrame. Let's apply this to our ```df``` DataFrame.

In [None]:
df.columns

### Task 1
Convert the array of columns into a list and assign it to a Python variable.



In [None]:
### Task 1 ###


___
The top five rows of of the Pandas DataFrame can be returned by using the method ```.head()```. Run ```.head()``` on your ```df``` to get a brief insight to the data.

In [None]:
### Enter code below


The attribute ```.index``` can give you some information on the number of entries in the DataFrame. Apply this attribute to your DataFrame. Is the result useful?


In [None]:
### Enter code below


### Task 2
Determine the number of rows (entries) in your ```df``` DataFrame.

In [None]:
### Task 2 ###


___
DataFrames are typically structured with a single Python data type assigned to each column. In other words, columns containing numeric data should not include strings (e.g. `1, 2, 3, 'four'` is not good practice). What Python function is used to return the data type of the variable's below?

In [None]:
example_string = 'Hello_world'
example_integer = 1234
example_float = 1.234

### Enter code below


Pandas allows you to check the data type contained within a column using the attribute ```.dtype```. 

In [None]:
dict_1 = {
    "key1" : [1, 2, 3, 4],
    "key2" : [10, 20, 30, 40]
}
dict_1["key1"]

df["resolution"].dtype


Apply `.dtype` to your entire DataFrame. Are the results what you would expect considering your previous insight into the DataFrame?

In [None]:
# Enter code below

Another useful feature of Pandas is the ability to find all the unique values contained within a column. Squre brackets ```[]``` are used to slice a DataFrame into a single column and the method ```.unique()``` returns non-duplicate values. Run the cell below to see what the code returns.

In [None]:
df['experimentalTechnique'].unique()

### Task 3
What types of macromolecules are included in your DataFrame?

How could you clean this data?

In [None]:
### Task 3 ###


___
As well as returning unique entries in a column, we can find their frequencies using the method ```.value_counts()```.

### Task 4

Apply ```.value_counts()``` to an appropriate column of your choice.

In [None]:
### Task 4 ###


___
Often, we only want to work with numeric data from a DataFrame. Columns containing strings can confound analysis if they are not removed. We can create a new DataFrame that countains only floats and integers using the following code:

In [None]:
# DataFrame copy made of df, containing only integers and floats.
df_numeric = df.select_dtypes(include = ["int64", "float64"])

# Displaying new DataFrame.
df_numeric

#### Task 5
Modify the code above to create a copy of ```df``` that contains only string-type data.

In [None]:
### Task 5 ###


___
A quick summary of numeric data within a DataFrame can be obtained using the method ```.describe()```. Apply this to your ```df```. Ask your supervisor how you can transpose the DataFrame to move your summary statistics returned by ```.describe()``` to the column position.

In [None]:
# Enter code below


A major problem that arises when working with unfamiliar data sets is the insertion of ```NULL```, ```Na``` or ```NaN``` values. Rows with any of these values can be removed using the ```.dropna()``` method.

How many entries of your original ```df``` DataFrame are free from missing values for all parameters?

In [None]:
# Enter code below


Does the presence of missing values in a DataFrame affect summary statistics? Write some code in the cell below to test your hypothesis.

In [None]:
# Enter code below


___

The final feature Pandas we will look at is boolean operators. The method ```.iloc[]``` will return the entry at a given index. A further argument can be parsed into ```.iloc[]``` to return both the entry at a given index and that entry's parameter value at a given column. Pandas, like Python, uses a starting index of ```0```. Run the code below to see ```.iloc[]``` in action.

In [None]:
entry_two = df.iloc[1]                   # Returns the entire entry from the second column

entry_three_column_four = df.iloc[2:10, 3]  # Returns the single value in row 3, column 4

print(entry_two)
print('-------')                         # Arbitrary seperator
print(entry_three_column_four)

Another useful method is ```.loc[]```. It works in a similar way to ```.iloc[]``` but is used as a logical entry selector. Boolean operators are used with ```.loc[]``` to exclusively select entries based on values in a specified column. Run the code below to get a feel for ```.loc[]```.  

In [None]:
# Select only entries solved by X-ray diffraction and assign to new DataFrame
xray_structures = df.loc[df['experimentalTechnique'] == 'X-RAY DIFFRACTION']
print(len(xray_structures))

# Select only entries solved by X-ray diffraciton and are DNA-based
dna_xray_structures = df.loc[(df['experimentalTechnique'] == 'X-RAY DIFFRACTION') & (df['classification'] == 'DNA')]
print(len(dna_xray_structures))

# Shows the DataFrame contains only one unique entry in the 'experimentalTechnique' field.
xray_structures['experimentalTechnique'].unique()

Note that the second variable is defined using the element-wise boolean operator ```&```. The element-wise booleans in Pandas for AND, OR and XOR are ```&```, ```|``` and ```^```, respectively.


It is good practice to use ```.loc[]``` to conditionally select entries from DataFrames but can be omitted, giving the same result.


### Task 6

What year was the 100th protein structure (```macromoleculeType```) solved solely by ```ELECTRON MICROSCOPY```?

In [None]:
#### Task 6 ###


___

## Matplotlib introduction

This section will introduce some of the basics of Maplotlib. It is by no means an exhaustive overview of the package and you should look through the [Matplotlib documentation](https://matplotlib.org/) when working on your own projects in the future.

The terminology of Matplotlib can cause confusion for those unfamiliar with the package. The term ***figure*** refers to the entire plot area of a graph, which i sseparate from the plot area. An ***axis*** is placed onto the Matplotlib *figure* and defines the area onto which your data will be plotted. A Matplotlib *axis*, ambiguously, includes both x- and y-axes. Multiple *axes* can be plotted onto a single figure.

Let's start by plotting a histogram of the resolution data in the DataFrame. By doing so, we can get an insight into how the data is distributed.

Run the code below. What does each line do? Speak with your supervisor for clarification.

In [None]:
# Initialising figure as a 1x1 grid of axes
resolution_fig, ax_res_hist = plt.subplots(1, 1)

 # Define number, size and range of bins
res_bins = np.arange(0, 5, 0.2)

# Plot histogram of non-NaN resolution data using Pandas
ax_res_hist.hist(df['resolution'].dropna(),
                 bins=res_bins
                )

plt.show()         # Displays all axes on figure

### Task 7

Plot a histogram for a numerical parameter from ```df``` of your chioce. Inspect the [Maplotlib documentation](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.hist.html?highlight=hist#matplotlib.axes.Axes.hist) for this form of histogram and add additional arguments into your code to format the graph's appearance.

Axes labels can also be added using the method ```.set_nlabel```, replacing ```n``` with either ```x``` or ```y```.

In [None]:
### Task 7 ###


___
Scatter plots can be a good way to identify trends or groups within data.

Run the code below. Is it valid to say the resolution of solved macromolecules has been worsening over time? What reasons are there for the trends observed? Discuss with your supervisor.

In [None]:
# Initialise figure with 1x1 grid of axes
res_year_fig, res_year_ax = plt.subplots(1, 1)

''' Remove only NaN/NULL values from the resolution and publicationYear columns and
return a DataFrame with only these two columns, saving computer memory '''

year_resolution_df = df.dropna(subset=['resolution', 'publicationYear'])[['resolution', 'publicationYear']]

res_year_ax.scatter(year_resolution_df['publicationYear'],   # Plot x-axis data
                    year_resolution_df['resolution']         # Plot y-axis data
                   )

plt.show()   # Display figure

### Task 8

Repurpose the code above to plot *resolution* as a function of *molecular weight*.

Refer to the [Matplotlib documentation](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html?highlight=scatter#matplotlib.axes.Axes.scatter) for scatter plots and add some formatting to change the plot's appearance.

In [None]:
### Task 7 ###

fig, axes = plt.subplots(1, 2)
# axes = (<AxesSubplot:> <AxesSubplot:>)
print(axes)

axes[0].scatter(df["publicationYear"], df["resolution"])
axes[1].scatter(df["structureMolecularWeight"], df["resolution"])


___
The final plot we will look at is the bar plot, specifically a horizontal bar plot. Run the code below. Discuss the code with your supervisor if anything is unlcear.

In [None]:
# Setting the style of figure
plt.style.use('ggplot')

# Initialise figure with 1x1 grid of axes
fig, ax = plt.subplots(1, 1)

ax.barh(y=df['experimentalTechnique'].value_counts().index,   # Labels added to y-axis
        width=df['experimentalTechnique'].value_counts(),     # Frequency added to x-axis
        log=True         # The difference between frequencies spans several orders of magnitude. Using a semi-log scale makes the data easier to interpret here
       )                 # Comment out 'log=True' to highlight its usefulness here.

plt.show()   # Display figure

### Task 9

What is the most commonly soled type of macromolecule? Plot a bargraph to get your answer.

In [None]:
### Task 8 ###
