# Data Analytics with High Performance Computing - Practical 1
## Introduction to Python for Data Analytics

In DAwHPC we will adopt Python for our practical activities. IPython Notebooks offer an ideal platform for integrating block of code with text block written in Markdown.

IPython Notebooks can be read and interacted with in a number of packages. 

Here, we'll use Jupyter, which offers the classic Jupyter Notebook interface or the next-gen JupyterLab, which can be run locally or on remote servers under Windows, Linux or Mac OSX.
Another Jupyter package is JupyterHub, which provides a team environment for Jupyter Notebook or JupyterLab.

### Learning outcomes

This practical will cover introductory concepts you'll require for using Python for data analytics. This involves three packages:
*  the Pandas library, for data cleaning data preparation and data analysis
*  the NumPy library, which Pandas builds upon, for using numerical data in the form of NumPy arrays
*  the MatPlotLib library, a package used for general data visualisation in Python

### 1. Getting started

The first part of any Python script is the imports.

We'll import all of the packages we require, and by setting shorthands for these packages at the same time.

To run the block of code below, press <kbd>Ctrl</kbd>+<kbd>⏎ Enter</kbd> (Windows/Linux) or <kbd>⌘ CMD</kbd>+<kbd>⏎ Enter</kbd> (OSX) - you can also press <kbd>⇧ Shift</kbd>+<kbd>⏎ Enter</kbd> to run the block and move to the next block.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Pandas provides two main objects: DataFrames and Series.

A DataFrame is an indexed table of rows an columns consisting of titles and data. These are referred to as the 'axes' of the DataFrame, where rows are axis = 0 and columns are axis = 1.

Each row and column is a Series object, which consists of data, in the form of a one-dimensional NumPy `ndarray` and an index, which can be provided as an array-like object or automatically generated by Pandas.

In [None]:
# Let's start by looking at a Pandas DataFrame
# DataFrames can be build in a number of ways - in this first example we'll use a Python Dictionary, or dict, object.
mydict = {'a': [1, 2, 3, 4], 'b': [2, 4, 6, 8], 'c': [3, 6, 9, 12], 'd': [4, 8, 12, 16]}
df1 = pd.DataFrame(data=mydict,index=['w', 'x', 'y', 'z'])
# Notice keys from mydict become column titles and values in the Python List object passed as the index become row names.
df1

Some properties and attributes of the DataFrame:

In [None]:
print('Length:', len(df1)) # The in-built Python len() function gives the length of an array-like object. For multi-dimensional arrays, the length of axis 0 is given
print('Size:', df1.size) # The DataFrame.size attribute gives the number of elements in the DataFrame
print('Titles: ', df1.index, df1.columns) # Gives the NumPy ndarray objects whose values are row and column names
print('Datatypes:', df1.dtypes) # Gives the datatypes of each column

### 2. Selecting and retrieving data from DataFrame objects using indexing and location
The first level of retrieving data from a DataFrame is to pass a column name as a key, similarly to a Python dict.

In [None]:
df1['a']

In [None]:
mydict['a']

Notice column 'a' of `df1` is a Pandas Series object, whereas the value corresponding to the key 'a' in `mydict` is a Python List object.

In [None]:
type(df1['a'])

In [None]:
type(mydict['a'])

Note: row names (the `index`) cannot be used in this key-value way at the dataframe level, and so a KeyError exception is thrown

In [None]:
df1['w']

However, a row name can be passed to the Series object corresponding to a column:

In [None]:
# Either directly
print(df1['a']['w'])
# or by creating a new Series object
column_a = df1['a']
print(column_a['w'])

The next level is to use Pandas indexing.

Locating specific data in DataFrames can become tricky, and ambiguous. For this reason, Pandas provides four indexing functionalities to help (from Pandas documentation):
*  DataFrame.at  -  access a value given a row/column pair
*  DataFrame.iat -  access a value given a row/column pair using integer position
*  DataFrame.loc -  access a group of rows and columns by label(s) or a boolean array    
*  DataFrame.iloc -  purely integer-location based indexing for selection by position

In [None]:
df1.at['w','a'] # note the row is given first

In [None]:
df1.iat[0,0]

In [None]:
df1.loc[['w','y']]

Lists can also be passed to `loc` or `iloc` to access more than one row:

In [None]:
df1.loc[['w','y']]

Note that for DataFrame.loc the ambiguity of the (incorrect) `df['w']` is removed. This is because `loc` retrieves all column data for row 'w' and constructs a new Series object.

In [None]:
df1.iloc[1:3] # here using a slice meaning integer indices 1 <= i < 3

### 3. Retrieving data using comparisons
A very useful way to fetch data from a DataFrame is by using a comparison function, which performs a boolean operation on the data before returning a DataFrame object corresponding to the result of that operation.

Let's have a look at this with a new DataFrame object.

In [None]:
df2 = pd.DataFrame(data = [np.add(np.arange(1,15),10),np.linspace(2000,10000,14),np.random.random(14)]).transpose()
df2

Here we have instantiatied a DataFrame object using three NumPy arrays:

0. we generate an array with a range of 1 <= x < 15, then add 10 to each element in that array,
1. we generate an array containing 14 numbers linearly spaced with 2000 <= x <= 10000,
2. we generate an array of 14 random numbers with 0 <= x <= 1.

Finally, the function DataFrame.tranpose() is applied to swap rows and columns.

Notice without passing a value of `index` to the DataFrame constructor an integer index is generated, and column names are also integers. This is done automatically if `data` contains only numerical data.
This could get confusing, so let's set an index.

In [None]:
df2.columns = ['a','b','c']
df2

DataFrame.where() and Series.where() allow conditional retrieval of data:

In [None]:
#Return the whole DataFrame with a single condition
df2.where(df2 < 16)

In [None]:
#Return a boolean a Series, also known as a Pandas `mask`
df2.loc[:]['a'] < 16 ## [:] means "all rows"

In [None]:
#Return a DataFrame given a condition (notice the condition in the mask shown above)
df2.where(df2.loc[:]['a'] < 16) ## [:] means "all rows"

Or equivalently, a Pandas `query`, which uses a string for the whole condition

In [None]:
df2.query("a < 16")

### 4. Performing arithmetic operations
Like the NumPy arrays from which they are constructed, Pandas Dataframes and Series can undergo whole-object operations. A few are demonstrated here.

In [None]:
#Recall df1
df1

In [None]:
df1.add(3)

In [None]:
df1.mul(2)

In [None]:
df1.mod(2)

In [None]:
#Combining operations with DataFrame.where()
df1.where(df1.mod(2) == 0)

In [None]:
#Combining Pandas and NumPy operations with DataFrame.where()
#Find entries that are perfect squares
df1.where(np.sqrt(df1).mod(1) == 0)

### 5. Aggregating and Grouping
Pandas allows us to aggregate (or agg for short) data along a chosen axis according to one or more operation.

In [None]:
df1.agg('sum')

In [None]:
df1.agg('sum',axis=1)

In [None]:
df1.agg(['sum','mean','min'])

Grouping of rows with the same value in a given column is very useful.
`DataFrame.groupby()` performs this operation.

In [None]:
# First, we'll create a new DataFrame containing copies of df1 and df1*2.
# Note we now have repeated values in columns and repeated index values
df3 = pd.concat([df1,df1.mul(2)])
df3

In [None]:
# `level=0` groups by index
# We must pass an operation by which to group values.
df3.groupby(level=0).sum()

In [None]:
# Grouping by a column drops that column and reindexes by that column
# Here we group by common values on column 'b', assigning the mean of grouped values.
df3.groupby(by=['b']).mean()

### 6. Plotting data (from dataframes)

Recall we ran `import matplotlib.pyplot as plt`

We will now use this package to plot some data.
Matplotlib is vast an powerful. Here we will attempt to scratch the surface.

At the simplest level, the pyplot API can be used.

In [None]:
#plt.plot() produces a 2-D line plot
plt.plot([0,5,7,2,9,3]) # plot numbers from a 1-D Python List

In [None]:
#If we provide a Pandas Dataframe, a line is plotted for each row
plt.plot(df1,marker='o')

For more in-depth plots, it's better to use pyplot in an object-oriented sense.

In [None]:
fig, ax = plt.subplots() # Create a Figure object, which contains a singe Axes object
ax.plot(df1,marker='o') # Create a plot on the Axes object
ax.set_xlabel('Rows')
ax.set_ylabel('Columns')
ax.set_title('Data from df1')
plt.show()

In [None]:
fig, axs = plt.subplots(1,2) # Create a figure with two axes arranged in 1 row and 2 columns
#axs is an array of Axes objects
print(type(axs))
##Create a scatter plot of column 'b' versus column 'c' from df2
axs[0].scatter(df2.loc[:]['c'],df2.loc[:]['b'])
axs[0].set_xlabel('c')
axs[0].set_ylabel('b')

##Create a multi-label bar chart from df1
#Number of series = number of columns
series = len(df1.keys())
#x positions of bars
xpos = np.arange(series)
#Loop over number of series and create a bar chart for each one
for i in range(series):
    ## adjust the positions and widths of bars so that they do not overlap
    ## use DataFrame.iloc to get data from rows
    ## df1.keys()[i] provides the series names
    ## label with row names
    axs[1].bar(xpos+(i-series/2)/series,df1.iloc[i],label=df1.index[i],width=1/series)
    ## create and position x-ticks, label with column names
    plt.xticks(xpos-1/(2*series),df1.keys())
##add a basic legend (uses label values from bar charts within last accessed Axes object)
plt.legend()
##neaten up and white space
plt.tight_layout()
plt.show()