# D01: Advanced Data Structures

In this lesson we'll be meeting some data structures as part of Numpy and Pandas, namely arrays, series and dataframes. As analysts, you'll have likely used such data structures before and be quite comfortable with their use already, but learning how to deal with them in Python is the basis for almost every piece of work that you do and is important!

## Importing and using Pandas and Numpy

The conventions for importing Pandas and Numpy are as follows:

In [1]:
import pandas as pd
import numpy as np

## Data Structures

As I mentioned previously, we're going to be meeting and working with some new data structures in this section of the course as follows:

* Numpy arrays
* Pandas series
* Pandas dataframes

As I've mentioned previously knowling how these data structures work and how to convert between them is crucial to performing efficient and accurate data analysis using Python.

### Numpy Arrays

NumPy’s main object is the multidimensional array. It is a table of elements (usually numbers), all of the same type. In Numpy dimensions are called axes. The number of axes is by rank.

For example, the coordinates of a point in 3D space [1, 2, 1] is an array of rank 1, because it has one axis. That axis has a length of 3. 

In example pictured below, the array has rank 2 (it is 2-dimensional). The first dimension (axis) has a length of 2, the second dimension has a length of 3:

You can read more about arrays <a href = "https://docs.scipy.org/doc/numpy-dev/user/quickstart.html">here</a>.

We can create arrays as follows:

In [2]:
arr1 = np.random.random(10)  # Creating an array of 10 random numbers
print(type(arr1))            # Printing the object type
arr1                         # Calling the array

<class 'numpy.ndarray'>


array([ 0.43201999,  0.36313952,  0.60329127,  0.27280615,  0.02794502,
        0.76435247,  0.66089665,  0.28025427,  0.32303552,  0.64034645])

Using the axis and length described above, we can create multi-dimensional arrays as follows:

In [3]:
arr2 = np.random.random((6,4))     # Creating a 6 (axis) x 4 (length) array with random numbers
arr2                               

array([[ 0.85767108,  0.7026964 ,  0.66306342,  0.10778892],
       [ 0.1848937 ,  0.60105776,  0.06672499,  0.31467498],
       [ 0.25621571,  0.3158233 ,  0.11069   ,  0.13392968],
       [ 0.21204279,  0.96211675,  0.67852048,  0.49627242],
       [ 0.33418053,  0.82615098,  0.35870906,  0.62960672],
       [ 0.37863942,  0.58306124,  0.45272909,  0.27499952]])

And we can use indexing on them just as we can with other data structures...

In [4]:
arr2[0]  # Array indexing

array([ 0.85767108,  0.7026964 ,  0.66306342,  0.10778892])

In [5]:
arr2[0][0] # Array nested indexing

0.857671083403718

As with previously, data structures are dynamic and you can store arrays in other data strucutres...

In [6]:
arr3 = np.random.random((2,2))
arr4 = np.random.random((2,5))
mylist = [arr3,arr4]               # Creating a list of arrays
mylist

[array([[ 0.88370989,  0.70069979],
        [ 0.22765741,  0.18005701]]),
 array([[ 0.34546416,  0.85956564,  0.8522887 ,  0.42713588,  0.16585739],
        [ 0.90717239,  0.04375428,  0.11055613,  0.06265992,  0.9027934 ]])]

And vice versa by using traditional Python data structures to create arrays:

In [9]:
list1 = [1,2,3,4,5]
list2 = [6,7,8,9,0]
arr5 = np.array([list1,list2])
arr5

array([[1, 2, 3, 4, 5],
       [6, 7, 8, 9, 0]])

We can also perform mathematical calculations on arrays quickly and simply:

In [15]:
np.add(arr5,5)                # Adds a value to the array
arr6 = np.multiply(arr5,arr4) # Multiplys an array with the values from another array
arr6 = np.round(arr6,3)       # Rounds an array to a given number of decimals
arr6                          

array([[ 0.345,  1.719,  2.557,  1.709,  0.829],
       [ 5.443,  0.306,  0.884,  0.564,  0.   ]])

You can store character data in arrays...

In [16]:
list3 = ['A','B','C','D','E']
arr7 = np.array(list3)
arr7

array(['A', 'B', 'C', 'D', 'E'], 
      dtype='<U1')

However (rather unsurprisingly) you can't perform mathematical functions on character data, making arrays a poor choice for storing it!

In [17]:
np.multiply(arr7,2)

TypeError: ufunc 'multiply' did not contain a loop with signature matching types dtype('<U3') dtype('<U3') dtype('<U3')

However, if we want to store non numeric data , Pandas is a much better choice!

### Pandas Series

A Pandas series is very similar to an Numpy array...

In [18]:
ser1 = pd.Series(list1)    # Creating a pandas series using the Series class
print(type(ser1))
ser1

<class 'pandas.core.series.Series'>


0    1
1    2
2    3
3    4
4    5
dtype: int64

However there are some differences! Firstly Pandas will create a formal index for our series and display this when we call our series:

In [19]:
print(type(ser1.index))  # Printing the index object details
print(ser1.index)        # Printing the index of a series

<class 'pandas.indexes.range.RangeIndex'>
RangeIndex(start=0, stop=5, step=1)


As we can see our series also has a separate object for an index. We can actually create this ourselves and it doesn't have to be numeric either...

In [21]:
index_list = ['A','B','C','D','E']
ser1 = pd.Series(data=list1,index=index_list)   # Creating a series with a custom index
print(ser1)                                                # Printing the series
ser1['A']                                                  # Calling a value with a custom index value

A    1
B    2
C    3
D    4
E    5
dtype: int64


1

Unlike numpy arrays, series can contiain a variety of data types:

In [22]:
list3 = [1,2.1,'Three',True,None] # List of different data types
ser2 = pd.Series(list3)           # Creating a series containing different data types
ser2

0        1
1      2.1
2    Three
3     True
4     None
dtype: object

We can also create a series from a numpy array using the Series class:

In [23]:
arr8 = np.random.random(5)  # Creatign a 1D array with 5 records
ser3 = pd.Series(arr8)      # Creating a series from a np array
ser3

0    0.999000
1    0.508048
2    0.988719
3    0.252876
4    0.003493
dtype: float64

However series are one dimensional so the array must be one dimensional also for it to work!

In [24]:
ser3 = pd.Series(arr3)   # Failing to create a Series from a multidimensional array!

Exception: Data must be 1-dimensional

We can also convert series to arrays using the values method:

In [25]:
arr9 = ser3.values    # Converting a pandas series to a numpy array
arr9

array([ 0.99899984,  0.5080482 ,  0.9887195 ,  0.25287565,  0.00349292])

### Pandas Dataframes

Dataframes are something which we should already be familiar with as analysts as they conform to the traditional 'dataset' structure that we're used to dealing with. Creating dataframes is slightly more involving than creating a series or array but still quite simple:

In [26]:
index_data = np.arange(3)           # Index data
data1 = np.random.random(3)         # Row 1 data
data2 = np.random.random(3)         # Row 2 data
data3 = np.random.random(3)         # Row 3 data
cols_data = ['col1','col2','col3']

# Creating a basic dataframe

df1 = pd.DataFrame(data=[data1,data2,data3],  # Specifying the data
                  index=index_data,           # Specifying the index
                  columns=cols_data)          # Specifying the column headers
df1                                           # Calling the dataframe

Unnamed: 0,col1,col2,col3
0,0.780934,0.729847,0.049975
1,0.473435,0.908981,0.623294
2,0.648444,0.974842,0.82894


We can also create a dataframe from a dictionary, and like a series, the columns can contain different data types:

In [27]:
# Creating a dataframe from a dictionary

df2 = pd.DataFrame({'col1':['A','B','C'],   # Setting column1 data
                    'col2':[1,2,3]})        # Setting column2 data

df2

Unnamed: 0,col1,col2
0,A,1
1,B,2
2,C,3


We can also call individual columns from a dataframe...

In [28]:
df2['col1']

0    A
1    B
2    C
Name: col1, dtype: object

And, as it turns out, dataframe columns are actually series!

In [None]:
type(df2['col1'])

So we could actually think of a dataframe as a collection of series, one for each column. This makes understanding the link between series and dataframes a bit simpler and easier.

We'll be looking at dataframes in more depth in the upcoming sections.

## Converting to lists and dicts

There will doubtless be times when you'll want to convert data you have stored in arrays, series and dataframes to the standard data structures in base Python. Often when installing smaller libraries for specific functions, you'll find that they don't recognise the Numpy and Pandas data structures.

Fortunately converting is very simple:

In [29]:
list1 = arr1.tolist()         # Converting an array to a list
list2 = df2['col1'].tolist()  # Converting a series to a list

dict1 = df2['col1'].to_dict() # Converting a series to a dict 
dict2 = df2.to_dict()         # Converting a dataframe to a dict

print(list1)
print(dict1)

print(list2)
print(dict2)

[0.4320199931265416, 0.36313952220360823, 0.6032912720073171, 0.27280615098298056, 0.027945017409577888, 0.7643524705875849, 0.6608966527893316, 0.2802542671328022, 0.32303552165125615, 0.640346449950873]
{0: 'A', 1: 'B', 2: 'C'}
['A', 'B', 'C']
{'col2': {0: 1, 1: 2, 2: 3}, 'col1': {0: 'A', 1: 'B', 2: 'C'}}


## Summary

As we've seen, Arrays, Series and Dataframes are all essentially different ways of storing data.

Each structure has it's own strengths:

<strong>Arrays:</strong>

* Excellent for performing mathematical functions
* Multidimensional

<strong>Series:</strong>

* Great for storing a set of one dimensional values of mixed types
* Formal indexing
* Integrate very well with Pandas dataframes

<strong>Dataframes:</strong>

* Traditional 'dataset' structure
* Formal indexing
* Good for a wide variety of data of different types

Speaking from experience, I spend most of my time working with dataframes. However it is useful to be able to convert a dataframe column (aka a series!) to a numpy array to perform a function or to transform the data before merging it back to the original dataframe. This is a vital part of using Python for data analysis!

Many of the smaller libraries that perform niche functions will be built to be compatible with numpy (or even just the basic data structures such as lists, tuples and dictionaries), but not pandas, as pandas is a newer package. The same can be said for many of the data visualisation libraries. This makes understanding, transforming and converting between these data structures especially important.

## Classes

One other thing to note is the both pd.DataFrame and pd.Series are something called 'Classes' which is an Object Orientated Programming (OOP) concept. We won't be delving into how these work behind the scenes but it important to know that whilst classes are different to functions that we've met previously, for now you can just think of them as a special kind of function.

## Further Reading

<a href = "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html">Numpy Array Reference</a><br/>
<a href = "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame">Pandas DataFrame reference</a><br/>
<a href = "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html">Pandas Series reference</a><br/>
<a href = "http://www.jesshamrick.com/2011/05/18/an-introduction-to-classes-and-inheritance-in-python/">Introduction to Classes</a><br/>