# Data Structures in Python

In this chapter, we will take a quick look at the data structures in Python, both native as well as those offered by NumPy and Pandas.

This chapter is not meant to be a comprehensive introduction, but rather a quick memory refresh before we dive into Pandas and dataframes.

A more detailed and exhaustive description can be found [here](https://docs.python.org/3/tutorial/datastructures.html).

## 1. Tuples

- Tuples are an immutable (cannot be changed after definition) data structure in Python.
- Defined using round parentheses "(" ")".
- Useful for storing permanent records, such as scientific constants (e.g. $\pi$) or private keys.

In [6]:
#Define a tuple:
a_tuple = (1, 2, 3)

print("The values in the tuple are {} and its dtype is {} and first element {}".format(a_tuple, 
                                                                                       type(a_tuple),
                                                                                      a_tuple[0]))

The values in the tuple are (1, 2, 3) and its dtype is <class 'tuple'> and first element 1


In [7]:
#Tuples can contain heterogeneous data
het_tuple = (13, 33, "a string!", 3.21)

print("Here's a tuple with heterogeneous dtypes: {}".format(het_tuple))

Here's a tuple with heterogeneous dtypes: (13, 33, 'a string!', 3.21)


In [8]:
#Tuples are immutable, i.e. the values stored in them cannot be modified:

#try to change an element stored in het_tuple above:
het_tuple[0] = 12312434

TypeError: 'tuple' object does not support item assignment

In [11]:
#It is possible to have nested tuples:
nested_tuple = (a_tuple, het_tuple)

print("Here's a nested tuple {}, and it's individual elements {} and {}".format(nested_tuple,
                                                                               nested_tuple[0],
                                                                               nested_tuple[1]))

Here's a nested tuple ((1, 2, 3), (13, 33, 'a string!', 3.21)), and it's individual elements (1, 2, 3) and (13, 33, 'a string!', 3.21)


## 2. Lists

Lists are in essence the same as tuples, but the **big** difference is that lists are **mutable**, i.e. their contents can be changed after definition.

In [14]:
#Here's how to define a list:
a_list = [1, 3, 5, 6, 7]

#Can host heterogeneous data:
het_list = [1, "string", (1, 3, 5), 4.2]

#Can be changed after definition:
print("This is the original list: {}".format(het_list))
het_list[0] = 10
print("This is the list after modification: {}".format(het_list))

#Can nest lists:
nest_list = [a_list, het_list]
print("A nested list: {}".format(nest_list))

This is the original list: [1, 'string', (1, 3, 5), 4.2]
This is the list after modification: [10, 'string', (1, 3, 5), 4.2]
A nested list: [[1, 3, 5, 6, 7], [10, 'string', (1, 3, 5), 4.2]]


## 3. Sets

A set is an unordered collection with no duplicate elements. A "set" in Python is the same concept as a "set" in mathematics and a lot of set operations are provided in Python.

In [15]:
a_set = {"lumos", "expelliarmus", "diffindo", "episkey", "lumos", "diffindo"}

print("A set, without any duplicates, though there were some in the definition itself: {}".format(a_set))

A set, without any duplicates, though there were some in the definition itself: {'expelliarmus', 'lumos', 'episkey', 'diffindo'}


In [17]:
#Since a set is unordered, it does not support indexing:
a_set[0]

TypeError: 'set' object does not support indexing

In [18]:
#Check membership of set:
"crucio" in a_set

False

In [21]:
#Set operations:
another_set = {"lumos", "alohomora", "diffindo", "sectumsempra", 'imperio'}

print("In a_set but not another_set: {}".format(a_set - another_set))
print("In a_set or another_set: {}".format(a_set | another_set)) #| represents the "OR" operation
print("In a_set and another_set: {}".format(a_set & another_set)) # & represents the "AND" operation
print("In a_set or another_set, but not both: {}".format(a_set ^ another_set))

In a_set but not another_set: {'episkey', 'expelliarmus'}
In a_set or another_set: {'imperio', 'alohomora', 'lumos', 'episkey', 'diffindo', 'sectumsempra', 'expelliarmus'}
In a_set and another_set: {'lumos', 'diffindo'}
In a_set or another_set, but not both: {'imperio', 'alohomora', 'episkey', 'sectumsempra', 'expelliarmus'}


## 4. Dictionaries

A dictionary is a collection of **key:value** pairs that can be accessed using the *keys* rather than the usual indexing approach used for other data types.

In [24]:
a_dict = {1: "one", 2: "two", 'three': 3, "a_list": [1, 3, 5, 6], "a_tuple": (3, 2, 1)}

print(a_dict)

{1: 'one', 2: 'two', 'a_list': [1, 3, 5, 6], 'three': 3, 'a_tuple': (3, 2, 1)}


In [28]:
#Get all the keys and values from a dictionary:
keys = a_dict.keys()
vals = a_dict.values()

print("Keys: {}".format(keys))
print("Values: {}".format(vals))

Keys: dict_keys([1, 2, 'a_list', 'three', 'a_tuple'])
Values: dict_values(['one', 'two', [1, 3, 5, 6], 3, (3, 2, 1)])


# NumPy arrays

So far, we have been dealing mostly with one dimensional data structures. It is possible to introduce higher dimensionality with these data types, but it has the potential to quickly become very convoluted and difficult to understand. Additionally, it doesn't make for very readable code, which is against the philosophy of Python.

NumPy is a package that provides a nice interface to produce multidimensional arrays (or matrices) and also has a cleaner way of accessing and manipulating them. It also has the additional advantage of being very computationally efficient and thus, is perfect for use in large simulations and calculations. 

NumPy has extensive functionality that cannot possibly be covered in full here. So we quickly show how to define a numpy array and direct interested readers to the [NumPy homepage](https://numpy.org/)

In [31]:
#Import numpy
import numpy

#Define an empty array:
array = numpy.array([])

print("Empty array {} and its dtype {}".format(array, type(array)))

Empty array [] and its dtype <class 'numpy.ndarray'>


In [33]:
#Initialize an array with 3 rows and 2 columns, i.e. shape (3,2) and fill it with 100
empty_array = numpy.full((3,2), 100)

print("Our array is: \n {}".format(empty_array))

Our array is: 
 [[100 100]
 [100 100]
 [100 100]]


# Dataframes with Pandas

Dataframes are a versatile data structure that can contain all of the structures we've seen previously. The best way to visualize a dataframe is to think of it as a table in Excel, with each column being named and each row having an index.

Here, we will look at how to create a simple dataframe and look at more advanced methods in subsequent notebooks.

In [34]:
import pandas

In [35]:
#Let's define some numeric data to go into our matrix:
#Our data will have 5 rows and 3 columns, which we will call att_1, att_2, att_3
data = numpy.full((5, 3), 100)

#Here we aggregate that data into a dataframe and name the columns
dataframe = pandas.DataFrame(data, columns = ['att_1', 'att_2', 'att_3'])

dataframe

Unnamed: 0,att_1,att_2,att_3
0,100,100,100
1,100,100,100
2,100,100,100
3,100,100,100
4,100,100,100


In [36]:
#Another, more direct, way to create a dataframe is to use a dictionary:
my_dict = {'att_1': [120, 123, 333, 132], 'att_2': [1, 3, 11, 21], 'att_3': [0, 0.1, 0.0001, 0.4]}

my_df = pandas.DataFrame(my_dict)

my_df

Unnamed: 0,att_1,att_2,att_3
0,120,1,0.0
1,123,3,0.1
2,333,11,0.0001
3,132,21,0.4
