# pandas

Information based on former lecture material, the 'Python Data Science Handbook', and new examples

use cases: Data management, data manipulation, incomplete data, ...

## Series

In [None]:
import numpy as np
import pandas as pd

In [None]:
pd?

In [None]:
pd.

pandas Series object: one-dimensional array of indexed data

In [None]:
numbers_data = pd.Series([0.1, 0.2, 0.3, 0.4, 0.5])
print(numbers_data)

A series consists of an index and a numpy array.

In [None]:
print(type(numbers_data))
print(type(numbers_data.index))
print(numbers_data.index)
print(type(numbers_data.values))
print(numbers_data.values)

Here, we can access the data in the array via the range index.

In [None]:
print(numbers_data[0])
print(numbers_data[3])

A range index is only one option. We can also define a different index to access our data via more intuitive keys. This works similar to dictionaries.

In [None]:
numbers_data = pd.Series(np.linspace(0.1, 0.5, 5), index=['A', 'B', 'C', 'D', 'E'])
print(numbers_data)
print(numbers_data['A'])

In [None]:
print(numbers_data.index)

In [None]:
numbers_data.index = [4, 8, 15, 16, 23]
print(numbers_data)
print(numbers_data[4])

Try out other indices and data examples.

In [None]:
pd.Series?

Note that due to the similar key: value structure, we can create a Series from a dictionary.

In [None]:
python_ds_book_dict = {'author': 'J. VanderPlas', 'title': 'Python Data Science Handbook', 'publisher': 'O\'Reilly', 'year': 2016}
python_ds_book_dict

In [None]:
python_ds_book_series = pd.Series(python_ds_book_dict, dtype = 'str')
python_ds_book_series

Note that the dictionary values can contain different types. The numpy array, however, should be of a fixed type. Above this is interpreted as type object in general. If the dictionary only contains numerical values, this will be a numerical numpy array as we are used to.

## DataFrames

Data Frames are a two-dimensional generalisation of a Series. Can be seen as a sort of indexed 2-dim numpy array, or a fixed-type 2-dim dictionary.

In [None]:
apalg_in_progress = {'Course': 58,
                     'Exercise Sheets': 55,
                     'Quiz 1': 43,
                     'Quiz 2': 37,
                     'Quiz 3': 28,
                     'Quiz 4': 31,
                     'Quiz 5': 24,
                     'Quiz 6': 20,
                     'Mini Project': 20}

apalg_passed = {'Course': 0,
                'Exercise Sheets': 39,
                'Quiz 1': 42,
                'Quiz 2': 31,
                'Quiz 3': 25,
                'Quiz 4': 30,
                'Quiz 5': 21,
                'Quiz 6': 18,
                'Mini Project': 0}

So far, we can create two Series from these two dictionaries

In [None]:
in_progress = pd.Series(apalg_in_progress, dtype='int16')
passed = pd.Series(apalg_passed, dtype='int16')

print("Learning in progress:\n")
print(in_progress)
print("\nLearning progress -- numbers passed:\n")
print(passed)

To comprise the information, get a better overview, and have more options to work with, we can create a data frame.

In [None]:
learning_progress = pd.DataFrame({'in progress': in_progress, 'passed': passed})
learning_progress

or directly from the dictionary:

In [None]:
learning_progress = pd.DataFrame({'in progress': apalg_in_progress, 'passed': apalg_passed})
learning_progress

Let's take a closer look at the structure.

In [None]:
tmp1 = {'in progress': in_progress, 'passed': passed}
print(type(tmp1))
print(type(tmp1['in progress']))
print(type(tmp1['passed']))

In [None]:
tmp2 = {'in progress': apalg_in_progress, 'passed': apalg_passed}
print(type(tmp2))
print(type(tmp2['in progress']))
print(type(tmp2['passed']))

That means, tmp1 is a dictionary of two Series, and tmp2 a dictionary of two dictionaries. In both cases the result is a Data Frame of Series.

In [None]:
print(type(learning_progress))
print(type(learning_progress['in progress']))
print(type(learning_progress['passed']))

A first approach to access the data is via indices as for dictionaries.

In [None]:
print(tmp2['passed']['Quiz 1'])
print(learning_progress['passed']['Quiz 1'])

 However, we can handle data in a more concised way with DataFrames (more compact, efficient, and other functionality).

In [None]:
print(learning_progress.values)
print(type(learning_progress.values))

In [None]:
print(learning_progress.size)
print(learning_progress.shape)

There is one common index that indicates the rows.

In [None]:
print(learning_progress.index)
print(type(learning_progress.index))

There is a second index that indicates the columns.

In [None]:
print(learning_progress.columns)
print(type(learning_progress.columns)) 

As stated above there are two perspectives: a DataFrame as a generalization of a two-dimensional NumPy array with row and column indices; or a table from a dictionary with the advantages of a NumPy array. A DataFrame can be created from both perspectives.

In [None]:
A = np.random.randint(1, 5, (3,3))
print(A)

data_from_matrix = pd.DataFrame(A, 
                                columns=['col1', 'col2', 'col3'],
                                index=['row1', 'row2', 'row3'])
data_from_matrix

In [None]:
D = [{'a': 3 * i, 'b': 3*i + 1, 'c': 3*i + 2} for i in range(3)]
print(D)

data_from_list_of_dict = pd.DataFrame(D)
data_from_list_of_dict

We can also create a data frame directly from an external file, for example a csv table.

In [None]:
test_data = pd.read_csv('test_data.csv')
test_data

In [None]:
test_data.index

We might want to change the index to the first column.

In [None]:
test_data = pd.read_csv('test_data.csv', index_col='rows')
test_data

In [None]:
test_data.index

consider the format of the csv file!

In [None]:
test_data = pd.read_csv('test_data_semicolon.csv')
test_data

In [None]:
pd.read_csv('test_data_semicolon.csv', delimiter=';')

... and write data back into a file (e.g. csv)

In [None]:
data_from_list_of_dict.to_csv('test_data_new.csv')

Note that missing data ca be handled. See below.

In [None]:
data1 = {'a': 1, 'b': 2}
data2 = {'b': 4, 'c': 3}

pd.DataFrame([data1, data2], index=['data 1', 'data 2'])

## The Pandas Index
So far, we have seen indices as keys to describe columns and rows. Let's take a closer look at what they are capable of.

In [None]:
index = pd.Index([4, 8, 15, 16, 23, 42])
print(index)

This Index has similar properties to a NumPy array.

In [None]:
print("index.size =", index.size)
print("index.shape =", index.shape)
print("index.ndim =", index.ndim)
print("index.dtype =", index.dtype)

To access parts of an Index, we can use the indexing and slicing notation known from NumPy.

In [None]:
print("index[2] =", index[2])
print("index[3:] =", index[3:])
print("*index[3:] =", *index[3:])

However, note that it is immutable.

In [None]:
index[3] = 103

Moreover, a Pandas Index allows set operations, e.g., to join data sets. Note, however, that they are ordered.

Note that while the most concepts from the handbook remain valid, some operations are now deprecated. For instance, instead of index & other, use index.intersection(other).

In [None]:
index = pd.Index([1, 2, 3, 5, 4])
other = pd.Index([7, 4, 3, 11])

print("Intersection:", index.intersection(other))
print("Intersection:", *index.intersection(other))
print("Union:", *index.union(other))
print("Difference:", *index.difference(other))
print("Symmetric difference:", *(index.symmetric_difference(other)))

## Indexing a Series

An Index is now the underlying structure to set up a (one-dimensional) Series and (two-dimensional) DataFrame. Let's take a closer look.

In [None]:
numbers_data = pd.Series(np.linspace(0.1, 0.5, 5), index=index)
print(numbers_data)
print(id(numbers_data))
print(id(index))
print(id(numbers_data.index))

While an index itself is immutable, a Series can be extended with a new key like a dictionary. A new index is created.

In [None]:
numbers_data[6] = 0.6
print(numbers_data)
print(numbers_data.index)
print(id(numbers_data))
print(id(index))
print(id(numbers_data.index))

We can obtain some information similar to a dictionary...

In [None]:
6 in numbers_data

In [None]:
numbers_data.keys()

In [None]:
list(numbers_data.items())

... and others like a NumPy array (because the Index is ordered). Let's begin with an alphabetical index and distinguish different access types.

In [None]:
numbers_data.index = ['A', 'B', 'C', 'D', 'E', 'F']
numbers_data

### Explicit Access
Slicing: Caution: in contrast to the range notion, the last index is included!

In [None]:
print(numbers_data[:'C'])
print(numbers_data['B':'D'])

Fancy Indexing:

In [None]:
print(numbers_data[['B','D','F']])

Masking:

In [None]:
print(numbers_data[numbers_data > .3])

### Implicit Access

Remember that each index also has an implicit index from 0 to its length - 1. We can use this to access data implicitly.

Caution: now, for slicing, the last index is excluded!

In [None]:
print(numbers_data[:2])
print(numbers_data[1:3])
print(numbers_data[[1, 3, 5]])

So, what happens if we a numerical index?

In [None]:
numbers_data.index = [1, 2, 3, 4, 5, 6]
print(numbers_data[1])
print(numbers_data[1:])
print(numbers_data[1:3])

Caution! the first case [1] uses the *explicit* index 1! Slicing, however, uses the *implicit* indices (excluding the last entry).

This calls for confusion, errors and misunderstandings!

The common practice around this is to use the so-called Indexers *loc* (for the explicit index) and *iloc* (for the implicit index).

In [None]:
print(numbers_data.loc[1])
print(numbers_data.iloc[1])

print(numbers_data.loc[1:3])
print(numbers_data.iloc[1:3])

print(numbers_data.loc[[1, 3, 5]])
print(numbers_data.iloc[[1, 3, 5]])

In [None]:
numbers_data.loc

In [None]:
numbers_data.loc.obj

In [None]:
numbers_data.loc[:]

In [None]:
numbers_data.loc[1:6]

In [None]:
numbers_data.iloc[0:6]

## Indexing a DataFrames

In [None]:
learning_progress

In [None]:
print(learning_progress['in progress']['Course'])

In [None]:
print(learning_progress['in progress'])
print(learning_progress.passed)
print(learning_progress.values[0])

main point of confusion: single index: sometimes rows (slicing), sometimes columns (one index)

TODO: slicing, loc vs iloc; add a column

In [None]:
print(learning_progress.loc['Quiz 1':'Quiz 3', :'in progress'])

In [None]:
print(learning_progress.iloc[2:5, :1])

# Pandas time series
TODO