# Data structures and data frames
* What is a data frame?  What's a data type?  In Python, everything is a type (as in, `int` is not special compared to the advanced things we will be using).
* Lets start by comparing to the other usual data structures you might have come across.

## Scalar
A scalar variable is just a single value

In [1]:
a = 42

A variable has a type. Python has the builtin function type() that gives you the type of an object:

In [2]:
type(a)

int

If we create another variable, say, a string, we see that it has a different type

In [3]:
b = "hello"
type(b)

str

a few other common types...

In [4]:
print(type(1.1))
print(type(1j))

<class 'float'>
<class 'complex'>


## Containers
A container is a collection of values. Various types of containers exist, differing in how the different values are stored. This produces different performance and storage efficiency semantics. That is, depending on what kind of operations you want to do on your collection of values, you choose a different kind of container.
### Lists
A list is a sequential array of values. Note that each value can be of a different type. Also the type of the list does not depend on the type of the contained values

In [5]:
a_list = [1, "hello"]
type(a_list)

list

In [6]:
type(a_list[0])

int

You can add stuff to a list after you have initially created it:

In [7]:
a_list.append(1.2)
a_list

[1, 'hello', 1.2]

### Dictionaries
A dictionary is an unordered collection of key-value pairs. You can quickly look up a value by providing the key. E.g. a phone book:

In [8]:
phonebook = {"Janne":123, "Richard":456}

In [9]:
phonebook

{'Janne': 123, 'Richard': 456}

In [10]:
phonebook["Janne"]

123

If you have experience with other programming languages, you might know dictionaries as "associative arrays", "hash tables", or "maps".
### Numpy arrays
The types above are inefficient, because because there is no uniformity in them.

Numpy fulfills the need of the numerical computing community for an efficient data structure for dense multi-dimensional arrays.  These are stored in exactly the same way as C/Fortran (and any other language) stores them:

In [11]:
import numpy as np
n = np.array((1, 2, 3))
n2 = np.array(((1, 2, 3), (4, 5, 6)))
n2

array([[1, 2, 3],
       [4, 5, 6]])

Unlike C/Fortran, these are wrapped in Python so you get a lot of methods on them.

You can see the shape of a numpy array with the shape attribute:

In [12]:
n2.shape

(2, 3)

Contrary to a list, each value in a numpy array must be of the same type. You can see the type of the values in a numpy array from the dtype attribute:

In [13]:
n2.dtype

dtype('int64')

In [14]:
n[0] = 4
n

array([4, 2, 3])

In [15]:
n[0] = "hello"

ValueError: invalid literal for int() with base 10: 'hello'

Why this restriction? It comes down to the "efficient" word above. Since a list can have elements of arbitrary type, it needs an extra layer of indirection:

![a_list in memory](img/a_list.svg)

And for a multidimensional array, it's even worse; each element is then a reference to a nested list etc.

In contrast, a numpy ndarray is stored densely in memory:

![ndarray in memory](img/ndarray.svg)

A multidimensional ndarray is stored in memory as a single one-dimensional data array, and the shape information stored in the metadata is used to calculate the correct element to access.

Numpy ndarrays are stored in the same way that arrays in C or Fortran are stored. This allows one to use battle-tested C/Fortran code working directly on ndarray data, all glued together with an easy to use Python layer. Essentially, this is what the entire numpy and scipy is mostly about, but with more metadata that makes things easier.

## Data frames
So what is then a data frame? In short, it is a data structure for tabular data. Similar to a two-dimensional numpy ndarray, except that each column can be of a different type (in fact, currently in Pandas data frames are implemented similarly as a one-dimensional ndarray for each column). Data frames optionally have one column as an index, similar to e.g. RDBMS's, allowing quicker lookups of rows when using the index column.

An additional type of data supported by data frames is categorical data, or factors. These are useful when one wants to group a string column according to the string value. We'll get back to categorical data later. If you have used R, you'll know categorical data as factors.

In the end, you could look at the data frames we will be using as yet another layer of metadata that makes numpy arrays even more useful.

Lets look at some simple examples:

In [16]:
import pandas as pd

In [17]:
d = pd.DataFrame({'name': ('phys', 'nbe'), 'value': (1, 2)})
d

Unnamed: 0,name,value
0,phys,1
1,nbe,2


You see here that we have a dataframe with two rows and two columns. No wait, three columns! How come? The first column is the index column; a pandas dataframe always contains an index column. Often you want to use one of the columns in the dataset as the index, e.g. a timestamp or something like that. But if you don't specify an index column, it will create an extra column as the index column.

In [18]:
d=pd.DataFrame({'name': ('phys', 'nbe')}, index=(1,2))
d

Unnamed: 0,name
1,phys
2,nbe


We can give the index column a name with

In [19]:
d.index.name = 'value'
d

Unnamed: 0_level_0,name
value,Unnamed: 1_level_1
1,phys
2,nbe


Lets demonstrate some pandas features by loading a toy dataset to play around with. Pandas happens to ship with a few, here we use the famous iris dataset:

In [20]:
import os.path
c = pd.read_csv(os.path.join(pd.__path__[0], 'tests/data/iris.csv'))

If the dataset is large, viewing the entire dataset on the screen is often impossible. Pandas has head and tail methods similar to the shell tools:

In [21]:
c.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [22]:
c.tail(3) # Show the last 3 rows (default 5)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


The datatypes of each column can be retrieved from the dtypes attribute:

In [23]:
c.dtypes

SepalLength    float64
SepalWidth     float64
PetalLength    float64
PetalWidth     float64
Name            object
dtype: object

The last column is of type 'object'. It looks like it's one of a few strings, indicating that it might be suitable as a categorical type. Lets see what happens if we try to make a categorical of it:

In [24]:
pd.Categorical(c['Name'])

[Iris-setosa, Iris-setosa, Iris-setosa, Iris-setosa, Iris-setosa, ..., Iris-virginica, Iris-virginica, Iris-virginica, Iris-virginica, Iris-virginica]
Length: 150
Categories (3, object): [Iris-setosa, Iris-versicolor, Iris-virginica]

So there are only 3 different values the name can take. Lets replace that column in the dataset with the corresponding categorical:

In [25]:
c['Name'] = pd.Categorical(c['Name'])

In [26]:
c.dtypes

SepalLength     float64
SepalWidth      float64
PetalLength     float64
PetalWidth      float64
Name           category
dtype: object

If you know that a column should be categorical upfront, you can specify that directly in read_csv:

In [27]:
c = pd.read_csv(os.path.join(pd.__path__[0], 'tests/data/iris.csv'), dtype={'Name': 'category'})
c.dtypes

SepalLength     float64
SepalWidth      float64
PetalLength     float64
PetalWidth      float64
Name           category
dtype: object

In addition to head() and tail() a useful way of quickly getting some idea about a dataset is the describe() method:

In [28]:
c.describe()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## What is a pandas DataFrame, really?
Technical description:
* A `DataFrame` is a table-like object with rows and columns
* One numpy array per column
* One more numpy array for a special column called the "index"
* Extracting single columns or rows gives you a `Series`, which has the same index as the table but only one "column" (or "row")
* A lot of useful methods to do higher-level operations than you do with raw `numpy`: time series operations, things on missing data, etc.

Why are higher level data structures better?  Because we can do things with less mental effort.

# Exercises
1. Try to find ten different data types

In [41]:
print(type(""))
print(type(1))
print(type(1.1))
print(type(1j))
print(type((1,2,3)))
print(type(open('/dev/null')))
print(type(np.ndarray))
print(type(pd.DataFrame([])))
print(type(pd.Series([])))
print(type(type))
import subprocess
print(type(subprocess.Popen))
print(type(print))


<class 'str'>
<class 'int'>
<class 'float'>
<class 'complex'>
<class 'tuple'>
<class '_io.TextIOWrapper'>
<class 'type'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'type'>
<class 'type'>
<class 'builtin_function_or_method'>
