# Introduction
## Goals
By the end of this course, you should be able to
- Do basic data analysis using R or Python/Pandas, with a special emphasis on
  - triton, or other similar HPC cluster environments
  - workflows, I/O strategies etc. that work on HPC clusters.

What this course is NOT:
- A basic course in programming. We don't expect you to have prior knowledge of R or Python, but some programming experience is required.
- A basic course in statistics / machine learning. As part of the course we'll do some simple stuff, but we expect that you either understand the statistics from before or learn it on your own.

Topics that we're going to cover
- The dataframe data structure, and how it relates to other common data structures.
- Working with dataframes. Indexing, etc.
- Visualizing your results.

# Data structures and data frames
What is a data frame? Lets start by comparing to the other usual data structures you might have come across:

## Scalar
A scalar variable is just a single value

In [None]:
a = 42

A variable has a type. Python has the builtin function type() that gives you the type of an object:

In [None]:
type(a)

If we create another variable, say, a string, we see that it has a different type

In [None]:
b = "hello"
type(b)

## Containers
A container is a collection of values. Various types of containers exist, differing in how the different values are stored. This produces different performance and storage efficiency semantics. That is, depending on what kind of operations you want to do on your collection of values, you choose a different kind of container.
### Lists
A list is a sequential array of values. Note that each value can be of a different type. Also the type of the list does not depend on the type of the contained values

In [None]:
a_list = [1, "hello"]
type(a_list)

In [None]:
type(a_list[0])

You can add stuff to a list after you have initially created it:

In [None]:
a_list.append(1.2)
a_list

### Dictionaries
A dictionary is an unordered collection of key-value pairs. You can quickly look up a value by providing the key. E.g. a phone book:

In [None]:
phonebook = {"Janne":123, "Richard":456}

In [None]:
phonebook

In [None]:
phonebook["Janne"]

If you have experience with other programming languages, you might know dictionaries as "associative arrays", "hash tables", or "maps".
### Numpy arrays
Numpy fulfills the need of the numerical computing community for an efficient data structure for dense multi-dimensional arrays:

In [None]:
import numpy as np
n = np.array((1, 2, 3))
n2 = np.array(((1, 2, 3), (4, 5, 6)))
n2

You can see the shape of a numpy array with the shape attribute:

In [None]:
n2.shape

Contrary to a list, each value in a numpy array must be of the same type. You can see the type of the values in a numpy array from the dtype attribute:

In [None]:
n2.dtype

In [None]:
n[0] = 4
n

In [None]:
n[0] = "hello"

Why this restriction? It comes down to the "efficient" word above. Since a list can have elements of arbitrary type, it needs an extra layer of indirection:

![a_list in memory](img/a_list.svg)

And for a multidimensional array, it's even worse; each element is then a reference to a nested list etc.

In contrast, a numpy ndarray is stored densely in memory:

![ndarray in memory](img/ndarray.svg)

A multidimensional ndarray is stored in memory as a single one-dimensional data array, and the shape information stored in the metadata is used to calculate the correct element to access.

Numpy ndarrays are stored in the same way that arrays in C or Fortran are stored. This allows one to use battle-tested C/Fortran code working directly on ndarray data, all glued together with an easy to use Python layer. Essentially, this is what the entire numpy and scipy is mostly about.

## Data frames
So what is then a data frame? In short, it is a data structure for tabular data. Similar to a two-dimensional numpy ndarray, except that each column can be of a different type (in fact, currently in Pandas data frames are implemented as a one-dimensional ndarray for each column). Data frames optionally have one column as an index, similar to e.g. RDBMS's, allowing quicker lookups of rows when using the index column.

An additional type of data supported by data frames is categorical data, or factors. These are useful when one wants to group a string column according to the string value. We'll get back to categorical data later. If you have used R, you'll know categorical data as factors.

Lets look at some simple examples:

In [None]:
import pandas as pd