# Introduction to Pandas

Tamás Gál (tamas.gal@fau.de)

The latest version of this notebook is available at [https://github.com/Asterics2020-Obelics](https://github.com/Asterics2020-Obelics/School2017/tree/master/pandas)

In [1]:
import pandas as pd
import numpy as np
import sys

print("Python version: {0}\n"
      "Pandas version: {1}\n"
      "NumPy version: {2}"
      .format(sys.version, pd.__version__, np.__version__))

Python version: 3.6.0 (default, Jan 30 2017, 16:11:40) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]
Pandas version: 0.20.1
NumPy version: 1.12.1


In [2]:
from IPython.core.magic import register_line_magic

@register_line_magic
def shorterr(line):
    """Show only the exception message if one is raised."""
    try:
        output = eval(line)
    except Exception as e:
        print("\x1b[31m\x1b[1m{e.__class__.__name__}: {e}\x1b[0m".format(e=e))
    else:
        return output
    
del shorterr

## The basic data structures in Pandas

### `DataFrame`

In [3]:
data = {'a': [1, 2, 3],
        'b': [4.1, 5.2, 6.3],
        'c': ['foo', 'bar', 'baz'],
        'd': 42}

In [4]:
df = pd.DataFrame(data)
df

Unnamed: 0,a,b,c,d
0,1,4.1,foo,42
1,2,5.2,bar,42
2,3,6.3,baz,42


In [5]:
type(df)

pandas.core.frame.DataFrame

### `Series`

In [6]:
df['a']

0    1
1    2
2    3
Name: a, dtype: int64

In [7]:
type(df['a'])

pandas.core.series.Series

In [8]:
df['a'] * 23

0    23
1    46
2    69
Name: a, dtype: int64

In [9]:
np.cos(df['a'])

0    0.540302
1   -0.416147
2   -0.989992
Name: a, dtype: float64

In [10]:
s = pd.Series(np.random.randint(0, 10, 5))
s

0    5
1    4
2    5
3    9
4    9
dtype: int64

In [11]:
s.sort_values()  # indices are kept!

1    4
0    5
2    5
3    9
4    9
dtype: int64

In [12]:
s * s.sort_values()  # and are used to match elements

0    25
1    16
2    25
3    81
4    81
dtype: int64

In [13]:
s * s.sort_values().reset_index(drop=True)

0    20
1    20
2    25
3    81
4    81
dtype: int64

## Examining a `DataFrame`

In [14]:
df

Unnamed: 0,a,b,c,d
0,1,4.1,foo,42
1,2,5.2,bar,42
2,3,6.3,baz,42


In [15]:
df.dtypes

a      int64
b    float64
c     object
d      int64
dtype: object

In [16]:
df.columns

Index(['a', 'b', 'c', 'd'], dtype='object')

In [17]:
df.shape

(3, 4)

### Looking into the data

In [18]:
df.head(2)

Unnamed: 0,a,b,c,d
0,1,4.1,foo,42
1,2,5.2,bar,42


In [19]:
df.tail(2)

Unnamed: 0,a,b,c,d
1,2,5.2,bar,42
2,3,6.3,baz,42


In [20]:
df.describe()

Unnamed: 0,a,b,d
count,3.0,3.0,3.0
mean,2.0,5.2,42.0
std,1.0,1.1,0.0
min,1.0,4.1,42.0
25%,1.5,4.65,42.0
50%,2.0,5.2,42.0
75%,2.5,5.75,42.0
max,3.0,6.3,42.0


## Indexing and Slicing

There are different ways to index/slice data in pandas, which is a bit confusing at first.

In [21]:
df.loc

<pandas.core.indexing._LocIndexer at 0x10e992b00>

In [22]:
df.iloc

<pandas.core.indexing._iLocIndexer at 0x1105f7fd0>

In [23]:
df.ix  # deprecated

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  """Entry point for launching an IPython kernel.


<pandas.core.indexing._IXIndexer at 0x11064ecf8>

### Using `.loc[]`

This one treats the input as label or "row-name".

In [24]:
df.loc[2]

a      3
b    6.3
c    baz
d     42
Name: 2, dtype: object

In [25]:
df['b'].loc[2]

6.2999999999999998

In [26]:
%shorterr df.loc[-1]

[31m[1mKeyError: 'the label [-1] is not in the [index]'[0m


#### Accessing multiple rows/columns

In [27]:
df.loc[[1, 2], ['b', 'd']]

Unnamed: 0,b,d
1,5.2,42
2,6.3,42


In [28]:
df.loc[1:3, ['a']]

Unnamed: 0,a
1,2
2,3


### Using `.iloc[]`

In [29]:
df.iloc[2]

a      3
b    6.3
c    baz
d     42
Name: 2, dtype: object

In [30]:
df.iloc[-1]

a      3
b    6.3
c    baz
d     42
Name: 2, dtype: object