# Pandas - Introduction

This notebook is the first part of a series devoted to the[`pandas` library.
It presents the basic objects of this library and how it plays with `numpy`.

In [1]:
# Start using pandas (default import convention)
import pandas as pd
import numpy as np

In [2]:
# Let pandas speak for themselves
print(pd.__doc__)


pandas - a powerful data analysis and manipulation library for Python

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and

Visit the official website for a nicely written documentation: https://pandas.pydata.org

In [3]:
# Current version (should be 0.24 in 2019)
print(pd.__version__)

1.2.0


## Basic objects 

The **pandas** library has a vast API with many useful functions. However, most of this revolves
around two important classes:

* Series
* DataFrame

In this introduction, we will focus on them - what each of them does and how they relate to each other
and numpy objects.

### Series

Series is a one-dimensional data structure, central to pandas. 

For a complete API, visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

In [4]:
# My first series
series = pd.Series([1, 2, 3])
series

0    1
1    2
2    3
dtype: int64

This looks a bit like a Numpy array, does it not?

Actually, in most cases the Series wraps a Numpy array...

In [5]:
series.values  # The result is a Numpy array

array([1, 2, 3])

...and if we construct the series from a numpy array, it wraps it directly.

In [6]:
zeros_array = np.zeros(10)

# We check object identity
pd.Series(zeros_array).values is zeros_array

True

But there is something more. Alongside the values, we see that each item (or "row") has a certain label. The collection of labels is called **index**.

In [7]:
series.index

RangeIndex(start=0, stop=3, step=1)

This index (see below) can be used, as its name suggests, to index items of the series.

In [8]:
# Return an element from the series
series.loc[1]

2

In [9]:
# Construction from a dictionary
series_ab = pd.Series({"a": 2, "b": 4})
series_ab

a    2
b    4
dtype: int64

**Exercise**: Create a series with 5 elements.

In [10]:
result = ...

### DataFrame

A **DataFrame** is pandas' answer to Excel sheets - it is a collection of named columns (or, in our case, a collection of **Series**).
Quite often, we directly read data frames from an external source, but it is possible to create them from:
* a dict of Series, numpy arrays or other array-like objects
* from an iterable of rows (where rows are Series, lists, dictionaries, ...)

In [14]:
# List of lists (no column names)
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(table)
table_df

Unnamed: 0,0,1
0,a,1
1,b,3
2,c,5


In [15]:
# Dict of Series (with column names)
df = pd.DataFrame({
    'number': pd.Series([1, 2, 3, 4], dtype=np.int8),
    'letter': pd.Series(['a', 'b', 'c', 'd'])
})
df

Unnamed: 0,number,letter
0,1,a
1,2,b
2,3,c
3,4,d


In [16]:
# Numpy array (10x2), specify column names
data = np.random.normal(0, 1, (10, 2))

df = pd.DataFrame(data, columns=['a', 'b'])
df

Unnamed: 0,a,b
0,0.549428,-0.085922
1,-0.459107,0.591276
2,-0.844495,2.008266
3,1.906335,-0.462032
4,-0.889533,0.483394
5,0.497882,0.138273
6,0.336032,0.944945
7,-0.170603,0.429885
8,-0.31158,0.543498
9,-1.558922,-0.814398


In [20]:
# A DataFrame also has an index.
df.index

RangeIndex(start=0, stop=10, step=1)

In [19]:
# The columns also form an index.
df.columns

Index(['a', 'b'], dtype='object')

In [18]:
# ...that is shared by all columns
df.index is df["a"].index

True

## D(ata) types

Pandas builds upon the numpy data types (mentioned earlier) and adds a couple of more.

In [26]:
typed_df = pd.DataFrame({
  "bool": np.arange(5) % 2 == 0,
  "int": range(5),
  "int[nan]": pd.Series([np.nan, 0, 1, 2, 3], dtype="Int64"),
  "float": np.arange(5) * 3.14,
  "object": [None, 1, "2", [3, 4], 5 + 6j],
  "string?": ["a", "b", "c", "d", "e"],
  "string!": pd.Series(["a", "b", "c", "d", "e"], dtype="string"),
  "datetime": pd.date_range('2018-01-01', periods=5, freq='3M'),
  "timedelta": pd.timedelta_range(0, freq="1s", periods=5),
  "category": pd.Series(["animal", "plant", "animal", "animal", "plant"], dtype="category")
})
typed_df

Unnamed: 0,bool,int,int[nan],float,object,string?,string!,datetime,timedelta,category
0,True,0,,0.0,,a,a,2018-01-31,0 days 00:00:00,animal
1,False,1,0.0,3.14,1,b,b,2018-04-30,0 days 00:00:01,plant
2,True,2,1.0,6.28,2,c,c,2018-07-31,0 days 00:00:02,animal
3,False,3,2.0,9.42,"[3, 4]",d,d,2018-10-31,0 days 00:00:03,animal
4,True,4,3.0,12.56,(5+6j),e,e,2019-01-31,0 days 00:00:04,plant


In [27]:
typed_df.dtypes

bool                    bool
int                    int64
int[nan]               Int64
float                float64
object                object
string?               object
string!               string
datetime      datetime64[ns]
timedelta    timedelta64[ns]
category            category
dtype: object

We will see some of the types practically used in further analysis.

## Indices & indexing



In [28]:
abc_series = pd.Series(range(3), index=["a", "b", "c"])
abc_series

a    0
b    1
c    2
dtype: int64

In [29]:
abc_series.index

Index(['a', 'b', 'c'], dtype='object')

In [30]:
abc_series.index = ["c", "d", "e"]  # Changes the labels in-place!
abc_series.index.name = "letter"
abc_series

letter
c    0
d    1
e    2
dtype: int64

In [31]:
table = [
    ['a', 1],
    ['b', 3],
    ['c', 5]
]
table_df = pd.DataFrame(
    table,
    index=["first", "second", "third"],
    columns=["alpha", "beta"]
)
table_df

Unnamed: 0,alpha,beta
first,a,1
second,b,3
third,c,5


In [32]:
alpha = table_df["alpha"]  # Simple [] indexing in DataFrame returns Series
alpha

first     a
second    b
third     c
Name: alpha, dtype: object

In [33]:
alpha["second"]             # Simple [] indexing in Series returns scalar values.

'b'

In [34]:
alpha.second   # This also works

'b'

but careful!

In [35]:
alpha.first

<bound method NDFrame.first of first     a
second    b
third     c
Name: alpha, dtype: object>

A slice with a `["list", "of", "columns"]` yields a `DataFrame` with those columns. 

For example:

In [36]:
table_df[["beta", "alpha"]]

Unnamed: 0,beta,alpha
first,1,a
second,3,b
third,5,c


`[["column_name"]]` returs a `DataFrame` as well, not `Series`:

In [37]:
table_df[["alpha"]]

Unnamed: 0,alpha
first,a
second,b
third,c


There are two ways how to properly index rows & cells in the DataFrame:

- `loc` for label-based indexing
- `iloc` for order-based indexing (it does not use the **index** at all)

Note the square brackets. The mentioned attributes actually are not methods
but special "indexer" objects. They accept one or two arguments specifying
the position along one or both axes.

**Exercise:** Create `DataFrame` whose `x`-column is $0, \frac{1}{4}\pi, \frac{1}{2}\pi, .. 2\pi $, `y` column is `cos(x)` and index are `fractions` `0, 1/4, 1/2 ... 2`

In [38]:
import fractions

index = [fractions.Fraction(n, ___) for n in range(___)]
x = np.___([___ for ___ in ___])
y = ___

df = pd.DataFrame(___, index = ___)

# display
df

TypeError: 'method' object cannot be interpreted as an integer

#### loc


In [None]:
first = table_df.loc["first"]
first

In [None]:
table_df.loc["first", "beta"]            

In [None]:
table_df.loc["first":"second", "beta"]   # Use ranges (inclusive)

#### iloc

In [None]:
table_df.iloc[1]

In [None]:
table_df.iloc[0:4:2]   # Select every second row

In [None]:
table_df.at["first", "beta"]

In [None]:
type(table_df.at)

## Modifying DataFrames

Adding a new column is like assigning to adding a key/value pair to a dict.
Note that the operation, unlike most others, does modify the DataFrame.

In [None]:
from datetime import datetime
table_df["now"] = datetime.now()
table_df

Non-destructive version that returns a new DataFrame, uses the `assign` method:

In [None]:
table_df.assign(delta = [True, False, True])

In [None]:
# However, the original DataFrame is not changed
table_df

Deleting a column is very easy too.

In [None]:
del table_df["now"]
table_df

The **drop** method works with both rows and columns (creating a new data frame).

In [None]:
table_df.drop("beta", axis=1)

In [None]:
table_df.drop("second", axis=0)

**Exercise:** Use a combination of `reset_index`, `drop` and `set_index` to transform `table_df` into `pd.DataFrame({'index': table_df.index}, index=table_df["alpha"])`

In [39]:
results = table_df.___.___.___

# display
result

AttributeError: 'DataFrame' object has no attribute '___'

---
**Let's get some data!**