<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="45%" align="right" border="4">

# pandas Basic

Dr. Yves J. Hilpisch

The Python Quants GmbH

<a href='http://fpq.io'>http://fpq.io</a> | <a href='mailto:team@tpq.io'>team@tpq.io</a>

## pandas Basics

In a sense, **`pandas` is built "on top" of `NumPy`**. For example, `NumPy` universal functions will generally work on `pandas` objects as well. We therefore import both to begin with.

In [None]:
from sys import version_info
version_info

In [None]:
import numpy as np
import pandas as pd
import warnings; warnings.simplefilter('ignore')

### First Steps with DataFrame Class

On a rather fundamental level, the `DataFrame` class is designed to manage **indexed and labeled data**.

In [None]:
df = pd.DataFrame([10, 20, 30, 40], columns=['numbers'],
                  index=['a', 'b', 'c', 'd'])
df

This simple example already shows some major features of the `DataFrame` class when it comes to storing data:

* **data**: data itself can be provided in different shapes and types
* **labels**: data is organized in columns which can have custom names
* **index**: there is an index that can take on different formats

Some important **attributes and methods** of the class (I).

In [None]:
df.index  # the index values

In [None]:
df.columns  # the column names

In [None]:
df.ix['c']  # selection via index

In [None]:
df.ix[['a', 'd']]  # selection of multiple indices

Some important **attributes and methods** of the class (II).

In [None]:
df.ix[df.index[1:3]]  # selection via Index object

In [None]:
df.sum()  # sum per column

In [None]:
df.apply(lambda x: x ** 2)  # square of every element

**Vectorized operations** on a DataFrame object generally work as on a `NumPy ndarray` object.

In [None]:
df ** 2  # again square, this time NumPy-like

**Enlarging the `DataFrame` object** in both dimensions is possible.

In [None]:
df['floats'] = (1.5, 2.5, 3.5, 4.5)
  # new column is generated
df

In [None]:
df['floats']  # selection of column

A whole `DataFrame` object can also be taken to define a new column. In such a case, indices are aligned automatically.

In [None]:
df['names'] = pd.DataFrame(['Yves', 'Guido', 'Felix', 'Francesc'],
                           index=['d', 'a', 'b', 'c'])
df

**Appending data** works similarly &ndash; however, note the index replacement.

In [None]:
df.append({'numbers': 100, 'floats': 5.75, 'names': 'Henry'},
               ignore_index=True)
  # temporary object; df not changed

Solution: append a DataFrame object, providing the appropriate index information.

In [None]:
df = df.append(pd.DataFrame({'numbers': 100, 'floats': 5.75,
                             'names': 'Henry'}, index=['z',]))
df

One of the strengths of `pandas` is working with **missing data**.

In [None]:
df.join(pd.DataFrame([1, 4, 9, 16, 25],
            index=['a', 'b', 'c', 'd', 'y'],
            columns=['squares',]))
  # temporary object

Doing an **`outer` join** preserves all data.

In [None]:
df = df.join(pd.DataFrame([1, 4, 9, 16, 25],
                    index=['a', 'b', 'c', 'd', 'y'],
                    columns=['squares',]),
                    how='outer')
df

Join methods are `inner, outer, left, right`.

Although there are **missing values**, the majority of method calls will still work.

In [None]:
df[['numbers', 'squares']].mean()
  # column-wise mean

In [None]:
df[['numbers', 'squares']].std()
  # column-wise standard deviation

### Second Steps with DataFrame Class

First, a set of random **dummy data** as a `ndarray` object.

In [None]:
a = np.random.standard_normal((9, 4))
a.round(6)

This can be used to instantiate a `DataFrame` object.

In [None]:
df = pd.DataFrame(a)
df

Adding **column names** ...

In [None]:
df.columns = ['No1', 'No2', 'No3', 'No4']
df

... for easy data selection.

In [None]:
df['No2'][3]  # value in column No2 at index position 3

Adding a **`DatetimeIndex`**.

In [None]:
dates = pd.date_range('2015-1-1', periods=9, freq='M')
dates

In [None]:
df.index = dates
df

Closing the circle: from `DataFrame` object to `ndarray` object.

In [None]:
np.array(df).round(6)

In [None]:
df.values

### Basic Analytics

The `DataFrame` class has many **convenience methods** already built in (I).

In [None]:
df.sum()

In [None]:
df.mean()

The `DataFrame` class has many **convenience methods** already built in (II).

In [None]:
df.cumsum()

There is also a short cut to a number of often used **statistics** for numerical data sets, the `describe` method.

In [None]:
df.describe()

You can also also directly apply the majority of `NumPy` **universal functions**.

In [None]:
np.sqrt(df)

**Incomplete data** is no problem for `pandas`.

In [None]:
np.sqrt(df).sum()

Neither is efficient **plotting**.

In [None]:
%matplotlib inline
df.cumsum().plot(lw=2.0)

In [None]:
# nicer plotting defaults
import seaborn as sns; sns.set()

In [None]:
df.cumsum().plot(lw=2.0)

### Series Class

There is also a dedicated `Series` class. 

In [None]:
type(df)

In [None]:
df['No1'].ix[:3]

In [None]:
type(df['No1'])

The main `DataFrame` methods are available for `Series` objects as well.

In [None]:
import matplotlib.pyplot as plt
df['No1'].cumsum().plot(style='r', lw=2.)
plt.xlabel('date')
plt.ylabel('value');

### Vectorized Operations

**Adding** two columns.

In [None]:
df['No1'] + df['No4']

**Multiplying** two columns.

In [None]:
df['No1'] * df['No4']

Using **universal** functions, e.g. `sin`.

In [None]:
np.sin(df)

Using **universal** functions, e.g. `mean`.

In [None]:
np.mean(df)

In [None]:
df.mean()

Distinguishing **axes**.

In [None]:
df.mean(axis=0)

In [None]:
df.mean(axis=1)

Using the **`apply`** method with Python function.

In [None]:
def f(x):
    return x ** 2

In [None]:
df.apply(f)

Using the **`apply`** method with `lambda`/anonymous function.

In [None]:
df.apply(lambda x: x ** 0.5)

**Speed comparisons** between universal functions and `apply` method.

In [None]:
%timeit np.sqrt(df)

In [None]:
%timeit df.apply(lambda x: x ** 0.5)

## Data Selection

Selecting **columns** (I).

In [None]:
df['No2']

Selecting **columns** (II).

In [None]:
df[['No1', 'No3']]

Selecting **columns** (III).

In [None]:
df.No2

Selecting **columns** (IV).

In [None]:
df.columns

In [None]:
df[df.columns[1]]

Selecting **columns** (V).

In [None]:
df.ix[:, 1]  # numbers

Selecting **columns** (VI).

In [None]:
df.loc[:, 'No2']  # index values

Selecting **columns** (VII).

In [None]:
df.iloc[:, 1:3]

Selecting **rows** (I).

In [None]:
df.ix[0]

Selecting **rows** (II).

In [None]:
df.iloc[0]

Selecting **rows** (III).

In [None]:
df.index[0]

In [None]:
df.loc['2015-01-31']

Selecting **rows** (IV).

In [None]:
df[:2]

## Basic Plotting with pandas

Different **types of plots** with pandas (I).

In [None]:
df.plot()  # DataFrame

Different **types of plots** with pandas (II).

In [None]:
df['No1'].plot()  # Series

Different **types of plots** with pandas (III).

In [None]:
df.hist();

Different **types of plots** with pandas (IV).

In [None]:
df.boxplot(return_type='dict');

Different **types of plots** with pandas (IVa).

In [None]:
eo = lambda x: x.day % 2 == 0
df['eo'] = eo(df.index)
df

Different **types of plots** with pandas (IVb).

In [None]:
df.boxplot(by='eo');
plt.tight_layout()

Different **types of plots** with pandas (V).

In [None]:
df.plot(kind='bar')

Different **types of plots** with pandas (VI).

In [None]:
df.plot(kind='bar', stacked=True)

Different **types of plots** with pandas (VII).

In [None]:
df.plot(kind='barh', stacked=True)

Different **types of plots** with pandas (VIII).

In [None]:
df.plot(x='No1', y='No2', kind='scatter')

Different **types of plots** with pandas (IX).

In [None]:
np.abs(df).plot(kind='area');

Different **types of plots** with pandas (X).

In [None]:
np.power(df, 2).plot(kind='area', stacked=False)

Different **types of plots** with pandas (XI).

In [None]:
np.abs(df['No1']).plot(kind='pie');

Different **types of plots** with pandas (XII).

In [None]:
df.plot(kind='hexbin', x='No1', y='No2', C='No3', reduce_C_function=np.max,
         gridsize=10)

## Financial Time Series

In what follows, we use the **`pandas` built-in function `DataReader`** to retrieve stock price data from Yahoo! Finance (<a href="http://finance.yahoo.com">http://finance.yahoo.com</a>), analyze the data and generate different plots of it.

In [None]:
import pandas.io.data as web  # old way/location

In [None]:
from pandas_datareader import data as web  # new way/location

We can **retrieve stock price information** for the German DAX index, for example, with a single line of code from Yahoo! Finance.

In [None]:
%%time
DAX = web.DataReader(name='^GDAXI', data_source='yahoo',
                     start='2000-1-1')
DAX.info()

The **most current data rows**.

In [None]:
DAX.tail()

The **whole history** as downloaded.

In [None]:
DAX['Close'].plot(figsize=(10, 6))

Technical traders are often interested in **trends**.

In [None]:
# old syntax/style
DAX['42d'] = pd.rolling_mean(DAX['Close'], window=42)
DAX['252d'] = pd.rolling_mean(DAX['Close'], window=252)

In [None]:
# new syntax/style
DAX['42d'] = DAX['Close'].rolling(window=252).mean()
DAX['252d'] = DAX['Close'].rolling(window=252).mean()

In [None]:
DAX[['Close', '42d', '252d']].tail()

A typical **stock price chart** with the two trends included.

In [None]:
DAX[['Close', '42d', '252d']].plot(figsize=(10, 6));

Calculating **log returns** for the index &ndash; the **wrong way**.

In [None]:
%%time
import math
DAX['Ret_Loop'] = 0.0
log_rets = []
for i in range(1, len(DAX)):
    log_rets.append(math.log(DAX['Close'][i] /
                        DAX['Close'][i - 1]))
DAX['Ret_Loop'].ix[1:] = log_rets

In [None]:
DAX[['Close', 'Ret_Loop']].tail()

Calculating **log returns** for the index &ndash; the **right way**.

In [None]:
%time DAX['Return'] = np.log(DAX['Close'] / DAX['Close'].shift(1))

In [None]:
DAX[['Close', 'Ret_Loop', 'Return']].tail()

In [None]:
del DAX['Ret_Loop']

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="mailto:yves@tpq.io">yves@tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="http://hilpisch.com" target="_blank">http://hilpisch.com</a> 

**Quant Platform** &mdash; <a href="http://quant-platform.com" target="_blank">http://quant-platform.com</a>

**Python for Finance** &mdash; <a href="http://python-for-finance.com" target="_blank">http://python-for-finance.com</a>

**Derivatives Analytics with Python** &mdash; <a href="http://derivatives-analytics-with-python.com" target="_blank">http://derivatives-analytics-with-python.com</a>

**Python Trainings** &mdash; <a href="http://training.tpq.io" target="_blank">http://training.tpq.io</a>