<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="45%" align="right" border="4">

# pandas Advanced

Dr. Yves J. Hilpisch

The Python Quants GmbH

<a href='http://fpq.io'>http://fpq.io</a> | <a href='mailto:team@tpq.io'>team@tpq.io</a>

## Grouping Operations

**Example `DataFrame`** to work with.

In [None]:
import warnings
warnings.simplefilter('ignore')

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns; sns.set()

In [None]:
rows = 10000
index = pd.date_range(dt.datetime.now().date(), periods=rows, freq='H')
df = pd.DataFrame(np.random.standard_normal((rows, 5)),
                  columns=['No1', 'No2', 'No3', 'No4', 'No5'],
                  index=index)

The `DataFrame` info.

In [None]:
df.info()

The `DataFrame` head.

In [None]:
df.head()

Adding a **column to group by**.

In [None]:
df['Gr1'] = np.random.choice(['A', 'B', 'C', 'D'], rows)

In [None]:
df.tail()

Generating a **`DataFrameGroupBy`** object.

In [None]:
grouped = df.groupby('Gr1')

In [None]:
type(grouped)

**Elements** per group.

In [None]:
grouped.size()

Typical **aggregations**.

In [None]:
grouped.sum()

In [None]:
grouped.mean()

**High level statistics** as overview.

In [None]:
grouped.describe()

Selecting **group data**.

In [None]:
grouped.get_group('A').head()

Custom **aggregations**.

In [None]:
grouped.aggregate({'No1' : np.mean,
                   'No3' : np.std})

**Plotting** of grouped data.

In [None]:
%matplotlib inline
grouped.mean().plot(kind='barh')

Introducing a **second column to group by**.

In [None]:
f = lambda x: x.hour % 2 == 0

In [None]:
df['Gr2']  = np.where(f(df.index), 'even', 'odd')

In [None]:
df.tail()

Grouping with **multiple columns**.

In [None]:
grouped = df.groupby(['Gr1', 'Gr2'])
grouped.size()

**Plotting** of the new object data.

In [None]:
grouped.aggregate([np.min, np.mean, np.max])[['No1', 'No2']].boxplot(
                    return_type='dict');

**Filter operations** on `GroupBy` objects.

In [None]:
grouped.filter(lambda x: np.mean(x['No2']) > 0.0).head()

## Joining, Appending, Merging

Let us start with **two small sample `DataFrame`** objects (I).

In [None]:
df1 = pd.DataFrame(['100', '200', '300', '400'], 
                    index=['a', 'b', 'c', 'd'],
                    columns=['A',])
df1

Let us start with **two small sample `DataFrame`** objects (II).

In [None]:
df2 = pd.DataFrame(['200', '150', '50'], 
                    index=['f', 'b', 'd'],
                    columns=['B',])
df2

**Default** operations (I).

In [None]:
df1.append(df2)

**Default** operations (II).

In [None]:
df1.append(df2, ignore_index=True)

**Default** operations (III).

In [None]:
pd.concat((df1, df2))

**Default** operations (IV).

In [None]:
pd.concat((df1, df2), ignore_index=True)

**Default** operations (V).

In [None]:
df1.join(df2)

**Default** operations (VI).

In [None]:
df2.join(df1)

**Default** operations (VII).

In [None]:
df = pd.DataFrame()
df['A'] = df1['A']
df

In [None]:
df['B'] = df2['B']  # sequence counts
df

**Default** operations (VIII).

In [None]:
df = pd.DataFrame({'A': df1['A'], 'B': df2['B']})
df

**Variants of joining** data (I).

In [None]:
df1.join(df2, how='left')  # default

**Variants of joining** data (II).

In [None]:
df1.join(df2, how='right')

**Variants of joining** data (III).

In [None]:
df1.join(df2, how='inner')

**Variants of joining** data (IV).

In [None]:
df1.join(df2, how='outer')

**Variants of joining** data (V).

In [None]:
df1.join(df2, how='outer', sort=True)

Adding a **further column** to both `DataFrame` objects..

In [None]:
c = pd.Series([250, 150, 50], index=['b', 'd', 'c'])
df1['C'] = c
df2['C'] = c

Resulting objects.

In [None]:
df1

In [None]:
df2

Default **merging**  of the objects.

In [None]:
pd.merge(df1, df2)

Other **merging** variants (I).

In [None]:
pd.merge(df1, df2, how='outer')

Other **merging** variants (I).

In [None]:
pd.merge(df1, df2, on='C')  # default

Other **merging** variants (II).

In [None]:
pd.merge(df1, df2, left_on='A', right_on='B', how='outer')

Other **merging** variants (III).

In [None]:
pd.merge(df1, df2, left_index=True, right_index=True)

In [None]:
pd.merge(df1, df2, on='C', left_index=True, right_index=True)

## High Frequency Data

The final example is about **high frequency data**. To begin with, a couple of imports. 

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
# from urllib.request import urlretrieve
%matplotlib inline

The Norwegian online broker **Netfonds (<a href="http://www.netfonds.no">http://www.netfonds.no</a>)** provides tick data for a multitude of stocks, in particular for American names.

In [None]:
url1 = 'http://hopey.netfonds.no/posdump.php?'
url2 = 'date=%s%s%s&paper=NKE.N&csv_format=csv'
url = url1 + url2

We want to download, combine and analyze **a week worth of tick data** for the Nike Inc. stock.

In [None]:
year = '2016'
month = '08'
days = ['25', '26', '29', '30', '31']
  # dates might need to be updated
  # to something 'recent enough' (last 2 weeks)

In [None]:
NKE = pd.DataFrame()
for day in days:
    NKE = NKE.append(pd.read_csv(url % (year, month, day),
                       index_col=0, header=0, parse_dates=True))
NKE.columns = ['bid', 'bdepth', 'bdeptht', 'offer', 'odepth', 'odeptht']
  # shorter colummn names

The data set now consists of more than **50,000 rows**.

In [None]:
NKE.info()

The **data visualized**.

In [None]:
NKE['bid'].plot(figsize=(10, 6));

A whole **trading day** in pictures.

In [None]:
to_plot = NKE[['bid', 'bdeptht']][
    (NKE.index > dt.datetime(2016, 8, 30, 0, 0))
 &  (NKE.index < dt.datetime(2016, 8, 31, 2, 59))]
  # adjust dates to given data set
to_plot.plot(subplots=True, style='b', figsize=(10, 6));

**Resampling** is easily accomplished with `pandas`.

In [None]:
# NKE_resam = NKE.resample(rule='5min', how='mean')
NKE_resam = NKE.resample(rule='5min').mean()
np.round(NKE_resam.tail(), 2)

The plot now looks a bit **more smooth**.

In [None]:
NKE_resam['bid'].fillna(method='ffill').plot()

In [None]:
def reversal(x):
    return 2 * 58 - x

In [None]:
NKE_resam['bid'].fillna(method='ffill').apply(reversal).plot();

## Statistical Analyses

Let us generate a **sample data** set to work with.

In [None]:
x = np.linspace(-5, 5, 500)
e = np.random.standard_normal(len(x)) * 2
data = pd.DataFrame({'x': x, 'y': 2 * x ** 2 - 0.5 * x + 3 + e})
data.plot(x='x', y='y', style='r.')

Let us implement a **ordinary least-squares regression analysis** (OLS).

In [None]:
model = np.polyfit(x=data['x'], y=data['y'], deg=2)

In [None]:
model

**Accessing and visualizing** the results.

In [None]:
import matplotlib.pyplot as plt

In [None]:
data.plot(x='x', y='y', style='r.')
plt.plot(x, np.polyval(model, x), lw=2.0)

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="mailto:yves@tpq.io">yves@tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="http://hilpisch.com" target="_blank">http://hilpisch.com</a> 

**Quant Platform** &mdash; <a href="http://quant-platform.com" target="_blank">http://quant-platform.com</a>

**Python for Finance** &mdash; <a href="http://python-for-finance.com" target="_blank">http://python-for-finance.com</a>

**Derivatives Analytics with Python** &mdash; <a href="http://derivatives-analytics-with-python.com" target="_blank">http://derivatives-analytics-with-python.com</a>

**Python Trainings** &mdash; <a href="http://training.tpq.io" target="_blank">http://training.tpq.io</a>