# Advanced `pandas`

The following notebook is dedicated to more advanved opeartions in Pandas:

- `split-apply-combine` pipeline,
- operations on string columns (string operations, replacement),
- joins on Pandas dataframes.

In [None]:
%pylab inline
plt.style.use("bmh")

In [None]:
import numpy as np
import pandas as pd

In [None]:
titanic_train = pd.read_csv("train.csv", index_col="PassengerId")
titanic_test = pd.read_csv("test.csv", index_col="PassengerId")
titanic = pd.concat([titanic_train, titanic_test], sort=False)

In [None]:
titanic.head()

In [None]:
titanic.groupby("Ticket").size().value_counts()

In [None]:
titanic.groupby("Ticket")["Fare"].mean() / titanic.groupby("Ticket").size()

In [None]:
titanic.merge((titanic.groupby("Ticket")["Fare"].mean() / titanic.groupby("Ticket").size()).rename("fare_per_pass"),
              left_on="Ticket", right_index=True, how="left").groupby("Pclass")["fare_per_pass"].mean()

# Split-apply-combine (`GROUP BY` in Pandas)

Depending on how the result of `apply` part of the pipeline is structured, Pandas will `combine` differently. Many common operations have shortcuts, making them extremely concise. We start with the most simple case: `apply` results in a single scalar per group.

Entry point to Pandas grouping:

In [None]:
class_groups = titanic.groupby("Pclass")

In [None]:
class_groups

Pandas is smart enough to provide some common operations automatically:

In [None]:
class_groups.mean()

We can group (only a single column) by a synthetic key:

In [None]:
age_groups = titanic.Parch.groupby((5 + 10*(titanic.Age//10)))

In [None]:
5 + 10*(titanic.Age//10)

In [None]:
age_groups

In [None]:
age_groups.mean().to_frame() # Note index name

NumPy arrays can also be used as grouping keys:

In [None]:
age_groups_npy = titanic.Parch.groupby((5 + 10*(titanic.Age//10)).values)

In [None]:
age_groups_npy.mean().to_frame()

We can group by a set of keys:

In [None]:
age_groups_multi = titanic.Parch.groupby([(5 + 10*(titanic.Age//10)), titanic.Pclass])

In [None]:
age_groups_multi = titanic.Parch.groupby([titanic.Pclass, (5 + 10*(titanic.Age//10))])

In [None]:
age_groups_multi

In [None]:
age_groups_multi.mean()

We can restructure the result:

In [None]:
age_groups_multi.mean().unstack()

We can mix column names with a real iterables:

In [None]:
age_groups_mixed = titanic.groupby([(5 + 10*(titanic.Age//10)), "Pclass", "Embarked"])

In [None]:
age_groups_mixed

In [None]:
age_groups_mixed.Parch.mean()

In [None]:
age_groups_mixed.Parch.mean().unstack(level=(0,2))

## `apply` outputs series

In [None]:
class_groups = titanic.groupby("Pclass") # Nothing is calculated yet

In [None]:
class_groups.mean()

Let's simulate series output of `apply` stage:

In [None]:
titanic.Fare.describe()

In [None]:
class_groups.Fare.describe()

In [None]:
class_groups.apply(lambda x: x.Fare.describe()) # Note column index name

In [None]:
titanic.groupby("Sex").Parch.mean()

In [None]:
class_groups.apply(lambda x: x.groupby("Sex").Parch.mean()) # Note column index name

In [None]:
class_groups.apply(lambda x: x[x.Parch==0].groupby("Sex").size())

The same can be achieved differently, of course:

In [None]:
titanic[titanic.Parch==0].groupby(["Pclass", "Sex"]).size().unstack()

What if `apply` result has multi-index on it's own?

In [None]:
titanic[titanic.Parch!=0].groupby(["Sex", "Embarked"]).size()

In [None]:
class_groups.apply(lambda x: x[x.Parch!=0].groupby(["Sex", "Embarked"]).size())

In [None]:
result_s = (class_groups
            .apply(lambda x: x[x.Parch!=0].groupby(["Sex", "Embarked"]).size()))

In [None]:
result = (class_groups
          .apply(lambda x: x[x.Parch!=0].groupby(["Sex", "Embarked"]).size())
          .unstack([1,2]))

In [None]:
result

### Intermezzo: indexing a multi-indexed dataframe

In [None]:
result.loc[:, ("female", "C")]

In [None]:
result.loc[:, [("female", "C"), ("female", "S")]]

Positional index is slightly different, as it known nothing about multi-index *per se*:

In [None]:
result.iloc[:, [0, 1]]

## DataFrame output

In [None]:
titanic[["SibSp", "Parch"]].head()

In [None]:
(titanic[titanic.Parch!=0]
 .groupby(["Sex", "Embarked"])[["SibSp", "Parch"]]
 .mean())

In [None]:
class_groups

In [None]:
(class_groups
 .apply(lambda x: x[x.Parch!=0].groupby(["Sex", "Embarked"])[["SibSp", "Parch"]]
        .mean()))

## Mixing group keys

In [None]:
titanic.head()

In [None]:
titanic_idx = titanic.reset_index().set_index((5 + 10*(titanic.Age//10)))

In [None]:
titanic_idx.head()

`pd.Grouper` is an entry point to complex mixed groupings:

In [None]:
pd.Grouper?

In [None]:
titanic_idx.groupby([pd.Grouper(level="Age"), "Pclass"]).Parch.mean()

In [None]:
titanic_idx.index

In [None]:
titanic_idx.groupby([titanic_idx.index, "Pclass"]).Parch.mean()

### Intermezzo: on `size` vs. `count`

`size` is a method to get, you name it, **size** of something, in this case, of a group:

In [None]:
titanic.groupby('Pclass').size()  ## how many elements are in each group

But you **count** only something specific:

In [None]:
titanic.groupby('Pclass').count()

As you can see, `count` only counts non-missing values, i.e. something, that is present in the dataframe. Hence, a bit more elaborated way of getting (almost) the same is:

In [None]:
titanic.isnull()

In [None]:
titanic.groupby('Pclass').apply(lambda group: group.notnull().sum())

# How `S-A-C` is important in exploratory data analysis

In [None]:
titanic.Pclass.value_counts()

In [None]:
titanic["AgeGroup"] = 5 + 10*(titanic.Age//10)
titanic[["Age", "AgeGroup"]]

Let's calculate something non-trivial. For example, percentage of each age group and sex combination, per class:

In [None]:
titanic_dna = titanic[titanic.Age.notnull()]  # it's a matter of discussion, if we need this: think on it a bit
group_counts = titanic_dna.groupby(['Pclass', 'AgeGroup', 'Sex']).size()/titanic_dna.groupby('Pclass').size()
group_counts.head()

In [None]:
group_counts.unstack(level=1)

A simple way to validate the calculation:

In [None]:
group_counts.groupby(level=0).sum()

Main advantage of Pandas for EDA comes from very flexible inter-operability of analytics and plotting:

In [None]:
group_counts = group_counts.unstack()

In [None]:
group_counts

In [None]:
plt.figure(figsize=(15, 5))

for pclass in [1, 2, 3]:
    plt.subplot(1, 3, pclass)
    group_counts.loc[pclass].plot(ax=plt.gca())
    plt.ylim(0, 0.35)
    plt.title("Age distribution for Class %i" % pclass, fontsize=12)
plt.tight_layout()

In [None]:
survival_groups = titanic.groupby(['Pclass', 'AgeGroup', 'Sex']).Survived.mean()
survival_groups

In [None]:
survival_groups = survival_groups.unstack()

In [None]:
plt.figure(figsize=(15, 5))

for pclass in [1, 2, 3]:
    plt.subplot(1, 3, pclass)
    survival_groups.loc[pclass].plot(ax=plt.gca())
    plt.ylim(0, 1)
    plt.title("Survived in class %i" % pclass, fontsize=12)
plt.tight_layout()

In [None]:
siblings_groups = titanic.groupby(['Pclass', 'AgeGroup', 'Sex']).SibSp.mean()
siblings_groups = siblings_groups.unstack()

In [None]:
siblings_groups

In [None]:
plt.figure(figsize=(15, 5))

for pclass in [1, 2, 3]:
    plt.subplot(1, 3, pclass)
    siblings_groups.loc[pclass].plot(ax=plt.gca())
    plt.ylim(0, 5)
    plt.title("Siblings in class %i" % pclass, fontsize=12)
plt.tight_layout()

In [None]:
embark_counts = titanic.groupby(['Pclass', 'AgeGroup', 'Sex', 'Embarked']).size()/titanic.groupby('Pclass').size()

embark_counts

In [None]:
embark_counts = embark_counts.unstack([-1, -2])
embark_counts

### Intermezzo: Seaborn in EDA

In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize=(6,6))

# We will plot jittered version of the data, hence we remove fliers
sns.boxplot(x="Pclass", y="Age", data=titanic,
            fliersize=0, width=0.3)
sns.stripplot(x="Pclass", y="Age", data=titanic,
              color="k", alpha=0.5, size=3)
plt.title("Age distribution by class", fontsize=12)
plt.tight_layout();

In [None]:
plt.figure(figsize=(6,6))

sns.violinplot(x="Age", y="Pclass", data=titanic,
               split=True, hue="Sex", scale="count", orient="h")
plt.title("Age distribution by class", fontsize=12)
plt.tight_layout();

In [None]:
with plt.style.context("seaborn-ticks"):
    plt.figure(figsize=(6,6))

    sns.violinplot(x="Age", y="Pclass", data=titanic,
                   split=True, hue="Sex", scale="count", orient="h",
                   palette={"male": "lightsteelblue", "female": "firebrick"})
    plt.title("Age distribution by class", fontsize=12)

    sns.despine(left=True)

    plt.ylabel("class")

    plt.tight_layout();

For comparison:

In [None]:
plt.figure(figsize=(15, 5))

for pclass in [1, 2, 3]:
    plt.subplot(1, 3, pclass)
    group_counts.loc[pclass].plot(ax=plt.gca())
    plt.ylim(0, 0.35)
    plt.title("Age distribution for Class %i" % pclass, fontsize=12)
plt.tight_layout()