<a href="https://colab.research.google.com/github/4dsolutions/clarusway_data_analysis/blob/main/DAwPy_S5_6_%28Groupby_and_Useful_Operations%29/Advanced.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a><br/>
[![nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/4dsolutions/clarusway_data_analysis/blob/main/DAwPy_S5_6_%28Groupby_and_Useful_Operations%29/Advanced.ipynb)

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" 
alt="CLRSWY"></p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:100%; text-align:center; border-radius:10px 10px;">WAY TO REINVENT YOURSELF</p>

# Operating with pandas DataFrames

[SOURCE1](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html)
[SOURCE2](https://datagy.io/python-pivot-tables/)
[SOURCE3](https://medium.com/analytics-vidhya/exploratory-data-analysis-of-titanic-survival-problem-e3af0fb1f276)

In [None]:
import pandas as pd
import numpy as np

In [None]:
from IPython.display import YouTubeVideo

### Jupyter Notebooks

One of the star technologies in the Clarusway universe is Jupyter.  

Remember Julia, Python, R (Ju-Py-teR) as original early adopters of the Jupyter architecture.  

Jupyter Notebooks started in their development as Python-only I-Python Notebooks.  By dint of refactoring, the kernel could become a swappable item, i.e. swap out Python for Julia, or R.  Or Haskell.

One thing you can do with a Jupyter Notebook is embed a YouTube.  

The YouTube might be about the Notebook itself, or at least about what the Notebook is about.  On the other hand, the YouTube might be what you're citing as an object of scholarship, as something you're talking about in the Notebook, but perhaps not dwelling on.

For example, we appreciate the Jake Vanderplus corpus on Github, especially, for our purposes, the Data Science Handbook.  Here's a YouTube of his keynote address to Pycon Columbia in 2019.

In [None]:
YouTubeVideo("zna96tsMIWE")  # https://youtu.be/zna96tsMIWE

### Planet Python

If we think of a subculture, such as Python's, as a planet or nation, then we get some useful geographic metaphors, such as "nearest neighbors".

An aspect of Planet Python is Pycon and EuroPython.  The latter came first, as Python was hatched in Europe, in the Netherlands in particular, by Guido van Rossum.  Then is spread around the world.  Pycons in the western hemisphere were started by the PSF (Python Software Foundation) as a core promotional campaign for the language.

[Flickr Slides from a Pycon in 2016, Portland, Oregon](https://flickr.com/photos/kirbyurner/albums/72157669197221096)

### Pandas

Part of Planet Python is [pandas](https://en.wikipedia.org/wiki/Pandas_(software)), a project started by Wess McKinney.  He is also the author of [Getting Started with pandas](https://wesmckinney.com/book/pandas-basics.html).

At this point in our Clarusway course, we have met numpy and the pandas DataFrame, and started looking at that object's principal methods.

### Reviewing GroupBy

The `.groupby` method of the DataFrame splits itself into an iterable sequence of [chunks](https://www.merriam-webster.com/dictionary/chunk), which then may be variously [aggregated](https://www.merriam-webster.com/dictionary/aggregated) such as by summing, averaging, counting and so on.

Note below the syntax for starting a random number generator (`rng`) at a specific place.  This object gets used to populate the column `data2`.

In [None]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

In [None]:
df.groupby("key").data1  # remember, a key column need not be named "key"

A single series with groupby enabled, one could put it.  There's nothing tabular worth showing at this point.

Below, in looping over the groupby, we encounter a (key: group) structure of several items.

In [None]:
for k, g in df.groupby("key"):
    print(k)  # value of "key" is the key
    print(g[['data1','data2']].agg(sum))  # aggregator attached
    print(g.shape)

In [None]:
bykey = df.groupby("key")  # we can save the groupby by nameing it...

In [None]:
bykey[['data1']].count()   # ... then using it.  One column selected, aggregator attached.

In [None]:
df.groupby("key").agg("count") # all columns counted

In [None]:
bykey['data1'].sum() # one colun summed, but groupwise, by key

In [None]:
df.groupby('key').aggregate(['min', np.median, max])  # multiple aggregators

In [None]:
df.groupby('key').agg([np.min, np.median, np.max])  # ... aggregators might be object in np

In [None]:
df

Jake shows us in one of his examples how a group does not have to be one of the columns but may be any Series or sequence.  Unique values become the "bucket names" (keys to groups), the chunks are what fill the buckets.  We're ready for aggregation at that point.

In [None]:
L = [0, 1, 0, 1, 2, 0]
df.groupby(L).sum()  # aggregator attached

This example is from [The Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/) also.  Provided the key you wish to group on is the DataFrame index, then a mapping of its values by means of a Python dict, will also serve to define an alternate set of groups with their own keys.

Here's his example:

In [None]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
df2.groupby(mapping).sum()

In [None]:
df2.groupby(str.lower).mean()

In [None]:
def filter_func(x):
    return x['data2'].std() > 4  # True

df.groupby('key').std() 

In [None]:
df.groupby('key').filter(filter_func)  # for each group, run the filter function

In [None]:
df['sum_cols'] = df['data1'] + df['data2']  # remember column creation
df

In [None]:
df.drop("sum_cols", axis="columns", inplace=True) # remember dropping columns

In [None]:
df

In [None]:
def func(x):
    # x is a DataFrame of group values
    x['sum_cols'] = x['data1'] + x['data2']
    return x

df = df.groupby("key").apply(func)
df

In [None]:
df.groupby('key').std().filter(regex="\d$")

In [None]:
df.groupby('key').std().filter(regex="^d")

In [None]:
df

In [None]:
df[['data1', 'data2']].transform(lambda x: x + 1)

In [None]:
df

### Reviewing Stack / Unstack

In [None]:
animals = ['Dog', 'Dog', 'Dog', 'Cat', 'Cat', 'Cat', 'Cat']
breeds = ['Lab', 'Lab', 'Pug', 'Siamese', 'Asian', 'Asian', 'Bengal']
columns = ['color', 'age', 'weight']

In [None]:
row_index = pd.MultiIndex.from_tuples(zip(animals, breeds))
row_index

In [None]:
names = ['Rover', 'Fido', 'Sydney', 'Felix', 'Tabby', 'Su', 'Tyron']
colors = ['yellow', 'black', 'orange', 'yellow', 'brown', 'chocolate', 'white']
ages = [5, 6, 3, 2, 10, 11, 5]
weights = [12.3, 12.8, 11.0, 4.7, 8.1, 9.2, 5.5]
pets_df = pd.DataFrame({'name' : names,
                        'color' : colors, 
                        'age'   : ages,
                        'weight': weights},
                        index = row_index)
pets_df

In [None]:
pets_df.index.names = ["Animal", "Breed"]

In [None]:
pets_df

In [None]:
stacked = pets_df.stack()
pd.DataFrame(stacked)

In [None]:
pets_df2 = pd.DataFrame({'name' : names,
                        'animal': animals,
                        'breed' : breeds,
                        'color' : colors, 
                        'age'   : ages,
                        'weight': weights})
pets_df2

### Pivoting

In [None]:
# ? pd.pivot

In [None]:
pets_df2.groupby('animal').agg(np.mean)

The pivots below get run twice to show how `pivot_table`...

* may be obtained from `pd`, in which case the target DataFrame needs to be passed as the data argument (leftmost), or 
* `pivot_table` may be invoked as a method of the table in question, in which case it should not be passed

In [None]:
pets_df2.pivot_table(aggfunc=np.mean, index='animal')  # table known

In [None]:
# pets_df2.pivot_table(data=pets_df2, aggfunc=np.mean, index='animal') Error

In [None]:
pd.pivot_table(data=pets_df2, index='animal', aggfunc=np.mean)

In [None]:
pets_df2.pivot_table(aggfunc=np.mean, index=['animal','breed'])  # table known

In [None]:
pd.pivot_table(pets_df2, aggfunc=np.mean, index=['animal','breed']) # table passed

Unstack takes the innermost level (lowest) of a hierarchical index and spreads it out as columns.  

Stack takes a lowest level of columns and stacks them up as rows in a hierarchical index.

In [None]:
table = pd.pivot_table(pets_df2, aggfunc=np.mean, index=['animal','breed'])
table.unstack()

In [None]:
pets_df2

In [None]:
pets_df2.pivot(index="name", columns="breed")

In [None]:
table = pets_df2.pivot(index="name", columns="breed")  # as above
table.stack()

In [None]:
? pd.pivot_table

In [None]:
pets_df2.pivot_table(values="weight", index="name", columns="breed", fill_value=" ")

In [None]:
pets_df2

In [None]:
pets_df2.pivot_table(index="name", values="weight")

In [None]:
pets_df2.pivot_table(index="name", values="age")

In [None]:
pets_df2[['name', 'color']]

In [None]:
pets_df2.pivot(values="weight", index="name", columns="animal")

In [None]:
pets_df2.pivot_table(values="weight", index="name", columns="animal", fill_value=" ")

In [None]:
YouTubeVideo("5yFox2cReTw")

In [None]:
df = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx')
df

In [None]:
df.info()

In [None]:
df.groupby('Region').Units.agg(np.sum)

In [None]:
pd.pivot_table(df, index='Region')

In [None]:
pd.pivot_table(df, index='Region', aggfunc=np.sum)

In [None]:
df[(df.Region == 'North') & (df.Type == "Men's Clothing")].Units.agg('count')

In [None]:
df.groupby('Region').agg({'Sales':np.sum, 'Units':np.mean}) 

In [None]:
df.groupby(['Region','Type']).agg({'Sales':np.sum, 'Units':np.mean}) 

In [None]:
pd.pivot_table(df, index=['Region','Type'], aggfunc={"Sales":np.sum, "Units":np.mean})

In [None]:
pd.pivot_table(df, index=['Type','Region'], aggfunc={"Sales":np.sum, "Units":np.mean})

Let's turn our attention to the Jake Vanderplas on-line tutorial on [Github](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html)

In [None]:
import seaborn as sns
titanic = sns.load_dataset('titanic')

In [None]:
titanic.info()

In [None]:
titanic.isnull().sum()

In [None]:
titanic.describe().T

Looking ahead to data cleaning, one approach to the missing 177 ages would be to fill them in with random ages within one standard deviation of the mean, i.e. mean-sd to mean+sd where sd = standard deviation.

The function below takes a new column name as input, i.e. we will leave Age as is and have the "filled in" version elsewhere.

In [None]:
def fill_na_age(df, colname):
    mean = df[colname].mean()
    sd = df[colname].std()
    def fill_empty(x):
        if np.isnan(x):  # return untouched otherwise
            return np.random.randint(mean-sd, mean+sd, ())  # replace with likely value
        return x 
    return df[colname].apply(fill_empty).astype(float)

Thanks to: Revathi Suriyadeepan<br/>
[Exploratory Data Analysis of Titanic Survival Problem](https://medium.com/analytics-vidhya/exploratory-data-analysis-of-titanic-survival-problem-e3af0fb1f276)<br/>
Part I — Analysis, Cleaning & Visualization<br/>
Dec 30, 2020

In [None]:
titanic['filled_age'] = fill_na_age(titanic, 'age')
titanic

In [None]:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean')

In [None]:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()

In [None]:
titanic.pivot_table(values='survived', index='sex', columns='class')

In [None]:
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')

In [None]:
fare = pd.qcut(titanic['fare'], 4)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])

In [None]:
births = pd.read_csv("https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv")
births

In [None]:
births.info()

In [None]:
births.describe()

In [None]:
pd.DataFrame(births.groupby("year")["births"].agg("sum"))

In [None]:
births.groupby("year")["births"].agg("sum").idxmax()

In [None]:
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table(values="births", index='decade', columns='gender', aggfunc='sum')

In [None]:
import matplotlib.pyplot as plt
sns.set()  # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');