# Multiple Custom Grouping Aggregations

This challenge is going to be fairly difficult, but should answer a question that many pandas users face - What is the best way to do a grouping operation that does many custom aggregations? In this context, a 'custom aggregation' is defined as one that is not directly available to use from pandas and one that you must write a custom function for. 

In Pandas Challenge 1, a single aggregation, which required a custom grouping function, was the desired result. In this challenge, you'll need to make several aggregations when grouping. There are a few different solutions to this problem, but depending on how you arrive at your solution, there could arise enormous performance differences. I am looking for a compact, readable solution with very good performance.

### Sales Data

In this challenge, you will be working with some mock sales data found in the sales.csv file. It contains 200,000 rows and 9 columns.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('sales.csv', parse_dates=['date'])
df.head()

Unnamed: 0,customer_id,date,country,region,delivery_type,cost_type,duration,revenue,cost
0,13763,2019-03-25,Portugal,F,slow,expert,60,553,295
1,13673,2019-12-06,Singapore,I,slow,experienced,60,895,262
2,10287,2018-09-04,India,I,slow,novice,60,857,260
3,14298,2018-06-21,Morocco,F,fastest,expert,120,741,238
4,11523,2019-01-05,Luxembourg,A,fast,expert,120,942,263


In [3]:
df.shape

(200000, 9)

### Challenge

There are many aggregations that you will need to return and it will take some time to understand what they are and how to return them. The following definitions for two time periods will be used throughout the aggregations.

Period **2019H1** is defined as the time period beginning January 1, 2019 and ending June 30, 2019.
Period **2018H1** is defined as the time period beginning January 1, 2018 and ending June 30, 2018.

### Aggregations
Now, I will list all the aggregations that are expected to be returned. Each bullet point represents a single column. Use the first word after the bullet point as the new column name.

For every country and region, return the following:
* recency: Number of days between today's date (9/9/2019) and the maximum value of the 'date' column 
* fast_and_fastest: Number of unique customer_id in period 2019H1 with delivery_type either 'fast' or 'fastest'
* rev_2019: Total revenue for the period 2019H1
* rev_2018: Total revenue for the period 2018H1
* cost_2019: Total cost for period 2019H1
* cost_2019_exp: Total cost for period 2019H1 with cost_type 'expert'
* other_cost: Difference between cost_2019 and cost_2019_exp
* rev_per_60: Total of revenue when duration equals 60 in period 2019H1 divided by number of unique customer_id when duration equals 60 in period 2019H1 
* profit_margin: Take the difference of rev_2019 and cost_2019_exp then divide by rev_2019. Return as percentage
* cost_exp_per_60: Total of cost when duration is 60 and cost_type is 'expert' in period 2019H1 divided by the number of unique customer_id when duration equals 60 and cost_type is 'expert' in period 2019H1 
* growth: Find the percentage growth from revenue in period 2019H1 compared to the revenue in period 2018H1

In [13]:
from datetime import datetime
def date_dif(df):
    d1=datetime.strptime("2019-09-09","%Y-%m-%d")
    d2=max(df["date"])
    return (d2-d1).days

df[["region","country","date"]].groupby(["region", "country"]).apply(date_dif)

region  country             
A       Argentina               86
        Australia               88
        Austria                 88
        Belgium                 88
        Brazil                  88
                                ..
J       Ukraine                 88
        United Arab Emirates    86
        United Kingdom          88
        United States           88
        Vietnam                 86
Length: 520, dtype: int64

In [58]:
start_date=datetime.strptime("2019-01-01","%Y-%m-%d")
end_date= datetime.strptime("2019-06-06","%Y-%m-%d")
mask=(df["date"]<=end_date) & (df["date"]>=start_date)
new_df=df[mask][["customer_id","delivery_type", "country","region"]]
new_df
new_df["fast_and_fastest"]=new_df[df["delivery_type"].isin(["fast","fastest"])].groupby([ "region","country","delivery_type"]).transform("count")


In [61]:
df[mask][["country","region","revenue"]].groupby(["country","region"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,revenue
country,region,Unnamed: 2_level_1
Argentina,A,123843
Argentina,B,124205
Argentina,C,105129
Argentina,D,112833
Argentina,E,131384
...,...,...
Vietnam,F,125282
Vietnam,G,129496
Vietnam,H,122348
Vietnam,I,115814


In [66]:
df[mask][["cost","country","region"]].groupby(["country","region"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,cost
country,region,Unnamed: 2_level_1
Argentina,A,40891
Argentina,B,41253
Argentina,C,34648
Argentina,D,37651
Argentina,E,44679
...,...,...
Vietnam,F,40848
Vietnam,G,42525
Vietnam,H,41262
Vietnam,I,39844


In [83]:
def final_agg(df):
    start_date=datetime.strptime("2019-01-01","%Y-%m-%d")
    end_date= datetime.strptime("2019-06-06","%Y-%m-%d")
    H12019=df["date"].between(start_date,end_date)
    H12018=df["date"].between("2018-01-01","2018-06-30")
    duartion=df["duration"].isin(["60"])
    exp=df["cost_type"].isin(["expert"])
    h12019_exp_duration=H12019 & duartion & exp
  

  ##calculate fields:
    rev_2019=df.loc[H12019,"revenue"].sum()/1_000
    rev_2018=df.loc[H12018,"revenue"].sum()/1_000
    cost_2019=df.loc[H12019,"cost"].sum()/1_000
    cost_2019_exp=df.loc[H12019,"cost"][df["cost_type"].isin(["expert"])].sum()/1_000
    other_costs=cost_2019-cost_2019_exp
    rev_per_60=df.loc[H12019,"revenue"][df["duration"].isin(["60"])].sum()/df.loc[H12019,"customer_id"][df["duration"].isin(["60"])].nunique()/1_000
    profit_margin=(rev_2019 - cost_2019)/rev_2019*100
    growth=(rev_2019/rev_2018-1) *100
    cost_exp_per_60= df.loc[h12019_exp_duration,"cost"].sum()/df.loc[h12019_exp_duration,"customer_id"].nunique()/1_000
    d={ "Revenue 2019": rev_2019, "Revenue 2018": rev_2018, "Cost 2019": cost_2019,"Expert cost 2019": cost_2019_exp,
    "Other costs": other_costs, "Revenue per 60d": rev_per_60, "Profit margin":profit_margin,
    "Expert cost for 60 days":cost_exp_per_60, "Growth": growth

    }

    return pd.Series(d)

df.groupby(["region","country"]).apply(final_agg)

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue 2019,Revenue 2018,Cost 2019,Expert cost 2019,Other costs,Revenue per 60d,Profit margin,Expert cost for 60 days,Growth
region,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A,Argentina,123.843,82.912,40.891,15.593,25.298,0.769861,66.981582,0.251147,49.366799
A,Australia,118.855,95.464,39.466,12.056,27.410,0.773264,66.794834,0.261667,24.502430
A,Austria,105.700,92.148,34.543,12.756,21.787,0.756845,67.319773,0.250000,14.706776
A,Belgium,113.085,81.593,39.272,13.359,25.913,0.726125,65.272140,0.267231,38.596448
A,Brazil,137.316,83.937,45.951,20.875,25.076,0.763469,66.536310,0.258361,63.594124
...,...,...,...,...,...,...,...,...,...,...
J,Ukraine,107.139,93.077,35.688,11.159,24.529,0.762556,66.690001,0.250048,15.107921
J,United Arab Emirates,100.303,82.973,34.991,11.264,23.727,0.735373,65.114702,0.245000,20.886312
J,United Kingdom,120.896,76.645,40.075,13.031,27.044,0.744635,66.851674,0.248259,57.735012
J,United States,138.124,88.374,46.487,16.473,30.014,0.768615,66.344010,0.260412,56.294838


# Become a pandas expert

If you are looking to completely master the pandas library and become a trusted expert for doing data science work, check out my book [Master Data Analysis with Python][1]. It comes with over 300 exercises with detailed solutions covering the pandas library in-depth.

[1]: https://www.dunderdata.com/master-data-analysis-with-python