#Python - Non-technical introduction

##Day 2 Session 3: "Get a high-level overview: How to use “aggregate” operations?"

This file accompanies the lectures and provides the code for the corresponding slides.

*Note:* If you want to make changes to this document, you need to save your own copy using the "Save copy in Drive" command in the "File" menu.

###Preparation

**Make sure to run the follwing code before continuing.** Code will prepare everything (load data, ...).

In [None]:
#load the Pandas package
import pandas as pd
#Read the csv file and store it in the variable "myData". Note: this file is hostes in a GitHub repository.
myData=pd.read_csv(filepath_or_buffer="https://raw.githubusercontent.com/bachmannpatrick/Python-Class/master/data/transactions.csv", sep=",")
#Adjust the format of column "TransDate" to datetime
myData["TransDate"]  = pd.to_datetime(myData["TransDate"], dayfirst=True)

###20 - Basic aggregating techniques

####**Slide:** 1. Apply an aggregating function to a variable by a single aggregating dimension (1/2)

In [None]:
myData.groupby("Customer").agg(new_col=("PurchAmount","sum")).reset_index()


Unnamed: 0,Customer,new_col
0,100001,279.90
1,100002,499.95
2,100003,379.90
3,100004,499.95
4,100005,309.80
...,...,...
98775,199995,89.85
98776,199996,179.95
98777,199997,179.70
98778,199998,29.95


####**Slide:** 1. Apply an aggregating function to a variable by a single aggregating dimension (2/2)

In [None]:
myData.groupby("Customer").agg(AggPurch=("PurchAmount","sum")).reset_index()


Unnamed: 0_level_0,sum
Customer,Unnamed: 1_level_1
100001,279.90
100002,499.95
100003,379.90
100004,499.95
100005,309.80
...,...
199995,89.85
199996,179.95
199997,179.70
199998,29.95


####**Slide:** 2. Apply an aggregating function to multiple variables by an aggregating dimension

In [None]:
myData.groupby("Customer").agg(AggPurch=("PurchAmount","sum"),MaxPurch=("PurchAmount","max"),AggQuant=("Quantity", "sum")).reset_index()

Unnamed: 0,Customer,AggPurch,MaxPurch,AggQuant
0,100001,279.90,199.95,2
1,100002,499.95,499.95,1
2,100003,379.90,249.95,2
3,100004,499.95,499.95,1
4,100005,309.80,79.95,4
...,...,...,...,...
98775,199995,89.85,29.95,3
98776,199996,179.95,179.95,1
98777,199997,179.70,29.95,6
98778,199998,29.95,29.95,1


####**Slide:** 3. Apply multiple aggregating functions to the same variable by a single aggregating dimension

In [None]:
myData.groupby("Customer").agg(AggPurch=("PurchAmount","sum"),MaxPurch=("PurchAmount","max"),AggQuant=("Quantity", "sum")).reset_index

<bound method DataFrame.reset_index of           AggPurch  MaxPurch  AggQuant
Customer                              
100001      279.90    199.95         2
100002      499.95    499.95         1
100003      379.90    249.95         2
100004      499.95    499.95         1
100005      309.80     79.95         4
...            ...       ...       ...
199995       89.85     29.95         3
199996      179.95    179.95         1
199997      179.70     29.95         6
199998       29.95     29.95         1
199999      179.95    179.95         1

[98780 rows x 3 columns]>

####**Slide:** 4. Apply an aggregating function to a variable by multiple aggregating dimensions

In [None]:
myData.groupby(["Customer","TransDate"]).agg(AggPurch=("PurchAmount","sum")).reset_index()

####**Slide:** 5. Apply an aggregating function to a variable by an aggregating dimension to a selection of rows

In [None]:
myData.iloc[1:5].groupby("Customer").agg(AggPurch=("PurchAmount", "sum")).reset_index()

Unnamed: 0,Customer,AggPurch
0,120621,99.95
1,149236,119.9
2,172951,199.95


####**Slide:** 6. Apply an aggregating function to the whole dataset

In [None]:
myData["PurchAmount"].sum()

18784784.62

####**Slide:** Sidenote: Create new columns in the original DataFrame with the transform()function

In [None]:
myData["AggPurch"]= myData.groupby("Customer")["PurchAmount"].transform(sum)
print(myData)

        Customer  TransDate  Quantity  PurchAmount    Cost    TransID  AggPurch
0         149332 2005-11-15         1       199.95  107.00  127998739    274.85
1         172951 2008-08-29         1       199.95  108.00  128888288    889.80
2         120621 2007-10-19         1        99.95   49.00  125375247     99.95
3         149236 2005-11-14         1        39.95   18.95  127996226    119.90
4         149236 2007-06-12         1        79.95   35.00  128670302    119.90
...          ...        ...       ...          ...     ...        ...       ...
223186    199997 2012-09-17         1        29.95   13.80  132481149    179.70
223187    199997 2012-09-17         1        29.95   13.80  132481149    179.70
223188    199998 2012-09-17         1        29.95   13.80  132481154     29.95
223189    199999 2012-09-17         1       179.95  109.99  132481165    179.95
223190    199542 2012-09-17         1        39.95   10.50  131973368     39.95

[223191 rows x 7 columns]


###21 - Advanced aggregating techniques


####**Slide:** Aggregate a variable by a transformed aggregating dimension

In [None]:
myData.groupby(myData["TransDate"].dt.to_period("M")).agg(AggPurch=("PurchAmount","sum")).reset_index()

Unnamed: 0,TransDate,AggPurch
0,2004-12,27623.90
1,2005-01,83363.73
2,2005-02,87341.59
3,2005-03,86803.31
4,2005-04,84293.01
...,...,...
92,2012-08,108462.20
93,2012-09,71429.25
94,2012-10,42588.75
95,2012-11,44633.30


####**Slide:** Sidenote: Chaining saves memory and is faster

In [None]:
myData.groupby("Customer").agg(AggPurch=("PurchAmount", "sum")).reset_index()[lambda x: x['AggPurch'] > 100]

Unnamed: 0,Customer,AggPurch
0,100001,279.90
1,100002,499.95
2,100003,379.90
3,100004,499.95
4,100005,309.80
...,...,...
98769,199989,119.80
98771,199991,199.95
98776,199996,179.95
98777,199997,179.70


In [None]:
#This is the same:
myData_agg=myData.groupby("Customer").agg(AggPurch=("PurchAmount", "sum")).reset_index()
myData_agg.loc[myData_agg["AggPurch"]>=100,]


Unnamed: 0,Customer,AggPurch
0,100001,279.90
1,100002,499.95
2,100003,379.90
3,100004,499.95
4,100005,309.80
...,...,...
98769,199989,119.80
98771,199991,199.95
98776,199996,179.95
98777,199997,179.70


####**Slide:** Sidenote: Pay attention to operation sequences

In [None]:
myData.groupby("Customer").agg(AggPurch=("PurchAmount", "sum")).reset_index()[lambda x: x["AggPurch"] > 100]

Unnamed: 0,Customer,AggPurch
0,100001,279.90
1,100002,499.95
2,100003,379.90
3,100004,499.95
4,100005,309.80
...,...,...
98769,199989,119.80
98771,199991,199.95
98776,199996,179.95
98777,199997,179.70


In [None]:
#not (!) identical to:
myData.loc[myData["PurchAmount"]>100,].groupby("Customer").agg(AggPurch=("PurchAmount","sum")).reset_index()

Unnamed: 0,Customer,AggPurch
0,100001,199.95
1,100002,499.95
2,100003,379.90
3,100004,499.95
4,100006,349.95
...,...,...
32827,199973,159.95
32828,199978,149.95
32829,199991,199.95
32830,199996,179.95


###22 - Combined select and aggregate operations

####**Slide:** Select the first 3 purchases of each customer

In [None]:
myData.sort_values("TransDate").groupby("Customer").head(3)

Unnamed: 0,Customer,TransDate,Quantity,PurchAmount,Cost,TransID
30414,144488,2004-12-16,1,24.95,11.25,127833616
44181,100291,2004-12-16,1,49.95,22.89,124285020
44180,100291,2004-12-16,1,49.95,22.89,124285020
44178,100290,2004-12-16,1,59.95,26.00,124284887
44171,100289,2004-12-16,1,21.95,10.00,124284670
...,...,...,...,...,...,...
223071,186299,2012-12-09,1,34.95,12.65,129592935
223070,186298,2012-12-09,1,49.95,17.00,129592815
216326,183058,2012-12-09,1,69.95,28.57,131969375
101290,121226,2012-12-09,1,19.95,9.50,131969793


####**Slide:** Select the last purchase of each customer

In [None]:
myData.sort_values("TransDate").groupby("Customer").tail(1)

Unnamed: 0,Customer,TransDate,Quantity,PurchAmount,Cost,TransID
44541,100002,2004-12-29,1,499.95,349.00,123490350
44544,100340,2004-12-29,1,29.95,10.84,124295297
44601,100345,2004-12-29,1,189.95,82.91,124295449
44602,100346,2004-12-29,1,49.95,22.60,124295469
44635,100357,2004-12-30,1,29.95,10.00,124295845
...,...,...,...,...,...,...
223071,186299,2012-12-09,1,34.95,12.65,129592935
223070,186298,2012-12-09,1,49.95,17.00,129592815
216326,183058,2012-12-09,1,69.95,28.57,131969375
101290,121226,2012-12-09,1,19.95,9.50,131969793


####**Slide:** Updating columns using an aggregating dimension (1/2)

In [None]:
myData["Count"]=myData.groupby("Customer")["Customer"].transform("count")
print(myData)

        Customer  TransDate  Quantity  PurchAmount    Cost    TransID  Count
0         149332 2005-11-15         1       199.95  107.00  127998739      3
1         172951 2008-08-29         1       199.95  108.00  128888288      4
2         120621 2007-10-19         1        99.95   49.00  125375247      1
3         149236 2005-11-14         1        39.95   18.95  127996226      2
4         149236 2007-06-12         1        79.95   35.00  128670302      2
...          ...        ...       ...          ...     ...        ...    ...
223186    199997 2012-09-17         1        29.95   13.80  132481149      6
223187    199997 2012-09-17         1        29.95   13.80  132481149      6
223188    199998 2012-09-17         1        29.95   13.80  132481154      1
223189    199999 2012-09-17         1       179.95  109.99  132481165      1
223190    199542 2012-09-17         1        39.95   10.50  131973368      1

[223191 rows x 7 columns]


####**Slide:** Updating columns using an aggregating dimension (2/2)

In [None]:
myData["RelDate"]=myData.groupby("Customer").cumcount()+1
print(myData)

        Customer  TransDate  Quantity  ...  AggPurch  Count  RelDate
0         149332 2005-11-15         1  ...    274.85      3        1
1         172951 2008-08-29         1  ...    889.80      4        1
2         120621 2007-10-19         1  ...     99.95      1        1
3         149236 2005-11-14         1  ...    119.90      2        1
4         149236 2007-06-12         1  ...    119.90      2        2
...          ...        ...       ...  ...       ...    ...      ...
223186    199997 2012-09-17         1  ...    179.70      6        5
223187    199997 2012-09-17         1  ...    179.70      6        6
223188    199998 2012-09-17         1  ...     29.95      1        1
223189    199999 2012-09-17         1  ...    179.95      1        1
223190    199542 2012-09-17         1  ...     39.95      1        1

[223191 rows x 9 columns]


####**Slide:** Creating a lagged variable

In [None]:
myData["CostLag"]=myData.groupby("Customer")["Cost"].shift(periods=1)
print(myData)

        Customer  TransDate  Quantity  ...  Count  RelDate  CostLag
0         149332 2005-11-15         1  ...      3        1      NaN
1         172951 2008-08-29         1  ...      4        1      NaN
2         120621 2007-10-19         1  ...      1        1      NaN
3         149236 2005-11-14         1  ...      2        1      NaN
4         149236 2007-06-12         1  ...      2        2    18.95
...          ...        ...       ...  ...    ...      ...      ...
223186    199997 2012-09-17         1  ...      6        5    13.80
223187    199997 2012-09-17         1  ...      6        6    13.80
223188    199998 2012-09-17         1  ...      1        1      NaN
223189    199999 2012-09-17         1  ...      1        1      NaN
223190    199542 2012-09-17         1  ...      1        1      NaN

[223191 rows x 10 columns]


In [None]:
myData.iloc[1:6].groupby("Customer", as_index=False) ["PurchAmount"].sum()

Unnamed: 0,Customer,PurchAmount
0,120621,99.95
1,140729,129.95
2,149236,119.9
3,172951,199.95


####**Slide:** Cumulating variables

In [None]:
myData["totSpend"] = myData.groupby("Customer")["Cost"].cumsum()
print(myData)

        Customer  TransDate  Quantity  PurchAmount    Cost    TransID  \
0         149332 2005-11-15         1       199.95  107.00  127998739   
1         172951 2008-08-29         1       199.95  108.00  128888288   
2         120621 2007-10-19         1        99.95   49.00  125375247   
3         149236 2005-11-14         1        39.95   18.95  127996226   
4         149236 2007-06-12         1        79.95   35.00  128670302   
...          ...        ...       ...          ...     ...        ...   
223186    199997 2012-09-17         1        29.95   13.80  132481149   
223187    199997 2012-09-17         1        29.95   13.80  132481149   
223188    199998 2012-09-17         1        29.95   13.80  132481154   
223189    199999 2012-09-17         1       179.95  109.99  132481165   
223190    199542 2012-09-17         1        39.95   10.50  131973368   

        totSpend  
0         107.00  
1         108.00  
2          49.00  
3          18.95  
4          53.95  
...      