#Python - Non-technical introduction

##SOLUTION Exercise Day 2 Session 3: "Get a high-level overview: How to use “aggregate” operations?"

This file provides the solutions to the exercises accompanying the lectures.

*Note:* If you want to make changes to this document, you need to save your own copy using the "Save copy in Drive" command in the "File" menu.

###Preparation

**Make sure to run the follwing code before continuing.** Code will prepare everything (load data, ...).

In [None]:
#%% Import relevant modules
import pandas as pd
import numpy as np
# Read in data
myData=pd.read_csv("https://raw.githubusercontent.com/bachmannpatrick/Python-Class/master/data/transactions.csv")
#fix the Date format
myData["TransDate"]  = pd.to_datetime(myData["TransDate"])

  myData["TransDate"]  = pd.to_datetime(myData["TransDate"])


###20 - Basic techniques for aggregating observations

1. Sum PurchAmount by `Customer` and `TransDate`.


In [None]:
myData.groupby(["Customer","TransDate"]).agg(AggPurch=("PurchAmount","sum")).reset_index()


Unnamed: 0,Customer,TransDate,AggPurch
0,100001,2011-06-25,79.95
1,100001,2011-08-24,199.95
2,100002,2004-12-29,499.95
3,100003,2012-01-23,379.90
4,100004,2012-05-08,499.95
...,...,...,...
135033,199995,2012-09-17,89.85
135034,199996,2012-09-17,179.95
135035,199997,2012-09-17,179.70
135036,199998,2012-09-17,29.95


2. Count number of transactions by Customer.

In [None]:
myData.groupby("Customer").agg(PurchAmount=("PurchAmount","count")).reset_index()

Customer
100001    2
100002    1
100003    2
100004    1
100005    4
         ..
199995    3
199996    1
199997    6
199998    1
199999    1
Name: PurchAmount, Length: 98780, dtype: int64

Alternative:

In [None]:
myData.groupby(["Customer"]).agg(Size=("PurchAmount","size")).reset_index()

Unnamed: 0,Customer,Size
0,100001,2
1,100002,1
2,100003,2
3,100004,1
4,100005,4
...,...,...
98775,199995,3
98776,199996,1
98777,199997,6
98778,199998,1


###21 - Advanced techniques for aggregating observations

1. Aggregate the purchase amount (sum) of all transactions per customer on a yearly basis for year 2007 and 2008.

In [None]:
myData.loc[(myData["TransDate"].dt.year==2008) | (myData["TransDate"].dt.year ==2007),].groupby([myData["TransDate"].dt.year, "Customer"]).agg(Size=("PurchAmount","sum")).reset_index()

Unnamed: 0,TransDate,Customer,Size
0,2007,100034,249.85
1,2007,100056,59.95
2,2007,100176,69.95
3,2007,100179,699.90
4,2007,100183,99.95
...,...,...,...
36068,2008,187644,59.95
36069,2008,187645,124.90
36070,2008,187646,149.85
36071,2008,187647,119.95


2. How many customers purchased for more than 50$ in total between 2008 and 2009?


In [None]:
myData.loc[(myData["TransDate"].dt.year==2008) | (myData["TransDate"].dt.year==2009)].groupby("Customer").agg(Count=("PurchAmount","sum")).reset_index()[lambda x:x["Count"] >50].agg(x=("Count", "count"))

Unnamed: 0,Count
x,24864


In [None]:
#Alternative:
(myData.loc[(myData["TransDate"].dt.year==2008) | (myData["TransDate"].dt.year==2009)].groupby("Customer")["PurchAmount"].sum()[lambda x:x >50]).count()

###22 - Combined select-aggregate operations

1. Add a column to `myData` with the total number of purchases per customer.



In [None]:
myData["Count"]= myData.groupby("Customer")["Customer"].transform("size")
print(myData)

        Customer  TransDate  Quantity  PurchAmount    Cost    TransID  Count
0         149332 2005-11-15         1       199.95  107.00  127998739      3
1         172951 2008-08-29         1       199.95  108.00  128888288      4
2         120621 2007-10-19         1        99.95   49.00  125375247      1
3         149236 2005-11-14         1        39.95   18.95  127996226      2
4         149236 2007-12-06         1        79.95   35.00  128670302      2
...          ...        ...       ...          ...     ...        ...    ...
223186    199997 2012-09-17         1        29.95   13.80  132481149      6
223187    199997 2012-09-17         1        29.95   13.80  132481149      6
223188    199998 2012-09-17         1        29.95   13.80  132481154      1
223189    199999 2012-09-17         1       179.95  109.99  132481165      1
223190    199542 2012-09-17         1        39.95   10.50  131973368      1

[223191 rows x 7 columns]


2. Create a lead shifted variable for `TransDate` (by one period) by customer.

In [None]:
myData["TransDateLag"]=myData.groupby("Customer")["TransDate"].shift(periods=-1)
print(myData)

        Customer  TransDate  Quantity  ...    TransID  Count  TransDateLag
0         149332 2005-11-15         1  ...  127998739      3    2005-12-13
1         172951 2008-08-29         1  ...  128888288      4    2008-08-29
2         120621 2007-10-19         1  ...  125375247      1           NaT
3         149236 2005-11-14         1  ...  127996226      2    2007-12-06
4         149236 2007-12-06         1  ...  128670302      2           NaT
...          ...        ...       ...  ...        ...    ...           ...
223186    199997 2012-09-17         1  ...  132481149      6    2012-09-17
223187    199997 2012-09-17         1  ...  132481149      6           NaT
223188    199998 2012-09-17         1  ...  132481154      1           NaT
223189    199999 2012-09-17         1  ...  132481165      1           NaT
223190    199542 2012-09-17         1  ...  131973368      1           NaT

[223191 rows x 8 columns]
