#Python - Non-technical introduction

##SOLUTION Exercise Day 2 Block 3: "Wrangling data with Python" (2/2)

This file provides exercises accompanying the lectures.

*Note:* If you want to make changes to this document, you need to save your own copy using the "Save copy in Drive" command in the "File" menu.

###Preparation


1.   Read the csv file and store it in the variable "myData". Note: this file is hostes in a GitHub repository. https://raw.githubusercontent.com/bachmannpatrick/Python-Class/master/data/transactions.csv
2.   Adjust the format of column "TransDate" to datetime

In [27]:
import numpy as np
import pandas as pd
from datetime import datetime

In [28]:
myData = pd.read_csv(filepath_or_buffer="https://raw.githubusercontent.com/bachmannpatrick/Python-Class/master/data/transactions.csv", sep=",")
myData["TransDate"] = pd.to_datetime(
    myData["TransDate"],
    format="%d.%m.%Y",
    utc=True,
    dayfirst=True
    )

In [36]:
myData

Unnamed: 0,Customer,TransDate,Quantity,PurchAmount,Cost,TransID,Count
0,149332,2005-11-15 00:00:00+00:00,1,199.95,107.00,127998739,3
1,172951,2008-08-29 00:00:00+00:00,1,199.95,108.00,128888288,4
2,120621,2007-10-19 00:00:00+00:00,1,99.95,49.00,125375247,1
3,149236,2005-11-14 00:00:00+00:00,1,39.95,18.95,127996226,2
4,149236,2007-06-12 00:00:00+00:00,1,79.95,35.00,128670302,2
...,...,...,...,...,...,...,...
223186,199997,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481149,6
223187,199997,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481149,6
223188,199998,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481154,1
223189,199999,2012-09-17 00:00:00+00:00,1,179.95,109.99,132481165,1


###20 - Basic aggregating techniques

1. Sum PurchAmount by `Customer` and `TransDate`.


In [29]:
myData.groupby(by=["Customer", "TransDate"]).agg(AggPurch=("PurchAmount", "sum")).reset_index()

Unnamed: 0,Customer,TransDate,AggPurch
0,100001,2011-06-25 00:00:00+00:00,79.95
1,100001,2011-08-24 00:00:00+00:00,199.95
2,100002,2004-12-29 00:00:00+00:00,499.95
3,100003,2012-01-23 00:00:00+00:00,379.90
4,100004,2012-08-05 00:00:00+00:00,499.95
...,...,...,...
135033,199995,2012-09-17 00:00:00+00:00,89.85
135034,199996,2012-09-17 00:00:00+00:00,179.95
135035,199997,2012-09-17 00:00:00+00:00,179.70
135036,199998,2012-09-17 00:00:00+00:00,29.95


2. Count number of transactions by Customer.

In [30]:
myData["Count"] = myData.groupby("Customer")["Customer"].transform("count")

In [31]:
myData

Unnamed: 0,Customer,TransDate,Quantity,PurchAmount,Cost,TransID,Count
0,149332,2005-11-15 00:00:00+00:00,1,199.95,107.00,127998739,3
1,172951,2008-08-29 00:00:00+00:00,1,199.95,108.00,128888288,4
2,120621,2007-10-19 00:00:00+00:00,1,99.95,49.00,125375247,1
3,149236,2005-11-14 00:00:00+00:00,1,39.95,18.95,127996226,2
4,149236,2007-06-12 00:00:00+00:00,1,79.95,35.00,128670302,2
...,...,...,...,...,...,...,...
223186,199997,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481149,6
223187,199997,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481149,6
223188,199998,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481154,1
223189,199999,2012-09-17 00:00:00+00:00,1,179.95,109.99,132481165,1


###21 - Advanced aggregating techniques

1. Aggregate the purchase amount (sum) of all transactions per customer on a yearly basis for year 2007 and 2008.

In [33]:
myDataRelevantYears = myData[myData["TransDate"].dt.year.isin([2007, 2008])]
result = myDataRelevantYears.groupby(["Customer", myDataRelevantYears["TransDate"].dt.year])["PurchAmount"].sum().reset_index()
result.rename(columns={"TransDate": "Year", "PurchAmount": "TotalPurchaseAmount"}, inplace=True)
result

Unnamed: 0,Customer,Year,TotalPurchaseAmount
0,100032,2008,139.90
1,100034,2007,249.85
2,100056,2007,59.95
3,100064,2008,149.95
4,100096,2008,69.90
...,...,...,...
36068,187644,2008,59.95
36069,187645,2008,124.90
36070,187646,2008,149.85
36071,187647,2008,119.95


In [42]:
subset_2007_2008 = myData[myData["TransDate"].dt.year.isin([2007, 2008])]

In [44]:
subset_2007_2008.groupby([myData["TransDate"].dt.year, "Customer"]).agg(Size=("PurchAmount", "sum")).reset_index()

Unnamed: 0,TransDate,Customer,Size
0,2007,100034,249.85
1,2007,100056,59.95
2,2007,100176,69.95
3,2007,100179,699.90
4,2007,100183,99.95
...,...,...,...
36068,2008,187644,59.95
36069,2008,187645,124.90
36070,2008,187646,149.85
36071,2008,187647,119.95


2. How many customers purchased for more than 50$ in total between 2008 and 2009?


In [34]:
customer_purchase = myDataRelevantYears.groupby("Customer")["PurchAmount"].sum()
num_customers = customer_purchase[customer_purchase > 50].count()
print(num_customers)

26265


###22 - Combined select-aggregate operations

1. Add a column to `myData` with the total number of purchases per customer.



In [39]:
myData["TotalPurchasesPerCustomer"] = myData.groupby("Customer")["TransID"].transform("count")
myData

Unnamed: 0,Customer,TransDate,Quantity,PurchAmount,Cost,TransID,Count,TotalPurchasesPerCustomer
0,149332,2005-11-15 00:00:00+00:00,1,199.95,107.00,127998739,3,3
1,172951,2008-08-29 00:00:00+00:00,1,199.95,108.00,128888288,4,4
2,120621,2007-10-19 00:00:00+00:00,1,99.95,49.00,125375247,1,1
3,149236,2005-11-14 00:00:00+00:00,1,39.95,18.95,127996226,2,2
4,149236,2007-06-12 00:00:00+00:00,1,79.95,35.00,128670302,2,2
...,...,...,...,...,...,...,...,...
223186,199997,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481149,6,6
223187,199997,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481149,6,6
223188,199998,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481154,1,1
223189,199999,2012-09-17 00:00:00+00:00,1,179.95,109.99,132481165,1,1


2. Create a lead shifted variable for `TransDate` (by one period) by customer.

In [40]:
myData["YearLag"] = myData.groupby("Customer")["TransDate"].shift(periods=1)

In [41]:
myData

Unnamed: 0,Customer,TransDate,Quantity,PurchAmount,Cost,TransID,Count,TotalPurchasesPerCustomer,YearLag
0,149332,2005-11-15 00:00:00+00:00,1,199.95,107.00,127998739,3,3,NaT
1,172951,2008-08-29 00:00:00+00:00,1,199.95,108.00,128888288,4,4,NaT
2,120621,2007-10-19 00:00:00+00:00,1,99.95,49.00,125375247,1,1,NaT
3,149236,2005-11-14 00:00:00+00:00,1,39.95,18.95,127996226,2,2,NaT
4,149236,2007-06-12 00:00:00+00:00,1,79.95,35.00,128670302,2,2,2005-11-14 00:00:00+00:00
...,...,...,...,...,...,...,...,...,...
223186,199997,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481149,6,6,2012-09-17 00:00:00+00:00
223187,199997,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481149,6,6,2012-09-17 00:00:00+00:00
223188,199998,2012-09-17 00:00:00+00:00,1,29.95,13.80,132481154,1,1,NaT
223189,199999,2012-09-17 00:00:00+00:00,1,179.95,109.99,132481165,1,1,NaT
