# Summer Activities

**Example data set — summer activities**

I will demonstrate the pandas tricks on a made up data set with different people names, their summer activities and their corresponding timestamps. A person can make multiple activities in various timestamps.

In [74]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
    ['Chandler Bing','party','2017-08-04 08:00:00',51],
    ['Chandler Bing','party','2017-08-04 13:00:00',60],
    ['Chandler Bing','party','2017-08-04 15:00:00',59],
    ['Harry Kane','football','2017-08-04 13:00:00',80],
    ['Harry Kane','party','2017-08-04 11:00:00',90],
    ['Harry Kane','party','2017-08-04 07:00:00',68],
    ['John Doe','beach','2017-08-04 07:00:00',63],
    ['John Doe','beach','2017-08-04 12:00:00',61],
    ['John Doe','beach','2017-08-04 14:00:00',65],
    ['Joey Tribbiani','party','2017-08-04 09:00:00',54],
    ['Joey Tribbiani','party','2017-08-04 10:00:00',67],
    ['Joey Tribbiani','football','2017-08-04 08:00:00',84],
    ['Monica Geller','travel','2017-08-04 07:00:00',90],
    ['Monica Geller','travel','2017-08-04 08:00:00',96],
    ['Monica Geller','travel','2017-08-04 09:00:00',74],
    ['Phoebe Buffey','travel','2017-08-04 10:00:00',52],
    ['Phoebe Buffey','travel','2017-08-04 12:00:00',84],
    ['Phoebe Buffey','football','2017-08-04 15:00:00',58],
    ['Ross Geller','party','2017-08-04 09:00:00',96],
    ['Ross Geller','party','2017-08-04 11:00:00',81],
    ['Ross Geller','travel','2017-08-04 14:00:00',60]
    ],
    columns=['name','activity','timestamp','money_spent'])

df['timestamp'] = pd.to_datetime(df['timestamp'])
df

Unnamed: 0,name,activity,timestamp,money_spent
0,Chandler Bing,party,2017-08-04 08:00:00,51
1,Chandler Bing,party,2017-08-04 13:00:00,60
2,Chandler Bing,party,2017-08-04 15:00:00,59
3,Harry Kane,football,2017-08-04 13:00:00,80
4,Harry Kane,party,2017-08-04 11:00:00,90
5,Harry Kane,party,2017-08-04 07:00:00,68
6,John Doe,beach,2017-08-04 07:00:00,63
7,John Doe,beach,2017-08-04 12:00:00,61
8,John Doe,beach,2017-08-04 14:00:00,65
9,Joey Tribbiani,party,2017-08-04 09:00:00,54


**Let’s say our goal is to predict, based on the given data set, who is the most fun person in the data set :).**

## 1. String commands

In [75]:
df[['first_name', 'last_name']] = df['name'].str.split(" ", n=1, expand=True)

In [76]:
df = df.reindex(columns=['name','first_name', 'last_name', 'activity', 'timestamp'])
df

Unnamed: 0,name,first_name,last_name,activity,timestamp
0,Chandler Bing,Chandler,Bing,party,2017-08-04 08:00:00
1,Chandler Bing,Chandler,Bing,party,2017-08-04 13:00:00
2,Chandler Bing,Chandler,Bing,party,2017-08-04 15:00:00
3,Harry Kane,Harry,Kane,football,2017-08-04 13:00:00
4,Harry Kane,Harry,Kane,party,2017-08-04 11:00:00
5,Harry Kane,Harry,Kane,party,2017-08-04 07:00:00
6,John Doe,John,Doe,beach,2017-08-04 07:00:00
7,John Doe,John,Doe,beach,2017-08-04 12:00:00
8,John Doe,John,Doe,beach,2017-08-04 14:00:00
9,Joey Tribbiani,Joey,Tribbiani,party,2017-08-04 09:00:00


In [77]:
del df['first_name']
del df['last_name']

## 2. Group by and value_counts

In [80]:
df.groupby('name')['activity'].value_counts()

name            activity
Chandler Bing   party       3
Harry Kane      party       2
                football    1
Joey Tribbiani  party       2
                football    1
John Doe        beach       3
Monica Geller   travel      3
Phoebe Buffey   travel      2
                football    1
Ross Geller     party       2
                travel      1
Name: activity, dtype: int64

This is [multi index](https://pandas.pydata.org/pandas-docs/stable/advanced.html), a valuable trick in pandas dataframe which allows us to have a few levels of index hierarchy in our dataframe. In this case the person name is the level 0 of the index and the activity is on level 1.

## 3. Unstack

We can also create features for the summer activities counts per person, by applying **unstack** on the above code. Unstack switches the rows to columns to get the activity counts as features. **By doing unstack we are transforming the last level of the index to the columns.** All the activities values will now be the columns of a the dataframe and when a person has not done a certain activity this feature will get Nan value. Fillna fills all these missing values (activities which were not visited by the person) with 0.

In [81]:
df.groupby('name')['activity'].value_counts().unstack().fillna(0)

activity,beach,football,party,travel
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chandler Bing,0.0,0.0,3.0,0.0
Harry Kane,0.0,1.0,2.0,0.0
Joey Tribbiani,0.0,1.0,2.0,0.0
John Doe,3.0,0.0,0.0,0.0
Monica Geller,0.0,0.0,0.0,3.0
Phoebe Buffey,0.0,1.0,0.0,2.0
Ross Geller,0.0,0.0,2.0,1.0


## 3. groupby, diff, shift, and loc + A great tip for efficiency

Knowing the time differences between person activities can be quite interesting for predicting who is the most fun person. How long did a person hang out in a party? how long did he/she hang out at the the beach? This might be useful for us as a feature, depends on the activity.

The most straight forward way to calculate the time differences would be to groupby the person name and them calculate the difference on the timestamp field using diff():

In [82]:
df = df.sort_values(by=['name','timestamp'])
df['time_diff'] = df.groupby('name')['timestamp'].diff()

If you have a lot of data and you want to save some time (this can be about 10 times faster depends on your data size) you can skip the groupby and just do the diff after sorting the data and then deleting the first row of each person which is not relevant.

In [83]:
df = df.sort_values(by=['name','timestamp'])
df['time_diff'] = df['timestamp'].diff()
df.loc[df.name != df.name.shift(), 'time_diff'] = None

BTW — the useful .Shift command shift all the column down per one space, so we can see on which row this column is changing by doing this: df.name!=df.name.shift().

And .loc command is the most recommended way to set values for a column for specific indices.

To change the time_diff to seconds units:

In [84]:
df['time_diff'] = df.time_diff.dt.total_seconds()

To get the duration per row:

In [86]:
df['row_duration'] = df.time_diff.shift(-1)

## 4. Cumcount and Cumsum

This are two really cool Ufuncs which can help you with many things. Cumcount create a cumulative count. For example we can take only the second activity for each person by grouping by the person name and then applying cumcount. This will just give a count for the activities by their order. Than we can take only the second activity for each person by doing ==1 (or the third activity by doing ==2) and applying the indices on the original sorted dataframe.

In [88]:
df = df.sort_values(by=['name','timestamp'])
df2 = df[df.groupby('name').cumcount()==1]

In [89]:
df = df.sort_values(by=['name','timestamp'])
df2 = df[df.groupby('name').cumcount()==2]

Cumsum is just a cummulative summary of a numeric cell. For example you can add the money the person spend in each activity as an additional cell and then summarize the money spent by a person at each time of the day using:

In [90]:
df = df.sort_values(by=['name','timestamp'])
df['money_spent_so_far'] = df.groupby('name')['money_spent'].cumsum()

KeyError: 'Column not found: money_spent'

## 5. groupby, max, min for measuring the duration of activities

In section 3 we wanted to know how much time each person spent in each activity. But we overlooked that sometimes we get multiple records for an acitivity which is actually the continuance of the same activities. So to get the actual activity duration we should measure the time from the first consecutive activity appearance to the last. For that we need to mark the change in activities and mark each row with the activity number. We would do this using the .shift command and the .cumsum command we saw before. A new activity is when the activity changes or the person changes.

In [None]:
df['activity_change'] = (df.activity!=df.activity.shift()) | (df.name!=df.name.shift())

Then we will calculate the activity number for each row by grouping per user and applying the glorious .cumsum:

In [None]:
df['activity_num'] = df.groupby('name')['activity_change'].cumsum()

Now we can calculate the duration of each activity as follows by grouping per name and activity num (and activity — which doesn’t really change the grouping but we need it to have the activity name) and calculating the sum of activity duration per row:

In [91]:
activity_duration = df.groupby(['name','activity_num','activity'])['activity_duration'].sum()

KeyError: 'activity_num'

This will return the activity duration in some kind of timedelta type. You could get the session activity duration in seconds using .dt.total_seconds:

In [92]:
activity_duration = activity_duration.dt.total_seconds()

NameError: name 'activity_duration' is not defined

Then you can the maximal/minimal activity duration for each person (or median or mean) using a command like this:

In [93]:
activity_duration = activity_duration.reset_index().groupby('name').max()

NameError: name 'activity_duration' is not defined

## Summary

This was a short Pandas tour using a summer activities made-up dataset. Hope you’ve learned and enjoy it. Good luck with your next Pandas project and enjoy the summer!

https://towardsdatascience.com/pandas-tips-and-tricks-33bcc8a40bb9