Adapted from a tutorial by [Shane Lynn](https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/)

# PHONE LOGS WITH GROUPBY

In [None]:
import pandas as pd
data = pd.read_csv("https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2015/06/phone_data.csv")

In [None]:
data.head()

[dateutil](https://dateutil.readthedocs.io/en/stable/) may be already included in your Python distro.  If not, you'll need to download and install it.

In [None]:
import dateutil

In [None]:
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)

In [None]:
data.head()

In [None]:
data['item'].count()

In [None]:
# What was the longest phone call / data entry?
data['duration'].max()

In [None]:
# What was the average length of a phone call / data entry?
data['duration'][data['item'] == 'call'].mean()

In [None]:
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()

In [None]:
# How many entries are there for each month?
data['month'].value_counts()

In [None]:
# Number of non-null unique network entries
data['network'].nunique()

Here is a quick reference summary table of common functions. Each also takes an optional level parameter which applies only if the object has a hierarchical index.

| Function | Description                         |
|----------|-------------------------------------|
| count    | Number of non-null observations     |
| sum      | Sum of values                       |
| mean     | Mean of values                      |
| mad      | Mean absolute deviation             |
| median   | Arithmetic median of values         |
| min      | Minimum                             |
| max      | Maximum                             |
| mode     | Mode                                |
| abs      | Absolute Value                      |
| prod     | Product of values                   |
| std      | Unbiased standard deviation         |
| var      | Unbiased variance                   |
| stem     | Unbiased standard error of the mean |
| skew     | Unbiased skewness (3rd moment)      |
| kurt     | Unbiased kurtosis (4th moment)      |
| quantile | Sample quantile (value at %)        |
| cumsum   | Cumulative sum                      |
| cumprod  | Cumulative product                  |
| cummax   | Cumulative maximum                  |
| cummin   | Cumulative minimum                 |

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

In [None]:
data.groupby(['month']).groups.keys()

In [None]:
len(data.groupby(['month']).groups['2014-11'])

In [None]:
data.dtypes

In [None]:
data.groupby('month').first()

Once you have a groupby DataFrame, you'll want to pick out a column or columns and specify some aggregating function thereon.  Here we want to sum the numbers in the duration column.

In [None]:
data.groupby('month')['duration'].sum()

In [None]:
data.groupby('month')['date'].count() # count the number of individual calls

In [None]:
# filter, groupby, pick column, and sum
data[data['item'] == 'call'].groupby('network')['duration'].sum()

In [None]:
# group by item with month, move to date column (individual calls) then count
data.groupby(['month', 'item'])['date'].count()

In [None]:
# group by network_type within month, count the calls
data.groupby(['month', 'network_type'])['date'].count()

In [None]:
# don't allow grouping to change any labels.  agg() is a dict of column:operation items
data.groupby('month', as_index=False).agg({"duration": "sum"})

In [None]:
# Group the data frame by month and item and extract a number of stats from each group
data.groupby(['month', 'item']).agg({'duration':"sum",      # find the sum of the durations for each group
                                     'network_type': "count", # find the number of network type entries
                                     'date': 'first'})    # get the first date per group

In [None]:
# Group the data frame by month and item and extract a number of stats from each group
# You have the ability to create a multi-index by asking for multiple aggregates per column
data.groupby(['month', 'item']).agg({'duration': ["min", "max", "sum"],      # find the min, max, and sum of the duration column
                                     'network_type': "count", # find the number of network type entries
                                     'date': ['min', 'first', 'nunique']})    # get the min, first, and number of unique dates per group

In [None]:
grouped = data.groupby(['month', 'item']).agg({'duration': ["min", "max", "sum"],      # find the min, max, and sum of the duration column
                                     'network_type': "count", # find the number of network type entries
                                     'date': ['min', 'first', 'nunique']})    # get the min, first, and number of unique dates per group

In [None]:
grouped.index

In [None]:
grouped.columns

# LAB 3

### Show DataFrame of [min, max, mean] for duration of all types of communication, within each month