# Introduction to the Data Science pipeline
### Analyzing the tennis data set

#### Structure of the data file

Open the file `tennis-data.txt` in an editor and peruse its contents. 

You will see and initial set of lines providing meta-information about the data,  then 14 lines of actual data (observations), and finally a few additional lines of text information.  The structure of the file is:

```
header
data
footer
```

The columns in the data are separated by white space.  Pandas provides functions for reading data in a very wide range of formats, skipping header and footer lines, transforming data types (e.g., string dates of the form 2017-01-25 to actual data types) etc.

#### Possible Questions

The data science pipeline is of the form:

```
Questions -> Wrangling -> Exploration -> Modeling -> Communication
```
Pandas can help us with the wrangling, exploration stages, and communication stages.  It also helps to set up data structures needed for the modeling phase.

Given the simplicity of the tennis data set, there are not many descriptive statisitical questions we can ask.  But, some questions could be:

- What is the prior probability of playing tennis?
- Produce a frquency count for the various categories in each feature?
- When the outlook is sunny is the temperature always hot?

Think of a few more ...

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

Read the data set with `read_table` and use the NBs support for interactive exploration to examine named (keyword) parameters.  Used the methods `.head()` and `.tail()` to ensure that only the data lines (observations) are read and the header/footer lines are skipped.

You will find it convenient to use `skiprows` in conjunction with `.head()` and `skipfooter` in conjunction `.tail()`.  I have given the needed values of 5 and 10 below.  I recommend you make these 0 and then experiment with `.head` and `.tail`

#### Reading the data

In [2]:
df = pd.read_table('tennis-data.txt', sep='\s+', engine='python', 
                   skiprows=5,skipfooter=10)
#df.head()
#df.tail()

You can check to see if all of the observations have been read by determining the length of the data frame, which should be 14.

In [3]:
len(df)

14

#### .dtypes, .info, .describe

We can get high level information about a dataframe with a few methods. `.dtypes` gives us the datatypes of each column.

In [4]:
df.dtypes

day            object
outlook        object
temperature    object
humidity       object
wind           object
playtennis     object
dtype: object

`.info()` gives us few more details (number of non-null objects etc)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 6 columns):
day            14 non-null object
outlook        14 non-null object
temperature    14 non-null object
humidity       14 non-null object
wind           14 non-null object
playtennis     14 non-null object
dtypes: object(6)
memory usage: 752.0+ bytes


`.describe()` gives us more stats.  The information returned by `describe` varies based on whether the column has numerical data or not.  The tennis data set only has non-numerical data (strings):

In [6]:
df.describe()

Unnamed: 0,day,outlook,temperature,humidity,wind,playtennis
count,14,14,14,14,14,14
unique,14,3,3,2,2,2
top,d14,rainy,mild,normal,weak,yes
freq,1,5,6,7,8,9


#### labels, index, columns, values

The rows of a data frame are identified by _labels_.  Unlike, the primary keys of a database table, labels need not be unique.  The columns of a data frame are also identified by labels.  The sequence of labels used to identify all the rows is known as the **index** and the sequence of labels used to identify the columns is known as **columns** :-).  The labels of the index and the columns will be shown in bold in the NB.

By default, pandas sets the row labels (index) to be a sequence of integers from 0 and the column labels to be the first row of data.  Hence we get:

In [7]:
df.index

RangeIndex(start=0, stop=14, step=1)

In [8]:
df.columns

Index(['day', 'outlook', 'temperature', 'humidity', 'wind', 'playtennis'], dtype='object')

We usually want to give more meaningful names to row labels.  We can change the index of a dataframe with the method `set_index`.  Keep in mind that most operations on data frames do not change the data frame.  Rather a new one is created.  Sometime we re-assign the new data frame to the existing variable (as below) and sometimes we assign it to a new variable so that we have access to both data frames.

In [9]:
df = df.set_index('day')
df.head()

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d3,overcast,hot,high,weak,yes
d4,rainy,mild,high,weak,yes
d5,rainy,cool,normal,weak,yes


If we now ask for the index we get:

In [10]:
df.index

Index(['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9', 'd10', 'd11',
       'd12', 'd13', 'd14'],
      dtype='object', name='day')

We can change the index to any column we want with `set_index`, but in doing so we loose the existing index.

In [11]:
df.set_index('playtennis').head()

Unnamed: 0_level_0,outlook,temperature,humidity,wind
playtennis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
no,sunny,hot,high,weak
no,sunny,hot,high,strong
yes,overcast,hot,high,weak
yes,rainy,mild,high,weak
yes,rainy,cool,normal,weak


But as is the case most of the time, `set_index` also creates a new data frame i.e., the original data frame df is not changed

In [12]:
df.head()

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d3,overcast,hot,high,weak,yes
d4,rainy,mild,high,weak,yes
d5,rainy,cool,normal,weak,yes


If you want to go back to the default index of a sequence of numbers we use `reset_index`

In [13]:
d2 = df.head()
d2.head()

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d3,overcast,hot,high,weak,yes
d4,rainy,mild,high,weak,yes
d5,rainy,cool,normal,weak,yes


In [14]:
d2.reset_index().head()

Unnamed: 0,day,outlook,temperature,humidity,wind,playtennis
0,d1,sunny,hot,high,weak,no
1,d2,sunny,hot,high,strong,no
2,d3,overcast,hot,high,weak,yes
3,d4,rainy,mild,high,weak,yes
4,d5,rainy,cool,normal,weak,yes


#### .values

We can also extra the values of a data frame:

In [15]:
df.values

array([['sunny', 'hot', 'high', 'weak', 'no'],
       ['sunny', 'hot', 'high', 'strong', 'no'],
       ['overcast', 'hot', 'high', 'weak', 'yes'],
       ['rainy', 'mild', 'high', 'weak', 'yes'],
       ['rainy', 'cool', 'normal', 'weak', 'yes'],
       ['rainy', 'cool', 'normal', 'strong', 'no'],
       ['overcast', 'cool', 'normal', 'strong', 'yes'],
       ['sunny', 'mild', 'high', 'weak', 'no'],
       ['sunny', 'cool', 'normal', 'weak', 'yes'],
       ['rainy', 'mild', 'normal', 'weak', 'yes'],
       ['sunny', 'mild', 'normal', 'strong', 'yes'],
       ['overcast', 'mild', 'high', 'strong', 'yes'],
       ['overcast', 'hot', 'normal', 'weak', 'yes'],
       ['rainy', 'mild', 'high', 'strong', 'no']], dtype=object)

But this **rarely** done in practice --- we work with the values of a data frame when they are IN the data frame.

#### Retrieving columns of a DataFrame

We can ask for a single column by specifying the column label

In [16]:
df['playtennis']

day
d1      no
d2      no
d3     yes
d4     yes
d5     yes
d6      no
d7     yes
d8      no
d9     yes
d10    yes
d11    yes
d12    yes
d13    yes
d14     no
Name: playtennis, dtype: object

A single column of a data frame is a data structure known as a **Series** object.

In [17]:
s = df['playtennis']
type(s)

pandas.core.series.Series

A series object can be created on its own.  For now, we will form series objects from columns of a data frame.  The index of a series is the same index of the data frame it is part of.

In [18]:
s.index

Index(['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9', 'd10', 'd11',
       'd12', 'd13', 'd14'],
      dtype='object', name='day')

We can extract all the values of a series into a NumPy array.

In [19]:
s.values

array(['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes',
       'yes', 'yes', 'yes', 'no'], dtype=object)

We can extract more than one column from a data frame to create  "sub data frame"

In [20]:
cols = ['outlook', 'playtennis']
df[cols]

Unnamed: 0_level_0,outlook,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1
d1,sunny,no
d2,sunny,no
d3,overcast,yes
d4,rainy,yes
d5,rainy,yes
d6,rainy,no
d7,overcast,yes
d8,sunny,no
d9,sunny,yes
d10,rainy,yes


Often, we DON'T store the col names in a variable and use that to extract.  Rather we directly specify the list of column names.  The resultant double `[[...]]` may initially appear unusual, but one gets used to it

In [21]:
df[['outlook','playtennis']]

Unnamed: 0_level_0,outlook,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1
d1,sunny,no
d2,sunny,no
d3,overcast,yes
d4,rainy,yes
d5,rainy,yes
d6,rainy,no
d7,overcast,yes
d8,sunny,no
d9,sunny,yes
d10,rainy,yes


Note that with this double bracket notation if we specify only 1 column name we get a data frame with one column NOT a series. Ensure you understand the difference between `df['playtennis']` and `df[['playtennis']]`

In [22]:
df[['playtennis']]

Unnamed: 0_level_0,playtennis
day,Unnamed: 1_level_1
d1,no
d2,no
d3,yes
d4,yes
d5,yes
d6,no
d7,yes
d8,no
d9,yes
d10,yes


#### Retrieving rows of a DataFrame

We retrieve rows by using one of two methods `.loc[]` and `.iloc[]`.  We use `.loc` in conjunction with row labels.  Here are the first 5 rows of the data frame again: 

In [23]:
df.head()

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d3,overcast,hot,high,weak,yes
d4,rainy,mild,high,weak,yes
d5,rainy,cool,normal,weak,yes


When we extrac a single row, we get a series.  `loc` (and `iloc`) are indexers where we use square brackets and NOT parenthesis (which we would use if they were functions).

In [24]:
df.loc['d4']

outlook        rainy
temperature     mild
humidity        high
wind            weak
playtennis       yes
Name: d4, dtype: object

When we extract multiple rows, we get a data frame.  Note that again we specify the labels in an array.

In [25]:
df.loc[['d2','d8','d11']]

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d2,sunny,hot,high,strong,no
d8,sunny,mild,high,weak,no
d11,sunny,mild,normal,strong,yes


We can do the usual range indexing, except that now the end label is included

In [26]:
df.loc['d3':'d9']

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d3,overcast,hot,high,weak,yes
d4,rainy,mild,high,weak,yes
d5,rainy,cool,normal,weak,yes
d6,rainy,cool,normal,strong,no
d7,overcast,cool,normal,strong,yes
d8,sunny,mild,high,weak,no
d9,sunny,cool,normal,weak,yes


`iloc` does indexing by position as opposed to labels.  Positions are 0 based

In [27]:
df.iloc[0]

outlook        sunny
temperature      hot
humidity        high
wind            weak
playtennis        no
Name: d1, dtype: object

We can do range indexing.  But this time the end value is NOT included:

In [28]:
df.iloc[0:4]  # will give us the 0th, 1st, 2nd, and 3rd rows

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d3,overcast,hot,high,weak,yes
d4,rainy,mild,high,weak,yes


#### value_counts

Now that we know how to extract columns and rows of a data frame, lets return to the series data type.  We often want to do a frequency count of the values of a series.  Suppose we want to count the number of `yes` and `no`s in the playtennis column, we could do:

In [29]:
s=df['playtennis']
s.value_counts()

yes    9
no     5
Name: playtennis, dtype: int64

Note that the result of `value_counts` is itself a series!

In [30]:
s2 = s.value_counts()
type(s2)

pandas.core.series.Series

We can compute the relative frequency of `yes` in the whole data set by counting the total number of `yes` and dividing by the total number of entries in the data frame

In [31]:
s.value_counts().loc['yes']/len(s)

0.6428571428571429

The above expression could be broken into pieces as below, but the above is more idiomatic

In [32]:
cnts = s.value_counts()
cnts.loc['yes']/len(s)

0.6428571428571429

We can also compute the relative frequency of both `yes` and `no` with the below.  Recall the notion of **broadcasting**: when a series is operated on by a scalar, all values in the series are operated on (there is an implicit loop).

In [33]:
s.value_counts()/len(s)

yes    0.642857
no     0.357143
Name: playtennis, dtype: float64

#### Boolean Indexing

Consider the following boolean expression.  Due to broadcasting we get a series of `True` / `False` values.

In [34]:
df['outlook'] == 'sunny'

day
d1      True
d2      True
d3     False
d4     False
d5     False
d6     False
d7     False
d8      True
d9      True
d10    False
d11     True
d12    False
d13    False
d14    False
Name: outlook, dtype: bool

We can store this series of `True`/`False` values and use it to index a data frame.  We then get those rows for which the boolean mask is `True`

In [35]:
boolean_mask = df['outlook'] == 'sunny'
df[boolean_mask]

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d8,sunny,mild,high,weak,no
d9,sunny,cool,normal,weak,yes
d11,sunny,mild,normal,strong,yes


The pandorable way of doing this is NOT to explicitly store the boolean mask but rather to use it inplace

In [36]:
d2 = df[df['outlook'] == 'sunny']
d2

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d8,sunny,mild,high,weak,no
d9,sunny,cool,normal,weak,yes
d11,sunny,mild,normal,strong,yes


#### drop

Like other operations `drop` is not a destructive operation.  It returns a new data frame.

In [37]:
df.drop('d2')
df.head()

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d3,overcast,hot,high,weak,yes
d4,rainy,mild,high,weak,yes
d5,rainy,cool,normal,weak,yes


In [38]:
df.drop('d2').head()

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d3,overcast,hot,high,weak,yes
d4,rainy,mild,high,weak,yes
d5,rainy,cool,normal,weak,yes
d6,rainy,cool,normal,strong,no


We can drop multiple rows by specifying the labels in a list

In [39]:
d1 = df.drop(['d10', 'd12', 'd14'])
d1.tail(5)

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d7,overcast,cool,normal,strong,yes
d8,sunny,mild,high,weak,no
d9,sunny,cool,normal,weak,yes
d11,sunny,mild,normal,strong,yes
d13,overcast,hot,normal,weak,yes


If we want to drop a column, we need to specify an **axis**.  You can think of a data frame with the (0,0) coordinate in the upper left.  axis=0 moves downwards and axis=1 moves to the right.

In [40]:
df.drop('wind', axis=1).head()

Unnamed: 0_level_0,outlook,temperature,humidity,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d1,sunny,hot,high,no
d2,sunny,hot,high,no
d3,overcast,hot,high,yes
d4,rainy,mild,high,yes
d5,rainy,cool,normal,yes


In [41]:
df.count()

outlook        14
temperature    14
humidity       14
wind           14
playtennis     14
dtype: int64

#### apply

This is in the spirit of a list comprehension:  when we want to apply a function to a single series or all the columns of a data frame we use `apply`

First, lets see how it works on a series

In [42]:
s = df['playtennis']
s.head()

day
d1     no
d2     no
d3    yes
d4    yes
d5    yes
Name: playtennis, dtype: object

Let us up case all the values

In [43]:
s.apply(lambda v: v.upper())

day
d1      NO
d2      NO
d3     YES
d4     YES
d5     YES
d6      NO
d7     YES
d8      NO
d9     YES
d10    YES
d11    YES
d12    YES
d13    YES
d14     NO
Name: playtennis, dtype: object

When we apply a function to a data frame the argument to that function is a series.  Hence it doesn't make sense to do

```
df.apply(lambda x: x.upper())
```
because at this point `x` is a series.

Rather we need to `apply` a function that can be applied to series like `count`

In [44]:
df.apply(lambda ser: ser.count())

outlook        14
temperature    14
humidity       14
wind           14
playtennis     14
dtype: int64

As we saw earlier, when `describe` is applied to a series, we get another series

In [45]:
s.describe()

count      14
unique      2
top       yes
freq        9
Name: playtennis, dtype: object

When we apply `describe` to a data frame it is applied to each column of the data frame (which are series objects).  The resultant collection of series objects are then assembled back into a data frame

In [46]:
df.apply(lambda ser: ser.describe())

Unnamed: 0,outlook,temperature,humidity,wind,playtennis
count,14,14,14,14,14
unique,3,3,2,2,2
top,rainy,mild,normal,weak,yes
freq,5,6,7,8,9


#### Combining Series

Lets hand create a couple of series with different but overlapping indexes

In [47]:
s1 = pd.Series([80, 70, 90], index='abe bob cathy'.split())
s1
s2 = pd.Series([10], index='bob don abe'.split())

print(s1)
print()
print(s2)


abe      80
bob      70
cathy    90
dtype: int64

bob    10
don    10
abe    10
dtype: int64


Due to broad casting, when we perform an operation on a series at a whole, all of the values in the series are operated on

In [48]:
s1+5

abe      85
bob      75
cathy    95
dtype: int64

In [None]:
(s1+5)*10

We can join two series together with `append`

In [49]:
s1.append(s2)

abe      80
bob      70
cathy    90
bob      10
don      10
abe      10
dtype: int64

Something interesting happens when we do an entry by entry operation on a series

In [51]:
s1.dtypes

dtype('int64')

In [52]:
s3=s1+s2
s3

abe      90.0
bob      80.0
cathy     NaN
don       NaN
dtype: float64

In [53]:
s3.dtype

dtype('float64')

Pandas automatically aligns row labels and performs the operation only on those rows.  The rest are deemed "Not a Number" `NaN`

### Grouping

Similar to the `GROUP BY` clause of SQL, Pandas supports the ability to group rows in a number of ways

In [54]:
grps = df.groupby('playtennis')

`grps` has a data type of its own.  Its constituent parts are data frames

In [55]:
type(grps)

pandas.core.groupby.DataFrameGroupBy

We can get information on the groups and their constituent rows

In [56]:
grps.groups

{'no': Index(['d1', 'd2', 'd6', 'd8', 'd14'], dtype='object', name='day'),
 'yes': Index(['d3', 'd4', 'd5', 'd7', 'd9', 'd10', 'd11', 'd12', 'd13'], dtype='object', name='day')}

The size of each group is available as a series object

In [57]:
grps.size()

playtennis
no     5
yes    9
dtype: int64

We can get the individual data frames in a group with `.get_group`

In [58]:
grps.get_group('no')

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d6,rainy,cool,normal,strong,no
d8,sunny,mild,high,weak,no
d14,rainy,mild,high,strong,no


We can also iterate across all the groups in a grougby object

In [59]:
for k,g in grps:
    print(k)
    print(g)
    print()

no
    outlook temperature humidity    wind playtennis
day                                                
d1    sunny         hot     high    weak         no
d2    sunny         hot     high  strong         no
d6    rainy        cool   normal  strong         no
d8    sunny        mild     high    weak         no
d14   rainy        mild     high  strong         no

yes
      outlook temperature humidity    wind playtennis
day                                                  
d3   overcast         hot     high    weak        yes
d4      rainy        mild     high    weak        yes
d5      rainy        cool   normal    weak        yes
d7   overcast        cool   normal  strong        yes
d9      sunny        cool   normal    weak        yes
d10     rainy        mild   normal    weak        yes
d11     sunny        mild   normal  strong        yes
d12  overcast        mild     high  strong        yes
d13  overcast         hot   normal    weak        yes



Separate them with a list comprehension

In [60]:
lst=[(k,g) for (k,g) in grps]
lst

[('no',     outlook temperature humidity    wind playtennis
  day                                                
  d1    sunny         hot     high    weak         no
  d2    sunny         hot     high  strong         no
  d6    rainy        cool   normal  strong         no
  d8    sunny        mild     high    weak         no
  d14   rainy        mild     high  strong         no),
 ('yes',       outlook temperature humidity    wind playtennis
  day                                                  
  d3   overcast         hot     high    weak        yes
  d4      rainy        mild     high    weak        yes
  d5      rainy        cool   normal    weak        yes
  d7   overcast        cool   normal  strong        yes
  d9      sunny        cool   normal    weak        yes
  d10     rainy        mild   normal    weak        yes
  d11     sunny        mild   normal  strong        yes
  d12  overcast        mild     high  strong        yes
  d13  overcast         hot   normal    weak   

In [61]:
type(lst[0][1])

pandas.core.frame.DataFrame

In [62]:
lst[0][1]

Unnamed: 0_level_0,outlook,temperature,humidity,wind,playtennis
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
d1,sunny,hot,high,weak,no
d2,sunny,hot,high,strong,no
d6,rainy,cool,normal,strong,no
d8,sunny,mild,high,weak,no
d14,rainy,mild,high,strong,no


#### .apply on groupby objects

We can also apply a function to a grouby object.  In this instance the function that is applied takes a data frame as the argument:

In [63]:
grps.apply(lambda d: len(d))

playtennis
no     5
yes    9
dtype: int64

Lets spend some time dissecting the below

In [64]:
df.apply(lambda s: s.value_counts())

Unnamed: 0,outlook,temperature,humidity,wind,playtennis
cool,,4.0,,,
high,,,7.0,,
hot,,4.0,,,
mild,,6.0,,,
no,,,,,5.0
normal,,,7.0,,
overcast,4.0,,,,
rainy,5.0,,,,
strong,,,,6.0,
sunny,5.0,,,,


In [68]:
lst = [df[c].value_counts() for c in df.columns]
len(lst)

5

Blend individual data frames

#### .agg or .aggregating values

`agg` is similar to `apply` but differs in the following crucial ways.

   [1] `apply` can be used with a series, dataframe or group.  The function being applied takes the components of the data type to which it is applied:
   
   ```Something.apply(lambda x: ______ )
   ```
   
   If `Something` is 
       - a series then x is a value
       - a data frame then x is a column (series)
       - a group then x is a data frame
       
       
   [2] `agg` can only be applied to a group and the function is applied to each column of the data frames in the group i.e., x is a series.  Also, multiple functions can be used during aggregation

In [None]:
grps.get_group('no').count()

In [None]:
len(grps.get_group('no'))+10

In [None]:
grps.agg(['count', lambda s: len(s)+10])

In [None]:
def f(s):
    return len(s)+100

grps.agg(['count', f])