# Part Two - Filtering, Sorting, and Grouping

In the last tutorial, we looked at some ways of accessing data in the rows and columns of a DataFrame object. We learned how to select a single row or column, which returns a Series object. Then we learned how to use some of the functions that belong to the Series class to learn about the information contained in those rows and columns.

In this tutorial, we're going to learn about grouping, sorting, and filtering. This is a little more complex than what we've done before, but it builds on the same basic concepts. In all cases, the key to understanding what you're doing is to pay attention to the kind of object you're working with (DataFrame, Series, or something else), and to understand the shape of the data it contains.

## Loading data and importing libraries

Because this is a new Notebook, we have to load all the data and libraries we're using again. We'll show the first few rows of the DataFrame again as well, just as a reminder.

In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/SimonCarryer/pandas_tutorial/master/data/dog_registrations.csv')

df['ValidDate'] = pd.to_datetime(df['ValidDate']) # Just formatting the data a little bit nicer

df.head()

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
0,Dog Individual Female,AM PIT BULL TERRIER,SPOTTED,BUTTER,15001,2007,2007-05-01 15:15:00
1,Dog Individual Female,AM PIT BULL TERRIER,BROWN,SABLE,15001,2007,2007-05-01 15:15:00
2,Dog Individual Neutered Male,MIXED,.,YIP,15001,2007,2007-04-11 15:14:00
3,Dog Individual Male,DOBERMAN PINSCHER,RED,SABER,15003,2007,2007-04-05 15:00:00
4,Dog Individual Spayed Female,MIXED,BLACK,DAISY,15003,2007,2007-05-25 12:15:00


## Filtering

Filtering in Pandas is deceptively simple. It relies on a neat trick involving a boolean Series. Remember in the last tutorial when we compared a Series with an operator (`>`, `<`, or `==`), how it applied that "row-wise", and returned the results of that comparison for each element in the Series?

In [2]:
df['Breed'] == 'AM PIT BULL TERRIER'

0         True
1         True
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
97432    False
97433    False
97434    False
97435    False
97436    False
97437    False
97438    False
97439    False
97440    False
97441    False
97442    False
97443    False
97444    False
97445    False
97446    False
97447    False
97448    False
97449    False
97450    False
97451    False
97452    False
97453    False
97454    False
97455    False
97456    False
97457    False
97458    False
97459    False
97460    False
97461    False
Name: Breed, Length: 97462, dtype: bool

What we see above is a Series object, which contains Boolean ("true or false") values. Pandas lets us use that Series, in conjunction with another function, `loc`, to filter a DataFrame, like this: 

In [3]:
terriers = df['Breed'] == 'AM PIT BULL TERRIER'
df.loc[terriers]

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
0,Dog Individual Female,AM PIT BULL TERRIER,SPOTTED,BUTTER,15001,2007,5/1/2007 15:15
1,Dog Individual Female,AM PIT BULL TERRIER,BROWN,SABLE,15001,2007,5/1/2007 15:15
73,Dog Individual Spayed Female,AM PIT BULL TERRIER,WHITE/BROWN,XENA,15006,2007,1/9/2007 14:29
132,Dog Individual Female,AM PIT BULL TERRIER,TAN,DORA,15014,2007,2/2/2007 9:22
142,Dog Individual Neutered Male,AM PIT BULL TERRIER,BLACK,BUSTER,15014,2007,1/29/2007 10:47
147,Dog Individual Male,AM PIT BULL TERRIER,WHITE/BLACK,VLADIMORE,15014,2007,6/4/2007 12:07
149,Dog Individual Spayed Female,AM PIT BULL TERRIER,WHITE/BROWN,ANGEL,15014,2007,6/5/2007 12:09
151,Dog Individual Neutered Male,AM PIT BULL TERRIER,BROWN,NONO,15014,2007,4/12/2007 13:53
163,Dog Individual Neutered Male,AM PIT BULL TERRIER,BRINDLE,ROCCO,15014,2007,7/11/2007 14:35
222,Dog Individual Male,AM PIT BULL TERRIER,BRINDLE,ODIS,15014,2007,1/3/2007 11:45


In fact, using the `loc` function is not strictly neccessary here. Just `df[terriers]` will do exactly the same thing. That relies on a bit of magic in the background, where Pandas figures out whether you're looking for a set of columns, or filtering rows. I find that notation quite confusing, so for clarity I like to use the `loc` function. `loc` has other uses as well, which you can research yourself.

There are other ways of producing a boolean Series, besides the simple operators. Series have several functions that perform comparisons and return booleans. Here are some examples:

In [7]:
df['Color'].isin(['BROWN', 'TAN'])

0        False
1         True
2        False
3        False
4        False
5        False
6        False
7        False
8         True
9        False
10       False
11       False
12       False
13       False
14       False
15        True
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23        True
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
97432     True
97433    False
97434    False
97435    False
97436    False
97437     True
97438     True
97439    False
97440    False
97441    False
97442    False
97443    False
97444     True
97445    False
97446    False
97447    False
97448    False
97449     True
97450    False
97451    False
97452    False
97453    False
97454    False
97455    False
97456    False
97457    False
97458    False
97459    False
97460    False
97461    False
Name: Color, Length: 97462, dtype: bool

In [8]:
df['ValidDate'].between('20070501', '20080601') # Note that we had to format the date column correctly for this to work.

0         True
1         True
2        False
3        False
4         True
5         True
6         True
7        False
8        False
9        False
10       False
11        True
12       False
13       False
14       False
15        True
16        True
17       False
18       False
19       False
20       False
21       False
22        True
23        True
24       False
25       False
26       False
27        True
28       False
29       False
         ...  
97432    False
97433    False
97434    False
97435    False
97436    False
97437    False
97438    False
97439    False
97440    False
97441    False
97442    False
97443    False
97444    False
97445    False
97446    False
97447    False
97448    False
97449    False
97450    False
97451    False
97452    False
97453    False
97454    False
97455    False
97456    False
97457    False
97458    False
97459    False
97460    False
97461    False
Name: ValidDate, Length: 97462, dtype: bool

In [9]:
df['LicenseType'].isnull() # Trust me this will be very useful

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
97432    False
97433    False
97434    False
97435    False
97436    False
97437    False
97438    False
97439    False
97440    False
97441    False
97442    False
97443    False
97444    False
97445    False
97446    False
97447    False
97448    False
97449    False
97450    False
97451    False
97452    False
97453    False
97454    False
97455    False
97456    False
97457    False
97458    False
97459    False
97460    False
97461    False
Name: LicenseType, Length: 97462, dtype: bool

In [10]:
df['Breed'].str.contains('MIX') 

# This is a bit sneaky - the `str` attribute accesses a bunch of other methods for Series containing strings.

0        False
1        False
2         True
3        False
4         True
5         True
6        False
7        False
8        False
9        False
10       False
11       False
12        True
13       False
14       False
15       False
16        True
17       False
18        True
19        True
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29        True
         ...  
97432    False
97433    False
97434    False
97435     True
97436    False
97437    False
97438     True
97439    False
97440     True
97441    False
97442     True
97443     True
97444     True
97445    False
97446     True
97447     True
97448    False
97449    False
97450    False
97451    False
97452    False
97453    False
97454    False
97455    False
97456    False
97457     True
97458    False
97459    False
97460    False
97461    False
Name: Breed, Length: 97462, dtype: bool

### Filtering Series

You can filter a Series just the same way that you filter a DataFrame.

In [11]:
df['DogName'].loc[terriers]

0            BUTTER
1             SABLE
73             XENA
132            DORA
142          BUSTER
147       VLADIMORE
149           ANGEL
151            NONO
163           ROCCO
222            ODIS
224            KODA
228           JAYDA
235           BONES
236            TANK
288           BELLA
432        BROOKLYN
433           PEARL
546        DESTINEE
644           REESE
645           REESE
646          CEASAR
668          BONNIE
691             JED
709        SKITTLES
792            DUKE
836            JAKE
1067          SAMMY
1115         OAKLEY
1170          FOXIE
1204         TREACH
            ...    
94619         BELLA
94639         SADIE
94644       SPENCER
94645         LAELA
94710        COOKIE
94711         GEMMA
94827      SCARLETT
94951      SAPPHIRE
95046         KOEIS
95055        SIERRA
95084         TYSON
95246        REILLY
95350         CHICA
95667         TITAN
95669         SPIKE
95776        HAMMER
95958         RUFUS
95999        COOKIE
96038         GUCCI


"Vladimore"!? Incredible.

### Multiple Filters

You can apply more than one filter at a time. Obviously how you write your code is up to you, but for myself, the clearest and most concise way I've found to do this is to define several boolean Series, and then chain them together with the `&` operator. If you're experienced with Python, you might be wondering why we use `&` instead of `and`. It's because what we're actually doing is a "bitwise" comparison - adding two booleans together. Observe:

In [12]:
terriers = df['Breed'] == 'AM PIT BULL TERRIER'
brown = df['Color'].str.contains('BROWN')

terriers & brown

0        False
1         True
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
97432    False
97433    False
97434    False
97435    False
97436    False
97437    False
97438    False
97439    False
97440    False
97441    False
97442    False
97443    False
97444    False
97445    False
97446    False
97447    False
97448    False
97449    False
97450    False
97451    False
97452    False
97453    False
97454    False
97455    False
97456    False
97457    False
97458    False
97459    False
97460    False
97461    False
Length: 97462, dtype: bool

Because the combination of all muliple Series is a single Series, we can use it to filter a DataFrame.

In [13]:
df.loc[terriers & brown]

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
1,Dog Individual Female,AM PIT BULL TERRIER,BROWN,SABLE,15001,2007,2007-05-01 15:15:00
73,Dog Individual Spayed Female,AM PIT BULL TERRIER,WHITE/BROWN,XENA,15006,2007,2007-01-09 14:29:00
149,Dog Individual Spayed Female,AM PIT BULL TERRIER,WHITE/BROWN,ANGEL,15014,2007,2007-06-05 12:09:00
151,Dog Individual Neutered Male,AM PIT BULL TERRIER,BROWN,NONO,15014,2007,2007-04-12 13:53:00
644,Dog Individual Female,AM PIT BULL TERRIER,BROWN,REESE,15017,2007,2007-08-28 15:51:00
645,Dog Individual Spayed Female,AM PIT BULL TERRIER,BROWN,REESE,15017,2007,2007-08-28 14:41:00
646,Dog Individual Neutered Male,AM PIT BULL TERRIER,BROWN,CEASAR,15017,2007,2007-08-28 14:41:00
668,Dog Individual Female,AM PIT BULL TERRIER,BROWN,BONNIE,15017,2007,2007-09-12 15:06:00
709,Dog Individual Male,AM PIT BULL TERRIER,BROWN,SKITTLES,15018,2007,2007-01-25 16:11:00
1067,Dog Individual Neutered Male,AM PIT BULL TERRIER,BROWN,SAMMY,15024,2007,2007-03-09 09:59:00


### Getting Complicated

Feel free to skip this part - it's a bit technical, but there's some extra stuff to learn here. This way of filtering DataFrames works because the boolean Series has the same "index" as the DataFrame - the indexes "align". That means that they have the same length, and all the same values. If your boolean Series comes from a column of your DataFrame, that will always work fine.

But you can also make an entirely new Series, with wholly different values. Provided the indexes align, you can use it to filter in exactly the same way.

In [16]:
# Make a Series object that holds random numbers between 0 and 9
random_numbers = pd.Series(pd.np.random.randint(0, 10, size=len(df)))

# Return only rows in df where the random number in "random_numbers" is zero
df.loc[random_numbers == 0]

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
4,Dog Individual Spayed Female,MIXED,BLACK,DAISY,15003,2007,2007-05-25 12:15:00
20,Dog Individual Neutered Male,SILKY TERRIER,WHITE/BLACK/BROWN,NICO,15003,2007,2007-02-12 14:42:00
28,Dog Individual Neutered Male,GER SHEPHERD,WHITE,CEASAR,15003,2007,2007-04-17 16:10:00
42,Dog Individual Neutered Male,CHIHUAHUA,MULTI,MOOSE,15005,2007,2007-01-11 14:16:00
46,Dog Individual Neutered Male,GOLDEN RETRIEVER,GOLD,SONNY,15005,2007,2007-02-15 09:01:00
47,Dog Individual Neutered Male,ENG SPRINGER SPANIE,WHITE/BLACK,PATCHES,15005,2007,2007-10-01 15:42:00
73,Dog Individual Spayed Female,AM PIT BULL TERRIER,WHITE/BROWN,XENA,15006,2007,2007-01-09 14:29:00
82,Dog Individual Neutered Male,MIXED,BLACK/BROWN,ENZO,15006,2007,2007-05-18 11:18:00
111,Dog Individual Male,MIXED,WHITE/APRICOT,HANK,15014,2007,2007-02-20 15:53:00
130,Dog Senior Citizen or Disability Spayed Female,DOBERMAN PINSCHER,BLACK/BROWN,DAISY GUY,15014,2007,2007-02-08 15:46:00


### Problem One

In the cell below:
* Filter the DataFrame to show dogs registered in the 15005 and 15006 zip codes.
* Filter the DataFrame to show dogs registered after 2007
* Count the number of male dogs
* Count the number of white poodles registered in 2008

## Sorting

There are two ways of sorting DataFrames and Series, simple, and complicated. Let's do simple first:

### Simple Sorting

There are a couple of functions which sort DataFrames and Series. They are `sort_values` and `sort_index`.

Here's `sort_values`

In [13]:
df['OwnerZip'].sort_values()

0        15001
66219    15001
66220    15001
66221    15001
66222    15001
66223    15001
29866    15001
29865    15001
29864    15001
29863    15001
29862    15001
29861    15001
66218    15001
66217    15001
1        15001
2        15001
4        15003
29890    15003
29889    15003
29888    15003
29887    15003
29886    15003
29885    15003
29884    15003
29883    15003
29882    15003
29891    15003
29881    15003
29879    15003
29878    15003
         ...  
29854    16059
29855    16059
29856    16059
66200    16059
29844    16059
29846    16059
29845    16059
66209    16059
66214    16059
66213    16059
66212    16059
66202    16059
66203    16059
66211    16059
66210    16059
66204    16059
66208    16059
66207    16059
66206    16059
66205    16059
66216    16066
29857    16066
29858    16066
66215    16066
97458    16066
29859    16229
97459    17821
29860    19350
97460    32828
97461    32828
Name: OwnerZip, Length: 97462, dtype: int64

In [14]:
df['OwnerZip'].sort_values(ascending=False)

97461    32828
97460    32828
29860    19350
97459    17821
29859    16229
97458    16066
66215    16066
29858    16066
29857    16066
66216    16066
66205    16059
66206    16059
66207    16059
66208    16059
66204    16059
66210    16059
66211    16059
66203    16059
66202    16059
66212    16059
66213    16059
66214    16059
66209    16059
29845    16059
29846    16059
29844    16059
66200    16059
29856    16059
29855    16059
29854    16059
         ...  
29878    15003
29879    15003
29881    15003
29891    15003
29882    15003
29883    15003
29884    15003
29885    15003
29886    15003
29887    15003
29888    15003
29889    15003
29890    15003
4        15003
2        15001
1        15001
66217    15001
66218    15001
29861    15001
29862    15001
29863    15001
29864    15001
29865    15001
29866    15001
66223    15001
66222    15001
66221    15001
66220    15001
66219    15001
0        15001
Name: OwnerZip, Length: 97462, dtype: int64

In [15]:
df.sort_values(by='OwnerZip')

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
0,Dog Individual Female,AM PIT BULL TERRIER,SPOTTED,BUTTER,15001,2007,2007-05-01 15:15:00
66219,Dog Individual Neutered Male,YORKSHIRE TERR MIX,WHITE/BLACK,CHEVY,15001,2009,2009-05-04 10:51:00
66220,Dog Individual Spayed Female,SHETLAND SHEEPDOG,WHITE/BLACK/BROWN,LACEY,15001,2009,2009-06-04 16:14:00
66221,Dog Individual Male,MALTESE,WHITE,BAILEY MARTIN,15001,2009,2009-02-03 15:05:00
66222,Dog Individual Spayed Female,MALTESE,WHITE,ANNIE MARTIN,15001,2009,2009-02-03 15:05:00
66223,Dog Individual Spayed Female,BEAGLE MIX,WHITE/BLACK/BROWN,BELLA,15001,2009,2009-06-29 09:51:00
29866,Dog Individual Female,PLOTT HOUND,BRINDLE,ZOEY,15001,2008,2008-06-23 14:43:00
29865,Dog Individual Spayed Female,GREYHOUND,BRINDLE,LIBBY,15001,2008,2008-02-26 14:08:00
29864,Dog Individual Female,BEAGLE,WHITE/BLACK/BROWN,MODENA,15001,2008,2008-02-19 12:25:00
29863,Dog Individual Neutered Male,YORKSHIRE TERR MIX,WHITE/BLACK,CHEVY,15001,2008,2008-03-04 15:18:00


In [16]:
df.sort_values(by=['Breed', 'DogName'])

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
45341,Dog Individual Neutered Male,.,WHITE/BLACK,BRUNO,15131,2008,2008-02-25 15:20:00
72443,Dog Individual Spayed Female,.,BLACK,BUCKEYE,15090,2009,2009-03-20 14:35:00
31070,Dog Individual Spayed Female,.,WHITE/BLACK,CHOMPERS,15024,2008,2008-01-09 16:01:00
21726,Dog Individual Female,.,BLACK/BROWN,COCO,15221,2007,2007-05-24 15:24:00
74423,Dog Individual Spayed Female,.,BLACK/BROWN,KELLI,15102,2009,2008-12-31 15:56:00
788,Dog Senior Citizen or Disability Spayed Female,.,WHITE/BLACK/BROWN,LADY,15024,2007,2007-03-13 15:01:00
31039,Dog Senior Citizen or Disability Spayed Female,.,WHITE/BLACK/BROWN,LADY,15024,2008,2008-03-31 15:47:00
67208,Dog Senior Citizen or Disability Spayed Female,.,WHITE/BLACK/BROWN,LADY,15024,2009,2009-03-06 12:30:00
5532,Dog Senior Citizen or Disability Duplicate,.,WHITE,MAGGIE,15084,2007,2007-05-21 10:29:00
37652,Dog Individual Spayed Female,.,BRINDLE,MAIA,15090,2008,2007-12-20 16:02:00


And here's `sort_index`. Since Series and DataFrames are already sorted by their index usually, it's most useful on Series that are sorted some other way.

In [17]:
df['Breed'].value_counts().sort_index()

.                          18
AFFENPINSCHER               5
AFGHAN HOUND               12
AINU                        2
AIREDALE TERRIER          187
AKITA                     272
AKITA MIX                  70
ALAPAHA BLUE BULLDO         1
ALAS KLEE KAI               5
ALAS MALAMUTE             120
ALASKAN SHEPHERD            1
AM BLACK&TAN COONHO         6
AM BULLDOG                288
AM ESKIMO DOG             280
AM FOXHOUND                35
AM PIT BULL TERRIER      2969
AM PITT BULL MIX          751
AM STAFF TERRIER          238
AM TOY TERRIER             29
AM WATER SPANIEL            2
ANATOLIAN SHEPHERD          7
ARGENTINO DOGO             13
AUS CATTLE DOG            199
AUS KELPIE                  8
AUS SHEPHERD              433
AUS TERRIER                12
BASENJI                    43
BASSET HOUND              489
BASSET HOUND MIX           57
BEAGLE                   3239
                         ... 
SPINONE ITALIANO           29
ST BERNARD                244
STAFFORDSH

### Complicated Sorting

There's another way of sorting, which is much more complicated. You can probably ignore this in most cases, but it can become useful. It uses a Series or DataFrame's _index_ to provide a sort order. Remember how the `iloc` function can return specific rows of a DataFrame or Series? If you provide it a list of all the rows you want, in the order you want them, it will return your DataFrame sorted the way you like.

In [18]:
# Make a Series which is all the numbers between 0 and 
random_order = pd.np.random.permutation(range(len(df)))
df.iloc[random_order]

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate
17480,Dog Individual Neutered Male,BORD COLLIE,BLACK,OREO,15146,2007,2006-12-28 11:46:00
77253,Dog Individual Spayed Female,MIN PINSCHER,BLACK/TAN,LILY,15116,2009,2009-10-14 15:53:00
42215,Dog Individual Male,AM PIT BULL TERRIER,BROWN,SIEGE,15110,2008,2008-02-19 14:06:00
73744,Dog Individual Neutered Male,AM PIT BULL TERRIER,WHITE/BLACK,CHAMP,15101,2009,2009-06-22 10:19:00
37162,Dog Individual Spayed Female,GOLDEN RETRIEVER,GOLD,HANNA,15090,2008,2008-04-02 12:42:00
54259,Dog Individual Female,POODLE STANDARD,APRICOT,LILY,15209,2008,2008-01-08 14:32:00
3668,Dog Individual Male,ROTTWEILER MIX,BLACK/BROWN,BUDDY,15045,2007,2006-12-14 15:59:00
1755,Dog Individual Female,GOLDEN RETRIEVER,BROWN,LACI,15026,2007,2007-09-24 10:09:00
42744,Dog Individual Neutered Male,TERRIER,WHITE/BLACK,BENJI,15116,2008,2008-01-30 11:25:00
2206,Dog Individual Spayed Female,CHIHUAHUA MIX,MULTI,RUBI,15037,2007,2007-06-06 15:30:00


### Problem Two

* Show the DataFrame sorted by the ValidDate
* Show the LicenseTypes, sorted descending
* Show the DataFrame, sorted by ExpYear and ValidDate
* Show the dog names, sorted by zip code

## Grouping

The easiest way to group data in Pandas is with a method we've already seen: The `value_counts` method. This is a `Series` method, so you can call it on any single column. By default, it returns a count of every non-`NaN` value in the column, sorted from most to least common.

In [21]:
df['Color'].value_counts().head()

BLACK          15916
BROWN          11920
WHITE           7673
WHITE/BLACK     7489
BLACK/BROWN     7289
Name: Color, dtype: int64

That sorting from most to least common is usually quite useful, but sometimes that's not what you want. In those cases, the `sort_index` method is usually what you're after.

In [32]:
df['ExpYear'].value_counts().sort_index()

2007    29861
2008    36356
2009    31245
Name: ExpYear, dtype: int64

But often you don't want to just count things, or you want to group by more than just one column.

This is a little bit trickier than the previous things we've seen, because it involves a new kind of object, the `DataFrameGroupBy` object. You can think of this like a DataFrame that's been squished down real tight. All the information is still in there, it's just collapsed down so that the data is aggregated. You've got to access the data in slightly different ways.

The way we access a `DataFrameGroupBy` object is with the `groupby` function of a DataFrame, like this:

In [18]:
df.groupby('Color')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000020730E285C0>

As you can see, grouping by itself doesn't gain us much. We've squished down the DataFrame, but we can't see any of the information it contains. To do that, we need to supply an aggregate function, such as `count`, `sum`, `mean` and so on.

In [19]:
df.groupby('Color').count()

Unnamed: 0_level_0,LicenseType,Breed,DogName,OwnerZip,ExpYear,ValidDate
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
.,489,489,489,489,489,482
APRICOT,392,392,392,392,392,389
BEIGE,314,314,314,314,314,312
BLACK,15916,15916,15916,15916,15916,15861
BLACK WITH WHITE,119,119,119,119,119,119
BLACK/BLUE,9,9,9,9,9,8
BLACK/BRINDLE,53,53,53,53,53,53
BLACK/BROWN,7289,7289,7289,7289,7289,7256
BLACK/BROWN/GREY,68,68,68,68,68,68
BLACK/BROWN/TAN,164,164,164,164,164,164


You'll notice that what's returned is the count of _every column_. That might be useful if we had a lot of numeric data, and we were using a `sum` or `mean` function, but since we're just counting rows, we get the same information for every column. Better to just count one column.

In [20]:
df.groupby('Color')['Breed'].count()

Color
.                         489
APRICOT                   392
BEIGE                     314
BLACK                   15916
BLACK WITH WHITE          119
BLACK/BLUE                  9
BLACK/BRINDLE              53
BLACK/BROWN              7289
BLACK/BROWN/GREY           68
BLACK/BROWN/TAN           164
BLACK/CREAM                56
BLACK/FAWN                137
BLACK/GOLD                140
BLACK/GREY                251
BLACK/LIVER                 7
BLACK/MERLE                 5
BLACK/ORANGE                5
BLACK/ORANGE/BRINDLE        4
BLACK/RED                 272
BLACK/SILVER              216
BLACK/TAN                2363
BLACK/TAN/WHITE           323
BLONDE                   1089
BLUE                      457
BLUE/GOLD                  60
BLUE/TAN                   52
BRINDLE                  2822
BROWN                   11920
BROWN/BLONDE               14
BROWN/FAWN                 12
                        ...  
RED/BLONDE                 42
RED/BRINDLE                37
RED/

You'll note that what's returned is a `Series`, with an index of all the different colours in our DataFrame. That means we can treat it exactly the same as any other `Series`, sorting it, filtering it, and so on:

In [21]:
colour_counts = df.groupby('Color')['Breed'].count()
colour_counts.sort_values(ascending=False)

Color
BLACK                   15916
BROWN                   11920
WHITE                    7673
WHITE/BLACK              7489
BLACK/BROWN              7289
WHITE/BLACK/BROWN        5284
WHITE/BROWN              4689
SPOTTED                  4637
BRINDLE                  2822
GOLD                     2647
YELLOW                   2567
BLACK/TAN                2363
TAN                      2199
RED                      2186
MULTI                    1979
FAWN                     1745
WHITE/TAN                1205
BLONDE                   1089
GREY                      856
WHITE/RED                 610
SABLE                     609
OTHER                     593
CREAM                     521
.                         489
WHITE/GREY                465
BUFF                      460
BLUE                      457
CHOCOLATE                 429
APRICOT                   392
RED/BROWN                 326
                        ...  
BLACK WITH WHITE          119
WHITE/BLACK/GREY          112
MERL

In [22]:
brown = colour_counts.index.str.contains('BROWN')

colour_counts.loc[brown]

Color
BLACK/BROWN           7289
BLACK/BROWN/GREY        68
BLACK/BROWN/TAN        164
BROWN                11920
BROWN/BLONDE            14
BROWN/FAWN              12
BROWN/GOLD              36
BROWN/LIVER              6
BROWN/ORANGE             5
BROWN/TAN              186
BROWN/TAN/BRINDLE        7
RED/BROWN              326
WHITE/BLACK/BROWN     5284
WHITE/BROWN           4689
Name: Breed, dtype: int64

You can also group by multiple columns.

In [23]:
df.groupby(['Breed','ExpYear'])['Color'].count()

Breed                  ExpYear
.                      2007         4
                       2008        10
                       2009         4
AFFENPINSCHER          2007         2
                       2008         2
                       2009         1
AFGHAN HOUND           2007         3
                       2008         6
                       2009         3
AINU                   2007         2
AIREDALE TERRIER       2007        62
                       2008        72
                       2009        53
AKITA                  2007        96
                       2008       109
                       2009        67
AKITA MIX              2007        19
                       2008        35
                       2009        16
ALAPAHA BLUE BULLDO    2007         1
ALAS KLEE KAI          2008         3
                       2009         2
ALAS MALAMUTE          2007        41
                       2008        41
                       2009        38
ALASKAN SHEPHERD   

This gets complicated though, because what is returned is still a `Series`, but it has what is called a `MultiIndex` - an index with more than one level. This one has both "ExpYear" and "Breed". Accessing data in Series and DataFrames with a `MultiIndex` is a bit more complicated, and we won't go into it just yet. In the meantime, a handy way to deal with this - if you're grouping by just two columns - is by using the `unstack` function, which pivots your `MultiIndex` Series into a regular `DataFrame`, with columns for your second index level.

In [24]:
df.groupby(['Breed','ExpYear'])['Color'].count().unstack()

ExpYear,2007,2008,2009
Breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
.,4.0,10.0,4.0
AFFENPINSCHER,2.0,2.0,1.0
AFGHAN HOUND,3.0,6.0,3.0
AINU,2.0,,
AIREDALE TERRIER,62.0,72.0,53.0
AKITA,96.0,109.0,67.0
AKITA MIX,19.0,35.0,16.0
ALAPAHA BLUE BULLDO,1.0,,
ALAS KLEE KAI,,3.0,2.0
ALAS MALAMUTE,41.0,41.0,38.0


The other way to handle this is to use the `reset_index` function, which here, inscrutably, turns the series into a `DataFrame` with columns for each of your `groupby` values.

In [25]:
breed_by_year = df.groupby(['Breed','ExpYear', 'DogName'])['Color'].count()
breed_by_year.reset_index()

Unnamed: 0,Breed,ExpYear,DogName,Color
0,.,2007,COCO,1
1,.,2007,LADY,1
2,.,2007,MAGGIE,1
3,.,2007,PRINCESS,1
4,.,2008,BRUNO,1
5,.,2008,CHOMPERS,1
6,.,2008,LADY,1
7,.,2008,MAIA,1
8,.,2008,MAZIE,1
9,.,2008,PRINCESS,1


Note that the final column is called "Color", but what it actually contains is a count of _rows_ in the Color column. It's just that this is the column we happened to choose to count. This can get confusing.

Also, "Chompers" is a great name.

### Grouping by a Series

You'll notice that we've always grouped by the name of a column in the DataFrame. When we do that, there's actually some secret magic happening in the background, which is that Pandas looks in the DataFrame for a column with that name, and then uses that `Series` to group the DataFrame. You can shortcut that by explicitly passing a `Series`.

In [26]:
df.groupby(df['Breed'])

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000020730E3DF60>

Or a list of `Series` objects.

In [27]:
df.groupby([df['Breed'], df['DogName']])

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000020730E3D320>

Most of the time it's more straightforward just to name the columns, but from time to time you might want to group a DataFrame by some Series that's not a column in your DataFrame. This will work fine, provided (as always) the indexes align.

For example, here we're grouping the DataFrame by a `Series` which contains random numbers from zero to nine - effectively making ten random samples.

In [28]:
random_numbers = pd.Series(pd.np.random.randint(0, 10, size=len(df)))
df.groupby(random_numbers)['Color'].count()

0    9928
1    9834
2    9621
3    9665
4    9715
5    9652
6    9651
7    9731
8    9800
9    9865
Name: Color, dtype: int64

One useful application for this is to group by a boolean series. This is a bit complicated, but here we are comparing the number of brown dogs to dogs of every other colour, by year.

In [29]:
brown = df['Color'].str.contains('BROWN')

brown.name = 'Brown' # This just makes the table look nicer at the end

df.groupby([df['ExpYear'], brown])['DogName'].count().unstack()

Brown,False,True
ExpYear,Unnamed: 1_level_1,Unnamed: 2_level_1
2007,20641,9220
2008,25377,10979
2009,21438,9807


### Problem Three

* Which zip code has the most dog registrations?
* What was the most common name for each of 2007, 2008, and 2009?
* What is the average number of dog registrations by year?