### Grouping

If your familiar with SQL and the `group by` clause, then this will be familiar to you (with the exception that in SQL the order in which rows are selected does not affect the group by - i.e. we have an automatic implicit sort on the group by key - not so here)

If you're not familiar with the `group by` in SQL, let's consider an example to understand what's going on:

Let's look at the file `cars_2014.csv`:

In [1]:
import itertools

with open('cars_2014.csv') as f:
    for row in itertools.islice(f, 0, 20):
        print(row, end = '')

make,model
ACURA,ILX
ACURA,MDX
ACURA,RDX
ACURA,RLX
ACURA,TL
ACURA,TSX
ALFA ROMEO,4C
ALFA ROMEO,GIULIETTA
APRILIA,CAPONORD 1200
APRILIA,RSV4 FACTORY APRC ABS
APRILIA,RSV4 R APRC ABS
APRILIA,SHIVER 750
ARCTIC CAT,1000 XT
ARCTIC CAT,500 XT
ARCTIC CAT,550 XT
ARCTIC CAT,700 LTD
ARCTIC CAT,700 SUPER DUTY DIESEL
ARCTIC CAT,700 XT
ARCTIC CAT,90 2X4 4-STROKE


This file contains car make and model ordered by make (so all the same makes are together in the file already) and then model.

We may want to know how many models exist for each make.

This is what a group by is used for: we need to make groups of makes, then count the number of items in each group.

Trivial to do with SQL, but a little more work with Python.

We might try doing it this way:

In [2]:
from collections import defaultdict

makes = defaultdict(int)

with open('cars_2014.csv') as f:
    next(f)  # skip header row
    for row in f:
        make, _ = row.strip('\n').split(',')
        makes[make] += 1
        
for key, value in makes.items():
    print(f'{key}: {value}')

ACURA: 6
ALFA ROMEO: 2
APRILIA: 4
ARCTIC CAT: 96
ARGO: 4
ASTON MARTIN: 5
AUDI: 27
BENTLEY: 2
BLUE BIRD: 1
BMW: 86
BUGATTI: 1
BUICK: 5
CADILLAC: 7
CAN-AM: 61
CHEVROLET: 33
CHRYSLER: 2
DODGE: 7
DUCATI: 4
FERRARI: 6
FIAT: 2
FORD: 34
FREIGHTLINER: 7
GMC: 12
HARLEY DAVIDSON: 29
HINO: 7
HONDA: 91
HUSABERG: 4
HUSQVARNA: 9
HYUNDAI: 13
INDIAN: 3
INFINITI: 8
JAGUAR: 9
JEEP: 5
JOHN DEERE: 19
KAWASAKI: 59
KENWORTH: 11
KIA: 10
KTM: 13
KUBOTA: 4
KYMCO: 28
LAMBORGHINI: 2
LAND ROVER: 6
LEXUS: 14
LINCOLN: 6
LOTUS: 1
MACK: 9
MASERATI: 3
MAZDA: 5
MCLAREN: 2
MERCEDES-BENZ: 60
MINI: 3
MITSUBISHI: 8
NISSAN: 24
PEUGEOT: 3
POLARIS: 101
PORSCHE: 4
RAM: 6
RENAULT: 4
ROLLS ROYCE: 3
SCION: 5
SEAT: 3
SKI-DOO: 67
SMART: 1
SRT: 1
SUBARU: 10
SUZUKI: 48
TESLA: 2
TOYOTA: 19
TRIUMPH: 10
VESPA: 4
VICTORY: 14
VOLKSWAGEN: 16
VOLVO: 8
YAMAHA: 110


Instead of doing all this, we could use the `groupby` function in `itertools`.

Again, it is a lazy iterator, so we'll use lists to see what's happening - but let's use a slightly smaller data set as an example first:

In [3]:
data = (1, 1, 2, 2, 3)

In [4]:
list(itertools.groupby(data))

[(1, <itertools._grouper at 0x204a6988dd8>),
 (2, <itertools._grouper at 0x204a69883c8>),
 (3, <itertools._grouper at 0x204a6988208>)]

As you can see, we ended up with an iterable of tuples. The tuple was the groups of numbers in data, so `1`, `2`, and `3`. But what's in the second element of the tuple? Well it's an iterator, but what does it contain?

In [5]:
it = itertools.groupby(data)
for group in it:
    print(group[0], list(group[1]))

1 [1, 1]
2 [2, 2]
3 [3]


Basically it just contained the grouped elements themselves.

This might seem a bit confusing at first - so let's look at the second optional argument of group by - it is a key. Basically the idea behind that key is the same as the sort keys, or filter keys we have worked with in the past. It is a **function** that returns a grouping key.

Let's try it out with a simple example:

In [6]:
data = (
    (1, 'abc'),
    (1, 'bcd'),
   
    (2, 'pyt'),
    (2, 'yth'),
    (2, 'tho'),
    
    (3, 'hon')
)

So we want to group the data, using the first item of each tuple as the group key:

In [7]:
groups = list(itertools.groupby(data, key=lambda x: x[0]))

In [8]:
print(groups)

[(1, <itertools._grouper object at 0x00000204A6990C50>), (2, <itertools._grouper object at 0x00000204A6990BE0>), (3, <itertools._grouper object at 0x00000204A6990BA8>)]


Once again you'll notice that we have the group keys, and some iterable. Let's see what those contain:

In [9]:
groups = itertools.groupby(data, key=lambda x: x[0])
for group in groups:
    print(group[0], list(group[1]))

1 [(1, 'abc'), (1, 'bcd')]
2 [(2, 'pyt'), (2, 'yth'), (2, 'tho')]
3 [(3, 'hon')]


So now let's go back to our car make example.

We want to get all the makes and how many models are in each make.

We could start approaching it this way:

In [10]:
with open('cars_2014.csv') as f:
    make_groups = itertools.groupby(f, key=lambda x: x.split(',')[0])

In [11]:
list(itertools.islice(make_groups, 5))

ValueError: I/O operation on closed file.

What's going on?

Remember that `groupby` is a **lazy** iterator. This means it did not actually do any work when we called it apart from setting up the iterator.

When we called `list()` on that iterator, **then** it went ahead and try to do the iteration.

However, our `with` (context manager) closed the file by then!

So we will need to do our work inside the context manager.

In [12]:
with open('cars_2014.csv') as f:
    next(f)  # skip header row
    make_groups = itertools.groupby(f, key=lambda x: x.split(',')[0])
    print(list(itertools.islice(make_groups, 5)))

[('ACURA', <itertools._grouper object at 0x00000204A69974A8>), ('ALFA ROMEO', <itertools._grouper object at 0x00000204A6997438>), ('APRILIA', <itertools._grouper object at 0x00000204A65C01D0>), ('ARCTIC CAT', <itertools._grouper object at 0x00000204A6990198>), ('ARGO', <itertools._grouper object at 0x00000204A69885F8>)]


Next, we need to know how many items are in each `itertools._grouper` iterators.

How about using the `len()` property of the iterator?

In [13]:
with open('cars_2014.csv') as f:
    next(f)  # skip header row
    make_groups = itertools.groupby(f, key=lambda x: x.split(',')[0])
    make_counts = ((key, len(models)) for key, models in make_groups)
    print(list(make_counts))

TypeError: object of type 'itertools._grouper' has no len()

Aww... Iterators don't necessarily implement a `__len__` method - and this one definitely does not.

Well, if we think about this, we could simply "replace" each element in 
the models, with a `1`, and sum that up...

In [14]:
with open('cars_2014.csv') as f:
    next(f)  # skip header row
    make_groups = itertools.groupby(f, key=lambda x: x.split(',')[0])
    make_counts = ((key, sum(1 for model in models)) 
                    for key, models in make_groups)
    print(list(make_counts))

[('ACURA', 6), ('ALFA ROMEO', 2), ('APRILIA', 4), ('ARCTIC CAT', 96), ('ARGO', 4), ('ASTON MARTIN', 5), ('AUDI', 27), ('BENTLEY', 2), ('BLUE BIRD', 1), ('BMW', 86), ('BUGATTI', 1), ('BUICK', 5), ('CADILLAC', 7), ('CAN-AM', 61), ('CHEVROLET', 33), ('CHRYSLER', 2), ('DODGE', 7), ('DUCATI', 4), ('FERRARI', 6), ('FIAT', 2), ('FORD', 34), ('FREIGHTLINER', 7), ('GMC', 12), ('HARLEY DAVIDSON', 29), ('HINO', 7), ('HONDA', 91), ('HUSABERG', 4), ('HUSQVARNA', 9), ('HYUNDAI', 13), ('INDIAN', 3), ('INFINITI', 8), ('JAGUAR', 9), ('JEEP', 5), ('JOHN DEERE', 19), ('KAWASAKI', 59), ('KENWORTH', 11), ('KIA', 10), ('KTM', 13), ('KUBOTA', 4), ('KYMCO', 28), ('LAMBORGHINI', 2), ('LAND ROVER', 6), ('LEXUS', 14), ('LINCOLN', 6), ('LOTUS', 1), ('MACK', 9), ('MASERATI', 3), ('MAZDA', 5), ('MCLAREN', 2), ('MERCEDES-BENZ', 60), ('MINI', 3), ('MITSUBISHI', 8), ('NISSAN', 24), ('PEUGEOT', 3), ('POLARIS', 101), ('PORSCHE', 4), ('RAM', 6), ('RENAULT', 4), ('ROLLS ROYCE', 3), ('SCION', 5), ('SEAT', 3), ('SKI-DOO', 6

#### Caveat

I want to show you something that you may find odd at first. Notice how I iterated through the groups.

Maybe I want to be able to itrerate multiple times through that iterator, so let's make a list out of it first:

In [15]:
groups = list(itertools.groupby(data, key=lambda x: x[0]))
for group in groups:
    print(group[0], group[1])

1 <itertools._grouper object at 0x00000204A6A33080>
2 <itertools._grouper object at 0x00000204A6A330B8>
3 <itertools._grouper object at 0x00000204A6A33128>


Ok, so this looks fine - we now have a list containing tuples - the first element is the group key, the second is an iterator - we can ceck that easily:

In [16]:
it = groups[0][1]

In [17]:
iter(it) is it

True

So yes, this is an iterator - what's in it?

In [18]:
list(it)

[]

Empty?? But we did not iterate through it - what happened?

Let's try again, just in case calling the `iter` method did something odd:

In [19]:
groups = list(itertools.groupby(data, key=lambda x: x[0]))
for group in groups:
    print(group[0], list(group[1]))

1 []
2 []
3 [(3, 'hon')]


So, the 3rd element is OK, but looks like the first two got exhausted somehow...

Let's make sure they are indeed exhausted:

In [20]:
groups = list(itertools.groupby(data, key=lambda x: x[0]))

In [21]:
next(groups[0][1])

StopIteration: 

In [22]:
next(groups[1][1])

StopIteration: 

In [23]:
next(groups[2][1])

(3, 'hon')

So, yes, the first two were exhausted when we converted the groups to a list.

The solution here is actually in the Python docs. Let's take a look:

```
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list
```

The key thing here is that the elements yielded from the different groups are using the **same** underlying iterable over all the elements. As the documentation states, when we advance to the next group, the previous one's iterator is automatically exhausted - it basically iterates over all the elements until it hits the next group key.

Let's see this by stepping through the iteration manually:

In [24]:
groups = itertools.groupby(data, key=lambda x: x[0])

In [25]:
group1 = next(groups)

In [26]:
group1

(1, <itertools._grouper at 0x204a69905f8>)

And the iterator in the tuple is not exhausted:

In [27]:
next(group1[1])

(1, 'abc')

Now, let's try again, but this time we'll advance to group2, and see what is in `group1`'s iterator:

In [28]:
groups = itertools.groupby(data, key=lambda x: x[0])

In [29]:
group1 = next(groups)

In [30]:
group2 = next(groups)

Now `group1`'s iterator has been exhausted (because we moved to `group2`):

In [31]:
next(group1[1])

StopIteration: 

But `group2`'s iterator is still OK:

In [32]:
next(group2[1])

(2, 'pyt')

We know that there are still two elements in `group2`, so let's advance to `group3` and go back and see what's left in `group2`'s iterator:

In [33]:
group3 = next(groups)

In [34]:
next(group2[1])

StopIteration: 

But `group3`'s iterator is just fine:

In [35]:
next(group3[1])

(3, 'hon')

So, just be careful here with the `groupby()` - if you want to save all the data into a list you cannot first convert the groups into a list - you **must** step through the groups iterator, and retrieve each individual iterators elements into a list, the way we did it in the first example, or simply using a comprehension:

In [36]:
groups = itertools.groupby(data, key=lambda x: x[0])

In [37]:
groups_list = [(key, list(items)) for key, items in groups]

In [38]:
groups_list

[(1, [(1, 'abc'), (1, 'bcd')]),
 (2, [(2, 'pyt'), (2, 'yth'), (2, 'tho')]),
 (3, [(3, 'hon')])]