# 3. Summary functions and mapping
---
### 3.1. Summary functions
---
`pandas` provides many simple "summary functions", usefull way to restructure the data. The `describe` method is one of the most used. This method generates a summary of the attributes of the given column. The output is different for each `dtype`, but only makes sense for numerical data.

In [1]:
import pandas as pd
battles_got = pd.read_csv('datasets/battles.csv')
battles_got.attacker_size.describe()

count        24.000000
mean       9942.541667
std       20283.092065
min          20.000000
25%        1375.000000
50%        4000.000000
75%        8250.000000
max      100000.000000
Name: attacker_size, dtype: float64

For string data we get a different output.

In [2]:
battles_got.attacker_king.describe()

count                           36
unique                           4
top       Joffrey/Tommen Baratheon
freq                            14
Name: attacker_king, dtype: object

We can also get some particular simple summary statistic about a column in a `DataFrame` or a `Series`. For example, we can see the mean of the data selected using the `mean` function.

In [3]:
battles_got.attacker_size.mean()

9942.541666666666

To see a list of the unique values we can use the `unique` function.

In [4]:
battles_got.defender_king.unique()

array(['Robb Stark', 'Joffrey/Tommen Baratheon', 'Balon/Euron Greyjoy',
       'Renly Baratheon', nan, 'Mance Rayder', 'Stannis Baratheon'],
      dtype=object)

To see a list of the unique values and how often they occur in the dataset, we can use the `value_conts` method.

In [5]:
battles_got.defender_king.value_counts()

Robb Stark                  14
Joffrey/Tommen Baratheon    13
Balon/Euron Greyjoy          4
Stannis Baratheon            2
Renly Baratheon              1
Mance Rayder                 1
Name: defender_king, dtype: int64

### 3.2. Mapping functions
---
A "map" is a term used in mathematics as a function that takes one set of values and converts or "maps" it to another set of values with a different format we want.

There are two mapping functions that are often used. The `Series` `map` is the first one and the most simple one. `map` takes every value in the column it is being called on and converts it some new value using a function you provide it. It takes a `Series` as input.

In [6]:
attacker_size_mean = battles_got.attacker_size.mean()
battles_got.head().attacker_size.map(lambda p: p - attacker_size_mean)

0    5057.458333
1            NaN
2    5057.458333
3    8057.458333
4   -8067.541667
Name: attacker_size, dtype: float64

The `DataFrame` `apply` function can be used to do the same thing across columns, on the level of the entire dataset. It takes a `DataFrame` as input.

In [7]:
def remean_attacker_size(srs):
    srs.attacker_size = srs.attacker_size - attacker_size_mean
    return srs

battles_got.head().apply(remean_attacker_size, axis='columns')

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
0,Battle of the Golden Tooth,298,1,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,1.0,0.0,5057.458333,4000.0,Jaime Lannister,"Clement Piper, Vance",1.0,Golden Tooth,The Westerlands,
1,Battle at the Mummer's Ford,298,2,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Baratheon,...,1.0,0.0,,120.0,Gregor Clegane,Beric Dondarrion,1.0,Mummer's Ford,The Riverlands,
2,Battle of Riverrun,298,3,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,0.0,1.0,5057.458333,10000.0,"Jaime Lannister, Andros Brax","Edmure Tully, Tytos Blackwood",1.0,Riverrun,The Riverlands,
3,Battle of the Green Fork,298,4,Robb Stark,Joffrey/Tommen Baratheon,Stark,,,,Lannister,...,1.0,1.0,8057.458333,20000.0,"Roose Bolton, Wylis Manderly, Medger Cerwyn, H...","Tywin Lannister, Gregor Clegane, Kevan Lannist...",1.0,Green Fork,The Riverlands,
4,Battle of the Whispering Wood,298,5,Robb Stark,Joffrey/Tommen Baratheon,Stark,Tully,,,Lannister,...,1.0,1.0,-8067.541667,6000.0,"Robb Stark, Brynden Tully",Jaime Lannister,1.0,Whispering Wood,The Riverlands,


`pandas` can also operate between `Series` of equal length. For example, we can combine information from the dataset using some operators.

In [8]:
battles_got.head().attacker_king + " vs " + battles_got.head().defender_king

0    Joffrey/Tommen Baratheon vs Robb Stark
1    Joffrey/Tommen Baratheon vs Robb Stark
2    Joffrey/Tommen Baratheon vs Robb Stark
3    Robb Stark vs Joffrey/Tommen Baratheon
4    Robb Stark vs Joffrey/Tommen Baratheon
dtype: object

These operators (`>`, `<`, `==`...) are faster than the `map` or `apply` but they are not as flexible as them.

# 4. Grouping and Sorting
---
### 4.1. Grouping
---
Sometimes we want to group our data to do something specific. To do this, we can use the `groupby` operation.

For example, we can replicate what `value_counts` does using `groupby` by doing the following:

In [9]:
battles_got.groupby('attacker_king').attacker_king.count()

attacker_king
Balon/Euron Greyjoy          7
Joffrey/Tommen Baratheon    14
Robb Stark                  10
Stannis Baratheon            5
Name: attacker_king, dtype: int64

In this case, we created a group and counted how many times each value appears.

`value_counts` is just a shortcut to this `groupby` operation. We can also use any of the summary functions with groups.

In [10]:
battles_got.groupby('attacker_king').attacker_size.min()

attacker_king
Balon/Euron Greyjoy           20.0
Joffrey/Tommen Baratheon     618.0
Robb Stark                   100.0
Stannis Baratheon           4500.0
Name: attacker_size, dtype: float64