# 2. Indexing, selecting, assigning reference
---
### 2.1. Native accessors
---
In Python, to access the property of an object we need to access it as an attribute. Columns in a `pandas` `DataFrame` work very similar. To access a property of the CSV file we have to specify which property we want to show as a column, doing the following:

In [1]:
import pandas as pd
battles_got = pd.read_csv('datasets/battles.csv')
battles_got.head().name

0       Battle of the Golden Tooth
1      Battle at the Mummer's Ford
2               Battle of Riverrun
3         Battle of the Green Fork
4    Battle of the Whispering Wood
Name: name, dtype: object

We can also do the same by using this indexing (`[]`) operator:

In [2]:
battles_got.head()['name']

0       Battle of the Golden Tooth
1      Battle at the Mummer's Ford
2               Battle of Riverrun
3         Battle of the Green Fork
4    Battle of the Whispering Wood
Name: name, dtype: object

If we only want some specific value, we only need to use the indexing operator indicating which row from that column we want.

In [3]:
battles_got['name'][0]

'Battle of the Golden Tooth'

### 2.2. Index-based selection
---
To select some specific data you can use accessor operators (`loc` and `iloc`). The `iloc` operator is used to select data based on the position of the data (numerical position in the data).

In [4]:
battles_got.iloc[0]

name                  Battle of the Golden Tooth
year                                         298
battle_number                                  1
attacker_king           Joffrey/Tommen Baratheon
defender_king                         Robb Stark
attacker_1                             Lannister
attacker_2                                   NaN
attacker_3                                   NaN
attacker_4                                   NaN
defender_1                                 Tully
defender_2                                   NaN
defender_3                                   NaN
defender_4                                   NaN
attacker_outcome                             win
battle_type                       pitched battle
major_death                                    1
major_capture                                  0
attacker_size                              15000
defender_size                               4000
attacker_commander               Jaime Lannister
defender_commander  

In native Python first we specify the row and second the column. With `loc` and `iloc` this is the opposite, it is row-first and column-second.
To get a column with `iloc`, we can do the following:

In [5]:
battles_got.head().iloc[:,4]

0                  Robb Stark
1                  Robb Stark
2                  Robb Stark
3    Joffrey/Tommen Baratheon
4    Joffrey/Tommen Baratheon
Name: defender_king, dtype: object

The `:` operator means "everything". It can also be used to indicate a range of values. For example, to select the two first rows we would do:

In [6]:
battles_got.iloc[:2, 4]

0    Robb Stark
1    Robb Stark
Name: defender_king, dtype: object

Or to select only the second and third row:

In [7]:
battles_got.iloc[1:3, 4]

1    Robb Stark
2    Robb Stark
Name: defender_king, dtype: object

It's also possible to pass a list:

In [8]:
battles_got.iloc[[0,1,4],4]

0                  Robb Stark
1                  Robb Stark
4    Joffrey/Tommen Baratheon
Name: defender_king, dtype: object

Negative numbers can also be used in selection, starting to count forwards from the end of the values.

In [9]:
battles_got.iloc[-2:]

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
36,Siege of Raventree,300,37,Joffrey/Tommen Baratheon,Robb Stark,Bracken,Lannister,,,Blackwood,...,0.0,1.0,1500.0,,"Jonos Bracken, Jaime Lannister",Tytos Blackwood,0.0,Raventree,The Riverlands,
37,Siege of Winterfell,300,38,Stannis Baratheon,Joffrey/Tommen Baratheon,Baratheon,Karstark,Mormont,Glover,Bolton,...,,,5000.0,8000.0,Stannis Baratheon,Roose Bolton,0.0,Winterfell,The North,


### 2.3. Label-based selection
---
The `loc` operator is used to select data based on the data index value, not its position.
For example, to get fist entry of a table, we would do:

In [10]:
battles_got.loc[0, 'name']

'Battle of the Golden Tooth'

If the data has meaningful indices, it's easier to select using `loc`.

In [11]:
battles_got.head().loc[:, ['name', 'year', 'battle_number', 'attacker_king', 'defender_king']]

Unnamed: 0,name,year,battle_number,attacker_king,defender_king
0,Battle of the Golden Tooth,298,1,Joffrey/Tommen Baratheon,Robb Stark
1,Battle at the Mummer's Ford,298,2,Joffrey/Tommen Baratheon,Robb Stark
2,Battle of Riverrun,298,3,Joffrey/Tommen Baratheon,Robb Stark
3,Battle of the Green Fork,298,4,Robb Stark,Joffrey/Tommen Baratheon
4,Battle of the Whispering Wood,298,5,Robb Stark,Joffrey/Tommen Baratheon


One important thing when choosing between `iloc` and `loc` is that they use different indexing shcemes.

`iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select 10 entries 0,...,9. `loc`, meanwhile, indexes inclusively. So 0:10 will select 11 entries 0,...,10.

### 2.4. Manipulating the index
---
We can manipulate the index in any way we see fit, by using the `set_index` method.

In [12]:
battles_got.head().set_index("battle_number")

Unnamed: 0_level_0,name,year,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,defender_2,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
battle_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Battle of the Golden Tooth,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,,...,1.0,0.0,15000.0,4000.0,Jaime Lannister,"Clement Piper, Vance",1.0,Golden Tooth,The Westerlands,
2,Battle at the Mummer's Ford,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Baratheon,,...,1.0,0.0,,120.0,Gregor Clegane,Beric Dondarrion,1.0,Mummer's Ford,The Riverlands,
3,Battle of Riverrun,298,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,,...,0.0,1.0,15000.0,10000.0,"Jaime Lannister, Andros Brax","Edmure Tully, Tytos Blackwood",1.0,Riverrun,The Riverlands,
4,Battle of the Green Fork,298,Robb Stark,Joffrey/Tommen Baratheon,Stark,,,,Lannister,,...,1.0,1.0,18000.0,20000.0,"Roose Bolton, Wylis Manderly, Medger Cerwyn, H...","Tywin Lannister, Gregor Clegane, Kevan Lannist...",1.0,Green Fork,The Riverlands,
5,Battle of the Whispering Wood,298,Robb Stark,Joffrey/Tommen Baratheon,Stark,Tully,,,Lannister,,...,1.0,1.0,1875.0,6000.0,"Robb Stark, Brynden Tully",Jaime Lannister,1.0,Whispering Wood,The Riverlands,


### 2.5. Conditional selection
---
We can select some data using conditional selections.

In [13]:
battles_got.head().attacker_king == 'Robb Stark'

0    False
1    False
2    False
3     True
4     True
Name: attacker_king, dtype: bool

The `==` operator produces a `Series` of `True`/`False` booleans based on an attribute of each record. This result can be used inside of `loc` to select the relevant data.

In [14]:
battles_got.head().loc[battles_got.attacker_king == 'Robb Stark']

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
3,Battle of the Green Fork,298,4,Robb Stark,Joffrey/Tommen Baratheon,Stark,,,,Lannister,...,1.0,1.0,18000.0,20000.0,"Roose Bolton, Wylis Manderly, Medger Cerwyn, H...","Tywin Lannister, Gregor Clegane, Kevan Lannist...",1.0,Green Fork,The Riverlands,
4,Battle of the Whispering Wood,298,5,Robb Stark,Joffrey/Tommen Baratheon,Stark,Tully,,,Lannister,...,1.0,1.0,1875.0,6000.0,"Robb Stark, Brynden Tully",Jaime Lannister,1.0,Whispering Wood,The Riverlands,


We can also use the ampersand operator `&` to bring two questions or more together.

In [15]:
battles_got.loc[(battles_got.attacker_king == 'Robb Stark') & (battles_got.attacker_size >= 10000)]

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
3,Battle of the Green Fork,298,4,Robb Stark,Joffrey/Tommen Baratheon,Stark,,,,Lannister,...,1.0,1.0,18000.0,20000.0,"Roose Bolton, Wylis Manderly, Medger Cerwyn, H...","Tywin Lannister, Gregor Clegane, Kevan Lannist...",1.0,Green Fork,The Riverlands,


If we are interested in both conditions being met, we need to use the pipe operator `|`.

In [16]:
battles_got.head().loc[(battles_got.attacker_king == 'Robb Stark') | (battles_got.attacker_size >= 10000)]

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
0,Battle of the Golden Tooth,298,1,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,1.0,0.0,15000.0,4000.0,Jaime Lannister,"Clement Piper, Vance",1.0,Golden Tooth,The Westerlands,
2,Battle of Riverrun,298,3,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,0.0,1.0,15000.0,10000.0,"Jaime Lannister, Andros Brax","Edmure Tully, Tytos Blackwood",1.0,Riverrun,The Riverlands,
3,Battle of the Green Fork,298,4,Robb Stark,Joffrey/Tommen Baratheon,Stark,,,,Lannister,...,1.0,1.0,18000.0,20000.0,"Roose Bolton, Wylis Manderly, Medger Cerwyn, H...","Tywin Lannister, Gregor Clegane, Kevan Lannist...",1.0,Green Fork,The Riverlands,
4,Battle of the Whispering Wood,298,5,Robb Stark,Joffrey/Tommen Baratheon,Stark,Tully,,,Lannister,...,1.0,1.0,1875.0,6000.0,"Robb Stark, Brynden Tully",Jaime Lannister,1.0,Whispering Wood,The Riverlands,


`pandas` comes with a few pre-built conditional selectors. One of these selectors is the `isin` method, that lets you select data whose value "is in" a list of values.

In [17]:
battles_got.head().loc[battles_got.attacker_1.isin(['Stark', 'Lannister'])] 

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
0,Battle of the Golden Tooth,298,1,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,1.0,0.0,15000.0,4000.0,Jaime Lannister,"Clement Piper, Vance",1.0,Golden Tooth,The Westerlands,
1,Battle at the Mummer's Ford,298,2,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Baratheon,...,1.0,0.0,,120.0,Gregor Clegane,Beric Dondarrion,1.0,Mummer's Ford,The Riverlands,
2,Battle of Riverrun,298,3,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,0.0,1.0,15000.0,10000.0,"Jaime Lannister, Andros Brax","Edmure Tully, Tytos Blackwood",1.0,Riverrun,The Riverlands,
3,Battle of the Green Fork,298,4,Robb Stark,Joffrey/Tommen Baratheon,Stark,,,,Lannister,...,1.0,1.0,18000.0,20000.0,"Roose Bolton, Wylis Manderly, Medger Cerwyn, H...","Tywin Lannister, Gregor Clegane, Kevan Lannist...",1.0,Green Fork,The Riverlands,
4,Battle of the Whispering Wood,298,5,Robb Stark,Joffrey/Tommen Baratheon,Stark,Tully,,,Lannister,...,1.0,1.0,1875.0,6000.0,"Robb Stark, Brynden Tully",Jaime Lannister,1.0,Whispering Wood,The Riverlands,


The `isnull` method and its companion `notnull` are used to let you know which values are empty and which are not (`NaN`).

In [18]:
battles_got.head().loc[battles_got.attacker_size.isnull()]

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
1,Battle at the Mummer's Ford,298,2,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Baratheon,...,1.0,0.0,,120.0,Gregor Clegane,Beric Dondarrion,1.0,Mummer's Ford,The Riverlands,


In [19]:
battles_got.head().loc[battles_got.attacker_size.notnull()]

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
0,Battle of the Golden Tooth,298,1,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,1.0,0.0,15000.0,4000.0,Jaime Lannister,"Clement Piper, Vance",1.0,Golden Tooth,The Westerlands,
2,Battle of Riverrun,298,3,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,0.0,1.0,15000.0,10000.0,"Jaime Lannister, Andros Brax","Edmure Tully, Tytos Blackwood",1.0,Riverrun,The Riverlands,
3,Battle of the Green Fork,298,4,Robb Stark,Joffrey/Tommen Baratheon,Stark,,,,Lannister,...,1.0,1.0,18000.0,20000.0,"Roose Bolton, Wylis Manderly, Medger Cerwyn, H...","Tywin Lannister, Gregor Clegane, Kevan Lannist...",1.0,Green Fork,The Riverlands,
4,Battle of the Whispering Wood,298,5,Robb Stark,Joffrey/Tommen Baratheon,Stark,Tully,,,Lannister,...,1.0,1.0,1875.0,6000.0,"Robb Stark, Brynden Tully",Jaime Lannister,1.0,Whispering Wood,The Riverlands,


### 2.6. Assigning data
---
We can assign data to a `DataFrame` easily. You can assign either a constant value.

In [20]:
battles_got['note'] = 'LGM345'
battles_got['note'].head()

0    LGM345
1    LGM345
2    LGM345
3    LGM345
4    LGM345
Name: note, dtype: object

Or also with an iterable of values.

In [21]:
battles_got['note'] = range(len(battles_got), 0, -1)
battles_got['note'].head()

0    38
1    37
2    36
3    35
4    34
Name: note, dtype: int32

# 3. Summary functions and mapping
---
### 3.1. Summary functions
---
`pandas` provides many simple "summary functions", usefull way to restructure the data. The `describe` method is one of the most used. This method generates a summary of the attributes of the given column. The output is different for each `dtype`, but only makes sense for numerical data.

In [22]:
battles_got.attacker_size.describe()

count        24.000000
mean       9942.541667
std       20283.092065
min          20.000000
25%        1375.000000
50%        4000.000000
75%        8250.000000
max      100000.000000
Name: attacker_size, dtype: float64

For string data we get a different output.

In [23]:
battles_got.attacker_king.describe()

count                           36
unique                           4
top       Joffrey/Tommen Baratheon
freq                            14
Name: attacker_king, dtype: object

We can also get some particular simple summary statistic about a column in a `DataFrame` or a `Series`. For example, we can see the mean of the data selected using the `mean` function.

In [24]:
battles_got.attacker_size.mean()

9942.541666666666

To see a list of the unique values we can use the `unique` function.

In [25]:
battles_got.defender_king.unique()

array(['Robb Stark', 'Joffrey/Tommen Baratheon', 'Balon/Euron Greyjoy',
       'Renly Baratheon', nan, 'Mance Rayder', 'Stannis Baratheon'],
      dtype=object)

To see a list of the unique values and how often they occur in the dataset, we can use the `value_conts` method.

In [26]:
battles_got.defender_king.value_counts()

Robb Stark                  14
Joffrey/Tommen Baratheon    13
Balon/Euron Greyjoy          4
Stannis Baratheon            2
Mance Rayder                 1
Renly Baratheon              1
Name: defender_king, dtype: int64

### 3.2. Mapping functions
---
A "map" is a term used in mathematics as a function that takes one set of values and converts or "maps" it to another set of values with a different format we want.

There are two mapping functions that are often used. The `Series` `map` is the first one and the most simple one. `map` takes every value in the column it is being called on and converts it some new value using a function you provide it. It takes a `Series` as input.

In [27]:
attacker_size_mean = battles_got.attacker_size.mean()
battles_got.head().attacker_size.map(lambda p: p - attacker_size_mean)

0    5057.458333
1            NaN
2    5057.458333
3    8057.458333
4   -8067.541667
Name: attacker_size, dtype: float64

The `DataFrame` `apply` function can be used to do the same thing across columns, on the level of the entire dataset. It takes a `DataFrame` as input.

In [28]:
def remean_attacker_size(srs):
    srs.attacker_size = srs.attacker_size - attacker_size_mean
    return srs

battles_got.head().apply(remean_attacker_size, axis='columns')

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
0,Battle of the Golden Tooth,298,1,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,1.0,0.0,5057.458333,4000.0,Jaime Lannister,"Clement Piper, Vance",1.0,Golden Tooth,The Westerlands,38
1,Battle at the Mummer's Ford,298,2,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Baratheon,...,1.0,0.0,,120.0,Gregor Clegane,Beric Dondarrion,1.0,Mummer's Ford,The Riverlands,37
2,Battle of Riverrun,298,3,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,0.0,1.0,5057.458333,10000.0,"Jaime Lannister, Andros Brax","Edmure Tully, Tytos Blackwood",1.0,Riverrun,The Riverlands,36
3,Battle of the Green Fork,298,4,Robb Stark,Joffrey/Tommen Baratheon,Stark,,,,Lannister,...,1.0,1.0,8057.458333,20000.0,"Roose Bolton, Wylis Manderly, Medger Cerwyn, H...","Tywin Lannister, Gregor Clegane, Kevan Lannist...",1.0,Green Fork,The Riverlands,35
4,Battle of the Whispering Wood,298,5,Robb Stark,Joffrey/Tommen Baratheon,Stark,Tully,,,Lannister,...,1.0,1.0,-8067.541667,6000.0,"Robb Stark, Brynden Tully",Jaime Lannister,1.0,Whispering Wood,The Riverlands,34


`pandas` can also operate between `Series` of equal length. For example, we can combine information from the dataset using some operators.

In [29]:
battles_got.head().attacker_king + " vs " + battles_got.head().defender_king

0    Joffrey/Tommen Baratheon vs Robb Stark
1    Joffrey/Tommen Baratheon vs Robb Stark
2    Joffrey/Tommen Baratheon vs Robb Stark
3    Robb Stark vs Joffrey/Tommen Baratheon
4    Robb Stark vs Joffrey/Tommen Baratheon
dtype: object

These operators (`>`, `<`, `==`...) are faster than the `map` or `apply` but they are not as flexible as them.

# 4. Grouping and Sorting
---
### 4.1. Grouping
---
Sometimes we want to group our data to do something specific. To do this, we can use the `groupby` operation.

For example, we can replicate what `value_counts` does using `groupby` by doing the following:

In [30]:
battles_got.groupby('attacker_king').attacker_king.count()

attacker_king
Balon/Euron Greyjoy          7
Joffrey/Tommen Baratheon    14
Robb Stark                  10
Stannis Baratheon            5
Name: attacker_king, dtype: int64

In this case, we created a group and counted how many times each value appears.

`value_counts` is just a shortcut to this `groupby` operation. We can also use any of the summary functions with groups.

In [31]:
battles_got.groupby('attacker_king').attacker_size.min()

attacker_king
Balon/Euron Greyjoy           20.0
Joffrey/Tommen Baratheon     618.0
Robb Stark                   100.0
Stannis Baratheon           4500.0
Name: attacker_size, dtype: float64