In [None]:
import pandas as pd

# Navigating `pandas`
Pandas is built on the `Series` and `DataFrame` types, so the first task is to get familiar with these.

## `Series`
Loosely speaking a `Series` is something like an ordered Python dictionary. It's easiest to see what I mean by this by mucking about with a simple `Series`. To do that we'll make one.

In [None]:
series = pd.Series([100, 200, 400, 800])
series

Importantly, we can't mix types in a `Series`. For example, if we include a string in the list we use to initialise `series` then it forces everything in the `Series` to have the `object` type:

In [None]:
series = pd.Series([100, 200, 400, "abc"])
series

Making everything in a `Series` the same type is one aspect of how `pandas` makes computation more efficient. In addition to its values, a `Series` has an _index_, and it can also have a _name_, and its index can have a name. In effect it is like a small data table with just one column.

In [None]:
pop2023 = pd.Series(
    [1_402_000, 202_000, 380_000, 101_000],
    index = ["Auckland", "Wellington", "Christchurch", "Dunedin"])
pop2023.name = "population_2023"
pop2023.index.name = "city"
pop2023

To access individual elements in a `Series` we should use values of its index, _not_ positional indexes.

In [None]:
pop2023["Wellington"]

This may seem confusing initially. The reason it matters is that a `Series` could have index values that are integers, and then it can be ambiguous what we mean when we use a position index.

In [None]:
confusing = pd.Series([1, 2, 3, 4], index = [10, 9, 8, 7])
confusing

In [None]:
confusing[0]


If we try to use positional indexing we'll get a warning, meaning that it will work _for now_, but the proper way to access a value in a `Series` by its position is using its `iloc` property.

In [None]:
confusing.iloc[0]

So... if you are intending to access items in a series by their position, use `.iloc[]`. Anyway, let's get back to the `pop_2023` example

In [None]:
pop2023.iloc[1]

Let's make another `Series` (note that the indexes are different, and in a different order).

In [None]:
pop2018 = pd.Series(
    [202_000, 100_000, 161_000, 1_346_000], 
    index = ["Wellington", "Dunedin", "Hamilton", "Auckland"])
pop2018.name = "population_2018"
pop2018.index.name = "city"
pop2018

Now... what happens if we combine them in some way?

In [None]:
pop2023 - pop2018

What's going on here? The `Series` objects use the `index` values to align the data and add corresponding elements, and where some element is missing (`Hamilton` is missing in the first `Series` and `Christchurch` in the second one) it arrives at a `NaN` (not a number, effectively a missing result).

The missing results are annoying, but the more important thing here is that `Series` clearly aren't just lists a little bit dressed up! They're much more like single variable data tables.

## `DataFrame`
A `DataFrame` is a collection of `Series`, that _share the same index_.

We can combine a bunch of `Series` that share index values (even if not all index values exist in all `Series`) into a `DataFrame` using `concat`, like this:

In [None]:
pd.concat([pop2018, pop2023], axis = "columns")

That's handy if you happen to have data as `Series` already, although you will often be assembling data from a dictionary of lists, like this:

In [None]:
df = pd.DataFrame(
    data = {"pop2023": [1_402_000, 380_000, 101_000, 175_000, 202_000],
            "pop2018": [1_346_000, 358_000, 100_000, 161_000, 202_000]})
df

By default the index is integers starting at 0. But we can set something more useful.

In [None]:
df.index = ['Auckland', 'Christchurch', 'Dunedin', 'Hamilton', 'Wellington']
df.index.name = "city"
df

In [None]:
# for convenience make a copy
cities_df = df.copy()

It's important here to notice the difference between the index column and the data columns, which the notebook view helps us with by setting it on a different row than the column names. The index is not data, it's an index!

A `DataFrame` has two dimensions, and we can extract columns, rows, or individual elements accordingly. Columns are extracted by name

In [None]:
df.pop2023

And we can extract more than one column at a time by providing a list of the desired columns. The view returned will be reordered in the process.

In [None]:
df[["pop2018", "pop2023"]]

Rows are extracted using `loc` or `iloc` and returned as `Series`.

In [None]:
df.loc["Wellington"], df.iloc[4]

You can also request more than one row, and again, reorder them in the process. In this case you'll get a view on the data as a `DataFrame`.

In [None]:
df.loc[["Wellington", "Auckland", "Hamilton"]]

**IMPORTANT** most often when indexing into a `DataFrame` you are getting a view on the data, not a new `Series` or `DataFrame`. Any changes you make in the view will be applied to the 'source' data.

In [None]:
df.loc["Auckland"] = df.loc["Auckland"] + 100_000
df

This is not particularly unexpected, and might be what you want, in which case, all good!

But if you want to work with the data in a row or column and perhaps in the process change it, _while leaving the source data intact_, then make a copy:

In [None]:
# restore the original dataframe
df = cities_df.copy()
auckland_data = df.loc["Auckland"].copy()
auckland_data = auckland_data + 100_000
auckland_data, df

### `reindex`
We can reorganise a `DataFrame` by reindexing. If you provide just one list it should be the rows. If you want to reorder columns, you supply a `columns` parameter to the `reindex()` method. Probably the easiest way to do it is to explicitly specify `index` and `column` lists.

In [None]:
new_row_order = ["Wellington", "Auckland", "Christchurch", "Hamilton", "Dunedin"]
new_col_order = ["pop2018", "pop2023"]
df.reindex(index = new_row_order, columns = new_col_order)

### `drop()`

The `drop()` method is similar to `reindex`, but drops any named rows or columns.

In [None]:
df.drop(index = ["Auckland", "Christchurch"], columns = ["pop2018"])

### `loc` and `iloc`
There are more ways to index data in `Series` and `DataFrames` than there really should be! I think it's good practice to stick to a limited set, as follows. Use `.loc` to index by names (or rows or columns):

In [None]:
df.loc[["Auckland", "Wellington"], ["pop2018"]]

And use `iloc` to index by position

In [None]:
df.iloc[:3, 1:]

Notice that `iloc` allows you to use slice notation. You can use slicing with names but... and this is confusing, its inclusive, unlike integer slicing

In [None]:
df.loc["Auckland":"Hamilton", :"pop2023"]

Keep in mind that `.loc` and `.iloc` are _properties_ of the `DataFrame` not methods. That means they are followed by _square brackets_ `[]` not parentheses (as they would be if they were methods).

### Boolean selections
An important special case (that is widely used) is selection using a sequence of boolean values.

In [None]:
df.loc[df.pop2018 > 200000]

What's happening here? Well, `df.pop2018 > 200_000` gives us an array of boolean values:

In [None]:
df.pop2018 > 200_000

And when we use that as the index it selects those the rows where the condition is `True`. We can even use this to see values selectively for the whole `DataFrame` at once.

In [None]:
df < 360_000

In [None]:
df[df < 360_000] = 0
df

In [None]:
df = cities_df.copy()

That's all a lot to take in, so we'll take a break here!