In [None]:
import pandas as pd
import geopandas as gpd
import random

# Meet `pandas` and `geopandas`
## Introduction
First up, the name `pandas`. According to the preliminaries in this [book by its original developer](https://wesmckinney.com/book/),

> The pandas name itself is derived from _panel data_, an econometrics term for multidimensional structured datasets, and a play on the phrase _Python data analysis_.

Second up, unlike Python, which I find to be elegant and fun to use, I find `pandas` a bit of a chore to deal with. This is partly because my use of it is somewhat intermittent. It's a complicated API, and in many places not very intuitive. That means that its functions and idioms can take a while to really 'stick'. That's been my experience at any rate. I mention this because I think this is quite a common sentiment. However, `pandas` is so central to the Python data analysis world and Python is so central to code development in the GIS world, that whatever reservations we might have about `pandas`, using it is almost unavoidable!

I mention these reservations, only to provide fair warning. I expect that much of what we cover today will only really stick for you as you use `pandas` and its close cousin `geopandas` regularly in your everyday work. I'd strongly encourage you to get into the habit of using it so that it becomes, if not second nature, then a bit less of a challenge to use!

The other important thing to realise at this moment, is that it's impossible in the space of a few hours to cover all that `pandas` and `geopandas` have to offer because it's a lot! All we can do in these sessions is to introduce some basics and give a sense of the scope of these modules and some of their potential. You will hopefully end up better equipped to know where to look in the voluminous documentation as you develop your skills further after these sessions.

## So why use `pandas`?
**First**, because it's everywhere (including in the ESRI suite of tools).

**Second**, because while reading data files in vanilla Python is not difficult (as we saw on Day 1), reading them using `pandas` is even easier!

In [None]:
df = pd.read_csv("data/wellington-gridded-population.csv")
df.head()

And now we can do operations on whole columns in the data in one _vectorised_ operation. Here's a (nonsensical) example, where we add a new column `z` to the data table by adding together values in `x` and `y`.

In [None]:
df["z"] = df.x + df.y
df.head()

**Third**, because `pandas` is completely geared around the manipulation of data as _series_ (the `Series` data type), or as tables (the `DataFrame` data type), and not at the level of individual data points. That' i's how the data we are usually working with are organised, so it makes sense to use a tool intended for working with such data! 

**Fourth**, following on from this, working with data this way has the potential to be (and usually is) much faster than using 'vanilla' Python. This is easily demonstrated. Make a million random numbers and put them in a list. Then also put them in a `pandas.Series`.

In [None]:
# a million random number in a list
numbers = [random.random() for i in range(1_000_000)]
# and the same numbers wrapped up in a pandas.Series
s_numbers = pd.Series(numbers)

Now time the process of squaring every item in the `list`, using the `%timeit` 'magic'. This runs the code enough times to get a reliable estimate of the execution time. Note that I am using a list comprehension here which is already faster (although not by much) than making a new list and appending to it in a `for` loop.

In [None]:
%timeit numbers2 = [x ** 2 for x in numbers] 

Now do the same for the `Series` version, where we simply square the whole series

In [None]:
%timeit s_numbers2 = s_numbers ** 2

On my MacMini the `list` operation takes 48.5 milliseconds, while the `Series` operation takes 580 **micro**seconds. In other words, it's getting on for 100 times faster!

That kind of difference in performance matters greatly when it comes to handling real world data!

The 'magic' here is that `pandas` allows you to avoid explicit loops written in Python. The `for` loops we covered earlier in this course have their place, and you will certainly find uses for them (such as e.g. iterating over all the files in a folder), but pure Python loops run by the interpreter are slow in comparison to the loops implemented in compiled C that `pandas` and an important underlying module `numpy` use to run _vectorised_ operations such as the squaring operation above.

Most of the time, using `pandas` you can avoid explicit looping over your data, and you _certainly should_.

## `geopandas`
`pandas` provides the basic tabular data piece of the geospatial data puzzle. The other half (geometry, mapping, projection, spatial operations) is provided by `geopandas`.

As a quick taster, it's easy to make the CSV dataset into a geospatial data set using `geopandas`. As you might expect `geopandas` can also read and write GIS formatted files and translate between formats, reproject data, and do all the usual GIS-y things.

In [None]:
gdf = gpd.GeoDataFrame(
    data = {"pop": df.pop_est},
    geometry = gpd.points_from_xy(df.x, df.y), 
    crs = 2193
)
gdf.plot(column = "pop", cmap = "Reds", figsize = (8, 10), markersize = 20)

We can even make a web map very easily.

In [None]:
gdf.explore(column = "pop")

So... that's where we're going. It may just take a little while to get there...