# Working with Data

## The importance of exploration

One of the first things that we do when working with any new data set is to familiarise ourselves with it. There are a _huge_ number of ways to do this, but there are no shortcuts to:
* Reading about the data (how it was collected, what the sample size was, etc.)
* Reviewing any accompanying metadata (data about the data, column specs, etc.)
* Looking at the data itself at the row- and column-levels
* Producing descriptive statistics 
* Visualising the data using plots 
In fact, you should use _all_ of these together to really understand where the data came from, how it was handled, and whether there are gaps or other problems. If you're wondering which comes first, I've always liked this approach: _start with a chart_. We're _not_ going to do that here because, first, I want you to get a handle on pandas itself!

## Revisiting Last Week's Data with Pandas

Pandas stands for 'Python Data Analysis Library'; it is designed to provide data scientists working in Python with a set of powerful tools to load, transform, and process large-ish data sets. As a result, it has become something of a *de facto* standard for online tutorials and many of the lessons that you can find online will make use of pandas at some point.

You will want to bookmark [the documentation](http://pandas.pydata.org/pandas-docs/stable/) since you will undoubtedly need to refer to it fairly regularly. _Note_: this link is to the most recent release. Over time there will be updates published and you _may_ find that you no longer have the most up-to-date version. If you find that you are now using an older version of pandas then you'll need to track down the _specific_ version of the documentation that you need from the [home page](http://pandas.pydata.org).

You can always check what version you have installed like this:
```python
import pandas as pd
print pd.__version__
```
*Note*: this approach isn't guaranteed to work with _every_ package, but it will work with most of them. Remember that variables and methods starting and ending with '`__`' are **private** and any interaction with them should be approached very, very carefully.

Let's take a look:

In [None]:
import pandas as pd
help(pd.DataFrame)

On second thought, let's never do that again. Well, at least not _that_ way! You'll have noticed that the help documentation for the DataFrame is not just a bit longer than anything we've seen before, it's massively longer. There's probably quite a lot of intimidating terminology in there too... Right from the start we get things like "Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)." 

## You've already invented pandas!

Here's the thing: in the [last notebook](https://raw.githubusercontent.com/kingsgeocomp/geocomputation/master/Practical-4-Functions%2C%20Packages%20and%20Methods.ipynb) we came close to writing something like pandas from scratch. That's because pandas takes a column-view of data in the same way that our Dictionary-of-Lists did, it's just that it's got a lot more features than our 'simple' tool does. That's why the documentation is so much more forbidding and why pandas is so much more powerful.

But at its heart, a pandas data frame ('df' for short) is just a collection of data series (i.e. columns) with an index. Each Series is like one of our column-lists from the last notebook. And the df is like the dictionary-of-lists that held the data together. You've seen this before, so you already _know_ what's going on... or at least you now have an _analogy_ that you can use to make sense of pandas:
```python
myDataFrame = {
    '<column name 1>': <Series 1>,
    '<column name 2>': <Series 2>,
    '<column name 3>': <Series 3>
}
```

Let's try using pandas with last week's data!

## Load a Remote CSV file in Pandas

In [None]:
df = pd.read_csv('http://www.reades.com/CitiesWithWikipediaData.csv') # Read the remote CSV file
print(type(df)) # What is 'df'?
df.head() # What did we get?

Check it out!

Instead of having to write a 'readRemoteCSV' function and then manually create a Dictionary-of-Lists from that remote file, we just told pandas to read it for us and it automagically converted it to a data structure that we could view. You'll notice that it even figured out where the column names were. 

All we did with `df.head()` was to ask it to print out the first 5 rows of data. If we wanted to only see the first two rows it would be `df.head(2)`. This is pretty handy, right? 

Also, it deliberately mimics the Unix command-line tool `head` (i.e. `head -5 CitiesWithWikipediaData.csv`). So you've learned two tools for the price of one!

## Describing Numerical Data in Pandas

Let's try a few more things:

In [None]:
df.describe()

You'll probably have seen a fairly prominent warning ("Invalid value encountered in percentile"), and if you look closely you'll see that there are some fields that report '`NaN`' in some of the rows. NaN is 'Not A Number', and if you were to look at the original data you'd see that we're missing the Area and Density for some of the cities in the data set -- pandas uses NaN because it wants us to know that there was _no_ data there. 

**_Why wouldn't we want to pandas to default to `0` for 'no data'?_**

### Recap

Let's take a step back for a second to appreciate where we're at: in one function call we loaded a remote data set, parsed it to turn it into data, and then produced a 7-figure summary for _most_ of the columns in the data! That's a bit faster than trying to do it all in Excel, right?

So, just by calling `describe`...
1. We've asked Python to describe the data frame and it has returned a set of columns with descriptive metrics for each.
2. Note what is _missing_ from this list: where are 'Name', 'MetroArea', and a couple of the other columns? Can you think why they weren't reported in the descriptives?

Of course, maybe you don't want the report for all columns, maybe you're just interested in one column:

In [None]:
df.Population.describe()

So now we have the same information, but only for the Population column. We have to do this a _little_ differently because describing the DataFrame does some clever formatting when you're using Jupyter notebooks, and describing a Series requires us to print out the result. Also notice that `dtype` at the end: that tells us the _data type_ is a 64-bit float. You can have strings, floats, integers, booleans, etc. in a DataFrame.

But the really crucial thing is that this introduces _one_ of the two ways that we access a Series in pandas: `<data frame>.<series name>.method`. So we could get similar information on the Name column with:
```python
df.Name.describe()
```
And so forth.

In [None]:
print df.Name.describe()

Notice that describing a text column gives us an 'object' data type because a String is a complex object, not a simple float or int.

In [None]:
print "The mean population of the cities in the data is: " + str(df.Population.mean())
print "The median population of the cities in the data is: " + str(df.Population.median())

### A Challenge for You!

If all of this has made some kind of sense, why not spend a four of five minutes (at most) exploring the CSV data from last week using pandas. Try the following:

1. What's the standard deviation of the population? _[Hint: look in the help for a Series, or Google it.]_
2. What's the highest rank (i.e. smallest city) in the data set? _[Hint: you are looking for the maximum value in the id column.]_

Use the coding block below for your exploration.

And notice to that we can ask the data frame to quickly work out a derived variable (such as the mean) just by asking the Series to do the work for us: `<data frame>.<series>.method()`. You might want to have a [look at the documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#series) to see what other methods are available for a data series. It's rather a long list, but most of your descriptive stats are here in [Cumulative / Descriptive Stats](http://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats) as are things about [strings](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling), [categorical data](http://pandas.pydata.org/pandas-docs/stable/api.html#categorical), and [plotting](http://pandas.pydata.org/pandas-docs/stable/api.html#plotting).

## Quick and Dirty Plotting

No, this isn't about this man: ![Niccolo Machiavelli](http://a5.files.biography.com/image/upload/c_fill,cs_srgb,dpr_1.0,g_face,h_300,q_80,w_300/MTE5NDg0MDU1MDQ5MzA3NjYz.jpg "Niccolo Machiavelli") (that's Niccolo Machiavelli) this is about making graphs of our data. This is going to be a long class, so I want you to recognise why it's worth getting the weather data below _into_ a pandas data frame before we go through the pain of working with the MetOffice API.

In [None]:
# You need to run this at _least_ once in each notebook to see your plots
%matplotlib inline 

In [None]:
df.Population.plot.hist() # Population histogram. What's the outlier?

In [None]:
df.Population.plot.box() # Population histogram. More than one outlier?

In [None]:
df.Latitude.plot.hist()

In [None]:
df.Latitude.plot.density()

Kind of handy, no? These aren't the _best_ looking plots, but they are all being generated on-the-fly for you by pandas with no more than a cheery `<data frame>.<data series>.plot.<plot type>`! Since those plots are all just method calls, many of them take optional parameters to change the colour, the notation (scientific or not), and other options. 

This is why we like pandas: it allows us to be _constructively lazy_. We don't need to know _how_ a draw a KDE plot (though it always helps if you don't see what you expected), we just need to know that pandas provides a method that will do it for you. And _that_ is why it's always worth having a [look at the documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#plotting).

### Looking Ahead

This is a very basic introduction to pandas just so that you're familiar with the bare bones of examining a pandas data frame to work out what it contains, how to calculate a few descriptive variables, and how to plot. Now it gets real.

# Pandas with _Real_ Data

That's the last of the 'toy' data, for the remainder of this module we're going to be working with two types of data: data about people (Socio-economic Classifcation) and data about the environment (weather). 

We've selected these two very different types of data on purpose:

1. Because we know that some of you have interests in the human environment, and others in the natural
2. Because these are very different types of data with very different properties
3. Because we'll see that _similar_ workflows can be used with each!

What we want to highlight is that computational approaches are _highly transferrable_ between contexts. The mean or median is not _less_ relevant in one context than another, it's just more or less appropriate as a tool for understanding the data! 

We'll see the Socio-economic Classification data next week and focus on the weather API data this week. This week is harder conceptually because API data is harder to understand -- we've simplified it quite a bit (that's why there is an Appendix) but it's still got some parts that are going to be hard going..

# Finding Data in Pandas

We've got ourselves a pandas data frame containing a set of locations, how do we go about finding one or more _specific_ rows in the data set rather than just summarising the data via `describe`?

## Searching for a Number

This is the easiest type of search to do in pandas because it looks _most_ like code you've already seen.

### Find All Sites East of 1.1 Degrees Longitude

To translate this into code, we just need to remember that: a) East would be _greater than_; and b) longitude is already a float. So in that case it's...

In [None]:
dfEast = df2[df2.longitude > 1.1]
dfEast.head(7)

Let's break this down:

* `df2.longitude` is obviously the longitude column of our data frame `df2`
* `df2.longitude > 1.1` is therefore a kind of _query_ (or _selection_) of rows where the longitude is greater than 1.1. What it _actually_ does is compare each row's longitude value to 1.1 and remember if the result is `True` or `False`.
* `df2[ ... ]` is _like_ what we do with a list when we write: `myList[3:5]` to select the fourth through sixth elements of a list, but in pandas we can _select_ non-sequential rows because we are using a `boolean` array (a.k.a. list) that looks like this: `[False, False, False, True, True, True, ...]`.
* `dfEast = ...` saves the _result_ of the selection into a new data frame called `dfEast` (data frame East).

You can check what I'm saying about the boolean result using:

In [None]:
df2.longitude > 1.1

And we can check that `dfEast` and `df2` are not the same using `shape`, which gives us the dimensions of the data frame as `(<rows>, <columns>)`:

In [None]:
print(df2.shape)
print(dfEast.shape)

What this first example means is that _anything_ that can be evaluated to `True` or `False` can be used to select rows from a data frame... Let's try some more selections based on numbers...

### Finding the Minimum & Maximum Elevation

The lowest point is in the Fens, and the highest point is, of course, Ben Nevis:

In [None]:
df2[df2.elevation == df2.elevation.min()] # Somewhere in Cambridgeshire

In [None]:
df2[df2.elevation == df2.elevation.???] # Find Ben Nevis

### Finding a Range Between Known Values

Perhaps we aren't just looking for extremes... how about all of the areas between 55 and 55.2 degrees latitude? *[There were 91 the last time I checked.]*

In [None]:
dfRange = df2.loc[ (df2.latitude > 55.0) & (df2.latitude < 55.2) ]
dfRange

That example contains a few new things to which you need to pay attention:
1. You'll see that, with mutiple selections, we had to put parentheses around each one -- this is to avoid confusing pandas as to what it should do _first_.
2. We see an '&' (ampersand) which is completely new: it's a logical `AND` that asks pandas to "Find all the rows where condition 1 _and_ condition 2 are both `True`". So it calculates the `True`/`False` for the left side and the `True`/`False` for the right side of the `&`, and then combines them. Look at the appendix to this notebook for more examples and options.
3. We had to a `.loc` on the end of the `df2` -- the best way to think of this is that it 'freezes' things so as to prepare the data frame to do a search based on the _location_ of some complex selection criteria. We'll see more of this next week.

### Finding a Range Based on the Distribution

Finally, let's try finding the stations whose elevation is _greater_ than the mean. *[There are 1,330 the last time I checked.]*

In [None]:
dfMean = df2.loc[ df2.elevation > df2.???.mean() ]
print("There are " + str(dfMean.shape[0]) + " sites above the mean elevation of " + str(df2.elevation.mean()))

## Searching on Text & Categories

Numeric searching is all well and good, but what if I'm interested in finding stations in a particular area?

### Searching for a Category

Let's find the names of every station inside the Cairngorms National Park! *[There were 71 the last time I checked.]*

In [None]:
df2[ df2.??? == ??? ].name

### Searching for Part of a String

If you want to find a full match for a string then it's fairly easy and works like everything you've seen before with string matching:

In [None]:
df2[df2.name=='Cairn Gorm Summit']

And pandas also provides a lot of useful tools for searching _inside_ a string, as long as you remember to _tell_ pandas to use the string-methods (notice the format: `<data frame>.<data series>.str.<string method>()`):

In [None]:
df2[df2.name.str.startswith('Beinn A\'')]

In [None]:
df2[df2.name.str.endswith('Summit')]

Searching _inside_ a string is no harder:

In [None]:
df2[df2.name.str.contains('Charn')]

How would you ensure that you found _only_ the Geal Charn stations? 

Combine what you've learned above to create a complex query (two conditions on a single line) using a mix of search critera:

In [None]:
df2.loc[ (???) & (???'Highland') ]

Finally, I want you to find and print out **_only_ the ID of Heathrow the town _not_ the Airport** using a single line of code. There are _at least_ two ways to retrieve this...

We are going to want that ID for the next step in working with the MetOffice API, but there is a last trick to learn here and that's how to extract an actual value as a string, int, or float from a data series. The thing to remember is that a Series is basically a list with a lot of value-added features. The contents of the list can be found in `<data series>.values`. So to get the 2nd through 5th values of the elevation column it would be:
```python
df2.elevation.values[1:5]
```

If you do the selection criteria for Heathrow Airport properly there should only be one item in the list of values that you retrieve, so the right code will include a `[0]` at the end:

That's yet another bunch of 'data' that's difficult for us to read, but by now this should be looking rather familiar to you... perhaps? Hang on a moment! It's a dictionary-of-lists-of-dictionaries-of...

## Creating a DataFrame from a Dictionary

And that, of course is exactly the type of data structure that we can work with in pandas! 

So the _last_ step here is to figure out how to create a new data frame from this dictionary. Here, the MetOffice has _not_ made our lives very easy because the data is packaged in a way that doesn't allow us to easily load it into pandas. If you search online, you'll find plenty of people complaining about how the MetOffice API works. Or doesn't work, if you prefer.

So we're not going to ask you to sort this out for yourselves. Instead, we're going to provide you with a function (!) to take the observation data and convert it into a data frame.

## Tidying Up

Before we can get back to plotting (again) we have a few more steps to work through:

1. To rename the columns to something a little more useful.
2. To turn the 'ts' field into an _actual_ timeseries so that pandas understands what it is.
3. To convert all of the other series to the right numerical/categorical format.

Let's do this in several stages... 

### Changing Column Names

You may remember that I indicated what the observations returned by each weather station might include:

* D  = Wind Direction
* Dp = Dew Point
* G  = Wind Gust
* H  = Humidity
* Pt = Pressure Tendency
* S  = Wind Speed
* T  = Temperature
* V  = Visibility
* W  = Weather Type
* ts = Time of Day

Given this, and the fact that I've listed these in order, what needs to replace the '???' in the code below?

In [None]:
df3.??? = ['WindDirection','DewPoint','WindGust','Humidity','PressureTendency','WindSpeed','Temperature','Visibility','WeatherType','ts']

In [None]:
df3.head(3)

You should see the 'full' column names now.

### Changing column types

If you were exploring the data frame along the way, you might have already noticed that the description of numeric columns (like Temperature) doesn't seem much like what we had before -- shouldn't we get the 7-figure summary for numeric columns? The problem is that pandas didn't know what we expected the columns to be, so it's treated them all as 'objects' (basically: strings) and not as numeric data types.

So we need to fix that now... as you saw before, there's a function called `'astype'` that allows us to convert between data types where it's fairly easy for pandas to figure out what we want to do:

In [None]:
df3.Temperature.describe()

In [None]:
for c in ['WindDirection','WeatherType','PressureTendency']:
    df3[c] = df3[c].astype('category')
for c in ['DewPoint','Humidity','Temperature']:
    df3[c] = df3[c].astype('float')
for c in ['WindGust','Visibility']:
    df3[c] = df3[c].astype('int')

In [None]:
df3.Temperature.describe()

That's more like it!

### Working with Timeseries Data

So that's looking a lot more useful, but as a final step we need to make sure that the temporal data is actually treated as a time series... again, Google is your friend here: `"pandas convert datetime to time series"`. 

Given that we are creating a _new_ column called 'Time' from an _existing_ column called 'ts', what do you think needs to replace the '???'?

In [None]:
df3['Time'] = pd.to_datetime(df3.???.values, infer_datetime_format=True)

And now compare:

In [None]:
df3.ts.head()

In [None]:
df3.Time.head() # A quick check

## Using a Time Series in an Index

We can tell that that type conversion succeeded because we've got a new `dtype`: `datetime64[ns]`. 

We can also now do some really neat things to 'resample' the data based on the fact that we have temporal data; however, to take advantage of this we have to let pandas know that the entire data set is organised by time. We do this by replacing the existing integer index with a datetime one:

In [None]:
df3.index = pd.to_datetime(df3.ts.values, infer_datetime_format=True)
df3.head(3)

In the output above, you'll notice that the left-most column (the one without a name, because it's an _index_, not a column) is now a datetime object. Why is that useful? Well check _this_ out:

In [None]:
df3.Temperature.resample('D').mean()

OK, this is another geeky moment but how cool is that? By telling pandas that our data is temporal, we're now in a position to ask pandas to answer questions like "What was the average daily ('D') temperature at Heathrow?" And we can do this in _one_ line of code.

If you had data at minutes-level resolution, then you could aggregate to Hourly or Daily. In principle, you can also do all sorts of datetime queries around things like "What was the weekly average in the 3rd week of 2016?" or "What was last Friday's weather?".

# Plotting!

This has been a long, slow build towards something more exciting: plotting! Well, plotting _again_. In a way, this has been a lot of effort just to make a graph, but let's recognise where we're at:

* We can request data for _any_ lcoation in Britain by changing the location id.
* We can get new data _any_ time we feel like it.
* We can (in a minute) create a plot of that data.
* We can update it continuously in the future!

That's pretty awesome, right?

In [None]:
# This command tells Jupyter that we want 
# the plots to be shown inline (on this 
# web page). You'll always need to do this
# *once* on a notebook.
%matplotlib inline

Pandas can do a _lot_ of different plots, [see for yourself](http://pandas.pydata.org/pandas-docs/stable/visualization.html#visualization-hexbin). Here's a sampling:

In [None]:
df3.Humidity.plot()

In [None]:
df3.Temperature.plot.bar()

In [None]:
df3.Temperature.plot.box() # Handy!

In [None]:
df3.plot.scatter(x='Temperature', y='Humidity') # Spot the problem data point

In [None]:
df3.plot.scatter(x='Temperature', y='Humidity', s=(df3.DewPoint+1)*25);

In [None]:
df3.plot.hexbin(x='Temperature', y='Humidity', gridsize=15)

In [None]:
df3.WindDirection.plot() # Ooops.

I tend to think that that's quite enough to be coping with for one session... over to the script!

# Appendix

The material below here is very helpful if you _really_ want to get to grips with both some powerful computing concepts (hello recursion!) and to understand how I pulled together the data frame through working with the data iteratively to get to grips with the MetOffice reply. However, it is not _necessary_ that you undertand these ideas now since they are relatively advanced and will be more directly useful in the _Spatial Analysis_ and _Applied Geocomputation_ modules.

## Logical Comparisons

You have already seen a bit of Boolean logic using 'and', 'or', and 'not'. For reasons that aren't really worth getting into here, when you're dealing with the simple binary True/False data (a.k.a. [bit-wise](https://wiki.python.org/moin/BitwiseOperators)) comparisons then the same operations are written using a slightly different syntax:

1. 'and' becomes '&'
2. 'or' becomes '|'
3. 'not' becomes '~'

These are rather topical for complex queries in pandas.