## Thinking About Data Structures

Instinctively, it seems like we need:

1. To keep the rows in order
2. To keep the columns in order

So if we wanted to do that then the 'right' data structure would clearly be a list-of-lists (LoLs!):
```python
data = [
    ['id', 'Name',           'Rank', 'Longitude',    'Latitude',    'Population'],
    ['1',  'Greater London', '1',    '-18162.92767', '6711153.709', '9787426'],
    ['10', 'Sheffield',      '10',   '-163545.3257', '7055177.403', '685368']
]
```

In [None]:
from urllib.request import urlopen
import csv

url = "http://bit.ly/2vrUFKi" # A bit-link to save space

urlData = [] # Somewhere to store the data

response = urlopen(url)        # Get the data using the urlopen function
csvfile  = csv.reader(response.read().decode('utf-8').splitlines()) # Pass it over to the reader function

for row in csvfile:              
    urlData.append( row )
    
print(urlData) # Check it worked!

If it worked, then you should have this output:
```python
[['id', 'Name', 'Rank', 'Longitude', 'Latitude', 'Population'], ['1', 'Greater London', '1', '-18162.92767', '6711153.709', '9787426'], ['2', 'Greater Manchester', '2', '-251761.802', '7073067.458', '2553379'], ['3', 'West Midlands', '3', '-210635.2396', '6878950.083', '2440986'], ['4', 'West Yorkshire', '4', '-185959.3022', '7145450.207', '1777934'], ['5', 'Glasgow', '5', '-473845.2389', '7538620.144', '1209143'], ['6', 'Liverpool', '6', '-340595.1768', '7063197.083', '864122'], ['7', 'South Hampshire', '7', '-174443.8647', '6589419.084', '855569'], ['8', 'Tyneside', '8', '-187604.3647', '7356018.207', '774891'], ['9', 'Nottingham', '9', '-131672.2399', '6979298.895', '729977'], ['10', 'Sheffield', '10', '-163545.3257', '7055177.403', '685368']]
```
To you that might look a lot _worse_ that the data that you originally had, but to a computer that list-of-lists is something it can work with; check it out:

In [None]:
for c in urlData:                                      # For each row in the list
    print("The population of " + c[1] + " is " + c[5]) # Print out the name and population

The advantage of using the `csv` library over plain old `string.split` is that the csv library knows how to deal with fields that contain commas (_e.g._ `"Cardiff,Caerdydd"`) or even newlines and so is much more flexible and consistent that our naive `split` approach. The vast majority of _common_ tasks (reading certain types of files, getting remote files, etc.) have libraries that do exactly what you want without you needing to write much code yourself to take advantage of it. You should always have a look around online to see if a library exists before thinking that you need to write everything/anything from scratch. The tricky part is knowing what words to use for your search and how to read the answers that you find...

Let's try this with the 'full' data set:

In [None]:
from urllib.request import urlopen
import csv

url = "http://bit.ly/2iIK9bA" # A bit-link to save space

urlData = [] # Somewhere to store the data

response = urlopen(url)
csvfile = csv.reader(response.read().decode('utf-8').splitlines())

for row in csvfile:              
    urlData.append( row )
    
for c in urlData:                                      # For each row in the list
    print("The population of " + c[1] + " is " + c[5]) # Print out the name and population

#### Hmmmm, is that right???

Have a think about why doing the same thing that worked on one data set might not work on a different one. If you understand what happened here then the next section will make a _lot_ more sense!

## Why 'Obvious' is Not Always 'Right'

But you need to be careful assuming that, just because something is hard for you to read, it's also hard for a computer to read! I've said before that the way a computer 'thinks' and the way that we think doesn't always line up naturally. Experienced programmers can think their way _around_ a problem by working _with_ the computer, rather than against it.

Some issues to consider:

- Is the first row of 'data' actually data?
- Do we really care about column _order_, or do we just care about being able to pick the *correct* column?
- Does the LoL approach deal with data efficiently? 

Let's apply this approach to the parsing of our data...

### What's an _Appropriate_ Data Structure?

If you look closely, then using either of the 'Cities' data files to do anything _useful_ presents something of a 'problem': our list-of-lists isn't very easy to navigate. Notice that not only _might_ the location of the Population column be different in the two files (as it turns out it was, deliberately), but when we want to work out the answer to a simple question such 'what is the 3rd largest city' we need to step through a lot of irrelevant data as well: we'd need to write a for loop and make sure that we skip past the name, latitude, longitude, etc. 

That doesn't make much sense since this should all be easier and faster in code than in Excel, but right now it's _harder_, and quite possibly _slower_ as well! When you get into situations like this (having to write a lot of code to do something that should be fast and easy) it is often the case that you've got the wrong _data structure_. 

So how does the experienced programmer get around this? 'Simple' (i.e. neither simple, nor obvious, until you know the answer): she realises that the data is organised the wrong way! We humans naturally tend to think in rows of data: London has the following _attributes_ (population, location, etc.), and York has a different set of attributes. Se we read across the row because that's the easiest way for us to read it. But, in short, a list-of-lists does _not_ seem to be the right way to store this data!

Crucially, a computer doesn't have to work that way. For a computer, it's as easy to read _down_ a column as it is to read _across_ a row. In fact, it's easier, because each column has a consistent _type_ of data: one column contains names (strings), another column contains populations (integers), and other columns contain other types of data (floats, etc.). 

Better still, the order of the columns often doesn't matter as long as we know what they are called: it's easier to ask for the 'population column' than it is to ask for the 6th column since, for all we know, the population column might be in a different place for different files but they are all (relatively) likely to use the 'population' label for the column itself.

#### A Dictionary of Lists

So, if we don't care about column order, only row order, then a dictionary of lists would be a nice way to handle things. And why should we care about column order? With our two CSV files above we already saw what a pain it was to fix things when the layout of the columns changed from one data set to the next. If, instead, we can just reference 'population' column then it doesn't matter where that column actually is. Why is that? 

Well, here are the first four rows of data from the simple city file as a list-of-lists:

```python
['id', 'Name', 'Rank', 'Longitude', 'Latitude', 'Population'], 
['1', 'Greater London', '1', '-18162.92767', '6711153.709', '9787426'], 
['2', 'Greater Manchester', '2', '-251761.802', '7073067.458', '2553379'], 
['3', 'West Midlands', '3', '-210635.2396', '6878950.083', '2440986']
```

Now, here's how it would look as a dictionary of lists organised by _column_, not by row:

```python
myData = {
    'id'         : [1, 2, 3],
    'Name'       : ['London', 'Manchester', 'West Midlands'],
    'Rank'       : [1, 2, 3],
    'Longitude'  : [-18162.92767, -251761.802, -210635.2396],
    'Latitude'   : [6711153.709, 7073067.458, 6878950.083],
    'Population' : [9787426, 2553379, 2440986],
}

```

What does this do better? Well, for starters, we know that everything in the 'Name' column will be a string, and that everything in the 'Longitude' column is a float, while the 'Population' column contains integers. So that's made life easier already. But let's test this out and see how it works.

### Step 3: Do some calculations

Now we have our final data structure (which next we will see is already available via the Pandas package), we can do some calculations.  

In [None]:
myData = {
    'Name'       : ['London','Manchester','West Midlands'],
    'Rank'       : [1, 2, 3, 4],
    'Longitude'  : [-18162.92767, -251761.802, -210635.2396],
    'Latitude'   : [6711153.709, 7073067.458, 6878950.083],
    'Population' : [9787426, 2553379, 2440986],
}

# Find the population of Manchester
pop = myData['Population'][myData['Name'].index('Manchester')]
print("The population of Manchester is: " + str(pop))

# Find the easternmost city
city = myData['Name'][myData['Longitude'].index(max(myData['Longitude']))]
print("The easternmost city is: " + str(city))

# Find the mean population of the cities
import numpy as np # Need to import a useful package
mean = np.mean(myData['Population'])
print("The mean population is: " + str(mean))

There's a _lot_ of content to process in the code above, so do _not_ rush blindly on if this is confusing. 

**Stop. Think it through. Talk it out with your neighbour and lecturers.** 

We'll go through each one in turn, but they nearly all work in the same way and the really key thing is that you'll notice that we no longer have any loops (which are slow) just `index` (which is very fast). 

### The Population of Manchester

The code can look pretty daunting, so let's break it down into two parts.

What would you get if you ran just this code?
```python
myData['Population'][0]
```
Remember that this is a dictionary-of-lists (DoL). So, Python first looks for a key named `Population` in the myData dictionary. It finds out that the value associated with this key is a _list_ (`[9787426, 2553379, 2440986]`). In this example, it just pulls out the first value (index 0), which is `9787426`. Does that make sense?

Now, to the second part:
```python
myData['Name'].index('Manchester')
```

This is very similar: we look in the dictionary for the key `Name` and find that that's _also_ a list (`['London','Manchester','West Midlands']`, since you asked). If you don't remember what `index` does, don't worry, here's the output from Python's `help()` function:
```
Help on built-in function index:

index(...)
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.
```
So all we're doing is asking Python to find out the index of 'Manchester' in the list associated with the dictionary key 'Name' _instead_ of just sticking in a `0` to get the first index value. Putting these two things back together what we're doing is:

* Finding the index (i.e. **row**) of 'Manchester' in the Name column,
* Using that index to read a value out of the Population column.

Notice the complete _absence_ of a for loop?

Does that make sense? If it does then you should be having a kind of an Alice-through-the-Looking-Glass moment because what we've done by taking a column view, rather than a row view is to make Python's ``index()`` command do the work for us. Instead of having to look through each row for a field that matches 'Name' and then check to see if it's 'Manchester', we've pointed Python at the right column immediately and asked it to find the match (which it can do very quickly). Once we have a match then we _also_ have the row number to go and do the lookup in the 'Population' column because the index _is_ the row number!

### The Easternmost City

Where this approach really comes into its own is on problems that involve maths. To figure out the easternmost city in this list we need to find the _maximum_ Longitude and then use _that_ value to look up the city name. So let's do the same process of pulling this apart into two steps:

It should be _pretty_ obvious what this does:
```python
myData['Name'][0]
```

But we don't just want the first city in the list, we want the one with the highest longitude. So to achieve that we need to replace the `0` with an index that we found by looking in the `Longitude` list.
```python
myData['Longitude'].index(max(myData['Longitude']))
```

Ugh, that's still a little hard to read, isn't it? Let's write it down another way to make it easier to read:

```python
myData['Longitude'].index(
    max(myData['Longitude'])
)
```

There's the same `.index` which tells us that Python is going to look for something in the list associated with the `Longitude` key. All we've done is change what's _inside_ that index function to `max(myData['Longitude'])`. This is telling Python to find the _maximum_ value in the `myData['Longitude']` list. So to explain this in three steps, what we're doing is:
* Finding the maximum value in the Longitude column (we know there must be one, but we don't know what it is!),
* Finding the index (position) of that maximum value in the Longitude column (now that we know what the value is!),
* Using that index to read a value out of the Name column.

I _am_ a geek, but that's pretty cool, right? In one line of code we managed to quickly find out where the data we needed was even though it involved three discrete steps. Remember how much work it was to find the mean when you were still thinking in _rows_, not _columns_?

### The Average City Size

Yeah, let's try that too.

Here we're going to 'cheat' a little bit: rather than writing our own function, we're going to import a package and use someone _else's_ function. The `numpy` package contains a _lot_ of useful functions that we can call on (if you don't believe me, add "`dir(np)`" on a new line after the `import` statement), and one of them calculates the average of a list or array of data.
```python
import numpy as np # Need to import a useful package
mean = np.mean(myData['Population'])
```
This is where our new approach really comes into its own: because all of the population data is in one place (a.k.a. a _series_ or column), we can just throw the whole list into the `np.mean` function rather than having to use all of those convoluted loops and counters. Simples, right?

### Review!

So the _really_ clever bit in all of this isn't switching from a list-of-lists to a dictionary-of-lists, it's recognising that the latter is a _better_ way to work _with_ the data that we're trying to analyse and that that there are useful functions that we can exploit to do the heavy lifting for us. Simply by changing the way that we stored the data in a 'data structure' (i.e. complex arrangement of lists, dictionaries, and variables) we were able to do away with lots of for loops and counters and conditions, and reduce many difficult operations to something that could be done on one line! 

So now, finally, we can bundle up our code into our own function to read data from a URL and report some calculations. 

In [None]:
from urllib.request import urlopen
import csv
import numpy as np

urlData = [] # Somewhere to store the data

# define our function
def read_city_data(url):
    """
    Reads a remote CSV file of city and returns
    a dictionary-of-lists containing the data.
    """
    urlData = {} # Somewhere to store the data

    response = urlopen(url)
    csvfile = csv.reader(response.read().decode('utf-8').splitlines())
    
    headers = next(csvfile)
    
    for h in headers:
        urlData[h] = []
    
    for row in csvfile: 
        for c in range(0,len(row)):
            urlData[headers[c]].append(row[c])
    
    for col in ['id','Rank','Population']:
        urlData[col] = list(map(int, urlData[col])) # list(map(int, results)) in Python 3
        
    for col in ['Latitude','Longitude']:
        urlData[col] = list(map(float, urlData[col]))
    
    return urlData

Now we can use the function we have defined:

In [None]:
print("URL 1:")
data1 = read_city_data("http://bit.ly/2vrUFKi")
print("\tThe average population is data1 is " + str(np.mean(data1['Population'])))

print("URL 2:")
data2 = read_city_data("http://bit.ly/2iIK9bA")
print("\tThe average population is data2 is " + str(np.mean(data2['Population'])))

Notice how the code to read the population is the _same_ for both data sets (unlike our old LoL with for loops)? See if you can use the code that we saw above to work out the Westernmost city in Britain, the 25th largest city... 

### Bringing it all together...

The function we have created at the end here is the key to the entire practical since it brings together all of the ideas that we've covered this session:

- We've used for loops, dictionaries, and lists to generate a data structure
- We'be used no fewer that three packages to do complicated things for us
- We've wrapped some complex operations up in a function that we can re-use
- We've seen how clever ways to do things can make our lives much, much easier

If this didn't make sense to you on the _first_ runthrough, I'd suggest going back through the latter half of the practical _again_ in a couple of days' time -- that will give your brain a little time to wrap itself around the basics before you throw the hard stuff at it. _Don't_ panic if it doesn't all make sense on the _second_ runthrough either -- this is like a language, you need to practice! With luck, the second time you went through this code a little bit _more_ made sense. If you need to do it a third time you'll find that even _more_ makes sense... and so on. 

Conceptually, this is one of the hardest practicals in the entire term because it joins up so many of the seemingly simple ideas that you covered in Code Camp into a very complex 'stew' --our basic ingredients (lists, dictionaries, etc.) simmered for a bit and became this heady mix.

#### Brain Teaser

If you want to have a stab at writing the code to print out the 4th most populous city-region then knock yourself out! This can _still_ be done on one line and is _very_ fast even though we are now dealing with quite a bit more data.

Here's a clue: you don't want to use `<list>.sort()` because that will sort your data _in place_ and break the link between the indexes across the columns; you want to use the function `sorted(<list>)` where `<list>` is the variable that holds your data and `sorted(...)` just returns whatever you pass it in a sorted order _without_ changing the original. You'll see why this matters if you get the answer... otherwise, wait a few days for the answers to post.

In [None]:
# Print out the name of the 4th most populous city-region
city = data2???
print("The fourth most populous city is: " + str(city))