# Moving from spreadsheets to Python

This notebook will explain some basic concepts in Python by explaining how common spreadsheet workflows have parallels in coding.

But why do it in Python if you can do it in a spreadsheet? There are two reasons: first this is just a stepping stone to things that you *can't* do in a spreadsheet. And second: using notebooks can actually make your process much more transparent and easy to communicate (and remember) than using a spreadsheet.

Let's begin.

## Cell references are like variables

If you've done a calculation in a spreadsheet, chances are you've written a formula like this:

`A1+A2`

In that situation, you are asking the spreadsheet to add together the number in cell A1 with the number in cell A2.

(A very common example of this would be dividing a number by a grand total to calculate a percentage.)

First, of course, the numbers must actually *be* in those cells - and for that to happen, you or someone else would have had to have typed the number into that cell.

In Python, the same principle exists with **variables**. 

A variable is created with the equals sign, `=`, like this:

In [2]:
A1 = 34
A2 = 101

This is called **assigning** a variable. 

Once those two lines have been run, it means we now have two variables - `A1` and `A2` - each of which contains a number.

We can't see them like we could with a spreadsheet, but we can see them in a different way: in code blocks, as part of the story being told by this notebook. 

So we can see variables being created in the code, and we can follow what happens to them next.

At any point we can check what's in that variable by using the `print()` command, putting the variable we want to 'print' in the brackets, like this:

In [None]:
print(A1)

34


This time, when the code is run you can see some **output** underneath. 

The `print()` command simply shows (prints) whatever you tell it to.

So it's a good way of keeping track of the things you are working with, or checking things are doing what you want them to do.

## Performing calculations with variables

Once stored, variables can be used to perform calculations in exactly the same way. 

The only difference is that you may need to use `print()` to see the results.

For example, if we want to add the variables `A1` and `A2` we might write this code:

In [None]:
#add A1 and A2
A1+A2

Or wrap it in a `print()` command like this:

In [None]:
#add A1 and A2, then print the results
print(A1+A2)

135


In this case, the calculation is performed first, and then the results are used by the `print()` command. 

In both the examples above, however, the results of the calculation aren't stored anywhere. 

So a third way of doing it would be this:

In [None]:
#add A1 and A2 and store in 'mytotal'
mytotal = A1+A2
#print that new variable
print(mytotal)

135


In this case, we've added the two variables `A1` and `A2`, and then stored them in another, new, variable. 

We called that variable `mytotal` - this is just an arbitrary name, we could have called it almost anything.

And then we printed it. 

Of course, as with cell references in spreadsheets, the major advantage of a variable is that if the numbers change (for example an updated dataset is released) you only need to change the values of any variable(s) but leave the rest of code unchanged. 

Below, for example, is all the code so far, together, but we change the value of `A1` to run that code on a changed value.

In [None]:
#store two values in two variables
A1 = 43
A2 = 101
#add A1 and A2 and store in 'mytotal'
mytotal = A1+A2
#print that new variable
print(mytotal)

144


## Adding comments

You'll have noticed that the last three code blocks each had some lines of text that started with a hash symbol, like this:

`#add A1 and A2`

This is a **comment**. 

A comment is created in Python whenever you type a hash. Any text after that hash will not be treated as code - it won't do anything. 

This is often used to add explanations about what the next line of code is doing, partly to help other people understand it, but also often to help yourself, when you come back to the code later. 

(Some people will type comments on the same line, after the working code, but this can create very long lines)

In a notebook comments are less useful because you can use the text blocks to add an explanation instead, as I'm doing now. However, it's still a good idea to add comments whenever it might help clarify what's going on - for someone else, or for your future self.

## Applying this to a story: FOI responses

We can now start to use some of these techniques with some data to find a story. 

You can find Freedom of Information statistics on Gov.uk. Specifically, I'm going to work through a simple story based on the [Freedom of Information statistics: April to June 2021 bulletin](https://www.gov.uk/government/statistics/freedom-of-information-statistics-april-to-june-2021/freedom-of-information-statistics-april-to-june-2021-bulletin)

At the top of that page are three download links: a PDF (ugh); "data tables"; and a "csv dataset". 

[Download the data tables](https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1017270/foi-statistics-q2-2021-statistical-tables.ods) and go to the sheet called '10_Exemptions'.

We can see that column F shows how many requests were refused due to the Section 27 exemption, "International relations" 

And row 21 shows how many FOI requests were received by the Foreign, Commonwealth and Development Office (FCO). 

The total requests received by the FCO is in cell B21, so we can copy that and store it.



In [None]:
#store the numbers of requests in a new variable called 'fcorequests'
fcorequests = 48

And the numbers refused due to Section 27 are in cell H21

In [None]:
#store the numbers of refusals in a new variable called 'sec27refusals'
sec27refusals = 28

Then we can calculate what percentage of FOI requests to the FCO were refused due to Section 27.

In [None]:
#calculate a percentage by dividing the part (refusals) by the whole (requests)
#and store in a new variable called 'percrefused'
percrefused = sec27refusals/fcorequests
#print it
print(percrefused)
#multiply by 100 to make it easier to 'read' as a percentage
print(percrefused*100)

0.5833333333333334
58.333333333333336


The answer, then, is 58% of FOI requests to the Foreign Office were refused because they believed it: 

> "would, or would be likely to, prejudice (a) relations between the United Kingdom and any other State, (b)relations between the United Kingdom and any international organisation or international court, (c) the interests of the United Kingdom abroad, or (d) the promotion or protection by the United Kingdom of its interests abroad." ([Section 27 of the FOI Act](https://www.legislation.gov.uk/ukpga/2000/36/section/27))

That's quite a lot.

### Naming variables

You might have noticed that in the example above we didn't use an abstract name for the variable like `B21`. 

Because we can call a variable anything we want, it's much better to choose a name which is meaningful and specific. I chose `fcorequests` because that's exactly what we're storing in that variable, so it's easy to understand what's happening when we use it.

However, note that I didn't call it `fco requests` with a space in the name. 

This is because if you have a space between words in Python (and coding generally), the two words will be interpreted as two separate things.

Let's see what will happen if we try to do that...

In [None]:
fco requests = 48

SyntaxError: ignored

We get a `SyntaxError`. In other words there's something wrong with the grammar of the line that's being highlighted. 

If you get an error like this, look for spaces that can be removed, misspelt commands, or missing equals operators, or some of the [causes listed in this article](https://realpython.com/invalid-syntax-python/).

Errors are part and parcel of coding so don't be afraid of them. They don't always make much sense but they do at least help you identify which line of code is causing the problem, and provide a clue to what might be the problem. 

It's often a good idea to copy the error and google it to see what you can find out about them - you can often find solutions to problems on sites like Stackoverflow, for example.

### Variable naming conventions and errors

In Python [the convention](https://stackoverflow.com/questions/159720/what-is-the-naming-convention-in-python-for-variable-and-function-names) is to use all lower case to name variables, and to use underscores to separate words. 

So strictly speaking we should have called our variables `fco_requests` and `sec27_refusals`, and `perc_refused` like so:

In [None]:
fco_requests = 48
sec27_refusals = 28
#note that you will also need to change the variable names in this line of code too
perc_refused = sec27_refusals/fco_requests
#print it
print(perc_refused)
#multiply by 100 to make it easier to 'read' as a percentage
print(perc_refused*100)

0.5833333333333334
58.333333333333336


But really it doesn't matter. You can use whatever approach makes sense for you - for example you can put a capital letter at the start (for example: `Sec27refusals`) or in the middle (for example: `percRefused`) and it will still work.

One thing to remember, though: adding capital letters makes it harder to remember *exactly* how you spelt a variable name when you want to use it again later - and that makes it more likely that you'll get errors. 

Here's the error you'll get when you try to use a variable but spell it wrong:

In [None]:
#create a variable which starts with a capital letter
Thisisavariable = 27
#print a variable - but this one doesn't start with a capital letter
print(thisisavariable)
#That variable doesn't exist, so we will get a NameError

NameError: ignored


This is why personally I always try to keep variable names all lower case, so I don't have to remember which letters are capitalised.

### Codifying the process

Now we could have done all of that in Excel. But doing it this way has a number of advantages:

* Once code is written, it can be re-run: when the next set of figures comes out, we can just update with the next numbers
* Likewise we can automate other parts of the process: at the moment we've manually typed in the two numbers - but we could write code which fetches the spreadsheet, goes into the relevant sheet, and grabs the relevant numbers
* We can also expand it beyond one part of government, repeating the process for all ministries, for example
* We can sort the results to bring interesting ones to the top
* And we can add automatic visualisation in the notebook, too
* The process is very easy for someone else to understand. If we want to show the process to a colleague, they can see it happening. In a spreadsheet, they'd have to click into a cell where we type that calculation, and then trace the other cells where numbers were being pulled from.

Further Python notebooks will explain how to do all of the above.