# Looping through lists (and a bit of a JSON refresher)

This notebook covers dealing with lists and some other questions from the 11th session.

Let's start by creating a list, and storing it in a variable. 

A list is created by using square brackets - it doesn't matter if those brackets contain anything (if they don't, it's just an empty list).

It's stored in the same way as any other variable, by naming the variable followed by an `=` sign, before the list is created to the right of that.

*Tip: You can make the list easier to read by pressing enter after each comma - Colab will automatically indent the items, and it will still work as code.*

In [None]:
#create a variable containing a list of 3 strings
myurls = ['https://www.thebureauinvestigates.com/profile/jasperjackson',
          'https://www.thebureauinvestigates.com/profile/matthewchapman',
          'https://www.thebureauinvestigates.com/profile/vickygayle']

## Checking if a variable is a list

You can check that a variable is a list by using the `type()` function.

In [None]:
#check what type of variable it is
type(myurls)

list

## Checking how many items are in a list

The `len()` function will tell you how many items are in a list (its length).

In [None]:
#check how many items are in the list
len(myurls)

3

## Use an index to grab an item from a list based on its position

Each item in a list can be grabbed by specifying what position it sits at - its **index**. 

You specify the index of an item by putting the number of that position in square brackets after the name of the list.

Note: indexing in Python begins at zero, so the first item is index `0`, the second is index `1`, and so on.

In [None]:
#fetch the first item (the item at position 0)
myurls[0]

'https://www.thebureauinvestigates.com/profile/jasperjackson'

In [None]:
#fetch the second item (the item at position 1)
myurls[1]

'https://www.thebureauinvestigates.com/profile/matthewchapman'

In [None]:
#fetch the third item (the item at position 2)
myurls[2]

'https://www.thebureauinvestigates.com/profile/vickygayle'

## Getting an error with list positions (indices)

If we try to fetch an item at a position that doesn't exist, we will get an error - specifically an `IndexError` (which is a big clue). 

It will even tell us that the 'list index' is "out of range" - in other words, the list only has a range of indices from 0 to 2, so anything other than those is outside that range.

In [None]:
#try to get the item at position 3 (the fourth item)
myurls[3]

IndexError: ignored

## Looping through the list

To loop through *all* the items in a list, you can write a `for` loop. 

The first line of a `for` loop follows this pattern:

`for THING in LIST:`

You need to replace `LIST` with the name of your list, and `THING` with the name you want to use for each item as it loops. Most often, `i` is used for that 'thing'.

After the colon, you *must* have at least one **indented** line containing the code that you want to run while it is looping. 

The most basic thing you might want to do is `print()` each item, as it loops. 

In [None]:
#go through each item in the list
#store it in a variable called 'i'
for i in myurls:
  #print the variable i
  print(i)
  #print 'scraping ' and the variable 'i' too
  print("scraping: "+i)

scraping: https://www.thebureauinvestigates.com/profile/jasperjackson
https://www.thebureauinvestigates.com/profile/jasperjackson
scraping: https://www.thebureauinvestigates.com/profile/matthewchapman
https://www.thebureauinvestigates.com/profile/matthewchapman
scraping: https://www.thebureauinvestigates.com/profile/vickygayle
https://www.thebureauinvestigates.com/profile/vickygayle


## Looping through a list using indices instead

Sometimes we might want to loop through multiple lists at the same time (because they're the same length and would line up with each other in a table)

To do that we can loop through a range of numbers (indices) instead.

This can mess with your head, so if it's too confusing, just skip it!

In [None]:
listofnames = ['Jasper','Matt','Vicky']
#loop through the range of numbers generated by the range function
for i in range(0,3):
  #convert to a string
  number_as_string = str(i)
  #combine with another string and print
  print("the number is "+number_as_string)
  #print the item at that position in the url list
  print(myurls[i])
  #print the item at that position in the other list
  print(listofnames[i])

the number is 0
https://www.thebureauinvestigates.com/profile/jasperjackson
Jasper
the number is 1
https://www.thebureauinvestigates.com/profile/matthewchapman
Matt
the number is 2
https://www.thebureauinvestigates.com/profile/vickygayle
Vicky


## Storing information from a loop

If you want to extract information from a loop you'll need to create an empty variable *before* the loop runs in order to store stuff that you extract while looping. 

### Detour: extracting some information

First, we'll need to test some code for extracting the information, which we will then put in the code so it repeats for each item.

This uses the `.split()` method to split a string into a number of smaller strings, wherever it finds a specified character (in the example below, a slash).

It then uses an index to grab the last item in that resulting list.

In [None]:
#split the URL string on each slash
spliturl = myurls[0].split("/")
#grab the 5th item from the resulting list
item5 = spliturl[4]
#print it
print(item5)
#grab the last item
lastitem = spliturl[-1]
#print that
print(lastitem)

jasperjackson
jasperjackson


### Back to the loop

Now we've got that working, we insert it into our loop.

We also create an empty list *before* it runs, which will store whatever we extract.

In [None]:
#create an empty list to store what we extract inside the loop
personids = []

#go through each item in the list
#store it in a variable called 'i'
for i in myurls:
  print("scraping: "+i)
  #print the variable i
  print(i)
  #extract the name at the end
  #split the URL string on each slash
  spliturl = i.split("/")
  print(spliturl)
  #grab the 5th item from the resulting list
  item5 = spliturl[4]
  print(item5)
  #store that in the previously empty list
  personids.append(item5)
  #print the list so far
  print(personids)

scraping: https://www.thebureauinvestigates.com/profile/jasperjackson
https://www.thebureauinvestigates.com/profile/jasperjackson
['https:', '', 'www.thebureauinvestigates.com', 'profile', 'jasperjackson']
jasperjackson
['jasperjackson']
scraping: https://www.thebureauinvestigates.com/profile/matthewchapman
https://www.thebureauinvestigates.com/profile/matthewchapman
['https:', '', 'www.thebureauinvestigates.com', 'profile', 'matthewchapman']
matthewchapman
['jasperjackson', 'matthewchapman']
scraping: https://www.thebureauinvestigates.com/profile/vickygayle
https://www.thebureauinvestigates.com/profile/vickygayle
['https:', '', 'www.thebureauinvestigates.com', 'profile', 'vickygayle']
vickygayle
['jasperjackson', 'matthewchapman', 'vickygayle']


In [None]:
#print the whole loop now it's finished
print(personids)

['jasperjackson', 'matthewchapman', 'vickygayle']


## Get the data out

To get the data out we need to use that list as a column inside a `pandas` dataframe, and then export that dataframe as a CSV.

In [None]:
#import pandas to create a dataframe and export a CSV
import pandas as pd

The `pd.DataFrame()` function needs a **dictionary** as its ingredient. The key(s) in that dictionary are basically the column headings, and then after a colon you will put the name of the list containing the values in that column.

In [None]:
#create a pandas dataframe using that list as the only column
mydata = pd.DataFrame( { "thedata" : personids } )
mydata

Unnamed: 0,thedata
0,jasperjackson
1,matthewchapman
2,vickygayle


In [None]:
#export the dataframe as a CSV
mydata.to_csv("mydata.csv")

### Adding extra columns to the dataframe

We could also add extra columns by adding a comma inside the dictionary, and specifying another 'key', followed by a colon, and then the list containing the values for the next column.

In [None]:
#create a pandas dataframe using two lists as columns
mydata = pd.DataFrame( { "thedata" : personids, "theurls" : myurls } )
mydata

Unnamed: 0,thedata,theurls
0,jasperjackson,https://www.thebureauinvestigates.com/profile/...
1,matthewchapman,https://www.thebureauinvestigates.com/profile/...
2,vickygayle,https://www.thebureauinvestigates.com/profile/...


## Using this with an API

Now we've played with loops a bit, we can apply the same techniques to an API.

First, let's go through the process of fetching data from an API and storing it in a dataframe (that we can also export as a CSV). 

Here's a URL adapted from the documentation on the police API - specifically [the documentation on getting data on stop and search by force](https://data.police.uk/docs/method/stops-force/)

In [None]:
#store the URL we want to test
testurl = 'https://data.police.uk/api/stops-force?force=avon-and-somerset&date=2022-01'

### Fetching JSON data and storing in a dataframe

To fetch the data from that URL, which is in JSON format, we use the `read_json()` function from pandas (which we imported earlier in this notebook, and renamed `as pd` - make sure you run that earlier code first!)

In [None]:
#fetch the JSON from that URL using pandas's read_json function
testjson = pd.read_json(testurl)
#show it
testjson.head()

Unnamed: 0,age_range,outcome,involved_person,self_defined_ethnicity,gender,legislation,outcome_linked_to_object_of_search,datetime,removal_of_more_than_outer_clothing,outcome_object,location,operation,officer_defined_ethnicity,type,operation_name,object_of_search
0,10-17,Arrest,True,,Male,Police and Criminal Evidence Act 1984 (section 1),,2022-01-24 19:25:00+00:00,0.0,"{'id': 'bu-arrest', 'name': 'Arrest'}","{'latitude': '51.347278', 'street': {'id': 537...",,White,Person search,,Offensive weapons
1,25-34,Khat or Cannabis warning,True,Other ethnic group - Not stated,Male,Misuse of Drugs Act 1971 (section 23),1.0,2022-01-26 10:38:00+00:00,0.0,"{'id': 'bu-khat-or-cannabis-warning', 'name': ...",,,Black,Person search,,Controlled drugs
2,25-34,A no further action disposal,True,Other ethnic group - Not stated,Female,Misuse of Drugs Act 1971 (section 23),,2022-01-06 17:30:00+00:00,0.0,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '51.459312', 'street': {'id': 543...",,White,Person search,,Controlled drugs
3,25-34,Community resolution,True,White - English/Welsh/Scottish/Northern Irish/...,Male,Police and Criminal Evidence Act 1984 (section 1),0.0,2022-01-20 01:47:00+00:00,0.0,"{'id': 'bu-community-resolution', 'name': 'Com...","{'latitude': '51.405274', 'street': {'id': 539...",,White,Person search,,Articles for use in criminal damage
4,over 34,A no further action disposal,True,White - English/Welsh/Scottish/Northern Irish/...,Female,Police and Criminal Evidence Act 1984 (section 1),,2022-01-25 00:00:00+00:00,0.0,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '51.137118', 'street': {'id': 532...",,White,Person search,,Stolen goods


### Fetching the force id codes

We notice that the URL to get data on stops and searches by force needs to contain the name of the force - but a specific 'id' name.

A **list** (wahey!) of those id names is available [from a different part of the API](https://data.police.uk/docs/method/forces/), so we fetch that too in the code below.

In [None]:
#fetch the JSON containing all force id names
forcedf = pd.read_json("https://data.police.uk/api/forces")
forcedf

Unnamed: 0,id,name
0,avon-and-somerset,Avon and Somerset Constabulary
1,bedfordshire,Bedfordshire Police
2,cambridgeshire,Cambridgeshire Constabulary
3,cheshire,Cheshire Constabulary
4,city-of-london,City of London Police
5,cleveland,Cleveland Police
6,cumbria,Cumbria Constabulary
7,derbyshire,Derbyshire Constabulary
8,devon-and-cornwall,Devon & Cornwall Police
9,dorset,Dorset Police


### A column in a dataframe is basically a list

The id names are in the first column of the dataframe. We can access that, and treat it like we would a list*: looping through it, using indices, etc.

*(Technically, a column in a dataframe is a **Series**, not a list, but it can be treated in the same way)*

In [None]:
#access the 'id' column in the dataframe, and show the first 3 items
forcedf['id'][0:3]

0    avon-and-somerset
1         bedfordshire
2       cambridgeshire
Name: id, dtype: object

### Looping through that list to create the URLs to get the data for each force

Now we have that list of id names, we can loop through them and insert into the URL we used before, in order to get data for all forces. 

First, as before, we need to create an empty variable to store the results. This time we create an empty dataframe, because what we're extracting in the loop is a dataframe, and we need an empty dataframe to keep them all together.

In [None]:
#create an empty dataframe to hold the data we get
allthedata = pd.DataFrame()

#loop through each item in that column
for i in forcedf['id']:
  #print it
  print(i)
  #create a URL with the id inserted in it
  forceurl = "https://data.police.uk/api/stops-force?force="+i
  #print that
  print(forceurl)
  #fetch the JSON at that URL
  forcejson = pd.read_json(forceurl)
  #show the number of columns/rows
  print(forcejson.shape)
  #append the new data to the ongoing dataframe
  allthedata = allthedata.append(forcejson)

#show the final results
allthedata.head()

avon-and-somerset
https://data.police.uk/api/stops-force?force=avon-and-somerset
(606, 16)
bedfordshire
https://data.police.uk/api/stops-force?force=bedfordshire
(428, 16)
cambridgeshire
https://data.police.uk/api/stops-force?force=cambridgeshire
(167, 16)
cheshire
https://data.police.uk/api/stops-force?force=cheshire
(0, 0)
city-of-london
https://data.police.uk/api/stops-force?force=city-of-london
(210, 16)
cleveland
https://data.police.uk/api/stops-force?force=cleveland
(471, 16)
cumbria
https://data.police.uk/api/stops-force?force=cumbria
(209, 16)
derbyshire
https://data.police.uk/api/stops-force?force=derbyshire
(145, 16)
devon-and-cornwall
https://data.police.uk/api/stops-force?force=devon-and-cornwall
(582, 16)
dorset
https://data.police.uk/api/stops-force?force=dorset
(193, 16)
durham
https://data.police.uk/api/stops-force?force=durham
(270, 16)
dyfed-powys
https://data.police.uk/api/stops-force?force=dyfed-powys
(372, 16)
essex
https://data.police.uk/api/stops-force?force=esse

Unnamed: 0,age_range,outcome,involved_person,self_defined_ethnicity,gender,legislation,outcome_linked_to_object_of_search,datetime,removal_of_more_than_outer_clothing,outcome_object,location,operation,officer_defined_ethnicity,type,operation_name,object_of_search
0,over 34,A no further action disposal,True,White - English/Welsh/Scottish/Northern Irish/...,Male,Misuse of Drugs Act 1971 (section 23),,2022-03-09 11:55:00+00:00,0.0,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '51.461396', 'street': {'id': 543...",,White,Person search,,Controlled drugs
1,18-24,,True,,Male,Police and Criminal Evidence Act 1984 (section 1),,2022-03-17 13:50:00+00:00,0.0,"{'id': '', 'name': ''}","{'latitude': '51.473994', 'street': {'id': 545...",,,Person search,,Evidence of offences under the Act
2,over 34,A no further action disposal,True,White - English/Welsh/Scottish/Northern Irish/...,Male,Misuse of Drugs Act 1971 (section 23),0.0,2022-03-23 11:35:00+00:00,0.0,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '51.497897', 'street': {'id': 547...",,White,Person search,,Controlled drugs
3,,A no further action disposal,True,Other ethnic group - Not stated,Male,Misuse of Drugs Act 1971 (section 23),,2022-03-13 03:50:00+00:00,0.0,"{'id': 'bu-no-further-action', 'name': 'A no f...","{'latitude': '51.403618', 'street': {'id': 539...",,Mixed,Person and Vehicle search,,Controlled drugs
4,25-34,A no further action disposal,True,Mixed/Multiple ethnic groups - White and Black...,Male,Misuse of Drugs Act 1971 (section 23),,2022-03-09 11:00:00+00:00,0.0,"{'id': 'bu-no-further-action', 'name': 'A no f...",,,Mixed,Person and Vehicle search,,Controlled drugs


In [None]:
#check the number of columns/rows of the final dataframe
allthedata.shape

(43502, 16)

## Scheduling a scraper using the `schedule` library

For scheduling scrapers and other code, you might use a library created to solve that challenge. The `schedule` library [looks like it might do that](https://schedule.readthedocs.io/en/stable/)

It needs installing first, along with the `time` library.

In [None]:
#first we need to import the schedule library
!pip install schedule
import schedule
#and the time library
import time

Collecting schedule
  Downloading schedule-1.1.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: schedule
Successfully installed schedule-1.1.0


You would then search around for tutorials explaining how to use it. Broadly speaking it looks like you store the code you want to schedule inside a function (called `job` below), and then specify that function inside another line of code which says how often you want to schedule something.

In [None]:
#put your scraper code inside a function
def job():
  #replace this with your scraper code
  print("I'm scraping...")

# Run job every 1 second
schedule.every(1).seconds.do(job)


while True:
    schedule.run_pending()
    time.sleep(1)