# Creating functions in Python (for scraping)

In previous notebooks we covered:

* How to create variables in Python (to store things like URLs for scraping, and the data from pages that we scrape); 
* How to loop through lists (in order to scrape or store each item in that list, for example); and 
* How to create data frames using `pandas` (to store the scraped data).

Now we're going to bring those together into a final multi-page scraper by creating our own **functions**.

We've used functions already such as `range()` and `len()`. These are **built-in functions** that come with Python. We've also used functions from libraries, like `scraperwiki.scrape()` and `lxml.html.fromstring()`.

You can create your own function - a **user-defined function** - with the `def` command like so:

In [None]:
def sayhello():
  print("hello")

The `def` command is followed by:

* The name of the function
* Parentheses (which can contain any ingredients that you want to use but in this example don't)
* A colon, and
* Indented lines of code underneath which will run when the function is used

The name of the function is entirely up to you, but try to make it distinctive and meaningful. 

We will explain the other parts as we begin to create a function below, but note for now that when you create the function nothing appears to happen.

Of course something *has* happened when you run the code above: a function has been created and can now be used.

## 'Calling' a function

Using a function is referred to as 'calling' it.

This is how you **call** a function:

In [None]:
sayhello()

hello


Basically it's like any other function: you type the name of the function, followed by parentheses containing any ingredients it needs. Even if the function doesn't need any ingredients, you still use (empty) parentheses.

## Creating a function with ingredients

Our example so far didn't have any ingredients, so let's create one that does.


In [None]:
def print_this_word(thisword):
  print(thisword)

This time we've put a word inside the parentheses: `thisword`

In a way, we've created a variable to store whatever ingredient is used when someone calls this function. (This is called a **parameter**.)

That variable is then used in the code below: `print(thisword)`

To see what happens, let's use that function:

In [None]:
print_this_word("pumpkin")

pumpkin


Now let's break down what happens when that line of code is run:

1. First, it **calls** the function `print_this_word`
2. Then it gives it an ingredient: the string "pumpkin". This is called **passing** an **argument**
3. When the function was written, it called that ingredient `thisword`, so the string "pumpkin" is stored in a variable called `thisword`
4. As the function code runs, it accesses that variable inside a `print()` command, so the contents of that variable ("pumpkin" this time) are printed

## Creating a scraper that **returns** something

Our scraper above only prints something, but often you want a scraper to do something (such as perform a calculation, or scrape a page) and then **return** the results. 

Here's an example:

In [None]:
#define a new function called 'addtwonumbers' which has 2 parameters (ingredients)
def addtwonumbers(numone, numtwo):
  #the two parameters are added together and stored in 'total'
  total = numone+numtwo
  #the function returns that value
  return(total)

You can see that this function has *two* ingredients (parameters): `numone` and `numtwo`, separated by a comma. The function adds those two together, and stores it in a new variable called `total`. Finally it specifies to `return()` the contents of that variable.

Here's that function being used:

In [None]:
whatisit = addtwonumbers(3,8)
print(whatisit)

11


You can see that the first line runs the `addtwonumbers()` function and **passes** it two ingredients: two numbers, 3 and 8. 

The function runs, adds those two numbers together, and *returns* the result to the variable that it was being used to create: `whatisit`.

We then print that.

Returning information from a function can be incredibly powerful: in the example above it was just a number that was returned, but you can return lists, dictionaries, multiple items, and, among other things, data frames - which is what happens next...

## Creating a scraper function

Now let's create a function to contain the code that we wrote for our scraper in a separate notebook.

In [1]:
#define a function
def scrapepage(theurl):
  print("scraping", theurl)
  #scrape the webpage at that url and store in 'html'
  html = scraperwiki.scrape(theurl, user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36")
  #convert 'html' into an lxml object so we can drill into it
  root = lxml.html.fromstring(html)
  #grab the contents of every <th> tag
  servicenames = root.cssselect('th')
  #count how many items there are - subtracting 3 for the 3 extra results we don't want
  items = len(servicenames)-3
  print(items)
  #set a limit to those items, normally the last 100, but less on the last page
  servicenames = servicenames[-items:]
  #grab the contents of each <p class="fctel"> tag
  tels = root.cssselect('div.fcdetailsleft')
  #Create a dataframe to store the data we are about to scrape
  #It has two column called 'service' and 'details'
  #We call this dataframe 'df'
  df = pandas.DataFrame(columns=["servicename","tel"])
  #Because we need to loop through two lists of the same length, we can instead 
  #loop through a range of indices, generated using the range function
  for i in range(0,items):
    #extract the text from that index in servicenames
    servicename = servicenames[i].text_content()
    #repeat for the item at that index in tels
    tel = tels[i].text_content()
    #then add to the df
    df = df.append({
      "servicename" : servicename,
      "tel" : tel
    }, ignore_index=True)
  #return the data frame to whatever called the function
  return(df)



To create this function we've essentially taken all the important code from that notebook and indented it under the line `def scrapepage(theurl):`

That first line transforms our previous code into something **reusable** by doing two things: giving a name to it (`scrapepage`); and giving a name to the URL we want to scrape (`theurl`).

There's one other extra line too, right at the end: `return(df)` ensures that the results of the scraper are passed back to whatever calls this function.

Something else to highlight: the function contains a `for` loop as well, which means there are two levels of indents in the code: all the code inside the function is indented, and then the `for` loop code inside *that* is indented one more time.

## Handling the last page

We've also added some extra code to handle the fact that the last page of results will not have exactly 100 results:

```
  items = len(servicenames)-3
  #set a limit to those items, normally the last 100, but less on the last page
  servicenames = servicenames[-items:]
```

On a normal page, 'servicenames' is 103 items long: the 100 items plus the 3 headings. But on the last page it might be 8 items long: 5 items plus 3 headings. 

So we subtract 3 from the number of items (measured with `len`) to get the number of items.

Then we use it as a negative index in `[-items:]`

For example, if the variable `items` contains the number 5 then that code will run as `[-5:]` meaning 'from the fifth-to-last item onwards'.

The variable is also used in this line:

`for i in range(0,items):`

Where again it is used to generate a range of numbers.

## Calling the scraper function

Now let's call that function on a bunch of pages. First we need to make sure the libraries are loaded.

In [None]:
#install the libraries 
#scraperwiki is a library for scraping webpages
!pip install scraperwiki
import scraperwiki
#lxml.html is used to convert it into xml (more structured)
import lxml.html
#cssselect is used to drill down into that and find data in tags
!pip install cssselect
import cssselect
#the pandas library which is used to work with data 
import pandas 

In [4]:
#Create a dataframe to store the data we are about to scrape
#It has two column called 'service' and 'details'
#We call this dataframe 'dfhere'
dfhere = pandas.DataFrame(columns=["servicename","tel"])

#first, store the URL up to the page number
firsturlpart = "https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage="
#next create a list of page numbers from 1 to 9
pagelist = range(1,10)
#then loop through them and add to the URL
for i in pagelist:
  #convert number to string so it can be combined with URL
  pagenumberasstring = str(i)
  #combine that with URL
  pageurl = firsturlpart+pagenumberasstring
  #scrape the page and store results in df
  df = scrapepage(pageurl)
  print(df)
  #add the new data frame to the existing data frame
  dfhere = dfhere.append(df)
  print(dfhere)

scraping https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=1
100
                                          servicename                                                tel
0   \r\n                    Addictive Eaters Anony...  \r\n        Tel: 03301333615\r\n        \r\nSt...
1   \r\n                    Nottinghamshire Adult ...  \r\n        Tel: 0115 876 0162\r\n        \r\n...
2   \r\n                    Nottinghamshire - Pare...  \r\n        Tel: 0115 956 0866\r\n        \r\n...
3   \r\n                    Nottinghamshire Camhs ...  \r\n        Tel: 0115 841 5812\r\n        \r\n...
4   \r\n                    Child And Adolescent M...  \r\n        Tel: 0115 844 0524\r\n        \r\n...
..                                                ...                                                ...
95  \r\n                    Peterborough Eating Di...  \r\n 

In [None]:
print(dfhere)

                                          servicename                                                tel
0             Addictive Eaters Anonymous - Nottingham  Tel: 03301333615\r\n        \r\nStation Street...
1          Nottinghamshire Adult Eating Disorder Team  Tel: 0115 876 0162\r\n        \r\nMandala Cent...
2   Nottinghamshire Camhs Eating Disorder Team - N...  Tel: 0115 841 5812\r\n        \r\nThorneywood\...
3             Nottinghamshire - Parents Support Group  Tel: 0115 956 0866\r\n        \r\nThorneywood\...
4   Child And Adolescent Mental Health Services (C...  Tel: 0115 844 0524\r\n        \r\nPebble Bridg...
..                                                ...                                                ...
0                          Cornwall Partnership Trust  Cornwall Eating Disorders Service\r\n    Truro...
1                    Cornwall Eating Disorder Service  Tel: 01872 322277\r\n        \r\nCornwall Coun...
2               Addictive Eaters Anonymous - Falmouth  

In [None]:
#And we can export it
dfhere.to_csv("scrapeddata.csv")

Note that the extra 'new line' characters will make it look like the cells are empty in Excel until you double-click in one to see the whole thing.

## Improvement 1: adding a delay (throttling)

We can change the scraper so that it pauses between each page. To do this we need the `time` library.

In [None]:
#Import the time library to use its sleep() function
import time

We can then use the `sleep()` function from that library, which [stops the code running for a specified number of seconds](https://www.programiz.com/python-programming/time/sleep). So to pause for three seconds it might be written like so:

`time.sleep(3)`

That can be inserted into loop that calls the scraping function (or, if the scraping function scrapes more than one page, you can insert it there to pause between each page):

In [None]:
#Create a dataframe to store the data we are about to scrape
#It has two column called 'service' and 'details'
#We call this dataframe 'dfhere'
dfhere = pandas.DataFrame(columns=["servicename","tel"])

#first, store the URL up to the page number
firsturlpart = "https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage="
#next create a list of page numbers from 1 to 9
pagelist = range(1,10)
#then loop through them and add to the URL
for i in pagelist:
  #convert number to string so it can be combined with URL
  pagenumberasstring = str(i)
  #combine that with URL
  pageurl = firsturlpart+pagenumberasstring
  #scrape the page and store results in df
  df = scrapepage(pageurl)
  print(df)
  #add the new data frame to the existing data frame
  dfhere = dfhere.append(df)
  print(dfhere)
  print("waiting 3 seconds before next scrape")
  #Sleep for 3 seconds before looping again
  time.sleep(3)

scraping https://www.nhs.uk/service-search/other-services/Eating-disorders/Nottingham/Results/102/-1.158/52.955/1797/15942?distance=500&ResultsOnPageValue=100&isNational=0&totalItems=805&currentPage=1
100
                                          servicename                                                tel
0             Addictive Eaters Anonymous - Nottingham  Tel: 03301333615        Station Street        ...
1          Nottinghamshire Adult Eating Disorder Team  Tel: 0115 876 0162        Mandala Centre    Gr...
2             Nottinghamshire - Parents Support Group  Tel: 0115 956 0866        Thorneywood    Child...
3   Nottinghamshire Camhs Eating Disorder Team - N...  Tel: 0115 841 5812        Thorneywood    Child...
4   Child And Adolescent Mental Health Services (C...  Tel: 0115 844 0524        Pebble Bridge    Hop...
..                                                ...                                                ...
95              Peterborough Eating Disorders Charity  Tel: 

## Improvement 2: stripping text

To avoid the problem with cells not being immediately obvious in Excel we can add `.strip()` to strip out white space around data. Here are the lines where we add it in the function below:

```
    #extract the text from that index in servicenames
    servicename = servicenames[i].text_content().strip()
    #repeat for the item at that index in tels
    tel = tels[i].text_content().strip()
```

In [None]:
#define a function
def scrapepage(theurl):
  print("scraping", theurl)
  #scrape the webpage at that url and store in 'html'
  html = scraperwiki.scrape(theurl, user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36")
  #convert 'html' into an lxml object so we can drill into it
  root = lxml.html.fromstring(html)
  #grab the contents of every <th> tag
  servicenames = root.cssselect('th')
  #count how many items there are - subtracting 3 for the 3 extra results we don't want
  items = len(servicenames)-3
  print(items)
  #set a limit to those items, normally the last 100, but less on the last page
  servicenames = servicenames[-items:]
  #grab the contents of each <p class="fctel"> tag
  tels = root.cssselect('div.fcdetailsleft')
  #Create a dataframe to store the data we are about to scrape
  #It has two column called 'service' and 'details'
  #We call this dataframe 'df'
  df = pandas.DataFrame(columns=["servicename","tel"])
  #Because we need to loop through two lists of the same length, we can instead 
  #loop through a range of indices, generated using the range function
  for i in range(0,items):
    #extract the text from that index in servicenames
    servicename = servicenames[i].text_content().strip()
    #repeat for the item at that index in tels
    tel = tels[i].text_content().strip()
    #then add to the df
    df = df.append({
      "servicename" : servicename,
      "tel" : tel
    }, ignore_index=True)
  #return the data frame to whatever called the function
  return(df)



We could also replace the new lines with `.replace()` like so:

In [None]:
df['tel'][1].replace("\r\n","")

'Tel: 01872 322277        Cornwall Council    New County Hall    Truro             TR1 3AY'

And we could replace the multiple spaces by using **regex**. This involves importing the `re` library.

In [None]:
import re

Once imported we can use the `sub()` function from that library to substitute any double-or-more spaces (indicated by the regular expression `\s\s+`) with a comma and space:

In [None]:
clean1 = df['tel'][1].replace("\r\n","")
print(clean1)
clean2 = re.sub("\s\s+",", ",clean1)
print(clean2)

Tel: 01872 322277        Cornwall Council    New County Hall    Truro             TR1 3AY
Tel: 01872 322277, Cornwall Council, New County Hall, Truro, TR1 3AY


These can be incorporated into the function like this:

In [None]:
#define a function
def scrapepage(theurl):
  print("scraping", theurl)
  #scrape the webpage at that url and store in 'html'
  html = scraperwiki.scrape(theurl, user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36")
  #convert 'html' into an lxml object so we can drill into it
  root = lxml.html.fromstring(html)
  #grab the contents of every <th> tag
  servicenames = root.cssselect('th')
  #count how many items there are - subtracting 3 for the 3 extra results we don't want
  items = len(servicenames)-3
  print(items)
  #set a limit to those items, normally the last 100, but less on the last page
  servicenames = servicenames[-items:]
  #grab the contents of each <p class="fctel"> tag
  tels = root.cssselect('div.fcdetailsleft')
  #Create a dataframe to store the data we are about to scrape
  #It has two column called 'service' and 'details'
  #We call this dataframe 'df'
  df = pandas.DataFrame(columns=["servicename","tel"])
  #Because we need to loop through two lists of the same length, we can instead 
  #loop through a range of indices, generated using the range function
  for i in range(0,items):
    #extract the text from that index in servicenames
    servicename = servicenames[i].text_content().strip()
    #remove new lines
    servicename = servicename.replace("\r\n","")
    #remove double spaces
    servicename = re.sub("\s\s+",", ",servicename)
    #repeat for the item at that index in tels
    tel = tels[i].text_content().strip().replace("\r\n","")
    #then add to the df
    df = df.append({
      "servicename" : servicename,
      "tel" : tel
    }, ignore_index=True)
  #return the data frame to whatever called the function
  return(df)

