# Creating functions in Python (for scraping)

In previous notebooks we covered:

* How to create variables in Python (to store things like URLs for scraping, and the data from pages that we scrape); 
* How to loop through lists (in order to scrape or store each item in that list, for example); and 
* How to create data frames using `pandas` (to store the scraped data).

Now we're going to bring those together into a final multi-page scraper by creating our own **functions**.

We've used functions already such as `sum()` and `len()`. These are **built-in functions** that come with Python. 

We've also used functions from libraries, like `pd.read_csv()` and `BeautifulSoup()`.

You can create your own function - a **user-defined function** - with the `def` command like so:

In [None]:
def sayhello():
  print("hello")

The `def` command is followed by:

* The name of the function
* Parentheses (which can contain any ingredients that you want to use but in this example don't)
* A colon, and
* Indented lines of code underneath which will run when the function is used

The name of the function is entirely up to you, but try to make it distinctive and meaningful. 

We will explain the other parts as we begin to create a function below, but note for now that when you create the function nothing appears to happen.

Of course something *has* happened when you run the code above: a function has been created and can now be used.

## 'Calling' a function

Using a function is referred to as 'calling' it.

This is how you **call** a function:

In [None]:
sayhello()

hello


Basically it's like any other function: you type the name of the function, followed by parentheses containing any ingredients it needs. Even if the function doesn't need any ingredients, you still use (empty) parentheses.

## Creating a function with ingredients

Our example so far didn't have any ingredients, so let's create one that does.


In [None]:
def print_this_word(thisword):
  print(thisword)

This time we've put a word inside the parentheses: `thisword`

In a way, we've created a variable to store whatever ingredient is used when someone calls this function. (This is called a **parameter**.)

That variable is then used in the code below: `print(thisword)`

To see what happens, let's use that function:

In [None]:
print_this_word("pumpkin")

pumpkin


Now let's break down what happens when that line of code is run:

1. First, it **calls** the function `print_this_word`
2. Then it gives it an ingredient: the string "pumpkin". This is called **passing** an **argument**
3. When the function was written, it called that ingredient `thisword`, so the string "pumpkin" is stored in a variable called `thisword`
4. As the function code runs, it accesses that variable inside a `print()` command, so the contents of that variable ("pumpkin" this time) are printed

## Creating a scraper that **returns** something

Our scraper above only prints something, but often you want a scraper to do something (such as perform a calculation, or scrape a page) and then **return** the results. 

Here's an example:

In [None]:
#define a new function called 'addtwonumbers' which has 2 parameters (ingredients)
def addtwonumbers(numone, numtwo):
  #the two parameters are added together and stored in 'total'
  total = numone+numtwo
  #the function returns that value
  return(total)

You can see that this function has *two* ingredients (parameters): `numone` and `numtwo`, separated by a comma. The function adds those two together, and stores it in a new variable called `total`. Finally it specifies to `return()` the contents of that variable.

Here's that function being used:

In [None]:
whatisit = addtwonumbers(3,8)
print(whatisit)

11


You can see that the first line runs the `addtwonumbers()` function and **passes** it two ingredients: two numbers, 3 and 8. 

The function runs, adds those two numbers together, and *returns* the result to the variable that it was being used to create: `whatisit`.

We then print that.

Returning information from a function can be incredibly powerful: in the example above it was just a number that was returned, but you can return lists, dictionaries, multiple items, and, among other things, data frames - which is what happens next...

## Creating a scraper function

Now let's create a function to contain the code that we wrote for our scraper in a separate notebook.

In [3]:
#define a function
def scrapepage(theurl):
  #fetch the page from the URL
  page = requests.get(theurl)
  #parse the page into a 'soup' object
  soup = BeautifulSoup(page.content, 'html.parser')
  #grab all the headlines - we've identified a class attribute they all have
  headlines = soup.select('div a[class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor"]')
  #create two empty lists for the two pieces of information we want to extract
  headlinetext = []
  links = []
  #loop through them
  for i in headlines:
    #extract the text
    headtext = i.get_text()
    #extract the link
    headlink = i['href']
    #add the text to the previously empty list 
    headlinetext.append(headtext)
    #add the link to a second empty list
    links.append(headlink)
  #check that both are the same length
  print(len(headlinetext))
  print(len(links))
  #create a dataframe to store them
  df = pd.DataFrame({"headline" : headlinetext, "link" : links})
  return(df)



To create this function we've essentially taken all the important code from that notebook and indented it under the line `def scrapepage(theurl):`

That first line transforms our previous code into something **reusable** by doing two things: giving a name to it (`scrapepage`); and giving a name to the URL we want to scrape (`theurl`).

There's one other extra line too, right at the end: `return(df)` ensures that the results of the scraper are passed back to whatever calls this function.

Something else to highlight: the function contains a `for` loop as well, which means there are two levels of indents in the code: all the code inside the function is indented, and then the `for` loop code inside *that* is indented one more time.

## Calling the scraper function

Now let's call that function on a bunch of pages. First we need to make sure the libraries are loaded.

In [5]:
#install the libraries 
import requests
from bs4 import BeautifulSoup
#the pandas library which is used to work with data 
import pandas as pd

### Testing the scraper function on one URL

Next we test it on just one URL.

In [6]:
#store the URL of another topic page (environment)
testurl = "https://www.bbc.co.uk/news/science-environment-56837908"
#run the function on that URL and store the results in 'testdf'
testdf = scrapepage(testurl)
#check what we have
testdf

39
39


Unnamed: 0,headline,link
0,Biggest coal plant in Australia to close early,/news/business-60411622
1,More than eight million trees lost in UK winter,/news/science-environment-60348947
2,Flooding is the new reality in Wales - watchdog,/news/uk-wales-60386125
3,Big banks fund new oil despite net zero pledges,/news/business-60366054
4,Record high deforestation of Amazon in January,/news/science-environment-60333422
5,UK's only shale gas wells to be abandoned,/news/uk-england-lancashire-60341226
6,Biggest coal plant in Australia to close early,/news/business-60411622
7,More than eight million trees lost in UK winter,/news/science-environment-60348947
8,Flooding is the new reality in Wales - watchdog,/news/uk-wales-60386125
9,Big banks fund new oil despite net zero pledges,/news/business-60366054


## Loop through multiple URLs and scrape them all

Now we can try it on a list of URLs. 

This time we need to create an empty dataframe to store the results of each page.

We also need to create a list of URLs that we want to loop through and scrape.

Each time that we loop through the list, and grab one URL, we use `.append()` to add the dataframe of *one page's results* to the ongoing dataframe of *all results*.

In [12]:
#Create a dataframe to store the data we are about to scrape
allresults = pandas.DataFrame()

#create a list of URLs to scrape
urllist = ["https://www.bbc.co.uk/news/world","https://www.bbc.co.uk/news/health", "https://www.bbc.co.uk/news/entertainment_and_arts"]

#then loop through them and add to the URL
for i in urllist:
  #scrape that url
  df = scrapepage(i)
  print(df)
  #add the new data frame to the existing data frame
  allresults = allresults.append(df)

#print final dataframe
print(allresults)

39
39
                                             headline                                               link
0   Video 1 minute 7 secondsUS warns of consequenc...                               /news/world-60445560
1       Ottawa police close in on vaccine protest hub                     /news/world-us-canada-60420469
2   Search for 12 missing after ferry blaze continues                        /news/world-europe-60443517
3   Malian soldiers killed after France announces ...                        /news/world-africa-60444777
4        Australia says China shone laser at warplane                     /news/world-australia-60446928
5      Mexican army moves in on drug lord's home town                 /news/world-latin-america-60443514
6   Video 1 minute 7 secondsUS warns of consequenc...                               /news/world-60445560
7       Ottawa police close in on vaccine protest hub                     /news/world-us-canada-60420469
8   Search for 12 missing after ferry blaze conti

## Export the results

And we can export it.

In [8]:
#And we can export it
allresults.to_csv("scrapeddata.csv")

Note that the extra 'new line' characters will make it look like the cells are empty in Excel until you double-click in one to see the whole thing.

## Improvement 1: adding a delay (throttling)

We can change the scraper so that it pauses between each page. To do this we need the `time` library.

In [10]:
#Import the time library to use its sleep() function
import time

We can then use the `sleep()` function from that library, which [stops the code running for a specified number of seconds](https://www.programiz.com/python-programming/time/sleep). So to pause for three seconds it might be written like so:

`time.sleep(3)`

That can be inserted into loop that calls the scraping function (or, if the scraping function scrapes more than one page, you can insert it there to pause between each page):

In [11]:
#Create a dataframe to store the data we are about to scrape
allresults = pandas.DataFrame()

urllist = ["https://www.bbc.co.uk/news/world","https://www.bbc.co.uk/news/health", "https://www.bbc.co.uk/news/entertainment_and_arts"]

#then loop through them and add to the URL
for i in urllist:
  #scrape that url
  df = scrapepage(i)
  print(df)
  #add the new data frame to the existing data frame
  allresults = allresults.append(df)
  print("waiting 3 seconds before next scrape")
  #Sleep for 3 seconds before looping again
  time.sleep(3)

print(allresults)

39
39
                                             headline                                               link
0   Video 1 minute 7 secondsUS warns of consequenc...                               /news/world-60445560
1       Ottawa police close in on vaccine protest hub                     /news/world-us-canada-60420469
2   Search for 12 missing after ferry blaze continues                        /news/world-europe-60443517
3   Malian soldiers killed after France announces ...                        /news/world-africa-60444777
4        Australia says China shone laser at warplane                     /news/world-australia-60446928
5      Mexican army moves in on drug lord's home town                 /news/world-latin-america-60443514
6   Video 1 minute 7 secondsUS warns of consequenc...                               /news/world-60445560
7       Ottawa police close in on vaccine protest hub                     /news/world-us-canada-60420469
8   Search for 12 missing after ferry blaze conti