 # Creating a function of your very own

 This notebook outlines some examples of creating functions (what are called **self-defined functions**) and running those.

 Below we create a very simple function called `insultme`. 

To create a function you start with `def` followed by the name you want to give the function, and some parentheses. If you need the function to work with any ingredients (which you normally do), then you name those ingredients inside the parentheses.

Essentially the word(s) inside brackets are variables which will store those ingredients while the function runs. 

Variables inside functions are called **local variables** because they don't exist outside of the function (unlike **global variables** - those which are created in normal code).

At the end of the `def` line there is a colon. After the colon any lines of code you want to run as part of the function should be indented. 

At the end of the function you will often see a `return` command: this specifies anything that will be *returned* to whatever called the function (often a variable is created to store the results of the function running, as you'll see below)

In [None]:
#define a function called 'insultme' - it has no ingredients
def insultme(thename):
  #return the string to whatever called the function
  return("You idiot "+thename)

## Running the function and storing the results

Now that the function has been defined above, it can be 'called'. 

In the code below a variable is created to store the results of the function (anything that it `return`s).

The function is also given a string - `"Paul"` - as its one ingredient. When the function runs it stores this in the variable named in brackets when it was defined (in the example above that ingredient is stored in something called `thename`).

In [None]:
#call the insultme function and pass it the string "Paul"
#store the results in a variable called mynewstring
mynewstring = insultme("Paul")

In [None]:
#print that variable
print(mynewstring)

You idiot Paul


## Writing a scraper function

Now we use the principles above to create a function that can be used to scrape pages.

First, import the libraries we will need.

In [None]:
#import the 3 libraries we need for the scraper
import requests
import pandas as pd
from bs4 import BeautifulSoup

## Write the code for one page first

A function is often based on code that you ran once, but now want to run multiple times. 

Below, for example, we write some code to scrape one page.

Later we might want to use the same code to scrape multiple pages, which is a good reason to store that code in a function.

In [None]:
#store a url we can test this on
myurl = "https://www.bbc.co.uk/news/science-environment-56837908"
#fetch the page from the URL
page = requests.get(myurl)
#parse the page into a 'soup' object
soup = BeautifulSoup(page.content, 'html.parser')
#grab all the headlines - we've identified a class attribute they all have
headlines = soup.select('div a[class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor"]')
#create two empty lists for the two pieces of information we want to extract
headlinetext = []
links = []
#loop through them
for i in headlines:
  #extract the text
  headtext = i.get_text()
  #extract the link
  headlink = i['href']
  #add the text to the previously empty list 
  headlinetext.append(headtext)
  #add the link to a second empty list
  links.append(headlink)
#check that both are the same length
print(len(headlinetext))
print(len(links))
#create a dataframe to store them
df = pd.DataFrame({"headline" : headlinetext, "link" : links})
#show it
df



31
31


Unnamed: 0,headline,link
0,Why many experts aren't impressed with UK ener...,/news/business-61022678
1,PM defends energy plan amid cost of living crisis,/news/business-61027313
2,Minister orders review of fracking impact,/news/uk-politics-60999026
3,UN scientists: It's 'now or never' to fix clim...,/news/science-environment-60984663
4,Wind and solar now supply 10% of world electri...,/news/science-environment-60917445
5,Why many experts aren't impressed with UK ener...,/news/business-61022678
6,PM defends energy plan amid cost of living crisis,/news/business-61027313
7,Minister orders review of fracking impact,/news/uk-politics-60999026
8,UN scientists: It's 'now or never' to fix clim...,/news/science-environment-60984663
9,Wind and solar now supply 10% of world electri...,/news/science-environment-60917445


## Applying that code within a loop

You don't have to store that code in a function. Instead you could put the code inside a loop (remembering to indent it all) which goes through multiple URLs and repeats the code with each.

That's what the code block below does.

In [None]:
#store a list of URLs
urllist = ["https://www.bbc.co.uk/news/world","https://www.bbc.co.uk/news/health", "https://www.bbc.co.uk/news/entertainment_and_arts"]
#create an empty dataframe to store the results of each loop below
thisdfwasempty = pd.DataFrame()

#loop through the urls in the list
for eachurl in urllist:
  #fetch the HTML document from that url
  page = requests.get(eachurl)
  #parse the page into a 'soup' object
  soup = BeautifulSoup(page.content, 'html.parser')
  #grab all the headlines - we've identified a class attribute they all have
  headlines = soup.select('div a[class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor"]')
  #create two empty lists for the two pieces of information we want to extract
  headlinetext = []
  links = []
  #loop through them
  for i in headlines:
    #extract the text
    headtext = i.get_text()
    #extract the link
    headlink = i['href']
    #add the text to the previously empty list 
    headlinetext.append(headtext)
    #add the link to a second empty list
    links.append(headlink)
  #check that both are the same length
  print(len(headlinetext))
  print(len(links))
  #create a dataframe to store them
  df = pd.DataFrame({"headline" : headlinetext, "link" : links})
  #add to the previously empty dataframe
  thisdfwasempty = thisdfwasempty.append(df)

39
39
34
34
34
34


In [None]:
#show the dataframe now it's been appended to 3 times (once for each loop)
thisdfwasempty

Unnamed: 0,headline,link
0,Ukrainian marines warn of 'last battle' in Mar...,/news/world-europe-61068650
1,Macron targets Le Pen as run-off campaign begins,/news/world-europe-61067426
2,This time it won’t be a walkover for Macron,/news/world-europe-61061359
3,German outlet hires Russian TV protest journalist,/news/world-asia-61071163
4,Pakistan gets new PM after week-long uncertainty,/news/world-asia-61063386
...,...,...
29,"Britain's dark, humid 'Lost World'",https://www.bbc.com/travel/article/20220410-un...
30,Who is the greatest First Lady?,https://www.bbc.com/culture/article/20220408-t...
31,How to spot the next pandemic,https://www.bbc.com/future/article/20220406-ho...
32,"The entry-level workers making $100,000",https://www.bbc.com/worklife/article/20220401-...


In [None]:
#export as a pdf
thisdfwasempty.to_csv("3pages.csv")

## Doing it with a function instead

Alternatively you can store that code inside a function and call *that* within the loop.

That approach is shown in the code block below.

In [None]:
#define a function, call it 'scrapebbcpage' and call the one ingredient 'theurl'
def scrapebbcpage(theurl):
  #fetch the page from the URL
  page = requests.get(myurl)
  #parse the page into a 'soup' object
  soup = BeautifulSoup(page.content, 'html.parser')
  #grab all the headlines - we've identified a class attribute they all have
  headlines = soup.select('div a[class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor"]')
  #create two empty lists for the two pieces of information we want to extract
  headlinetext = []
  links = []
  #loop through them
  for i in headlines:
    #extract the text
    headtext = i.get_text()
    #extract the link
    headlink = i['href']
    #add the text to the previously empty list 
    headlinetext.append(headtext)
    #add the link to a second empty list
    links.append(headlink)
  #check that both are the same length
  print(len(headlinetext))
  print(len(links))
  #create a dataframe to store them
  df = pd.DataFrame({"headline" : headlinetext, "link" : links})
  #return that dataframe to whatever called the function
  return(df)

### Test the function on one URL

Before using it within a loop, it's worth testing on just one URL.

In [None]:
#test the function on one url - store the dataframe returned in a variable called 'testscrape'
testscrape = scrapebbcpage("https://www.bbc.co.uk/news/science-environment-56837908")

31
31


In [None]:
#show the dataframe that was returned and stored
testscrape

Unnamed: 0,headline,link
0,Why many experts aren't impressed with UK ener...,/news/business-61022678
1,PM defends energy plan amid cost of living crisis,/news/business-61027313
2,Minister orders review of fracking impact,/news/uk-politics-60999026
3,UN scientists: It's 'now or never' to fix clim...,/news/science-environment-60984663
4,Wind and solar now supply 10% of world electri...,/news/science-environment-60917445
5,Why many experts aren't impressed with UK ener...,/news/business-61022678
6,PM defends energy plan amid cost of living crisis,/news/business-61027313
7,Minister orders review of fracking impact,/news/uk-politics-60999026
8,UN scientists: It's 'now or never' to fix clim...,/news/science-environment-60984663
9,Wind and solar now supply 10% of world electri...,/news/science-environment-60917445


### Putting the function in a loop

Now it's been tested on one URL, we can put it in a loop.

This can be read like the example earlier of 'applying that code in a loop'. When it gets to the first line of the loop, the code from the function is run.

In [None]:
#store a list of URLs
urllist = ["https://www.bbc.co.uk/news/world","https://www.bbc.co.uk/news/health", "https://www.bbc.co.uk/news/entertainment_and_arts"]
#create an empty dataframe to store the results of each loop
thisdfwasempty = pd.DataFrame()

#loop through the urls
for eachurl in urllist:
  #run the function on each url, and store the returned dataframe in 'scrapeddata'
  scrapeddata = scrapebbcpage(eachurl)
  #append that dataframe to the previously empty one
  thisdfwasempty = thisdfwasempty.append(scrapeddata)

31
31
31
31
31
31


In [None]:
#check the dataframe once the loop has finished
thisdfwasempty

Unnamed: 0,headline,link
0,Why many experts aren't impressed with UK ener...,/news/business-61022678
1,PM defends energy plan amid cost of living crisis,/news/business-61027313
2,Minister orders review of fracking impact,/news/uk-politics-60999026
3,UN scientists: It's 'now or never' to fix clim...,/news/science-environment-60984663
4,Wind and solar now supply 10% of world electri...,/news/science-environment-60917445
...,...,...
26,"Britain's dark, humid 'Lost World'",https://www.bbc.com/travel/article/20220410-un...
27,Who is the greatest First Lady?,https://www.bbc.com/culture/article/20220408-t...
28,How to spot the next pandemic,https://www.bbc.com/future/article/20220406-ho...
29,"The entry-level workers making $100,000",https://www.bbc.com/worklife/article/20220401-...
