# Functions

Functions are repeatable bits of code. You don’t have to use them but they make code cleaner and reduce errors and inconsistencies. Once you have a function that works you can use it in different scripts too.

When you read other people’s code you will almost certainly see some functions. Functions work a bit like a formula in Excel:
`=sum(A1:A2)`

Here `sum` is a function and it has two *arguments*: `A1` and `A2`. In Excel there have to be at least two arguments to sum because it doesn’t make sense to sum up one cell, but other Excel formulas do work on one cell. Equally in Python the number of arguments to a function will depend on what it does.

In Python we can write our own functions so let’s rewrite the Excel sum formula in Python.

- we have to start the function definition with `def`
- we have to choose a name (people normally use verbs for functions)
- we have to have round brackets followed by a colon (even if they're empty)
- we should either return or print something
- nothing happens after a `return` statement!

In [1]:
def replicate_excel(num1, num2):
    return num1 + num2

In [2]:
replicate_excel(232, 907)

1139

We can run this as often as we like. Although in this case there is already a built-in Python function called `sum`.

It's important to notice that we have to pass *exactly* two arguments to `replicate_excel` or we'll get an error:

In [7]:
replicate_excel(23432, 78907, 51092)

TypeError: replicate_excel() takes 2 positional arguments but 3 were given

Notice that the error message tells us that our function can only take two *positional* arguments. This is the default type of argument. The other type is a *keyword* argument. We'll look at the latter later in the notebook.

Let write a new function to tell a user they're not allowed to do something. First we'll just print a message, so no arguments are required. But we still need the round brackets:

In [4]:
def say_no():
    print("I'm sorry, I can't let you do that.")

In [5]:
say_no()

I'm sorry, I can't let you do that.


Now we'll add an argument, so that we can put the username in the message.

In [8]:
def say_no(username):
    print(f"I'm sorry, I can't let you do that, {username}.")

Now we *must* provide a username.

In [10]:
say_no ("Magaret")

I'm sorry, I can't let you do that, Magaret.


In [11]:
say_no (666)

I'm sorry, I can't let you do that, 666.


In [9]:
say_no("Dave")

I'm sorry, I can't let you do that, Dave.


You can put logic inside the body of a function &mdash; as complicated as you like &mdash; but if a function gets really hard to follow you should probably think about splitting the function into several functions.

Let's revisit the `if` logic from last week and either allow a user to do something or not, according to their status.

In [12]:
def say_yes_or_no(username, status):
        if status == 1:
            print(f"Your status is lowly, {username}. Go away!")
        elif status < 5:
            print(f"I'm sorry, {username}: you're not important enough to do that.")
        else:
            print(f"Right you are, {username}. Happy to be of service.")

Now we need to call it with two arguments, in the correct order.

In [17]:
say_yes_or_no ("Yuzhi", 2)

I'm sorry, Yuzhi: you're not important enough to do that.


In [16]:
say_yes_or_no ("Caroline Bassett", 5)

Right you are, Caroline Bassett. Happy to be of service.


Let's do something a bit more Digital Humanities. The Dice similarity score compares two *sets* of values. If two sets are identical the score is 1; if two sets have nothing in common the score is 0.

The formula looks like this:

\begin{equation}
\text{Dice}(A, B) = \frac{2 |A \cap B|}{|A| + |B|}
\end{equation}

How can we implement this in Python?

The top line of the equation is finding the intersection of two sets and then multiplying it by two. Last week we found the difference of sets (which values are in set 1 but not in set 2) with `-`. We can get the intersection of `set1` and `set2` (the items common to both sets) with `&`.

In [18]:
set1 = set([345, 2213, 8573, 121, 5402])
set2 = set([941, 6663, 121, 6020, 12, 2213])
print(set1 & set2)

{121, 2213}


For the Dice calculation we need the length of the intersection (which we can see in this case is going to be two items), multiplied by two.

In [19]:
print(len(set1 & set2) * 2)

4


For the lower half, we need to add the length of `set1` to the length of `set2`:

In [20]:
print(len(set1) + len(set2))

11


Breaking it down like this, we can see that the result of the whole equation in this case should be 4/11.

In [21]:
print(4 / 11)

0.36363636363636365


The whole equation in Python could be done like this. We'll also use fstrings to only give us three decimal places.

In [22]:
dice = (len(set1 & set2) * 2) / (len(set1) + len(set2))
print(f"{dice:.3}")

0.364


Let's write a function that will tell us how similar the vocabulary in Jane Austen's novels is.

In [23]:
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/emma.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/persuasion.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/mansfield_park.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/northanger_abbey.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/pride_and_prejudice.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/sense_and_sensibility.txt

--2025-11-10 15:26:23--  https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/emma.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 933759 (912K) [text/plain]
Saving to: ‘emma.txt’


2025-11-10 15:26:23 (23.1 MB/s) - ‘emma.txt’ saved [933759/933759]

--2025-11-10 15:26:23--  https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/persuasion.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 497612 (486K) [text/plain]
Saving to: ‘persua

In [25]:
with open('emma.txt', 'r') as f:
    emma = f.read()
with open('pride_and_prejudice.txt', 'r') as f:
    pride = f.read()
with open('sense_and_sensibility.txt', 'r') as f:
    sense = f.read()
with open('mansfield_park.txt', 'r') as f:
    mansfield = f.read()
with open('northanger_abbey.txt', 'r') as f:
    northanger = f.read()
with open('persuasion.txt', 'r') as f:
    persuasion = f.read()

It's always tempting to rush into coding the whole thing, but let's break it down. First we want a function that just takes one novel and makes a set from its words.

In [24]:
def settify_novel(novel):
    unique_words = set(novel.split())
    return unique_words

In [26]:
mansfield_words = settify_novel(mansfield)
northanger_words = settify_novel(northanger)

In [27]:
print(len(mansfield_words))

16717


This seems to work. So now we can write a Dice function to compare any two Austen novels (provided we have them opened). Here we're slightly extending the `settify_novel` function to allow for two novels, and putting the Dice calculation into the function.

In [28]:
def Austen_Dice(novel1, novel2):
    novel1_words = set(novel1.split())
    novel2_words = set(novel2.split())
    dice = (len(novel1_words & novel2_words) * 2) / (len(novel1_words) + len(novel2_words))
    return(f"{dice:.3}")

In [30]:
Austen_Dice(pride, persuasion)

'0.492'

In [29]:
Austen_Dice(northanger, mansfield)

'0.478'

nearly half of the words are the same

This is a surprisingly low score. Punctuation and capitalisation are probably skewing the results a fair bit. Let's now write a function to clean up a text. (We could add these steps to `Austen_Dice` but this is going to be a general function that we might want to use for other things. If functions to do one thing, they are more reusable and flexible.)

normalizing text

In [32]:
import string
def normalise(text):
    """A general function to normalise text strings ahead of analysis.
    It will lowercase the text; remove punctuation; remove line breaks
    """
    text = text.lower()
    text = text.replace("\n", " ")
    text = text.replace("  ", " ") #double spaces to single (just one pass)
    text = text.translate(str.maketrans('', '', string.punctuation)) # removes chars defined as punctuation in Python docs
    return(text)

In [33]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [34]:
normalise?

In [35]:
Austen_Dice(normalise(northanger), normalise(mansfield))

'0.566'

#### Complexities with arguments

In [36]:
replicate_excel(12, 44, 88)

TypeError: replicate_excel() takes 2 positional arguments but 3 were given

But in a function which adds numbers we don't know in advance how many numbers we'll want to use. We need the function to allow for *any number of arguments*.

In [37]:
def addup(num1, *remaining_nums):
    return(num1 + sum(remaining_nums))

In [38]:
addup(94, 11, 722, 4)

831

#### Functions with lots of arguments

In Python, normally put all functions at the above

In [39]:
def calculate_interest(principal, interest_rate, years, deposit):
    """Calculate the interest payable on a loan.
    Deposits over over 100 qualify for a lower rate of interest.
    """
    if deposit > 99:
        interest_rate = interest_rate - 0.5
    print(f"{principal*((1 + interest_rate/100)**years):.2f}")

In [40]:
calculate_interest(10000, 3.875, 25, 25)

25868.78


In [41]:
calculate_interest(principal=10000,
                   interest_rate=3.875,
                   years=25,
                   deposit=25)

25868.78


In [42]:
calculate_interest(principal=10000,
                   years=25,
                   deposit=25,
                   interest_rate=3.875,
                   )

25868.78


You can force some arguments to be positional only and/or some arguments to be keyword only. The rule is that keyword-only arguments must come after all positional arguments. If you ever need to do this &mdash; and you might not &mdash; look up the syntax.

In [43]:
def calculate_interest(principal, interest_rate, years, deposit, *args, **kwargs):
    if deposit > 99:
        interest_rate = interest_rate - 0.5
    print(f"{principal*((1 + interest_rate/100)**years):.2f}")

In [44]:
calculate_interest(10000, 3.875, 25, 25, 19, 23432)

25868.78


#### Lambdas

This is a fancy name (not specific to Python) for a single-use function, sometimes called an 'in-line' or 'anonymous' function. Because it's anonymous, you can't call it again. Because lambdas are an abbreviated form, they're hard to read and should be used sparingly and for simple things!

"The syntactic restrictions tend to make lambdas either unreadable or unworkable." Luciano Ramalho, *Fluent Python*, 2nd edition, O'Reilly.

In [45]:
austen_novels = ["Sense and Sensibility", "Pride and Prejudice", "Mansfield Park", "Emma", "Northanger Abbey", "Persuasion"]

In [46]:
sorted(austen_novels)

['Emma',
 'Mansfield Park',
 'Northanger Abbey',
 'Persuasion',
 'Pride and Prejudice',
 'Sense and Sensibility']

In [47]:
sorted(austen_novels, reverse=True)

['Sense and Sensibility',
 'Pride and Prejudice',
 'Persuasion',
 'Northanger Abbey',
 'Mansfield Park',
 'Emma']

In [48]:
sorted(austen_novels, key=len)

['Emma',
 'Persuasion',
 'Mansfield Park',
 'Northanger Abbey',
 'Pride and Prejudice',
 'Sense and Sensibility']

lambda - means nothing at all

In [49]:
sorted(austen_novels, key=lambda x: x[-1])

['Emma',
 'Pride and Prejudice',
 'Mansfield Park',
 'Persuasion',
 'Sense and Sensibility',
 'Northanger Abbey']

In [50]:
def sort_last_letter(item):
    """sort a list of strings alphabetically but by the last letter in the string"""
    return (item[-1])

In [51]:
sorted(austen_novels, key=sort_last_letter)

['Emma',
 'Pride and Prejudice',
 'Mansfield Park',
 'Persuasion',
 'Sense and Sensibility',
 'Northanger Abbey']

#### Exercises

- Write a function open any text into memory and give it an alias (suggestion: only send one argument, the file name, to your function; create the variable name (eg `northanger`) when you *call* the function)
- Create a new function that just removes the punctuation from a string (i.e. only one part of our `normalise` function and test it on a few strings to make sure it's working
- Now that we have a normalising function, we can do a more realistic comparison of texts: look again at which words only appear in Northanger Abbey
- Write a function to calculate the area of a circle (you'll need to use `math.pi` from `math`, so import that).