## 4.2 Strings

Many applications process textual data or present text messages to the user.
Text is formed from characters that represent letters, digits, punctuation,
mathematical symbols, etc.
A sequence of zero or more characters is a **string**, e.g. (H, i, !).

The usual, and more compact, notation is to write strings within
single or double quotation marks, also known simply as quotes, e.g.
'Hi!' or "Hi!". The quotes signal the start and end of the string.
They're not part of the string itself, in the same way that the parentheses
and commas in (H, i, !) are not items of the sequence.

In M269, the string ADT is a restricted form of the sequence ADT,
where the values are immutable strings.
Python's `str` type implements the string ADT.

### 4.2.1 Literals

Python follows the same notation as English.
For example, we write `'Hi!'` or `"Hi!"`.
The double quote is the single character `"`, not two single-quotes.
Two single-quotes `''` (or two double quotes `""`) denote the empty string,
a sequence of no characters, a characterless string.

If a string includes a single-quote or apostrophe (they're the same character),
then it must be enclosed in double quotes, e.g. `"It's OK"`. Vice versa,
strings that include double quotes must be enclosed in single-quotes, e.g.
`'Kane whispered "Rosebud" as he died.'` If a string has both kinds of quotes,
or spans multiple lines like docstrings,
enclose it in three single or double quotes, e.g.
`'''Rhett said "Frankly, my dear, I don't give a damn." and left.'''`

A string literal represents a value, hence it's an expression, and
therefore gets evaluated.

In [1]:
"""Rick: And remember, this gun's pointed right at your heart.
Louis: That is my least vulnerable spot."""

"Rick: And remember, this gun's pointed right at your heart.\nLouis: That is my least vulnerable spot."

The interpreter shows an alternative literal for the multi-line string,
with double quotes and newline characters (`\n`).
Not very readable, I must say.
The `print` function displays the string itself, without the enclosing quotes.

In [2]:
print("""Rick: And remember, this gun's pointed right at your heart.
Louis: That is my least vulnerable spot.""")

Rick: And remember, this gun's pointed right at your heart.
Louis: That is my least vulnerable spot.


String literals can have accented characters like ñ, é, ö and
any character listed in the Unicode standard,
which covers Chinese, Arabic and most other written languages.

<div class="alert alert-info">
<strong>Info:</strong> TM112 Block&nbsp;1 Section&nbsp;1.1.5 introduces the Unicode standard.
</div>

#### Mistakes

Typeset quotes are often curved, but in Python they're straight.
If you copy curved quotes from a PDF, HTML or Word document into Python code,
you get a syntax error.

In [3]:
‘Game over ...’

SyntaxError: invalid character '‘' (U+2018) (1225574369.py, line 1)

The starting quote isn't the `'` character,
so the interpreter doesn't recognise the start of a string literal,
assuming instead it's an identifier with a strange character.

Starting a literal with one kind of quote and ending with another
is also a syntax error.

In [4]:
'holy guacamole!"

SyntaxError: unterminated string literal (detected at line 1) (1971414413.py, line 1)

Strings within single and double quotes can't span multiple lines.

In [5]:
"Rick: And remember, this gun's pointed right at your heart.
Louis: That is my least vulnerable spot."

SyntaxError: unterminated string literal (detected at line 1) (502687848.py, line 1)

In both examples the interpreter complains that
it reached the end of the line (EOL) before it reached the end of the string.

Jupyter notebooks and other code editors use syntax colouring to distinguish
strings (in red), keywords (in bold green), operators (in bold purple), etc.
That's useful to detect errors before running the code.
When the whole string or part of it isn't in red, as in two of the examples,
you know there's some mistake.

### 4.2.2 Inspecting strings

Python's `str` type supports many operations, in particular
the length, indexing, comparison and membership operations.

#### Length

The function `len` returns the size of a string.

In [6]:
len('')     # length of the empty string

0

In [7]:
len("""
Hello!
""")        # string includes 2 newline characters, but no quotes

8

#### Indexing

Python is more flexible than the indexing operation I defined:
it allows any integer index.

In [8]:
'hello'[0]          # retrieve the first character

'h'

In [9]:
'hello'[5 - 1]      # the index can be an integer expression

'o'

In [10]:
'hello'[-1]

'o'

Negative indices count from the end of the string:
the character at index -1 is the first character from the end,
the character at index -2 is the second character from the end, etc.
In M269, we shall use mainly index -1, because it's a convenient
shorthand for `len(s) - 1` to access the last item of sequence `s`.

In [11]:
text = 'hello'
text[len(text) - 1]     # the same as text[-1]

'o'

#### Comparisons

All six comparison operators apply to strings.

In [12]:
'Tweedledee' == 'Tweedledum'

False

In [13]:
'hello' < 'high'        # e comes before i, so 'hello' < 'high'

True

In [14]:
'underpass' > 'under'

True

Python's lexicographic comparison uses the character ordering of the Unicode
standard, which leads to results you may not expect.

In [15]:
'aardvark' < 'Zeus'     # A-Z comes before a-z in Unicode

False

In [16]:
'exposé' < 'exposed'    # accented letters come after non-accented

False

As long as you compare two strings *left* and *right* that only have
the unaccented letters a to z and follow the same lower/uppercase pattern, then
*left* < *right* is true if and only if dictionaries list *left* before *right*.

In [17]:
'Aardvark' < 'Zeus'     # capital first letter, rest lowercase

True

The 26 lowercase letters are listed consecutively, in alphabetical order,
in the Unicode standard. We can thus use comparisons to check
if a character is a lowercase letter.

In [18]:
character = '!'
'a' <= character <= 'z'     # is the character a lowercase letter?

False

We can write similar expressions to check if a character is a digit or
an uppercase letter.

We can apply the minimum and maximum operations to obtain the character that
appears first or last in the Unicode table.

In [19]:
min('Zeus')         # in Unicode, A-Z comes before a-z

'Z'

In [20]:
max('By Jove!')     # in Unicode, space and ! come before A-Z

'y'

#### Membership

Python's `in` operator checks if the left operand
is a substring of the right operand.

In [21]:
'pose' in 'exposed'

True

In [22]:
'hello' in 'Hello, world!'      # h and H are different characters

False

When the left operand is a single character,
this becomes the membership operation.

In [23]:
',' in 'Hello, world!'  # does the string contain a comma?

True

A Boolean expression of the form `not (substring in string)`
can also be written as `substring not in string`, which is easier to read.

In [24]:
'hello' not in 'Hello, world!'

True

#### Exercise 4.2.1

Assume `character` is a string of length&nbsp;1. Write a Boolean expression that
is true if and only if `character` is a decimal digit. Don't use comparisons.

In [25]:
character = '6'
# replace this by your expression

After running your code (you should get `True`),
change the character to a letter and rerun the code (you should get `False`).

[Hint](../31_Hints/Hints_04_2_01.ipynb)
[Answer](../32_Answers/Answers_04_2_01.ipynb)

#### Mistakes

A frequent mistake is to forget that indices start from zero.
This leads to 'off by one' errors where you intend to access the *n*-th item
of a sequence but are instead accessing the next item.
If the index is 'out of bounds', the interpreter raises an **index error**,
but if it's not, the 'off by one' error may only become apparent far later
in the execution of the algorithm.

In [26]:
'hello'[5]  # there's no 6th character, counting from the start

IndexError: string index out of range

In [27]:
'hello'[-6] # there's no 6th character, counting from the end

IndexError: string index out of range

If we apply an operation to operands of the wrong type, we get a **type error**.

In [28]:
'five' < 5

TypeError: '<' not supported between instances of 'str' and 'int'

#### Exercise 4.2.2

Explain whether these expressions are valid or
lead to a syntax, type or index error:

1. `"hello"['e']`
1. `''[0]`
1. `len('goodman'[-1])`
1. `42 in 'Everyone wears jersey 42 on Jackie Robinson Day'`

[Hint](../31_Hints/Hints_04_2_02.ipynb)
[Answer](../32_Answers/Answers_04_2_02.ipynb)

### 4.2.3 Creating strings

Python's `str` type also supports the concatenation and slicing operations,
again with some 'extensions'.

#### Concatenation

Python overloads the addition operator to also represent concatenation:

In [29]:
'Hello,' + 'world!'     # concatenation doesn't add spaces

'Hello,world!'

In [30]:
'Hello' + ', ' + 'world' + '!'

'Hello, world!'

Multiplication is just repeated additions, e.g. 3 × 4 = 4 + 4 + 4.
By analogy, Python reuses the multiplication operator for repeated
concatenation.

In [31]:
3 * 'Ho'    # thus spoke Father Christmas

'HoHoHo'

In [32]:
'Ho' * 0    # repeating zero times produces the empty string

''

We'll use repeated concatenation mainly for creating long test strings.

#### Exercise 4.2.3

What are the best- and worst-case scenarios of repeated concatenation?
What are the corresponding complexities?

_Write your answer here._

[Hint](../31_Hints/Hints_04_2_03.ipynb)
[Answer](../32_Answers/Answers_04_2_03.ipynb)

#### Slicing

Python allows arbitrary integers as the start and end indices.
The character at the end index is excluded from the slice.

In [33]:
'hello'[0:1]        # indices 0 to 0, i.e. 'hello'[0]

'h'

In [34]:
'hello'[1:4]        # indices 1 to 3, i.e. 2nd to 4th character

'ell'

In [35]:
'hello'[3:3]        # if start = end, the slice is empty

''

In [36]:
'hello'[2:1]        # if start > end, the slice is empty

''

In [37]:
'hello'[2:-1]       # 3rd to penultimate character (-1 not included)

'll'

The following examples show why the last index isn't included.

In [38]:
'hello'[1:4]        # len(text[start:end]) = end - start

'ell'

In [39]:
'hello'[0:1] + 'hello'[1:4]     # text[a:b] + text[b:c] = text[a:c]

'hell'

As an example of slicing, here's how to swap the two halves of a string.

In [40]:
text = 'Alice and Bob'
middle = len(text) // 2
text[middle:len(text)] + text[0:middle]

'and BobAlice '

In Python, the start index of a slice can be omitted (it defaults to zero)
and so can the end index (it defaults to the length of the string).
In other words, `s[:i]` is the same as `s[0:i]` and
`s[i:]` is the same as `s[i:len(s)]`.
The previous expression to swap both halves can be more succinctly written as:

In [41]:
text[middle:] + text[:middle]

'and BobAlice '

I think this shows more clearly that the result is
the text from the middle onwards, followed by the text up to the middle.

#### Exercise 4.2.4

Explain why the swapping code works for strings of length&nbsp;0 and 1,
the edge cases, without raising an index error.

_Write your answer here._

[Hint](../31_Hints/Hints_04_2_04.ipynb)
[Answer](../32_Answers/Answers_04_2_04.ipynb)

#### Conversion

Each Python data type has a **constructor**, a function with the same name as the type to create a value of that type.
Constructor `str` creates a string from a number:

In [42]:
str(2020)

'2020'

#### Exercise 4.2.5

What is the complexity of converting an integer to a string?

_Write your answer here._

[Hint](../31_Hints/Hints_04_2_05.ipynb)
[Answer](../32_Answers/Answers_04_2_05.ipynb)

#### Mistakes

The `+` operator means addition if both operands are numbers,
and concatenation if both are strings.
The `*` operator means multiplication if both operands are numbers,
and repeated concatenation if one is an integer and the other is a string.
Any other combination leads to a type error. The exact message may vary.

In [43]:
3 + '3'

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [44]:
'high' * 5.0

TypeError: can't multiply sequence by non-int of type 'float'

You may think repeated concatenation doesn't make sense for negative integers,
but the Python designers decided to produce the empty string instead of raising
an error.

In [45]:
'Ho' * -5

''

#### Exercise 4.2.6

Assume that you have a string variable `text` and
an integer variable `times` with a positive value.
Write an expression that repeats the text the given number of times,
separated by ellipses (`...`).

In [46]:
text = 'hello'
times = 3
# replace this by your expression

Running the previous cell should create the string `'hello...hello...hello'`.
If you run the cell with `times = 1`, you should obtain just `'hello'`.

[Hint](../31_Hints/Hints_04_2_06.ipynb)
[Answer](../32_Answers/Answers_04_2_06.ipynb)

⟵ [Previous section](04_1_sequences.ipynb) | [Up](04-introduction.ipynb) | [Next section](04_3_iteration.ipynb) ⟶