# Basic text processing

## What is a _string_?

A string is the data type that is used for dealing with text.

A string is a sequence of characters, and is always enclosed by quotation marks.

An example of string is:

In [None]:
example_of_string = "A text is a sequence of characters."

A string does not need to make any sense. Another example of string is:

In [None]:
example_of_string = "xt is a sequence of ch"

We can use double quotation marks, single quotation marks, or triple double quotation marks:

In [None]:
example_of_string = "A text is a sequence of characters."

In [None]:
example_of_string = 'A text is a sequence of characters.'

In [None]:
example_of_string = """A text is a sequence of characters, enclosed by quotation marks!"""

If the string spans more than one line, we should use the triple double quotation marks:

In [None]:
beginning = """When shall we three meet again
In thunder, lightning, or in rain?
When the hurlyburly's done,
When the battle's lost and won."""

In [None]:
print(beginning)

### Type and length of a string

We can check the data type of variable `beginning` using the `type()` built-in function:

In [None]:
print(type(beginning))

We can check the length of the string using the `len()` built-in function:

In [None]:
print(len(beginning))

## Strings and numbers

Python differentiates between numbers (mathematical objects) and strings (sequences of characters).

⚠️ **Warning**: A number in quotation marks is actually a string, and can't be used in mathematical operations. This is quite important! It's an easy mistake that even experienced programmers make, and which can have huge consequences.

Compare the following two cells. Are they numbers or strings?

In [None]:
a = 3
b = 4

In [None]:
c = "3"
d = "4"

✏️ **Exercises:**

In [None]:
# Write the code to print the data type of variable `a`



In [None]:
# Write the code to print the data type of variable `c`:



In [None]:
# What happens if we sum variables `a` and `b`?
# Write your code here:



In [None]:
# What happens if we sum variables `c` and `d`?
# Write your code here:



In [None]:
# What happens if we sum variables `a` and `c`?
# Write your code here:



You can convert a number to a string using the `str()` function:

In [None]:
a = 4
print(a)
print(type(a))

In [None]:
a = str(4)
print(a)
print(type(a))

We can use the `int()` and `float()` functions to convert a string into a number:

In [None]:
a = "3"
print(a)
print(type(a))

In [None]:
a = int(a)
print(a)
print(type(a))

In [None]:
a = "3.4"
print(a)
print(type(a))

In [None]:
a = float(a)
print(a)
print(type(a))

## Five common operations on strings

In this section we will learn to manipulate strings. We will look into the following (very useful) operations:
1. Converting to upper/lower case
2. Removing white spaces (including linebreaks) as the beginning and end of string
3. Replacing one part of the string by another
4. Concatenating strings
5. Splitting strings

If you want to know more on how to use them, you can read about these and other operations in the [official Python documentation](https://python-reference.readthedocs.io/en/latest/docs/str/).

### 1. Converting to upper or lower case**

The `.lower()` and `.upper()` methods are applied to a string variable to make the string lower- or uppercase.

To use it, append it to the variable you want to convert:

In [None]:
a = "I am shouting."

print(a.upper())

### 2. Removing white spaces

The `strip()` method removes white spaces at the beginning and end of the string.

In [None]:
text_to_trim = "   Too many whitespaces!   "

print(text_to_trim.strip())

### 3. Replacing one part of the string by another

The `replace()` method can be used to replace one part of the string by another. It's appended to the variable containing the string you want to replace, and it requires two arguments at least: _arg1_ is the string to be replaced whereas _arg2_ is the string to replace the other one. For example:

In [None]:
main_string = "It was the best of times."

print(main_string.replace("best", "worst"))

Now, if we print the value of the variable `main_string`, what happens? Try it.

In [None]:
print(main_string)

But... why isn't "best" replaced by "worst" in `main_string`?

⚠️ **Warning:** In python, a string is immutable, which means that you cannot change the content. However, you can assign the changed content to a new variable or reassign the variable with the changed content, as shown below:

In [None]:
# Change the content of a string varible and assign it to a new variable:
main_string = "It was the best of times."
changed_string = main_string.replace("best", "worst")

# Print the new variable:
print(changed_string)

In [None]:
# Change the content of a string varible and reassign it to that variable:
main_string = "It was the best of times."
main_string = main_string.replace("best", "worst")

# Print the main_string variable:
print(main_string)

✏️ **Exercises:**

In [None]:
# Create a variable with the text "I AM NOT SHOUTING" and convert it to lower case:



In [None]:
# Using the functions we've learnt, transform the contents of `original_string` so
# that you obtain the text "IT WAS THE WORST OF TIMES."

original_string = "It was the best of times."

# Write your code here:



### 4. Concatenating strings

Concatenating is the process of joining or combining strings together.

For example, let's concatenate the strings in `title1` and `title2`:

In [None]:
title1 = "Alice's Adventures in Wonderland"
title2 = "Through the Looking Mirror"
print(title1 + title2)

⚠️ **Warning:** As you can see, python won't add the white spaces for you!

✏️ **Exercises:**

In [None]:
# Concatenate the two titles plus the name of the author in parentheses (which
# should also be stored in a variable), so that the result looks exactly like this:
# "Alice's Adventures in Wonderland and Through the Looking Mirror (Lewis Carroll)"
# 
# Write your code here:



### 5. Splitting strings

Splitting strings is a very common operation in python.

Using the `.split()` method (see the [documentation](https://docs.python.org/3/library/stdtypes.html#str.split)), we can split string into a list of chunks (substrings).

By default, `.split()` uses the white space to split the string, but you can specify another delimiter inside the parentheses, in quotation marks.

In [None]:
mirror_title = "Through the Looking Mirror"
print(mirror_title.split())

In [None]:
# Converting a string that looks like a list into an actual list in python:
string_to_split = "poetry, drama, non-fiction, novel"
list_of_fruits = string_to_split.split(", ") # The delimiter is a string, can have more than one character.
print(list_of_fruits)
print(type(list_of_fruits))

### 6. Indexing and accessing characters

A string is a sequence of characters in a specific order, and can be indexed:

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7  | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|---|---|---|---|---|---|---|----|---|---|----|----|----|----|----|----|----|----|
| A | l | i | c | e | ' | s |  | A | d | v  | e  | n  | t  | u  | r  | e  | s  |


In [None]:
strAlice = "Alice's Adventures"

Individual characters in a string can be accessed by indicating their index in square brackets.

Example, this is how we can access the first character of the string:

In [None]:
print(strAlice[0])

We can access the last character of the string in the same way, but there is a more clever way:

In [None]:
# Accessing the last character of the string:
print(strAlice[17])
print(strAlice[-1])

✏️ **Exercises:**

In [None]:
# 1. Create a variable that contains your name.
# 2. Print the last-but-one character of this variable.
# 3. Print the length of this variable?
# 4. Can you think of a way of printing the last character of the string using the ```len()``` method?
# 
# Type your code here:



### 7. Slicing

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7  | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|---|---|---|---|---|---|---|----|---|---|----|----|----|----|----|----|----|----|
| A | l | i | c | e | ' | s |  | A | d | v  | e  | n  | t  | u  | r  | e  | s  |

Slicing allows us to create a substring from a string, by specifying the start and end indices in square brackets, separated by a colon, like this:


In [None]:
strAlice = "Alice's Adventures"
print(strAlice[2:12])

Notice that the start index is included in the resulting string, whereas the end index is not.

If the start index is left empty, it means the beginning of the string:

In [None]:
substring = strAlice[:5]
print(substring)

If the end index is left empty, it means the end of the string:

In [None]:
substring = strAlice[8:]
print(substring)

✏️ **Exercises:**

In [None]:
# Given a string that contains your full name, print only your last name by slicing the string.
# 
# Type your code here:



### 8. Finding substrings

The python ```in``` operator is wonderful: it is a very easy way to understand wether and element is inside another element.

Suppose we want to find out if a string contains a certain substring in it. This is how we use it:

In [None]:
substring = "age"

taleBeginning = """It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,"""

if substring in taleBeginning:
    print('Substring "' + substring + '" has been found.')
else:
    print("Substring '" + substring + "' has NOT been found.")

### 9. Does the string starts or ends with...
The ```.startswith()``` method checks whether a string starts with a certain string, specified inside the brackets:

In [None]:
taleBeginning = """It was the best of times,"""

if taleBeginning.startswith("It was"):
    print("Yes, these are the first characters of the string!")
else:
    print("No, that's not how the string starts...")

Similarly, we have a method that checks whether a string ends in a certain way: ```.endswith()```

In [None]:
taleBeginning = """It was the best of times,"""
if taleBeginning.endswith("times"):
    print("Yes, these are the last characters of the string!")
else:
    print("No, that's not how the string ends...")

✏️ **Exercise:**

Do you remember **functions**?

Create a function called `does_it_end_with` that takes two arguments in: `string1` and `string2`. The function should return `True` if `string2` is the ending of `string1`. Otherwise, it should return `False`. Test that your function is correct by trying it with different inputs.

In [None]:
# Write your code here:

