<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/010_string-manipulation.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# String Manipulation
___
In this notebook, we will look at typical string manipulation in Python. Unfortunately, data does not always come in a nice and handy format such as numbers and dates. More often than not, you will be faced with a problem where some variable must be extracted in some way or another from an existing string. For instance, consider the following example:



In [None]:
import pandas as pd

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
towns = pd.read_csv(f"{DATA_PATH}/data/swiss_towns.csv")
towns.head(5)

`towns` is a data frame containing a list of several Swiss towns. The canton is not indicated in a separate column but rather as part of the town name in parenthesis. We would like to separate the information, such that given a string such as `"Aarberg (BE)"` we can extract two variables: `Aarberg` for the town, and `BE` for the canton.

To do this, we will use the fact that the strings follow a specific pattern, i.e., `"Town (CANTON)"`. If the pattern was different amongst the towns, the task would be much more daunting and we would have to either group observations by matching patterns or generally be very careful in our string manipulation.

#### 🙀 🤯 Regular expression
While most languages provide tools for string manipulation, there is also something called [regular expression](https://en.wikipedia.org/wiki/Regular_expression) or regex. regex is incredibly powerful and it allows to search for specific patterns in a text very quickly, furthermore, it is not Python-specific and can be used in almost all programming languages. Unfortunately, **regex is hard**. The syntax is totally different from typical programming languages and fairly unintuitive.

To keep it simple, we will work with *pythonic* string manipulation, i.e., we will only use Python functions and not dabble with regex. You should, however, know that it exists as you will surely encounter it at some point in your data scientist career. In most cases, you won't need regex, and, in the few cases where you do need it, a simple search on stackoverflow will generally get you going, so don't stress about it too much!

## Strings are lists!?
When doing string manipulation in Python, it's good to view strings as lists of characters. While this is not exactly the case, it's a useful approximation. In fact, you can access the elements in a string the same way you would access them in a list:

In [None]:
# Example string
example_string = "Town (CANTON)"

In [None]:
# Get only first 4 characters of the string
example_string[:4]

In [None]:
# Get only last 8 characters of the string
example_string[-8:]

#### ➡️ ✏️ Task 1
Using indexing as above, extract the word `"CANTON"`, i.e., the last 7 characters up to the penultimate one.

In [None]:
# Enter code below
example_string[-8:-1]

## String methods in Python

While this is quite useful and subsetting your strings using indices will be enough in many cases, it doesn't always cut it. We could easily extract the canton codes in the `towns` dataframe using this method because all cantons have a 2 letter abbreviation. Instead, however, we might have a list of stocks with their ticker in parentheses, e.g.,

```python
stocks = ["Nokia (NOK)", "Tesla (TSLA)", "BlackBerry (BB)"]
```

In such a case, we can't simply index every string using the same logic as above. Luckily for us, Python provides some useful string manipulation methods! Let's go over some of the most useful ones.

### Upper and lower case
This is mostly useful when dealing with data that has been entered manually by somebody (because human mistakes happen a lot!) or when working on natural language processing tasks. Python provides simple methods to transform strings or parts thereof to upper and lower case. Observe the following examples:

In [None]:
my_text = "According to all known laws of aviation, there is no way a bee should be able to fly. Its wings are too small to get its fat little body off the ground. The bee, of course, flies anyway because bees don't care what humans think is impossible."
my_text

In [None]:
# Pass the string to upper case
my_text.upper()

In [None]:
# Pass the text to lower case
my_text.lower()

In [None]:
# Everything lower case except the very first letter which is upper case
my_text.capitalize()

In [None]:
# Change upper case to lower case and vice-versa
my_text.swapcase()

In [None]:
# Every first letter is upper case, the rest is lower case
my_text.title()

### Split, strip, and replace
While changing letter case is useful, it is probably not the kind of string manipulation you will be using most often. Stripping and splitting strings, however, is without doubt the two string methods you will be using most in Python, so pay close attention!

**Splitting** a string allows split or separate a string on a specified character. For instance, we could split the above text into sentences, using a split on the dot character (`.`) or we could split the above text into words, using a split on the whitespace character (` `).

In [None]:
# Split the text into a list of sentences
sentence_list = my_text.split(".")
# Display the result
sentence_list

In [None]:
# Split the text into a list of words
word_list = my_text.split(" ")
# Display the result
word_list

Did you notice how the character we split on actually disappears? Furthermore, did you notice that the lists are not very clean, in the list of sentences, we have sentences starting with a whitespace and in the list of words, the dots and commas are still there for some words (e.g. `"aviation,"`, `"fly."`, `"impossible."`, etc.). Can we do a bit better?

**Stripping** a string means that we will drop some characters. In general, we do this when the data is not especially clean, such as in the lists of sentences and words above. Observe the following.


In [None]:
# Remove leading and trailing whitespaces in each sentence
# (We can also use .lstrip for leading whitespaces and .rstrip for trailing whitespaces)
[s.strip() for s in sentence_list]

But we can also pass an argument to the `.strip` method, i.e., we can also strip the text from something else than whitespaces, for instance, dots and commas!

In [None]:
# Remove dots and commas from each word
[w.strip(".,") for w in word_list]

Finally, sometimes we might want to replace some character or word occurence instead of stripping it completely. For instance, we might want to replace dots with commas or vice-versa, but we could also replace a specific word with another, consider the following example.

In [None]:
# Replace the word bee with the word bumblebee in the full text
my_text.replace("bee", "bumblebee")

In [None]:
# Sometimes also useful
"impossible" in my_text

## Escape characters
Lastly, it is also important to understand what [escape characters](https://en.wikipedia.org/wiki/Escape_character) are, as you will surely encounter them in the wild. The *backslash* character `\` in a string, is a so called escape character and it can be used to represent special characters that you cannot simply add in your string.

For instance, what if we wanted to format our little text and add a new line after each sentence? In such a case we can use `\n`, which produces a new line.

In [None]:
# A string display on two lines
my_string = "First line\nSecond line"
print(my_string)

The `\n` was directly interpreted as a new line when displaying the string. When you obtain a text, e.g., through web-scraping, you will often have many newlines and other various special characters in your text, so it's important to know about their existence, even though we will not go over them in detail in this class.

In [None]:
# Print the bee movie introduction on separate lines
print(my_text.replace(". ", ".\n"))

#### ➡️ ✏️ Task 2
Modify the following function, such that it extracts the town and the canton separately from a given string in the form of `"TOWN (CANTON)"` and run the cell below to check whether your code works.

In [None]:
def extract_town_canton(input_string):
    # Use .split to separate the town and canton
    town, canton = input_string.split() # ➡️ ✏️ Split on the right character!
    # ➡️ ✏️ Add a .strip method to get rid of parentheses on the canton
    
    return town, canton # Output results