# What are strings and why should we care about them?

## Strings are everywhere

- Email addresses (e.g., __`gciampag@umd.edu`__)
- Webpage URLs (e.g., __`http://www.umd.edu/`__)
- Names
- Documents, words
- Sales records
- Etc.

We need to learn to work with strings because a lot of data we want to do things with live in the world as mixed data

<br/>
<br/>
<br/>
<br/>

---

Strings are the ultimate &ldquo;lingua franca&rdquo; between systems:

- Data is often converted to strings (in e.g., JSON or **J**ava**S**cript **O**bject **N**otation)
- We assume strings coming in, and we parse it appropriately. 
  + This can include data (numbers / records), as we see in one of the Projects for this module!
- This also includes the &ldquo;human system&rdquo; (i.e., the user)!

<br/>
<br/>
<br/>
<br/>

---

## Strings are sequences of characters

But what *is* a string? It's fundamentally a sequence of characters.

And that's exactly what a string is _in Python too_: it's a sequence of characters, much like (though not exactly) like a `list`.

In [None]:
# Strings are iterable sequences of characters
s = "banana"
for char in s:
    print(char)

In [None]:
# Strings are indexable sequences of characters
for i in range(len(s)):
    print(i, s[i])

In [None]:
# Strings can contain multiple words!
sentence = "I scream, you scream, we all scream for ice cream"
for char in sentence:
    print(char)

In [None]:
# Strings include special characters and can be empty
a = ""
b = "  \t\n"

print("a is ", a)
print("b is ", b)

print("a == b ?", a == b)

print("Length of a:", len(a))
print("Length of b:", len(b))

<br/>
<br/>
<br/>
<br/>

---

### Characters don't have to be visible / or be letters!

- Notice that even the "blank space" is a character!

- Notice that different strings may print the same
   + A string that includes an empty space character is **NOT** the same as an empty string 

- This distinction is very important to remember as you work with real world data.

In [None]:
a = ""  # a blank / empty string
b = " "  # a non-empty string with one blank space *character*

print("Printing out the value of a")
print(a)
print("Printing out the value of b")
print(b)

print("Length of a:", len(a), "Length of b:", len(b))
print("a == b?", a == b)

In [None]:
# Same name, different strings!
a = "James"
b = " James"
c = "James "

print("a == b?", a == b)
print("b == c?", b == c)
print("a == c?", a == c)

<br/>
<br/>
<br/>
<br/>

---

Other kinds of characters that don't look like &ldquo;letters&rdquo;: tabulations (or _tabs_) and newlines. These are called _escaped sequences_ since the literal symbol used does not match what is being printed.

In [None]:
# tab is \t
s = "a\ttab😂"
print(s)

In [None]:
for i in range(len(s)):
    print(i, s[i])

In [None]:
# new line is \n
s = "a\ntab"
print(s)

In [None]:
for i in range(len(s)):
    print(i, s[i])

The full list of escaped sequences recognized by Python is here: https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## The ASCII character set
<img src="https://terpconnect.umd.edu/~gciampag/INST126/images/ascii-table-1.1.svg" width="50%" style="margin: auto;"/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Strings are immutable sequences

- In Python both lists and strings are _sequences_

- Because of this most of the properties and functions that apply to lists also apply to strings 
  + They can be indexed, 
  + They have length,
  + We can check if something is &ldquo;in&rdquo; it, etc.; 

- With one important exception: **strings are immutable** 
  + You can never change a string directly;
  + You can only create a *new*, changed string; 

A consequence of this is that if you want to preserve that change, then _you must then (re)assign to a variable_.

More on this when we talk about working with strings.

In [None]:
# Strings are immutable
s = "banana"  # a string with 6 characters
s[0] = 1
print(s)

In [None]:
# Lists instead are mutable
l = ['b', 'a', 'n', 'a', 'n', 'a']  # a list with 6 individual characters (length-1 strings)
l[0] = 1
print(l)

In [None]:
# Strings have a length
a_string = "hello world Hi"
print(len(a_string))

In [None]:
# Strings can be indexed
a_string[3]

In [None]:
# Strings can be tested for membership of a substring
course_code = "INST126"
"2" in course_code

In [None]:
sentence = "I scream, you scream, we all scream for ice cream"
"scream" in sentence

<br/>
<br/>
<br/>
<br/>

---

- One notable difference with lists is that strings are not directly sortable
- The `sorted()` function will take a string as input (since it is a sequence)
  + But it will return it &ldquo;exploded&rdquo; as a list
- There is not `.sort()` method for strings

In [None]:
# The sorted() function "explodes" a string into a list
a_string = "Hello world, hi!"
print(sorted(a_string))

In [None]:
# There is not .sort() method for strings
a_string.sort()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Advanced Indexing: Slicing a string

In addition to indexing individual characters, Python also allows to index substrings. This is done with a special type of indexes called _slices_. 

In [None]:
first10letters = "abcdefghij"
first10letters[3:8]

The index before the `:` indicates where you want to *start*, and the index after the `:` indicates where you want to stop *before*. 

So `[3:8]` will go from the char at index `3` to index `7` (before index `8`). 

- In a slice index, both the _start_ and _stop_ parts are optional:
  + `:stop` implicitly means `0:stop`
  + `start:` implicitly means `start:-1`
  + So `:` means `0:-1` !

In [None]:
# equivalent to 0:8
first10letters[:8]

In [None]:
# equivalent to 3:-1
first10letters[3:]

In [None]:
# equivalent to 0:-1
first10letters[:]

In [None]:
# equivalent to 0:-3
first10letters[:-3]

<br/>
<br/>
<br/>
<br/>

---

Slicing is super useful for _truncation_, or for parsing strings according to a pattern 

__Ex.__ Given a class code like `INST126`, extract the subject area and the level number.

(The first four characters of a course code is always the subject area and the last three the number.)

In [None]:
code = "INST126"

area = code[0:4]
number = code[4:]

print("Area =", area)
print("Number =", number)

<br/>
<br/>
<br/>
<br/>

---

### Practice: string indexing

Using the `code` string below, with a partner answer the following questions. Refer to this table for the indexes:

|||||||||
|-:|---|---|---|---|---|---|---|
| Characters | I | N | S | T | 1 | 2 | 6 |
| INDEX | 0 | 1 | 2 | 3 | 4 | 5 | 6 |

In [None]:
code = "INST126"

<br/>
<br/>
<br/>
<br/>

---

__Q1.__ How would you get the level of the course?

(The level of a course is the the first number after the four-letter code.)

In [None]:
# TASK: Get the level of the course (first number after the four-letter code)
...

<br/>
<br/>
<br/>
<br/>

---

__Q2.__ How do you get the last three characters of the code?

In [None]:
# TASK: get last three characters of the code
...

<br/>
<br/>
<br/>
<br/>

---

__Q3.__ Given the code for a 1XX-level course, print the corresponding 3XX-level course.

In [None]:
# TASK: give code of 1XX-level course, print the corresponding 3XX-level course
code = "CHEM131"
...

<br/>
<br/>
<br/>
<br/>

---

__Q4.__ How would you get the first three letters of each name?

In [None]:
names = ["Joel", "Sarah", "John", "Michael", "Patrick", "Kacie"]
...

<br/>
<br/>
<br/>
<br/>

---

### Practice: Check if character(s) is in string (membership test)

We have already used the `in` operator to check if some character is in a string.

This is also called _membership test_.

__Ex.__ Given words below, print only the words that are valid email addresses.

In [None]:
# print only *valid* email addresses
words = ["Hi", "good", "morning", "INST126", "our", "emails", "are", "gciampag@umd.edu", "@umd.edu", "yogi@umd"]

...

<br/>
<br/>
<br/>
<br/>

---

The `in` operator also accepts substrings.

__Ex.__ Given the list of courses belows, print only the Chemistry courses.

In [None]:
# print only chemistry courses
courses = ["INST126", "CHEM131", "INST326", "CHEM331"]

...

<br/>
<br/>
<br/>
<br/>

---

However remember that Python strings are case-sensitive and so is the membership test:

In [None]:
message = "Hello, my name is Inigo Montoya"

# Let's check if the message mentions my name!
print("inigo" in message)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## Working with strings: advanced

Similar to lists, there is a collection of built-in **[methods](https://docs.python.org/3/library/stdtypes.html#string-methods)** available to strings. 

For example, the code below can be solved by using the `.lower()` method, which returns a string in lower-case.

In [None]:
message = "Hello, my name is Inigo Montoya"

# Let's check if the message mentions my name!
print("inigo" in message.lower())

<br/>
<br/>
<br/>
<br/>

---

I'm not going to show you all of them, but I will talk through them and discuss some fairly common ones.

There is no need to memorize them – just know:
- There are __many methods__ that allow you to do things with strings;
- If you want to do something, __first search__ the documentation! 
- Often way __more efficient__ than what trial and error;

In [None]:
help(str.lower)

<br/>
<br/>
<br/>
<br/>

---

Knowing how / when to read the documentation of built-in functions is a soft skill of good programmers.

It gives you a sense of how to use code that other people have written that you can reuse. 

Some questions to ask when reading the docs:
- What are the parameters? 
- What are the return values? 
- What can you learn from examples? 
- How do you learn how to use it appropriately in your own code?

In [None]:
# The full list of string builtin methods: 
help(str)

<br/>
<br/>
<br/>
<br/>

---

### Checking a string contents

Multiple criteria can checked:
* __format (numeric/alphabetic)__: `isalnum()`, `isascii()`, `isalpha()`, `isdigit()`, `isdecimal()`, `isnumeric()`
* __special chars__: `isprintable()`, `isspace()`
* __case__: `islower()`, `isupper()` `istitle()`
* __syntax__: `isidentifier()` &ndash; is the string a valid Python identifier?
* __prefix/suffix__: `startswith()`, `endswith()`
* etc.

In [None]:
# Simple calculator
a = input("Enter first operand: ")
b = input("Enter second operand: ")

# but first we want to make sure the strings are all numbers before we convert them
if a.isdigit() and b.isdigit():
    a = int(a)
    b = int(b)
    print("a times b =", a * b)
else:
    if not a.isdigit():
        print("Error: not a digit:", a)
    if not b.isdigit():
        print("Error: not a digit:", b)

<br/>
<br/>
<br/>
<br/>

---

### Changing a string

#### "Cleaning" / normalizing a string

Often we get data in string form, and we need to make sure it conforms to our expectations.

__Ex.__ I need to turn into a number so I can do math with it:

In [None]:
sales_record = "$1,000,000"

# with iteration
cleaned = "" # initialize clean string as a blank/empty string

# for each character in the sales record string
for char in sales_record:
    if char.isnumeric(): # if the character is numeric
        cleaned += char # grab it
print(cleaned)

<br/>
<br/>
<br/>
<br/>

---

You can use `.replace()` if you know in advance which characters you want to strip out

In [None]:
sales_record = "$1,000,000"
dollars = "$"
commas = ","
blank = ""
cleaned = sales_record.replace(dollars, blank).replace(commas, blank)
print(cleaned)

<br/>
<br/>
<br/>
<br/>

---

__Ex.__ I need to convert user input in a “canonical” format for easier comparison.

In [None]:
def normalize_string(s):
    """ Convert the string s to upper case and remove leading and trailing blank spaces"""
    return s.upper().strip()

canonical_name = "Josh Lyman".upper()
    
# need to make sure it's normalized and we remove all weird stuff
weird_inputs = [" Josh Lyman", "JOSH LYMAN", "jOSH lYMAN", "josh lyman "]

print("Canonical: ", canonical_name)
print()
for name in weird_inputs:
    print(name)
    print("Equal to canonical?", name == canonical_name)
    norm_name = normalize_string(name)
    print("Normalized:", norm_name)
    print("Equal to canonical?", norm_name == canonical_name)
    print()

<br/>
<br/>
<br/>
<br/>

---

#### "Parsing" a string (getting specific bits we want)

You can do this if you know there is some *separator* that you can rely on to divide the string into the "bits" you want.

Examples:
- Parse an email
- Parse a URL
- Parse a sentence into words!
- Parse a time stamp

These all use the `.split()` method.

In [None]:
row = "giovanni,INFOSCI,junior"
fields = row.split(",")
fields

<br/>
<br/>
<br/>
<br/>

---

### Practice: Splitting strings

__Q.__ parse an e-mail address into its username and domain name components.

E.g. `gciampag@umd.edu` &rarr; `gciampag` (username) and `umd.edu` (domain name)

In [None]:
# parse email address into username and domain
email = "gciampag@umd.edu"

...

<br/>
<br/>
<br/>
<br/>

---

__Ex.__ extract the top-level domain (e.g. `edu`) from an email address

In [None]:
# if we only want the TOP-LEVEL domain (.edu), we can do a multiple split
email = "gciampag@umd.edu"

elements = email.split("@")  # split the email by the @ separator
domain = elements[1] # grab the second item

elements = domain.split(".") # split that second item by the . separator
tldomain = elements[1] # get the second item from that one
print("Top-level domain:", tldomain)

<br/>
<br/>
<br/>
<br/>

---

#### Multiple assignment

You can assign the results of a split to multiple variables at once.

In [None]:
# Parse a timestamp into three variables: hours, minutes, seconds
timestamp = "13:30:31"
hours, minutes, seconds = timestamp.split(":")
print(hours)

<br/>
<br/>
<br/>
<br/>

---

However need to be careful that variables match the number of split elements

<div class="alert alert-warning">This cell will raise an error!</div>

In [None]:
firstname, lastname = "Giovanni Luca Ciampaglia".split(" ")

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## REMEMBER: Strings are immutable

Unlike lists, since strings are immutable, string methods _return a new object_ (without modifying the original string).

This means if you don't assign the return value of the string method to a new variable, the change will be **lost**.

In [None]:
# Python strings are case-sensitive
a = "hello"
b = "Hello"
print("a == b?", a == b)

In [None]:
# String methods return new string objects!
a.lower()
b.lower()
print("a == b?", a == b)

In [None]:
a = a.lower()
b = b.lower()
print("a == b?", a == b)

<br/>
<br/>
<br/>
<br/>

---

In [None]:
message = "Hello, my name is Inigo Montoya"
print(message)
# let's check if the message mentions my name!
message = message.lower() # change to lower case
message = message.replace("inigo", "MYSTERY")
print(message)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

---

## String formatting

So far we've taken strings as given, and we often specify a string directly. But frequently it is useful to compose a string programmatically, from variables.

Often this is done for debugging (to read the state of your program at various steps), but often this is used as outputs of your program, intermediate or final.

Here's an example

In [None]:
salutation = "Hello"
name = "Sarah"
output = salutation + ", " + name + "!"
print(output)

<br/>
<br/>
<br/>
<br/>

---

### Formatted strings (f-strings)

Python offers a special type of strings called **[f-strings](https://docs.python.org/3/tutorial/inputoutput.html#tut-f-strings)**.

A string literal that starts with `f` (like `f"hello!"`) is an f-string.

When you print an f-string, Python will automatically replace any variable that occurs sorrounded by the curly braces `{ ... }` with its value.

In [None]:
sticker = f"Hello! My name is {first_name}"
first_name = "Giovanni"
print(sticker)

<br/>
<br/>
<br/>
<br/>

---

The `f` is important. Without it, Python will not apply the substitution:

In [None]:
sticker2 = "Hi {name}"
name = "Yogesh"
print(sticker2)

<br/>
<br/>
<br/>
<br/>

---

All variables must be present when printing an f-string, or a `NameError` will be raised.

In [None]:
msg = f"Hi {name}, this is my friend {friend}"
name = "Giovanni"
print(msg)

<br/>
<br/>
<br/>
<br/>

---

f-string are very useful when debugging. Adding the `=` character in the curly brace after the name of a variable will print its name, along the value.

In [None]:
sales = ["$100", "$250", "$500"]

for i in range(len(sales)):
    sale = sales[i]
    print(f"{i=}, {sale=}") # This is an example of a debugging/tracing statement

<br/>
<br/>
<br/>
<br/>

---

Last but not least, f-string allow to format numbers appropriately. 

This code computer the total value of a check but does not format the decimal part correctly when it is a whole number.

In [None]:
tip = 0.18
check = 25.00
total_value = check + check * tip
print(total_value)

<br/>
<br/>
<br/>
<br/>

---

An f-string allows to specify the _format_ of what is being printed. In the example above, we want a decimal number rounded to two decimal digits.

In [None]:
tip = 0.18
check = 25.00
total_value = check + check * tip
print(f'Please charge my card for ${total_value:.2f}')

<br/>
<br/>
<br/>
<br/>

---

You can even specify operations inside an f-string:

In [None]:
birth_year = 1980
this_year = 2022
name = "Giovanni"
message = f"Happy birthday, {name}! You are {this_year - birth_year} this year!"
print(message)

<br/>
<br/>
<br/>
<br/>

---

### Controlling the way it looks
You can also control how the string looks! Various things like controlling how many decimal places are printed out (very useful when doing math), or how wide or indented the string is.

In [None]:
# Without f-string ... too many decimal digits
x = 2
y = 3
result = x / y
print(x, "divided by", y, "is", result)

In [None]:
# With f-string specifying result should be printed rounded to two decimal digits
message = f"{x} divided by {y} is {result:.2f}" 
print(message)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>


---

# Solutions

## Practice: string indexing

Using the `code` string below, with a partner answer the following questions. Refer to this table for the indexes:

|||||||||
|-:|---|---|---|---|---|---|---|
| Characters | I | N | S | T | 1 | 2 | 6 |
| INDEX | 0 | 1 | 2 | 3 | 4 | 5 | 6 |

In [None]:
code = "INST126"

__Q1.__ How would you get the level of the course?

(The level of a course is the the first number after the four-letter code.)

In [None]:
# TASK: Get the level of the course (first number after the four-letter code)
### BEGIN SOLUTION
code[4]
### END SOLUTION

<br/>
<br/>
<br/>
<br/>

---

__Q2.__ How do you get the last three characters of the code?

In [None]:
# TASK: get last three characters of the code
### BEGIN SOLUTION
code[4:]
### END SOLUTION

<br/>
<br/>
<br/>
<br/>

---

__Q3.__ Given the code for a 1XX-level course, print the corresponding 3XX-level course.

In [None]:
# TASK: give code of 1XX-level course, print the corresponding 3XX-level course
code = "CHEM131"
### BEGIN SOLUTION
area = code[:4]
level = code[4:]
newcode = area + '3' + level[1:]
print(newcode)
### END SOLUTION

<br/>
<br/>
<br/>
<br/>

---

__Q4.__ How would you get the first three letters of each name?

In [None]:
names = ["Joel", "Sarah", "John", "Michael", "Patrick", "Kacie"]
### BEGIN SOLUTION
for name in names:
    # get the first initial
    initial = name[0]
    print(initial)
### BEGIN SOLUTION

<br/>
<br/>
<br/>
<br/>

---

## Practice: Check if character(s) is in string (membership test)

We have already used the `in` operator to check if some character is in a string.

This is also called _membership test_.

__Ex.__ Given words below, print only the words that are valid email addresses.

In [None]:
# print only *valid* email addresses
words = ["Hi", "good", "morning", "INST126", "our", "emails", "are", "gciampag@umd.edu", "@umd.edu", "yogi@umd"]

### BEGIN SOLUTION
for item in words:
    if ("@" in item[1:]) and ("." in item):
        print(item)
### END SOLUTION

<br/>
<br/>
<br/>
<br/>

---

The `in` operator also accepts substrings.

__Ex.__ Given the list of courses belows, print only the Chemistry courses.

In [None]:
# print only chemistry courses
courses = ["INST126", "CHEM131", "INST326", "CHEM331"]

### BEGIN SOLUTION
for course in courses:
    if "CHEM" in course:
        print(course)
### END SOLUTION

<br/>
<br/>
<br/>
<br/>

---

## Practice: Splitting strings

__Q.__ parse an e-mail address into its username and domain name components.

E.g. `gciampag@umd.edu` &rarr; `gciampag` (username) and `umd.edu` (domain name)

In [None]:
# parse email address into username and domain
email = "gciampag@umd.edu"

### BEGIN SOLUTION
elements = email.split("@")
username = elements[0]
print("Username:", username)
domain = elements[1]
print("Domain:", domain)
### END SOLUTION