# <center><b>Python for Data Science</b></center>
# <center><b>Lesson 08</b></center>
# <center><b>Strings</b></center>

<center><i>Adapted from:</i></center>
<center>*****************</center>

<center>How to Think Like a Computer Scientist: Interactive Edition</center>

<b>Resources:<br></b>

- [How to Think Like a Computer Scientist: Interactive Edition](https://runestone.academy/ns/books/published/thinkcspy/index.html)
***
- [Advanced String Slicing](https://drive.google.com/file/d/153mM7pEgLHzNdJtip_bxOottivtZGx20/view?usp=sharing)
- [Strings and Character Data in Python (Real Python)](https://realpython.com/python-strings/)
- [Python Strings -- Stanford Computer Science](https://cs.stanford.edu/people/nick/py/python-string.html)
- [Mastering Python Strings (Towards Data Science)](https://towardsdatascience.com/mastering-python-strings-3c933686962a)
- [Python Strings (tutorialspoint)](https://www.tutorialspoint.com/python/python_strings.htm)
- [Python Strings -- With Examples (Programiz)](https://www.programiz.com/python-programming/string)

##  <span style="color:green">TABLE OF CONTENTS</span>

1. [Strings Revisited](#1.1)<br>
2. [Strings as a Collection Data Type](#1.2)<br>
3. [Operations on Strings](#1.3)<br>
a. [Concatenation ... (The + Operator with Strings)](#1.4)<br>
b. [Repetition ... (The * Operator with Strings)](#1.5)
4. [Index Operator: Working with the Characters of a String](#1.6)<br>
5. [The Length Function Applied to a String](#1.7)<br>
6. [The Slice Operator](#1.8)
7. [String Comparison](#1.9)
8. [Strings Are Immutable](#1.10)
9. [The In and Not In Operators](#1.11)
10. [Character Classification](#1.12)
11. [String Methods](#1.13)
12. [Summary](#1.14)
13. [Glossary](#1.15)

In [15]:
# set up notebook to display multiple output in one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print("This notebook is now set up to display multiple output in one cell.")

This notebook is now set up to display multiple output in one cell.


<a class="anchor" id="1.1"></a>
<a class="anchor" id="T1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">1. Strings Revisited</font></b></div>

- Throughout the start of the course strings have been used to represent words or phrases that we wanted to print out. 
- The definition of a string was simple: a string is simply some characters inside quotes. 
- In this unit strings will be explored in much more detail.

<a class="anchor" id="1.2"></a>
<a class="anchor" id="T1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">2. Strings as a Collection Data Type</font></b></div>

- So far we have seen built-in data types like: int, float, bool, and str. 
- int, float, and bool are considered to be simple or primitive data types because their values are not composed of any smaller parts. They cannot be broken down. 
- On the other hand, strings and lists are different from the others because they are made up of smaller pieces. In the case of strings, they are made up of smaller strings each containing one character.
- Types that are comprised of smaller pieces are called **collection data types**. (See [<b>Collection Data Types in Python</b>](https://medium.com/analytics-vidhya/collection-data-types-in-python-3a3f9c0b554).)
- Depending on what we are doing, we may want to treat a collection data type as a single entity (the whole), or we may want to access its parts. This ambiguity is useful.
- Strings can be defined as <u>sequential</u> collections of characters. This means that the individual characters that make up the string are assumed to be in a particular order from left to right.
- A string that contains no characters, often referred to as the empty string, is still considered to be a string. It is simply a sequence of zero characters and is represented by '  ' or “  ” (two single or two double quotes with nothing in between).

<a class="anchor" id="1.3"></a>
<a class="anchor" id="T1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">3. Operations on Strings</font></b></div>

- In general, you cannot perform mathematical operations on strings, even if the strings look like numbers.

In [16]:
# Try to add a string and an integer

print(type('Python'))
print(type(10))

'Python' + 10   # Assumes that Python has type string

<class 'str'>
<class 'int'>


TypeError: can only concatenate str (not "int") to str

In [17]:
# Try to divide a string by an integer

"dog" / 16


TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [89]:
# Try to multiply a string by an integer

Python * "dog"

NameError: name 'Python' is not defined

In [18]:
# Adding two integers

12 + 8

20

In [19]:
# Run the code in this cell

"12" + 8   # Why doesn't this work?

TypeError: can only concatenate str (not "int") to str

<a class="anchor" id="1.4"></a>
### <span style="color:green">a. Concatenation ... (The + Operator with Strings)</span>

- The + operator does work with strings, but for strings, the + operator represents **concatenation**, not addition. 
- Concatenation means joining the two operands by linking them end-to-end.

In [23]:
# Concatenation examples

school_1 = "Brookfield Central "   # notice the blank space at the end of the string
school_2 = "Brookfield East"
nickname_1 = "Lancers"
nickname_2 = " Spartans"           # notice the blank space at the beginning of the string

print(school_1 + nickname_1)
print(school_2 + nickname_2)

Brookfield Central Lancers
Brookfield East Spartans


- The space after the word Central and the space before the word Spartans are parts of their respective strings and are necessary to produce the space between the concatenated strings. 

In [24]:
# Concatenation example ... but this time without the spaces

school_3 = "Wauwatosa East"
nickname_3 = "Red Raiders"

print(school_3 + nickname_3)      # What do you notice?

Wauwatosa EastRed Raiders


<a class="anchor" id="1.5"></a>
### <span style="color:green">b. Repetition ... (The * Operator with Strings)</span>

- The * operator also works on strings.  
- It performs <b>repetition</b>.  
- For example, 'Pandas'*4 is 'PandasPandasPandasPandas'.  
- One of the operands has to be a string and the other has to be an integer.

In [25]:
# Repetition examples

print("defense" * 4)

name = "Bucks "
number = "six "

print((name + "in " + number + "! ") * 3)
print(name + "in " + number + "! " * 3)

defensedefensedefensedefense
Bucks in six ! Bucks in six ! Bucks in six ! 
Bucks in six ! ! ! 


- Note in the last example that the order of operations for * and + is the same as it was for arithmetic. 
- The repetition is done before the concatenation. 
- If you want to cause the concatenation to be done first, you will need to use parenthesis.

In [26]:
print(3 * "Blue")
print("3" * "Blue")     # What conclusion can you draw?

BlueBlueBlue


TypeError: can't multiply sequence by non-int of type 'str'

In [27]:
print("3" + "Blue")
print(3 + "Blue")      # What conclusion can you draw?

3Blue


TypeError: unsupported operand type(s) for +: 'int' and 'str'

<a class="anchor" id="1.6"></a>
<a class="anchor" id="T1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">4. Index Operator: Working with the Characters of a String</font></b></div>

###### The indexing operator (Python uses square brackets to enclose the index) selects a single character from a string. 
<p>&nbsp;</p>

- The characters are accessed by their position or index value. 
- For example, in the string shown below, the 17 characters are indexed left to right from postion 0 to position 16.

![UW.PNG](attachment:UW.PNG)
<p>&nbsp;</p>

- It is also the case that the positions are named from right to left using negative numbers where -1 is the rightmost index and so on. Note that the character at index 9 (or -8) is the blank character.

In [29]:
# Run the code in this cell

college_team = 'Wisconsin Badgers'
x = college_team[5]
y = college_team[9]
z = college_team[0]
p = college_team[4 * 3 - 12]

print(x)
print(y)
print(z)
print(p)

print()

print(f"{z} is the zero-eth letter of the string.")
print(f"{p} is the zero-eth letter of the string.")

n
 
W
W

W is the zero-eth letter of the string.
W is the zero-eth letter of the string.


- The expression **college_team[5]** selects the character at index 5 from college_team, and creates a new string containing just this one character. The variable x refers to the result.
- Computer scientists often start counting from zero. The letter at index zero of "Wisconsin Badgers" is W. 
- If you want the zero-eth letter of a string, you just put 0, or any expression with the value 0, in the brackets. 
- The expression in brackets is called an **index**. 
- An index specifies a member of an ordered collection. In this case the collection of characters in the string. The index indicates which character you want. It can be any integer expression so long as it evaluates to a valid index value.
- Note that indexing returns a string — Python has no special type for a single character. It is just a string of length 1.

In [30]:
# Use some negative indexes

last_character = college_team[-1]
print(last_character)

character = college_team[-9]
print(character)

s
n


<a class="anchor" id="1.7"></a>
<a class="anchor" id="T1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">5. The Length Function Applied to a String</font></b></div>

- The **len** function, when applied to a string, returns the number of characters in a string.

In [102]:
# Example of len() function applied to a string

school = "Marquette University"
print(len(school))


20


In [32]:
# common mistake when finding the last character of a string 

school = "Marquette University"
n = len(school)
last_character = school[n]     # this will cause an error!
print(last_character)

IndexError: string index out of range

- The code in the cell above causes the runtime error IndexError: string index out of range. 
- The reason is that there is no letter at index position 20 in "Marquette University". 
- Since Python starts counting at zero, the 20 indexes are numbered 0 to 19. 
- To get the last character, we have to subtract 1 from the length (see below).

In [33]:
# correct way to find the last character of a string 

school = "Marquette University"
n = len(school)
last_character = school[n - 1]     # this will work!
print(last_character)

y


- You can also use negative indices, which count backward from the end of the string. 
- The expression school [-1] yields the last letter, school[-2] yields the second to last, and so on. 

In [34]:
# Using negative indices

school = "Marquette University"
last_character = school[-1]
print(last_character)

second_last_character = school[-2]
print(second_last_character)


y
t


<a class="anchor" id="1.8"></a>
<a class="anchor" id="T1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">6. The Slice Operator</font></b></div>

- A substring of a string is called a slice. 
- Selecting a slice is similar to selecting a character ...
- The **slice operator** [n:m] returns the part of the string from the n’th character to the m’th character, including the first but excluding the last. 
- In other words, start with the character at index n and go up to but do not include the character at index m. 
- If you omit the first index (before the colon), the slice starts at the beginning of the string. 
- If you omit the second index, the slice goes to the end of the string.
- There is no Index Out Of Range exception for a slice. A slice is forgiving and shifts any offending index to something legal.

![slice-2.PNG](attachment:slice-2.PNG)

In [41]:
# Using the slice operator

music = "Country Western"
print(music[0:7])
print(music[7:15])

Country
 Western


In [39]:
# Using the slice operator

print(music[12:15])
print(music[12:300])

ern
ern


In [42]:
# Using the slice operator

print(music[-7:-3])
print(music[-3:-7])
print(music[4:-15])

West




In [43]:
# Using the slice operator

print(music[5:])
print(music[:11])

ry Western
Country Wes


<a class="anchor" id="1.9"></a>
<a class="anchor" id="T1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">7. String Comparison</font></b></div>

- The comparison operators also work on strings. 
- To see if two strings are equal you simply write a boolean expression using the equality operator.
- Other comparison operations are useful for putting words in lexicographical order. This is similar to the alphabetical where you would use with a dictionary, except that **all the uppercase letters come before all the lowercase letters**.
- It is probably clear to you that the word apple would be less than (come before) the word banana. After all, a is before b in the alphabet. But what if we consider the words apple and Apple? Are they the same?
- It turns out, as you recall from our discussion of variable names, that uppercase and lowercase letters are considered to be different from one another. The way the computer knows they are different is that each character is assigned a unique integer value. “A” is 65, “B” is 66, and “5” is 53. The way you can find out the so-called ordinal value for a given character is to use a character function called ord.
- When you compare characters or strings to one another, Python converts the characters into their equivalent ordinal values and compares the integers from left to right. As you can see from the example above, “a” is greater than “A” so “apple” is greater than “Apple”.
- Humans commonly ignore capitalization when comparing two words. However, computers do not. A common way to address this issue is to convert strings to a standard format, such as all lowercase, before performing the comparison.

In [61]:
# String comparison example

print("apple" < "banana", "\n")
print("The ordinal value of a = ", ord('a'), ".") 
print("The ordinal value of b = ", ord('b'), ".")

True 

The ordinal value of a =  97 .
The ordinal value of b =  98 .


In [62]:
# More string comparison examples 

print("apple" == "apple")
print("apple" == "Apple")
print("apple" < "Apple")

True
False
False


In [108]:
# More string comparison examples 

print(ord('c'))
print(ord('C'))
print('cat' > 'Cat')


99
67
True


<a class="anchor" id="1.10"></a>
<a class="anchor" id="T1"></a>
<div class="alert alert-block alert-info">
<b><font size="4">8. Strings Are Immutable</font></b></div>

- One thing that makes strings different from some other Python collection types is that you are not allowed to modify the individual characters in the collection. 
- It is tempting to use the [ ] operator on the left side of an assignment, with the intention of changing a character in a string. For example, in the following code, we would like to change the first letter of greeting.

In [109]:
# Example of strings being immutable

school_abbreviation = 'BEHS'
school_abbreviation[1] = 'C'    # using the [] operator to change a character in a string isn't allowed
print(school_abbreviation)


TypeError: 'str' object does not support item assignment

- Instead of producing the output BCHS, this code produces the runtime error TypeError: 'str' object does not support item assignment.
- Strings are **immutable**, which means you cannot change an existing string. The best you can do is create a new string that is a variation on the original.

<a class="anchor" id="1.11"></a>
<div class="alert alert-block alert-info">
<b><font size="4">9. The In and Not In Operators</font></b></div>

- The <b>in</b> operator tests if one string is a substring of another.

In [110]:
# Examples involving the in operator

print('currency' in 'concurrency')
print('conc' in 'concurrency')
print('red' in 'concurrency')
print('concurrency' in 'concurrency')
print('' in 'concurrency')
print('red' in 'Red')


True
True
False
True
True
False


- A string is a substring of itself, and the empty string is a substring of any other string. 

- The <b>not in</b> operator returns the logical opposite result of <b>in</b>.

In [3]:
# Examples involving the not in operator

print('gram' not in 'Instagram')
print('Python' not in 'Instgram') 

False
True


<a class="anchor" id="1.12"></a>
<div class="alert alert-block alert-info">
<b><font size="4">10. Character Classification</font></b></div>

- It is often helpful to examine a character and test whether it is upper- or lowercase, or whether it is a character or a digit. 
- The string module provides several constants that are useful for these purposes. 
- One of these, string.digits is equivalent to “0123456789”. It can be used to check if a character is a digit using the in operator.
- The string string.ascii_lowercase contains all of the ascii letters that the system considers to be lowercase. 
- Similarly, string.ascii_uppercase contains all of the uppercase letters. 
- string.punctuation comprises all the characters considered to be punctuation.
- For more information on string operations, refer to the [String Module Documentation](https://docs.python.org/3/library/string.html#module-string).
<p>&nbsp;</p>
- [ASCII](9https://en.wikipedia.org/wiki/ASCII)

In [5]:
# String module constants

import string                     # we need to import the string module

print(string.ascii_lowercase)
print(string.ascii_uppercase)
print(string.digits)
print(string.punctuation)


abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


<a class="anchor" id="1.13"></a>
<div class="alert alert-block alert-info">
<b><font size="4">11. String Methods</font></b></div>

- This topic will be covered separately.

<a class="anchor" id="1.14"></a>
<div class="alert alert-block alert-info">
<b><font size="4">12. Summary</font></b></div>

[Strings: Summary of Key Concepts](https://docs.google.com/document/d/1A2kcmaNS1UcJcoz2iTM63zSEw0RQ634RXlzKjI1GveM/edit?usp=sharing)

<a class="anchor" id="1.15"></a>
<div class="alert alert-block alert-info">
<b><font size="4">13. Glossary</font></b></div>

[Strings: Glossary](https://docs.google.com/document/d/1LIGgBTYSk8JPd8tUmJrh1-owuw844d5m3SxyF6YKtpQ/edit?usp=sharing)