<img src="https://github.com/christopherhuntley/BUAN5405-docs/blob/master/Slides/img/Dolan.png?raw=true" width="180px" align="right">

# Lesson 6: Strings
_The first sequential data type_

# Learning Objectives

## Theory / Be able to explain ...
- The Python Type Hierarchy
- The three kinds of string literals
- The concept of immutability
- Basics of sequential data types
- Indexing, slicing, and traversal
- String objects and methods
- Iterators and generators

## Skills / Know how to  ...
- Create string literals from quoted text or type conversion
- Use indexing and slicing to retrieve substrings
- Use the `in` and `%` operators
- Work with `bytearray` data for mutable text
- Use an iterator or generator to traverse a sequence of any length

**What follows is adapted from Chapter 6 of the _Python For Everybody_ book. If you have not read it, then please do so before continuing on.**

---

In [None]:
#@title Lesson 6 Introduction
%%html
<div style="max-width: 1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/T98tX5W51nU" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
<div style="max-width: 1000px">


## The Type Hierarchy
> "There are 10 kinds of people in this world, those who know binary and those who don't." -- anonymous nerd lore

In this world of Big Data where computational power and data storage capacity have become commodities measured out like sugar or produce, it is easy for non-programmers to forget that it hasn't always been this way, that actual people had to design and build the technology up over many decades. The modern world is by design, not accident, or so we like to believe. 

In the very beginning, all data and programs were **encoded** as **binary**: strings of zeroes and ones called **bits** where every few bits represented something else. We still use binary (at least in digital computers), of course, but not many people speak it natively anymore. It's just too computer-specific and hard for us humans to process. Instead, we use an ever evolving repertoire of **languages** and **data structures** to express ourselves and keep our data safe and useful. 

After the **bit** the next standard data type was the **byte**, composed of 8 bits with 256 possible values, which is just enough to encode each key on a computer keyboard. Thus, [**ASCII**](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) was born. ASCII used seven of the bits (i.e., 128 possible values) for the keyboard codes, with the last bit available for error checking. Each character was assigned a number (e.g., "A" = binary `1000001` = decimal 65). Upper case letters were separate from lower case letters ("a" = `1100001` = 97) and every "control" character (e.g. tab, line ending, end of file, etc.) has an ASCII encoding as well. 

When strung together, **bytes** could represent lots of things as **strings**, **integers**, and **floats**, which you will recognize as native Python data types. From there it was natural to think of more elaborate data structures like complex numbers or lists. These in turn led to decimal and fractional numbers, and file, array, list, tuple, and dictionary collections. Today the Python standard library includes dozens of data types, each with its own unique properties and uses. And, if none of these quite do what we want, we can always create our own. 

This lesson starts our explorations into the world of data structures with a review of the **string data type**, which in practice is the most commonly used data structure of all. While we analytics geeks would likely prefer to have quantitative data all the time, if a human is involved in its collection or communication then it almost certainly will start as text strings. 

---
## String: The (Second-Most) Universal Data Type
Many people are surprised to learn that in its original form, the entirety of the world wide web was text. Even the images were encoded into text! The web browser's job was to take all the strings of text coming over the wire -- there was no wifi back then -- and display web pages that people could view and interact with. Why text? Because just about anything can be encoded as text. It also allowed people to hand-craft web pages with HTML. While the web has progressed a lot since then, most content is still ... text. 

In Python all text has the `string` data type. Most of what follows is a somewhat terse review of things from the Py4E book. Please start there and then read through the notes. 

---
## String Literals
A **string literal** is a specific sequence of characters. The come about in one of two ways:
- As quoted text like "Every good boy does fine"
- Conversion from a literal of another type

### Quoted Text
Quoted text in Python comes in three varieties.

**Single quotes** are used for short strings like names: 

In [None]:
'Apple'

'Apple'

**Double quotes** are used when the quoted text might include single quotes/apostrophes:

In [None]:
"Apple's market cap is $1.5 Trillion"

"Apple's market cap is $1.5 Trillion"

**Triple quotes** are used when the text spans multiple lines (and we want to keep it that way):

In [None]:
'''
Apple's market cap is ...

1.5 TRILLION DOLLARS!
'''

"\nApple's market cap is ...\n\n1.5 TRILLION DOLLARS!\n"

The `\n`s in there represent line endings (a.k.a., "newlines").

### Conversions from other data types
Just about anything can be converted to a string. 

In [None]:
str(15)

'15'

In [None]:
str(type(15))

"<class 'int'>"

In [None]:
b'This is a bytes literal'.decode()

'This is a bytes literal'

The last example is a special case covered in the _Pro Tips_ section at the end of the lesson.  

### Special Characters
Some character codes from the old days of teletypes still exist in Python and other languages. Most represented either whitespace or unprintable behaviors (e.g., ringing a bell with `\a` or going backwards one space with `\b`). They were encoded with a backslash `\` character plus a letter or number. Of these only a few so-called "escape-codes" still have any meaning:
- `\t`: advance to the next tabstop ("tab")
- `\n`: ascii line feed (line end)
- `\\`: backslash
- `\'`: single quote character
- `\"`: double quote character

There are other codes (also with backslashes) for things like unicode characters. You will find them explained in the Python docs. 

### Immutability: When everything is taken literally
Strings and numbers are meant to be taken literally. **They cannot be changed after they are created.** We can't change the number 1 into the number 2, no matter how much we try. Similarly we can't change "Apple" into "Google" no matter how much Eric Schmidt may have wanted it to happen. 

So how are we able to do things like this?

In [None]:
x = "Apple is king of Silicon Valley" 
x = "Google is king of Silicon Valley"   # Reassignment
x += " ... but Apple was there first"    # Update operator
print(x)

Google is king of Silicon Valley ... but Apple was there first


Despite appearances, **each string is immutable**. We can operate on the strings to make new strings, but the **original strings are unchanged**. Again, we can't modify 1 to make it 2, so why would we expect to be able to do that with strings? 

### **Pulse Check ...**

**1. Why do we need three different quoting mechanisms in Python?**

1. We need single quotes to represent a string which is a sequence of characters
2. We need double quotes to represent a string when there is an apostrophy `'` in the string of characters
3. We need triple quotes when our string spans multiple lines.  Also known as a docstring it is used to explain the how a method, defined function, or module works.

**2. Why does the following code fail?**
```python
x = "ABD"
x[2]="C"

```

**The code fails because string data types are immutable meaning you cannot change them once they are created.  You can only create new strings to modify the original which do not affect the original.**

In [None]:
#@title <--- Check your work
%%html
<div style="max-width: 1000px">
   <div style="position: relative;padding-bottom: 56.25%;height: 0;">
     <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  
     src="https://www.youtube.com/embed/uGE-_amS2Eg"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
   </div>
</div>

---
## Strings as Sequences
One of the reasons why string immutability is surprising is that **a string represents a sequence of characters**, kind of like a list. And, as anybody who has ever made a shopping list can tell you, lists are anything but immutable. We can add, delete, and reorder list items to our hearts content. But not so with strings. (For mutable sequences of characters we suggest using a `bytearray`, covered at the end of the lesson.)

Despite this one huge difference, in just about every other way a string works pretty much like a list. Both are sequential data types, with a set of features shared by all sequential data types.  

### Indexing and Slicing
Sequences inherently have **order**. When referring to an item in the sequence, we refer to its **position** in the sequence. There is always a first item, a second item, etc.

In Python we use the `[]` operator to **index** the string (or list, tuple, etc.):

In [None]:
'Google'[3] # the 3rd index or 4th character of the string 

'g'

Those of you who are counting will notice that 'g' is the fourth character in 'Google.' Python starts counting from 0, not 1 when indexing a sequence. 

In [None]:
'Google'[0] # indexing in Python always starts at zero 

'G'

We can even use negative numbers as indexes:    

In [None]:
'Google'[-2] # negative indexing we work in reverse

'l'

For negative indexes it works backwards from the end, with `[-1]` representing the last character.  To find out how many characters are in a string, use the `len()` function.

In [None]:
len('Google')

6

Sometimes we want more than one character at a time. For that use a **slice** instead of an index:

In [None]:
'Google'[2:4] # second index up to but not including 4th index 

'og'

A slice `[a:b]` returns the substring starting with position `a` up to position `b-1`. So, `[2:4]` is asking for whatever is in positions 2 and 3 returned as a new string literal, which in this case is 'og'.

### Traversal
**Traversal** is what we call iteration over the items in a sequence. It is exactly what you might imagine: 

In [None]:
for c in "Google":
    print(c.upper())

G
O
O
G
L
E


We can do whatever we want inside the loop body except modify the string. 

Traversal is actually a more generally applicable task that we will return to in Lesson 11 when we consider tree structured data. 

### **Pulse Check ...**
**1. What does the following code do?** You will likely want to consult the [Python docs](https://docs.python.org/3/library/functions.html#slice). Be sure to explain what the third slice argument does. 

In [None]:
"Go Stags!"[::-1] 

'!sgatS oG'

**START STOP & STEP are the three parameters when slicing.  The above means we start at the zero index and exhaust the sequence as there is no specified stop index.  We have a step of -1 which means we exhaust the collection backwards.**

**2. Rewrite the code from question 1 so that it prints the (reversed) characters one per line.**

In [None]:
my_text = "Go Stags!"

for c in my_text[::-1]:
    print(c)

!
s
g
a
t
S
 
o
G


In [None]:
#@title <--- Check your work
%%html
<div style="max-width: 1000px">
   <div style="position: relative;padding-bottom: 56.25%;height: 0;">
     <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  
     src="https://www.youtube.com/embed/bA2Kt_UiBss"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
   </div>
</div>

---
## String Operators
We have already seen two different **string operations**:
- **concatenating** with the `+` operator
- **appending** with the `+=` operator

The first returns the merger of two strings, one after the other. The second changes the value of the variable. 

We can also use comparison operators just like with numbers:

In [None]:
'Google' < 'Apple'

False

The comparison is based on the numeric codes that correspond to each character. Since `A` comes before `G` in the alphabet, its numeric code is smaller. This also applies to lowercase and capital letters and numbers:

In [None]:
"A" < "a"

True

In [None]:
"21" > "100"

True

The logic is the same as lexicographic ordering (a.k.a. "alphabetizing"), with the comparison examining successive characters in each string until one character is smaller than the other or one of the character sequences has been exhausted. So, since "2" is greater than "1", "21" is greater than "100". 

### The `in` Operator

The `in` operator is used to ask the question: is x in the sequence? We can use it to determine if a string is a substring of another string:

In [None]:
x = "Google is king of Silicon Valley ... but Apple was there first"
print("Microsoft" in x)
print("Apple" in x)

False
True


Note that the `in` keyword is not new to us. It's in every `for` loop. However, in a `for` loop it is _assigning_ the loop variable (`c`) to each item in the sequence (`x`), one item at a time. It's not asking. It's doing.    

### The `%` Operator
One of the more powerful string operators is `%`, which allows us to insert values into the middle of a string. (Technically, it generates a new string but conceptually it inserts into it.)

In [None]:
"Google is king of Silicon Valley ... but %s was there first" % 'Microsoft' 

'Google is king of Silicon Valley ... but Microsoft was there first'

The `%s` is a placeholder for where to insert the string 'Microsoft'. Other placeholders like `%d` (for decimals) `%i` (for integers), or `%g` for (for floating point numbers) exist for various other types of data. It is possible to use multiple placeholders in the same string if we supply a **tuple** on the right hand side. We will cover tuples in Lesson 10, but you can get the general idea from this example.    

In [None]:
"%s is king of Silicon Valley ... but %s was there first" % ('Google', 'Microsoft')

'Google is king of Silicon Valley ... but Microsoft was there first'

### **Pulse Check ...**
**1. Why doesn't this expression return `True`?**
```python
"a" in "Apple"
```

**The character "a" does appear in the right hand operand "Apple".  lowercase 'a' does not equal uppercase A as Python is case sensitive**

**2. Explain the output of the following code. How did Python choose the sequence?**

In [None]:
for c in sorted("Go Stags!"):
    print(c)

 
!
G
S
a
g
o
s
t


**We are iterating through the string "Go Stags!" and printing out each character after the sorted method has been applied to the string.  Uppercase letters come before lowercase in Python.  This explains why they printing first in the sequence.**

**3. Study the two operations below. The first one fails (with an error) while the second one executes cleanly. Why do you suppose that is true?** Hint: The answer says a lot about the nature of strings versus numbers for encoding data.
```python
"The Red Sox have %d World Series Championships" % "9"
"The Red Sox have %s World Series Championships" % 9
```

In [None]:
"The Red Sox have %d World Series Championships" % "9" # test the following

TypeError: ignored

In [None]:
"The Red Sox have %s World Series Championships" % "9" # correct it by specifying string format operator

'The Red Sox have 9 World Series Championships'

In [None]:
"The Red Sox have %s World Series Championships" % 9 # test the following 

'The Red Sox have 9 World Series Championships'

1. The first code line fails because the right operand specified is a string and it cannot be formatted to a decimal specified by the (%d).

2. The second line of code works because the right operand specified is an integer which can be formatted to a string specified by the (%s).

In summary most data types can be converted to a string and not the other way around where a string can be converted into a integer or float.

In [None]:
#@title <--- Check your work
%%html
<div style="max-width: 1000px">
   <div style="position: relative;padding-bottom: 56.25%;height: 0;">
     <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  
     src="https://www.youtube.com/embed/RgSiMv0VMWQ"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
   </div>
</div>

---
## String Methods

Remember the discussion of essentialism at the start of Lesson 2? There we said that every entity has both form and function. In Python every value, function, module, or other fundamental element is an [object](https://docs.python.org/3/reference/datamodel.html#data-model). Objects have two sides: state and behavior. An object's state (data) is implemented with **instance variables** that act like properties or features. An object's behavior (functionality) is implemented through its **methods**, a set of data type-specific functions that always take the object itself as a parameter. 

Let's take, for example, the number 2.5. Clearly the number has state (i.e., 2.5) but it also has behavior. We can ask it, for example to add another number to itself (which is how the `+` operator really works) or we can ask it to provide the simplest possible equivalent fraction (a.k.a., "integer ratio"):

In [None]:
2.5.as_integer_ratio()

(5, 2)

In human terms this is saying that 2.5 = 5/2. How did the `as_integer_ratio()` method know what number to convert? The one it is attached to, which in this case is 2.5. We tell Python that the method is attached via "dot notation":
```python
value.method( arguments )
```

The actual method is defined much like a function, only it always has one extra `self` parameter that never appears in the call:
```python
def method( self, parameters):
    ...
```

So what does this have to do with strings? The string data type has a large number of built-in [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods):
- `upper()`, `lower()`, `capitalize()`
- `strip()`, `lstrip()`, `rstrip()`
- `center()`, `ljust()`, `rjust()`
- `count()`, `find()`, `rfind()`, `index()`, `rindex()`
- `replace()`, `format()`

and many more. Most are similar to their equivalents in MS Excel. Then there are a few "magic" ones like `__add__()`,  which says what the `+` operator is supposed to do. 


### The `find` Method
One of the more commonly-used string methods is `find()`, which returns the position of the first instance of a substring within a longer string:

In [None]:
"Every Good Boy Does Fine".find("Fine")

20

If we want `find()` to begin looking somewhere other than the beginning of the string then we can pass the starting position as a second argument. We can then slice out a word out of the string, like this ...

In [None]:
# pick out a word that fits the pattern 'but '+word+" "
x = "Google is king of Silicon Valley ... but Apple was there first"
start_slice = x.find('but ') + len('but ') # finds the next position after the 'but '
end_slice = x.find(" ",start_slice)        # finds the next space after start_slice 
x[start_slice:end_slice]                   # the actual slice

'Apple'

While `find()` is certainly useful, it is not the only way to parse strings. You can, of course, traverse the string using a `for` loop (not recommended) or use the **regular expressions** module to search for arbitrarily complex text patterns. Regular expressions are covered in chapter 11 of the Py4E book.   

### **Pulse Check ...**
**Write an expression to count the number of sentences in the Gettysburg address.** Assume that each sentence ends with a period. It is possible to use just one expression (without a statement), though it's okay to use an assignment statement and an expression. Any more than that is just verbose!

> Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.
>
>But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

In [None]:
# YOUR CODE HERE

gettysburg_address = '''
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.
'''
# test the code was addded correctly by testing the first 10 characters
gettysburg_address[0:11]


'\nFour score'

In [None]:
# check how many periods there are to count the number of sentences
# assumption is that every sentence ends with a period
# https://stackoverflow.com/questions/22724695/how-to-count-characters-in-a-string-python
sum(1 for c in gettysburg_address if c == ".")

10

In [None]:
# Huntleys solution 
gettysburg_address.count(".")

10

In [None]:
# From the lecture 
gettysburg_address.replace('.\n', '. ').replace('\n','').split('. ')

['Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal',
 'Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure',
 'We are met on a great battle-field of that war',
 'We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live',
 'It is altogether fitting and proper that we should do this',
 'But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground',
 'The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract',
 'The world will little note, nor long remember what we say here, but it can never forget what they did here',
 'It is for us the living, rather, to be dedicated here to the unfinished work which they who fou

In [None]:
#@title <--- Check your work
%%html
<div style="max-width: 1000px">
   <div style="position: relative;padding-bottom: 56.25%;height: 0;">
     <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  
     src="https://www.youtube.com/embed/ePE4uSuGkfU"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
   </div>
</div>

---
## Pro Tips

### `bytes` and `bytearray`
While there are other ways to encode characters besides ASCII (e.g., UTF-8 and various unicode sets), many technical protocols like HTML require ASCII characters. For that we use the `bytes` data type. To the novice eye `bytes` literals look a lot like normal strings:  

In [None]:
b'My Name is Earl'

b'My Name is Earl'

We can tell it is a `bytes` literal because of the leading `b`. Otherwise looks like a string, with many of the same methods. It's even immutable like a string. 

Okay, so what's the big deal? Nothing really, until you convert the `bytes` into a `bytearray`. A `bytearray` is a mutable version `bytes`. You can alter the characters in place (without creating new strings) and even add new characters to the end. 

In [None]:
x_str = 'My Name is Earl'       # the original string
print("str: \t\t", x_str)

x_bytes = x_str.encode()        # encode it to bytes
print("bytes: \t\t",x_bytes)

x_bytearray = bytearray(x_bytes)  # convert to a mutable bytearray
print("bytearray: \t",x_bytearray)

x_bytearray = x_bytearray[0:11]     # cut off the last few characters
print("truncated: \t", x_bytearray)

# add new text to the end of the bytearray
x_bytearray += b'Inigo Montoya, you killed my father, prepare to die.'
print("extended: \t",x_bytearray)

x_str = x_bytearray.decode()    # decode back to a string
print("str: \t\t",x_str)

str: 		 My Name is Earl
bytes: 		 b'My Name is Earl'
bytearray: 	 bytearray(b'My Name is Earl')
truncated: 	 bytearray(b'My Name is ')
extended: 	 bytearray(b'My Name is Inigo Montoya, you killed my father, prepare to die.')
str: 		 My Name is Inigo Montoya, you killed my father, prepare to die.


Note that by modifying the `bytearray` in place we minimize how much data we need to keep in memory at a given time. If the string is book length, that could be a big deal, both faster and more memory efficient.  

### Iterators, Generators, and `yield`

When discussing software design programmers can sometimes sound very philosophical or very silly, depending on your perspective. What is a _sequence_ really? A sequence is an **iterable** collection of items. The items can be traversed one at a time, in some order determined by the sequence. We encapsulate the process with an **iterator**. 

An **iterator** traverses the items in a sequence, one at a time in the proper sequence. In Python, an _iterable_ collection (sequence) always has a magic method `__iter__()` that returns a _iterator_. The iterator objects then have yet another magic method called `__next()__` that always either returns the next item in the sequence or throws a StopIteration exception if the sequence is exhausted. (An exception is like an error, catchable with a `try ... except` statement, but doesn't mean anything is wrong, just notable.)

Sometimes an example is easier than an explanation:

In [None]:
go_stags = "Go Stags!".__iter__()  # create an iterator for the string
print(go_stags)     
print(go_stags.__next__())         # 0-th item
print(go_stags.__next__())         # 1-st item
print(go_stags.__next__())         # 2-nd item
print(go_stags.__next__())         # 3-rd item
print(go_stags.__next__())         # 4-th item
print(go_stags.__next__())         # 5-th item
print(go_stags.__next__())         # 6-th item
print(go_stags.__next__())         # 7-th item
print(go_stags.__next__())         # 8-th item
# print(go_stags.__next__())         # 9-st item
# print(go_stags.__next__())         # Throws an exception

<str_iterator object at 0x7f94f0503040>
G
o
 
S
t
a
g
s
!


Behind the scenes that is exactly how a `for` loop works. Each pass through the loop it sets the `item` to the next in sequence until none are left. The `in` operator does the same thing. It's really a specialized (and short-circuited) while loop for finding items in a collection. 

Every data type we will consider in the next few lessons is iterable: strings, files, lists, tuples, dictionaries, sets, Series, and DataFrames. All of them are used to contain collections of data. In some cases the sequences are somewhat artificial (e.g., sets are non-sequential by nature) but nonetheless they have iterators. 

So, what if we want to create iterators for our own custom data structures? We have two choices:

- Create a new data type (called a **class**) that implements the `__iter__()` and `__next()__` methods.
- Use a **generator function**.

We won't get into creating a new class, which is well beyond the scope of this course, but we can at least illustrate the second. The code below uses a generator function to create a `digits` iterator to use in the `for` loop. 

In [None]:
# Set up the problem
divisor = 7
dividend = 479

# A digits generator
def digits(dividend):            # a typical function definition
    digit_str = str(dividend)    # convert the integer dividend into a string
    for d in digit_str:          # start iterating over the string
        yield d                  # `yield` one digit and freeze execution until the next generator call

# Initialize variables
remainder = quotient = 0

for d in digits(dividend):
    remainder = remainder*10 + int(d)     # pull down the next digit into the remainder
    q = remainder // divisor              # determine how many times the divisor fits into the remainder 
    product = q * divisor                 # calculate the product, and then ...
    remainder -= product                  # subtract it from the remainder
    
    quotient = quotient*10 + q            # add the next digit to the quotient
    
print(str(quotient)+"r"+str(remainder))

68r3


The main difference between a _generator_ and a normal function is that it uses `yield` instead of `return`. After a `return` from a function Python forgets everything that happened, starting fresh with the next function call. With `yield`, however, the function instead puts all of its state (variables and current statement pointer) on ice before it leaves. Then with the next call of the generator, it thaws out the state and continues execution right where it was, yielding the next item in the sequence.

You have already encountered a generator in Lesson 5. The `range()` function is a generator for sequences of integers. Each time its `__next__()` method is called it yields the next number in the sequence. 

So why do we need generators if we can use lists? Because generators work for **infinite** (or at least impossibly huge) sequences that are too big to fit into memory all at once. The generator can yield items one at a time without having to remember the previous items or predict the remaining ones. To create all the digits of π (a sequence that never ends), one could use a generator function that calculates the digits one at a time.   

---
## **Exercises**

**1. Rewrite your `waist2hip_ratio()` function from Lesson 4 to use the formatting operator `%` for the output string.** You will need to use a tuple for the 4 placeholder insertions. 

In [None]:
# Work from lesson 4 copied in this code block

# Turn exercise waist to hip ratio from lesson 3 into a function named 'w2h_ratio'
# the function 'w2h_ratio' has three parameters 'waist_inches', 'hip_inches','gender'
def w2h_ratio(waist_inches, hip_inches, gender):

    try:
        waist = float(waist_inches)
        hip = float(hip_inches) 
        gender = str(gender)
        
    except:
        return("waist2hip_ratio: Invalid measurement(s)") # waist & hip must be in inches

    if waist <= 0 or hip <= 0:  # waist and hip measurements must be greater than 0
        return("waist2hip_ratio: Invalid measurement(s)")

    if gender != ("F" or "M"): # Gender must be "M" OR "F"
        return ("waist2hip_ratio: Unknown gender")

    waist2hip_ratio = round(waist/hip, 2) # round our waist2hip_ratio to two decimal spaces

    # if gender == "M" and waist2hip_ratio > 0.90:
    #     return "Apple"
    # elif gender == "M" and waist2hip_ratio <= 0.90:
    #     return "Pear"
    # elif gender == "F" and waist2hip_ratio > 0.80:
    #     return "Apple"
    # elif gender == "F" and waist2hip_ratio <= 0.80:
    #     return "Pear"    
    
    
    # Determine if the individual is an "Apple" OR "Pear" shape based on their waist to hip ratio
    shape = "Apple" if (waist2hip_ratio > 0.90 and gender == "M") or (waist2hip_ratio > 0.80 and gender == "F") else "Pear"
    
    # return the waist and hip measurements to strings so they can be printed
    shape_result = "For a " + gender + " with waist " + str(waist) + " and hip " + str(hip) + ", \n" + " the w2h ratio is " 
    shape_result += str(waist2hip_ratio) + "," + " with a shape " + shape
    
    print(shape_result) # print the summary of the individuals shape and waist to hip ratio

In [None]:
# Call the function to test its logic using
# the same inputs from lesson 3/4 exercise
w2h_ratio(25, 33, "F")

For a F with waist 25.0 and hip 33.0, 
 the w2h ratio is 0.76, with a shape Pear


In [None]:
# Re-write the shape result output using the string format operator %

# Work from lesson 4 copied in this code block

# Turn exercise waist to hip ratio from lesson 3 into a function named 'w2h_ratio'
# the function 'w2h_ratio' has three parameters 'waist_inches', 'hip_inches','gender'
def w2h_ratio(waist_inches, hip_inches, gender):

    try:
        waist = float(waist_inches)
        hip = float(hip_inches) 
        gender = str(gender)
        
    except:
        return("waist2hip_ratio: Invalid measurement(s)") # waist & hip must be in inches

    if waist <= 0 or hip <= 0:  # waist and hip measurements must be greater than 0
        return("waist2hip_ratio: Invalid measurement(s)")

    if gender != ("F" or "M"): # Gender must be "M" OR "F"
        return ("waist2hip_ratio: Unknown gender")

    waist2hip_ratio = round(waist/hip, 2) # round our waist2hip_ratio to two decimal spaces

    # if gender == "M" and waist2hip_ratio > 0.90:
    #     return "Apple"
    # elif gender == "M" and waist2hip_ratio <= 0.90:
    #     return "Pear"
    # elif gender == "F" and waist2hip_ratio > 0.80:
    #     return "Apple"
    # elif gender == "F" and waist2hip_ratio <= 0.80:
    #     return "Pear"    
    
    
    # Determine if the individual is an "Apple" OR "Pear" shape based on their waist to hip ratio
    shape = "Apple" if (waist2hip_ratio > 0.90 and gender == "M") or (waist2hip_ratio > 0.80 and gender == "F") else "Pear"
    
    # return the waist and hip measurements to strings so they can be printed
    # shape_result = "For a " + gender + " with waist " + str(waist) + " and hip " + str(hip) + ", \n" + " the w2h ratio is " 
    # shape_result += str(waist2hip_ratio) + "," + " with a shape " + shape

    # Re-write the shape result output using the string format operator %
    # CHANGE CODE HERE
    
    shape_result_new = "For a %s with waist %g and hip %g, the w2h ratio is %g." % (gender, waist, hip, waist2hip_ratio)

    print(shape_result_new) # print the summary of the individuals shape and waist to hip ratio

# Call the function 
w2h_ratio(25, 33, "F")

For a F with waist 25 and hip 33, the w2h ratio is 0.76.


In [None]:
#@title <--- Check your work
%%html
<div style="max-width: 1000px">
   <div style="position: relative;padding-bottom: 56.25%;height: 0;">
     <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  
     src="https://www.youtube.com/embed/YfrN_4Pq_ko"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
   </div>
</div>

**2. Write a function called `char_rotate()` that moves the first character of a string to the end of the string.** The function has one parameter (the string) and returns a new string with the first character rotated to the end.

In [None]:
def char_rotate(my_string):
    # first character is 0 index
    first_char = my_string[0] 
    # then abstract the second character and exhaust till the end
    # then concate the first character to the 'string_rotate'
    string_rotate = my_string[1:] + first_char
    return string_rotate  # need to make it a fruitful function to use in Ex 3

# call the function 
char_rotate("Monica Willson")

'onica WillsonM'

**3. Write a function called `igpay()` that takes in a word and returns the pig latin equivalent.**
- Convert every letter to upper case. Latin does not have lowercase letters. 
- If the word starts with a vowel, then add "YAY" to the end of the word.
- if the word starts with one or more consonants in a row, then move the consonants to the end (so the word starts with a vowel) and add "AY"
- If the word does not contain at least one vowel then just return the word
- Look out for infinite loops! Interrupt/restart the runtime if needed.

Hints: 
- Use your `char_rotate()` function from exercise 2.
- There may be a string method or two that can simplify your code.
- `igpay("pig")` is "IGPAY"
- `igpay("art")` is "ARTAY"
- `igpay("thx")` is "thx"
- `igpay("")`  is ""

In [None]:
vowels = ["A","E","I","O","U"]
consonants = ["B","C","D","F","G","H","J","K","L","M","N","P","Q","R","S","T","V","W","X","Y","Z"]

def ig_pay(my_string):
    # convert the input text from user to all uppercase letters
    my_string = my_string.upper()
    # if starts with a vowel add 'YAY'
    if my_string[0] in vowels:
        return my_string + "YAY"
    # if the string does not contain at least one vowel, just return the string
    for c in my_string:
       if c not in vowels:
           return my_string
    # if starts with one or more consonants, move consonants to the end 
    # and add "AY" 
    while my_string[0] not in vowels:
        my_string = char_rotate(my_string)
        return my_string + "AY"
    
  

# call the function 
ig_pay('pig') # test 'pig' user input
# ig_pay(my_string) # test 'art' user input 
# ig_pay(my_string) # test 'thx' user input
# ig_pay(my_string) # test ""

'PIG'

In [None]:
ig_pay('art')

'ARTYAY'

In [None]:
ig_pay('thx')

'THX'

In [None]:
ig_pay("")

IndexError: ignored

In [None]:
#@title <--- Check your work
%%html
<div style="max-width: 1000px">
   <div style="position: relative;padding-bottom: 56.25%;height: 0;">
     <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  
     src="https://www.youtube.com/embed/ciXaBYgNHeY"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
   </div>
</div>

**4. Write a function called `roman2int()` that converts a roman numeral string to an integer.** So, `roman2int("XXIX")` returns 29 and `roman2int("MCMXCIX")` returns 1999. 

Requirements:
- The function has a single parameter called `roman_numeral` that must be a string. 
- Iterate over the string with a `while` loop with two loop variables and a stopping condition:
  - `int_value` is an accumulator for the integer value of the roman characters processed so far.
  - `remaining` is used to keep track of the romain characters remaining. It is like the `digits` variable we used wth long division. 
  - The loop terminates when `remaining` is an empty string.
  - Don't forget to initialize the variables before entering the loop.
- Each pass through, use a slice or string method to compare the head of `remaining` with one of the following patterns (with integer equivalents provided):
  - "IV" = 4
  - "IX" = 9
  - "XL" = 40
  - "XC" = 90
  - "CD" = 400
  - "CM" = 900
  - "I" = 1
  - "V" = 5
  - "X" = 10
  - "L" = 50
  - "C" = 100
  - "D" = 500
  - "M" = 1000
  - any other character = 0
- Compare the patterns in the order given above using an `if ... elif` statement with 14 clauses. If a pattern matches then 
  - Update `int_value` by adding the appropriate integer.
  - Update `remaining` by removing the matched pattern from the head of string. Note: you can do with with a slice to the right of the `=`.  
  

In [None]:
def roman2int(roman_numeral):

# initialize the accumulator and loop variable (counter)
    int_value = 0 # accumulator total value of the roman_numeral
    remaining = roman_numeral # counter
    while remaining:  # iterate through the characters of roman numerials
        if remaining.startswith("IV"):
            remaining = remaining[2:]  # slice the characters after "IV"
            int_value += 4  # add the value of 4 to the accumulator
        
        elif remaining.startswith("IX"):
            remaining = remaining[2:]
            int_value += 9
        
        elif remaining.startswith("XL"):
            remaining = remaining[2:]
            int_value += 40
     
        elif remaining.startswith("XC"):
            remaining = remaining[2:]
            int_value += 90

        elif remaining.startswith("CD"):
            remaining = remaining[2:]
            int_value += 400

        elif remaining.startswith("CM"):
            remaining = remaining[2:]
            int_value += 900
        
        elif remaining.startswith("I"):
            remaining = remaining[1:]
            int_value += 1
        
        elif remaining.startswith("V"):
            remaining = remaining[1:]
            int_value += 5
        
        elif remaining.startswith("X"):
            remaining = remaining[1:]
            int_value += 10
        
        elif remaining.startswith("L"):
            remaining = remaining[1:]
            int_value += 50
            
        elif remaining.startswith("C"):
            remaining = remaining[1:]
            int_value += 100
        
        elif remaining.startswith("D"):
            remaining = remaining[1:]
            int_value += 500
          
        elif remaining.startswith("M"):
            remaining = remaining[1:]
            int_value += 1000

    return int_value

roman2int("XXIX")

29

In [None]:
#@title <--- Check your work
%%html
<div style="max-width: 1000px">
   <div style="position: relative;padding-bottom: 56.25%;height: 0;">
     <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  
     src="https://www.youtube.com/embed/K7OEyfj6bL4"
     frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
   </div>
</div>

---
## **Before you go ... Submit your work on Google Classroom**
- Save your notebook to be sure it is up to date. 
- Turn in your notebook. Your notebook will become read-only. 
- Once it has been reviewed it will be returned and no-longer be read-only. 

---
> ## Every Tee Shirt Has a Story
> ABOUT THE 1989 GENETIC ALGORITHMS CONFERENCE   
> This shirt has some significance for me. It marked when I started to think a little differently about the world around me. I had just finished my masters studies in an area now called evolutionary computation (i.e., solving math problems with simulated sexual reproduction) but then simply called Genetic Algorithms. At the conference I got a prime speaking spot, with several kinda famous names in the audience and my overhead slides projected 15 feet tall behind me. I was going all out. I'd even paid to get the slides printed in color! Up on stage I felt like [Patton](https://www.hollywood.com/general/patton-movie-stills-57256250/#/ms-1915/1) lecturing to the troops. 
>
>I got about halfway through my slides, just about the point where I was about to quote John Holland, the founder of the field, when I spotted him sitting in the first row, watching intently. I didn't expect that. I froze for about 20 seconds, or at least that's what it felt like; I have no idea. I guess he figured out what happened because he motioned for me to continue with my quote. Afterwards he bought me lunch. He turned out to be a very nice and patient man, a true educator as well as a world famous scientist.
>
> That night, as I sat in the lobby watching TV with the other geeks from around the world, I was again frozen, this time with what we were seeing live. It was the [Tank Man video](https://lens.blogs.nytimes.com/2009/06/03/behind-the-scenes-tank-man-of-tiananmen/) from the Tiananmen Square massacre. I could not believe what I was seeing. Could that be real? I was told by several people who seemed to know that yes, that's how it happens there. Could it happen here, I wondered? It already has of course but that's another story. 
>
> The conference tee shirt -- we all wore it in the group photo -- features binary code evolving into the words "Genetic Algorithms." It's held up pretty well for being over 30 years old.        

![L6 Tee Front](https://github.com/christopherhuntley/BUAN5405-docs/raw/master/Photos/L06_TeeFront.jpeg)

## Copyright &copy; 2020 Christopher Huntley. All rights reserved. 