[Table of Contents](../../index.ipynb)

# FRC Analytics with Python - Session 12
# Working with Text Part I
**Last Updated: 23 April 2021**

## I. Introduction
Modern programming languages have many tools and techniques for working with textual data. That's a good thing, because textual data is everywhere. We'll cover string literals, concatenation, formatting, text encoding, and string methods in this session. In the next session (Part II), we'll cover regular expressions and tools for manipulating text in Pandas dataframes.

## II. String Literals

### A. This is Literally a Review of Literal Strings
By now, you should be proficient with using single or double quotation marks to create a string literal.

In [1]:
print("This string uses double quotes.")
print('And this string uses single quotes.')

This string uses double quotes.
And this string uses single quotes.


Here is what the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html#s3.10-strings) says about using single or double quotes for stings:
> Be consistent with your choice of string quote character within a file. Pick ' or " and stick with it. It is okay to use the other quote character on a string to avoid the need to \\ escape within the string.

Furthermore, if you are working on a group programming project, choose which style you will use as a group, and everyone should stick to it.

### B. Multi-Line Strings
There is a third type of literal string called a mult-line string. See the example below:

In [2]:
# Multi-line String cell #1
print("""
  My brother built a robot 
that does not exactly work, 
as soon as it was finished, 
it began to go berserk, 
its eyes grew incandescent 
and its nose appeared to gleam, 
it bellowed unbenignly 
and its ears emitted steam. 
""")
# First verse of the poem My Brother Built a Robot by Jack Prelutsky


  My brother built a robot 
that does not exactly work, 
as soon as it was finished, 
it began to go berserk, 
its eyes grew incandescent 
and its nose appeared to gleam, 
it bellowed unbenignly 
and its ears emitted steam. 



The multi-line string preserves all spaces and line breaks. Multi-line strings can also be created with single quotes, but that practice is discouraged.

In [3]:
# Multi-line String cell #2
def single_quoted_multiline():
    print('''
      Please dont use
          single quotes to
              create a multi-line
                  string.''')
    
single_quoted_multiline()


      Please dont use
          single quotes to
              create a multi-line
                  string.


Look carefully at the horizontal alignment of the output for *Multi-line String cell #1* and *Multi-line String cell #2*. See how the output of cell #2 starts farther to the right than cell #1? This is becuase the multi-line string includes the spaces used for indentation within the function. If you don't want the lines to be indented, do this:

In [4]:
def wrong_way():
    print("""This technique
for avoiding leading spaces at
the beginning of mult-line strings
works, but it looks ugly because
it does not comply with Python rules
for indentation. Try not to do this.""")
    
import textwrap    

def using_dedent():
    long_string = textwrap.dedent("""\
        The textwrap module contains
        several useful functions for manipulating strings. Check
        it out in the documentation for the Python Standard
        Library!""")
    return long_string
    
print(using_dedent())

The textwrap module contains
several useful functions for manipulating strings. Check
it out in the documentation for the Python Standard
Library!


The `dedent()` function in the [Standard Library's *textwrap* module](https://docs.python.org/3/library/textwrap.html) removes common leading whitespace from the beginning of each line. By *common*, we mean that the function only works if each line has the exact same amount of leading whitespace. The backslash at the beginning of the string prevents a blank line from being printed.

## III. There's no Escaping Escape Sequences
Strings can contain sequences of characters with special meaning, called escape sequences. We'll cover the following escape sequences:
* Quotation Marks `\'` and `\"`
* Newline character: `\n`
* Horizontal tab character: `\t`
* Backslash: `\\`

Escape sequences occur in several programming langauges, including Python, Java, and C++.

### A. Escaping Quotation Marks
We use single quotes (') and double quotes (") to delimit literal strings. There are a couple techniques we can use if we want our string to include quotation marks.

In [5]:
# Option 1: Use double quotes for strings with single quotes and vice versa
scott_adams_quote = [
    "Normal people believe that if it ain’t broke, don’t fix it.",
    "Engineers believe that if it ain’t broke, it doesn’t have enough features yet."]

radio_quote = 'Reginald went on to ask "Is it snowing where you are, Mr. Thiessen?"'

If your text includes both single or double quotes, use the corresponding escape seqence.

In [6]:
# Example of an escape sequence for a single quote.
print('Chief Brody said "You\'re going to need a bigger boat."')

Chief Brody said "You're going to need a bigger boat."


### B. Newline Sequence
We can cause a string to print on more than one line by inserting a newline character. We insert a newline character with a newline escape sequence. 

In [7]:
multi_line_string = "This is line 1.\nThis is line 2."
multi_line_string

'This is line 1.\nThis is line 2.'

So far, the results are not very impressive. The notebook is just displaying the newline character in the string. The outcome is better if we pass the string to the `print()` function.

In [8]:
print(multi_line_string)

This is line 1.
This is line 2.


Here is another example of newline sequences. Do note that escape sequences use the *back* slash, which should be above your keyboard's ENTER key. The *forward* slash (below the question mark) won't work.

In [9]:
robot_poem_verse2 = (
    "My brother built that robot\n"
    "to help us clean our room,\n"
    "instead, it ate the dust pan\n"
    "and attacked us with the broom,\n"
    "it pulled apart our pillows,\n"
    "it disheveled both our beds,\n"
    "it took a box of crayons\n"
    "and it doodled on our heads.\n"
)
print(robot_poem_verse2)

My brother built that robot
to help us clean our room,
instead, it ate the dust pan
and attacked us with the broom,
it pulled apart our pillows,
it disheveled both our beds,
it took a box of crayons
and it doodled on our heads.



The example above demonstrates a couple other useful techniques:
* Parentheses allow us to split long Python statements onto several lines. The result looks much nicer than long lines that require horizontal scrolling!
* We can join string literals in Python into a longer string by just typing them next to each other, with whitespace in between. This only works with string *literals*. If you try to join a literal string and a variable containing a string using this syntax, you'll get an error.

In [10]:
# Causes an error

print(multi_line_string "\nThis is line 3.")

SyntaxError: invalid syntax (Temp/ipykernel_15364/3301733843.py, line 3)

### C. Horizontal Tab
The horizontal tab sequence, `\t`, helps us indent or align text. 

In [11]:
print("Team Number\t", "Team Name\t\t", "Rookie Year")
print("360\t\t", "The Revolution\t\t", "2000")
print("488\t\t", "Team XBot\t\t", "2000")
print("492\t\t", "Titan Robotics Club\t", "2001")
print("568\t\t", "Nerds of the North\t", "2001")

Team Number	 Team Name		 Rookie Year
360		 The Revolution		 2000
488		 Team XBot		 2000
492		 Titan Robotics Club	 2001
568		 Nerds of the North	 2001


The first text to be displayed after a horizontal tab escape sequence (`\t`) will be placed at the next tab stop. On most systems the tab stops are about eight characters apart.

The concept of a tab stop comes from manual typewriters. If you wanted to type a document with data in columns, you would place metal clips onto a tab bar within the typewriter. The position of the clip corresponded to a horizontal position on the page. When you hit the tab key on the typewriter, the typewriter carriage would slide to the left until it hit the mechanical tab stop. The tab stops are visible in the image below - they are the dark grey rectangular clips.

![Tab Stops2](images/tab_stops2.jpg)

### D. Back Slash
If you want your string to include a back slash, you need to escape it with a backslash.

In [12]:
print("This string contains a backslash: \\")

This string contains a backslash: \


### E. Raw Strings
If you have a string that contains a lot of back slashes, it's often easier to use a raw string instead of escaping all of the back slashes. Raw strings are created by prepending a literal string with 'r'. The catch is that escape sequences don't work in a raw string.

In [18]:
# Raw Strings
print(r"This is a raw string. None of these escapse sequences work: \n \t \\")

This is a raw string. None of these escapse sequences work: \n \t \\


### F. Exercises

**Ex. III.1** Use escape sequences to create a string that looks like this when it's printed:
```
/\
\/  /\
	\/
```
Don't use a multi-line string. The string should only contain escape sequences and the forwards slash ("/").

In [23]:
# Exercise III.1
# Replace the empty string per the exercise instructions
ex1_string = "/\\\n\\/  /\\\n    \\/" 
print(ex1_string)

/\
\/  /\
    \/


## IV. Concatenating Strings and Variables
Concatenating is a fancy word for sticking two things together, and it is often used in programming to refer to the act of making a larger string out of two or more smaller strings. There are several ways to assemble larger strings from smaller strings, including the plus (`+`) operator, the `print()` function, the `.join()` method

### A. Concatenating with the `print()` Function
The print function can be used to convert all of its arguments to strings and concatenate them. By default, each argument is separated by a space.

In [24]:
# Joining strings with `+`
name = "George Washington Carver"
year = "1864"

print(name, "was born in", year, ".")

George Washington Carver was born in 1864 .


The `print()` function takes two named parameters `sep` and `end`. We can use these named parameters to change the `print()` function's behavior.

The `sep` parameter specifies which character is used to separate the arguments that are passed to the `print()` function. For example, the following code separates items in the print statement with a hyphen:

In [25]:
# Specifying a different separator
print(1, 2, 3, 4, 5, sep="-")

1-2-3-4-5


Or we could print each element on a separate line.

In [27]:
# Printing each argument on a different line
print(1, 2, 3, 4, 5, sep="\n")

1
2
3
4
5


The `end` parameter specifies what character is placed at the end of the `print()` function's output.

In [28]:
print(1, 2, 3, 4, 5, end=": And that's all there is!")

1 2 3 4 5: And that's all there is!

Setting `end` to a space allows successive print statements to all print on the same line.

In [29]:
print(1, end = " ")
print(2, end = " ")
print(3, end = " ")
print(4, end = " ")
print(5, end = " ")

1 2 3 4 5 

Using the `print()` function to join strings has a significant shortcoming. The strings can only be sent to standard output. They cannot be saved to a variable. Check out the code below.

In [30]:
string_var = print(name, "was born in", year, ".")
print(string_var)

George Washington Carver was born in 1864 .
None


When we try to print `string_var`, all we see is the value `None`. This is because the `print()` function does not return the string that it creates. It returns the `None` value instead. We'll need to use other techniques if we want to save a concatenated string to a variable

### B. The `+` Operator
First, we'll review how to use the plus (`+`) operator.

In [31]:
# Joining strings with `+`
name = "Gladys West"
year = "1930"

statement = name + " was born in " + year + "."
print(statement)

Gladys West was born in 1930.


Note that the `+` operator only works if both of it's operands (i.e., the items on either side of the `+`) are strings. The following statement resuls in an error -- can you see why?

In [32]:
# Joining strings with `+`
name = "Satyendra Nath Bose"
year = 1894

statement = name + " was born in " + year + "."
print(statement)

TypeError: can only concatenate str (not "int") to str

The `+` operator behaves differently depending on the data type of its operands. If both operands are numeric, it adds both operands and returns the sum. If both operands are stings, it creates a longer string by joining both strings. But if one operand is numeric and one is a string, the `+` operator doesn't know what you want to do, so it throws an error.

### C. F-Strings and the `.format()` Method
Assembling strings from multiple substrings with the `+` operator can get tedious. For Python version 3.8 or later, F-strings can be used to concatenate multiple values into a string, with less hassle.

Python 3.8 was released in October 2019. First, let's verify our Python version is at least 3.8.

In [33]:
import sys

# Verify Python version 
print(sys.version)

3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]


Now let's check out our first F-string.

In [34]:
import math

# F-String example
num = 2
f"The square root of {num} is {math.sqrt(num)}."

'The square root of 2 is 1.4142135623730951.'

F-strings are created by prepending a literal string with the letter `f`. The curly braces and everything inside them are called replacement fields. The text inside the replacement fields are interpreted as Python code. In the previous example, the value of the variable `num` was converted to a string and inserted into the string. In the second replacement field we calculated the square root of 2.

F-strings work with double quotes, single quotes, and multi-line strings.

But what if you are using an earlier version of Python? Or you just don't like adding an `f` to the front of your strings? In that case, use the string object's `.format()` method.

In [35]:
# .format() method
"The square root of {} is {}".format(num, math.sqrt(num))

'The square root of 2 is 1.4142135623730951'

When using the `.format()` method, you create replacement fields by entering pairs of curly braces in the string, and then call the `.format()` method on the string literal. The arguments that are passed to the `.format()` method are inserted into the replacement fields, with the first argument being placed in the first replacement field, the second argument in the second field, etc.

The `.format()` method also supports named replacement fields, where the name within the curly braces corresponds to a named argument passed to the `.format()` method. When using named replacement fields, the order of arguments to the `.format()` method does not matter.

In [36]:
# Using .format() with named arguments
"The square root of {num} is {sqrt}".format(sqrt=math.sqrt(3), num=3)

'The square root of 3 is 1.7320508075688772'

F-strings provide numerous formatting options. For example, suppose we only want four significant digits in the square root value, and we want the number 2 to be padded with zeros so it always fills three spaces:

In [37]:
f"The square root of {num:03} is {math.sqrt(2):.3f}."

'The square root of 002 is 1.414.'

In the replacement fields, everything after the colon is part of the format specification, or format_spec for short. In the first field, the format_spec `03` specifies that the number should take up at least three characters and be padded with zeros if the number is less than three characters. The second format_spec `.3f` indicates that there should be three digits to the right of the decimal and that the number should be formatted as a fixed point value.

There are numerous options for format_specs. The official reference is here: https://docs.python.org/3/library/string.html#formatspec. Additional guidance is avaialable at https://www.w3schools.com/python/ref_string_format.asp and https://realpython.com/python-formatted-output/#f-string-formatting.

Here are some additional examples of format specifications.

In [38]:
# Scientific Notation: 'E' is for scientific notation
team_num = 1318
print(f"{team_num} in scientific notation is {team_num:.3E}")

# Use a comma as a digits separator
print()
print(f"Our team number squared is: {1318**2:,}")

# Display a number as binary, octal, or hexadecimal
print(f"""
Team number: \t\t\t 1318
Binary team number: \t\t {1318:b}
Octal team number: \t\t {1318:o}
Hexadecimal team number: \t {1318:x}
""")

# Aligning and Padding Strings
# '>' means right-align, '^' means center
#   (and '<' means left align, but it's not used in this example)
# The number to the right of '>' or '^' specifies the width of the field.
# The character to the left of '^' specifies what character to use to pad
#   the field. The default is a blank space.
print()
teams = [("Team #", "Team Name"), (492, "Nerds of the North"), (1318, "The Issaquah Robotics Society"),
         (2046, "Bear Metal"), (4131, "Iron Patriouts")]
for team in teams:
    print(f"{team[0]:>6}\t\t{team[1]:_^32}")

1318 in scientific notation is 1.318E+03

Our team number squared is: 1,737,124

Team number: 			 1318
Binary team number: 		 10100100110
Octal team number: 		 2446
Hexadecimal team number: 	 526


Team #		___________Team Name____________
   492		_______Nerds of the North_______
  1318		_The Issaquah Robotics Society__
  2046		___________Bear Metal___________
  4131		_________Iron Patriouts_________


### D. The `.join()` Method

Suppose you need to incrementally build a string in steps. You might be tempted to write code that looks like this:

In [39]:
the_Count = ""
for idx in range(1, 4):
    the_Count = the_Count + str(idx) + " cookies, ah ah ahhhhh!\n"

print(the_Count)

1 cookies, ah ah ahhhhh!
2 cookies, ah ah ahhhhh!
3 cookies, ah ah ahhhhh!



Using a for loop to build a string like this works fine, a long as our strings are short and there are not a huge number of iterations in our loop. The problem with this approach is that strings are immutable. What's actually happeninig is that every time we go through the loop, Python has to throw away the string currently stored in `the_Count` and allocate memory for a completely new string. It's as if you are assigned to write a term paper. You won't be able to write it all in one sitting because it's a long paper. After you write the first few paragraphs, instead of saving the document, you print it out and then delete the document file. The next time you sit down to work on the term paper, you use the hardcopy to re-type everything you've written so far, and only after that do you start composing new paragraphs. As you can see, this is very inefficient. 

Even though a for loop will work fine for most of our robotics projects, knowledgeable programmers who review your code will notice, and we all want to make a good impression!

The `.join()` method can help us out in this situation.

In [40]:
# Create list of strings
str_list = [str(idx) + " cookies, ah ah ahhhhh!" for idx in range(1, 4)]
print("List of Strings")
print(str_list)

# Use .join() to create one string from entire list
print()
print("Joined Strings")
print("\n".join(str_list))

List of Strings
['1 cookies, ah ah ahhhhh!', '2 cookies, ah ah ahhhhh!', '3 cookies, ah ah ahhhhh!']

Joined Strings
1 cookies, ah ah ahhhhh!
2 cookies, ah ah ahhhhh!
3 cookies, ah ah ahhhhh!


The `.join()` method takes a list of strings as an argument. You call the `.join()` method on the string that you will use to separate the individual substrings from the list. IN the previous example, we called the `.join()` method on a string containing the newline escape sequence, which caused each list element to be printed on a new line. This does some a bit backwards - calling `.join()` on the list and passing the separator string seems more natural, but would be problematic since lists can contain any type of object, not just strings.

The `.join()` method is more efficient because it only creates a new string once. It's as if, for your term paper, you wrote and saved each section in it's own file, and then copied the contents of each file into the final document just once, after all sections were written.

Here are examples of using `.join()` with different separators.

In [41]:
# No seperation between list elements
print("".join([str(i) for i in range(5)]))

# Separate list elements with a space
print(" ".join([str(i) for i in range(5)]))

# Separate list elements with a space
print(" and a ".join([str(i) for i in range(1, 5)]))

01234
0 1 2 3 4
1 and a 2 and a 3 and a 4


### E. The `%` Operator
Python provides an older technique for string formatting that uses the `%` operator and is similar to what's used in the C language. The mentor thinks it's better to use F strings or the `.format()` method to format strings, but you might see the `%` operator technique if you are studying other people's code (which is a great way to learn!). We'll dicuss it briefly so you'll know what it is if you see it.

In [42]:
ai_researcher = "Cynthia Breazeal"
birth_year = 1967

print("%s was born in %d." % (ai_researcher, birth_year))

Cynthia Breazeal was born in 1967.


In this type of string formatting, replacement fields are denoted by a percent sign followed by the letter s, d, f, or x.
* `%s` is for string values
* `%d` is for integers
* `%f` is for floating point numbers
* `%.<number of digits>f` is for floating point numbers with a fixed number of digits
* `%x` and `%X` are for integers in hexadecimal representation

The format string is followed by the `%` operator, which is followed by a tuple of values. The length, order, and datatypes of the tuple must match the replacement fields in the format string.

### F. Exercises

**Ex IV.1:** Use a for-loop, F-strings, and a print statement to create a table of integers from 1 to 10 and their natural logarithms.
* Use the format string and escape sequences to create the columns. Each call to `print()` should have only one argument.
* The first column will contain the integers.
* The second column will contain the square roots. 
* Each logarithm should have 3 digits to the right of the decimal place, even if it's an integer.
* There is a function to calculate logarithms in the standard library's *math* module.

In [27]:
# Ex IV.1
import math;

print("Integers\tSquare Roots\tNatural Log of Integers")
for i in range(1, 11):
    print(f"{i:.3f}\t\t{math.sqrt(i):.3f}\t\t{math.log(i):.3f}")

Integers	Square Roots	Natural Log of Integers
1.000		1.000		0.000
2.000		1.414		0.693
3.000		1.732		1.099
4.000		2.000		1.386
5.000		2.236		1.609
6.000		2.449		1.792
7.000		2.646		1.946
8.000		2.828		2.079
9.000		3.000		2.197
10.000		3.162		2.303


**Ex IV.2:** Repeat exercise IV.1, but use the `.format()` method instead of an F-string.

In [30]:
# Ex IV.2
import math;

print("Integers\tSquare Roots\tNatural Log of Integers")
for i in range(1, 11):
    print("{:.3f}\t\t{:.3f}\t\t{:.3f}".format(i, math.sqrt(i), math.log(i)))


Integers	Square Roots	Natural Log of Integers
1.000		1.000		0.000
2.000		1.414		0.693
3.000		1.732		1.099
4.000		2.000		1.386
5.000		2.236		1.609
6.000		2.449		1.792
7.000		2.646		1.946
8.000		2.828		2.079
9.000		3.000		2.197
10.000		3.162		2.303


**Ex IV.3:** Use the `.join()` method and list comprehensions to convert the `teams` data to a comma separated value string.
* First, call `.join()` within a list comprehension to create a list of strings, where each string consists of the items in the tuple joined with a comma.
* Pass the list created with the list comprehension to `.join()` again to join the strings with newline characters.

The result should look like this:
`'Team #,Team Name\n492,Nerds of the North\n1318,The Issaquah Robotics Society\n2046,Bear Metal\n4131,Iron Patriots'`

In [84]:
# Ex IV.3
teams = [("Team #", "Team Name"), ("492", "Nerds of the North"), ("1318", "The Issaquah Robotics Society"),
         ("2046", "Bear Metal"), ("4131", "Iron Patriots")]

print(",".join([team[0] + "," + team[1] for team in teams]))
print("\n".join([team[0] + "," + team[1] for team in teams]))

Team #,Team Name,492,Nerds of the North,1318,The Issaquah Robotics Society,2046,Bear Metal,4131,Iron Patriots
Team #,Team Name
492,Nerds of the North
1318,The Issaquah Robotics Society
2046,Bear Metal
4131,Iron Patriots


**Ex IV.4:** Use a for loop and F-strings to print the first 5 rows of star data in tabular format.
* Pad each column with spaces so it's the same width.
* Right justify all numeric columns.
* Center all categorical columns.
* Remember what the `<`, `^`, and `>` characters do in the format_spec. Here is a link to the documentation: https://docs.python.org/3/library/string.html#format-specification-mini-language

In [92]:
# Ex IV.4, run cell to get star classification data
import csv
with open("data/stars.csv", "rt") as sfile:
    stars = list(csv.DictReader(sfile))

print("Number of stars:", len(stars))
stars[:2]

Number of stars: 240


[{'Temperature': '3068',
  'L': '0.0024',
  'R': '0.17',
  'A_M': '16.12',
  'Color': 'Red',
  'Spectral_Class': 'M',
  'Type': '0'},
 {'Temperature': '3042',
  'L': '0.0005',
  'R': '0.1542',
  'A_M': '16.6',
  'Color': 'Red',
  'Spectral_Class': 'M',
  'Type': '0'}]

In [118]:
# Ex IV.5 Place your code in this cell
print("Temp\t\tL\t\tR\tA\t\tA_M\t\tType\t\tColor\t\tSpecteral_Class")
for i in stars:
    print(f"{i['Temperature']}\t\t{float(i['L'])}\t\t{i['R']}\t\t{i['A_M']}\t\t{i['Type']}\t\t{i['Color']}\t\t{i['Spectral_Class']}")


Temp		L		R	A		A_M		Type		Color		Specteral_Class
3068		0.0024		0.17		16.12		0		Red		M
3042		0.0005		0.1542		16.6		0		Red		M
2600		0.0003		0.102		18.7		0		Red		M
2800		0.0002		0.16		16.65		0		Red		M
1939		0.000138		0.103		20.06		0		Red		M
2840		0.00065		0.11		16.98		0		Red		M
2637		0.00073		0.127		17.22		0		Red		M
2600		0.0004		0.096		17.4		0		Red		M
2650		0.00069		0.11		17.45		0		Red		M
2700		0.00018		0.13		16.05		0		Red		M
3600		0.0029		0.51		10.69		1		Red		M
3129		0.0122		0.3761		11.79		1		Red		M
3134		0.0004		0.196		13.21		1		Red		M
3628		0.0055		0.393		10.48		1		Red		M
2650		0.0006		0.14		11.782		1		Red		M
3340		0.0038		0.24		13.07		1		Red		M
2799		0.0018		0.16		14.79		1		Red		M
3692		0.00367		0.47		10.8		1		Red		M
3192		0.00362		0.1967		13.53		1		Red		M
3441		0.039		0.351		11.18		1		Red		M
25000		0.056		0.0084		10.58		2		Blue White		B
7740		0.00049		0.01234		14.02		2		White		A
7220		0.00017		0.011		14.23		2		White		F
8500		0.0005		0.01		14.5		2		White		A
16500		0.013		0.014		11.89	

## V. Text Encoding
In this next section, we'll take a break from Python (mostly) and discuss how textual information is stored on files and in computer memory. The act of converting text to a binary format that can be stored by a computer is called encoding, and decoding happens when you convert the binary information back into text. You don't have to think about encoding or decoding for most programming tasks. But occasionally, a problem with encoding or decoding will pop up and your program won't work until you figure it out.

### A. Integer Representations: Binary to Hexadecimal
The most common methods for expressing integers when programming are decimal, binary, octal, and hexadecimal.

#### Decimal
Decimal is simply the base 10 numbering system that we've all been using since kindergarten. 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... you get the idea.

#### Binary
Understanding binary notation is useful because internally, all data in computer memory, and in the processor, and on the hard drive, etc., is represented in a binary format. A single digit in binary, either a 0 or a 1, is physically manifested in RAM with a transistor and a capacitor. When the capacitor is charged up, that means the memory cell is storing a 1. If the capacitor is discharged, that means the memory cell is storing a zero. This is an over-simplification, but it's still a useful mental model.

If we were to count from zero to seven in binary, it would look like this: 000, 001, 010, 011, 100, 101, 110, 111. We don't generally write out binary numbers in our code because long binary numbers are difficult for us humans to interpret. Binary numbers are worthy of attention because they help us understand what's happening under the hood.

If you must convert a decimal number to binary, Python can do that for you. You can use the built-in `bin()` function, or a format_spec ending in `b`.

In [119]:
# Printing Binary Number with the bin() function
print("Decimal 5 in binary:\t", bin(5))

# Using a format specification
print(f"{7:b}")

# Binary int literals start with `0b`
print(0b1001)

# Convert binary string to decimal integer
int('1001', 2)

Decimal 5 in binary:	 0b101
111
9


9

#### Octal
Octal refers to a numbering system that uses only the digits 0 through 7 (base 8). Counting to ten in octal looks like this: 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12 (if you see an 8 or a 9 digit, you know it's not octal). Octal was commonly used in the 1960s because 6-bit, 12-bit, 24-bit, and 36-bit architectures were common. When we say a computer uses an n-bit architecture, we're saying that the computer's processor is able to store and manipulate n-bit numbers. So if we say a computer has a 12-bit architecture, for example, we're saying that the processor can store, add, divide, multiply, save to memory, and do other operations on 12-bit chunks of data. If the 12-bit chunk of data were to represent an integer, then the processor would be able to handle unsigned integers up to $2^{12} - 1$, or up to 4095. Octal numbers require three bits per digit, so octal representations worked well for these architectures, which all use multiples of three bits.

Since the 1960s, architectures that use multiples of eight bits became commonplace. Consequently octal numbers are not frequently used. Nevertheless, Python can work with octal numbers. 

In [120]:
# Printing an octal number with the oct() function
print(oct(9))

# Using a format specification
print(f"{10:o}")

# Binary int literals start with `0o` (zero-Oh)
print(0o20)

# Convert octalstring to decimal integer
int('20', 8)

0o11
12
16


16

#### Hexadecimal
We discussed binary numbers because understanding binary representations of data is necessary to understand how computers work. We discussed octal numbers because references to octal numbers appear in documentation for programming languages. We've finally progressed to discussing hexadecimal numbers. Hexadecimal numbers are used frequently in computer programming, so being able to use and interpret hexadecimal numbers is useful.

Hexadecimal numbers are base 16, meaning sixteen different symbols are used to represent an integer. Since we only have ten digits in our traditional number system, hexadecimal numbers use the letters A through F in addition to the digits zero through nine. Counting to twenty in hexadecimal looks like this: 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, 10, 11, 12, 13, 14. In hexadecimal, A is decimal 10, B is decimal 11, F is decimal 15, 10 is the decimal 16, and 100 is the $16^2$, or 256.

Python supports hexadecimal similarly to how it supports octal and binary.

In [121]:
# Printing a hexadecimal number with the hex() function
print(hex(255))

# Using a format specification
print(f"{255:x}")

# Hexadecimal int literals start with `0x` (zero-Oh)
print(0xff)

# Convert hex string to decimal integer
int('ff', 16)

0xff
ff
255


255

There are good reasons why computer programmers use a wacky number system like hexadecimal. Modern computer systems process data in multiples of eight bits, or one byte. One byte can represent the numbers zero through $2^8 - 1 = 255$. The number 255 in hexadecimal is FF. Two hexadecimal digits can perfectly represent the contents of a byte in memory. Typical computers use a 64-bit architecture, which is eight bytes. The contents of an eight-byte value can be perfectly represented by sixteen hexadecimal digits.

It's pretty easy to convert from binary to hexadecimal and back. Review the following example and see if you can guess how to convert between binary and hexadecimal.

In [122]:
# Converting between hexadecimal and binary
print(f"Decimal 100 in hexadecimal notation:\t\t\t{100:x}")
print()
print(f"Hexadecimal 6 in binary, padded to 4 digits:\t\t{0x6:04b}")
print(f"Hexadecimal 4 in binary, padded to 4 digits:\t\t{0x4:04b}")
print()
print(f"Concatenated binary representations for 0x6 and 0x4:\t{0x6:04b}{0x4:04b}")
print(f"Decimal 100 in binary, padded to 8 digits:\t\t{100:08b}")

Decimal 100 in hexadecimal notation:			64

Hexadecimal 6 in binary, padded to 4 digits:		0110
Hexadecimal 4 in binary, padded to 4 digits:		0100

Concatenated binary representations for 0x6 and 0x4:	01100100
Decimal 100 in binary, padded to 8 digits:		01100100


As you may have guessed by reviewing the preceding example, to convert from binary to hexadecimal:
1. Split the binary number into chunks of four bits, starting from the least significant bit on the right. Add 0s to the left side of the number as needed to get to a multiple of four bits.
2. For each chunk of four bits, determine the numeric value of the 4-bit chunk, ignoring it's position within the overall binary number and ignoring all other 4-bit chunks. For example, the numeric value of 0001 is 1, 1000 is 8, 1110 is E and 1111 is F, regardless of where the 4-bit chunk occurs within the larger binary number.
3. Replace the chunk with the single hexadecimal digit that corresponds to the value of the 4-bit chunk.

To convert from hexadecimal to binary, just replace each hexadecimal digit with it's four-digit binary equivalent. For example, to convert hex 8A to binary, first consider that 8 in hex is 1000 in binary and A is 1010 in binary. Therefore 8A is 10001010 in binary.

In the preceding example, we split the binary number 01100100 into two chunks of 0110 and 0100. The chunk 0100 is 6 in hexadecimal and 0100 is 4. Therefore 64 in hex = 01100100 in binary.

To convert hexadecimal to decimal, determine the position p of each digit, with the rightmost digit starting at position 0. Multiply each digit by $16^p$ and sum the results. For example, to convert 64 in hex to decimal, multiply 6 times 16 and add 4: $6 \times 16^1 + 4*16^0 = 96 + 4 = 100$

There are many situations where hexadecimal is useful. For example:
* Hexadecimal codes are used to show extended characters and symbols that are not available on a typical keyboard. We'll cover this in more detail in the Unicode section of this notebook.
* Colors are frequently specified using hexadecimal notation. <span style="color:#b7a57a;background-color:#4b2e83">For example, the six-digit hex numbers b7a57a and 4b2e83 both represent colors that can be used on the web.</span>
* If you were to inspect the raw contents of a binary file, like a JPEG image file, the contents would be displayed in a mixture of hexadecimal and printable ASCII characters. The example below displays the first 6 bytes of an image file.

In [1]:
with open("images/tab_stops1.jpg", "rb") as binary_file:
    jpg_bytes = binary_file.read(6)
print(jpg_bytes)

b'\xff\xd8\xff\xdb\x00C'


Most of the bytes are displayed with two hexadecimal characters preceded by '\x'. The sixth byte just happens to correspond to the ASCII code for the capital letter 'C'. In such cases Python will display the printable ASCII character instead of the hexadecimal number. This can be useful when a binary file contains segments of text data. Use the `.hex()` method to see all content in hexadecimal.

In [2]:
jpg_bytes.hex()

'ffd8ffdb0043'

By the way, JPEG files always start with the bytes FF and D8.

### B. ASCII

#### History and Overview
Some form of text encoding has been around since the middle of the 19th century, due to Samuel Morse's invention of the telegraph. One of the first applications of electrical engineering was to build machines that could automate transmission of messages by Morse code, and these machines needed to convert text to and from binary formats. These machines evolved into teletypes (also called teleprinters), which are machines that can send textual information via a phone line or other electrical connections. An early teletype from 1855 is pictured below.

![Hughes Teleprinter from 1855](images/1024px-Printing_Telegraph.jpg)

Early computer scientists chose to convert encoding systems used for teletypes for use in early computer systems. ASCII, or *American Standard Code for Information Interchange* is an encoding system that was developed in the 1960's, gained widespread use, and was incorporated into many subsequent encoding schemes. ASCII encodes text into 7-bit binary integers and is capable of encoding $2^7$ or 128 different characters.

Here is an ASCII chart:
```
Dec: Encoded character as a decimal (base 10) number
Char: The encoded character.

Dec  Char                           Dec  Char     Dec  Char     Dec  Char
---------                           ---------     ---------     ----------
  0  NUL (null)                      32  SPACE     64  @         96  `
  1  SOH (start of heading)          33  !         65  A         97  a
  2  STX (start of text)             34  "         66  B         98  b
  3  ETX (end of text)               35  #         67  C         99  c
  4  EOT (end of transmission)       36  $         68  D        100  d
  5  ENQ (enquiry)                   37  %         69  E        101  e
  6  ACK (acknowledge)               38  &         70  F        102  f
  7  BEL (bell)                      39  '         71  G        103  g
  8  BS  (backspace)                 40  (         72  H        104  h
  9  TAB (horizontal tab)            41  )         73  I        105  i
 10  LF  (NL line feed, new line)    42  *         74  J        106  j
 11  VT  (vertical tab)              43  +         75  K        107  k
 12  FF  (NP form feed, new page)    44  ,         76  L        108  l
 13  CR  (carriage return)           45  -         77  M        109  m
 14  SO  (shift out)                 46  .         78  N        110  n
 15  SI  (shift in)                  47  /         79  O        111  o
 16  DLE (data link escape)          48  0         80  P        112  p
 17  DC1 (device control 1)          49  1         81  Q        113  q
 18  DC2 (device control 2)          50  2         82  R        114  r
 19  DC3 (device control 3)          51  3         83  S        115  s
 20  DC4 (device control 4)          52  4         84  T        116  t
 21  NAK (negative acknowledge)      53  5         85  U        117  u
 22  SYN (synchronous idle)          54  6         86  V        118  v
 23  ETB (end of trans. block)       55  7         87  W        119  w
 24  CAN (cancel)                    56  8         88  X        120  x
 25  EM  (end of medium)             57  9         89  Y        121  y
 26  SUB (substitute)                58  :         90  Z        122  z
 27  ESC (escape)                    59  ;         91  [        123  {
 28  FS  (file separator)            60  <         92  \        124  |
 29  GS  (group separator)           61  =         93  ]        125  }
 30  RS  (record separator)          62  >         94  ^        126  ~
 31  US  (unit separator)            63  ?         95  _        127  DEL
```

ASCII was originally designed not for computers but for teletypes, which is why the first 31 characters are control codes, most of which are no longer in common use, or are no longer used for their original purpose. Some of the escape sequences we used in section III correspond to these control codes, including "\t" for the horizontal tab (9) and "\n" for a new line (10). Upper case letters are encoded as the numbers 65 - 90, lowercase as 97 - 122, digits as 48 - 57, with numerous punctuation characters filling in the other spots. A numerical value that is used to represent a character are often called a *code point*.

Python provides two built-in functions for extracting the ASCII code that corresponds to a specific character and vise versa: `char()` and `ord()`.

In [3]:
# Using chr() to convert from ASCII codes to characters
for num in [73, 82, 83, 33]:
    print(chr(num), sep="", end="")

IRS!

In [4]:
# Use char to convert from characters to ASCII codes
for char in "!\"#\n123\tABC abc{}":
    print(char, ord(char))

! 33
" 34
# 35

 10
1 49
2 50
3 51
	 9
A 65
B 66
C 67
  32
a 97
b 98
c 99
{ 123
} 125


From the preceding printout, can you see which numbers correspond to the newline, tab, and space characters?

#### Why Only 7 Bits?
If you were reading closely, you might have noticed that ASCII encodes characters into 7-bit integers. But modern-day computers manage and address memory using bytes, which are eight bits. So why didn't the creators of ASCII just use eight bits for their encoding scheme? There are a couple likely reasons:
* Back in the 1960s, communications bandwidth was a minuscule fraction of what it is now. Encoding characters in 8-bit integers would have required more time to send a message between two locations.
* It was not obvious in the 1960s that computer systems would all converge on the 8-bit byte as the standard memory unit.

Here are a few ASCII codes in displayed in binary. Count the number of digits -- you'll see that there are only seven.

In [5]:
for char in "ABC":
    print(f"{char}:\t{ord(char):b}")

A:	1000001
B:	1000010
C:	1000011


### C. 8-Bit Encodings
By the 1980s, even though ASCII requires only 7-bits to encode a character, most computers used a full byte to store each character. Computer and software companies could have just set the first, most significant bit to 0, but instead they added another 128 characters and created an 8-bit encoding scheme. The first 128 characters were identical to ASCII, with a zero in the most significant bit, and the new characters all had a one in the most significant bit. The most common 8-bit encoding scheme is CP1252. Microsoft created the CP1252 encoding scheme for the first version of Microsoft Windows, which was released in 1985. CP1252 added additional characters for other Western European languages, like æ and ü. CP1252 can be used for Danish, Finnish, French, Norwegian, Portuguese, and many other languages. You might also come across an encoding called ISO 8859-1, which is almost identical to CP1252.

If you are using a windows computer, it's likely that CP1252 is the default text encoding. Run the following code cell to see your system's preferred encoding.

In [6]:
# Get the default encoding for your system
import locale
print("Default Text Encoding:", locale.getpreferredencoding())

Default Text Encoding: cp1252


If you are running this notebook on a Linux or iOS operating system, then the default encoding should be UTF-8, which we will discuss momentarily.

### D. Unicode
#### The Unicode Standard
There are a whole lot of people in the world that speak languages that cannot be represented with ASCII, CP1252, or ISO 8859-1. Wouldn't it be great if there was one text encoding scheme that could handle all of the world's languages? We're in luck! The Unicode standard, which was released in the early 1990s, allows up to 1,114,111 different characters and symbols to be encoded. That's enough to cover all of the European languages including Greek and Russian, Chinese, Japanese, Korean, Devanagari, and much more. Currently, only 143,696 of the possible characters and symbols have been defined. Each character has been mapped to a different code point (i.e., integer). Conveniently the characters corresponding to the code points 0 through 127 are the same for ASCII and Unicode, so ASCII text can be read as Unicode.

Python 3 (but not Python 2) is fully compatible with Unicode. Use the `"\uxxxx"` escape sequence in a literal string, where `xxxx` is the Unicode code point in hexadecimal notation that corresponds to the desired character. For code points with more than four hexadecimal digits, use the `\Uxxxxxxxx` escape sequence. All four or all eight digits must be specified in the escape sequence, so add zeros to the front of the number as needed. All Unicode characters have string names, so you can also print the character with the escape sequence `\N{character name}`. Here are some examples:

In [7]:
# Unicode examples
print("Hello!:", "\u3053\u3093\u306B\u3061\u306f")
print("Helpful for card games:", "\u2666 \u2663 \u2665 \u2660")
print("Chess anyone?", "\u2654 \u265F")
print("Unicode has emojis? Are you kidding me?", "\N{smirking face} \N{grinning face} \U0001F44D")
print("Math Symbols: \N{integral} \N{partial differential}x\\\N{partial differential}y, \U0001D6BA")
print("STEM Stuff: \N{rocket} \N{microscope} \N{robot face} ")

Hello!: こんにちは
Helpful for card games: ♦ ♣ ♥ ♠
Chess anyone? ♔ ♟
Unicode has emojis? Are you kidding me? 😏 😀 👍
Math Symbols: ∫ ∂x\∂y, 𝚺
STEM Stuff: 🚀 🔬 🤖 


There are several websites where you can look up the code for any Unicode character. http://unicode-table.com is one such site. Or you can download PDF Unicode charts from http://https://unicode.org/charts/.

#### Unicode Encodings: UTF-8, UTF-16, and UTF-32
Technically Unicode is not an encoding, it's an abstract encoding standard. It maps characters to code points but it does not specify how the integers should be represented as bytes. UTF-8, UTF-16, and UTF-32 are encoding standards that specify a binary encoding for each Unicode code point. UTF-8 is by far the most frequently used and is a good encoding standard to use by default. It's a variable-length standard that generally uses one, two, three, or four bytes to represent a character (but uses more in some situations). UTF-16 is similar to UTF-8, but it uses between two and four bytes to encode a character. UTF-32 generally uses four bytes for each character.

So if UTF-8 can use different numbers of bytes to represent a single character, how does Python or any other software know where one character stops and the next character begins within a stream of bytes? It's actually quite simple. The first few bits of the first byte specify how many bytes are used. The following table shows the number of bytes used for specific ranges of code points. The table also shows the bit-by-bit structure of each byte. Bits represented with a 1 or a 0 are always a 1 or a 0 respectively. Bits represented with an x can be either a 1 or a 0.

| First code point | Last code point |   Byte 1 |   Byte 2 |   Byte 3 |   Byte 4 |
|------------------|-----------------|----------|----------|----------|----------|
|           U+0000 |          U+007F | 0xxxxxxx |    n/a   |    n/a   |   n/a    |
|           U+0080 |          U+07FF | 110xxxxx | 10xxxxxx |    n/a   |   n/a    |
|           U+0800 |          U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx |   n/a    |
|         U+100000 |        U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |

* A byte with a 0 for the most significant bit always specifies a single-byte character, with the code point contained in the following seven bits.
* A byte that starts with 10 is never the first byte of a character -- it's always a continuation byte.
* A byte that starts with 110 is always the first byte of a two-byte character, which leaves 11 bits to encode the code point (five in the first byte, and six in the second).
* A byte that starts with 1110 is always the first byte of a three-byte character, which leaves 16 bits for the code point.
* A byte that starts with 11110 is always the first byte of a four-byte character, which leaves 21 bits for the code point.

A string object's `.encode()` method can be used to convert a string to a sequence of bytes. Different encodings for the letter 'A' are displayed below.

In [8]:
# Single Byte Encodings
print("A in ASCII: ", "A".encode("ASCII"))
print("A in UTF-8: ", "A".encode("UTF8"))
print("A in CP1252:", "A".encode("CP1252"))

# Multi Byte Encodings
print("A in UTF-16:", "A".encode("UTF-16-LE"))
print("A in UTF-32:", "A".encode("UTF-32-LE"))

A in ASCII:  b'A'
A in UTF-8:  b'A'
A in CP1252: b'A'
A in UTF-16: b'A\x00'
A in UTF-32: b'A\x00\x00\x00'


### E. Exercises

** Ex V.1.** Convert the numbers as specified. Do the conversions manually - don't use Python.

In [None]:
# Ex V.1
# Convert the following hexadecimal digits to four-digit binary numbers, i.e., 1 would be converted to 0001. 
  # 3: 0011
  # 4: 0100
  # 9: 1001
  # B: 1011
  # E: 1110
  # F: 1111

# Convert the following decimal numbers to hexadecimal:
  # 12: C
  # 15: F
  # 16: 10
  # 32: 20
  # 99: 63
  # 186: BA
  # 254: FE
    
# Convert the following hexadecimal numbers to decimal
  # 80: 128
  # AA: 170
  # DF: 224
  # BAD: 2989  gh 

**Ex V.2.** 
1. Use a list comprehension, the built-in `chr()` function, and the `.join()` method to convert the list of integers into a readable string.
* Hint: You can call `.join()` on an empty string, i.e., `""`.
2. Compare the first few characters of the output string to the ASCII chart above. Do the integers in the list match up with the ASCII chart?
3. There is one non-ASCII character in the text. Can you find it? What is it?
4. Convert the decimal code point from part 3 to Hexadecimal. Look up the code point online. What is the official Unicode name of the non-ASCII character.
    Hint: It's easy to look up Unicode code points online. But remember, the code points below are all decimal. If you type a decimal code point into a search tool that expects hexadecimal numbers, you might get the wrong answer.

In [37]:
# Ex V.2
int_list = [34, 65, 110, 121, 32, 102, 111, 111, 108, 32, 99, 97, 110, 32, 119, 114, 105, 116, 101, 32, 99, 111, 100,
            101, 32, 116, 104, 97, 116, 32, 97, 32, 99, 111, 109, 112, 117, 116, 101, 114, 32, 99, 97, 110, 32, 117,
            110, 100, 101, 114, 115, 116, 97, 110, 100, 46, 32, 71, 111, 111, 100, 32, 112, 114, 111, 103, 114, 97,
            109, 109, 101, 114, 115, 32, 119, 114, 105, 116, 101, 32, 99, 111, 100, 101, 32, 116, 104, 97, 116, 32,
            104, 117, 109, 97, 110, 115, 32, 99, 97, 110, 32, 117, 110, 100, 101, 114, 115, 116, 97, 110, 100, 46, 34,
            32, 8211, 32, 77, 97, 114, 116, 105, 110, 32, 70, 111, 119, 108, 101, 114]
"".join([str(chr(num)) for num in int_list])

'"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." – Martin Fowler'

**Ex V.3.** Use an online Unicode code point lookup tool to display the Unicode characters named below. Use the escape sequences for code points, either `"\Uxxxxxxx"` or `"\uxxxx"`.

In [59]:
# Ex V.3
  # A. Display symbols for DNA, a wrench, a roll of toilet paper, and a water pistol
print("\U0001F9EC\U0001F527\U0001F9FB\U0001F52B")

  # B. Display the following uppercase letters: Greek letter Sigma, the Russion letter Ya (the backwards R), and A in Devanagari script.
print("\u03A3\u042F\u0905")


🧬🔧🧻🔫
ΣЯअ


** Ex V.4.** Display the UTF-8 byte sequence in binary and hexadecimal notation for the symbol named *MUSICAL SYMBOL QUARTER NOTE*. Refer to the UTF-8 encoding table above.

In [64]:
# Ex V.4
"𝅘𝅥".encode("binary")

LookupError: unknown encoding: binary

## VI. String Methods
Python string objects provide several methods that are useful for manipulating text. The official documentation for Python's string methods is here: https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str
 
This section provides examples of many of Python's string methods, but not all. Check the documentation to see all available string methods.

### A. String Slicing
OK, string slicing doesn't use methods, but we'll review it in this section anyway. Unlike Python, it's typical in other programming languages, such as R, Java, Javascript, Swift, and C#, to use functions or methods to extract substrings. As we discussed way back in session 1, one can extract sub-strings using integer indexes, similar to how elements are extracted from lists. In Python, a set of start, stop and step indices that are joined with a colon (e.g., `27:20:-1`) is called a slice.

In [60]:
phrase = "The good thing about Science is that it's true whether or not you believe in it."

print("First character:\t", phrase[0])
print("Last character:\t\t", phrase[-1])
print("First 14 characters:\t", phrase[:14])
print("Last 14 characters:\t", phrase[-14:])
print("Word from the middle:\t", phrase[21:28])
print("Backwords word:\t\t", phrase[27:20:-1])
print("Every third character:", phrase[::3])
print()
print("Quotation by Neil deGrasse Tyson")

First character:	 T
Last character:		 .
First 14 characters:	 The good thing
Last 14 characters:	 believe in it.
Word from the middle:	 Science
Backwords word:		 ecneicS
Every third character: T otnauSeesh 'tehh  tobient

Quotation by Neil deGrasse Tyson


### B. Case Methods
There are several methods for converting between upper and lower case. The `.casefold()` method is similar to `.lower()` and is used to compare strings in such a way that two strings will be considered equal if they differ only by case. Some languages other than English have case distinctions that are not handled correctly by the `.lower(0` method.

In [65]:
print("Original phrase:\t\t", phrase)
print("Swapped case:\t\t\t", phrase.swapcase())
print("Upper case:\t\t\t", phrase.upper())
print("Lower case:\t\t\t", phrase.lower())
print("Casefold:\t\t\t", phrase.casefold())
print("Title case:\t\t\t", phrase.title())
print("Capitalize first letter:\t", "capitalized phrase.".capitalize())

Original phrase:		 The good thing about Science is that it's true whether or not you believe in it.
Swapped case:			 tHE GOOD THING ABOUT sCIENCE IS THAT IT'S TRUE WHETHER OR NOT YOU BELIEVE IN IT.
Upper case:			 THE GOOD THING ABOUT SCIENCE IS THAT IT'S TRUE WHETHER OR NOT YOU BELIEVE IN IT.
Lower case:			 the good thing about science is that it's true whether or not you believe in it.
Casefold:			 the good thing about science is that it's true whether or not you believe in it.
Title case:			 The Good Thing About Science Is That It'S True Whether Or Not You Believe In It.
Capitalize first letter:	 Capitalized phrase.


### C. Alignment Methods
The string methods `.ljust()`, `.center()`, and `.rjust()` can be used to pad strings with spaces on the left or right side, or both. The first parameter for these methods is the desired length of the padded string. We'll use these methods to format some data into a table with left justified, centered, and right justified columns.

In [67]:
colors = [("lavender", 0xE6E6FA, (230, 230, 250)),  ("thistle", 0xD8BFD8, (216, 191, 216)), ("plum", 0xDDA0DD, (221, 160, 22)),
          ("violet", 0xEE82EE, (238, 130, 238)), ("orchid", 0xDA70D6, (218, 112, 214)), ("fuchsia", 0xFF00FF, (255, 0, 255)),
          ("magenta", 0xFF00FF, (255, 0, 255)), ("mediumorchid", 0xBA55D3, (186, 85, 211)), ("mediumpurple", 0x9370DB, (147, 112, 219)),
          ("blueviolet", 0x8A2BE2, (138, 43, 226)), ("darkviolet", 0x9400D3, (148, 0, 211)), ("darkorchid", 0x9932CC, (153, 50, 204)),
          ("darkmagenta", 0x8B008B, (139, 0, 139)), ("purple", 0x800080, (128, 0, 128)), ("indigo", 0x4B0082, (75, 0, 130))]

In [102]:
# Use .ljust(), .center(), and .rjust() string methods to display a table

# Print Column Headers
print("HTML / CSS Name".rjust(15),
      "RGB Values".center(20),
      "Hex Code".ljust(12),
      sep = " | ")
print("-"*15, "-"*20, "-"*12, sep = "-|-")

# Print Each Row
for color in colors:
    print(color[0].rjust(15),
          str(color[2]).center(20),
          f"{color[1]:x}".ljust(12),
          sep = " | ")

HTML / CSS Name |      RGB Values      | Hex Code    
----------------|----------------------|-------------
       lavender |   (230, 230, 250)    | e6e6fa      
        thistle |   (216, 191, 216)    | d8bfd8      
           plum |    (221, 160, 22)    | dda0dd      
         violet |   (238, 130, 238)    | ee82ee      
         orchid |   (218, 112, 214)    | da70d6      
        fuchsia |    (255, 0, 255)     | ff00ff      
        magenta |    (255, 0, 255)     | ff00ff      
   mediumorchid |    (186, 85, 211)    | ba55d3      
   mediumpurple |   (147, 112, 219)    | 9370db      
     blueviolet |    (138, 43, 226)    | 8a2be2      
     darkviolet |    (148, 0, 211)     | 9400d3      
     darkorchid |    (153, 50, 204)    | 9932cc      
    darkmagenta |    (139, 0, 139)     | 8b008b      
         purple |    (128, 0, 128)     | 800080      
         indigo |     (75, 0, 130)     | 4b0082      


### D. Finding and Replacing Sub-strings
The string methods `.find()`, `.rfind()`, `.startswith()`, `.endswith()`, and `.count()` can be used to check for occurrences of a substring within a larger string.

In [75]:
verse3 = """
That robot seemed relentless 
as it tied our socks in knots, 
then clunked into the kitchen 
and dismantled pans and pots, 
the thing was not behaving 
in the fashion we had planned. 
it clanked into the bathroom 
and it filled the tub with sand."""

# Return the position of the first occurence of a substring
#   within a larger string
print("Position of 'robot':", verse3.find("robot"))
print("First position of 'it':", verse3.find("it"))
print("Last position of 'it':", verse3.rfind("it"))
print()

# Check start and end of string
print("Does string start with \\n?:", verse3.startswith("\n"))
print("Does string end with 'Sand.'?:", verse3.endswith("Sand."))
print()

# Count occurences of a substring
print("Number of occurences of 'in':", verse3.count("in"))
print("Number of occurences of 'Kraken':", verse3.count("Kraken"))

Position of 'robot': 6
First position of 'it': 34
Last position of 'it': 238

Does string start with \n?: True
Does string end with 'Sand.'?: False

Number of occurences of 'in': 6
Number of occurences of 'Kraken': 0


The `.replace()` method replaces all occurrences of a substring with whatever string is specified.

In [76]:
print(verse3.replace(" it ", " Ultron "))# Replacing substrings



That robot seemed relentless 
as Ultron tied our socks in knots, 
then clunked into the kitchen 
and dismantled pans and pots, 
the thing was not behaving 
in the fashion we had planned. 
it clanked into the bathroom 
and Ultron filled the tub with sand.


### E. Splitting and Joining Strings
The `.split()` method splits a string into a list of substrings. In natural language processing, splitting a string into words is called tokenization. The `.join()` method joins all of the strings in a list into a single string, using a separator character specified by the user.f

In [77]:
# Split string into a list of individual words
print("Words in verse 3 of 'My Brother Build a Robot'")
v3_words = verse3.split()
print(v3_words)
print()

print("Number of words in verse 3:", len(v3_words))
print("Number of unique words:", len(set(v3_words)))

# Rejoin words with different puncuation
"...".join(v3_words)

Words in verse 3 of 'My Brother Build a Robot'
['That', 'robot', 'seemed', 'relentless', 'as', 'it', 'tied', 'our', 'socks', 'in', 'knots,', 'then', 'clunked', 'into', 'the', 'kitchen', 'and', 'dismantled', 'pans', 'and', 'pots,', 'the', 'thing', 'was', 'not', 'behaving', 'in', 'the', 'fashion', 'we', 'had', 'planned.', 'it', 'clanked', 'into', 'the', 'bathroom', 'and', 'it', 'filled', 'the', 'tub', 'with', 'sand.']

Number of words in verse 3: 44
Number of unique words: 34


'That...robot...seemed...relentless...as...it...tied...our...socks...in...knots,...then...clunked...into...the...kitchen...and...dismantled...pans...and...pots,...the...thing...was...not...behaving...in...the...fashion...we...had...planned....it...clanked...into...the...bathroom...and...it...filled...the...tub...with...sand.'

The `.splitlines()` method splits a string into separate lines, breaking the string at the newline character, carriage return, and a few other characters. Check the documentation for the complete list.

In [78]:
# Splitting a character into lines
verse3.splitlines()

['',
 'That robot seemed relentless ',
 'as it tied our socks in knots, ',
 'then clunked into the kitchen ',
 'and dismantled pans and pots, ',
 'the thing was not behaving ',
 'in the fashion we had planned. ',
 'it clanked into the bathroom ',
 'and it filled the tub with sand.']

### F. Dealing with White Space
The `.lstrip()` method removes whitespace or other characters from the beginning of a string. The `.rstrip()` method removes whitespace from the end, and `.strip()` removes whitespace from both ends.

In [79]:
# Stripping Whitespace
"\t\t  Stripping Whitespace! \n  \n".strip()

'Stripping Whitespace!'

### G. Testing Strings
Python provides several methods that will test if a string meets certain conditions, like if it contains only digits, alphanumeric characters, etc. The methods return either `True` or `False`.

In [80]:
# Check if a string consists only of letters and numbers
print("Is verse 3 completely alphanumeric?".ljust(40), verse3.isalnum())
print("Is 'verse3' alphanumeric:".ljust(40), "verse3".isalnum())
print()
print("Does '3.14159' contain only digits?".ljust(40), "3.14159".isdigit())
print("Is '1318' numeric?".ljust(40), "1318".isnumeric())
print()
print("IS THIS UPPERCASE?".ljust(40), "IS THIS UPPERCASE?".isupper())

Is verse 3 completely alphanumeric?      False
Is 'verse3' alphanumeric:                True

Does '3.14159' contain only digits?      False
Is '1318' numeric?                       True

IS THIS UPPERCASE?                       True


### H. String Constants
As documented in https://docs.python.org/3/library/string.html, the string class provides several constants that can come in handy. A few examples are shown below.

In [81]:
import string
print("ASCII Letters:", string.ascii_letters)
print("ASCII Lowercase:", string.ascii_lowercase)
print("Hex Digits:", string.hexdigits)
print("Punctuation Symbols", string.punctuation)

ASCII Letters: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
ASCII Lowercase: abcdefghijklmnopqrstuvwxyz
Hex Digits: 0123456789abcdefABCDEF
Punctuation Symbols !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


### G. Exercises

**Ex. VI.1** For security reasons, companies usually set minimum length and complexity requirements for passwords used to log into accounts. Write a function that uses string slicing and string methods, and checks whether a string meets the following password requirements:
* At least two characters must be upper case
* At least two characters must be lower case
* At least two characters must be the punctuation symbols
* At least two characters must be digits
* The password must be between 8 and 20 characters long

The function should return True or False.

In [91]:
# Ex VI.1
import string

def check_pwd(pwd):
    count = 0
    # Fill in the rest of the function here.
    for letter in pwd:
        if letter.isupper() == True: 
            count += 1
        if count == 2:
            count = 0
            for letter in pwd:
                if letter.islower() == True:
                    count += 1
                    if count == 2:
                        count = 0
                        for letter in pwd:
                            if letter.islower() == True:
                                count += 1
                                if count == 2:
                                    count = 0
                                    for letter in pwd:
                                        for punc in string.punctuation:
                                            if letter == punc:
                                                count += 1
                                                if count == 2:
                                                    count = 0
                                                    for letter in pwd:
                                                        if letter.isdigit() == True:
                                                            count += 1
                                                            if count == 2:
                                                                count = 0
                                                                if len(pwd) >= 8 and len(pwd) <= 20:
                                                                    return True
    return False
                                                
                                        
            

# Test the function by running this statement.
# Results should be True True True False False False
print(check_pwd("A\u042Fbb99{}"), check_pwd("ABcd56()"), check_pwd("ABc\u03C356!'"),
     check_pwd("ABcd56%&...................."), check_pwd("ABc56()"), check_pwd("ABcd56*"))

True True True False False False


**Ex. VI.2**
There is a json file that contains a match schedule for a robotics competition at `data/wasno2020.json`.
1. Use the `json` module to read the data from the file.
2. Print a table with 3 columns, one for the match number, one for the red alliance teams, and one for the blue alliance teams.
3. The table should have column labels.
4. The data in each column should be padded to the same width and have consistent alignment (left, right, or center).

In [278]:
# Ex VI.2
import json

file = open("data/wasno2020.json")
comp = json.load(file)
print("Match Number".rjust(20),
      "Red Alliance".center(20),
      "Blue Alliance".ljust(20),
      sep = " | ")
print("-"*20, "-"*20, "-"*20, sep = "-|-")
for mtch in comp["Schedule"]:
    print((str(mtch["matchNumber"])).rjust(20), "|" ,end = " ")
    for team in mtch["teams"]:
        if team["station"].startswith("Red"):
            print((str(team["teamNumber"])).center(6), end = " ")
    print(end = "| ")
    for team in mtch["teams"]:
        if team["station"].startswith("Blue"):
            print((str(team["teamNumber"])).ljust(6), end = " ")
    print()
comp

        Match Number |     Red Alliance     | Blue Alliance       
---------------------|----------------------|---------------------
                   1 |  4131   4683   2412  | 1318   4089   8059   
                   2 |  4205   4180   2910  | 3826   2928   7118   
                   3 |  4918   4173   4513  | 4512   1294   4309   
                   4 |  2522   5588   8032  | 1778   2903   7627   
                   5 |  4915   492    949   | 4682   3070   8248   
                   6 |  2930   4681   4911  | 1899   3268   7461   
                   7 |  2976   4131   4089  | 4309   1778   2910   
                   8 |  8248   1294   4918  | 4180   5588   4682   
                   9 |  2930   3268   3826  | 8032   1318   4512   
                  10 |  1899   4915   8059  | 7461   2412   4513   
                  11 |  2928   4681   4205  | 2522   2976   3070   
                  12 |  492    7118   7627  | 4911   2903   4173   
                  13 |  4683   4512   7461  | 949 

{'Schedule': [{'description': 'Qualification 1',
   'field': 'Primary',
   'tournamentLevel': 'Qualification',
   'startTime': '2020-02-29T11:00:00',
   'matchNumber': 1,
   'teams': [{'teamNumber': 4131, 'station': 'Red1', 'surrogate': False},
    {'teamNumber': 4683, 'station': 'Red2', 'surrogate': False},
    {'teamNumber': 2412, 'station': 'Red3', 'surrogate': False},
    {'teamNumber': 1318, 'station': 'Blue1', 'surrogate': False},
    {'teamNumber': 4089, 'station': 'Blue2', 'surrogate': False},
    {'teamNumber': 8059, 'station': 'Blue3', 'surrogate': False}]},
  {'description': 'Qualification 2',
   'field': 'Primary',
   'tournamentLevel': 'Qualification',
   'startTime': '2020-02-29T11:09:00',
   'matchNumber': 2,
   'teams': [{'teamNumber': 4205, 'station': 'Red1', 'surrogate': False},
    {'teamNumber': 4180, 'station': 'Red2', 'surrogate': False},
    {'teamNumber': 2910, 'station': 'Red3', 'surrogate': False},
    {'teamNumber': 3826, 'station': 'Blue1', 'surrogate': Fals

**Ex. VI.3.** Capitalize the first letter of each line in the verse below.
* Consider using the `.splitlines()` and `.join()` methods.

In [179]:
# Ex VI.3
v4 = """
We tried to disconnect it, 
but it was to no avail, 
it picked us up and dropped us 
in an empty garbage pail, 
we cannot stop that robot, 
for we’re stymied by one hitch…. 
my brother didn’t bother 
to equip it with a switch.
"""
lines = v4.splitlines()
vList = []
for line in lines:
    vList.append(line.capitalize())
v5 = "\n".join(vList)
print(v5)


We tried to disconnect it, 
But it was to no avail, 
It picked us up and dropped us 
In an empty garbage pail, 
We cannot stop that robot, 
For we’re stymied by one hitch…. 
My brother didn’t bother 
To equip it with a switch.


**Ex VI.4.** Replace the word "robot" in the verse from exercise VI.3 with "machine".

In [182]:
# Ex VI.4
v6 = v5.replace("robot", "machine")
print(v6)



We tried to disconnect it, 
But it was to no avail, 
It picked us up and dropped us 
In an empty garbage pail, 
We cannot stop that machine, 
For we’re stymied by one hitch…. 
My brother didn’t bother 
To equip it with a switch.


## VII. Quiz
Answer the following questions by typing the answers as comments in the code block below each question.

**#1.** Why does the line below throw an error? How code it be fixed?  
`print("The cube of 5 is " + 5**3)`

In [187]:
# Q.1
#The error is that it can only concatenate str to str and not int to str
#The code can be fixed if you replace the "+" with a "," and remove the space after "is"

The cube of 5 is 125


**#2.** Most of the world's languages can be represented with two-byte Unicode sequences. What is the maximum number of code points that could be encoded with a two byte sequence?

In [188]:
# Q.2
#16 code points


**#3.** What are some common whitespace characters? How can you remove whitespace characters from both ends of a string?

In [None]:
# Q.3
#Common whitespace characters are \n and \t. You can remove whitespace characters by using the methods dedent(), rstrip(),
#lstrip(), and strip().


**#4.** What do the following format strings print, exactly? Try to figure it out without running it.

  A. `f"{1/3:.3%}"`  
  B. `f"{1000**2:.3E}"`  
  C. `f"{1318:x}"`
  D. `f"|{5**3:^7}|"`

In [194]:
# Q.4
#A: 33.333%
#B: 1.000E+06
#C: 526
#D: |  125  |

**#5.** The `.center()`, `.ljust()`, and `.rjust()` string methods all take a second, optional parameter? What is this parameter for? Refer to the [Python Standard Library documentation](https://docs.python.org/3/library/index.html) to answer this question. Look in *Built-in Types->Text Sequence Type*.

In [None]:
# Q.5
# The second optional parameter is fillchar. This parameter is the type of character you would use when padding a string.

**6. The `.strip()` string method has two optional, named arguments. What are they named and what do they do? Refer to the [Python Standard Library documentation](https://docs.python.org/3/library/index.html) to answer this question.

In [None]:
# Q.6
# The optional argument is chars which is the type of character that will removed from the leading or trailing characters
# otherwise the argument defaults to the whitespace.


**6. The `.find()` string method has two optional, named arguments. What are they named and what do they do? Refer to the Python Standard Library documentation to answer this question.

In [None]:
# Q.7
# The optional arguments are sub, start, and end. Sub is the character(s) being found, start and stop are the portion of
# the string to be used as start is the beginning and end is the end.


## VIII. Save Your Work
Once you have completed the exercises, save a copy of the notebook outside of the git repository (outside of the *pyclass_frc* folder). Include your name in the file name. Send the notebook file to another student to check your answers.

## IX. Concept and Terminology Review
You should be able to define the following terms or describe the concept.
* Escape sequences
* Multi-line strings
* Raw strings
* Newline character
* Tab character
* String concatenation
* F-strings
* `.format()` method
* Format specifications
* `.join()` method
* Binary numbers
* Octal numbers
* Hexadecimal numbers
* ASCII
* Unicode
* UTF-8, UTF-16, and UTF-32
* String slicing
* String methods
* String constants

[Table of Contents](../../index.ipynb)