## Lesson 3 - String Processing

- 3.1 - String Sequence
- 3.2 - Basic Operations of String Object
- 3.3 - Special Character and Escape Sequence
- 3.4 - Methods and Functions of String Object
- 3.5 - Transforming Object to a Sting: repr() and str()
- 3.6 - Formating String Object: format(), %, and f-string
- 3.7 - Using bytes Object
- 3.8 - Application of print()

In everyday business operation, **string** is one of the most important and challenging type of data in programming design from user's name, file name, and other text processing. The powerful built-in text processing and formatting tools allows us to complete these tasks efficiently. We are going to discuss some of these tools for processing **string** object in Python.

String literals can be enclosed by either double or single quotes, although single quotes are more commonly used.  A double quoted string literal can contain signle quotes without any fuss and likewise single quoted string can contain double quotes.  

Reference: https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str

## 3.1 - String Sequence

In the previous lesson, we mentioned that **list** and **tuple** are ordered colletions of arbitrary objects, in which the order of the elments are defined when it was created. This kind of ordered data type is called "**sequence**" in Python. The Python **strings** object is an immutable sequence of Unicode characters.  

Since Python **Strings** is a sequence data type, we can also apply indexing [n] to access a character or slicing [n:m] to access a sub-part of sequences from a string. Python strings are **immutable** which means they cannot be changed after they are created. Therefore, the extracted or modified characters needs to be assigned to a new variable.

e.g. Python strings
``` Python
x = "Hello"
x[0]  # return "H"
x[-1]  # return "o"
x[1:]  # return "ello"
```

In [None]:
# Create the Python sting
x = "Hello"
x[0]  # return "H"
x[-1]  # return "o"
x[1:]  # return "ello"

An application for slicing a string is for text processing to take out any **escape** character at the end of the string, such as "\n" is a newline character, "\t" is a tab character, etc. It is an useful text processing for raw text data.

e.g. 
``` Python
x = "Goodbye\n"
x = x[:-1]
x  # return 'Goodbye'
```

In [None]:
# Take out the escape character
x = "Goodbye\n"
x = x[:-1]
x  # return 'Goodbye'

The previous example is just one of the ways to remove those unnecessary character from a string.  Python has many built-in functions and methods to handle the task.

We can also use Python built-in len( ) function to count the characters within a string.  For example,

``` Python
len("Goodbye")  # return 7
```



In [None]:
# Using the len( ) function on string
len("Goodbye")  # return 7

Note: Do not confuse between a list and a string, the most obvious different between the two is that: string is **immutable**, which is mainly for performance purpose. Any attempt to modify a string will return an error message. For instances,

``` Python
string = "Hello"
string.append('c') # return error message
string[0] = "A"  # return error message
```

In the previous example, when we are trying to remove an escape character '\n', we are literally copying from the orginal string, slicing it, and then creating a new string. In the following sections, we will discover some of the functions and methods of string objects, which applies exactly the same logic.

In [None]:
# Error message for modifying the string object
string = "Hello"
string.append('c') # return error message
string[0] = "A"  # return error message

## 3.2 - Basic Operations of String Object

When we need to join multiple strings, the easiest (or most popular) way is to connect the strings with the mathematical operator (+):

``` Python
x = "Hello " + "World"
x  # return 'Hello World'
```

Python will also join the strings with space between them:

``` Python
x = "Hello "   "World"
x  # return 'Hello World'
```

The multiplication operator (*) also works with the string data type (even though it is not often used).

``` Python
8 * "x"  # return 'xxxxxxxx'
```

In [None]:
# Using the addition operator to join two string objects
x = "Hello " + "World"
x  # return 'Hello World'

In [None]:
# Using space to join two string objects
x = "Hello "   "World"
x  # return 'Hello World'

In [None]:
# Using multiplication operator on string object
8 * "x"  # return 'xxxxxxxx'

## 3.3 - Special Characters and Escape Sequence

In the previous section, we have seen an example of **escape character**: \n means newline and \t means tab. In Python strings, the backslash "\" is a special character, also called the **escape character**, when it is combined with other characters it becomes an **escape sequence**, which is used in representing certain withespace characters. In this section, we introduce some of the most common **escape sequences** and their applications.

##### Basic Escape Sequence

Below table presents the most common used **escape sequences**.  These **escape sequences** also apply to the **bytes** object covered at the end of this lesson.

|  Escape Sequence  |  Meaning  |
| :---:  |  :---:  |
|  \newline  |  Ignored  |
|  \'  |  single quote (')  |
|  \"  |  double quote (")  |
|  \\  |  Backslash (\)  |
|  \a  |  ASCII Bell (Bell)  |
|  \b  |  ASCII Backspace (Backspace)  |
|  \f  |  ASCII Formfeed (start a new page)  |
|  \n  |  ASCII Linefeed (start a new line)  |
|  \r  |  ASCII Carriage Return (Carriage Return)  |
|  \t  |  ASCII Horizontal Tab (Tab)  |
|  \v  |  ASCII Vertical Tab (Vertical Tab)  |

Except the escape squences mentioned above, there are other ASCII characters are defined with numbers.  

#####  Unicode Escape Sequence (8 bit / 16 bit)

In Python, we can use 8-bit or 16-bit escape sequence to presents any ASCII (American Standard Code for Inofrmation Interchange) characters.  8-bits escape sequence is deined by a backslash (\) follow with three numbers: \nnn, where nnn is a 8-bit value.  16-bit escape sequence is defined by a backslash and x (\x) follow with two 16-bit characters: \xnn, where nn presents a 16-bit value.

e.g. Unicode Escape Sequence
``` Python
'm'  # return 'm'
'\155'  # return 'm'
'\x6D'  # return 'm'
'\x6d'  # return 'm', we can write the 16-bit value in either small or capital letter
```

Applying the same logic, we can write other escape sequence in the following way:

``` Python
'\n'  # return '\n'
'\012'  # return '\n'
'\x0A'  # return '\n'
```

Starting Python 3, all string object is defined by Unicode string, which allow us to presents any characters from different languages. Here is some simple example to demonstrate the use of 16-bit character and Unicode name.

``` Python
# Using the Unicode name to present Unicode character
unicode_a = '\N{LATIN SMALL LETTER A}'
unicode_a  # return 'a'

unicode_a_with_acute = '\N{LATIN SMALL LETTER A WITH ACUTE}'
unicode_a_with_acute  # return 'á'

# Using \u follow with 4 digit 16-bit character to present a Unicode Character
'\u00E1'  # return 'á' by Unicode character
```

In [1]:
# Using the Unicode name to present Unicode character
unicode_a = '\N{LATIN SMALL LETTER A}'
unicode_a  # return 'a'

'á'

In [None]:
# Using the Unicode name to present Unicode character
unicode_a_with_acute = '\N{LATIN SMALL LETTER A WITH ACUTE}'
unicode_a_with_acute  # return 'á'

In [None]:
# Using \u follow with 4 digit 16-bit character to present a Unicode Character
'\u00E1'  # return 'á' by Unicode character

##### Use print( ) function to show the escape sequence

To present the escape sequence, we can use the print( ) function to actually see the return value from sequence.  Here are some examples,

e.g. a string 'a\n\tb' verus print('a\n\tb')

``` Python
# Python interpret the string directly
'a\n\tb'  # return 'a\n\tb' 

# print( ) function interpret escape sequence and return the actual value
print('a\n\tb')  # return a and b with newline and tab
```

In this example, the first case returned the original string format.  In the second case, the print( ) function interpret the escape sequence and return the actual value with the newline and tab space.

In [None]:
# Python interpret the string directly
'a\n\tb'  # return 'a\n\tb' 

In [None]:
# print( ) function interpret escape sequence and return the actual value
print('a\n\tb')  # return a and b with newline and tab

Generally, the print( ) function returns the string with an additional newline.  Sometime a string contains a newline escape sequence and we are trying to avoid doubling the newline, we can put in an argument in the print( ) function (end="") and set the newline as empty, so that the return value does not include an additional newline.

e.g.
``` Python
print('Hello World\n')  # return 'Hello World' with two newlines

print('Hello World\n', end="")  # return 'Hello World with one newline
```

In [2]:
# With additional newline from print( ) function
print('Hello World\n')

Hello World



In [3]:
# Without additional newline from print( ) function
print('Hello World\n', end="")

Hello World


## 3.4 - Methods and Function for Strings

Python has a set of built-in methods that we can apply on strings, which we can directly apply these methods for text processing.  Furthermore, the string module contains a number of useful constants and classes, as well as some deprecated functions that are also available as methonds on strings.  In this section, we only focus on how to apply the Python built-in methods and their applications.  To apply the Python built-in methods on strings, we only need to remember to add a period (.) after the string to operate:

Reference: https://docs.python.org/2.5/lib/string-methods.html

e.g.  stringName.method()
``` Python
# returns lowercased string
x = 'HELLO WORLD'
x.lower()  # return new string 'hello world'

# returns uppercased string
x.upper()  # return new string 'HELLO WORLD'

# converts first character to Captial letter
x.capitalize()  # return new string 'Hello world'
```

Note: Keep in mind that string is an immutable sequence, so the method of string object is actaully returning a new string, but not modifying the original string.

In [4]:
# returns lowercased string
x = 'HELLO WORLD'
x.lower()  # return new string 'hello world'

'hello world'

In [5]:
# returns uppercased string
x.upper()  # return new string 'HELLO WORLD'

'HELLO WORLD'

In [6]:
# converts first character to Captial letter
x.capitalize()  # return new string 'Hello world'

'Hello world'

##### Using split( ) and join( ) 

Both the split( ) and join( ) methods are very powerful tools for text processing.  They have the exact opposite function in text processing: split( ) method breaks a string object into a list of strings, and the join( ) method connects a list of strings and form a new string object.  

Using (+) for joining strings could be useful.  However, when a massive number of strings need to be joined, the (+) method will create efficiency problem.  For instance, when we are trying to join two strings "Hello" and "world", three objects need to be created: "Hello", "world", and "Hello world".  The first two objects will only be dumped when the thrid object is created.  Therefore, the (+) method actually creates a massive useless string objects during the process.  

The join( ) method is a more efficient way for the same task.

Syntex: str.join(sequence) 

e.g. join a list of strings with empty spaces between each element
``` Python
# join a list of strings with empty spaces between them
" ".join(["join", "puts", "spaces", "between", "elements"])
# return 'join puts spaces between elements'
```

We only need to change the string object in front of the join( ) function to identify the object between each string from the list.

e.g. join a list of strings with ::
``` Python
# join a list of strings with double colon
"::".join(["Separated", "with", "colons"])
# return 'Separated::with::colons'
```

We can also join a list of strings with empty space.

``` Python
# join a list of string with empty space
"".join(["Separated", "by", "nothing"])
# return 'Separatedbynothing'
```

In [None]:
# join a list of strings with empty spaces between them
" ".join(["join", "puts", "spaces", "between", "elements"])

In [None]:
# join a list of strings with double colon
"::".join(["Separated", "with", "colons"])

In [None]:
# join a list of string with empty space
"".join(["Separated", "by", "nothing"])

split( ) method splits a string into a list.  The default separator is any whitespace, but users can specify the separator for spliting.  Here are some examples:

Syntex: str.split(separator, max)

e.g.
``` Python
x = "You\t\t can have tabs\t\n \t and newlines \n\n " \"mixed in"
x.split()  # return ['You', 'can', 'have', 'tabs', 'and, 'newlines', 'mixed', 'in']

x = "Mississippi"
x.split("ss")  # return ['Mi', 'i', 'ippi']
```

We can also use the second argument to define the number of split.  If given the argument with value of n, split( ) creates a list of n+1 elements or until the string cannot be split.  Here are some examples:

``` Python
x = 'a b c d e'
x.split(' ', 1)  # return ['a', 'b c d e']
x.split(' ', 2)  # return ['a', 'b', 'c d e']
x.split(' ', 9)  # return ['a', 'b', 'c', 'd', 'c']
```

If we want to split a string with a whitespace with the second arugment defined, we can pass in "None" for the first arugment.  

``` Python
x = 'a\nb c d'
x.split(' ', 2)  # return ['a\nb', 'c', 'd']
x.split(None, 2)  # return ['a', 'b', 'c d']
```

split( ) and join( ) are popular tool used for textual data. It is recommended to use the Python standard csv and json packages for language processing. 

In [None]:
x = "You\t\t can have tabs\t\n \t and newlines \n\n " \"mixed in"
x.split()

In [None]:
x = "Mississippi"
x.split("ss")

In [None]:
x = 'a b c d e'
x.split(' ', 1)

In [None]:
x.split(' ', 2)

In [None]:
x.split(' ', 9)

In [None]:
x = 'a\nb c d'
x.split(' ', 2)

In [None]:
x.split(None, 2)

In [None]:
# Try it yourself!

# How to use split( ) and joint( ) to change all the whitespaces in a string
# to '-'?
# e.g.
# "this is a test"  =>  "this-is-a-test"

##### Using int( ) and float( )

Integers and floasts are data tyeps that deal with numbers.  Python has built-in functions int( ) and float( ) converting strings to integers or floats. If the string cannot be interpreted, the functions will return ValueError. 

int( ) function has two arguments, which the second argument is optional that can be used to define the number format.  The default value is 10.  Here are some examples:

e.g. 
``` Python
float('123.456')

float('xxyy')

int('3333')

int('123.456')

int('10000', 8)

int('101', 2)

int('ff', 16)

int('123456', 6)
```

