# 8. Strings: A Deeper Look 

# Objectives
In this chapter, you’ll:
* Understand text processing.
* Use string methods.
* Format string content.
* Concatenate and repeat strings.
* Strip whitespace from the ends of strings.
* Change characters from lowercase to uppercase and vice versa.

# Objectives (cont.)
* Compare strings with the comparison operators.
* Search strings for substrings and replace substrings.
* Split strings into tokens.
* Concatenate strings into a single string with a specified separator between items.
* Create and use regular expressions to match patterns in strings, replace substrings and validate data.

# Objectives (cont.)
* Use regular expression metacharacters, quantifiers, character classes and grouping.
* Understand how critical string manipulations are to natural language processing.
* Understand the data science terms data munging, data wrangling and data cleaning, and use regular expressions to munge data into preferred formats. 

## 8.2.1 Presentation Types


In [None]:
f'{17.489:.2f}'

* Can also use the **`e` presentation type** for exponential (scientific) notation.

# 8.4 Stripping Whitespace from Strings
* Methods for removing whitespace from the ends of a string each return a new string

### Removing Leading and Trailing Whitespace
* **`strip`** removes leading and trailing whitespace

In [None]:
sentence = '\t  \n  This is a test string. \t\t \n'

In [None]:
sentence.strip()

### Removing Leading Whitespace
* **`lstrip`** removes only leading whitespace

In [None]:
sentence.lstrip()

### Removing Trailing Whitespace
* **`rstrip`** removes only trailing whitespace 

In [None]:
sentence.rstrip()

# 8.7 Searching for Substrings
* Can search a string for a **substring to 
    * count number of occurrences
    * determine whether a string contains a substring
    * determine the index at which a substring resides in a string
* Each method shown in this section compares characters lexicographically using their underlying numeric values

### Counting Occurrences
* **`count`** returns the number of times its argument occurs in a string

In [None]:
sentence = 'to be or not to be that is the question'

In [None]:
sentence.count('to')

* If you specify as the second argument a **start index**, `count` searches only the slice **_string_`[`*start_index*`:]`**

In [None]:
sentence.count('to', 12)

* If you specify as the second and third arguments the **start index** and **end index**,`count` searches only the slice **_string_`[`*start_index*`:`*end_index*`]`**

In [None]:
sentence.count('that', 12, 25)

* Like `count`, the other string methods presented in this section each have **start index** and **end index** arguments 

### Locating a Substring in a String
* **`index`** searches for a substring within a string and returns the first index at which the substring is found; otherwise, a `ValueError` occurs:

In [None]:
sentence.index('be')

* **`rindex`** performs the same operation as `index`, but searches from the end of the string

In [None]:
sentence.rindex('be')

* **`find`** and **`rfind`** perform the same tasks as `index` and `rindex` but return `-1` if the substring is not found

### Determining Whether a String Contains a Substring 
* To check whether a string contains a substring, use operator `in` or `not in`

In [None]:
'that' in sentence

In [None]:
'THAT' in sentence

In [None]:
'THAT' not in sentence

### Locating a Substring at the Beginning or End of a String
* **`startswith`** and **`endswith`** return `True` if the string starts with or ends with a specified substring

In [None]:
sentence.startswith('to')

In [None]:
sentence.startswith('be')

In [None]:
sentence.endswith('question')

In [None]:
sentence.endswith('quest')

# 8.8 Replacing Substrings
* A common text manipulation is to locate a substring and replace its value
* **`replace`** searches a string for the substring in its first argument and replaces _each_ occurrence with the substring in its second argument
* Can receive an optional third argument specifying the maximum number of replacements 

In [None]:
values = '1\t2\t3\t4\t5'

In [None]:
values.replace('\t', ',')

# 8.9 Splitting and Joining Strings
* Tokens typically are separated by whitespace characters such as blank, tab and newline, though other characters may be used—the separators are known as **delimiters**

### Splitting Strings
* To tokenize a string at a custom delimiter, specify the delimiter string that `split` uses to tokenize the string

In [None]:
letters = 'A, B, C, D'

In [None]:
letters.split(', ')

* Specify the maximum number of splits with an integer as the second argument
* Last token is the remainder of the string

In [None]:
letters.split(', ', 2)

* **`rsplit`** performs the same task as `split` but processes the maximum number of splits from the end of the string toward the beginning

### Joining Strings
* **`join`** concatenates the strings in its argument, which must be an iterable containing only string values
* The separator between the concatenated items is the string on which you call `join`

In [None]:
letters_list = ['A', 'B', 'C', 'D']

In [None]:
','.join(letters_list)

* Join the results of a list comprehension that creates a list of strings

In [None]:
','.join([str(i) for i in range(10)])

### String Methods `partition` and `rpartition` 
String method **`partition`** splits a string into a tuple of three strings based on the method’s _separator_ argument
* the part of the original string before the separator
* the separator itself
* the part of the string after the separator

In [None]:
'Amanda: 89, 97, 92'.partition(': ')

* To search for the separator from the end of the string, use method **`rpartition`** 

In [None]:
url = 'http://www.deitel.com/books/PyCDS/table_of_contents.html'

In [None]:
rest_of_url, separator, document = url.rpartition('/')

In [None]:
document

In [None]:
rest_of_url

### String Method `splitlines` 
* **`splitlines`** returns a list of new strings representing lines of text split at each newline character in the original string

In [None]:
lines = """This is line 1
This is line2
This is line3"""

In [None]:
lines

In [None]:
lines.splitlines()

* Passing `True` to `splitlines` keeps the newlines

In [None]:
lines.splitlines(True)

# 8.10 Characters and Character-Testing Methods
* In Python, a character is simply a one-character string
* Python provides string methods for testing whether a string matches certain characteristics
* **`isdigit`** returns `True` if the string on which you call the method contains only the digit characters (`0`–`9`)
    * Useful for validating data

In [None]:
'-27'.isdigit()


In [None]:
'27'.isdigit()

# 8.10 Characters and Character-Testing Methods (cont.)
* `isalnum` returns `True` if the string on which you call the method is alphanumeric (only digits and letters)

In [None]:
'A9876'.isalnum()

In [None]:
'123 Main Street'.isalnum()

# 8.10 Characters and Character-Testing Methods (cont.)
* Table of many character-testing methods

| String Method | Description
| -------- | --------
| `isalnum()` | Returns `True` if the string contains only _alphanumeric_ characters (i.e., digits and letters).
| `isalpha()`  | Returns `True` if the string contains only _alphabetic_ characters (i.e., letters).
| `isdecimal()`  | Returns `True` if the string contains only _decimal integer_ characters (that is, base 10 integers) and does not contain a + or - sign.
| `isdigit()`  | Returns `True` if the string contains only digits (e.g., '0', '1', '2').
| `isidentifier()`  | Returns `True` if the string represents a valid _identifier_. 
| `islower()`  | Returns `True` if all alphabetic characters in the string are _lowercase_ characters (e.g., `'a'`, `'b'`, `'c'`).
| `isnumeric()`  | Returns `True` if the characters in the string represent a _numeric value_ without a + or - sign and without a decimal point.
| `isspace()`  | Returns `True` if the string contains only _whitespace_ characters.
| `istitle()`  | Returns `True` if the first character of each word in the string is the only _uppercase_ character in the word. 
| `isupper()`  | Returns `True` if all alphabetic characters in the string are _uppercase_ characters (e.g., `'A'`, `'B'`, `'C'`).

------
&copy;1992&ndash;2020 by Pearson Education, Inc. All Rights Reserved. This content is based on Chapter 5 of the book [**Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud**](https://amzn.to/2VvdnxE).

DISCLAIMER: The authors and publisher of this book have used their 
best efforts in preparing the book. These efforts include the 
development, research, and testing of the theories and programs 
to determine their effectiveness. The authors and publisher make 
no warranty of any kind, expressed or implied, with regard to these 
programs or to the documentation contained in these books. The authors 
and publisher shall not be liable in any event for incidental or 
consequential damages in connection with, or arising out of, the 
furnishing, performance, or use of these programs.                  

# 8.11 Raw Strings
* Backslash characters in strings introduce _escape sequences_—like `\n` for newline and `\t` for tab
* To include a backslash in a string, use two backslash characters `\\`
* Makes some strings difficult to read
* Consider a Microsoft Windows file location:

In [None]:
file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'

In [None]:
file_path

# 8.11 Raw Strings (cont.)
* **raw strings**—preceded by the character `r`—are more convenient
* They treat each backslash as a regular character, rather than the beginning of an escape sequence
* We will see this when we create regular expression strings that make use of \

In [None]:
file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'

In [None]:
file_path

## 8.12.1 re Module and Function fullmatch 
* **`fullmatch`** checks whether the **entire** string in its second argument matches the pattern in its first argument

In [None]:
import re

### Matching Literal Characters

In [None]:
pattern = '02215'

In [None]:
'Match' if re.fullmatch(pattern, '02215') else 'No match'

In [None]:
'Match' if re.fullmatch(pattern, '51220') else 'No match'

* First argument is the regular expression pattern to match
    * Any string can be a regular expression
    * **literal characters** match **themselves** in the specified order
* Second argument is the string that should entirely match the pattern. 
* If the second argument matches the pattern in the first argument, `fullmatch` returns an object containing the matching text, which evaluates to `True`
* For no match, `fullmatch` returns `None`, which evaluates to `False`

### Metacharacters, Character Classes and Quantifiers
* Regular expressions typically contain various special symbols called **metacharacters**:

| Regular expression metacharacters
| --------
| `[]  `  `{}  `  `()  `  `\  `  `*  `  `+  `  `^  `  `$  `  `?  `  `.  `  `\|`

* **`\` metacharacter** begins each predefined **character class**
* Each matches a specific set of characters

In [None]:
'Valid' if re.fullmatch(r'\d{5}', '02215') else 'Invalid'

In [None]:
'Valid' if re.fullmatch(r'\d{5}', '9876') else 'Invalid'

* In `\d{5}`, **`\d`** is a character class representing a digit (0–9)
* A character class is a **regular expression escape sequence** that matches **one** character
* To match more than one, follow the character class with a **quantifier**
* `{5}` repeats `\d` five times to match five consecutive digits

### Other Predefined Character Classes
* To match any metacharacter as its **literal** value, precede it by a backslash (`\`)
    * For example, `\\` matches a backslash (`\`) and `\$` matches a dollar sign (`$`)

| Character class	| Matches
| ------------	| ------------
| `\d`	| Any digit (0–9).
| `\D`	| Any character that is _not_ a digit.
| `\s`	| Any whitespace character (such as spaces, tabs and newlines).
| `\S`	| Any character that is _not_ a whitespace character.
| `\w`	| Any **word character** (also called an **alphanumeric character**)—that is, any uppercase or lowercase letter, any digit or an underscore
| `\W`	| Any character that is _not_ a word character.

### Custom Character Classes
* Square brackets, `[]`, define a **custom character class** that matches a **single** character
* `[aeiou]` matches a lowercase vowel
* `[A-Z]` matches an uppercase letter
* `[a-z]` matches a lowercase letter 
* `[a-zA-Z]` matches any lowercase or uppercase letter

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]*', 'Wally') else 'Invalid'

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]*', 'eva') else 'Invalid'

In [None]:
'Match' if re.fullmatch('[^a-z]', 'A') else 'No match'

In [None]:
'Match' if re.fullmatch('[^a-z]', 'a') else 'No match'

In [None]:
'Match' if re.fullmatch('[*+$]', '*') else 'No match'

In [None]:
'Match' if re.fullmatch('[*+$]', '!') else 'No match'

### `*` vs. `+` Quantifier
* `+` matches **at least one occurrence** of a subexpression
* `*` and `+` are **greedy**—they match as many characters as possible

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]+', 'Wally') else 'Invalid'

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]+', 'E') else 'Invalid'

### Other Quantifiers
* **`?` quantifier** matches **zero or one occurrences** of a subexpression

In [None]:
'Match' if re.fullmatch('labell?ed', 'labelled') else 'No match'

In [None]:
'Match' if re.fullmatch('labell?ed', 'labeled') else 'No match'

In [None]:
'Match' if re.fullmatch('labell?ed', 'labellled') else 'No match'

* `labell?ed` matches `labelled` (the U.K. English spelling) and `labeled` (the U.S. English spelling), but not the misspelled word `labellled`
* `l?` indicates that there can be **zero or one more** `l` characters before the remaining literal `ed` characters

### Other Quantifiers (cont.)
* Can match **at least _n_ occurrences** of a subexpression with the **`{n,}` quantifier**

In [None]:
'Match' if re.fullmatch(r'\d{3,}', '123') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,}', '1234567890') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,}', '12') else 'No match'

### Other Quantifiers (cont.)
* Can match **between _n_ and _m_ (inclusive) occurrences** of a subexpression with the **`{n,m}` quantifier**

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '123') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '123456') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '1234567') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '12') else 'No match'

## 8.12.2 Replacing Substrings and Splitting Strings
* `sub` function replaces patterns in a string
* `split` function breaks a string into pieces, based on patterns

### Function `sub`—Replacing Patterns 
* **`sub` function** replaces **all** occurrences of a pattern with the replacement text you specify

In [None]:
import re

In [None]:
re.sub(r'\t', ', ', '1\t2\t3\t4')

* Three required arguments:
    * the _pattern to match_ (the tab character `'\t'`)
    * the _replacement text_ (`', '`) and 
    * the _string to be searched_ (`'1\t2\t3\t4'`)
* Keyword argument `count` can be used to specify the maximum number of replacements

In [None]:
re.sub(r'\t', ', ', '1\t2\t3\t4', count=2)

### Function `split` 
* **`split` function** _tokenizes_ a string, using a regular expression to specify the _delimiter_
* Returns a list of strings

In [None]:
re.split(r',\s*', '1,  2,  3,4,    5,6,7,8')

* Keyword argument `maxsplit` specifies maximum number of splits

In [None]:
re.split(r',\s*', '1,  2,  3,4,    5,6,7,8', maxsplit=3)

## 8.12.3 Other Search Functions; Accessing Matches

### Function `search`—Finding the First Match Anywhere in a String
* **`search`** looks in a string for the _first_ occurrence of a substring that matches a regular expression and returns a **match object** (of type **`SRE_Match`**) that contains the matching substring
* Match object’s **`group`** method returns that substring

In [None]:
import re

In [None]:
result = re.search('Python', 'Python is fun')

In [None]:
result.group() if result else 'not found'

* `search` returns `None` if the string does _not_ contain the pattern

In [None]:
result2 = re.search('fun!', 'Python is fun')

In [None]:
result2.group() if result2 else 'not found'

* Can search for a match only at the _beginning_ of a string with function **`match`**

### Ignoring Case with the Optional `flags` Keyword Argument
* Many `re` module functions receive an optional `flags` keyword argument
* Changes how regular expressions are matched

In [None]:
result3 = re.search('Sam', 'SAM WHITE', flags=re.IGNORECASE)

In [None]:
result3.group() if result3 else 'not found'

### Metacharacters That Restrict Matches to the Beginning or End of a String
* **`^` metacharacter** at the beginning of a regular expression (and not inside square brackets) is an anchor
* Indicaties that the expression matches only the _beginning_ of a string 

In [None]:
result = re.search('^Python', 'Python is fun')

In [None]:
result.group() if result else 'not found'

In [None]:
result = re.search('^fun', 'Python is fun')

In [None]:
result.group() if result else 'not found'

* **`$` metacharacter** at the end of a regular expression is an anchor indicating that the expression matches only the _end_ of a string

In [None]:
result = re.search('Python$', 'Python is fun')

In [None]:
result.group() if result else 'not found'

In [None]:
result = re.search('fun$', 'Python is fun')

In [None]:
result.group() if result else 'not found'

### Function `findall` and `finditer`—Finding All Matches in a String
* **`findall`** finds _every_ matching substring in a string
* Returns a list of the matching substrings

In [None]:
contact = 'Wally White, Home: 555-555-1234, Work: 555-555-4321'

In [None]:
re.findall(r'\d{3}-\d{3}-\d{4}', contact)

* **`finditer`** works like `findall`, but returns a lazy _iterable_ of match objects

In [None]:
for phone in re.finditer(r'\d{3}-\d{3}-\d{4}', contact):
    print(phone.group())

### Capturing Substrings in a Match
* Use **parentheses metacharacters**—**`(`** and **`)`**—to capture substrings in a match

In [None]:
text = 'Charlie Cyan, e-mail: demo1@deitel.com'

In [None]:
pattern = r'([A-Z][a-z]+ [A-Z][a-z]+), e-mail: (\w+@\w+\.\w{3})'

In [None]:
result = re.search(pattern, text)

* The regular expression specifies two substrings to capture, each denoted by the metacharacters `(` and `)`
* `(` and `)` do _not_ affect whether the `pattern` is found in the string `text`
* `match` function returns a match object _only_ if the _entire_ `pattern` is found in the string `text`
* `match` object’s **`groups`** method returns a tuple of the captured substrings

In [None]:
result.groups()

* `match` object’s `group` method returns the _entire_ match as a single string

In [None]:
result.group()

* Access each captured substring by passing an integer to the `group` method

In [None]:
result.group(1)

In [None]:
result.group(2) 

# 8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging 
* Data does not always come in forms ready for analysis
* Data could be
    * wrong format
    * incorrect 
    * missing
* Data scientists can spend as much as 75% of their time preparing data before they begin their studies
* Called **data munging** or **data wrangling**

# 8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging (cont.)
* Two of the most important steps in data munging are _data cleaning_ and _transforming data_ into optimal formats for database systems and analytics software
* Common data cleaning examples include: 
    * deleting observations with missing values,
    * substituting reasonable values for missing values,
    * deleting observations with bad values,
    * substituting reasonable values for bad values,
    * tossing outliers (although sometimes you’ll want to keep them),
    * duplicate elimination (although sometimes duplicates are valid),
    * dealing with inconsistent data,
    * and more. 

# 8.13 Intro to Data Science: Pandas, Regular Expressions and Data Munging (cont.)
* Data cleaning is a difficult and messy process where you could easily make bad decisions that would negatively impact your results
* The actions data scientists take can vary per project, be based on the quality and nature of the data and be affected by evolving organization and professional standards
* Common data transformations include:
    * removing unnecessary data and _features_ (we’ll say more about features in the data science case studies),
    * combining related features,
    * sampling data to obtain a representative subset (we’ll see in the data science case studies that _random sampling_ is particularly effective for this and we’ll say why),
    * standardizing data formats,
    * grouping data,
    * and more.

### Cleaning Your Data 
* Bad data values and missing values can significantly impact data analysis
* Some data scientists advise against any attempts to insert “reasonable values” 
    * Instead, they advocate clearly marking missing data and leaving it up to the data analytics package to handle the issue

### Cleaning Your Data (cont.)
* Consider a hospital that records patients’ temperatures (and probably other vital signs) four times per day
* Assume that the data consists of a name and four `float` values, such as
```python
['Brown, Sue', 98.6, 98.4, 98.7, 0.0]
```
* Patient’s first three recorded temperatures are 99.7, 98.4 and 98.7
* Last temperature was missing and recorded as 0.0, perhaps because the sensor malfunctioned
* Average of the first three values is 98.57, which is close to normal
* If you calculate the average temperature _including_ the missing value for which 0.0 was substituted, the average is only 73.93, clearly a questionable result
* Crucial to “get the data right.” 
* One way to clean the data is to substitute a _reasonable_ value for the missing temperature, such as the average of the patient’s other readings

### Data Validation
* `Series` of five-digit ZIP Codes from a dictionary of city-name/five-digit-ZIP-Code key–value pairs
* Intentionally entered an invalid ZIP Code for Miami

In [1]:
import pandas as pd

In [2]:
zips = pd.Series({'Boston': '02215', 'Miami': '3310'})

In [3]:
zips

Boston    02215
Miami      3310
dtype: object

### Data Validation (cont.)
* The “second column” represents the `Series`’ ZIP Code _values_ (from the dictionary’s values)
* The “first column” represents their _indices_ (from the dictionary’s keys)
* Can use regular expressions with Pandas to validate data
* The **`str` attribute** of a `Series` provides string-processing and various regular expression methods
* Use the `str` attribute’s **`match` method** to check whether each ZIP Code is valid: 

In [4]:
zips.str.match(r'\d{5}')

Boston     True
Miami     False
dtype: bool

* `match` applies the regular expression `\d{5}` to _each_ `Series` element
* Returns a new `Series` containing `True` for each valid element

### Data Validation (cont.)
* Several ways to deal with invalid data
* One is to catch it at its source and interact with the source to correct the value
    * Not always possible
* In the case of the bad Miami ZIP Code of `3310`, we might look for Miami ZIP Codes beginning with 3310
    * There are two—`33101` and `33109`
    * We could pick one of those

### Data Validation (cont.)
* Sometimes, rather than matching an _entire_ value to a pattern, you’ll want to know whether a value contains a _substring_ that matches the pattern
* Use method **`contains`** instead of `match`

In [5]:
cities = pd.Series(['Boston, MA 02215', 'Miami, FL 33101'])

In [6]:
cities

0    Boston, MA 02215
1     Miami, FL 33101
dtype: object

In [7]:
cities.str.contains(r' [A-Z]{2} ')

0    True
1    True
dtype: bool

In [8]:
cities.str.match(r' [A-Z]{2} ')

0    False
1    False
dtype: bool

### Reformatting Your Data
* Consider munging data into a different format
* Assume that an application requires U.S. phone numbers in the format ###-###-####
* The phone numbers have been provided to us as 10-digit strings without hyphens

In [9]:
contacts = [['Mike Green', 'demo1@deitel.com', '5555555555'],
            ['Sue Brown', 'demo2@deitel.com', '5555551234']]   

In [10]:
contactsdf = pd.DataFrame(contacts, 
                          columns=['Name', 'Email', 'Phone'])

In [11]:
contactsdf

Unnamed: 0,Name,Email,Phone
0,Mike Green,demo1@deitel.com,5555555555
1,Sue Brown,demo2@deitel.com,5555551234


### Reformatting Your Data (cont.)
* Munge the data with functional-style programming
* Can _map_ the phone numbers to the proper format by calling the `Series` method **`map`** on the `DataFrame`’s `'Phone'` column
* `map`’s argument is a _function_ that receives a value and returns the _mapped_ value
* Our function `get_formatted_phone` maps 10 consecutive digits into the format ###-###-####

In [12]:
import re

In [13]:
def get_formatted_phone(value):
    result = re.fullmatch(r'(\d{3})(\d{3})(\d{4})', value)
    return '-'.join(result.groups()) if result else value

### Reformatting Your Data (cont.)
* Regular expression in the block’s first statement matches _only_ 10 consecutive digits
* Captures substrings containing the first three digits, the next three digits and the last four digits
* `return` statement:
    * If `result` is `None`, returns `value` unmodified
    * Otherwise, calls `result.groups()` to get a tuple containing the captured substrings and pass that tuple to string method `join` to concatenate the elements, separating each from the next with `'-'` to form the mapped phone number

### Reformatting Your Data (cont.)
* `Series` method `map` returns a new `Series` containing the results of calling its function argument for each value in the column

In [14]:
formatted_phone = contactsdf['Phone'].map(get_formatted_phone)

In [15]:
formatted_phone

0    555-555-5555
1    555-555-1234
Name: Phone, dtype: object

* Once you’ve confirmed that the data is in the correct format, you can update it in the original `DataFrame` 

In [16]:
contactsdf['Phone'] = formatted_phone

In [17]:
contactsdf

Unnamed: 0,Name,Email,Phone
0,Mike Green,demo1@deitel.com,555-555-5555
1,Sue Brown,demo2@deitel.com,555-555-1234
