# 8. Strings: A Deeper Look 

# Objectives
In this chapter, you’ll:
* Understand text processing.
* Use string methods.
* Format string content.
* Concatenate and repeat strings.
* Strip whitespace from the ends of strings.
* Change characters from lowercase to uppercase and vice versa.

# Objectives (cont.)
* Compare strings with the comparison operators.
* Search strings for substrings and replace substrings.
* Split strings into tokens.
* Concatenate strings into a single string with a specified separator between items.
* Create and use regular expressions to match patterns in strings, replace substrings and validate data.

# Objectives (cont.)
* Use regular expression metacharacters, quantifiers, character classes and grouping.
* Understand how critical string manipulations are to natural language processing.
* Understand the data science terms data munging, data wrangling and data cleaning, and use regular expressions to munge data into preferred formats. 

## 8.2.1 Presentation Types


In [1]:
f'{17.489:.2f}'

'17.49'

* Can also use the **`e` presentation type** for exponential (scientific) notation.

# 8.4 Stripping Whitespace from Strings
* Methods for removing whitespace from the ends of a string each return a new string

### Removing Leading and Trailing Whitespace
* **`strip`** removes leading and trailing whitespace

In [2]:
sentence = '\t  \n  This is a test string. \t\t \n'

In [3]:
sentence.strip()

'This is a test string.'

### Removing Leading Whitespace
* **`lstrip`** removes only leading whitespace

In [4]:
sentence.lstrip()

'This is a test string. \t\t \n'

### Removing Trailing Whitespace
* **`rstrip`** removes only trailing whitespace 

In [5]:
sentence.rstrip()

'\t  \n  This is a test string.'

# 8.7 Searching for Substrings
* Can search a string for a **substring to 
    * count number of occurrences
    * determine whether a string contains a substring
    * determine the index at which a substring resides in a string
* Each method shown in this section compares characters lexicographically using their underlying numeric values

### Counting Occurrences
* **`count`** returns the number of times its argument occurs in a string

In [6]:
sentence = 'to be or not to be that is the question'

In [7]:
sentence.count('to')

2

* If you specify as the second argument a **start index**, `count` searches only the slice **_string_`[`*start_index*`:]`**

In [8]:
sentence.count('to', 12)

1

* If you specify as the second and third arguments the **start index** and **end index**,`count` searches only the slice **_string_`[`*start_index*`:`*end_index*`]`**

In [9]:
sentence.count('that', 12, 25)

1

* Like `count`, the other string methods presented in this section each have **start index** and **end index** arguments 

### Locating a Substring in a String
* **`index`** searches for a substring within a string and returns the first index at which the substring is found; otherwise, a `ValueError` occurs:

In [10]:
sentence.index('be')

3

* **`rindex`** performs the same operation as `index`, but searches from the end of the string

In [11]:
sentence.rindex('be')

16

* **`find`** and **`rfind`** perform the same tasks as `index` and `rindex` but return `-1` if the substring is not found

### Determining Whether a String Contains a Substring 
* To check whether a string contains a substring, use operator `in` or `not in`

In [12]:
'that' in sentence

True

In [13]:
'THAT' in sentence

False

In [14]:
'THAT' not in sentence

True

### Locating a Substring at the Beginning or End of a String
* **`startswith`** and **`endswith`** return `True` if the string starts with or ends with a specified substring

In [15]:
sentence.startswith('to')

True

In [16]:
sentence.startswith('be')

False

In [17]:
sentence.endswith('question')

True

In [18]:
sentence.endswith('quest')

False

# 8.8 Replacing Substrings
* A common text manipulation is to locate a substring and replace its value
* **`replace`** searches a string for the substring in its first argument and replaces _each_ occurrence with the substring in its second argument
* Can receive an optional third argument specifying the maximum number of replacements 

In [19]:
values = '1\t2\t3\t4\t5'

In [20]:
values.replace('\t', ',')

'1,2,3,4,5'

# 8.9 Splitting and Joining Strings
* Tokens typically are separated by whitespace characters such as blank, tab and newline, though other characters may be used—the separators are known as **delimiters**

### Splitting Strings
* To tokenize a string at a custom delimiter, specify the delimiter string that `split` uses to tokenize the string

In [21]:
letters = 'A, B, C, D'

In [22]:
letters.split(', ')

['A', 'B', 'C', 'D']

### Joining Strings
* **`join`** concatenates the strings in its argument, which must be an iterable containing only string values
* The separator between the concatenated items is the string on which you call `join`

In [23]:
letters_list = ['A', 'B', 'C', 'D']

In [24]:
','.join(letters_list)

'A,B,C,D'

* Join the results of a list comprehension that creates a list of strings

In [25]:
','.join([str(i) for i in range(10)])

'0,1,2,3,4,5,6,7,8,9'

# 8.10 Characters and Character-Testing Methods
* In Python, a character is simply a one-character string
* Python provides string methods for testing whether a string matches certain characteristics
* **`isdigit`** returns `True` if the string on which you call the method contains only the digit characters (`0`–`9`)
    * Useful for validating data

In [26]:
'-27'.isdigit()


False

In [27]:
'27'.isdigit()

True

# 8.10 Characters and Character-Testing Methods (cont.)
* `isalnum` returns `True` if the string on which you call the method is alphanumeric (only digits and letters)

In [28]:
'A9876'.isalnum()

True

In [29]:
'123 Main Street'.isalnum()

False

# 8.10 Characters and Character-Testing Methods (cont.)
* Table of many character-testing methods

| String Method | Description
| -------- | --------
| `isalnum()` | Returns `True` if the string contains only _alphanumeric_ characters (i.e., digits and letters).
| `isalpha()`  | Returns `True` if the string contains only _alphabetic_ characters (i.e., letters).
| `isdecimal()`  | Returns `True` if the string contains only _decimal integer_ characters (that is, base 10 integers) and does not contain a + or - sign.
| `isdigit()`  | Returns `True` if the string contains only digits (e.g., '0', '1', '2').
| `isidentifier()`  | Returns `True` if the string represents a valid _identifier_. 
| `islower()`  | Returns `True` if all alphabetic characters in the string are _lowercase_ characters (e.g., `'a'`, `'b'`, `'c'`).
| `isnumeric()`  | Returns `True` if the characters in the string represent a _numeric value_ without a + or - sign and without a decimal point.
| `isspace()`  | Returns `True` if the string contains only _whitespace_ characters.
| `istitle()`  | Returns `True` if the first character of each word in the string is the only _uppercase_ character in the word. 
| `isupper()`  | Returns `True` if all alphabetic characters in the string are _uppercase_ characters (e.g., `'A'`, `'B'`, `'C'`).

# 8.11 Raw Strings
* Backslash characters in strings introduce _escape sequences_—like `\n` for newline and `\t` for tab
* To include a backslash in a string, use two backslash characters `\\`
* Makes some strings difficult to read
* Consider a Microsoft Windows file location:

In [30]:
file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'

In [31]:
file_path

'C:\\MyFolder\\MySubFolder\\MyFile.txt'

# 8.11 Raw Strings (cont.)
* **raw strings**—preceded by the character `r`—are more convenient
* They treat each backslash as a regular character, rather than the beginning of an escape sequence
* We will see this when we create regular expression strings that make use of \

In [32]:
file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'

In [33]:
file_path

'C:\\MyFolder\\MySubFolder\\MyFile.txt'