# Last Time
* We looked at the basics of loading a text file and reading its "words" into a list.
* We worked through some exercises related to lists, sets, and dictionaries.
* We took a second look at formatting strings.

# This Time
* We'll wrap our discussion of principal string operations.
* We _may_ start our discussion of regular expression matching and manipulation.

# String Introduction
* Strings support many of the same sequence operations as lists and tuples
* Strings, like tuples, are immutable
* Here, we take a deeper look at general string functionality 
* String manipulation is vital in understanding regular expressions and the `re` module for matching patterns in text, which is particularly important in addressing today’s data rich applications.

# String Introduction (cont.)
* The table below shows many string-processing and NLP-related applications

| String and NLP applications
| ------------------------
| Anagrams
| Automated grading of written homework
| Automated teaching systems
| Categorizing articles
| Chatbots
| Compilers and interpreters
| Creative writing
| Cryptography
| Document classification
| Document similarity
| Document summarization
| Electronic book readers
| Fraud detection
| Grammar checkers
| Inter-language translation
| Legal document preparation
| Monitoring social media posts
| Natural language understanding
| Opinion analysis
| Page-composition software
| Palindromes
| Parts-of-speech tagging
| Project Gutenberg free books
| Reading books, articles, documentation and absorbing knowledge
| Search engines
| Sentiment analysis
| Spam classification
| Speech-to-text engines
| Spell checkers
| Steganography
| Text editors
| Text-to-speech engines
| Web scraping
| Who authored Shakespeare’s works?
| Word clouds
| Word games
| Writing medical diagnoses from x-rays, scans, blood tests
| and many more… 		
		
		
		
		
		
		
		
		
		
		
		

## String’s `format` Method
* f-strings were added to Python in version 3.6
* Before that, formatting was performed with the string method **`format`**
* f-string formatting is based on the `format` method’s capabilities
* The principal concept was to call `format` as a _method_ on a **format string** containing curly brace (`{}`) **placeholders**, possibly with format specifiers
* The arguments to this method call are then the values to be formatted
* Format specifiers are preceded by a colon (`:`)

In [3]:
print('{:.2f}'.format(17.489))

17.49


### Multiple Placeholders
* The format string may contain multiple placeholders
* `format` method’s arguments correspond to the placeholders from left to right

In [4]:
print('{} was {}.'.format('The mysterious figure', 'here'))

The mysterious figure was here.


### Referencing Arguments By Position Number or Keyword
* Format string can reference specific arguments by their position in the `format` method’s argument list
* This can be used to include repetitions of one or more arguments.
* Likewise, arguments can be referenced as keywords, and then included using the key(s) within the format string.
* It's advised to stick to one representation per format string.

In [5]:
print('{0}, {0} {1}'.format('Long', 'Ago'))

Long, Long Ago


In [6]:
print('{first} {last}'.format(first='Wanda', last='Frank'))

Wanda Frank


In [7]:
print('{last}, {first}'.format(first='Wanda', last='Frank'))

Frank, Wanda


# Stripping Whitespace from Strings
* It's often necessary to remote whitespace from one or both ends of a string for purposes of formatting and/or comparison to other strings.
    * **`strip`** removes leading and trailing whitespace
    * **`lstrip`** removes only leading whitespace
    * **`rstrip`** removes only trailing whitespace 
* Note that whitespace includes tab (`\t`) and newline (`\n`) characters.

In [8]:
sentence = '\t  \n  This is a test string. \t\t \n'
print('|'+sentence+'|')

|	  
  This is a test string. 		 
|


In [9]:
print('|'+sentence.strip()+'|')

|This is a test string.|


In [10]:
print('|'+sentence.lstrip()+'|')

|This is a test string. 		 
|


In [11]:
print('|'+sentence.rstrip()+'|')

|	  
  This is a test string.|


## Changing Character Cases for a String
* The method **`capitalize`** returns a new string with only the first letter capitalized (sometimes called **sentence capitalization**)
* The method **`title`** returns a new string with only the first character of each word capitalized (sometimes called **book-title capitalization**)

In [12]:
print('the road was long and fraught with DANGER.'.capitalize())

The road was long and fraught with danger.


In [14]:
print('happy birThday home'.title())

Happy Birthday Home


# Comparison Operators for Strings
* Strings may be compared with the comparison operators
* Strings are compared based on their underlying integer numeric values, character by character (much like elements of a list!)
* Can check integer codes per characters with `ord`

In [15]:
print(f'O: {ord("O")}; o: {ord("o")}')

O: 79; o: 111


In [16]:
print('Orange' == 'orange')
print('Orange' != 'orange')
print('Orange' < 'orange')
print('Orange' <= 'orange')
print('Orange' > 'orange')
print('Orange' >= 'orange')


False
True
True
True
False
False


# Searching for Substrings
* Can search a string for a **substring** to accomplish many tasks:
    * determine whether a string contains a substring
    * count number of occurrences
    * determine the index at which a substring resides in a string
* Each method we take a look at compares characters lexicographically using their underlying numeric values (capitalization counts!)

### Counting Occurrences
* **`count`** returns the number of times its argument occurs in a string
* If you specify as the second argument a **start index**, `count` searches only the slice **_string_`[`*start_index*`:]`**
* If you specify as the second and third arguments the **start index** and **end index**,`count` searches only the slice **_string_`[`*start_index*`:`*end_index*`]`**

In [17]:
sentence = 'that that creature was there after moonfall was that which perplexed me'
print(sentence.count('that'))

3


In [18]:
print(sentence.count('that',10))

1


In [19]:
print(sentence.count('that',10,30))

0


### Locating a Substring in a String
* **`index`** searches for a substring within a string and returns the first index at which the substring is found
* If it is **not** found, a `ValueError` occurs.
* Like `count`, may have **start index** and **end index** arguments 

In [20]:
print(sentence)
print(sentence.index('was'))

that that creature was there after moonfall was that which perplexed me
19


In [23]:
print(sentence.index('was',20))

44


In [24]:
print(sentence.index('was',20,40)) #will return a ValueError, since the substring is not present within this range

ValueError: substring not found

### Index Variants ###
* **`rindex`** performs the same operation as `index`, but searches from the end of the string backwards.
* **`find`** performs the task as `index` but returns `-1` if the substring isn't found.
* **`rfind`** performs the same operation as `find`, but searches from the end of the string backwards.

In [25]:
print(sentence)
print(sentence.index('that')) #first time substring is encountered
print(sentence.rindex('that')) #last time substring is encountered


that that creature was there after moonfall was that which perplexed me
0
48


In [26]:
print(sentence)
print(sentence.find('that',20,40))

that that creature was there after moonfall was that which perplexed me
-1


### Determining if a String Contains a Substring Or Begins/Ends With One 
* To check whether a string contains a substring, use operator `in`
* The _method_ **`startswith`** returns `True` if the string starts with the specified substring
* The _method_ **`endswith`** returns `True` if the string starts with the specified substring


In [30]:
print(sentence)
print('that' in sentence)

that that creature was there after moonfall was that which perplexed me
True


In [31]:
print('koala' not in sentence)

True


In [32]:
print(sentence.startswith('that'))

True


In [33]:
print(sentence.endswith('that'))

False


# Replacing Substrings
* A common text manipulation is to locate a substring and replace its value
* **`replace`** searches a string for the substring in its first argument and replaces _each_ occurrence with the substring in its second argument
* Can receive an optional third argument specifying the maximum number of replacements
* Use an empty string (`''`) for the second value to "delete" any substring

In [34]:
values = '1\t2\t3\t4\t5' # \t is a tab character
print(values)

1	2	3	4	5


In [36]:
print(values.replace('\t', ','))
print(values.replace('\t', ',',2))
x=values.replace('1','10')
print(values)
print(x)

1,2,3,4,5
1,2,3	4	5
1	2	3	4	5
10	2	3	4	5


# Tokens and Tokenization
* The word **token** has several meanings in the context of computer science -- in the case of strings a single token generally represents an individual element of interest (e.g. a word or even a combination of symbols with a particular meaning).
* **Tokenization** is the process of splitting apart individual tokens from a larger collective (e.g. a string or file).
* Tokens typically are separated by whitespace characters such as blank, tab and newline, though other characters may be used—these separators are known as **delimiters**

### Splitting Strings
* Recall that `split` separates (i.e. tokenizes) parts of a string using a delimiter 
    * The default is to follow the approach above (any whitespace characters, including ' ', '\n', '\t') 
    * To tokenize a string at a custom delimiter, specify the delimiter string that `split` uses to tokenize the string
* `split` can have a specified maximum number of splits with an integer as the second argument
    * In such cases, the remaining portion of the string will remain a contiguous substring of the original.
    * `rsplit` can be used in lieu of `split` to process tokens from right to left.

In [42]:
lettersV1 = 'A B\t C\n D '
print(lettersV1.split(None)) #By default, any whitespace or combination of whitespace characters separate one token from the next


['A', 'B', 'C', 'D']


In [38]:
lettersV2 = 'A,B,C,D'
print(lettersV2.split()) #The default delimiter here won't work
print(lettersV2.split(',')) #Instead we can specify ',' as a custome separator.


['A,B,C,D']
['A', 'B', 'C', 'D']


In [39]:
print(lettersV2.split(',',2)) #tokenize for the first two occurrences of ',' -- leave the rest leftover
print(lettersV2.rsplit(',',2)) #tokenize for the last two occurrences of ',' -- leave the rest leftover

['A', 'B', 'C,D']
['A,B', 'C', 'D']


### Joining Strings
* **`join`** concatenates the strings in its argument, which must be an iterable containing only string values or expressions
* The separator between the concatenated items is the _string_ on which you call `join`
    * Use an empty string to simply join the iterable elements as they are
* List comprehensions can be joined in this way with ease.

In [43]:
letters_list = ['A', 'B', 'C', 'D']

In [45]:
print('||'.join(letters_list)) #join with commas separating elements of the original iterable
print(''.join(letters_list)) #join with nothing separating elements of the original iterable

A||B||C||D
ABCD


In [48]:
print(','.join([str(i*i) for i in range(10)])) #Join the results of a list comprehension that creates a list of strings

0,1,4,9,16,25,36,49,64,81


### String Methods `partition` and `rpartition` 
* String method **`partition`** splits a string into a tuple of three strings based on the method’s _separator_ argument
    * the part of the original string before the separator
    * the separator itself
    * the part of the string after the separator
* To search for the separator from the end of the string, use method **`rpartition`** 

In [49]:
name, separator, grades = 'Amanda: 89, 97, 92'.partition(': ')
print(name)
print(separator)
print(grades)

Amanda
: 
89, 97, 92


In [50]:
url = 'http://www.deitel.com/books/PyCDS/table_of_contents.html'
rest_of_url, separator, document = url.rpartition('/')
print(rest_of_url)
print(separator)
print(document)

http://www.deitel.com/books/PyCDS
/
table_of_contents.html


### String Method `splitlines` 
* **`splitlines`** returns a list of new strings representing lines of text split at each newline character in the original
string
* Passing `True` to `splitlines` keeps the newlines

In [51]:
lines = """This is line 1
This is line2
This is line3"""

In [52]:
print(lines)
print(lines.splitlines())
print(lines.splitlines(True))

This is line 1
This is line2
This is line3
['This is line 1', 'This is line2', 'This is line3']
['This is line 1\n', 'This is line2\n', 'This is line3']


# Characters and Character-Testing Methods
* In Python, a character is simply a one-character string
* Python provides string methods for testing whether a string matches certain characteristics
    * **`isdigit`** returns `True` if the string on which you call the method contains only the digit characters (`0`–`9`)
    * **`isalnum`** returns `True` if the string on which you call the method is alphanumeric (only digits and letters)
* Character-testing is critical for validating input or searching for specific data-types.
   

In [53]:
print('27'.isdigit())
print('-27'.isdigit()) #EVERY element in the string must be a digit for this to return true

True
False


In [54]:
print('A9876'.isalnum())
print('123 Main Street'.isalnum()) # Whitespace characters are NOT alphanumeric

True
False


* Table of many character-testing methods

| String Method | Description
| -------- | --------
| `isalnum()` | Returns `True` if the string contains only _alphanumeric_ characters (i.e., digits and letters).
| `isalpha()`  | Returns `True` if the string contains only _alphabetic_ characters (i.e., letters).
| `isdecimal()`  | Returns `True` if the string contains only _decimal integer_ characters (that is, base 10 integers) and does not contain a + or - sign.
| `isdigit()`  | Returns `True` if the string contains only digits (e.g., '0', '1', '2').
| `isidentifier()`  | Returns `True` if the string represents a valid _identifier_. 
| `islower()`  | Returns `True` if all alphabetic characters in the string are _lowercase_ characters (e.g., `'a'`, `'b'`, `'c'`).
| `isnumeric()`  | Returns `True` if the characters in the string represent a _numeric value_ without a + or - sign and without a decimal point.
| `isspace()`  | Returns `True` if the string contains only _whitespace_ characters.
| `istitle()`  | Returns `True` if the first character of each word in the string is the only _uppercase_ character in the word. 
| `isupper()`  | Returns `True` if all alphabetic characters in the string are _uppercase_ characters (e.g., `'A'`, `'B'`, `'C'`).

# Raw Strings
* Backslash characters in strings introduce _escape sequences_ — like `\n` for newline and `\t` for tab
* To include a backslash in a string, use two backslash characters `\\`
* These can occasionally make some strings relevant to filepaths difficult to read or interpret
* Consider a Microsoft Windows file location: _C:\MyFolder\MySubFolder\MyFile.txt_ it's string representation will really be 'C:\\MyFolder\\MySubFolder\\MyFile.txt'
* **raw strings**—preceded by the character `r` are often more convenient
* They treat each backslash as a regular character, rather than the beginning of an escape sequence

In [55]:
file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'
print(file_path)

C:\MyFolder\MySubFolder\MyFile.txt


In [56]:
file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'
print(file_path)

C:\MyFolder\MySubFolder\MyFile.txt
