# Formatting strings

Proper text formatting makes data easier to read and understand.

##  Presentation Types

When you specify a placeholder for a
value in an fstring,
Python assumes the value should be displayed as a string unless
you specify another type. In some cases, the type is required.

In [1]:
f'{17.489:.2f}'

'17.49'

Python supports precision only for floating-point
and Decimal values. Formatting is
type dependent—if you try to use .2f to format a string like 'hello', a `ValueError`
occurs. So the presentation type f in the format specifier .2f is required. It
indicates what type is being formatted so Python can determine whether the other
formatting information is allowed for that type.

### Integers

In [2]:
f'{10:d}'

'10'

### Characters

In [3]:
f'{65:c} {97:c}'

'A a'

### Strings

In [4]:
f'{"hello":s} {7}'

'hello 7'

### Floating-Point and Decimal Values

You’ve used the f presentation type to format floatingpoint
and Decimal values. For
extremely large and small values of these types, Exponential (scientific) notation
can be used to format the values more compactly. Let’s show the difference between f
and e for a large value, each with three digits of precision to the right of the decimal
point.

In [5]:
from decimal import Decimal

In [6]:
f'{Decimal("10000000000000000000000000.0"):.3f}'

'10000000000000000000000000.000'

In [7]:
f'{Decimal("10000000000000000000000000.0"):.3e}'

'1.000e+25'

#  Field Widths and Alignment

In [10]:
f'[{27:10d}]'

'[        27]'

In [11]:
f'[{3.5:10f}]'

'[  3.500000]'

In [12]:
f'[{"hello":10}]'

'[hello     ]'

### Explicitly Specifying Left and Right Alignment in a Field 

In [13]:
f'[{27:<15d}]'

'[27             ]'

In [14]:
f'[{3.5:<15f}]'

'[3.500000       ]'

In [15]:
f'[{"hello":>15}]'

'[          hello]'

### Centering a Value in a Field 

In [16]:
f'[{27:^7d}]'

'[  27   ]'

In [17]:
f'[{3.5:^7f}]'

'[3.500000]'

In [18]:
f'[{"hello":^7}]'

'[ hello ]'

#  Numeric Formatting

### Formatting Positive Numbers with Signs

In [23]:
f'[{-27:+10d}]'

'[       -27]'

In [20]:
f'[{27:+010d}]'

'[+000000027]'

### Using a Space Where a + Sign Would Appear in a Positive Value

In [24]:
print(f'{27:d}\n{27: d}\n{-27: d}')

27
 27
-27


### Grouping Digits

In [25]:
f'{12345678:,d}'

'12,345,678'

In [26]:
f'{123456.78:,.2f}'

'123,456.78'

#  String’s format Method

Python’s f-strings
were added to the language in version 3.6. Before that, formatting
was performed with the string method `format`. In fact, f-string
formatting is based on
the format method’s capabilities.  You’ll often see the `format`
method in the Python documentation and in the many Python books and articles
written before f-strings
were introduced. However, we recommend using the newer f-string
formatting.

You call method format on a format string containing curly brace ({}) placeholders,
possibly with format specifiers.

In [27]:
'{:.2f}'.format(17.489)

'17.49'

### Multiple Placeholders


A format string may contain multiple placeholders, in which case the format method’s
arguments correspond to the placeholders from left to right.

In [28]:
'{} {}'.format('Amanda', 'Cyan')

'Amanda Cyan'

### Referencing Arguments By Position Number

In [29]:
'{0} {0} {1}'.format('Happy', 'Birthday')

'Happy Happy Birthday'

### Referencing Keyword Arguments

In [30]:
'{first} {last}'.format(first='Amanda', last='Gray')

'Amanda Gray'

In [31]:
'{last} {first}'.format(first='Amanda', last='Gray')

'Gray Amanda'

# Concatenating and Repeating Strings 

In [32]:
s1 = 'happy'

In [33]:
s2 = 'birthday'

In [34]:
s1 += ' ' + s2

In [35]:
s1

'happy birthday'

In [1]:
symbol = '>'

In [2]:
symbol *= 5

In [3]:
symbol

'>>>>>'

# Stripping White Space from Strings

### Removing Leading and Trailing Whitespace

In [8]:
sentence = '\t  \n wassaaaaup      This is a test string. \t\t \n'
print(sentence)

	  
 wassaaaaup      This is a test string. 		 



In [9]:
sentence.strip()

'wassaaaaup      This is a test string.'

### Removing Leading Whitespace

In [7]:
sentence.lstrip()

'This is a test string. \t\t \n'

### Removing Trailing Whitespace

In [43]:
sentence.rstrip()

'\t  \n  This is a test string.'

#  Comparison Operators for Strings

In [44]:
print(f'A: {ord("A")}; a: {ord("a")}')

A: 65; a: 97


In [45]:
'Orange' == 'orange'

False

In [46]:
'Orange' != 'orange'

True

In [47]:
'Orange' < 'orange'

True

In [48]:
'Orange' <= 'orange'

True

In [49]:
'Orange' > 'orange'

False

In [50]:
'Orange' >= 'orange'

False

#  Searching for Substrings

### Counting Occurrences

In [51]:
sentence = 'to be or not to be that is the question'

In [10]:
sentence.count('to')

0

In [53]:
sentence.count('to', 12)

1

In [54]:
sentence.count('that', 12, 25)

1

### Locating a Substring in a String

In [55]:
sentence.index('be')

3

In [56]:
sentence.rindex('be')

16

### Determining Whether a String Contains a Substring 

In [57]:
'that' in sentence

True

In [58]:
'THAT' in sentence

False

In [59]:
'THAT' not in sentence

True

### Locating a Substring at the Beginning or End of a String

In [11]:
sentence.startswith('to')

False

In [61]:
sentence.startswith('be')

False

In [65]:
sentence.endswith('question')

True

In [63]:
sentence.endswith('quest')

False

#  Replacing Substrings

In [68]:
values = '1\t2\t3\t4\t5'
print(values)

1	2	3	4	5


In [69]:
values.replace('\t', ',')

'1,2,3,4,5'

# Splitting and Joining Strings

### Splitting Strings

In [70]:
letters = 'A, B, C, D'

In [71]:
letters.split(', ')

['A', 'B', 'C', 'D']

In [72]:
letters.split(', ', 2)

['A', 'B', 'C, D']

### Joining Strings

In [12]:
letters_list = ['A', 'B', 'C', 'D']

In [13]:
','.join(letters_list)

'A,B,C,D'

In [75]:
','.join([str(i) for i in range(10)])

'0,1,2,3,4,5,6,7,8,9'

### String Methods partition and rpartition 

String method `partition` splits a string into a tuple of three strings based on the
method’s separator argument. The three strings are
* the part of the original string before the separator,
* the separator itself, and
* the part of the string after the separator.

This might be useful for splitting more complex strings.

In [76]:
'Amanda: 89, 97, 92'.partition(': ')

('Amanda', ': ', '89, 97, 92')

To search for the separator from the end of the string instead, use method
`rpartition` to split.

In [77]:
url = 'http://www.deitel.com/books/PyCDS/table_of_contents.html'

In [78]:
rest_of_url, separator, document = url.rpartition('/')

In [79]:
document

'table_of_contents.html'

In [80]:
rest_of_url

'http://www.deitel.com/books/PyCDS'

### String Method splitlines 

If you read large
amounts of text into a string, you might want to split the string into a list of lines based
on newline characters. Method `splitlines` returns a list of new strings representing
the lines of text split at each newline character in the original string.

In [14]:
lines = """This is line 1
This is line2
This is line3"""

In [82]:
lines

'This is line 1\nThis is line2\nThis is line3'

In [83]:
lines.splitlines()

['This is line 1', 'This is line2', 'This is line3']

In [84]:
lines.splitlines(True)

['This is line 1\n', 'This is line2\n', 'This is line3']

#  Characters and Character-Testing Methods

String method `isdigit` returns True if the string on
which you call the method contains only the digit characters (0–9).

In [85]:
'-27'.isdigit()

False

In [86]:
'27'.isdigit()

True

String method `isalnum` returns True if the string on which you call the
method is alphanumeric—that is, it contains only digits and letters.

In [87]:
'A9876'.isalnum()

True

In [88]:
'123 Main Street'.isalnum()

False

#  Raw Strings

Recall that backslash characters in strings introduce escape sequences—like \n for
newline and \t for tab. So, if you wish to include a backslash in a string, you must use
two backslash
characters \\. This makes some strings difficult to read.

In [89]:
file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'

In [90]:
file_path

'C:\\MyFolder\\MySubFolder\\MyFile.txt'

For such cases, raw strings—preceded by the character r—are more convenient. They
treat each backslash as a regular character, rather than the beginning of an escape
sequence.

In [93]:
file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'
print(file_path)

C:\MyFolder\MySubFolder\MyFile.txt


In [92]:
file_path

'C:\\MyFolder\\MySubFolder\\MyFile.txt'

# Regular Expressions

Sometimes you’ll need to recognize patterns in text, like phone numbers, email
addresses, web page addresses, Social Security numbers and more. A
**regular expression** string describes a search pattern for matching characters in
other strings.

##  re Module and Function fullmatch 

To use regular expressions, import the Python Standard Library’s re module.

In [18]:
import re

### Matching Literal Characters

One of the simplest regular expression functions is fullmatch, which checks whether
the entire string in its second argument matches the pattern in its first argument.

In [17]:
pattern = '02215'

In [19]:
'Match' if re.fullmatch(pattern, '02215') else 'No match'

'Match'

In [97]:
'Match' if re.fullmatch(pattern, '51220') else 'No match'

'No match'

### Metacharacters, Character Classes and Quantifiers

Regular expressions typically contain various special symbols called metacharacters: [] {} () \ * + ^ $ ? . |

The \ metacharacter begins each of the predefined character classes, each
matching a specific set of characters.

In [23]:
'Valid' if re.fullmatch(r'\d{5}', '02215') else 'Invalid'

'Valid'

In [22]:
'Valid' if re.fullmatch(r'\d{5}', '9876') else 'Invalid'

'Invalid'

In the regular expression \d{5}, \d is a character class representing a digit (0–9). A
character class is a regular expression escape sequence that matches one character. To
match more than one, follow the character class with a quantifier. The quantifier {5}
repeats \d five times, as if we had written \d\d\d\d\d, to match five consecutive
digits.

### Other Predefined Character Classes


The list below shows some common predefined character classes and the groups of
characters they match. To match any metacharacter as its literal value, precede it by a
backslash (\). For example, \\ matches a backslash (\) and \$ matches a dollar sign
($).

* \d Any digit (0–9).
* \D Any character that is not a digit.
* \s Any whitespace character (such as spaces, tabs and newlines).
* \S Any character that is not a whitespace character.
* \w Any word character (also called an alphanumeric
character)—that is, any uppercase or lowercase letter, any digit
or an underscore
* \W Any character that is not a word character.


### Custom Character Classes

Square brackets, [], define a custom character class that matches a single
character. For example, [aeiou] matches a lowercase vowel, [AZ]
matches an
uppercase letter, [az]
matches a lowercase letter and [azAZ]
matches any
lowercase or uppercase letter.

The * quantifier matches zero or more
occurrences of the subexpression to its left (in this case, [az]).
So [AZ][
az]*
matches an uppercase letter followed by zero or more lowercase letters, such as
'Amanda', 'Bo' or even 'E'.

In [24]:
'Valid' if re.fullmatch('[A-Z][a-z]*', 'Wally') else 'Invalid'

'Valid'

In [25]:
'Valid' if re.fullmatch('[A-Z][a-z]*', 'eva') else 'Invalid'

'Invalid'

When a custom character class starts with a caret (^), the class matches any character
that’s not specified. So [^az]
matches any character that’s not a lowercase letter

In [None]:
'Match' if re.fullmatch('[^a-z]', 'A') else 'No match'

In [None]:
'Match' if re.fullmatch('[^a-z]', 'a') else 'No match'

Metacharacters in a custom character class are treated as literal characters—that is, the
characters themselves. So `[*+$]` matches a single *, + or $ character.

In [None]:
'Match' if re.fullmatch('[*+$]', '*') else 'No match'

In [None]:
'Match' if re.fullmatch('[*+$]', '!') else 'No match'

### * vs. + Quantifier

If you want to require at least one lowercase letter in a first name, you can replace the * quantifier  with +, which matches at least one occurrence of a
subexpression.

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]+', 'Wally') else 'Invalid'

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]+', 'E') else 'Invalid'

Both * and + are greedy—they match as many characters as possible. So the regular
expression [AZ][
az]+
matches 'Al', 'Eva', 'Samantha', 'Benjamin' and any
other words that begin with a capital letter followed at least one lowercase letter.

### Other Quantifiers

The ? quantifier matches zero or one occurrences of a subexpression.

In [26]:
'Match' if re.fullmatch('labell?ed', 'labelled') else 'No match'

'Match'

In [31]:
'Match' if re.fullmatch('labell?ed', 'labeled') else 'No match'

'Match'

In [28]:
'Match' if re.fullmatch('labell?ed', 'labellled') else 'No match'

'No match'

You can match at least n occurrences of a subexpression with the {n,} quantifier.

In [32]:
'Match' if re.fullmatch(r'\d{3,}', '123') else 'No match'

'Match'

In [33]:
'Match' if re.fullmatch(r'\d{3,}', '1234567890') else 'No match'

'Match'

In [34]:
'Match' if re.fullmatch(r'\d{3,}', '12') else 'No match'

'No match'

You can match between n and m (inclusive) occurrences of a subexpression with the
{n,m} quantifier.

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '123') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '123456') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '1234567') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '12') else 'No match'

#  Replacing Substrings and Splitting Strings

### Function sub—Replacing Patterns 

By default, the re module’s `sub` function replaces all occurrences of a pattern with
the replacement text you specify.

In [35]:
import re

In [36]:
re.sub(r'\t', ', ', '1\t2\t3\t4')

'1, 2, 3, 4'

In [None]:
re.sub(r'\t', ', ', '1\t2\t3\t4', count=2)

The sub function receives three required arguments:
* the pattern to match (the tab character '\t')
* the replacement text (', ') and
* the string to be searched ('1\t2\t3\t4')

and returns a new string. The keyword argument count can be used to specify the
maximum number of replacements

### Function split 

The `split` function tokenizes a string, using a regular expression to specify the
delimiter, and returns a list of strings. Let’s tokenize a string by splitting it at any
comma that’s followed by 0 or more whitespace characters—\s is the whitespace
character class and * indicates zero or more occurrences of the preceding
subexpression:

In [37]:
re.split(r',\s*', '1,  2,  3,4,    5,6,7,8')

['1', '2', '3', '4', '5', '6', '7', '8']

Use the keyword argument maxsplit to specify the maximum number of splits.

In [None]:
re.split(r',\s*', '1,  2,  3,4,    5,6,7,8', maxsplit=3)

#  Other Search Functions; Accessing Matches

### Function search—Finding the First Match Anywhere in a String

Function `search` looks in a string for the first occurrence of a substring that matches a
regular expression and returns a match object (of type SRE_Match) that contains the
matching substring. The match object’s group method returns that substring.

In [None]:
import re

In [38]:
result = re.search('Python', 'Python is fun')

In [39]:
result.group() if result else 'not found'

'Python'

In [40]:
result2 = re.search('fun!', 'Python is fun')

In [41]:
result2.group() if result2 else 'not found'

'not found'

### Ignoring Case with the Optional flags Keyword Argument

In [42]:
result3 = re.search('Sam', 'SAM WHITE', flags=re.IGNORECASE)

In [43]:
result3.group() if result3 else 'not found'

'SAM'

### Metacharacters that Restrict Matches to the Beginning or End of a String

The ^ metacharacter at the beginning of a regular expression (and not inside square
brackets) is an anchor indicating that the expression matches only the beginning of a
string.

In [44]:
result = re.search('^Python', 'Python is fun')

In [45]:
result.group() if result else 'not found'

'Python'

In [None]:
result = re.search('^fun', 'Python is fun')

In [None]:
result.group() if result else 'not found'

Similarly, the $ metacharacter at the end of a regular expression is an anchor
indicating that the expression matches only the end of a string.

In [None]:
result = re.search('Python$', 'Python is fun')

In [None]:
result.group() if result else 'not found'

In [None]:
result = re.search('fun$', 'Python is fun')

In [None]:
result.group() if result else 'not found'

### Function findall and finditer—Finding All Matches in a String

In [None]:
contact = 'Wally White, Home: 555-555-1234, Work: 555-555-4321'

Function `findall` finds every matching substring in a string and returns a list of the
matching substrings. 

In [None]:
re.findall(r'\d{3}-\d{3}-\d{4}', contact)

Function `finditer` works like findall, but returns a lazy iterable of match objects.
For large numbers of matches, using finditer can save memory because it returns
one match at a time, whereas findall returns all the matches at once.

In [None]:
for phone in re.finditer(r'\d{3}-\d{3}-\d{4}', contact):
    print(phone.group())

### Capturing Substrings in a Match

You can use parentheses metacharacters—( and )—to capture substrings in a
match.

In [None]:
text = 'Charlie Cyan, e-mail: demo1@deitel.com'

In [None]:
pattern = r'([A-Z][a-z]+ [A-Z][a-z]+), e-mail: (\w+@\w+\.\w{3})'

In [None]:
result = re.search(pattern, text)

The match object’s groups method returns a tuple of the captured substrings.

In [None]:
result.groups()

The match object’s group method returns the entire match as a single string.

In [None]:
result.group()

You can access each captured substring by passing an integer to the group method.
The captured substrings are numbered from 1 (unlike list indices, which start at 0).

In [None]:
result.group(1)

In [None]:
result.group(2)

# Data Science: Pandas, Regular Expressions and Data Munging 


Data does not always come in forms ready for analysis. It could, for example, be in the
wrong format, incorrect or even missing. Preparing data for analysis is called **data munging** or **data wrangling**.

### Data Validation

Let’s begin by creating a Series of five-digit
file numbers from a dictionary of student-name/
file-number
key–value pairs. We intentionally entered an invalid file number
Code for MAriam

In [2]:
import pandas as pd

In [8]:
students1 = pd.Series({'Bilal': '92215', 'Mariam': '3310'})

In [9]:
students1

Bilal     92215
Mariam     3310
dtype: object

We can use regular expressions with Pandas to validate data. The `str` attribute of a
Series provides string-processing
and various regular expression methods.

In [10]:
students1.str.match(r'\d{5}')

Bilal      True
Mariam    False
dtype: bool

Sometimes, rather than matching an entire value to a pattern, you’ll want to know
whether a value contains a substring that matches the pattern. In this case, use method
containsinstead
of match.

Let’s create a Series of strings, each containing a student name, major and fiel number, then determine whether each string contains a substring
matching the pattern ' [AZ]{
2} ' (a space, followed by two uppercase letters,
followed by a space)

In [11]:
students2 = pd.Series(['Bilal, CS 92215', 'Mariam, ST 33101'])

In [12]:
students2

0     Bilal, CS 92215
1    Mariam, ST 33101
dtype: object

In [13]:
students2.str.contains(r' [A-Z]{2} ')

0    True
1    True
dtype: bool

In [14]:
students2.str.match(r' [A-Z]{2} ')

0    False
1    False
dtype: bool

### Reformatting Your Data

We’ve discussed data cleaning. Now let’s consider **munging data** into a different format.
As a simple example, assume that an application requires lebanese phone numbers in the
format ##-### ###,
with hyphen separating the group of digit and a space the second group. The phone
numbers have been provided to us as 8-digit
strings.

In [29]:
contacts = [['Malak', 'malak@demo.com', '55555555'],
            ['Samir', 'samir@demo.com', '55551234']]

In [30]:
contactsdf = pd.DataFrame(contacts, 
                          columns=['Name', 'Email', 'Phone'])

In [31]:
contactsdf

Unnamed: 0,Name,Email,Phone
0,Malak,malak@demo.com,55555555
1,Samir,samir@demo.com,55551234


In [18]:
import re

We can
map the phone numbers to the proper format by calling the Series method `map` on
the DataFrame’s 'Phone' column. Method map’s argument is a function that receives
a value and returns the mapped value. The function `get_formatted_phone` maps 8
consecutive digits into the format ##-###-###:

In [35]:
def get_formatted_phone(value):
    result = re.fullmatch(r'(\d{2})(\d{3})(\d{3})', value)
    return '-'.join(result.groups()) if result else value

Series method map returns a new Series containing the results of calling its
function argument for each value in the column.

In [36]:
formatted_phone = contactsdf['Phone'].map(get_formatted_phone)

In [37]:
formatted_phone

0    55-555-555
1    55-551-234
Name: Phone, dtype: object

Once you’ve confirmed that the data is in the correct format, you can update it in the
original DataFrame by assigning the new Series to the 'Phone' column.

In [38]:
contactsdf['Phone'] = formatted_phone

In [39]:
contactsdf

Unnamed: 0,Name,Email,Phone
0,Malak,malak@demo.com,55-555-555
1,Samir,samir@demo.com,55-551-234
