# LAB 2 Chapter 2 Strings

Almost every useful program involves some kind of text processing, whether it is parsing
data or generating output. This chapter focuses on common problems involving text
manipulation, such as pulling apart strings, searching, substitution, lexing, and parsing.
Many of these tasks can be easily solved using built-in methods of strings. However,
more complicated operations might require the use of regular expressions or the cre‐
ation of a full-fledged parser. All of these topics are covered. In addition, a few tricky
aspects of working with Unicode are addressed.

#### 2.2 problem 
You need to check the start or end of a string for specific text patterns, such as filename extensions, URL schemes, and so on.

- str.start_with
- str.end_with
- both methods take in a tuple, a list will throw an error

In [3]:
name = 'Ebuka'
print(name.startswith('Eb'), name.endswith('a'),name.startswith('z'))

True True False


#### 2.3 Problem
You want to match text using the same wildcard patterns as are commonly used when
working in Unix shells (e.g., *.py, Dat[0-9]*.csv, etc.).

- fnmatch import fnmatch, fnmatchcase


In [4]:
from fnmatch import fnmatch, fnmatchcase
# on windows True
print(fnmatch('foo.txt', '*.TXT'))
print(fnmatchcase('foo.txt', '*.TXT'))

True
False


#### 2.4 Problem
You want to match or search text for a specific pattern.
- simple literal, 
    - str.find(), str.endswith(), str.startswith(),
- use re ( regular expression) to do more complicated things
    - d+ one or more digit
    - one or more
    - $ if you want the exact match
    
- better to precompile a re if its going to be matched multiple times
    - match capture the first match
    - use findall to get all matches
- using match can help seperate the group that match
- better to use raw strings
    

In [5]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
m = datepat.match('11/27/2012')
print(type(m))
print(m)
print('group(): ',m.group())
print('group(0): ',m.group(0))
print('group(1): ',m.group(1))
print('group(2): ',m.group(2))
print('group(3): ',m.group(3))
print('groups: ',m.groups())

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
datepat.findall(text)

<class 're.Match'>
<re.Match object; span=(0, 10), match='11/27/2012'>
group():  11/27/2012
group(0):  11/27/2012
group(1):  11
group(2):  27
group(3):  2012
groups:  ('11', '27', '2012')


[('11', '27', '2012'), ('3', '13', '2013')]

### Problem 2.5
You want to search for and replace a text pattern in a string.

- str.replace
- for more complicated numbers use sub from the re module 
- \3 <- the 3 represent the capture group

In [6]:
text = 'yeah, but no, but yeah, but no, but yeah'
print(text.replace('no', 'na'))
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
print('text',text)
print('text re sub:',re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text))


yeah, but na, but yeah, but na, but yeah
text Today is 11/27/2012. PyCon starts 3/13/2013.
text re sub: Today is 2012-11-27. PyCon starts 2013-3-13.


#### Problem 2.6 
You need to search for and possibly replace text in a case-insensitive manner.

- re.IGNORECASE flag
    - re.sub('python', 'snake', text, flags=re.IGNORECASE)

#### 2.7 problem
You’re trying to match a text pattern using regular expressions, but it is identifying the longest possible matches of a pattern. Instead, you would like to change it to find the shortest possible match.

- re ? - matches 0 or 1 occuance
-    . - matches any character except \n
- noncapture group (i.e., it defines a group for the purposes of matching, but that group is not captured separately or numbered).
-  re.DOTALL match all char including new lines

In [12]:
import re
str_pat = re.compile(r'\"(.*)\"')
text = 'Computer sayas "NO."'
print(str_pat.findall(text))
text = 'Computer says "No." Phone says "Yes."'
print(str_pat.findall(text))

['NO.']
['No." Phone says "Yes.']


In [13]:
str_pat = re.compile(r'\"(.*?)\"')
text = 'Computer sayas "NO."'
print(str_pat.findall(text))
text = 'Computer says "No." Phone says "Yes."'
print(str_pat.findall(text))

['NO.']
['No.', 'Yes.']


#### 2.8 Problem
You’re trying to match a block of text using a regular expression, but you need the match
to span multiple lines.

In [19]:
comment_pat = re.compile(r'/\*(.*?)\*/')
text1 = '/* this is a comment */'
text2 = '''/* this is a
... multiline comment */
... '''

print(comment_pat.findall(text1))
print(comment_pat.findall(text2))
comment = re.compile(r'/\*((?:.|\n)*?)\*/')
# (?:..) =no grouping
print(comment.findall(text2))

[' this is a comment ']
[]
[' this is a\n... multiline comment ']


#### 2.9 Problem
You’re working with Unicode strings, but need to make sure that all of the strings have
the same underlying representation.

- first normalize the text into a standard representation using the unicodedata module
- The first argument to normalize() specifies how you want the string normalized. 
- NFC means that characters should be fully composed (i.e., use a single code point if possible). 
- NFD means that characters should be fully decomposed with the use of combining characters.
- NFKC and NFKD,
- Normalization can also be an important part of sanitizing and filtering text. For example, suppose you want to remove all diacritical marks from some text (possibly for the purposes of searching or matching):

In [27]:
import unicodedata 

s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'
print(s1)
print(s2)
print(s1==s2)

t1 = unicodedata.normalize('NFC',s1)
t2 = unicodedata.normalize('NFC',s1)
print(t1 == t2)
print(f'asscii t1: {ascii(t1)}')

t1 = unicodedata.normalize('NFD',s1)
t2 = unicodedata.normalize('NFD',s1)
print(t1 == t2)
print(f'asscii t1: {ascii(t1)}')

Spicy Jalapeño
Spicy Jalapeño
False
True
asscii t1: 'Spicy Jalape\xf1o'
True
asscii t1: 'Spicy Jalapen\u0303o'


#### 2.11 Problem
You want to strip unwanted characters, such as whitespace, from the beginning, end, or
middle of a text string.

- strip is good when you want to remove char from left or fight
- str.replace or re.sub for middle of text

In [39]:
s = '      hello      world   \n'
print(s)
s = s.strip()
print('***str.strip()***\n')
print(s)
print('***get rid of middle***')
print(s.replace('  ',''))
re.sub('\s+',' ',s)

      hello      world   

***str.strip()***

hello      world
***get rid of middle***
helloworld


'hello world'

#### 2.12 Problem
Some bored script kiddie has entered the text “pýtĥöñ” into a form on your web page
and you’d like to clean it up somehow

- str.translate()
    - takes in a map for the translation


In [43]:
s = 'pýtĥöñ\fis\tawesome\r\n'
print('s: ', s)
remap = {
    ord('\t') : ' ',
    ord('\f'): ' ',
    ord('\r'): ' ',
}
a = s.translate(remap)
print('a: ', a)


s:  pýtĥöñis	awesome

a:  pýtĥöñ is awesome 



#### 2.13 Problem
You need to format text with some sort of alignment applied

str.ljust
str.rjust
str.center

In [59]:
text = "HELLO WORLD"
print(text.rjust(20))
print(text.ljust(20))
print(text.center(20))
print(text.rjust(20,'*'))
print(text.ljust(20,'*'))
print(text.center(20,'*'))
print()
print('format'.center(20,'*'))
print(format(text,'>20'))
print(format(text,'<20'))
print(format(text,'^20'))
print('format'.center(20,'*'))
print(format(text,'=>20'))
print(format(text,'=<20'))
print(format(text,'=^20'))

         HELLO WORLD
HELLO WORLD         
    HELLO WORLD     
*********HELLO WORLD
HELLO WORLD*********
****HELLO WORLD*****

*******format*******
         HELLO WORLD
HELLO WORLD         
    HELLO WORLD     
*******format*******
====HELLO WORLD=====


#### 2.14
You want to combine many small strings together into a larger string.

- join() method
- str + str runs slow because it makes a new obj everytime
- sep option in str

In [9]:
parts = 'Morbius is the _ Movie Ever!!!'.split()
print('parts: ',parts)
print(''.join(parts))
print(' '.join(parts))
print(', '.join(parts))

parts:  ['Morbius', 'is', 'the', '_', 'Movie', 'Ever!!!']
Morbiusisthe_MovieEver!!!
Morbius is the _ Movie Ever!!!
Morbius, is, the, _, Movie, Ever!!!


In [7]:
def something():
    yield 'Hot'
    yield 'Freezes?'
    yield 'cold'
    yield 'burns?'
    yield 'You are confused'
    
something()
' '.join(something())

'Hot Freezes? cold burns? You are confused'

#### 2.15 problem
You want to create a string in which embedded variable names are substituted with a
string representation of a variable’s value.

In [5]:
s = '{name} has {n} messages'
print(s.format(name='ebuka',n=10))
name = 'Ebuka'
n = 54
print(s.format_map(vars()))

ebuka has 10 messages
Ebuka has 54 messages
