# Part 2. String operations, formatted output, basic file I/O

A string is a sequence of characters, enclosed with a single qoute `'` or double quote `"`. This data type allows Python to manipulate textual information, produce output, parse textual and data files.

In [1]:
"This is a string"

'This is a string'

In [2]:
'Another one'

'Another one'

There is no difference between using single of double quotes, its just that in the first case you can't have double quotes inside the string, while in the second case - singe quotes. 

In [3]:
'String with a "double quote" '

'String with a "double quote" '

In [4]:
"string with a 'single quote' "

"string with a 'single quote' "

If you need them anyway you use special symbols \' or \" (backslash used to introduce a special symbol in a string)

In [5]:
'String with a \'single quote\' and a "double quote"'

'String with a \'single quote\' and a "double quote"'

When you display the string as above you don't see how special symbols work. But when you use the print function all the special symbols are formatted appropriately

In [6]:
print('String with a \'single quote\' and a "double quote"')

String with a 'single quote' and a "double quote"


In [7]:
# or can be used without parenthesis - "native" Python 2 syntax. NOTE: this won't work in Python 3!!!
print 'String with a \'single quote\' and a "double quote"'

String with a 'single quote' and a "double quote"


Or can use the following trick without any use of special symbols:

In [8]:
print("String with a 'single quote'"+' and a "double quote"')

String with a 'single quote' and a "double quote"


But wait... If backslash is used to introduce special symbols, how to have backslash itself inside the string?

In [9]:
# then you use a special symbol \\
print("String with a backslash \\")

String with a backslash \


In [10]:
#you can also span the string definition through multiple lines using backslash `\`
'Multiple \
line \
string'

'Multiple line string'

In [11]:
#Alternatively can also use triple quotes. 
#This way 'end of line' special symbol will be added at the end of each line
#Its a bit inconvenient though as you do not see if there are any spaces in the end of each line
"""
Multiple
line 
string
"""

'\nMultiple\nline \nstring\n'

Also notice that `end of line` is denoted by `\n` - another special symbol. Below is a comprehensive list of available special symbols

|Syntaxt | Special character | 
|---|---|
|\\\\	| Backslash (\)|
|\'	|Single quote (')|
|\"	|Double quote (")|
|\a	|Bell (BEL)|
|\b	|Backspace (BS)|
|\f	|Formfeed (FF)|
|\n	|End of line (EOL)|
|\r	|Carriage Return (CR)|
|\t	|Horizontal Tab (TAB)|
|\v	|Vertical Tab (VT)|
|\o.. |ASCII character with octal value ..|
|\x... | ASCII character with hex value ...|

But what if we whant to have a directory path in a sring (a pretty common use case) and do not want its multiple backslashes to introduce special symbols like below? 

In [12]:
print('ac\b')

ac


In [13]:
print('C:\name of folder\subfolder\newfile.txt')

C:
ame of folder\subfolder
ewfile.txt


In [14]:
#prefix 'r' before the string directs python to use 'raw' string without special symbols
print(r'C:\name of folder\subfolder\newfile.txt')

C:\name of folder\subfolder\newfile.txt


another useful prefix is 'u' directing python to use unicode in the string definition (this becomes default option in Python 3) 

In [15]:
print(u'A unicode \u018e string \xf1')

A unicode Ǝ string ñ


As any other data type strings can be assigned to variables

In [16]:
A='string data'

In [17]:
B=A; B

'string data'

## String operations

As string is a sequence of characters and it is similar to a list (in fact both - strings and lists are both subspecies of a more common class - sequence), just that all the elements are of the same character type. More precisely, they are rather similar to a tuple as they are immutable. So the string operations inherit a lot from them, including indexing, slicing and concatenation.

In [18]:
A='String is a sequence of characters'

In [19]:
A[0]

'S'

In [20]:
A[:6]

'String'

In [21]:
A[-10:]

'characters'

In [22]:
A[:6]+A[-10:]

'Stringcharacters'

In [23]:
'a'*3

'aaa'

In [24]:
A[::-1]

'sretcarahc fo ecneuqes a si gnirtS'

In [25]:
#Just like with tuples modification through assignment is not allowed
#A[0]='A' 

Below is a list of useful build-in string methods. All are called as <string>.method(parameters) and return a modified copy of the string (original string is immutable)

| Syntax | Key parameters | Function | 
|---|---|---|
| count | substr, rb,re (default = 0,-1)| returns the number of occurrences of a *substr* within a range string[rb:re]|
| find	| substr, rb,re (default = 0,-1)| returns the index of first occurrence of a *substr* within a range string[rb:re], if no occurrence then -1 |
| rfind	| substr, rb,re (default = 0,-1)| returns the index of last occurrence of a *substr* within a range string[rb:re], if no occurrence then -1 |
| index	| substr, rb,re (default = 0,-1)| same as find, but raises an exception if no occurrence |
| split | s (defalt=' ') | Splits a string into a list of its parts (words) separated by a separator s |
| strip | chars (default='') | Removes all characters from *chars* from a right or left | 
| rstrip | chars (default='') | Removes all characters from *chars* from the right |
| rstrip | chars (default='') | Removes all characters from *chars* from the left |
| replace | substr1, substr2, count | replaces all occurances of *substr1* with *substr2* up to *count* times|
| join | L | joints the strings from a list L putting instances of a given string between them |
| min | - | minimum alphabetical character from the string |
| max | - | maximum alphabetical character from the string |
| upper | - | turns all string characters to upper case |
| lower | - | turns all string characters to lower case |
| capitalize | - | first (only) letter capitalized |
| center | n,c(default =' ') | centers the string within a larger one of length n filled with c |
| rjust | n,c(default =' ')| right-justifies the string within a larger one of length n filled with c |
| ljust | n,c(default =' ')| left-justifies the string within a larger one of length n filled with c |
| isdigit | - | checks if the string consists of all numeric characters |

Some examples below

In [26]:
'AAbAAAAcAAd'.count('AA')

4

In [27]:
'12345'.isdigit()

True

In [28]:
'-:-'.join(['A','B','C'])

'A-:-B-:-C'

In [29]:
'abcd'.upper()

'ABCD'

In [30]:
'ACBC'.rfind('C')

3

In [31]:
'ACBC'.count('C')

2

In [32]:
'acbABBAaba'.strip('abc')

'ABBA'

In [33]:
'Matlab is a useful tool for data science. I love Matlab!'.replace('Matlab','Python')

'Python is a useful tool for data science. I love Python!'

In [34]:
'Sentence contains multiple words'.split()

['Sentence', 'contains', 'multiple', 'words']

In [35]:
'Python'.center(16,'*')

'*****Python*****'

### Excercise 1. 
Implement a function taking a list of strings and producing a set of words they contain all together

In [2]:
def GetWords(L):
    return set(' '.join(L).split())

In [3]:
GetWords(['First phrase','Second phrase'])

{'First', 'Second', 'phrase'}

### Excercise 2. 
Implement a function taking a string, turning it to lower case and removing all spaces and vowel letters

In [4]:
def remVowels(L):
    vowels=['a','e','o','i','u',' ']
    L=L.lower()
    for v in vowels:
        L=L.replace(v,'')
    return L

In [5]:
remVowels('This is a sentence')

'thsssntnc'

## Console input

Python allows to read string input from console

In [26]:
s = raw_input("Enter your input: ")
s

Enter your input: abc


'abc'

In [29]:
#parse input as a list of float numbers
L = raw_input("Enter your input: ").split(',') #split the input string to a list of comma-separate inputs
L = map(float,L) # convert all inputs of a list to float using map
L

Enter your input: 1,2,3,4,5


[1.0, 2.0, 3.0, 4.0, 5.0]

### Basic file I/O

In [5]:
#output elements of the string list to a file
L=['aa','bb','cc']
fo = open("output.txt", "wb") # Open a file for output ('w'), binary mode ('b')
fo.write('Elements of the list:\n'); #write a line to a file
for i in range(0,len(L)): #for all elements of the list
    fo.write('Element '+str(i+1)+':'+L[i]+'\n') #write them to a file
fo.close() #close a file

In [11]:
#input elements of the list from the file above
fo = open("output.txt", "r") # Open a file for input ('r'), binary mode ('b')
fo.readline(); #read and skip the title line
L=[]
line=fo.readline() #read another line
while line: #while input is non-empty (i.e. there was smth to read)
    L+=[line.split(':')[1][:-1]] #add new input to the list, excluding EOL symbol
    line=fo.readline() #read another line
fo.close() #close a file
L

['aa', 'bb', 'cc']

## Formatted output

One of the common uses of string data is generating output. But apart from simply printing the string, Python provides powerful ways of customizing the output through string formatting. Besides from using all the power of string processing tools to prepare the output string yourself, you can use a `format` method where any quantities to substitute will be passed as parameters.

Python 3 provides an alternative using formatted string literal: put `f` in front of the string then simply use variable names in curly braces to substitite their values into the string. This is not supported in Python 2, however there is a wayaround as shown below.

In [40]:
#use of format method
'Value of A={}, while B={}'.format(10,20)

'Value of A=10, while B=20'

In [41]:
#can also reference the values to substitute by name providing those as arguments in the format
B=20
'Value of A={A}, while B={B}'.format(A=10,B=B)

'Value of A=10, while B=20'

In [42]:
#example of a Python 2 way around for using a notion similar to Python 3 formatted strings
A=10; B=20
#Python 3 will allow you to simply write
#f'Value of A={A}, while B={B}'
#in Python 2 you can:
'Value of A={A}, while B={B}'.format(**locals())
#can also use **globals() to access global variables

'Value of A=10, while B=20'

In [43]:
#one can reorder the values to substitute
'Value of C={2}, while B={1} and A={0}'.format(10,2.1522,'abc')

'Value of C=abc, while B=2.1522 and A=10'

In [44]:
#or provide specific formatting instructions (like using only 2 decimal digits)
'Value of A={:d}, while B={:.2f} and C={:*^10}'.format(10,2.1522,'abc')

'Value of A=10, while B=2.15 and C=***abc****'

In [45]:
#another way of passing values to the string to format, specifying their type and giving additional formatting instructions
'Value of A=%d, B=%.2f, while S=%s' % (10,2.1522,'abc')

'Value of A=10, B=2.15, while S=abc'

More details on the string formatting and other useful methods could be found at 
https://docs.python.org/2/library/string.html

### Example 3. Output a list of random variable distributions 

You are given a list of tuples L characterizing random variables. Each tuple consists of three elements - first element stands for the distributions type (0-Uniform, 1-Normal, 2-Lognormal), two others - parameters of the distribution (mu and sigma^2 for Normal/lognormal and a,b for Uniform). You need to create a function taking this list as an input and producing an output as a series of strings like:

`Variable 1: Normal(1.00,0.10)`

`Variable 2: Uniform(-2.00,2.00)`

presenting distribution parameters with 2 decimal digits; variables are numbered as 1,2,3,...

*Hint* use a dictionary to convert distribution number to string identifier

In [8]:
#sample input
L=[(0,0,1),(2,1,0.521),(1,0,0.101),(0,0,1),(0,1,0.2011),(1,3.1415926,1.1010)]

In [9]:
def PrintDistributions(L):
    d={0:'Uniform', 1:'Normal', 2:'Lognormal'}
    i=0;
    for v in L:
        i+=1
        print('Variable %d: %s(%.2f,%.2f)' % (i,d[v[0]],v[1],v[2]))     

In [10]:
PrintDistributions(L)

Variable 1: Uniform(0.00,1.00)
Variable 2: Lognormal(1.00,0.52)
Variable 3: Normal(0.00,0.10)
Variable 4: Uniform(0.00,1.00)
Variable 5: Uniform(1.00,0.20)
Variable 6: Normal(3.14,1.10)


While strings are often used as a formatted output, they can be also used as in input for further parsing.
Python does not have a build-in direct "inverse" of format (like scanf in C++) but has a more powerful (although also more verbose)
`regular expressions` instead.

`re.search(<format string>,<input string>).groups()`

could be used as below

Please see 
https://docs.python.org/3/library/re.html#simulating-scanf  
for a more comprehensive description

In [49]:
#import regular expressions
import re

In [50]:
#parse a string using a regula 
re.search(r'(\S+) - (\d+) errors, (\d+) warnings',r'/usr/sbin/sendmail - 0 errors, 4 warnings').groups()

('/usr/sbin/sendmail', '0', '4')

In [51]:
#parse a string for values using a regular expression format specs 
re.search(r'Variable (\S+) = (\d+)',r'Variable A = 1').groups()

('A', '1')

Often one can apply specific parsing rules manually using split. E.g. consider an output as in the excercise 3

`Normal(-1.1,0.1)`

and try to parse it into the distribution name and the parameters


In [52]:
#First replace all the delimiter symbols (,) as one comma delimiter
S='Normal(-1.1,0.1)'
S=S.replace('(',',').replace(')',','); S

'Normal,-1.1,0.1,'

In [53]:
#then use split to extract the elements of interest from the string
E=S.split(','); E

['Normal', '-1.1', '0.1', '']

In [54]:
#Finally convert string representation of parameters into numeric format
Distribution=E[0]
A=float(E[1])
B=float(E[2])
Distribution,A,B

('Normal', -1.1, 0.1)

### Example 4. Parse a list of random variable distributions

Implement an inverse of an excercise 3: given a list of string outputs of excercise 3, parse it into a list of tuples encoding random variables as per the rules above

In [13]:
#Example of input:
S=[
'Variable 1: Uniform(0.00,1.00)',
'Variable 2: Lognormal(1.00,0.52)',
'Variable 3: Normal(0.00,0.10)',
'Variable 4: Uniform(0.00,1.00)',
'Variable 5: Uniform(1.00,0.20)',
'Variable 6: Normal(3.14,1.10)'
]    

In [14]:
def ParseDistributions(S):
    d={'Uniform':0, 'Normal':1, 'Lognormal':2}
    L=[]
    for s in S:
        v=s.replace(' ',',').replace('(',',').replace(')',',').split(',')
        L+=[(d[v[2]],float(v[3]),float(v[4]))]
    return L 

In [15]:
ParseDistributions(S)

[(0, 0.0, 1.0),
 (2, 1.0, 0.52),
 (1, 0.0, 0.1),
 (0, 0.0, 1.0),
 (0, 1.0, 0.2),
 (1, 3.14, 1.1)]