# Strings

String manipulation is an important part of data cleaning. Often, the raw data contains string fields that do not quite follow an expected format.  For example, proper nouns could be incorrectly capitalized.  Dates could have been entered under different conventions.  Fortunately, Python offers many tools that make string manipulation rather painless.  In this notebook, we will look at some of the commonly-performed operations on strings.

## Defining strings

Strings can be defined between single or double quotes.

Note that Python strings support unicode.

In [1]:
# The following are strings

a = 'First string'

b = "Second string"

cjk = '您は신'

print( type(a), type(b), cjk)

<class 'str'> <class 'str'> 您は신


## Replication

We can use multiplication syntax to define a string made up of identical copies of another string as illustrated below:

In [2]:
r1 = '*'*10
r2 = 'Python'*3

print(r1)
print(r2)

**********
PythonPythonPython


## Concatenation

If `a` and `b` are strings, they can be concatenated via `a+b`.

In [3]:
c = a + b

print(c)

First stringSecond string


## Exercise

Complete the definition of the function `myRep` with arguments `x`, `y`, and `n`, where `x` and `y` can be assumed to be strings and `n` can be assumed to be a nonnegative integer, that returns the string `x+y` repeated `n` times.

In [4]:
def myRep(x, y, n):
    res = ''
    # Your code here
    
    return res

# Uncomment the following lines to test if myRep passes the assert statements
# assert(myRep('a','b',3) == 'ababab')
# assert(myRep('Python','C',0) == '')

## Indexing

The character in position `i` of the string `a` can be accessed via `a[i]`.  Note that the first character has index 0.

Negative indices can also be used. For example, `a[-4]` returns the fourth character from the end.

In [5]:
print(a[0], a[6], a[-1], a[-4])  # Print the first, seventh, last, and fourth-last characters of a

F s g r


## Exercise

Complete the definition of the function `posOfi` with argument `s` and returns a list of indices at which `s` contains the letter 'i'.  (Hint: use the [`enumerate` function](https://docs.python.org/3.5/library/functions.html#enumerate).)

In [6]:
def posOfi(s):
    # Your code here
    return None

print(posOfi('Missisippi'))  # Should print [1,4,6,9]

None


## Substring

We can obtain a substring of a string `a` using the syntax `a[i:j]` where `i` specifies the starting index and `j-1` the ending index.  Note that `a[:j]` is equivalent to `a[0:j]` and `a[i:]` is the substring starting from index `i` to the end.

In [7]:
print(a[2:4])
print(a[:3])
print(a[6:])

rs
Fir
string


## Splitting

For a string `a`, `a.split()` splits the string into a list of words separated by space by default.  Note that a contiguous sequence of space characters including newline (`\n`), carriage return (`\r`), and tab `\t` is considered one space.

We can also specify what separating characters to use for the splitting.  For example, `a.split(',')` splits on the comma and `a.split('--')` splits on '--'.

In [8]:
print('This is  a  \n\n   long   sentence with  \r \t weird spaces separating the words.'.split())

print('One,two, three ,four'.split(',')) # Note that ` three ` is one of the words after separation.

print('Five--six--ninety-four'.split('--'))

['This', 'is', 'a', 'long', 'sentence', 'with', 'weird', 'spaces', 'separating', 'the', 'words.']
['One', 'two', ' three ', 'four']
['Five', 'six', 'ninety-four']


## Whitespace stripping

In some case, it is helpful to remove leading and trailing space characters.

In [9]:
s = '  time   '
print(s)
print(s.strip())

  time   
time


It is common to combine `strip` after splitting on the comma.

In [10]:
cs = 'One   , two,  three  '

print( [ s.strip() for s in cs.split(',') ])

['One', 'two', 'three']


## Stripping a combination of characters

The `strip` function can accept a string consisting of all characters to be stripped in any combination.

In [11]:
tostrip = '&#-.!'

t = '###.Hel#lo!?!&-'

print(t.strip(tostrip)) # Strips leading and trailing characters that are listed in tostrip

Hel#lo!?


## Exercise

Complete the following function which takes a string consisting of a paragraph of sentences ending with a period and returns a list of all the sentences, with leading and trailing spaces stripped.  The last line should return
`True`.  You may assume that every period ends a proper sentence and there are no sentences not ending in a period.

In [12]:
def sentences(p):
    # Your code here
    return None

p = 'The essence of Python.  One can sense. But not learn.'

sentences(p) == ['The essence of Python.', 'One can sense.', 'But not learn.']

False

## Altering cases

The functions `upper`, `lower`, and `title` are useful for altering cases.  The following examples illustrate what they do.

In [13]:
x = "gArbagE collECtion"

print( x.upper() )
print( x.lower() )
print( x.title() )

GARBAGE COLLECTION
garbage collection
Garbage Collection


The following example illustrates a function that takes a phrase and turns it into an acronym by concatenating the first letters of the words and capitalize all the letters.

In [14]:
def acronymize(phrase):
    a = ''
    for w in phrase.split():
        a += w[0]
    return a.upper()

acronymize("Be right back"), acronymize("Your mileage might vary")

('BRB', 'YMMV')

## Conversion between strings and numbers

It is often useful to convert a string representing a number to a number type and vice versa.  The following examples illustrate how these tasks can be achieved.

In [15]:
number = 12.345

s = str(number)

print( s, type(s))

f = float(s)

print(f, type(f))

i = int('345')
print(i, type(i))

12.345 <class 'str'>
12.345 <class 'float'>
345 <class 'int'>


## Exercise

Complete the following function which takes a list of full names as argument an returns a list of names that are not properly capitalized.  For example, for the argument `['John Doe', 'JANE Kelly', 'nicole dunn', 'David Huang']`, the function returns `['JANE Kelly, 'nicole Dunn']`.

In [16]:
def badNames(names):
    # Your code here
    return None

## Simple pattern matching

We can check if a string `t` is a substring of another string `s` via `t in s`.

In [17]:
t1 = "is"
t2 = "has"

s = "This is my car."

print( t1 in s )
print( t2 in s )

True
False


If we want to obtain the index at which a substring begins, we can use the `find` function.  If the substring is not found, `-1` is returned.

In [18]:
print( s.find(t1) )
print( s.find(t2) )

2
-1


## Exercise

Complete the following function which takes a list `l` of strings as argument and returns a list consisting of the strings in `l` not containing the symbol `-`.  For example, given the argument `['Hi', 'Good-bye', 'Ciao', 'Twenty-one']`, the function should return `['Hi', 'Ciao']`.

In [19]:
def filterList(l):
    # Your code here
    return None