# Strings and Python
Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points. String literals are written in a variety of ways:

* Single quotes:

In [None]:
single_quotes = 'allows embedded "double" quotes'

* Double quotes: 

In [None]:
double_quotes = "allows embedded 'single' quotes"

* Triple quites:

In [None]:
triple_quotes_1 = '''Three single quotes'''
triple_quotes_2 = """Three double quotes"""

Python strings are "immutable" which means they cannot be changed after they are created. Since strings can't be changed, we construct *new* strings as we go to represent computed values. So for example the expression ('hello' + 'there') takes in the 2 strings 'hello' and 'there' and builds a new string 'hellothere'.


Characters in a string can be accessed using the standard [ ] syntax

In [None]:
  s = 'hi'
  print(s[1])
  print(len(s))
  print(s + ' there')

i
2
hi there


The str() function converts values to a string form so they can be combined with other strings.

In [None]:
pi = 3.14
text = 'The value of pi is '  + str(pi)
print(text)

The value of pi is 3.14


The "print" function prints out one or more python items followed by a newline (leave a trailing comma at the end of the items to inhibit the newline). A "raw" string literal is prefixed by an 'r' and passes all the chars through without special treatment of backslashes, so r'x\nx' evaluates to the length-4 string 'x\nx'. A 'u' prefix allows you to write a unicode string literal (Python has lots of other unicode support features -- see the docs below).

In [None]:
raw = r'this\t\n and that'
print(raw)

multi = """It was the best of times.
  It was the worst of times."""
print(multi)

this\t\n and that
It was the best of times.
  It was the worst of times.


## String Methods
Here are some of the most common string methods. 

* s.lower(), s.upper() -- returns the lowercase or uppercase version of the string
* s.strip() -- returns a string with whitespace removed from the start and end
* s.isalpha()/s.isdigit()/s.isspace()... -- tests if all the string chars are in the various character classes
* s.startswith('other'), s.endswith('other') -- tests if the string starts or ends with the given other string
* s.find('other') -- searches for the given other string (not a regular expression) within s, and returns the first index where it begins or -1 if not found
* s.replace('old', 'new') -- returns a string where all occurrences of 'old' have been replaced by 'new'
* s.split('delim') -- returns a list of substrings separated by the given delimiter. The delimiter is not a regular expression, it's just text. 'aaa,bbb,ccc'.split(',') -> ['aaa', 'bbb', 'ccc']. As a convenient special case s.split() (with no arguments) splits on all whitespace chars.
* s.join(list) -- opposite of split(), joins the elements in the given list together using the string as the delimiter. e.g. '---'.join(['aaa', 'bbb', 'ccc']) -> aaa---bbb---ccc

## String Slices

The "slice" syntax is a handy way to refer to sub-parts of sequences -- typically strings and lists. The slice s[start:end] is the elements beginning at start and extending up to but not including end. Suppose we have s = "Hello"

the string 'hello' with letter indexes 0 1 2 3 4

* s[1:4] is 'ell' -- chars starting at index 1 and extending up to but not including index 4
* s[1:] is 'ello' -- omitting either index defaults to the start or end of the string
* s[:] is 'Hello' -- omitting both always gives us a copy of the whole thing (this is the pythonic way to copy a sequence like a string or list)
* s[1:100] is 'ello' -- an index that is too big is truncated down to the string length
The standard zero-based index numbers give easy access to chars near the start of the string. As an alternative, Python uses negative numbers to give easy access to the chars at the end of the string: s[-1] is the last char 'o', s[-2] is 'l' the next-to-last char, and so on. Negative index numbers count back from the end of the string:

* s[-1] is 'o' -- last char (1st from the end)
* s[-4] is 'e' -- 4th from the end
* s[:-3] is 'He' -- going up to but not including the last 3 chars.
* s[-3:] is 'llo' -- starting with the 3rd char from the end and extending to the end of the string.


In [None]:
s = 'hello'

print('s[1:4]\t:', s[1:4])
print('s[1:]\t:', s[1:])
print('s[:]\t:', s[:])
print('s[1:100]:', s[1:100])


print('s[-1]\t:', s[-1])
print('s[-4]\t:', s[-4])
print('s[:-3]\t:', s[:-3])
print('s[-3:]:', s[-3:])

s[1:4]	: ell
s[1:]	: ello
s[:]	: hello
s[1:100]: ello
s[-1]	: o
s[-4]	: e
s[:-3]	: he
s[-3:]: llo


## Exercise 1


### 1. both_ends

Given a string s, return a string made of the first 2 and the last 2 chars of the original string, so 'spring' yields 'spng'. However, if the string length
is less than 2, return instead the empty string.


In [None]:
def both_ends(s):
  string = s[:2] + s[-2:] if len(s)>2 else ""
  return string

In [None]:
print(both_ends('spring')) # 'spng'
print(both_ends('Hello'))  # 'Helo'
print(both_ends('a'))      # ''
print(both_ends('xyz'))    #'xyyz'

spng
Helo

xyyz


### 2. fix_start

Given a string s, return a string where all occurences of its first char have been changed to '*', except do not change the first char itself. 

e.g. 'babble' yields 'ba**le'
Assume that the string is length 1 or more.

Hint: s.replace(stra, strb) returns a version of string s
where all instances of stra have been replaced by strb.

In [None]:
def fix_start(s):
  return s[0] + s.replace(s[0],'*')[1:]

In [None]:
print(fix_start('babble'))  # 'ba**le'
print(fix_start('aardvark')) # 'a*rdv*rk'
print(fix_start('google')) # 'goo*le'
print(fix_start('donut')) # 'donut'

ba**le
a*rdv*rk
goo*le
donut


### 3. mix_up
Given strings a and b, return a single string with a and b separated by a space \<a\> \<b\>, except swap the first 2 chars of each string.

e.g.
```
  'mix', pod' -> 'pox mid'
  'dog', 'dinner' -> 'dig donner'
```
Assume a and b are length 2 or more.

In [None]:
def mix_up(a, b):
  return b[:2] + a[-1] + ' ' + a[:2] + b[-1]

In [None]:
print(mix_up('mix', 'pod')) # 'pox mid'
print(mix_up('dog', 'dinner'))  # 'dig donner'
print(mix_up('gnash', 'sport')) # 'spash gnort'
print(mix_up('pezzy', 'firm'))  # 'fizzy perm'

pox mid
dig dor
sph gnt
fiy pem


### 4. verbing
Given a string, if its length is at least 3, add 'ing' to its end. Unless it already ends in 'ing', in which case add 'ly' instead. If the string length is less than 3, leave it unchanged. Return the resulting string.

In [None]:
def verbing(s):
  string = s
  if len(s)>3:
    if s.endswith("ing"):
      string += "ly"
    else:
      string += "ing"
  return string

In [None]:
print(verbing('hail')) # 'hailing'
print(verbing('swiming')) # 'swimingly'
print(verbing('do'))  # 'do'

hailing
swimingly
do


### 5. not_bad

Given a string, find the first appearance of the substring 'not' and 'bad'. If the 'bad' follows the 'not', replace the whole 'not'...'bad' substring with 'good'. Return the resulting string. So 'This dinner is not that bad!' yields: This dinner is good!

In [11]:
def not_bad(s):
  t1 = s.find("not")
  t2 = s.find("bad")
  string = s
  if t2 > t1:
    string = s[:t1] + "good" + s[t2+len("bad"):]
  return string

In [12]:
print(not_bad('This movie is not so bad'))  # 'This movie is good'
print(not_bad('This dinner is not that bad!')) # 'This dinner is good!'
print(not_bad('This tea is not hot')) # 'This tea is not hot'
print(not_bad("It's bad yet not"))  # "It's bad yet not"

This movie is good
This dinner is good!
This tea is not hot
It's bad yet not


Hint: ```'this movie is bad'.find('bad')``` -> 14

### 6. front_back

Consider dividing a string into two halves. If the length is even, the front and back halves are the same length. If the length is odd, we'll say that the extra char goes in the front half. 

e.g. 'abcde', the front half is 'abc', the back half 'de'.

Given 2 strings, a and b, return a string of the form
```
a-front + b-front + a-back + b-back
```

In [13]:
def front_back(a, b):
  sp_a = len(a)//2 + len(a)%2
  sp_b = len(b)//2 + len(b)%2
  return a[:sp_a] + b[:sp_b] + a[sp_a:] + b[sp_b:]

In [14]:
print(front_back('abcd', 'xy')) # 'abxcdy'
print(front_back('abcde', 'xyz')) #'abcxydez'
print(front_back('Kitten', 'Donut'))  #'KitDontenut'

abxcdy
abcxydez
KitDontenut


Hint:
``` a//b ```
is floor division

and 
``` a%b ```
is the remainder of the division

### 7. words_freqs 
counts how often each word appears in the text and returns a dictionary of with words as keys and counts as values:
```{'word1':count1,
'word2':count2}```

In [19]:
# option 1
def words_freqs(text_string):
  vocab_dict = dict()
  for w in text_string.split(" "):
    if w in vocab_dict.keys(): 
      vocab_dict[w] += 1
    else: 
      vocab_dict[w] = 1
  return vocab_dict

In [38]:
# option 2 - using defaultdict
from collections import defaultdict
def words_freqs(text_string):
  vocab_dict = defaultdict(int)
  for w in text_string.split(" "):
      vocab_dict[w] += 1
  return dict(vocab_dict)

In [57]:
# option 3
from collections import Counter
def words_freqs(text_string):
  return Counter(text_string.split()) 

In [58]:
print(words_freqs('This movie is not so bad'))  
# {'This': 1, 'movie': 1, 'is': 1, 'not': 1, 'so': 1, 'bad': 1}

print(words_freqs('this is true, that is false'))
# {'this': 1, 'is': 2, 'true,': 1, 'that': 1, 'false': 1}

Counter({'This': 1, 'movie': 1, 'is': 1, 'not': 1, 'so': 1, 'bad': 1})
Counter({'is': 2, 'this': 1, 'true,': 1, 'that': 1, 'false': 1})


To get the result in a sorted fashion from the most frequent to the least frequent:

In [59]:
vocab = words_freqs('this is true, that is false')
sorted(vocab.items(), key=lambda x: x[1], reverse=True)

[('is', 2), ('this', 1), ('true,', 1), ('that', 1), ('false', 1)]

The expected result:

```[('is', 2), ('this', 1), ('true,', 1), ('that', 1), ('false', 1)] ```

## References 
* https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str

# Regular expressions

Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python "re" module provides regular expression support.
In Python a regular expression search is typically written as:

In [None]:
import re

pat = ''
test_str = '' 
match = re.search(pat, test_str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. As shown in the following example which searches for the pattern 'word:' followed by a 3 letter word:

In [None]:
test_str = 'an example word:cat!!'
match = re.search(r'word:\w+', test_str)
match.group()

'word:cat'

The code ```match = re.search(pat, str)``` stores the search result in a variable named "match".

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions.

## Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

* a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)

* . (a period) -- matches any single character except newline '\n'
* \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
* \b -- boundary between word and non-word
* \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
* \t, \n, \r -- tab, newline, return
* \d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
* ^ = start, $ = end -- match the start or end of the string
* \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

For more details please refer to the week 2 slides and the chapter provided for further reading.

## Basic Examples

The basic rules of regular expression search for a pattern within a string are:

The search proceeds through the string from start to end, stopping at the first match found
All of the pattern must be matched, but not all of the string
If ```match = re.search(pat, str)``` is successful, match is not None and in particular match.group() is the matching text

In [None]:
match = re.search(r'iii', 'piiig') 
print(match)

<re.Match object; span=(1, 4), match='iii'>


In [None]:
match = re.search(r'igs', 'piiig') 
print(match)

None


In [None]:
match = re.search(r'..g', 'piiig')
print(match)

<re.Match object; span=(2, 5), match='iig'>


In [None]:
match = re.search(r'\d\d\d', 'p123g') 
print(match)

<re.Match object; span=(1, 4), match='123'>


In [None]:
match = re.search(r'\w\w\w', '@@abcd!!') 
print(match)

<re.Match object; span=(2, 5), match='abc'>


## Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

* \+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* \* -- 0 or more occurrences of the pattern to its left
* ? -- match 0 or 1 occurrences of the pattern to its left

### Leftmost & Largest
First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

## Repetition Examples

In [None]:
match = re.search(r'pi+', 'piiig')
print(match)

<re.Match object; span=(0, 4), match='piii'>


In [None]:
match = re.search(r'i+', 'piigiiii')
print(match)

['ii', 'iiii']


In [None]:
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')
print(match)

<re.Match object; span=(2, 9), match='1 2   3'>


In [None]:
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx')
print(match)

<re.Match object; span=(2, 7), match='12  3'>


In [None]:
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')
print(match)

<re.Match object; span=(2, 5), match='123'>


In [None]:
match = re.search(r'^b\w+', 'foobar')
print(match)

None


In [None]:
match = re.search(r'b\w+', 'foobar')
print(match)

<re.Match object; span=(3, 6), match='bar'>


## Emails Example

Suppose you want to find the email address inside the string 'xyz alice-b@students.latrobe.edu.au purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':

In [None]:
test_email = '1112-2233@students.latrobe.edu.au monkey dishwasher'
match = re.search(r'\w+@\w+', test_email)
print(match.group())

2233@students


The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

In [None]:
match = re.search(r'[\w.-]+@[\w.-]+', test_email)
print(match.group() )

1112-2233@students.latrobe.edu.au


## Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

In [None]:
test_str = 'purple 1112-2233@students.latrobe.edu.au monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', test_str)
print(match.group())
print(match.groups())  
print(match.group(1)) 
print(match.group(2)) 

1112-2233@students.latrobe.edu.au
('1112-2233', 'students.latrobe.edu.au')
1112-2233
students.latrobe.edu.au


## findall

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

In [None]:
test_str = 'purple 1112-2233@students.latrobe.edu.au, blah monkey bob@abc.com blah dishwasher'

emails = re.findall(r'[\w\.-]+@[\w\.-]+', test_str) 
for email in emails:
  # do something with each found email string
  print(email)

1112-2233@students.latrobe.edu.au
bob@abc.com


## Execise 2

### 1. cleanup_address
Given 
* 50 Fifth Ave. New York, NY 10012
* 100 Ninth Ave. Brooklyn, NY 11416
* 9 Houston St. Juneau, AK 99999
* 2800 Springfield Rd. Omaha, NE 55555

Change to:
* 50,Fifth Ave.,New York,NY,10012
* 100,Ninth Ave.,Brooklyn,NY,11416
* 9,Houston St.,Juneau,AK,99999
* 2800,Springfield Rd.,Omaha,NE,55555

In [76]:
# Part 1
import re
def cleanup_address(address):
  match = re.search(r'(\d+)\s(\w+\s\w+\.)\s([\w\s]+)\,\s([\w]{2})\s(\d{5})', address)
  return match.group(1)+','+match.group(2)+','+match.group(3)+','+match.group(4)+','+match.group(5)

addresses = ['50 Fifth Ave. New York, NY 10012',
             '100 Ninth Ave. Brooklyn, NY 11416',
             '9 Houston St. Juneau, AK 99999',
             '2800 Springfield Rd. Omaha, NE 55555']
             
print(*(cleanup_address(address) for address in addresses), sep="\n")

50,Fifth Ave.,New York,NY,10012
100,Ninth Ave.,Brooklyn,NY,11416
9,Houston St.,Juneau,AK,99999
2800,Springfield Rd.,Omaha,NE,55555


### 2. extract_names

The Social Security administration has this neat data by year of what names are most popular for babies born that year in the USA (see [social security baby names](http://www.socialsecurity.gov/OACT/babynames/)).

Implement the extract_names(filename) function which takes the filename of a baby1990.html file and returns the data from the file as a single list -- the year string at the start of the list followed by the name-rank strings in alphabetical order. ```['2006', 'Aaliyah 91', 'Abagail 895', 'Aaron 57', ...]```. Note that for parsing webpages in general, regular expressions don't do a good job, but these webpages have a simple and consistent format.

Rather than treat the boy and girl names separately, we'll just lump them all together. In some years, a name appears more than once in the html, but we'll just use one number per name. Optional: make the algorithm smart about this case and choose whichever number is smaller.

Build the program as a series of small milestones, getting each step to run/print something before trying the next step. This is the pattern used by experienced programmers -- build a series of incremental milestones, each with some output to check, rather than building the whole program in one huge step.

Printing the data you have at the end of one milestone helps you think about how to re-structure that data for the next milestone. Here are some suggested milestones:

* Extract all the text from the file and print it
* Find and extract the year and print it
* Extract the names and rank numbers and print them
* Get the names data into a dict and print it
* Build the [year, 'name rank', ... ] list and print it

The function should return the list.

Tto make the list into a reasonable looking summary text for printing, here's a clever use of join: text = '\n'.join(mylist) + '\n'

In [None]:
import re
def extract_names(filename):
  """
  Given a file name for baby.html, returns a list starting with the year string
  followed by the name-rank strings in alphabetical order.
  ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
  """
  # open up the file provided
  with open(filename) as f:
    content = f.read()

    # find the year
    year = re.search(r'(?:<h1>Popular Names by Birth Year<\/h1>.*)(\d{4})',content).group(1)
    # find the names and ranks
    matches = re.findall(r'(?:<tr align="right"><td>)(\d+)(?:</td><td>)(\w+)(?:</td><td>)(\w+)', content)

  # initialize the list to be returned
  name_list = [year]
  
  # loop over the table of names
  for match in matches:
    # unpack the rank, male name, female name
    rank, male, female = match
    name_list.append(male + " " + rank)
    name_list.append(female + " " + rank)

  name_list = sorted(name_list)
  return name_list


filename = 'baby2006.html'
extract_names(filename)

# Pandas

Pandas is a ubiquitous python library for efficiently handling tabular data. It used for cleaning up data

Pandas offers two data types, Series and DataFrames.

In [None]:
import pandas as pd



0    a
1    b
2    c
dtype: object

In [None]:
# data is stored as object internally
print(pd.Series(["a", "b", "c"]))

# data is stored as string internally
print(pd.Series(["a", "b", "c"], dtype="string"))

0    a
1    b
2    c
dtype: object


0    a
1    b
2    c
dtype: string

In [None]:
s = pd.Series(["a", None, "b"], dtype="string")
print(s)

0       a
1    <NA>
2       b
dtype: string
0       1
1    <NA>
2       0
dtype: Int64
0    1
2    0
dtype: Int64


In [None]:
print(s.str.count("a"))

0       1
1    <NA>
2       0
dtype: Int64


In [None]:
print(s.dropna().str.count("a"))

0    1
2    0
dtype: Int64


In [None]:
s.str.isdigit()

0    False
1     <NA>
2    False
dtype: boolean

In [None]:
s.str.match("a")

0     True
1     <NA>
2    False
dtype: boolean

## String operations


Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. e.g. methods to exclude missing/NA values automatically. 

These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:

In [None]:
import numpy as np
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string")
print(s)

0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string


In [None]:
print(s.str.lower())

print(s.str.upper())

0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string


In [None]:
s.str.len()

0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

## Splitting and replacing strings


In [None]:
s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")
print(s2)

0    a_b_c
1    c_d_e
2     <NA>
3    f_g_h
dtype: string


In [None]:
s2.str.split("_")

0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

Elements in the split lists can be accessed using get or [] notation:

In [None]:
s2.str.split("_").str.get(1)

0       b
1       d
2    <NA>
3       g
dtype: object

It is easy to expand this to return a DataFrame using expand


In [None]:
print(s2.str.split("_", expand=True))

      0     1     2
0     a     b     c
1     c     d     e
2  <NA>  <NA>  <NA>
3     f     g     h


It is also possible to limit the number of splits:

In [None]:
print(s2.str.split("_", expand=True, n=1))

      0     1
0     a   b_c
1     c   d_e
2  <NA>  <NA>
3     f   g_h
