# Regular expressions (regex)
is a sequence of characters that define a search pattern. They allow us to do fancy data sciency things like searching for an email address with a particular pattern - eg. starts with an "s", followed by 3 digits and ending with "@yahoo.com".

In this notebook we will briefly touch upon string manipulation and using regex with pandas.

# String manipulation <a name="strings"></a>
Python has long been popular for its raw data manipulation in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed.

In [1]:
import numpy as np
import pandas as pd

### Basics
Let's refresh what normal `str` (String objects) are capable of in Python

In [2]:
# complex strings can be broken into small bits
val = "Edinburgh is great"
val.split(" ")

['Edinburgh', 'is', 'great']

In [3]:
# substrings can be concatinated together with +
first, second, last = val.split(" ")
first + "::" + second + "::" + last

'Edinburgh::is::great'

Remember that Strings are just lists of individual charecters

In [4]:
val = "Edinburgh"
for each in val:
    print(each)

E
d
i
n
b
u
r
g
h


You can use standard list operations with them

In [5]:
val.find("n")

3

In [6]:
val.find("x")  # -1 means that there is no such element

-1

In [7]:
# and of course remember about upper() and lower()
val.upper()

'EDINBURGH'

If you want to learn more about strings you can always refer to the [Python manual](https://docs.python.org/2/library/string.html)

### Regular expressions
provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called *regex*, is a string formed according to the regular expression language. Python's built-in module is responsible for applying regular expression of strings via the `re` package

In [8]:
import re
text = "foo    bar\t baz   \tqux"
text

'foo    bar\t baz   \tqux'

In [9]:
re.split("\s+", text)

['foo', 'bar', 'baz', 'qux']

this expression effectively removed all whitespaces and tab characters (`\t`) which was stated with the `\s` regex and then the `+` after it means to remove any number of sequential occurrences of that character.

Let's have a look at a more complex example - identifying email addresses in a text file:

In [10]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

# pattern to be used for searching
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [11]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

Let's dissect the regex part by part:
```
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
```

- the `r` prefix before the string signals that the string should keep special characters such as the newline character `\n`. Otherwise, Python would just treat it as a newline
- `A-Z` means all letters from A to Z including lowercase and uppercase
- `0-9` similarly means all characters from 0 to 9
- the concatenation `._%+-` means just include those characters
- the square brackets [ ] means to combine all of the regular expressions inside. For example `[A-Z0-9._%+-]` would mean include all letters A to Z, all numbers 0 to 9, and the characters ._%+-
- `+` means to concatenate the strings patterns
- `{2,4}` means consider only 2 to 4 character strings

To summarise the pattern above searches for any combination of letters and numbers, followed by a `@`, then any combination of letters and numbers followed by a `.` with only 2 to 4 letters after it.

### Regular expressions and pandas
Let's see how they can be combined. Replicating the example above

In [12]:
data = pd.Series({'Dave': 'Daves email dave@google.com', 'Steve': 'Steves email steve@gmail.com',
        'Rob': 'Robs rob@gmail.com', 'Wes': np.nan})
data

Dave      Daves email dave@google.com
Steve    Steves email steve@gmail.com
Rob                Robs rob@gmail.com
Wes                               NaN
dtype: object

We can reuse the same `pattern` variable from above

In [13]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [dave@google.com]
Steve    [steve@gmail.com]
Rob        [rob@gmail.com]
Wes                    NaN
dtype: object

pandas also offers more standard string operations. For example, we can check if a string is contained within a data row:

In [14]:
data.str.contains("gmail")

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

Many more of these methods exist:
    
    
| Methods | Description |
| -- | -- |
| cat | Concatenate strings element-wise with optional delimiter |
| contains | Return boolean array if each string contains pattern/regex |
| count | Count occurrences of a pattern |
| extract | Use a regex with groups to extract one or more strings from a Series |
| findall | Computer list of all occurrences of pattern/regex for each string |
| get | Index into each element |
| isdecimal | Checks if the string is a decimal number |
| isdigit | Checks if the string is a digit |
| islower | Checks if the string is in lower case |
| isupper | Checks if the string is in upper case |
| join | Join strings in each element of the Series with passed seperator |
| len | Compute the length of each string |
| lower, upper | Convert cases |
| match | Returns matched groups as a list |
| pad | Adds whitespace to left, right or both sides of strings |
| repeat | Duplicate string values |
| slice | Slice each string in the Series |

### Exercise
There is a dataset `data/yob2012.txt` which lists the number of newborns registered in 2012 with their names and sex. Using regular expressions, extract all names from the dataset which start with letters A to C. How many names did you find?

Note: `^` is the "starting with" operator in regular expressions, 

In [15]:
yob = pd.read_csv("data/yob2012.txt")  # Read the data
yob # Note that an incosistency means one line has been treated as the header!

Unnamed: 0,Sophia,F,22267
0,Emma,F,20902
1,Isabella,F,19058
2,Olivia,F,17277
3,Ava,F,15512
4,Emily,F,13619
5,Abigail,F,12662
6,Mia,F,11998
7,Madison,F,11374
8,Elizabeth,F,9674
9,Chloe,F,9641


In [16]:
pattern = r'^[A-C]+[A-Z]*' # Create a regex pattern
regex = re.compile(pattern, flags=re.IGNORECASE) # Compile it
namesearch = yob["Sophia"].str.findall(pattern, flags=re.IGNORECASE) # Search for the regex
namesearch

0                 []
1                 []
2                 []
3              [Ava]
4                 []
5          [Abigail]
6                 []
7                 []
8                 []
9            [Chloe]
10                []
11           [Avery]
12         [Addison]
13          [Aubrey]
14                []
15                []
16                []
17       [Charlotte]
18                []
19                []
20                []
21          [Amelia]
22                []
23                []
24                []
25                []
26                []
27        [Brooklyn]
28                []
29                []
            ...     
33653             []
33654             []
33655             []
33656             []
33657             []
33658             []
33659             []
33660             []
33661             []
33662             []
33663             []
33664             []
33665             []
33666             []
33667             []
33668             []
33669        

In [17]:
namelist = [] # Now filter the above to just a series of strings
for n in namesearch:
    if n != []:
        namelist.append(n[0])
namelist = pd.Series(namelist)
namelist

0              Ava
1          Abigail
2            Chloe
3            Avery
4          Addison
5           Aubrey
6        Charlotte
7           Amelia
8         Brooklyn
9             Anna
10         Aaliyah
11         Allison
12          Alexis
13          Audrey
14          Alyssa
15          Claire
16          Camila
17         Arianna
18          Ashley
19         Brianna
20           Bella
21           Alexa
22          Aubree
23          Autumn
24          Ariana
25       Alexandra
26        Caroline
27          Bailey
28            Aria
29       Annabelle
           ...    
7576        Colbee
7577       Colburn
7578       Coleden
7579       Colesyn
7580       Collynn
7581        Corbon
7582      Cordarro
7583        Corden
7584        Corian
7585       Cormack
7586        Cornel
7587       Corrado
7588        Corrin
7589       Cortlan
7590        Cosmas
7591        Costas
7592      Costello
7593    Crescencio
7594       Crishon
7595       Cristen
7596       Criston
7597      Cr

In [18]:
namelist.drop_duplicates() # Finally, drop duplicate names

0              Ava
1          Abigail
2            Chloe
3            Avery
4          Addison
5           Aubrey
6        Charlotte
7           Amelia
8         Brooklyn
9             Anna
10         Aaliyah
11         Allison
12          Alexis
13          Audrey
14          Alyssa
15          Claire
16          Camila
17         Arianna
18          Ashley
19         Brianna
20           Bella
21           Alexa
22          Aubree
23          Autumn
24          Ariana
25       Alexandra
26        Caroline
27          Bailey
28            Aria
29       Annabelle
           ...    
7573         Cline
7574           Coe
7575         Cohyn
7577       Colburn
7578       Coleden
7579       Colesyn
7580       Collynn
7581        Corbon
7582      Cordarro
7583        Corden
7584        Corian
7585       Cormack
7586        Cornel
7587       Corrado
7589       Cortlan
7590        Cosmas
7591        Costas
7592      Costello
7593    Crescencio
7594       Crishon
7596       Criston
7597      Cr