# SIT742: Modern Data Science 
**(Week 05: Text Analysis)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/sit742](https://github.com/tulip-lab/sit742/issues)

Prepared by **SIT742 Teaching Team**

---

## Session 5C - More on Regular Expressions (Optional)

Table of Content

* Part 1. Get Started with Regular Expressions
* Part 2. Case Study (1) - Parsing Dates with Regular Expressions 
* Part 3. Case Study (2) - Extract IPs, dates, and email address 
* Part 4. Summary
* Part 5. Exercises

---


Getting a specific piece of text from a large block of text is a very challenging problem in parsing data.
In various programming languages there are built-in string functions available for 
searching and replacing. For example, the [common string operations](https://docs.python.org/2/library/string.html) built in Python. 
However, these methods are limited to simplest cases. For example, the `string.find()` method in Python 
returns the lowest index of the matched substring in a given string.
In more complex cases, such as data validation, where you are going to test an string to see if a telephone number pattern occurs within the string, it could result in code containing a stack of 
`if` statements by using different Python's built-in string functions.
To simplify the code and make it more readable, you might need to move up to [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). So what is a regular expression?

> "A regular expression is a pattern which specifies a set of strings of characters; it is said to match certain strings."--- Ken Thompson

If you've ever typed `cp *.ipynb ../module_2/` at the UNIX command prompt, or entered "match?" into a web-based search engine, you've already used a simple regular expression. In the first instance, you've copied all the files which end with file extension ".ipynb" (as opposed to copying them one by one); in the second, you've conducted a search not only for "match," but also for "matches", "matching", "matched", and "matcher" all at once.
Using a well-crafted regular expression, you can easily search through a large number of text files, searching for words ending with "ed", replace the .html suffix with a .xml suffix, and then change all the lower case characters to upper case. 

Regular expressions (Regex) are all about matching and finding patterns in text, from simple patterns to the very complex ones.
For instance, they can be simple as this:
```python
    \d
```
A character shorthand that matches any digit from 0 to 9.
They can also be something a bit more complicated like 
```python
(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])
```
which is where we will wind up at the end of this chapter: a fairly robust regular expression
that matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators, which are '-', ' ', '/' and '.'. This regular expression can be visualised as

![regEx](https://github.com/tulip-lab/sit742/raw/master/Jupyter/image/regeximg1.png)


If you are an experienced Regex user, this seems simple and straightforward.
However, if you are new to Regex or aren't exprienced enough, it takes a while to understand it. 
Don't worry if you don't get how that all works. 
We will show you how to develop regular expressions step-by-step in this chapter.
If you just follow the examples, writing regular expression will eventually turn out to be not 
that hard. To quickly review some regular expression syntax:


* <font color="red">[0-9]</font> Matches a single digit
* <font color="red">[a-z0-9]</font> Matches a single character that must be a lower case letter or a digit.
* <font color="red">[A-Za-z]</font> Matches a single character that much be a upper/lower case letter 
* <font color="red">\d</font> Matches any decimal digit; equivalent to the set [0-9].
* <font color="red">\D</font> Matches characters that are not digits, which is equivalent to [^0-9] or [^\d].
* <font color="red">\w</font> Matches any alphanumeric character, which is equivalent to [a-zA-Z0-9].
* <font color="red">\W</font> Matches any non-alphanumeric character; which is equivalent to [^a-zA-Z0-9] or [^\w].
* <font color="red">\s</font> Matches any whitespace character; which is equivalent to [\t\n\r\f\v], where \t indicates taps, \n  line feeds, \r carriage returns, \f form feeds and \v vertical tabs.
* <font color="red">\S:</font> Matches any non-whitespace character; which is equivalent to  [^ \t\n\r\f\v].
* <font color="red">ˆ</font> Matches the start of the line.
* <font color="red">$</font> Matches the end of the line.
* <font color="red">.</font> Matches any character (a wildcard).
* <font color="red">*</font> Matches when the preceding character occurs zero or more times
* <font color="red">?</font> Matches when the preceding character occurs zero or one times
* <font color="red">+</font> Matches when the preceding character occurs one or more times

More information can be found here :
https://docs.python.org/3.8/library/re.html

---


In [0]:
import sys
print (sys.version_info)

sys.version_info(major=3, minor=6, micro=7, releaselevel='final', serial=0)


Libraries needed are:

In [0]:
# library for regular expression
import re
import pandas as pd
pd.__version__

'0.22.0'

## Part 1. Regular Expressions

As a powerful way of searching, replacing, and parsing text with complex patterns of characters, regular expressions are the most significant tools in data parsing. They figure into all kinds of text-manipulation tasks. Searching and search-and-replace are among the most common uses. Regular expressions tend to be easier to write than they are to read. This is less of a problem if you are the only one who ever needs to maintain them. But if several people need to, the syntax can turn into more of a hindrance than an aid. For example,
```python
    ^(|(([A-Za-z0-9]+_+)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6})$
```
is a regular expression for validating email addresses.
Please don't try to parse it yourself, 
since an experienced regular expression user might take a while to parse it.
In this section, we will first go through some good introductory materials of regular expressions,
and then show you some fundamentals of how to use regular expressions in search text.

There are a couple of good online materials that introduce regular expressions in Python. 
We strongly suggest that you study this chapter together with these materials. 
They are 
* [Regular Expression HOWTO](https://docs.python.org/3.8/howto/regex.html) from Python's office website: An introductory tutorial to using regular expressions in Python with the `re` module. 📖
* [Regular Expressions](https://www.deakin.edu.au/library), chapter 5 of "**Dive into Python 3**": A series of examples inspired by real world problems are used to show you how to generate regular expressions for parsing street name, Roman numerals, and phone numbers. 📖

The complete list of meta-characters and their behaviour in the context of regular expressions can be found [here](https://docs.python.org/3.8/library/re.html). Besides, there is an alternative material if you would like to view, which is 

* [RegexOne](http://regexone.com): An interactive tutorial on learning regular expressions with simple exercises.

Before we go through some basics of regular expressions in python, we would like to point out [RegExr](http://regexr.com) by Grant Skinner. It is an online tool to learn, build, & test regular expressions. RegExr provides us with syntax highlighting, contextual help, video tutorial, reference, and searchable community patterns.
You will find a lot of good information in the six tabs provides on its website. In addition, pop-ups appear when you hover over the regular expression or target text in RegExr, giving you helpful information linking you between a regular expression and the corresponding matches in text. 
These resources are one of the reasons why RegExr is among our favourite online Regex checkers.

### 1.0. Backslash

**First, what is '\'? **

'\', backslash or escape-character, is used to indicate special forms or to allow special characters to be used without invoking their special meaning.




**How about r"" ? When to use it? **

r"" is Python’s string literal prefix notation, which has nothing to do with regular expression.  By using r"" or r'', Python will not handle special characters in any special way, in another word, it treated the contents as raw string. For example, r"\t" represents
a two-character string containing '\' and 't', whereas "\t" represents tab.

Sometimes you can use them interchangeably,

In [0]:
str1 = re.findall('\t', "Please find \t")
print (str1)

str2 = re.findall(r'\t', "Please find \t")
print (str2)

['\t']
['\t']


Sometimes not!

In [0]:
str1=re.match(r"\W(.)\1\W", " ff ")
print (str1)

str2=re.match("\W(.)\1\W", " ff ")
print (str2)

str3=re.match("\\W(.)\\1\\W", " ff ")
print (str3)

<_sre.SRE_Match object; span=(0, 4), match=' ff '>
None
<_sre.SRE_Match object; span=(0, 4), match=' ff '>


"\W(.)\1\W" doesn't match ?  What is the difference? 

In [0]:
str4="\W(.)\1\W"
print (str4)
str4

\W(.)\W


'\\W(.)\x01\\W'

In [0]:
str4=r"\W(.)\1\W"
print (str4)
str4

\W(.)\1\W


'\\W(.)\\1\\W'

Now you might be able to guess, what "\W(.)\1\W" will match. 

In [0]:
str2=re.match("\W(.)\1\W", " f\x01 ")
print (str2)

<_sre.SRE_Match object; span=(0, 4), match=' f\x01 '>


It matches with non-word + any one character  + "\x01" + non=word.

*Conclusion -- always fist validate your regular expression, then test with Python*

\* is ??  <br>
\* is a wildcard similar with ? and +  <br>
\* matches 0+ <br>
? matches 0-1 <br>
\+ matches 1+ <br>

In [0]:
str1 = re.findall(r'.*', 'Please find all.')
print (str1)

['Please find all.', '']


In [0]:
str1 = re.findall(r'.?', 'Please find all.')
print (str1)

['P', 'l', 'e', 'a', 's', 'e', ' ', 'f', 'i', 'n', 'd', ' ', 'a', 'l', 'l', '.', '']


In [0]:
str1 = re.findall(r'.+', 'Please find all.')
print (str1)

['Please find all.']


In [0]:
str1 = re.findall(r'l+', 'Please find all')
print (str1)

['l', 'll']


### 1.1. Matching String Literals
Matching strings with one or more literal characters, called string literals, is similar to the way you might do a search in a word editor or when submitting a keyword to a search engine. When you search for a string of text, you are searching with a string literal.
Let's start with a very simple scenario. 
If we have a sentence like
```
    Today is 26 jan 2016, not 25 Jan 2016.
```
And want to see if the string contains the word `Jan`  using a Python regular expression,
we'd use the following

In [0]:
import re # The Regular Expressions library
str = "Today is 26 jan 2016, not 25 Jan 2016." 
s = re.search("Jan", str)
print (s)

<_sre.SRE_Match object; span=(29, 32), match='Jan'>


The simple pattern used above is just something like 'J' followed by 'a' followed by 'n' (i.e., 'Jan').
The `search()` method scans through the string, looking for any location where 'Jan' appears. If a match is found, a match object instance corresponding to the first match is returned. Our search was successful, as the code prints out the match object
as 
```
    <_sre.SRE_Match object at 0x103ed47e8>
```
This is Python's way of saying 'True' or 'Yes'. If no match is found, it will print out 'None'. 
For example, try the following 

In [0]:
print (re.search("Feb", str))

None


The returned match object contains information about the match: where it starts and ends, the substring it matched, and more. You can query the match object for information about the matching string. The most important ones are:

In [0]:
print (s.group())
print (s.start()) 
print (s.end())
print (s.span())

Jan
29
32
(29, 32)


The `group()` method returns the string "Jan" matched by the regular expression. 
The `start()` method returns the starting position of "Jan", which is equal to the index of 'J' in the whole string.
Go ahead, count the characters in "Today is 18 Jan 2016.", starting at "T", then try:
```python
   str.index('J')
```
It should give the same integer as that given by `s.start()`. 
The `end()` method returns the ending position of the match, 
and the span() method returns a tuple containing the (start, end) positions.
This scenario is so simple that you don't need a regular expression.
Instead, you can use a string function, `find()`, which gives you the start position of the target string.

In [0]:
str.find("Jan")

29

How about finding both "Jan" and "jan"? 
The `find()` method can only find the first match of a given regular expression. 
There are two pattern methods that return all of the matches for a pattern encoded in a given regular expression. 
They are `findall()` and `finditer()`.
The former returns a list of matching strings, 
and the latter returns a sequence of match object instances as an iterator. 
Let's try

In [0]:
print (re.findall("Jan", str))
for m in re.finditer("Jan", str):
    print (m.group())
    print (m.span())

['Jan']
Jan
(29, 32)


However, using "Jan" can find the one with uppercase "J", but not the one with lowercase 'j'. 
The reason is that string matching is case-sensitive in regular expressions. 
If you want to match both lower- and uppercase, you can: 
1. Convert all the characters in the string into either lower- or uppercase ones, then use either `re.findall("jan", str)` or `re.findall("JAN", str)` respectively to find the two appearances of "Jan",
2. Update our regular expression to account for both 'J' and 'j', and retrieve both "jan" and "Jan" in their original form, like: 
```python
    [Jj]an
```
where '[ ]' indicates a set of characters, and '[Jj]' will match 'J' or 'j', which is also known as a character class.

In [0]:
re.findall(r"[Jj]an", str)

['jan', 'Jan']

Our second choice is to use grouping in regular expressions. For multiple options we place them in brackets () and separate them by a pipe |. So we could use:

In [0]:
re.findall(r"(Jan|jan)", str)

['jan', 'Jan']

Let's move one-step further to find all the words with only alphabetic characters using only regular expressions. 
It is not feasible to use grouping to enumerate all the possible words. 
Instead, we are going to use '[ ]' together with '+'.
You have seen '[ ]' above. '+' means matching 1 or more repetitions of the preceding regular expression. 
For example,
'an+' will match ‘a’ followed by any non-zero number of ‘n’s; 
it will not match just ‘a’. 
To match non-zero numbers of either lower- or uppercase characters, we derive the following regular expression:

In [0]:
re.findall(r"[a-zA-Z]+", str)

['Today', 'is', 'jan', 'not', 'Jan']

In the example above, we represent a range of characters by giving two characters separated by a '-'. For example [a-z] will match any lowercase ASCII letters, and [A-Z] will match any uppercase ASCII letters. Put the two together, we derive the regular expression that matches any lower- or uppercase letters.

### 1.2. Matching Digits
There are several ways to represent digits in regular expressions:
* [0-9]: A range that matches the range of digits 0 through 9, which is the same as "[0123456789]".
* \d: A character shorthand to match the digits, which is pre-defined in most regular expression engines.
It is equivalent to [0-9].

Note that the character shorthand for digits is shorter and simpler, 
but it doesn’t have the power or flexibility of the range. 
With a range, you can pick the exact digits you want to match. 
For example, if you want to match a sequence of the binary digits, like '0010101011', 
you would use
```python
    [01]+
```

To match numbers that have more than one digit, for example, '12' and '123',
you can repeat either representation as many times as you want, like
* [0-9][0-9] or \d\d matches two-digits numbers from 00 to 99.
* [0-9][0-9][0-9] or \d\d\d matches three digits numbers from 000 to 999.

However, the above approach gets redundant if you try to match '100000' for example.
In this case, we can specify the number of occurrences of those digits by using 
curly brackets, like:  
* [0-9]{2} or \d{2} that matches numbers from 00 to 99.
* \d{1,3} that matches numbers from 0 to 999.

Let's try to extract year for the give string,

In [0]:
s = re.search(r'\d{4}', str) 
print(s.group()) 

2016


As we discussed before, the `search()` method returns the first match found in the string.
However, if search stops when it finds the first occurrence, what is the point of group?

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses, i.e., ( and ) meta-characters. Any sub-pattern inside a pair of parentheses will be captured as a group.
Let's try to find a pair of words separated by a white space in the following simple
string
```python
    Isaac Newton, Data Scientist
```
The regular expression we are going to use is 
```python
    ([a-zA-Z]+) ([a-zA-Z]+)
```
It uses two groups. One is used to match the first word in the pair and another matches
the second word. Note that there is a white space between the two groups.

In [0]:
m = re.match(r"([a-zA-Z]+) ([a-zA-Z]+)", "Isaac Newton, Data Scientist")
print(m.group(0) + "\n" + m.group(1)  + "\n" + m.group(2))

Isaac Newton
Isaac
Newton


As you can see, `m.group(0)` returns the entire match. `m.group(1)` returns the match of the first parenthesized subgroup. And `m.group(2)` returns the match of the second parenthesized subgroup.
You can also retrieve the two groups by using the `groups()` methods as

In [0]:
m.groups()

('Isaac', 'Newton')

In Python regular expressions, you can also name each group in a regular expression using 
```python
    (?P<name>...)
```
The substring matched by the group is accessible via the symbolic group name 'name'.
For example,

In [0]:
m = re.match(r"(?P<first_name>[a-zA-Z]+) (?P<last_name>[a-zA-Z]+)", "Isaac Newton")
m.groupdict()

{'first_name': 'Isaac', 'last_name': 'Newton'}

### 1.3 More on Regular Expression Syntax
We have shown you how to match words and digits in the previous two sections. Here we would like to list some meta-characters that are used very often in regular expressions:
* \D: Matches characters that are not digits, which is equivalent to [^0-9] or [^\d].
* \w: Matches any alphanumeric character, which is equivalent to [a-zA-Z0-9].
* \W: Matches any non-alphanumeric character; which is equivalent to [^a-zA-Z0-9] or [^\w].
* \s: Matches any whitespace character; which is equivalent to [ \t\n\r\f\v], where \t indicates taps, \n  line feeds, \r carriage returns, \f form feeds and \v vertical tabs.
* \S: Matches any non-whitespace character; which is equivalent to  [^ \t\n\r\f\v].

### 1.4. Raw Strings in Python Regular Expressions
We have been using 'r' in our regular expressions, what does it mean?
It is Python's raw string notation for regular expressions.
It has been used to work around the backslash plague.

In regular expressions the backslash character ('\')  is often used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. For example, to match a literal backslash, one has to write '\\\\\\\\' as the regular expression string. This is because the regular expression must be '\\\\', and each backslash must be expressed as '\\\\' inside a regular Python string literal. Let's assume that you would like find all the LaTeX commands in a given LaTeX file. Those commands always start with a backslash, like '\\usepackage',
'\\section', '\\title', etc. The regular expression without raw string notation is:
```python
    \\\\\\w+
```
Refer to the previous section for the meaning of "\w".
In contrast, one can prefiex the string literals with a letter 'r' or 'R' to form a raw string notation, which tells 
the regular expression engine not to handle backslashes in any special way. With the raw string notation, the regular expression above can be simplified to 
```python
    r"\\\w+"
```
Let's try them out:

In [0]:
m1 = re.match("\\\\\w+", "\section")
print (m1.group())

m2 = re.match(r"\\\w+", "\section")
print (m2.group())

\section
\section


The two lines of matching code above are functionally identical. But it is easy to interpret the regular expression using raw string notation. Therefore, when writing regular expression in Python, it is recommended that you use raw strings instead of regular Python strings. 

- - -

## Part 2. Case Study (1) - Parsing Dates with Regular Expressions


This section will show you how to parse dates in simple data formats, 
e.g., mm/dd/yyyy, and dd/mm/yyyy. You might think that something as conceptually trivial as a date should be an easy job for a regular expression. But it isn’t, for reasons like: 
* The problem of leading zeros: humans are very sloppy with writing dates. Sometimes we omit the leading zeros, and write dates like "1/1/2016" and "1/01/2016". Therefore, should the regular expression for dates allow leading zeros to be omitted?
* Different date delimiters: besides forward slashes, we can also use white spaces, or hyphens to separate day, month and year.
* Matching a given range of numbers: regular expressions don't deal directly with numbers and don't understand the numerical meanings that humans assign to strings of digits. They treat numbers, like 123, as strings of characters displayed as digits, 1, 2, and 3. Therefore, we cannot tell a regular expression to match a given range of numbers directly. For instance, to match months that are in a range from 1 to 12 and to match days from 1 to 31.

Therefore, you have to choose how simple or how accurate you want your regular expression to be.
If you already know your text doesn’t contain any invalid dates, you could use a trivial regex such as
```python
    r"\d{2}/\d{2}/\d{4}"
```
The fact that this matches things like 00/00/0000 is irrelevant if those don’t occur in your text.
In most cases, you won't know whether your text has invalid dates or not. 

So given that a basic date is day, month and year, and are all digits, which of the three is easiest to parse with regular expressions?
Give month a try. First define our own method 'month' which accepts a pattern and a month (both text) as arguments and reports if there is a match:

In [0]:
def month(pattern, m):
    if re.match(pattern, m):
        print (m + " is a month")
    else:
        print (m + " is NOT a month")

It seems that it is trivial to write a regular expression to match the 12 months from 1 to 12 with or without 
leading zeros. 
Let's first assume that all months are represented by two digits. 
In other words, we append a zero to the left if the month is in between January to September. 
The simplest regular expression one can think could be
```python
    r"\d\d"
```
Try it out,

In [0]:
month(r'\d\d', "12")
month(r"\d\d", "03")
month(r"\d\d", "00")
month(r"\d\d", "13")
month(r"\d\d", "3")

12 is a month
03 is a month
00 is a month
13 is a month
3 is NOT a month


The regular expression we used matches exactly two-digit numbers from 00 to 99. 
Although it can match all the months represented by two digits, the problems are that 
* It cannot match months represented by a single digit, e.g., 1 (January), 2 (February), etc.
* It matches numbers that do not represent any month. 
  So one does need to validate the given number to make sure it is in the right range.

Tackling the first problem, we can use curly brackets `{m, n}` to specify the minimum and maximum occurrences of digits. A month can have at least one digit and at most two digits. 
So the regular expression should look like
```python
    r"\d{1,2}"
```

In [0]:
month(r"\d{1,2}", "03") 
month(r"\d{1,2}", "3") 
month(r"\d{1,2}", "00") 
month(r"\d{1,2}", "0")
month(r"\d{1,2}", "13")

03 is a month
3 is a month
00 is a month
0 is a month
13 is a month


However, this regular expression still matches invalid months, such as "00", "0" and "13".
The months must be restricted to numbers between 1 and 12.
We use alternation inside a group to match various pairs of digits to form a range of one- or two-digit numbers.

In [0]:
month(r"([1-9]|1[0-2])", "3") 
month(r"([1-9]|1[0-2])", "0") 
month(r"([1-9]|1[0-2])", "01") 

3 is a month
0 is NOT a month
01 is NOT a month


In the above code `[1-9]` matches months that can be represented by a single digit, and `1[0-2]` matches October, November and December. Let's further update the regular expression to allow leading zeros by adding `0?`:

In [0]:
month(r"(0?[1-9]|1[0-2])", "03")

03 is a month


It seems that we have constructed a regular expression that can handle months represented by either one- or two-digits numbers. But sooner or later you will find the following problem: 

In [0]:
month(r"(0?[1-9]|1[0-2])", "13") 
month(r"(0?[1-9]|1[0-2])", "99") 

13 is a month
99 is a month


Why?

Some of these patterns seem right but don't always work. 
Regular expressions are quite specific, like mini programs.
You have to get them right and then they will very effectively block everything that doesn't match.
We very specifically say what we want, as opposed to listing all the exceptions we don't want.
which is easier?
For example testing all exceptions, case by case:
* is the input empty (and this in itself is trouble, one space or two? ' ', tab, CR, LF etc.)
* is the input the correct type (character, number etc.)
* the correct format
* correct range
* positive, negative
* uppercase, lowercase, etc.

Watch out for the difference between greediness & laziness in regular expressions. 
Greediness means match longest possible string.
Laziness means match shortest possible string. 
Or, put another way, laziness will stop as soon as the condition is satisfied, 
but greediness means it will stop only once the condition is not satisfied any more - this is quite different.

Consider also Start of String and End of String anchors. The caret ^ matches the position before the first character in the string. Applying "^a" to "abc" matches the whole string. "^b" does not match "abc" at all, because the b cannot be matched right after the start of the string, matched by ^. Similarly, $ matches right after the last character in the string.

So here's what we want:
```python
    r"^(0?[1-9]|1[0-2])$" 
```
Let's test it now:

In [0]:
pattern =  "^(0?[1-9]|1[0-2])$" 

In [0]:
month(pattern,"03") 
month(pattern,"0")
month(pattern,"033")
month(pattern,"003")
month(pattern,"99")
month(pattern,"3")

03 is a month
0 is NOT a month
033 is NOT a month
003 is NOT a month
99 is NOT a month
3 is a month


Similarly, you can write regular expressions to validate days. 
We will leave this for you to do as an exercise.
Next, we are going to show you the regular expressions for handling years in 20th and 21st centuries. 
These years are between 1900 and 2099.
The first two digits are either 19 or 20, which can be captured by a group alternating between these two numbers
```python
    (19|20)
```
Each of the last two digits contains numbers between 0 and 9, which can be easily captured by
```
    \d{2}
```
Put them together and we have
```
    r"(19|20)\d{2}"
```

In [0]:
def year(pattern, m):
    if re.match(pattern, m):
        print (m + " is a year")
    else:
        print (m + " is NOT a year")
        
year(r"(19|20)\d{2}", "1800")
year(r"(19|20)\d{2}", "1900")
year(r"(19|20)\d{2}", "2099")
year(r"(19|20)\d{2}", "2100")

1800 is NOT a year
1900 is a year
2099 is a year
2100 is NOT a year


Dealing with leap years is not trivial. Can one write a regular expression that can distinguish days in February in either leap years or non-leap years? It is easy to write regular expressions to match February 29th regardless of the year. Allowing February 29th only in leap years would require us to spell out all the years that are leap years, and all the years that aren’t. Therefore, it seems that regular expressions are not a good choice here. Handling leap years require an extra bit of code. Maybe it's better to do it in two stages:
1. Does it look like a date? (use regex), then
2. is it a date? (code, e.g. convert to numeric then > 0 and < 13)

For example, the regular expression we found here:
```python
    r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])'
```
matches a date in traditional date format from between 1900-01-01 and 2099-12-31, with a choice of four separators.
However, there are dates that match the regular expression but aren't valid.
For example,

In [0]:
pattern = r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])'
year(pattern, "2016-02-31") 

NameError: ignored

It is impossible for February in any year can have more than 29 days. Instead of using regular expressions to validate 
dates, you can also use Python's `datatime` module. If a given date string cannot be converted to a Python Date object, then the date wouldn't be valid.

In [0]:
# use python datetime libraries
import datetime as dt
today = "2016-02-31"

#Due to it is impossible for February in any year can have more than 29 days
#It should show a error 
mydt = dt.datetime.strptime(today, '%Y-%m-%d') 

ValueError: ignored

Therefore, 
if you get regular expressions right, they can be very useful as anything that doesn't match the pattern will get blocked. However, getting them wrong will result in many problems.
- - -

## Part 3. Case Study (2) - Extract IPs, dates, and email address with regular expressions

With following tasks we will use the mail box data ([mbox-short.txt](http://www.pythonlearn.com/code3/mbox-short.txt)) provided by the book [Python for Informatics: Exploring Information](http://www.pythonlearn.com/book.php#python-for-informatics). 



In [0]:
!pip install wget



In [0]:
import wget

link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Jupyter/data/mbox-short.txt'

DataSet = wget.download(link_to_data)

!ls

 mbox-short.txt      'reuters_2 (1).txt'  'stopwords_en (1).txt'
'reuters_1 (1).txt'  'reuters_2 (2).txt'   stopwords_en.txt
'reuters_1 (2).txt'   reuters_2.txt
 reuters_1.txt	      sample_data


In [0]:
with open('mbox-short.txt','r') as infile:
    text = infile.read()

### 3.1 Find IP addresses 

In this task we will need to 
1. find all IP addresses in the mbox-short dataset.
2. print unique IP addresses 

Let's have a try first: 

In [0]:
str1 = re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', "This is a IP address 111.23.39.99")
str1

['111.23.39.99']

![](https://github.com/tulip-lab/sit742/raw/master/Jupyter/image/regeximg3.png)

From https://regexper.com/

In [0]:
str1= re.findall(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', text)
if len(str1)>0:
    print(str1)

['141.211.14.90', '141.211.14.79', '194.35.219.184', '127.0.0.1', '194.35.219.182', '134.68.220.122', '127.0.0.1', '8.12.11.200', '8.12.11.200', '141.211.14.97', '141.211.93.149', '194.35.219.184', '127.0.0.1', '194.35.219.182', '134.68.220.122', '127.0.0.1', '8.12.11.200', '8.12.11.200', '141.211.14.25', '141.211.93.144', '194.35.219.184', '127.0.0.1', '194.35.219.182', '134.68.220.122', '127.0.0.1', '8.12.11.200', '8.12.11.200', '141.211.14.25', '141.211.14.43', '194.35.219.184', '127.0.0.1', '194.35.219.182', '134.68.220.122', '127.0.0.1', '8.12.11.200', '8.12.11.200', '141.211.14.46', '141.211.14.83', '194.35.219.184', '127.0.0.1', '194.35.219.182', '134.68.220.122', '127.0.0.1', '8.12.11.200', '8.12.11.200', '141.211.14.93', '141.211.93.142', '194.35.219.184', '127.0.0.1', '194.35.219.182', '134.68.220.122', '127.0.0.1', '8.12.11.200', '8.12.11.200', '141.211.14.46', '141.211.14.72', '194.35.219.184', '127.0.0.1', '194.35.219.182', '134.68.220.122', '127.0.0.1', '8.12.11.200', '8.

By running the code above, we are able to print all IP addresses. 

Next can we save all unique IP address in a list? We will need to read the whole txt file in to 'text', and then apply re.findall function. set() function returns the unique values.

In [0]:
str1=re.findall(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', text)
set(str1)

{'127.0.0.1',
 '134.68.220.122',
 '141.211.14.25',
 '141.211.14.34',
 '141.211.14.36',
 '141.211.14.39',
 '141.211.14.43',
 '141.211.14.46',
 '141.211.14.58',
 '141.211.14.72',
 '141.211.14.76',
 '141.211.14.79',
 '141.211.14.83',
 '141.211.14.84',
 '141.211.14.90',
 '141.211.14.91',
 '141.211.14.92',
 '141.211.14.93',
 '141.211.14.97',
 '141.211.14.98',
 '141.211.93.141',
 '141.211.93.142',
 '141.211.93.143',
 '141.211.93.144',
 '141.211.93.145',
 '141.211.93.149',
 '141.211.93.151',
 '141.211.93.152',
 '141.211.93.153',
 '194.35.219.182',
 '194.35.219.184',
 '8.12.11.200'}

### 3.2 Extract All date time 


In the next task, we need to extract all date time from the file. We trust that all date time are valid for now. 



In [0]:
str1=re.findall(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})', text)
set(str1)

{'2007-09-12 16:17:59',
 '2007-12-12 21:40:33',
 '2007-12-17 17:11:08',
 '2007-12-20 15:25:38',
 '2007-12-20 21:26:28',
 '2007-12-28 23:44:24',
 '2008-01-03 16:22:14',
 '2008-01-03 16:27:29',
 '2008-01-03 16:33:02',
 '2008-01-03 17:05:11',
 '2008-01-03 17:16:39',
 '2008-01-03 19:23:46',
 '2008-01-04 04:05:43',
 '2008-01-04 04:31:35',
 '2008-01-04 04:47:16',
 '2008-01-04 06:05:51',
 '2008-01-04 07:00:10',
 '2008-01-04 09:02:54',
 '2008-01-04 10:01:40',
 '2008-01-04 10:15:54',
 '2008-01-04 10:37:04',
 '2008-01-04 11:08:38',
 '2008-01-04 11:09:12',
 '2008-01-04 11:10:04',
 '2008-01-04 11:11:00',
 '2008-01-04 11:33:05',
 '2008-01-04 11:35:25',
 '2008-01-04 13:05:51',
 '2008-01-04 14:48:37',
 '2008-01-04 15:01:37',
 '2008-01-04 15:44:39',
 '2008-01-04 16:09:01',
 '2008-01-04 18:08:50',
 '2008-01-05 09:12:07'}

From the extract datetime string, extract date and hour information by using nested group

In [0]:
str2=re.findall(r'((\d{4}-\d{2}-\d{2} \d{2}):\d{2}:\d{2})', text)
set(str2)

{('2007-09-12 16:17:59', '2007-09-12 16'),
 ('2007-12-12 21:40:33', '2007-12-12 21'),
 ('2007-12-17 17:11:08', '2007-12-17 17'),
 ('2007-12-20 15:25:38', '2007-12-20 15'),
 ('2007-12-20 21:26:28', '2007-12-20 21'),
 ('2007-12-28 23:44:24', '2007-12-28 23'),
 ('2008-01-03 16:22:14', '2008-01-03 16'),
 ('2008-01-03 16:27:29', '2008-01-03 16'),
 ('2008-01-03 16:33:02', '2008-01-03 16'),
 ('2008-01-03 17:05:11', '2008-01-03 17'),
 ('2008-01-03 17:16:39', '2008-01-03 17'),
 ('2008-01-03 19:23:46', '2008-01-03 19'),
 ('2008-01-04 04:05:43', '2008-01-04 04'),
 ('2008-01-04 04:31:35', '2008-01-04 04'),
 ('2008-01-04 04:47:16', '2008-01-04 04'),
 ('2008-01-04 06:05:51', '2008-01-04 06'),
 ('2008-01-04 07:00:10', '2008-01-04 07'),
 ('2008-01-04 09:02:54', '2008-01-04 09'),
 ('2008-01-04 10:01:40', '2008-01-04 10'),
 ('2008-01-04 10:15:54', '2008-01-04 10'),
 ('2008-01-04 10:37:04', '2008-01-04 10'),
 ('2008-01-04 11:08:38', '2008-01-04 11'),
 ('2008-01-04 11:09:12', '2008-01-04 11'),
 ('2008-01-

### 3.3 Extract author's email address


There are many email addresses included in the file. We would like to extract email addresses from the Author the format is normally:

"Author: stephen.marquard@uct.ac.za"

Now lets see if we can use the following regular expression:
```python
r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
```
which was copied and pasted from http://emailregex.com/

Does it work in the task?

In [0]:
str1=re.findall(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", text)
set(str1)

{'200801032122.m03LMFo4005148@nakamura.uits.iupui.edu',
 '200801032127.m03LRUqH005177@nakamura.uits.iupui.edu',
 '200801032133.m03LX3gG005191@nakamura.uits.iupui.edu',
 '200801032205.m03M5Ea7005273@nakamura.uits.iupui.edu',
 '200801032216.m03MGhDa005292@nakamura.uits.iupui.edu',
 '200801040023.m040NpCc005473@nakamura.uits.iupui.edu',
 '200801040905.m0495rWB006420@nakamura.uits.iupui.edu',
 '200801040932.m049W2i5006493@nakamura.uits.iupui.edu',
 '200801040947.m049lUxo006517@nakamura.uits.iupui.edu',
 '200801041106.m04B6lK3006677@nakamura.uits.iupui.edu',
 '200801041200.m04C0gfK006793@nakamura.uits.iupui.edu',
 '200801041403.m04E3psW006926@nakamura.uits.iupui.edu',
 '200801041502.m04F21Jo007031@nakamura.uits.iupui.edu',
 '200801041515.m04FFv42007050@nakamura.uits.iupui.edu',
 '200801041537.m04Fb6Ci007092@nakamura.uits.iupui.edu',
 '200801041608.m04G8d7w007184@nakamura.uits.iupui.edu',
 '200801041609.m04G9EuX007197@nakamura.uits.iupui.edu',
 '200801041610.m04GA5KP007209@nakamura.uits.iupu

What if I only want email address after Author ? 

In [0]:
str1=re.findall(r'Author: ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)', text)
str1

['stephen.marquard@uct.ac.za',
 'louis@media.berkeley.edu',
 'zqian@umich.edu',
 'rjlowe@iupui.edu',
 'zqian@umich.edu',
 'rjlowe@iupui.edu',
 'cwen@iupui.edu',
 'cwen@iupui.edu',
 'gsilver@umich.edu',
 'gsilver@umich.edu',
 'zqian@umich.edu',
 'gsilver@umich.edu',
 'wagnermr@iupui.edu',
 'zqian@umich.edu',
 'antranig@caret.cam.ac.uk',
 'gopal.ramasammycook@gmail.com',
 'david.horwitz@uct.ac.za',
 'david.horwitz@uct.ac.za',
 'david.horwitz@uct.ac.za',
 'david.horwitz@uct.ac.za',
 'stephen.marquard@uct.ac.za',
 'louis@media.berkeley.edu',
 'louis@media.berkeley.edu',
 'ray@media.berkeley.edu',
 'cwen@iupui.edu',
 'cwen@iupui.edu',
 'cwen@iupui.edu']

## Part 4. Summary

This chapter has discussed the fundamentals of regular expressions.
You have learnt how to develop regular expressions for handling street addresses, Roman numerals, phone numbers, and dates. Regular expressions may not be suitable for dealing with complex XML files but the basic idea that a regular expression can block anything that doesn't match is both useful and powerful. You might use regular expressions, for example, as the basis for a short program that filters incoming spam in your email. In this case, the program might use a regular expression to determine whether the name of a known spammer appeared in the "From:" line of the email. Email filtering programs, in fact, very often use regular expressions for exactly this type of operation. You will further learn regular expressions in Module 6 on text pre-processing. It is worth noting that regular expressions are not specific to Python. Many other programming languages also provide regular expression capabilities, for example, Perl, Java, Ruby, etc. Finally, do you think it's possibly to do Data Wrangling without regular expressions?

---

Watch the Software Carpentry lecture on regular expressions, if you need more helps.

https://www.youtube.com/playlist?list=PL7C1EB31127AB8A0B

- - -

## Part 5. Exercises

1.  Write a regular expression to match negative numbers and real-valued numbers, such as -1,023, -10.00, 10.393, and 0.234.
1.  Write and test regular expressions for validating days. 
    1. Just worry about 0 to 31 by assuming all the months can have 31 days
    2. Deal with 29, 30 or 31 days, where we distinguish months having 30 days, those having 31 days and assume February always has 29 by treating every year as a leap year.
2.  Write regular expressions that can handle date format dd/mm/yyyy, deal with 29, 30, and 31 (as in Exercise 2), 
    and make the matched day, month and year accessible with group names.
- - -

<details><summary><u><b><font color="Blue">Click here for the solution to Exercise</font></b></u></summary>
```python
    #Question 1
    import re
    test_string  = 'HIHIDJOPJ PO 1.22222 -1023 -10.00 10.393 0.234 Hi HiHDIHDIQWHDIQWO #@#$!#!@'
    stringresult=re.findall(r"[-+]?\d+[\.]?\d*", test_string)
    stringresult
```

```python
    #Question 2
    import re
    def vailddays(day):
        pattern = r"^(1[0-9]|[2-9][0-9])\d\d[- /.](0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])$"
        daylist=re.split(r'[- /.]', day)
        shortmonth= [4,6,9,11]    
    if re.match(pattern, day):
        print (day + " is a vaild day format")        
        if int(daylist[1]) in shortmonth and int(daylist[1])>30:
            print (day + " is not a vaild day format")            
        elif int(daylist[1]) == 2 and int(daylist[2]) > 29:   
            print (day + " is not a vaild day format")  
    else:
        print (day + " is not a vaild day format")
    #Testcases        
    vailddays('2012-4-31')
    vailddays('2012-2-30')
    vailddays('2012-1-32')
    vailddays('2012-1-1')
    vailddays('2012-01-01')
    vailddays('2012-01-1')
    vailddays('2012-1-01')
    vailddays('2999-2-22')
    vailddays('1000-2-22')
    vailddays('9999-2-22')
    vailddays('9999.2.22')
    vailddays('9999/2/22')
    vailddays('9999 2 22')
```
    
```python
    #Question 3
    import re
    def vailddays(day):
        pattern = r"^(0?[1-9]|[12][0-9]|3[01])[/](0?[1-9]|1[012])[/](1[0-9]|[2-9][0-9])\d\d$"    
        processedday=re.match(pattern, day)   
        shortmonth= [4,6,9,11]
     
        try:    
            if int(processedday.group(2)) in shortmonth and int(processedday.group(1))>30:
                print (day + " is not a vaild day format")            
            elif int(processedday.group(2)) == 2 and int(processedday.group(1)) > 29:   
                print (day + " is not a vaild day format")  
            else:
                print (day + " is a vaild day format")
        except AttributeError as error:
            print(day + ' format is not matching the dd/mm/yyyy. \n It will show a error which  ', error,'\n')

#Testcases        
vailddays('31/04/2012')
vailddays('31/4/2012')
vailddays('30/2/2012')
vailddays('31-12-2012')
vailddays('1/1/1111')
```