In [1]:
import pandas as pd
df = pd.read_csv('Regex_exercise.csv')
df = df.head(6)

In [2]:
df

Unnamed: 0,Text,Telephone,email,html,filename,Trim_phrase,extract info
0,3.14529,415-555-1234,tom@hogwarts.com,<a>This is a link</a>,workspace.doc,The quick brown fox...,W/dalvikvm( 1553): threadid=1: uncaught exception
1,-255.34,650-555-2345,tom.riddle@hogwarts.com,<a href='https://regexone.com'>Link</a>,img0912.jpg,jumps over the lazy dog.,E/( 1553): FATAL EXCEPTION: main
2,128,(416)555-3456,tom.riddle+regexone@hogwarts.com,<div class='test_style'>Test</div>,updated_img0912.png,,E/( 1553): java.lang.StringIndexOutOfBoundsExc...
3,1.90E+10,202 555 4567,tom@hogwarts.eu.com,<div>Hello <span>world</span></div>,documentation.html,,E/( 1553): at widget.List.makeView(ListView.ja...
4,123340.00,4035555678,potter@hogwarts.com,,favicon.gif,,E/( 1553): at widget.List.fillDown(ListView.ja...
5,720p,1 416 555 9292,harry@hogwarts.com,,img0912.jpg.tmp,,E/( 1553): at widget.List.fillFrom(ListView.ja...


#### The purpose of this exercise is to be familiar with regular expression, which used often in data wrangling process.

##### Problem 1: Matching a decimal numbers
At first glance, writing a regular expression to match a number should be easy right?

We have the \d special character to match any digit, and all we need to do is match the decimal point right? For simple numbers, that may be right, but when working with scientific or financial numbers, you often have to deal with positive and negative numbers, significant digits, exponents, and even different representations (like the comma used to separate thousands and millions).

Below are a few different formats of numbers that you might encounter. Notice how you will have to match the decimal point itself and not an arbitrary character using the dot metacharacter. If you are having trouble skipping the last number, notice how that number ends the line compared to the rest.

In [3]:
df['Text']#720P is not decimal number

0       3.14529
1       -255.34
2           128
3      1.90E+10
4    123,340.00
5          720p
Name: Text, dtype: object

In [4]:
df.Text.str.extract('(^-?\d+(,\d+)*(\.\d+(E\d+)?)?$)')

Unnamed: 0,0,1,2,3
0,3.14529,,0.14529,
1,-255.34,,0.34,
2,128.0,,,
3,,,,
4,123340.0,",340",0.0,
5,,,,


##### Problem 2: Matching phone numbers
Validating phone numbers is another tricky task depending on the type of input that you get. Having phone numbers from out of the state which require an area code, or international numbers which require a prefix will add complexity to the regular expression, as does the individual preferences that people have for entering phone numbers (some put dashes or whitespace while others do not for example).

Below are a few phone numbers that you might encounter when using real data, write a single regular expressions that matches the number and captures the proper area code.

In [5]:
df['Telephone']

0      415-555-1234
1      650-555-2345
2     (416)555-3456
3      202 555 4567
4        4035555678
5    1 416 555 9292
Name: Telephone, dtype: object

In [6]:
df.Telephone.str.extract('([0-9]?\(?(\d{3})\)?[\s-]?\d{3}[\s-]?\d{4})')

Unnamed: 0,0,1
0,415-555-1234,415
1,650-555-2345,650
2,(416)555-3456,416
3,202 555 4567,202
4,4035555678,403
5,416 555 9292,416


##### Problem 3: Matching emails
When you are dealing with HTML forms, it's often useful to validate the form input against regular expressions. In particular, emails are difficult to match correctly due to the complexity of the specification and I would recommend using a built-in language or framework function instead of rolling your own. However, you can build a pretty robust regular expression that matches a great deal of common emails pretty easily using what we've learned so far.

One thing to watch out for is that many people use plus addressing for one time use, such as "name+filter@gmail.com", which gets directly to "name@gmail.com" but can be filtered with the extra information. In addition, some domains have more than one component, for example, you can register a domain at "hellokitty.hk.com" and have an email with the form "ilove@hellokitty.hk.com", so you will have to be careful when matching the domain portion of the email.

Below are a few common emails, in this example, try to capture the name of the email, excluding the filter (+ character and afterwards) and domain (@ character and afterwards).

In [7]:
df['email']

0                    tom@hogwarts.com
1             tom.riddle@hogwarts.com
2    tom.riddle+regexone@hogwarts.com
3                 tom@hogwarts.eu.com
4                 potter@hogwarts.com
5                  harry@hogwarts.com
Name: email, dtype: object

In [8]:
df.email.str.extract('(^([\w\.]*))')

Unnamed: 0,0,1
0,tom,tom
1,tom.riddle,tom.riddle
2,tom.riddle,tom.riddle
3,tom,tom
4,potter,potter
5,harry,harry


##### Problem 4: Matching HTML
If you are looking for a robust way to parse HTML, regular expressions are usually not the answer due to the fragility of html pages on the internet today -- common mistakes like missing end tags, mismatched tags, forgetting to close an attribute quote, would all derail a perfectly good regular expression. Instead, you can use libraries like Beautiful Soup or html5lib (both Python) or phpQuery (PHP) which not only parse the HTML but allow you to walk to DOM quickly and easily.

That said, there are often times when you want to quickly match tags and tag content in an editor, and if you can vouch for the input, regular expressions are a good tool to do this. As you can see in the examples below, some things that you might want to be careful about odd attributes that have extra escaped quotes and nested tags.

Go ahead and write regular expressions for the following examples.

In [9]:
df['html']

0                      <a>This is a link</a>
1    <a href='https://regexone.com'>Link</a>
2         <div class='test_style'>Test</div>
3        <div>Hello <span>world</span></div>
4                                        NaN
5                                        NaN
Name: html, dtype: object

In [10]:
df.html.str.extract('<([a-z]+)')

Unnamed: 0,0
0,a
1,a
2,div
3,div
4,
5,


##### Problem 5: Matching specific filenames
If you use Linux or the command line frequently, are often dealing with lists of files. Most files have a filename component as well as an extension, but in Linux, it is also common to have hidden files that have no filename.

In this simple example, extract the filenames and extension types of only image files (not including temporary files for images currently being edited). Image files are defined as .jpg,.png, and .gif.

In [11]:
df['filename']

0          workspace.doc
1            img0912.jpg
2    updated_img0912.png
3     documentation.html
4            favicon.gif
5        img0912.jpg.tmp
Name: filename, dtype: object

In [12]:
df.filename.str.extract('((\w)+[0-9]?)+\.(jpg|gif|png)$')

Unnamed: 0,0,1,2
0,,,
1,img0912,2,jpg
2,updated_img0912,2,png
3,,,
4,favicon,n,gif
5,,,


##### Problem 6: Trimming whitespace from start and end of line
Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.

We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.

Write a simple regular expression to capture the content of each line, without the extra whitespace.

In [13]:
df.Trim_phrase.str.extract('^\s*(.*)\s*$')

Unnamed: 0,0
0,The quick brown fox...
1,jumps over the lazy dog.
2,
3,
4,
5,


In [14]:
df['extract info']

0    W/dalvikvm( 1553): threadid=1: uncaught exception
1                     E/( 1553): FATAL EXCEPTION: main
2    E/( 1553): java.lang.StringIndexOutOfBoundsExc...
3    E/( 1553): at widget.List.makeView(ListView.ja...
4    E/( 1553): at widget.List.fillDown(ListView.ja...
5    E/( 1553): at widget.List.fillFrom(ListView.ja...
Name: extract info, dtype: object

In [15]:
df['extract info'].str.extract('(\w+)\(([\w\.]+)\:(\d+)\)')

Unnamed: 0,0,1,2
0,,,
1,,,
2,,,
3,makeView,ListView.java,1727.0
4,fillDown,ListView.java,652.0
5,fillFrom,ListView.java,709.0


#### Works Cited

Regular Expression: https://regexone.com/