<img src = "images/Right logo Transparent Big.png" width = 300, align = "center">

<h1 align=center><font size = 5>SIMPLE REGULAR EXPRESSIONS and DEBUGGING IN PYTHON</font></h1>

### Table of contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
<p><a href="#ref9001">Loading in Data</a></p>
<p><a href="#funyarinpa">Regular Expressions</a></p>
<p><a href="#imyourpoutine">Regular Expression in Python</a></p>
<p><a href="#ref1">What is debugging and error handling?</a></p>
<p><a href="#ref2">Error catching</a></p>
</div>

<a id="ref9001"></a>
<center><h1>Loading in Data</h1></center>

Let's create a small collection of emails to perform some data analysis and take a look at it.

In [1]:
email_dict = {'Name': ['John Doe','Jane Doe','Susan Quinn','Joe Star'],\
              'Email' : ['doej@example.com', 'jadoe@sample.ca', 'qsus@example.br', 'joes@sample.com']}
email_dict

{'Email': ['doej@example.com',
  'jadoe@sample.ca',
  'qsus@example.br',
  'joes@sample.com'],
 'Name': ['John Doe', 'Jane Doe', 'Susan Quinn', 'Joe Star']}

Our simple dataset contains a list of names and a list of their corresponding emails. Let's say we want to find email addresses and simultaneously segment each address into its 3 components:username, domain name and domain suffix.If we split the string at the '@', we still won't have what we want since even emails with the same domains might have different regional extensions. So how can we easily extract the necessary data in a quick and easy way?

<a id="funyarinpa"></a>
<center><h1>Regular Expressions</h1></center>

**Regular Expressions** are generic expressions that are used to match patterns in strings and text. A way we can exemplify this is with a simple expression that can match with an email string. But before we write it, let's look at how an email is structured:

<h1 align=center><font size = 3>test@testing.com</font></h1>

So, an email is composed by a string followed by an '@' symbol followed by another string. In Python regular expressions, we can express this as:

<h1 align=center><font size = 3>.+@.+</font></h1>

Where:
* The '.' symbol matches with any character.
* The '+' symbol repeats the previous symbol one or more times. So, '.+' will match with any string with one or more characters.
* The '@' symbol only matches with the '@' character.

Now, for our problem, which is extracting the domain from an email excluding the regional url code, we need an expression that specifically matches with what we want:

<h1 align=center><font size = 3>@.+\\.</font></h1>

Where the '\\.' symbol specifically matches with the '.' character.

<a id="imyourpoutine"></a>
<center><h1>Regular Expressions in Python</h1></center>

To go over **Regular Expressions**, we’ll be using this simple string of email addresses with different domains and regions. So, let’s say we want to specifically perform data analysis on the domains contained in each email address. But, even addresses with the same domain can possess different URLs due to regional differences. So we simply need to extract everything between the ‘@’ and the ‘.’ symbol. But how can we do that since domain names have variable lengths and might even possess strange characters within them?


In [2]:
import pandas as pd

In [3]:
df = pd.DataFrame(email_dict,columns = ['Name','Email'])
df

Unnamed: 0,Name,Email
0,John Doe,doej@example.com
1,Jane Doe,jadoe@sample.ca
2,Susan Quinn,qsus@example.br
3,Joe Star,joes@sample.com


In [4]:
emails = df['Email'].to_string(header = False,index=False)
emails = emails.splitlines()
emails

[u'doej@example.com',
 u' jadoe@sample.ca',
 u' qsus@example.br',
 u' joes@sample.com']

In [5]:
import re

pattern = r'.+@(.+)\.' # the parentheses here specify a group. 
                        # we can access the content matched in this group using .group(1) (see below)
regex = re.compile(pattern)

for email in emails:
    result = regex.match(email)
    print (email + '\'s domain is ' + result.group(1))

doej@example.com's domain is example
 jadoe@sample.ca's domain is sample
 qsus@example.br's domain is example
 joes@sample.com's domain is sample


<a id='ref1'></a>
<center><h1>What is debugging and error handling?</h1></center>

*What do you get when you try to add  **`a + 10`**? An error!*

In [7]:
"a" + 10

TypeError: cannot concatenate 'str' and 'int' objects

*And what happens to your code if an error occurs? It halts!*

In [9]:
list1 = (3,5,(3,5))
list1[1] = 'four'

TypeError: 'tuple' object does not support item assignment

These are very simple cases and the sources of the errors are easy to spot. But when it's embedded in a large chunk of code with many parts, it can be difficult to identify when and where an error has occurred. This process of identifying the source of the bug and fixing it is called debugging.


<a id='ref2'></a>
<center><h1>Error Catching</h1></center>

If you know an error may occur, the best way to handle the error is to **`catch`** the error while it's happening, so it doesn't prevent the script from halting at the error.

In [11]:
float('2.6578')

2.6578

In [12]:
float('temperature')

ValueError: could not convert string to float: temperature

In [13]:
def function_float(x):
    try:
        return float(x)
    except:
        return(x)

In [14]:
function_float('3.7667')

3.7667

In [15]:
function_float('atmosphere') 

'atmosphere'

---