## Applications of Regular Expressions

In [1]:
#import 
import re

### Parsing URLs

Some of the most informative pieces of information in a URL are: 

1. The security scheme (like `https`), 
3. The hostname (like `google`).
4. The domain extension (like `.com` or `.edu`). 
5. The subdomain (like `/images`). 

Many websites insert `www` between the security scheme and hostname, websites hosted as subdomains skip this component. 

Let's write a regular expression to extract these pieces of information from a URL. Note that there are many ways to parse these data. 

In [2]:
s = """
    Our websites today are 
    https://www.ucla.edu/, 
    http://ccle.ucla.edu/, 
    https://www.google.com/images/, 
    and https://sites.google.com/view/perlmutma/home
    """

In [3]:
pattern = r"(https*)://([www]+|[a-z]+)\.([a-z]+)\.([a-z]+)([/A-z0-9]*)"

In [4]:
re.findall(pattern, s)

[('https', 'www', 'ucla', 'edu', '/'),
 ('http', 'ccle', 'ucla', 'edu', '/'),
 ('https', 'www', 'google', 'com', '/images/'),
 ('https', 'sites', 'google', 'com', '/view/perlmutma/home')]

We can now check, for example, whether a given website is a subdomain of another by checking whether `www` is the second element of the resulting tuples. 

# Parsing Unstructured Scientific Data

Sometimes, data doesn't come to us neatly wrapped in CSV files. For example, consider the following: 

In [5]:
data = """
Andrea    5:31
Ben       5:02
Carl      6:21
Didi      5:10
"""
data

'\nAndrea    5:31\nBen       5:02\nCarl      6:21\nDidi      5:10\n'

Since it looks like these data represent times, let's parse the data into names, minutes, and seconds. 

In [6]:
#\d in place of [0-9]
pattern = r"([A-z]+)\s+(\d+):(\d+)"


In [7]:
parsed = re.findall(pattern, data)
parsed

[('Andrea', '5', '31'),
 ('Ben', '5', '02'),
 ('Carl', '6', '21'),
 ('Didi', '5', '10')]

Now we can compute the total times in seconds: 

In [8]:
{p[0] : 60*int(p[1]) + int(p[2]) for p in parsed}

{'Andrea': 331, 'Ben': 302, 'Carl': 381, 'Didi': 310}