# Regular Expressions

Look at the file `data/nasa_web.log`. It contains two month's worth of all HTTP requests to the NASA Kennedy Space Center server in Florida. 
We are going to go though that file and extract some interesting information. 

Let's start by reading the file into your python session.

In [1]:
with open('data/nasa_web.log', 'r') as f:
    log_file = f.read()

In [6]:
log_file.splitlines()

['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 '199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085',
 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
 '199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179',
 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0',
 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0',
 '205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985',
 'd104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 '129.94.144.152 - - [01/Jul/

If you're not familiar with HTTP logs, the data provided can be broken down like this:

`netspace.net.au - - [01/Jul/1995:03:34:44 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985`



| Host making the request | Timestamp            | Timezone      | HTTP method | URL                 | HTTP Version |HTTP status code | Bytes in the reply |
|-------------------------|----------------------|---------------|-------------|---------------------|-------------|-----------------|--------------------|
| netspace.net.au         | 01/Jul/1995:03:34:44 | -0400         | GET         | /shuttle/countdown/ | HTTP/1.0    | 200              | 3985               |

## The vocabulary

Regular expressions are a way to define strings that match a specific pattern. There is a whole language to be learned here and below you will find some of the most useful values.

    \d      Any Digit
    \D      Any Non-digit character
    .       Any Character
    \.      Period
    [abc]   Only a, b, or c
    [^abc]  Not a, b, nor c
    [a-z]   Characters a to z
    [0-9]   Numbers 0 to 9
    \w      Any Alphanumeric character
    \W      Any Non-alphanumeric character
    {m}     m Repetitions
    {m,n}   m to n Repetitions
    *       Zero or more repetitions
    +       One or more repetitions
    ?       Optional character
    \s      Any Whitespace
    \S      Any Non-whitespace character
    ^…$     Starts and ends
    (ab|de) Matches ab or de

[source](https://regexone.com/references/python)

A fantastic way of learning/practicing regex is [regex101](https://regex101.com/)

Firstly, try to understand the regular expression below:

In [7]:
r'GET /\w+'

'GET /\\w+'

This expression would look for the word `GET` followed by a space character ` `, followed by a forward slash `/` and one or more alphanumeric character(s). In that sense this would be an expression to search for URLs requested with a GET.

*Although not always required, if you want to you can always define the strings with the `r''` prefix to make sure the are never interpreted, i.e. problems such Python using `\` as an escape character are avoided.*

### Exercise - match dates

Another example of a simple regular expression would be to search for a date. Imagine we wanted to capture all date instances that followed the convention 01/Jul/1995. What would an expression catch this pattern look like?

In [40]:
log_file.splitlines()[0]

'199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245'

In [63]:
# Add your code below
# write a regular expression pattern that can recognise dates in the log (you will not need to run it - discuss)

import re

# return first 5 dates
search = r'\d{2}\/[A-z]+\/\d{4}'
re.findall(search, log_file)[0:5]

['01/Jul/1995', '01/Jul/1995', '01/Jul/1995', '01/Jul/1995', '01/Jul/1995']

With our log we know exactly the number of letters and numbers expected. With regex, you can use `{n}` after a pattern to specify that you expect **n** instances.
For example `\d{2}` will match exactly 2 digits.

## `search()`

Python supports regular expressions through a built-in library called `re`. Start by importing the library and using 
the `.search()` function, which takes a regex pattern and a string to be searched. 

Search `log_file` for `GET` requests (the regex specified under the vocublary list above can be used), assigning the `re.Match object` (which is returned by `re.search()`, to the variable `match`:

In [71]:
# Add your code below
import re
pattern = r'GET /\w+'
match = re.search(pattern, log_file)

Note that the search function will only return the first match, so it is mostly useful to see whether your regular expression does at least have one hit. If there is no hit, it simply returns `None`. If it does find a hit, it returns the hit as a `MatchObject`.

We can use the `.group()` method of the `MatchObject` to extract the actual string that was matched:

In [72]:
match.group()

'GET /history'

We can use the `.start()` method of the `MatchObject` to extract the starting position of the string that was matched:

In [73]:
match.start()

47

We can use the `.end()` method of the `MatchObject` to extract the end position of the string that was matched:

In [74]:
match.end()

59

### Creating a message from `MatchObject` properties

Similarly to the `.group()` method which extracts the actual string which was matched, you can use the `.start()` and `.end()` methods to collect the position of where the regular expression matched.

`print` the `start` and `end` positions formatted nicely with the actual match. Use an f-string to put together the `.group()`, `.start()`, `.end()` into a sentence as a single string, assigned to `message`:

In [75]:
# return a single string containing .group(), .start() and .end() of the match
message = f'Found "{match.group()}" at position: {match.start()}-{match.end()}'
message


'Found "GET /history" at position: 47-59'

## `findall()`

Search `log_file` for all matches of our regex `pattern`, we can use the `re.findall()` function. It takes the same parameters as `re.search()`, but returns a list of matches. 

Assigning the returned list to the variable `matches`:

In [126]:
# Add your code below
pattern = r'GET /\w+'

get_lst = re.findall(pattern, log_file)

import pandas as pd
import numpy as np

words, counts = np.unique(get_lst, return_counts = True)

# np.asarray(words, counts)

np.asarray((words, counts)).T


array([['GET /base', '1'],
       ['GET /biomed', '2'],
       ['GET /cgi', '358'],
       ['GET /de', '1'],
       ['GET /elv', '19'],
       ['GET /facilities', '78'],
       ['GET /facts', '32'],
       ['GET /history', '1214'],
       ['GET /htbin', '175'],
       ['GET /icons', '316'],
       ['GET /images', '2842'],
       ['GET /ksc', '136'],
       ['GET /mdss', '5'],
       ['GET /msfc', '20'],
       ['GET /news', '4'],
       ['GET /persons', '9'],
       ['GET /procurement', '1'],
       ['GET /pub', '12'],
       ['GET /robots', '1'],
       ['GET /shuttle', '4410'],
       ['GET /software', '172'],
       ['GET /statistics', '10'],
       ['GET /whats', '13']], dtype='<U21')

## Substitutions

Regular expressions are useful in another scenario where we want to find and replace text. This can be useful when you want to change the domain name of an email of your business for instance.

Setting up a substitution is very similar to using regular expressions in general, except that this time we also need to specify a `replacement` value. 

Replace the word **'company'** in the email addresses found in **'contacts'** with the word **'agency'** using `sub()` method.

In [103]:
# Add your code below
contacts = 'Please write to tom@company.com or margaret@company.com'
pattern = r'company'
replacement = r'agency'

contacts = re.sub(pattern, replacement, contacts)
contacts


'Please write to tom@agency.com or margaret@agency.com'

### Exercise - Replace IP addresses

Now we want to **find** and **replace** all host **IP addresses** with the string **'unknown domain'** in `log_file`

Save the regular expression pattern string to a variable called `pattern`

In [110]:
# Add your code below
# Step1 - develop the regular expression pattern string to find all IP addresses
log_file.splitlines()[0]

pattern = r'\d*\.\d*\.\d*\.\d*'
re.findall(pattern, log_file)[:10]

['199.72.81.55',
 '199.120.110.21',
 '199.120.110.21',
 '205.212.115.106',
 '129.94.144.152',
 '129.94.144.152',
 '199.120.110.21',
 '205.189.154.54',
 '205.189.154.54',
 '205.189.154.54']

In [111]:
# If the above pattern is correct, you should see all IP addresses in 'log_file' by running the code below
re.findall(pattern, log_file)

['199.72.81.55',
 '199.120.110.21',
 '199.120.110.21',
 '205.212.115.106',
 '129.94.144.152',
 '129.94.144.152',
 '199.120.110.21',
 '205.189.154.54',
 '205.189.154.54',
 '205.189.154.54',
 '205.189.154.54',
 '199.72.81.55',
 '205.189.154.54',
 '205.189.154.54',
 '205.212.115.106',
 '205.189.154.54',
 '199.72.81.55',
 '199.72.81.55',
 '199.72.81.55',
 '199.72.81.55',
 '199.72.81.55',
 '205.189.154.54',
 '199.33.32.50',
 '199.33.32.50',
 '199.33.32.50',
 '129.188.154.200',
 '129.188.154.200',
 '129.188.154.200',
 '129.188.154.200',
 '129.94.144.152',
 '205.189.154.54',
 '205.189.154.54',
 '129.188.154.200',
 '129.188.154.200',
 '129.188.154.200',
 '129.188.154.200',
 '129.188.154.200',
 '205.212.115.106',
 '205.189.154.54',
 '129.94.144.152',
 '129.188.154.200',
 '129.188.154.200',
 '199.33.32.50',
 '129.188.154.200',
 '199.166.39.14',
 '199.166.39.14',
 '199.166.39.14',
 '199.166.39.14',
 '199.166.39.14',
 '199.166.39.14',
 '129.94.144.152',
 '129.94.144.152',
 '205.212.115.106',
 '199

Finally save the `log_file` to a new file called `data/log_file_edited.log` where every **IP addresses** is replaced with `unknown domain`.

In [112]:
# Step2 - write substituted data to a file

replacement = r'unknown domain'

with open('data/log_file_edited.log', 'w') as f:
    f.write(re.sub(pattern, replacement, log_file))


In [121]:
with open('data/log_file_edited.log', 'r') as f:
    edited_file = f.read()
    
edited_file.splitlines()

['unknown domain - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245',
 'unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 'unknown domain - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085',
 'burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0',
 'unknown domain - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179',
 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0',
 'burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0',
 'unknown domain - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985',
 'd104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985',
 'unknown domain - - [01/Jul

## Additional examples

###### How many requests for a page from the `shuttle` section (base `/shuttle/`) failed (status code different than 200)?

In [127]:
pattern = r'GET /shuttle/[\w/\.]* HTTP/1.0" (?!200)\d{3}'
# pattern = r'GET /shuttle/[\w/\.]* HTTP/1.0" ([0-13-9]\d{2}|2[1-9]\d|2\d[1-9])'
len(re.findall(pattern, log_file))


130

Hint: for the "different than 200" part you have two options; either use `pattern1|pattern2` in order to list all the accepted status codes, or read about negative look ahead `(?!pattern)` that allow you to match the string only if the pattern isn't found.

###### Get all different second level arguments to URL starting from shuttle (`/shuttle/something`)?

In [128]:
pattern = r'GET /shuttle/(\w+)/'
set(re.findall(pattern, log_file))


{'countdown', 'missions', 'movies', 'resources', 'technology'}

###### How many queries did NASA get from a german domain (`.de`)?

In [129]:
pattern = r'^[\w\.]+\.de'
len(re.findall(pattern, log_file, re.MULTILINE))


63

Hint: In order to match the beginning of a line use `^` and add `re.MULTILINE` as argument to `re.findall` (or it will be treated as the `not` operator)
For instance `re.findall(r'^a', file_, re.MULTILINE)` will find all lines starting with an `a`