In [107]:
# Initialize Otter
import otter
grader = otter.Notebook("ps4.ipynb")

# PS4: Regular expressions
In this problem set set you will get some basic practice using regular expressions. In Python the regular expressions module is called `re`:

In [108]:
import re

## Queston 1: Matching words
We'll practice on the file `words.txt.gz` included with the problem set.

**1(a)** (2 pts) Read the file, so that you have a list of strings, one for each word. Save this list as `words`.

**Hint:** you can use `gzip.open` method to read a `.gz` file

In [109]:
import gzip
with gzip.open("words.txt.gz", "rt") as f:
    words = f.read()
    words = words.split()


In [110]:
grader.check("q1a")

**1(b)** (2 pts) Write a regular expression that matches any three consecutive vowels, and use it to produce a list named `three_vowels` of all such words that occur in this data set.

In [111]:
pat = re.compile(r"[aeiou]{3}")
three_vowels = [w for w in words if re.search(pat, w) is not None]

In [112]:
grader.check("q1b")

**1(c)** (2 pts) Write a regular expression that a single word containing an even number (greater than `0`) of the letter `e`, and use it to produce a list named `even_e` of all such words in the data set.

In [113]:
pat = re.compile(r"(?:[^e]*e[^e]*e[^e]*)+")
even_e = [w for w in words if re.fullmatch(pat, w) is not None]

In [114]:
grader.check("q1c")

**1(d)** (2 pts) Write a regular expression that matches any string that begins and ends with a consonant, and has no consonants in between. Your answer should be a string named `consonants_begin_end`.

In [115]:
consonants_begin_end = r"^[b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z][^b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]*[b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]$"

In [116]:
grader.check("q1d")

**1(e)** (2 pts) Write a regular expression that matches any word *at least four letters long* whose last two letters are the first two letters in reverse order, and use it to produce a list named `fwd_2_rev` of all such words in the dataset. (An example of such a word is `cardiac`.)

In [117]:
pat = re.compile(r"(\w)(\w)\w*\2\1")
fwd_2_rev = [w for w in words if re.fullmatch(pat, w) is not None]

In [118]:
grader.check("q1e")

## Question 2: Filtering Internet traffic
In this problem, you'll get a taste of a more realistic application of
regular expressions. The file `SkypeIRC.txt.gz` contains a capture of Internet traffic generated
by a laptop while it was running the program Skype. Each line represents a single "packet" of data that was either sent or received by the machine.

The first line of the file is:
```
    1   0.000000  192.168.1.2 → 212.204.214.114 IRC 96 Request (ISON)
```
The first two fields are a counter and timestamp indicating the total number of packets captured so far, and the time (in seconds) that has elapsed since the capture was initiated. Following that, `192.168.1.2 → 212.204.214.114` indicates that the packet was being sent from the [IP address](https://en.wikipedia.org/wiki/IP_address) `192.168.1.2` to `212.204.214.114`. The remainder of the line is a description of the packet contents, which varies from packet to packet.

**2(a)** (2 pts) Construct a regular expression named `zero_to_255` that matches a single line containing numbers between 0 and 255, inclusive. Numbers with leading zeros should not match (except for 0 itself), so "40" should match, but "040" should not. We will surround your regular expression with the anchors "^" and "$" when testing; you do not need to include them.

In [119]:
zero_to_255 = r"(0|[1-9]|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])"

In [120]:
grader.check("q2a")

**2(b)** (2 pts) Using your solution in the previous step, build a regular expression named `ip` that matches IP addresses. For the purpose of this exercise, we will define an IP address to be a string of the form `A.B.C.D`, where $A$, $B$, $C$, and $D$ are numbers between 0 and 255.

In [121]:
ip = rf"\b{zero_to_255}\.{zero_to_255}\.{zero_to_255}\.{zero_to_255}\b"

In [122]:
grader.check("q2b")

**2(c)** (2 pts) Create a regular expression that extracts the sender and receiever IP addresses from each line of the SkypeIRC log file, and use a `collections.Counter` to count up all the unique sender and receiver addresses in the data. What do you think is the IP address of your computer on which the capture was performed?

In [123]:
from collections import Counter
senders = Counter()
receivers = Counter()

ip2ip = re.compile(f"(ip) \u2192 ({ip})")
with gzip.open("SkypeIRC.txt.gz", "rt") as f:
    for l in f:
        ips = re.findall(ip, l)
        if len(ips)>=2:
           s, r = '.'.join(ips[0]), '.'.join(ips[1])
           senders[s] = senders.get(s,0) + 1
           receivers[r] = receivers.get(r,0) + 1

In [124]:
grader.check("q2c")

Since the capturing process is running on my computer, it's highly likely that my computer's IP address appears most frequently on both the sender and receiver side.
So we run `senders.most_common(2)` and `receivers.most_common(2)`. Next, if we check some records in this log, we find
>   5   0.235960  192.168.1.2 → 192.168.1.1  DNS 84 Standard query 0x311f PTR 2.1.168.192.in-addr.arpa     
>   6   0.236116  192.168.1.2 → 192.168.1.1  DNS 88 Standard query 0x3120 PTR 114.214.204.212.in-addr.arpa     
>   7   0.270252  192.168.1.1 → 192.168.1.2  DNS 84 Standard query response 0x311f PTR 2.1.168.192.in-addr.arpa     

It means `192.168.1.2` sends a DNS query to `192.168.1.1` and `192.168.1.1` responsed. So `192.168.1.1` is likely the local DNS resolver. Moreover, there are many
records containing TCP communication between `192.168.1.2` and other IP addresses.    
Hence, we conclude that `192.168.1.2` is the IP address of my computer. Note that this is the local (private) IP address, and my computer may also have a public IP address assigned by the internet service provider.

In [125]:
print(senders.most_common(2), receivers.most_common(2))

[('192.168.1.2', 1182), ('192.168.1.1', 355)] [('192.168.1.2', 1068), ('192.168.1.1', 359)]


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Upload this .zip file to Gradescope for grading.

In [126]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)