In [7]:
# Initialize Otter
import otter
grader = otter.Notebook("ps4.ipynb")

# PS4: Regular expressions
In this problem set set you will get some basic practice using regular expressions. In Python the regular expressions module is called `re`:

In [8]:
import re

## Queston 1: Matching words
We'll practice on the file `words.txt.gz` included with the problem set.

**1(a)** (2 pts) Read the file, so that you have a list of strings, one for each word. Save this list as `words`.

**Hint:** you can use `gzip.open` method to read a `.gz` file

In [9]:
import gzip
with gzip.open("words.txt.gz", "rt") as f:
    words = f.read()
    words = words.split()


In [10]:
grader.check("q1a")

**1(b)** (2 pts) Write a regular expression that matches any three consecutive vowels, and use it to produce a list named `three_vowels` of all such words that occur in this data set.

In [11]:
pat = re.compile(r"[aeiou]{3}")
three_vowels = [w for w in words if re.search(pat, w) is not None]

In [12]:
grader.check("q1b")

**1(c)** (2 pts) Write a regular expression that a single word containing an even number (greater than `0`) of the letter `e`, and use it to produce a list named `even_e` of all such words in the data set.

In [13]:
pat = re.compile(r"(?:[^e]*e[^e]*e[^e]*)+")
even_e = [w for w in words if re.fullmatch(pat, w) is not None]

In [14]:
grader.check("q1c")

**1(d)** (2 pts) Write a regular expression that matches any string that begins and ends with a consonant, and has no consonants in between. Your answer should be a string named `consonants_begin_end`.

In [15]:
consonants_begin_end = r"^[b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z][^b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]*[b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]$"

In [16]:
eg = 'Du\x7fd'
print(eg)
re.match(consonants_begin_end, eg)

Dud


<re.Match object; span=(0, 4), match='Du\x7fd'>

In [17]:
grader.check("q1d")

**1(e)** (2 pts) Write a regular expression that matches any word *at least four letters long* whose last two letters are the first two letters in reverse order, and use it to produce a list named `fwd_2_rev` of all such words in the dataset. (An example of such a word is `cardiac`.)

In [18]:
pat = re.compile(r"(\w)(\w)\w*\2\1")
fwd_2_rev = [w for w in words if re.fullmatch(pat, w) is not None]

In [19]:
grader.check("q1e")

## Question 2: Filtering Internet traffic
In this problem, you'll get a taste of a more realistic application of
regular expressions. The file `SkypeIRC.txt.gz` contains a capture of Internet traffic generated
by a laptop while it was running the program Skype. Each line represents a single "packet" of data that was either sent or received by the machine.

The first line of the file is:
```
    1   0.000000  192.168.1.2 → 212.204.214.114 IRC 96 Request (ISON)
```
The first two fields are a counter and timestamp indicating the total number of packets captured so far, and the time (in seconds) that has elapsed since the capture was initiated. Following that, `192.168.1.2 → 212.204.214.114` indicates that the packet was being sent from the [IP address](https://en.wikipedia.org/wiki/IP_address) `192.168.1.2` to `212.204.214.114`. The remainder of the line is a description of the packet contents, which varies from packet to packet.

**2(a)** (2 pts) Construct a regular expression named `zero_to_255` that matches a single line containing numbers between 0 and 255, inclusive. Numbers with leading zeros should not match (except for 0 itself), so "40" should match, but "040" should not. We will surround your regular expression with the anchors "^" and "$" when testing; you do not need to include them.

In [68]:
zero_to_255 = r"(0|[1-9]|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])"

In [71]:
grader.check("q2a")

**2(b)** (2 pts) Using your solution in the previous step, build a regular expression named `ip` that matches IP addresses. For the purpose of this exercise, we will define an IP address to be a string of the form `A.B.C.D`, where $A$, $B$, $C$, and $D$ are numbers between 0 and 255.

In [44]:
ip = rf"{zero_to_255}\.{zero_to_255}\.{zero_to_255}\.{zero_to_255}"

In [23]:
grader.check("q2b")

**2(c)** (2 pts) Create a regular expression that extracts the sender and receiever IP addresses from each line of the SkypeIRC log file, and use a `collections.Counter` to count up all the unique sender and receiver addresses in the data. What do you think is the IP address of your computer on which the capture was performed?

_Type your answer here, replacing this text._

In [45]:
ip

'(0|[1-9]|[1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5])\\.(0|[1-9]|[1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5])\\.(0|[1-9]|[1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5])\\.(0|[1-9]|[1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5])'

In [28]:
from collections import Counter
senders = Counter()
receivers = Counter()

with gzip.open("SkypeIRC.txt.gz", "rt") as f:
    for l in f:
        ips = re.findall(ip, l)
        if len(ips)<2:
            continue
        senders[ips[0]] = senders.get(ips[0], 0) + 1
        receivers[ips[1]] = senders.get(ips[1], 0) + 1

In [31]:
ips

[('192', '168', '1', '2'), ('212', '204', '214', '1')]

In [30]:
ips[0]

('192', '168', '1', '2')

In [29]:
senders

Counter({('192', '168', '1', '2'): 1182,
         ('212', '204', '214', '1'): 141,
         ('192', '168', '1', '1'): 355,
         ('71', '10', '179', '1'): 43,
         ('172', '200', '160', '2'): 41,
         ('86', '128', '100', '2'): 1,
         ('68', '95', '198', '1'): 2,
         ('86', '128', '187', '1'): 2,
         ('24', '177', '122', '7'): 27,
         ('68', '32', '70', '1'): 6,
         ('65', '190', '6', '1'): 3,
         ('165', '124', '253', '2'): 2,
         ('81', '236', '228', '1'): 1,
         ('86', '128', '163', '1'): 1,
         ('86', '134', '79', '6'): 1,
         ('80', '186', '57', '1'): 1,
         ('68', '206', '150', '2'): 18,
         ('212', '50', '132', '2'): 1,
         ('217', '47', '73', '1'): 4,
         ('195', '215', '8', '1'): 9,
         ('83', '147', '171', '2'): 1,
         ('172', '207', '190', '2'): 1,
         ('217', '47', '73', '3'): 4,
         ('82', '253', '163', '2'): 1,
         ('217', '41', '176', '2'): 4,
         ('83', '130', 

In [None]:
grader.check("q2c")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Upload this .zip file to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)