# Lesson 4.8: RegEx
# Activity 8A: What is RegEx?

A regular expression (RegEx) is a special text string for describing a search pattern. You can think of regular expressions as amplified wildcards. You are familiar with wildcard notations such as `*.txt` to find all text files in a file manager. The regex equivalent is `.*\.txt`.


## Why Do We Need Regular Expressions?

In the last few lessons, we spent a lot of time parsing logs. Many of the logs had a fixed structure with each column having a value. This is not always the case! In some logs, each line will be a completely different structure. If we are looking for lines in a file that have specific information like IP's, domain names, or emails, we can look for the pattern inside those files and only parse the logs that are of interest to us.

The module we can use is called `re`. If we want to look for any string that ends with a `txt` extension, we can use `match = re/search(pattern,string)`. We are using `search` to look for a pattern. `match` will return:
* `true` if a pattern is found
* `false` if a pattern is not found

In [0]:
import re 
string = "2.txt"
match = re.search(".*\.txt", string)
if match:
    print("pattern found")
else:
    print("pattern is not found")

## Explanation
In RegEx (in any language including Bash) a `.` stands for any characters. It can be a number, a letter, special  character, etc.

* `*` represents how many times the `.` needs to appear for the match to occur.
* `*` means that this character can appear between 0 and infinity. 

For example, `1.txt` will match, `ab.txt` will match, and `14243fsfsaf.txt` would match, too.
the `\` character is there, so we can relate to the `.` in the file name as literal since it has a spceial meaning in RegEx.

Let's try it:

In [0]:
string1 = "b.txt"
match = re.search(".*\.txt", string)
if match:
    print("pattern found")
else:
    print("no match")

In [0]:
string = "banana.txt"
match = re.search(".*\.txt", string)
if match:
    print("pattern found")
else:
    print("no match")

But if we change the ending to `.jpg` what will happen?

In [0]:
string = "banana.jpg"
match = re.search(".*\.txt", string)
if match:
    print("pattern found")
else:
    print("no match")

Let's try something a bit more complicated, like matching an email pattern:

In [0]:
string = "myemail@email.com"
match = re.search(".*@.*\.[a-z]{3}", string)
if match:
    print("pattern found")
else:
    print("no match")

## Explanation
In this example, we don't know which type of character to look for, or how many times this character appears.
We use `.*`, meaning this could match alphanumeric characters and non-alphanumeric characters.

We know that every email has literal `@`.
After the `@` character, we dont know the type or number of characters, so we use `.*`.

After the domain name, usually we have a literal `.`. 
The top domain  consists of 3 letters (not always). The way to represent just letters in RegEx is `[a-z]` or `[A-Z]`.

If we dont put a `.` or `@`, this wouldn't match. The second part in the `{}` brackets means how many characters of that type. Instead of an unknown number, we know it is 3.

**Note**: To match every email format we will need to create a more complicated RegEx.

## Instructor Demo
Write a script that:
* Matches an email with the format `email@xxx.com`. The username can be any length. The domain name will be composed of any numbers or letters of the subdomain, and 3 letters of the domain such as `pop@email.com`.
* Prints if a pattern is found or if there is not a match.

In [0]:
string = "myemail@email.com"
match = re.search(".*@.*\.[a-z]{3}", string)
if match:
    print("pattern found")
else:
    print("no match")

You can create a RegEx for a telephone number, too. A regular telephone number consists of 10 digits separated by dashes (XXX-XXX-XXXX).

**Note**: You can represent a number by using `[0-9]` or `/d`.

Write a script that:
* Matches the telephone number format XXX-XXX-XXXX
* Prints if any phone number is a match

In [0]:
string = "909-228-0222"
match = re.search("[0-9]{3}-[0-9]{3}-[0-9]{4}", string)
if match:
    print("pattern found")
else:
    print("no match")

## Student Exercise

### Reference 
```[]``` Specifies a set of characters you wish to match

```[a-zA-Z]``` Matches all letters (lowercase and capital)

```[0-9]``` Matches all numbers 

``` *``` Matches zero or more occurrences of the pattern left to it

```{n}``` Means "n" repetitions of the pattern left to it

```\d ``` Matches any decimal digit. Equivalent to `[0-9]`

```\ ``` Matches the literal meaning of a character type

```{n,m}``` Matches the pattern from a minimum number of `n` to a maximum number of `m`


### Problem 1 

Write a script that :
* Matches the telephone number format `(XXX)XXX-XXXX`
* Prints if any phone number is a match

### Problem 2 

Write a script that:
* Matches the email format `email@domain.gov.us`(the username can be any length. The domain name will be composed of any letters but ends with `gov.us`)
* Prints if any email is a match 

## Instructor Review
### Problem 1

In [0]:
import re
string = "(909)224-1111"
match = re.search("\([0-9]{3}\)[0-9]{3}-[0-9]{4}", string)
if match:
    print("pattern found")
else:
    print("no match")

### Problem 2

In [0]:
import re
string = "alpha@mail.gov.us"
match = re.search(".*@[a-z]*\.gov\.us", string)
if match:
    print("pattern found")
else:
    print("no match")

# Activity 8B: Finding Log Information

Lets build off what we started in the last activity. Try to create a RegEx for an IPv4. The pattern of an IPv4 is three numbers separated by a dot `.`. Each number can be a length of 1-3. We will use `/d` to represent the digits.

## Instructor Demo
Write a script that:
* Matches an IPv4 format, such as 192.168.1.1
* Prints if any string is a match

In [0]:
string = "192.168.1.1"
match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}", string)
if match:
    print("pattern found")
else:
    print("no match")

The main difference here is that each number can be of a length of 1 to 3 numbers. So, in the curly brackets, we put a minimum of 1 number and a maximum of 3.

## Student Exercise
### Problem 1 
Write a script that:
* Matches a MAC address format `XX:XX:XX:XX:XX:XX` (made of hexadecimal characters)
* Prints if any string is a match or not

### Problem 2
Write a script that:
* Matches an IPv6 address 
* Prints if any string is an IPv6 

Hint: An IPv6 is composed of eight parts, each made of 0 to 4 hexadecimal characters.

## Instructor Review
### Problem 1

In [0]:
import re
string = "00:00:5e:00:53:af"
string = string.lower()
match = re.search("[a-f0-9]{2}:[a-f0-9]{2}:[a-f0-9]{2}:[a-f0-9]{2}", string)
if match:
    print("pattern found")
else:
    print("no match")

### Problem 2

In [0]:
import re
string = "aaa:dddd:eeee:eeee:eee2:2323:2323:ab4f"
match = re.search("[a-f0-9]{0,4}:[a-f0-9]{0,4}:[a-f0-9]{0,4}:[a-f0-9]{0,4}:[a-f0-9]{0,4}:[a-f0-9]{0,4}:[a-f0-9]{1,4}:[a-f0-9]{0,4}",string)
if match:
    print("pattern found")
else:
    print("no match")

# Activity 8C: Parsing a Log File
## Introduction
Let's put this knowledge to use and use RegEx on the `auth.log` file.

Open the file:

In [0]:
with open(".voc/public/auth.log") as f:
    for line in  f.readlines():
        print(line)

As you can see, some lines appear with an IP. These lines are used for remote ssh access. They have the source IP and login name. We can use RegEx to find the lines that have remote login.

In [0]:
with open(".voc/public/auth.log") as f:
    for line in  f.readlines():
        #print(line)
        match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}", line)
        if match:
            print("pattern found")
        else:
           print("no match")

Print the lines that have IP to see the structure.

In [0]:
with open(".voc/public/auth.log") as f:
    for line in  f.readlines():
        #print(line)
        match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}", line)
        if match:
            print(line)
        else:
            pass

If you want to print only the searched term and not the all of the lines, use `print(match.group(0))`.

In [0]:
with open(".voc/public/auth.log") as f:
    f.seek(0)
    for line in  f.readlines():
        #print(line)
        match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}", line)
        if match:
           print(match.group(0))
        else:
           pass

We can see failed and accepted attempts.
## Instructor Demo

Write a script that:
* Iterates over the `auth.log` file
* Counts how many logins from each source IP were made
* Puts the results in a dictionary 

In [0]:
import re
with open(".voc/public/auth.log") as f:
    dict_1 = {}
    for line in  f.readlines():
        #print(line)
        match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}", line)
        
        if match:
            #print(line)
            if "Accepted" in line:
                src_ip = match.group(0)
                if src_ip in dict_1:
                    dict_1[src_ip] += 1
                else:
                    dict_1[src_ip] = 1
        else:
            pass
        
    print(dict_1)

We can do the same thing for failed attempts.

Write a script that:
* Iterates over the `auth.log` file.
* Counts how many failed logins come from each source IP
* Puts the results in a dictionary 

In [0]:
with open(".voc/public/auth.log") as f:
    dict_1 = {}
    for line in  f.readlines():
        #print(line)
        match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}", line)
        
        if match:
            #print(line)
            if "Failed" in line:
                src_ip = match.group(0)
                if src_ip in dict_1:
                    dict_1[src_ip] += 1
                else:
                    dict_1[src_ip] = 1
        else:
            pass
        
    print(dict_1)

## Student Exercise

### Problem 1
Write a script that:
* Iterates over the `ftp.log` file
* Counts how many successful logins have been from different IPs
* Puts the results in a dictionary

## Instructor Review

In [0]:
import re
with open(".voc/public/ftp.log") as f:
    dict_1 = {}
    for line in  f.readlines():
        #print(line)
        match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}", line)
        
        if match:
            #print(line)
            if "OK" in line:
                src_ip = match.group(0)
                if src_ip in dict_1:
                    dict_1[src_ip] += 1
                else:
                    dict_1[src_ip] = 1
        else:
            pass
        
    print(dict_1)

# Activity 8D: Reading a Script

Let's take a look at an existing script. Review each line with the class and add #codecomments to explain what each line means. 

In [0]:
import os
from datetime import datetime


def searchFor(data,ip):
    dataLog = []

    for line in data:
        
        if ip in line:
            print(line)
            dataLog.append(line)
            with open("outputFile", 'a') as newFile:
                newFile.write(line)
    return dataLog

def narrowDown(searchQuery,dataLog):
    for line in dataLog:
        if searchQuery in line:
            print(line)



def main():
    inputFile = input('Log file: ')

    ip = input('IP or string to search for: ') 

    now = datetime.now()
    timestamp = now.strftime('%m.%d.%H.%M') 
    outputFile = f'./log-scanner-findings/{ip}--{timestamp}.txt'


    with open(inputFile, 'rt') as f:
        data = f.readlines()
        dataLog = searchFor(data,ip)


    os.makedirs(os.path.dirname(outputFile), exist_ok=True)
    
    print(f'{ip} found {len(dataLog)} times')


    searchQuery = input('Narrow it down, next string:')
    narrowDown(searchQuery,dataLog)


main()

## Instructor Review

In [0]:
# Tool to assist in manually reviewing server logs.
# Useful in creating samples for reporting, or just counting occurrences.
# Searches log files for lines containing a given IP (or any other string)
# and appends each line found to a new file with the IP/string as name.

# To Use:
#   1. Set inputFile value with path to log file
#   2. Run script
#   3. Enter the IP/string you want to find
#   4. Script will create ./log-scanner-findings/outputFile containing all matching lines

import os
from datetime import datetime


def searchFor(data,ip):
    dataLog = []

    for line in data:
        
        if ip in line:
            print(line)
            dataLog.append(line)
            with open("outputFile", 'a') as newFile:
                newFile.write(line)
    return dataLog

def narrowDown(searchQuery,dataLog):
    for line in dataLog:
        if searchQuery in line:
            print(line)



def main():
    inputFile = input('Log file: ')

    ip = input('IP or string to search for: ') # user enters the IP/string to search for

# Using timestamp for output filename, to avoid appending to an existing file.
# Also I want the timestamps to sort lexicographically
    now = datetime.now()
    timestamp = now.strftime('%m.%d.%H.%M') # format timestamp as MM.DD.HH.MM ex. 10.03.20.40
    outputFile = f'./log-scanner-findings/{ip}--{timestamp}.txt'

# read the log file
    with open(inputFile, 'rt') as f:
        data = f.readlines()
        dataLog = searchFor(data,ip)

# create the output directory if it doesn't exist
    os.makedirs(os.path.dirname(outputFile), exist_ok=True)
    
    print(f'{ip} found {len(dataLog)} times')

# search again for another string
    searchQuery = input('Narrow it down, next string:')
    narrowDown(searchQuery,dataLog)


main()