# Processing Log Files

## 1. What Are Log Files?


So far we've learned a bunch of new concepts that let us interact with the system. We've learned how to read and write files, how to use regular expressions to process them, how to interact with the system shell, and how to execute commands in that environment. That's really impressive. So take a quick moment to think about how far you've come and about all the new cool stuff you'll be able to do by interacting with the system. 

Now we're going to take a look at how we can use these tools to help us with our day-to-day work. In the next few videos, we'll dive into a concrete examples centered around processing chunks of data. The kind of data that you might find in a Syslog file or a web request log. The different events that happen in programs that are running in a system and aren't connected to terminal are usually rent log files. Log files contain a lot of useful information, particularly when you're trying to debug a tricky problem that's happening on a computer. On the flip side, sometimes it can be overwhelming to try to find something inside of a log file that contains a whole lot of lines with a whole lot of things in them. So it's a good idea to learn how we can process these files and get our tools to extract information that we want out of them. To do this we'll go back to our knowledge of regular expressions. Using regex's in our scripts gives us a great deal of flexibility when processing log files and other texts data sources too. In a script, we can program any kind of behavior we want, so we can manipulate and process text data and get results we need. We're going to show you how you can do this in the next couple of videos. All right. Let's get started.

## 2. Filtering Log Files with Regular Expressions

When working with log files and scripts, our first step is usually to open them so our code can access their contents. We've discussed various methods of operating on files. The usual technique is to call the open function which returns a file object and then iterate through each of its lines using a for-loop. For example, to open a file received as a parameter of our script, we can use code like this one.

```python
import sys

logfile = sys.argv[1]
with open(logfile) as file:
    for line in file:
        print(line.strip())
```
Remember that for performance reasons, when files are large, it's generally a good practice to read them line by line instead of loading the entire contents into memory. For our example, let's say the log file contains these messages.

```
Jul 6 14:01:23 computer.name CRON[2422]: USER (good_user)
Jul 6 14:02:11 computer.name jam_tag=pism[4322]: USER (good_user)
Jul 6 14:03:45 computer.name CRON[2422]: USER (good_user)
Jul 6 14:05:33 computer.name CRON[2422]: USER (good_user)
Jul 6 14:06:29 computer.name jam_tag=pism[4322]: USER (naughty_user)
Jul 6 14:07:59 computer.name jam_tag=pism[4322]: USER (good_user)
Jul 6 14:08:08 computer.name CRON[2422]: USER (naughty_user)
```

The server that generates this log file has been acting strangely and we suspect it's due to a Cron job started by one of the system administrators. You may remember that Cron jobs are used to schedule scripts on UNIX-based operating systems. To find out what's happening with the server, we want to audit the log files and see exactly who's been launching CRON jobs. By looking at the sample log, we can see that the lines that'll be most interesting to us are the ones that contain the Cron substring. These lines also show the user who started the Cron job wrapped in parentheses. With this info, we can ignore any line without the Cron substring in it. We can check for this using the "in" keyword.

```python
import sys

logfile = sys.argv[1]
with open(logfile) as file:
    for line in file:
        if 'CRON' not in line:
            # Tells loop to go to next line
            continue
        print(line.strip())
```
Here, we're using the "continue" keyword which tells our loop to go to the next element. So if the line doesn't contain a string that we're looking for, we'll skip it and go to the next line. Once we know we're processing to write log line, we can use our knowledge of regular expressions to extract the username. We can do this in a bunch of different ways. In this example, we'll use escape characters, capture groups, and the end of string anchor. Before we add the expression to our script, we'll construct it and test it out in an interpreter.

```python
pattern = r'USER \((\w+)\)$'
```

Let's take a closer look at this expression. Since the username is found at the end log line, we use the dollar sign anchor to only match texts that is at the end of the line. To find the username, we look for the word user followed by a string wrapped in parentheses as that's how these lines are structured. This means that we need to escape those parentheses with a backslash. Since we want to extract the actual username, we use another couple of parentheses to create a capturing group. For the username itself, we're matching any alphanumeric characters by using backslash w plus. 

With that cleared out, let's test it out with a sample line.

In [3]:
import re
pattern = r'USER \((\w+)\)$'
line = 'Jul 6 14:08:08 computer.name CRON[2422]: USER (naughty_user)'
result = re.search(pattern, line)
print(result[1])

naughty_user


Looks like you've got a naughty user. On the plus side, it seems our regular expression works correctly. We can now use expression in our code.

```python
import sys
import re

logfile = sys.argv[1]
with open(logfile) as file:
    for line in file:
        if 'CRON' not in line:
            # Tells loop to go to next line
            continue
        pattern = r'USER \((\w+)\)$'
        result = re.search(pattern, line)
        print(result[1])
```

```
good_user
good_user
good_user
naughty_user
```