# Processing log files

### What are Log Files

The different events that happen in programs that are running in a system and aren't connected to terminal are usually rent log files. Log files contain a lot of useful information, particularly when
you're trying to debug a tricky problem that's happening on a computer. 

On the flip side, sometimes it can be overwhelming to try to find something inside of a log file that contains a whole lot of lines with a whole lot of things in them. So it's a good idea to learn how we can process these files and get our tools to extract information that we want out of them. 

To do this we'll go back to our knowledge of regular expressions. Using regex's in our scripts gives us a great deal of flexibility when processing log files and other texts data sources too.

In a script, we can program any kind of behavior we want, so we can manipulate and process text data and get results we need.



For example, to open a file received as a parameter of our script, we can use code like this one. 


In [None]:
# we have to create a python script to read a log of users running CRON jobs


import sys

logfile = sys.argv[1] # The script needs to accept a command line argument for the path to the log file.
with open(logfile) as f:
    for line in f:
        print(line.strip())

Remember that for performance reasons, when files are large, it's generally a good practice to read them line by line instead of loading the entire contents into memory. 

For our example, let's say the log file contains these messages.



The server that generates this log file has been acting strangely and we suspect it's due to a **Cron job** started by one of the system administrators.

**Cron jobs** are used to schedule scripts on UNIX-based operating systems. To find out what's happening with the server, we want to audit the log files and see exactly who's been launching CRON jobs.

By looking at the sample log, we can see that the lines that'll be most interesting to us are the ones that contain the Cron substring. 

These lines also show the user who started the Cron job wrapped in parentheses. With this info, we can ignore any line without the
Cron substring in it. 

We can check for this using the **in** keyword. 

In [None]:
import sys

logfile = sys.argv[1]
with open(logfile) as f:
    for line in f:
        if "CRON" not in line:
            continue
        print(line.strip())

In [6]:
import re

pattern = "USER \((\w+)\)$"
line= "Jul 6 14:03:01 computer.name CRON[29440]: USER (naughty_user)"
result= re.search(pattern, line)
print(result)
print(result[1])

<re.Match object; span=(42, 61), match='USER (naughty_user)'>
naughty_user


In [None]:
# final code

import sys

logfile = sys.argv[1]
with open(logfile) as f:
    for line in f:
        if "CRON" not in line:
            continue
        pattern = "USER \((\w+)\)$"
        result= re.search(pattern, line)
        print(result[1])

### Making Sense out of Data

we're taking the current value in the dictionary by passing a default value of zero, so that when the key is in present in the dictionary, we had a default value. We then add one and set it as a new
value associated with that key. 


In [8]:
username= {}
name= "good_user"
username[name] = username.get(name, 0) + 1
print(username)
username[name] = username.get(name, 0) + 1
print(username)

{'good_user': 1}
{'good_user': 2}



We need to initialize an empty dictionary to begining of our code

In [None]:
import sys

logfile = sys.argv[1]
username= {}

with open(logfile) as f:
    for line in f:
        if "CRON" not in line:
            continue
        pattern = "USER \((\w+)\)$"
        result= re.search(pattern, line)
        
        if result is None:
            continue
        name= result[1]
        username[name] = username.get(name, 0) +1
        
print(username)