# Parsing Logs with Python and Regexp

In [1]:
import re
import pandas as pd

### What is a regex (regular expression)

A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.<sub>[<a href='https://en.wikipedia.org/wiki/Regular_expression'>Wikipedia</a>]</sub>

<b><a href='https://pythex.org'>Python Regular Expression Editor</a></b> can be used to build regular expression code. The website provides an easy to understand view of regex code selection.

In [2]:
def log_to_data_frame(file: str, regex: str, columns: list) -> pd.DataFrame:
    df = pd.DataFrame(columns=columns)

    with open(file, mode='r', encoding='utf8', newline='') as log:
        for line in log:
            data = log.readline().replace('\n', ' ').strip()
            
            try:
                row = re.compile(regex).findall(data)
            except:
                print('Unable to parse string.')
            else:
                df = pd.concat(
                    [
                        df,
                        pd.DataFrame(
                            row,
                            columns=['DateTime', 'Caller', 'Task', 'Code', 'Detail']
                        )
                    ], ignore_index=True
                )

    return df

In [3]:
regex = r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}-\d{2})\s([\w-]+)\s([a-zA-Z0-9_ ]*)\[([\d]*)\]:\s(.*)'
columns = ['DateTime', 'Caller', 'Task', 'Code', 'Detail']
file = 'install.log'

df = log_to_data_frame(file, regex, columns)

In [4]:
df.head(10)

Unnamed: 0,DateTime,Caller,Task,Code,Detail
0,2020-12-14 10:54:21-08,localhost,softwareupdate_firstrun_tasks,198,Rebuilding Tag-Cache inside of ProductMetadata...
1,2020-12-14 10:54:22-08,localhost,softwareupdated,230,Initializing SoftwareUpdateMacController (SUMa...
2,2020-12-14 10:54:22-08,localhost,softwareupdated,230,SUOSUAlarmObserver: Setting alarm event stream...
3,2020-12-14 10:54:23-08,localhost,softwareupdated,230,SUOSUServiceDaemon: Error reading /var/folders...
4,2020-12-14 10:54:23-08,localhost,softwareupdated,230,authorizeWithEmptyAuthorizationForRights: Requ...
5,2020-12-14 10:54:23-08,localhost,softwareupdated,230,"Previous System Version : (null), Current Syst..."
6,2020-12-14 10:54:24-08,localhost,softwareupdated,230,SUStatisticsManager: Successfully reported sta...
7,2020-12-14 10:54:24-08,localhost,softwareupdated,230,BackgroundActivity: Scheduling one-time backgr...
8,2020-12-14 10:54:25-08,MacBook-Air,Language Chooser,192,LCA: No networks found in Wifi scan.
9,2020-12-14 10:54:25-08,MacBook-Air,Installer Progress,52,IASGetCurrentInstallPhase: Unable to get the c...


In [5]:
df.to_csv('output.csv',index=False)