# Real data analysis in Python

Konrad Brodzik

20 April 2023

### Workshop format

Live demo, but here's a link to the data and some setup instructions if you'd like to follow along: https://github.com/Kondziowy/data-analysis-workshop

### Problem statement

Sally: "The website is loading slowly, can you fix it?"

Me: "What do you mean?"

Sally: "Umm, it loads slower than last week"

### App - description

- Web application - internal document management system
- 200 daily active users

![System diagram](img\system-diagram.png "Diagram" )

### Our goals

1. Parse data that's available to us
1. De-noise the data
2. Have a quantitative measure of what's going wrong
3. Correlate different sources of data to find out the root cause of the problem
4. Aggregate data so we can use it for future failure detection

### Front End

What does it do? Serves static files (Javascript, images), forwards other requests to Gunicorn.

Example log:
```
127.0.0.1 - - [01/Apr/2023:11:27:40 +0000] "GET /index.html HTTP/1.1" 200 1200 0.002617
127.0.0.1 - - [01/Apr/2023:11:27:45 +0000] "GET /images/logo.png HTTP/1.1" 200 15734 0.007353
127.0.0.1 - - [01/Apr/2023:11:28:10 +0000] "POST /login HTTP/1.1" 302 - 0.014992
127.0.0.1 - - [01/Apr/2023:11:29:05 +0000] "GET /dashboard HTTP/1.1" 200 4369 0.100576
127.0.0.1 - - [01/Apr/2023:11:30:20 +0000] "GET /orders HTTP/1.1" 200 3520 0.006817
```

### Possible approaches

- Dedicated library?
- Using find and split
- Regular Expressions

In [26]:
import re
from collections import defaultdict

log_file = open('data\\access.log.small', 'r')  # replace with the path to your log file

log_pattern = re.compile(
    r'(?P<IP>\d+\.\d+\.\d+\.\d+)\s-\s-\s\[(?P<Date>.*?)\]\s\"(?P<Method>POST|GET) (?P<Request>.*?) HTTP/1.1\"\s(?P<StatusCode>\d+)\s(?P<DataSize>\d+)\s(?P<RequestTime>.*?)$')

data_by_column = defaultdict(list)
for line in log_file:
    match = log_pattern.match(line)
    if match is None:
        continue  # skip lines that don't match the pattern
    for column, value in match.groupdict().items():
        data_by_column[column].append(value)

log_file.close()

In [27]:
print(data_by_column)

defaultdict(<class 'list'>, {'IP': ['127.0.0.1', '127.0.0.1', '127.0.0.1', '127.0.0.1'], 'Date': ['01/Apr/2023:11:27:40 +0000', '01/Apr/2023:11:27:45 +0000', '01/Apr/2023:11:29:05 +0000', '01/Apr/2023:11:30:20 +0000'], 'Method': ['GET', 'GET', 'GET', 'GET'], 'Request': ['/index.html', '/images/logo.png', '/dashboard', '/orders'], 'StatusCode': ['200', '200', '200', '200'], 'DataSize': ['1200', '15734', '4369', '3520'], 'RequestTime': ['0.002617', '0.007353', '0.100576', '0.006817']})


In [28]:
import pandas as pd

front_end_requests = pd.DataFrame(
  data_by_column
)
front_end_requests.head()

Unnamed: 0,IP,Date,Method,Request,StatusCode,DataSize,RequestTime
0,127.0.0.1,01/Apr/2023:11:27:40 +0000,GET,/index.html,200,1200,0.002617
1,127.0.0.1,01/Apr/2023:11:27:45 +0000,GET,/images/logo.png,200,15734,0.007353
2,127.0.0.1,01/Apr/2023:11:29:05 +0000,GET,/dashboard,200,4369,0.100576
3,127.0.0.1,01/Apr/2023:11:30:20 +0000,GET,/orders,200,3520,0.006817


In [31]:
import seaborn as sns

### Front end logs - summary
- 

### Back End

What does it do? Processes business logic, wraps database requests.

Example log:
```
[2022-04-01 13:37:00 +0000] [12345] [ellen] [api.py:85] [INFO] GET /documents?page=X&per_page=Y&filter=Z 200 3520 0.006817
[2022-04-01 13:37:00 +0000] [12345] [] [api.py:86] [INFO] Recording operation GET from user ellen
[2022-04-01 13:37:00 +0000] [12345] [ellen] [api.py:85] [INFO] GET /users 200 3128 0.006817
```