3. Extract data from logs

Read in network.log and extract source IP, destination IP, protocol and data size.

Expected output:

Source: 10.0.0.1 | Destination: 10.0.0.2 | Protocol: TCP | Bytes: 1024
Source: 10.0.0.2 | Destination: 10.0.0.3 | Protocol: UDP | Bytes: 2048
Source: 10.0.0.3 | Destination: 10.0.0.1 | Protocol: TCP | Bytes: 512

Data Transfer Summary:
TCP: 1536 bytes
UDP: 2048 bytes


Hint: you could probably find some complex regexp pattern, but a more strategic approach is to check the formatting and make a strategy based on that.

## find() 方法

- find() 是字符串方法，用于查找子字符串在原始字符串中的第一个出现位置。
- line.find("Source:") 会返回 "Source:" 在 line 中的 起始位置（索引）。
- 如果 "Source:" 在字符串中找不到，find() 方法会返回 -1。

a）

In [82]:
# 方法 一

# find()

network = []
with open("data/network.log", "r") as file:
    for line in file:
        line = line.strip()
        source_line = line.find("Source")   # 获取 "Source:" 的起始位置
        if source_line != -1:             # 确保 "Source:" 存在
            extracted_data = line[source_line:]    # 从 "Source:" 开始截取
            network.append(extracted_data)
            print(extracted_data)



Source: 10.0.0.1 | Destination: 10.0.0.2 | Protocol: TCP | Bytes: 1024
Source: 10.0.0.2 | Destination: 10.0.0.3 | Protocol: UDP | Bytes: 2048
Source: 10.0.0.3 | Destination: 10.0.0.1 | Protocol: TCP | Bytes: 512


In [83]:
network

['Source: 10.0.0.1 | Destination: 10.0.0.2 | Protocol: TCP | Bytes: 1024',
 'Source: 10.0.0.2 | Destination: 10.0.0.3 | Protocol: UDP | Bytes: 2048',
 'Source: 10.0.0.3 | Destination: 10.0.0.1 | Protocol: TCP | Bytes: 512']

## re

- .*：这部分是正则表达式的核心：
- .：匹配任意字符（除了换行符）。
- *：表示零个或多个前面的字符。因此，.* 会匹配 "Source:" 后面所有的字符，直到行的结束。

1. match.group()     
match.group() 方法用于获取匹配结果。对于 re.search() 函数，如果匹配成功，match 对象将包含匹配到的内容。group() 方法返回完整的匹配字符串。

In [84]:
# 方法二: 正则表达式

import re 
network = []
with open("data/network.log", "r") as file:
    for line in file:
        line = line.strip()
        match = re.search("Source.*", line)          # 正则表达式匹配从 "Source:" 开始到行末的内容
        if match:
            extracted_data = match.group()
            network.append(extracted_data)
            print(extracted_data)                      # 提取匹配的部分


Source: 10.0.0.1 | Destination: 10.0.0.2 | Protocol: TCP | Bytes: 1024
Source: 10.0.0.2 | Destination: 10.0.0.3 | Protocol: UDP | Bytes: 2048
Source: 10.0.0.3 | Destination: 10.0.0.1 | Protocol: TCP | Bytes: 512


b）

上面三行 转换成 一个列表里包含 三行字典，其中 "Bytes"  的类型变成 int。       
因为所有从文件或字符串中读取的内容，默认都是以字符串类型存储的。

In [85]:
# 初始化结果列表
network_dicts = []

for line in network:
    pairs = line.split(" | ")
    log_entry = {}
    for pair in pairs:
        key, value = pair.split(": ")
        log_entry[key.strip()] = int(value.strip()) if key.strip() == "Bytes" else value.strip()
    network_dicts.append(log_entry)

network_dicts

[{'Source': '10.0.0.1',
  'Destination': '10.0.0.2',
  'Protocol': 'TCP',
  'Bytes': 1024},
 {'Source': '10.0.0.2',
  'Destination': '10.0.0.3',
  'Protocol': 'UDP',
  'Bytes': 2048},
 {'Source': '10.0.0.3',
  'Destination': '10.0.0.1',
  'Protocol': 'TCP',
  'Bytes': 512}]

In [86]:
# Data Transfer Summary:
# TCP: 1536 bytes
# UDP: 2048 bytes

import pandas as pd

# 把 network_dicts 变成 Dataframe
df = pd.DataFrame(network_dicts)
protocol_bytes = df.groupby("Protocol")["Bytes"].sum()
# protocol_bytes


print("Data Transfer Summary:")

print(f"TCP: {protocol_bytes['TCP']} bytes")
print(f"UDP: {protocol_bytes['UDP']} bytes")

Data Transfer Summary:
TCP: 1536 bytes
UDP: 2048 bytes


In [87]:
protocol_bytes

Protocol
TCP    1536
UDP    2048
Name: Bytes, dtype: int64