# Exercise 0 - Python repetition

These exercises aim for you to train in fundamental Python programming in order to follow along with the course.

## 0. User input for ETL Parameters

Ask the user for 2 inputs:

- source file path
- destination file path


For example:

\# source path
/Users/aigineer/Documents/data_platform_course/data/file.csv

\# destination
/Users/aigineer/Documents/data_platform_course/cleaned_data/file.csv


Then the output should be:

source: /Users/aigineer/Documents/data_platform_course/data/file.csv

destination: /Users/aigineer/Documents/data_platform_course/cleaned_data/file.csv

In [1]:
file_path = input("Enter your file path: ")
destination_path = input("Enter your file path: ")

print(f"source: {file_path}")
print(f"destination: {destination_path}")

source: /Users/aigineer/Documents/data_platform_course/data/file.csv
destination: /Users/aigineer/Documents/data_platform_course/cleaned_data/file.csv


# 1. Schema validation

In order to maintain data quality, consistency and reliability, a system needs to validate that it conforms to certain predefined structure or format. This is called schema validation and you'll practice this in the following exercises.

### a) Create a dictionary that looks like this:

| Key       | Value  |
|-----------|--------|
| id        | 101    |
| name      | Erika  |
| is_active | True   |
| age       | 45     |

In [63]:
my_dict = {
    "id": 101, 
    "name": "Erika", 
    "is_active": True, 
    "age": 45
    }

my_dict

{'id': 101, 'name': 'Erika', 'is_active': True, 'age': 45}

b) Validate that the id is integer, name is string, is_active is boolean and age is integer. It should return true if valid and false if not valid.



In [64]:
def validate_dict(dict):
    if isinstance(dict["id"], int) and isinstance(dict["name"], str) and isinstance(dict["is_active"], bool) and isinstance(dict["age"], int):
        print(True)
    else:
        print(False)

validate_dict(my_dict)

True


In [32]:
type(my_dict["id"]), type(my_dict["name"]), type(my_dict["is_active"]), type(my_dict["age"])

(int, str, bool, str)

c) The dictionary created can be seen as one row, now lets create more records and store each record (dictionary) in a list.

| id  | name   | is_active | age  |
|-----|--------|-----------|------|
| 102 | Marcus | True      | 34   |
| 103 | David  | False     | 29   |
| 104 | Anna   | True      | 41.5 |
| 106 | Ingrid | NOPE      | 8    |

In [42]:
marcus = {
    "id": 102,
    "name": "Marcus",
    "is_active": True,
    "age": 34
    }

david ={
    "id": 103,
    "name": "David",
    "is_active": False,
    "age": 29
}

anna = {
    "id": 104,
    "name": "Anna",
    "is_active": True,
    "age": 41.5,
}

ingrid = {
    "id": 106,
    "name": "Ingrid",
    "is_active": "NOPE",
    "age": 8
}

my_list = [marcus, david, anna, ingrid]
my_list

[{'id': 102, 'name': 'Marcus', 'is_active': True, 'age': 34},
 {'id': 103, 'name': 'David', 'is_active': False, 'age': 29},
 {'id': 104, 'name': 'Anna', 'is_active': True, 'age': 41.5},
 {'id': 106, 'name': 'Ingrid', 'is_active': 'NOPE', 'age': 8}]

In [59]:
type(my_list[0]), type(my_list), type(my_list[0]["age"])

my_other_list = [
    {"id": 107,
    "name": "Henrik",
    "is_active": True,
    "age": 38},
    {"id": 108,
    "name": "Karin",
    "is_active": True,
    "age": 44},
    {"id": 109,
    "name": "August",
    "is_active": 45,
    "age": 4},
    {"id": 110,
    "name": "Bo",
    "is_active": True,
    "age": 2}
]

d) Do schema validation on the JSON array in c)



In [68]:
# def validate_person(list_):
#     for person in list_:
#         validate_dict(person)

def validate_person(list_):
    for idx, person in enumerate(list_):
        if validate_dict(person):
            print(f"Record {idx + 1}: Valid")
        else:
            print(f"Record {idx + 1}: Invalid - {person}")

validate_person(my_list)
print("")
validate_person(my_other_list)

True
Record 1: Invalid - {'id': 102, 'name': 'Marcus', 'is_active': True, 'age': 34}
True
Record 2: Invalid - {'id': 103, 'name': 'David', 'is_active': False, 'age': 29}
False
Record 3: Invalid - {'id': 104, 'name': 'Anna', 'is_active': True, 'age': 41.5}
False
Record 4: Invalid - {'id': 106, 'name': 'Ingrid', 'is_active': 'NOPE', 'age': 8}

True
Record 1: Invalid - {'id': 107, 'name': 'Henrik', 'is_active': True, 'age': 38}
True
Record 2: Invalid - {'id': 108, 'name': 'Karin', 'is_active': True, 'age': 44}
False
Record 3: Invalid - {'id': 109, 'name': 'August', 'is_active': 45, 'age': 4}
True
Record 4: Invalid - {'id': 110, 'name': 'Bo', 'is_active': True, 'age': 2}


e) Make a function for schema validation and try input the two examples and see if you get correct answer. Also make other examples and test your function.

# 2. Data quality check

Create a function that checks a list that it contains exactly ten elements, and none of them contains None. If they contain None, print out an error message that says that it is invalid and print out what a valid format should be.

In [115]:
numbers_list = list(range(1,11))
numbers_list2 = list(range(1,10))
# numbers_list3 = list(range(1,10))
# numbers_list3.append(None)
numbers_list3 = 1, 2, 3, 4, 5, None,6, 7, 8, 9

print(len(numbers_list))
print(len(numbers_list2))
print(len(numbers_list3))

print(numbers_list)
print(numbers_list2)
print(numbers_list3)

10
9
10
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
(1, 2, 3, 4, 5, None, 6, 7, 8, 9)


In [116]:
name = "henrik"

def check_list(list_):
    if not isinstance(list_, list):
        raise ValueError("Input must be a list")
    if len(list_) != 10:
        raise ValueError(f"Your list contains of {len(list_)} elements but list has to have exactly 10 elements")
    elif None in list_:
        raise ValueError("List contains none")
    else:
        print("list ok")

try:
    check_list(numbers_list)
except ValueError as err:
    print(err)


list ok


# 3. Extract data from logs

In data engineering, log files and log messages are very common. Sometimes you need to parse them to find valuable information, for example for debugging reasons.

Read in network.log and extract source IP, destination IP, protocol and data size.

Expected output:

| Source      | Destination | Protocol | Bytes |
|-------------|-------------|----------|-------|
| 10.0.0.1    | 10.0.0.2    | TCP      | 1024  |
| 10.0.0.2    | 10.0.0.3    | UDP      | 2048  |
| 10.0.0.3    | 10.0.0.1    | TCP      | 512   |

### Data Transfer Summary:
- **TCP**: 1536 bytes
- **UDP**: 2048 bytes

In [144]:
import re

with open("data/network.log", 'r') as file:
    raw_text = file.read()

# print(raw_text)

sentences = [text.strip() for text in raw_text.split(" | ")]
# print(sentences)

tcp = 0
udp = 0
bytes = []
bytes_only = [item for item in sentences if "Bytes:" in item]

# for item in sentences:
#     if "Bytes:" in item: 
#         bytes.append(item)

# bytes_str = str(bytes)

# print(type(bytes))
# print(type(bytes_str))
# print(bytes)
# print(bytes_str)
# for item in bytes:
#     if "Bytes:" in item:
#         bytes_only.append(item)

print(bytes_only)
for item in bytes_only:
    print(item)

['Bytes: 1024\n2024-06-01 09:05:00', 'Bytes: 2048\n2024-06-01 09:10:00', 'Bytes: 512']
Bytes: 1024
2024-06-01 09:05:00
Bytes: 2048
2024-06-01 09:10:00
Bytes: 512
