Wyczyszczone dane należy połączyć w jedna tabele. Tutaj zajmiemy się danymi, które nie wymagają encodowania One Hot.
Wczytujemy gotowe dane.

In [2]:
from datetime import datetime
from json_handler.json_handler import load_jsonl_data, load_json_data

In [3]:
sessions_list = load_jsonl_data('prepared_data/prepared_sessions.jsonl')
products_list = load_json_data('prepared_data/preapred_products.json')
users_list = load_jsonl_data('data/input/users.jsonl')

W celu łatwiejszego operowania na danych przedstawimy je w postaci obiektów klas, które pomogą nam stworzyć
jedną tabelę.
Klasa Product przechowywać będzie proste dane bezpośrednio odczytane z pliku.

In [4]:
class Product:
    def __init__(self, product_id, price, category):
        self.product_id = product_id
        self.price = price
        self.category = category

Klasa User przechowywać będzie zestaw słowników, których kluczami będą ’product_id’, 'discount’, 'category_path’.

In [5]:
class User:
    def __init__(self):
        self.products_views_number = dict()
        self.category_views_number = dict()
        self.bought_with_discount_number = dict()
        self.bought_products_in_category_number = dict()

Klasa Session posiada obiekt typu User oraz Product, dodatkowo dane dodyczące danej sesji. Obiekty tej klasy będą
podstawą, z której tworzone będą połączone rekordy wynikowe.

In [6]:
class Session:
    def __init__(self, user, product, timestamp, event_type, discount, session_id):
        self.user = user
        self.product = product
        self.timestamp = timestamp
        self.event_type = event_type
        self.discount = discount
        self.session_id = session_id

Tworzymy zbioru obiektów.

In [7]:
products = dict()
for product in products_list:
    products[product['product_id']] = Product(product['product_id'], product['price'], product['category_path'])

users = dict()
for user in users_list:
    users[user['user_id']] = User()

sessions = list()
for session in sessions_list:
    user = users[session['user_id']]
    product = products[session['product_id']]
    sessions.append(Session(user, product, session['timestamp'], session['event_type'],
                            session['offered_discount'], session['session_id']))

Tak przygotowane dane umieszczone zostaną w klasie Record, reprezentującej ostateczny rekord w pliku wynikowym.

In [8]:
class Record:
    def __init__(self, price, products_views_number, category_views_number, bought_with_discount_number,
                 bought_products_in_category_number, discount, elapsed_time):
        self.price = price
        self.products_views_number = products_views_number
        self.category_views_number = category_views_number
        self.bought_with_discount_number = bought_with_discount_number
        self.bought_products_in_category_number = bought_products_in_category_number
        self.discount = discount
        self.elapsed_time = elapsed_time

Tworzymy listę Recordów i uzupełniamy.

In [9]:
records = list()
current_session_id = 0
last_session_timestamp = 0
elapsed_time = 0
for session in sessions:
    if current_session_id != session.session_id:
        current_session_id = session.session_id
        last_session_timestamp = session.timestamp
        elapsed_time = datetime.strptime(last_session_timestamp, '%Y-%m-%dT%H:%M:%S') - datetime.strptime(last_session_timestamp, '%Y-%m-%dT%H:%M:%S')
    else:
        elapsed_time = datetime.strptime(session.timestamp, '%Y-%m-%dT%H:%M:%S') - datetime.strptime(last_session_timestamp, '%Y-%m-%dT%H:%M:%S')
        last_session_timestamp = session.timestamp

    elapsed_time = elapsed_time.total_seconds()

    price = session.product.price

    products_views_number = session.user.products_views_number.get(session.product.product_id, 0)
    session.user.products_views_number[session.product.product_id] = products_views_number + 1

    category_views_number = session.user.category_views_number.get(session.product.category, 0)
    session.user.category_views_number[session.product.category] = category_views_number + 1

    purchases_with_discount_counter = 0
    for x in range(0, session.discount + 1, 5):
        purchases_with_discount_counter += session.user.bought_with_discount_number.get(x, 0)

    purchases_in_category_number = session.user.bought_products_in_category_number.get(session.product.category, 0)

    if session.event_type == 0:
        session.user.bought_with_discount_number[session.discount] = session.user.bought_with_discount_number.get(session.discount, 0) + 1
        session.user.bought_products_in_category_number[session.product.category] = purchases_in_category_number + 1

    records.append(Record(session.product.price, products_views_number, category_views_number,
                          purchases_with_discount_counter, purchases_in_category_number,
                          session.discount, elapsed_time))

Tak utworzony zbiór rekordów zapisujemy do pliku.

In [10]:
import json

with open('prepared_data/records.json', 'w') as file:
    json.dump([r.__dict__ for r  in records],file, indent=2)
