Here we'll clean up the raw data from /datasets_dirty and move the clean data to /datasets

In [2]:
import pandas as pd

try:
    v_df = pd.read_csv('datasets_dirty/visits_log_us.csv')
except:
    v_df = pd.read_csv('/datasets/visits_log_us.csv')

try:
    o_df = pd.read_csv('datasets_dirty/orders_log_us.csv')
except:
    o_df = pd.read_csv('/datasets/orders_log_us.csv')

try:
    c_df = pd.read_csv('datasets_dirty/costs_us.csv')
except:
    c_df = pd.read_csv('/datasets/costs_us.csv')

In [None]:
v_df.info(memory_usage='deep')
print(v_df.describe())
print(v_df.head())
print(v_df.sample(5))

In [None]:
v_df['Device'].value_counts()

In [None]:
v_df['Source Id'].value_counts()

In [None]:
v_df['Device'] = v_df['Device'].astype('category')
v_df['Source Id'] = v_df['Source Id'].astype('category')

v_df.info(memory_usage='deep')

In [None]:
v_df['Start Ts'] =  pd.to_datetime(v_df['Start Ts'], format="%Y.%m.%d %H:%M")
v_df['End Ts'] =  pd.to_datetime(v_df['End Ts'], format="%Y.%m.%d %H:%M")

In [None]:
v_df = v_df.rename(
    columns={
        'Uid': 'uid',
        'Device': 'device',
        'Start Ts': 'start_time',
        'End Ts': 'end_time',
        'Source Id': 'source_id'
    }
)

v_df.info(memory_usage='deep')
print(v_df.describe())
print(v_df.head())
print(v_df.sample(5))

Going from memory usage: 79.3 MB to memory usage: 8.9 MB without any loss of data? Nice.
## The visits table (server logs with data on website visits):
- uid — user's unique identifier
    - Change from 'Uid' to 'uid'
- device — user's device
    - Change from 'Device' to 'device'
    - There's only two different values, so I'll change the type to category
- start_time — session start date and time
    - Change name from 'Start Ts' to 'start_time'
    - Looks like the seconds aren't included in this, I'll convert to datetime
- end_time — session end date and time
    - Change name from 'End Ts' to 'end_time'
    - Change to datetime type also
- source_id — identifier of the ad source the user came from
    - Change name from 'Source Id' to 'source_id'
    - There's only 10 unique values, so I changed this to category type. I'll come back and undo if I need to.

In [None]:
o_df.info(memory_usage='deep')
print(o_df.describe())
print(o_df.head())
print(o_df.sample(5))

In [None]:
o_df['Buy Ts'] =  pd.to_datetime(o_df['Buy Ts'], format="%Y.%m.%d %H:%M")

o_df = o_df.rename(
    columns={
        'Uid': 'uid',
        'Buy Ts': 'purchase_time',
        'Revenue': 'profit'
    }
)

In [None]:
o_df.info(memory_usage='deep')
print(o_df.describe())
print(o_df.head())
print(o_df.sample(5))

Going from memory usage: 4.4 MB to memory usage: 1.2 MB without any loss of data? Nice.
## The orders table (data on orders):
- uid — unique identifier of the user making an order
    - Change from 'Uid' to 'uid'
- purchase_time — order date and time
    - Change from 'Buy Ts' to 'purchase_time'
    - Convert to datetime type
- profit — Yandex.Afisha's revenue from the order
    - Change from 'Revenue' to 'profit'

In [None]:
c_df.info(memory_usage='deep')
print(c_df.describe())
print(c_df.head())
print(c_df.sample(5))

In [None]:
c_df['source_id'].value_counts()

In [None]:
c_df['source_id'] = c_df['source_id'].astype('category')
c_df['dt'] =  pd.to_datetime(c_df['dt'], format="%Y.%m.%d")

c_df = c_df.rename(
    columns={
        'dt': 'date'
    }
)

In [None]:
c_df.info(memory_usage='deep')
print(c_df.describe())
print(c_df.head())
print(c_df.sample(5))

Going from memory usage: 206.2 KB to memory usage: 42.7 KB without any loss of data? Stellar move by me.

## The costs table (data on marketing expenses):
- source_id — ad source identifier
    - There's only 7 unique values. Convert to category type
- dt — date
    - change from 'dt' to 'date'
    - It only has dates, and no times. Convert to datetime type accordingly
- costs — expenses on this ad source on this day
    - This looks fine unchanged

Check for duplicate rows real quick:

In [None]:
print(v_df.duplicated().value_counts().get(True, 0), 'duplicate rows out of', v_df.shape[0])
print(o_df.duplicated().value_counts().get(True, 0), 'duplicate rows out of', o_df.shape[0])
print(c_df.duplicated().value_counts().get(True, 0), 'duplicate rows out of', c_df.shape[0])

In [None]:
v_df.to_csv('datasets/visits_log_us.csv', index=False)
o_df.to_csv('datasets/orders_log_us.csv', index=False)
c_df.to_csv('datasets/costs_us.csv', index=False)

The cleaned data has been saved!