Here we'll clean up the raw data from /datasets_dirty and move the clean data to /datasets

In [1]:
import pandas as pd

try:
    v_df = pd.read_csv('datasets_dirty/visits_log_us.csv')
except:
    v_df = pd.read_csv('/datasets/visits_log_us.csv')

try:
    o_df = pd.read_csv('datasets_dirty/orders_log_us.csv')
except:
    o_df = pd.read_csv('/datasets/orders_log_us.csv')

try:
    c_df = pd.read_csv('datasets_dirty/costs_us.csv')
except:
    c_df = pd.read_csv('/datasets/costs_us.csv')

In [2]:
v_df.info(memory_usage='deep')
print(v_df.describe())
print(v_df.head())
print(v_df.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359400 entries, 0 to 359399
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   Device     359400 non-null  object
 1   End Ts     359400 non-null  object
 2   Source Id  359400 non-null  int64 
 3   Start Ts   359400 non-null  object
 4   Uid        359400 non-null  uint64
dtypes: int64(1), object(3), uint64(1)
memory usage: 79.3 MB
           Source Id           Uid
count  359400.000000  3.594000e+05
mean        3.750515  9.202557e+18
std         1.917116  5.298433e+18
min         1.000000  1.186350e+13
25%         3.000000  4.613407e+18
50%         4.000000  9.227413e+18
75%         5.000000  1.372824e+19
max        10.000000  1.844668e+19
    Device               End Ts  Source Id             Start Ts  \
0    touch  2017-12-20 17:38:00          4  2017-12-20 17:20:00   
1  desktop  2018-02-19 17:21:00          2  2018-02-19 16:53:00   
2    touch  2017-07-01 01:54:00  

In [3]:
v_df['Device'].value_counts()

desktop    262567
touch       96833
Name: Device, dtype: int64

In [4]:
v_df['Source Id'].value_counts()

4     101794
3      85610
5      66905
2      47626
1      34121
9      13277
10     10025
7         36
6          6
Name: Source Id, dtype: int64

In [5]:
v_df['Device'] = v_df['Device'].astype('category')
v_df['Source Id'] = v_df['Source Id'].astype('category')

v_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359400 entries, 0 to 359399
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype   
---  ------     --------------   -----   
 0   Device     359400 non-null  category
 1   End Ts     359400 non-null  object  
 2   Source Id  359400 non-null  category
 3   Start Ts   359400 non-null  object  
 4   Uid        359400 non-null  uint64  
dtypes: category(2), object(2), uint64(1)
memory usage: 55.5 MB


In [6]:
v_df['Start Ts'] =  pd.to_datetime(v_df['Start Ts'], format="%Y.%m.%d %H:%M")
v_df['End Ts'] =  pd.to_datetime(v_df['End Ts'], format="%Y.%m.%d %H:%M")

In [7]:
v_df = v_df.rename(
    columns={
        'Uid': 'uid',
        'Device': 'device',
        'Start Ts': 'start_time',
        'End Ts': 'end_time',
        'Source Id': 'source_id'
    }
)

v_df.info(memory_usage='deep')
print(v_df.describe())
print(v_df.head())
print(v_df.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359400 entries, 0 to 359399
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   device      359400 non-null  category      
 1   end_time    359400 non-null  datetime64[ns]
 2   source_id   359400 non-null  category      
 3   start_time  359400 non-null  datetime64[ns]
 4   uid         359400 non-null  uint64        
dtypes: category(2), datetime64[ns](2), uint64(1)
memory usage: 8.9 MB
                uid
count  3.594000e+05
mean   9.202557e+18
std    5.298433e+18
min    1.186350e+13
25%    4.613407e+18
50%    9.227413e+18
75%    1.372824e+19
max    1.844668e+19
    device            end_time source_id          start_time  \
0    touch 2017-12-20 17:38:00         4 2017-12-20 17:20:00   
1  desktop 2018-02-19 17:21:00         2 2018-02-19 16:53:00   
2    touch 2017-07-01 01:54:00         5 2017-07-01 01:54:00   
3  desktop 2018-05-20 11:23:00         9

Going from memory usage: 79.3 MB to memory usage: 8.9 MB without any loss of data? Nice.
## The visits table (server logs with data on website visits):
- uid — user's unique identifier
    - Change from 'Uid' to 'uid'
- device — user's device
    - Change from 'Device' to 'device'
    - There's only two different values, so I'll change the type to category
- start_time — session start date and time
    - Change name from 'Start Ts' to 'start_time'
    - Looks like the seconds aren't included in this, I'll convert to datetime
- end_time — session end date and time
    - Change name from 'End Ts' to 'end_time'
    - Change to datetime type also
- source_id — identifier of the ad source the user came from
    - Change name from 'Source Id' to 'source_id'
    - There's only 10 unique values, so I changed this to category type. I'll come back and undo if I need to.

In [8]:
o_df.info(memory_usage='deep')
print(o_df.describe())
print(o_df.head())
print(o_df.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50415 entries, 0 to 50414
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Buy Ts   50415 non-null  object 
 1   Revenue  50415 non-null  float64
 2   Uid      50415 non-null  uint64 
dtypes: float64(1), object(1), uint64(1)
memory usage: 4.4 MB
            Revenue           Uid
count  50415.000000  5.041500e+04
mean       4.999647  9.098161e+18
std       21.818359  5.285742e+18
min        0.000000  3.135781e+14
25%        1.220000  4.533567e+18
50%        2.500000  9.102274e+18
75%        4.890000  1.368290e+19
max     2633.280000  1.844617e+19
                Buy Ts  Revenue                   Uid
0  2017-06-01 00:10:00    17.00  10329302124590727494
1  2017-06-01 00:25:00     0.55  11627257723692907447
2  2017-06-01 00:27:00     0.37  17903680561304213844
3  2017-06-01 00:29:00     0.55  16109239769442553005
4  2017-06-01 07:58:00     0.37  14200605875248379450
          

In [9]:
o_df['Buy Ts'] =  pd.to_datetime(o_df['Buy Ts'], format="%Y.%m.%d %H:%M")

o_df = o_df.rename(
    columns={
        'Uid': 'uid',
        'Buy Ts': 'purchase_time',
        'Revenue': 'profit'
    }
)

In [10]:
o_df.info(memory_usage='deep')
print(o_df.describe())
print(o_df.head())
print(o_df.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50415 entries, 0 to 50414
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   purchase_time  50415 non-null  datetime64[ns]
 1   profit         50415 non-null  float64       
 2   uid            50415 non-null  uint64        
dtypes: datetime64[ns](1), float64(1), uint64(1)
memory usage: 1.2 MB
             profit           uid
count  50415.000000  5.041500e+04
mean       4.999647  9.098161e+18
std       21.818359  5.285742e+18
min        0.000000  3.135781e+14
25%        1.220000  4.533567e+18
50%        2.500000  9.102274e+18
75%        4.890000  1.368290e+19
max     2633.280000  1.844617e+19
        purchase_time  profit                   uid
0 2017-06-01 00:10:00   17.00  10329302124590727494
1 2017-06-01 00:25:00    0.55  11627257723692907447
2 2017-06-01 00:27:00    0.37  17903680561304213844
3 2017-06-01 00:29:00    0.55  16109239769442553005
4

Going from memory usage: 4.4 MB to memory usage: 1.2 MB without any loss of data? Nice.
## The orders table (data on orders):
- uid — unique identifier of the user making an order
    - Change from 'Uid' to 'uid'
- purchase_time — order date and time
    - Change from 'Buy Ts' to 'purchase_time'
    - Convert to datetime type
- profit — Yandex.Afisha's revenue from the order
    - Change from 'Revenue' to 'profit'

In [11]:
c_df.info(memory_usage='deep')
print(c_df.describe())
print(c_df.head())
print(c_df.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2542 entries, 0 to 2541
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   source_id  2542 non-null   int64  
 1   dt         2542 non-null   object 
 2   costs      2542 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 206.2 KB
         source_id        costs
count  2542.000000  2542.000000
mean      4.857199   129.477427
std       3.181581   156.296628
min       1.000000     0.540000
25%       2.000000    21.945000
50%       4.000000    77.295000
75%       9.000000   170.065000
max      10.000000  1788.280000
   source_id          dt  costs
0          1  2017-06-01  75.20
1          1  2017-06-02  62.25
2          1  2017-06-03  36.53
3          1  2017-06-04  55.00
4          1  2017-06-05  57.08
      source_id          dt   costs
944           3  2018-01-05  498.58
1413          4  2018-04-23   65.52
76            1  2017-08-16   22.07
964    

In [12]:
c_df['source_id'].value_counts()

5     364
1     363
2     363
3     363
4     363
9     363
10    363
Name: source_id, dtype: int64

In [13]:
c_df['source_id'] = c_df['source_id'].astype('category')
c_df['dt'] =  pd.to_datetime(c_df['dt'], format="%Y.%m.%d")

c_df = c_df.rename(
    columns={
        'dt': 'date'
    }
)

In [14]:
c_df.info(memory_usage='deep')
print(c_df.describe())
print(c_df.head())
print(c_df.sample(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2542 entries, 0 to 2541
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   source_id  2542 non-null   category      
 1   date       2542 non-null   datetime64[ns]
 2   costs      2542 non-null   float64       
dtypes: category(1), datetime64[ns](1), float64(1)
memory usage: 42.7 KB
             costs
count  2542.000000
mean    129.477427
std     156.296628
min       0.540000
25%      21.945000
50%      77.295000
75%     170.065000
max    1788.280000
  source_id       date  costs
0         1 2017-06-01  75.20
1         1 2017-06-02  62.25
2         1 2017-06-03  36.53
3         1 2017-06-04  55.00
4         1 2017-06-05  57.08
     source_id       date   costs
1879         9 2017-08-03    9.30
1829         9 2017-06-14    5.08
570          2 2017-12-25  176.20
1144         4 2017-07-26   60.41
736          3 2017-06-11  148.50


Going from memory usage: 206.2 KB to memory usage: 42.7 KB without any loss of data? Stellar move by me.

## The costs table (data on marketing expenses):
- source_id — ad source identifier
    - There's only 7 unique values. Convert to category type
- dt — date
    - change from 'dt' to 'date'
    - It only has dates, and no times. Convert to datetime type accordingly
- costs — expenses on this ad source on this day
    - This looks fine unchanged

Check for duplicate rows real quick:

In [17]:
print(v_df.duplicated().value_counts().get(True, 0), 'duplicate rows out of', v_df.shape[0])
print(o_df.duplicated().value_counts().get(True, 0), 'duplicate rows out of', o_df.shape[0])
print(c_df.duplicated().value_counts().get(True, 0), 'duplicate rows out of', c_df.shape[0])

0 duplicate rows out of 359400
0 duplicate rows out of 50415
0 duplicate rows out of 2542


In [32]:
v_df.to_csv('datasets/visits_log_us.csv', index=False)
o_df.to_csv('datasets/orders_log_us.csv', index=False)
c_df.to_csv('datasets/costs_us.csv', index=False)

The cleaned data has been saved!