# Processing and cleaning data
The phases of this part are divided in several sections. Coding styles and techniques vary from one section to fit the specific purpose and according to the skill area of the project member who was writing it.

In [2]:
import pandas as pd
import numpy as np
from datetime import timedelta
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
import plotly.graph_objects as go
from tqdm import tqdm

## Base - level cleaning: which users are fit for analysis.
An eligible user must not be a drop out, so there is data about their life for every day of the first few weeks of the experiment. 
For "total count" statistical measures other operations are performed. For example, to count how much of the user's day was spent on doing sports, there need to be null answers in the data to signify the times in which a user was inactive that will be summed to the other answers; then we obtain a complete activity dataset where the number of rows doing "Sports" equate exactly how much time each user has spent on Sports. 

**Purpose**: obtaining a list of user IDs that are fit for analysis, and filling missing timestamp data with null answers. 

**Process**: for each user, an algorithm called "gantt_data" checks if the user's data is consistent or not. Only users who pass this test are kept. For each of those, missing data is found and filled with nulle answers.

In [3]:
# Loading raw data.
td_dataset = pd.read_stata("data/td_ita.dta")  # time diaries dataset
demo_dataset = pd.read_stata("data/data4diarynew_ITA.dta") # demographics dataset
step_dataset = pd.read_csv("data/stepDetector_30min.csv") # step counter dataset


In [166]:
# Users who did sport AT LEAST once will be submitted to this algorithm.



time_accurate = td_dataset[(td_dataset.date_not.dt.month == 11 
                         ) & (td_dataset.date_not.dt.day >= 13
                             ) & (td_dataset.date_not.dt.day <= 30)]

once_active = time_accurate[time_accurate['what'] == 'Sport'].id.unique()

once_active = time_accurate[time_accurate['what'] == 'Sport'].id.unique()
IDs = pd.Series(time_accurate.id.unique()).astype(int) 
IDs = IDs[IDs.isin(once_active)]

user_activity = dict()
for ID in tqdm(IDs):
    id_act = time_accurate[time_accurate.id == ID]
    user_activity[ID] = list(id_act.date_not.dt.day.sort_values().unique())
print(f'At first there were about {time_accurate.id.unique().shape[0]} users. Now: N = ',len(user_activity.keys())) 
print("User dictionary with sorted unique user-specific observation days is done.")

def gantt_data_org(k, v):
    
    ''' 
    gantt_data is named after the Gantt chart.
    S is the Series of days (regardless of when they start).
    '''
    
    tot = 0
    start = np.nan
    end = np.nan
    consecutive = False
    
    if len(v):  #if it's not empty, it will check for continuity
        tot = (pd.to_datetime(v[-1]) - pd.to_datetime(v[0])).days
        start, end = v[0], v[-1]
        
        for i in range(len(v)-1):
            v[i]  #current
            v[i+1] #next
            difference = (pd.to_datetime(v[i+1]) - pd.to_datetime(v[i])).days
            # if there is more than two days of difference it means that there was more than one
            # day in the data with no observations, and it is unlikely it's consistent data.
            if difference >= 2:    
                consecutive = False
                break
            else:
                consecutive = True
                
    return pd.DataFrame([[k, start, end, tot, consecutive]], 
                        columns=['id', 'start', 'finish', 'tot', 'cons'])

gantt_data = pd.DataFrame(columns=["id", "start", "finish", "tot", "cons"])

for ID, days in user_activity.items():
    res = gantt_data_org(ID, days)
    gantt_data = pd.concat([gantt_data, res], ignore_index=True)
print("The test is finished.")
print("We end up with: N = ", gantt_data.id.unique().shape[0])
to_keep = gantt_data[gantt_data['cons'] == True].id

sport_eve = time_accurate[time_accurate.id.isin(to_keep)]

100%|████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:00<00:00, 740.41it/s]


At first there were about 241 users. Now: N =  128
User dictionary with sorted unique user-specific observation days is done.
The test is finished.
We end up with: N =  128


In [138]:
user_activity

{0: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 1: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 2: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 4: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 5: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 6: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 9: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 10: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 12: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 15: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 17: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 19: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 21: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 2

Everyone who has done sport once (128 people) are also people who never dropped out from the experiment. They also all have observations ranging from 13th of November to 30th of November, as shown in the dictionary _user activity_.

In [185]:
print(f'Initial count of rows: {sport_eve.shape[0]:,}')

def today_or_yesterday(t): 

    '''
    This function labels the row with the day it belongs to using circadian time. 
    If an observation falls in the time range from midnight to 5 a.m. it is 
    classified as belonging to the precedent calendar day as we humans would perceive it. 
    5 a.m. is where more than 99% of the users go to sleep or wake up - which means that their day ended or started, 
    and their perception of which day it is has shifted by one.
    Otherwise, it's the same day as indicated the timestamp. 
    '''
    if t.hour < 5:
        return days[t.day - 1]
    else:
        return days[t.day]

days = dict()
for i in range(sport_eve.date_not.dt.day.min(), sport_eve.date_not.dt.day.max() + 1):
    days[i] = i - sport_eve.date_not.dt.day.min()  

min_range = pd.date_range(start="2020-11-13 00:00:00", end="2020-11-30 23:59:59", freq='30T')   #notifications were fired every half hour and such must be the timing between null events
results = pd.DataFrame()
for user in tqdm(sport_eve.id.unique()):
    subset = sport_eve[sport_eve.id == user]
    complete_data = pd.DataFrame({'date_not': min_range})
    merged_df = pd.concat([complete_data, subset])
    merged_df['id'] = merged_df['id'].fillna(user)
    
    merged_df = merged_df.sort_values(by="what", na_position='first', ascending=False).drop_duplicates(keep='last', subset='date_not')
    
    merged_df['what'] = merged_df['what'].astype(str)
    merged_df['what'] = merged_df['what'].replace("nan", "Inactive")
    
    if merged_df.date_not.min() != complete_data.date_not.min() or merged_df.date_not.max() != complete_data.date_not.max():
        print()
        print(f"Some mistake occurred. \nThe amount of observation time is different than expected.\nUser: {user}, dates: {merged_df.date_not.min(), merged_df.date_not.max()}")
        print()
    if merged_df.shape[0] != complete_data.shape[0]: 
        print()
        print(f"Some mistake occurred. \nThe amount rows is different than expected.\nUser: {user}, length: {merged_df.shape[0]}, expected: {complete_data.shape[0]}")
        print()
    
    results = pd.concat([results, merged_df], ignore_index=True)
results['day'] = [today_or_yesterday(time) for time in results.date_not]
results.id = results.id.astype(int)
print(f'Finished. Total count of rows: {results.shape[0]:,}')

Initial count of rows: 108,800


  3%|██▋                                                                                    | 4/128 [00:00<00:03, 33.44it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 0.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 0.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 1.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 1.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 2.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 2.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 4.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 2

  9%|████████                                                                              | 12/128 [00:00<00:03, 33.92it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 12.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 12.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 15.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 15.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 17.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 17.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 19.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-

 16%|█████████████▍                                                                        | 20/128 [00:00<00:03, 33.46it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 24.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 24.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 26.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 26.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 27.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 27.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 31.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-

 19%|████████████████▏                                                                     | 24/128 [00:00<00:03, 33.09it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 41.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 41.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 43.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 43.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 44.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 44.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 45.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-

 22%|██████████████████▊                                                                   | 28/128 [00:00<00:02, 33.58it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 53.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 53.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 57.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 57.0, length: 864, expected: 854



 25%|█████████████████████▌                                                                | 32/128 [00:00<00:02, 33.18it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 58.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 58.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 60.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 60.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 61.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 61.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 63.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-

 28%|████████████████████████▏                                                             | 36/128 [00:01<00:02, 32.80it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 67.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 67.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 68.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 68.0, length: 864, expected: 854



 31%|██████████████████████████▉                                                           | 40/128 [00:01<00:02, 32.80it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 72.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 72.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 73.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 73.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 74.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 74.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 75.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-

 38%|████████████████████████████████▎                                                     | 48/128 [00:01<00:02, 32.57it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 83.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 83.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 84.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 84.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 85.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 85.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 87.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-

 41%|██████████████████████████████████▉                                                   | 52/128 [00:01<00:02, 32.49it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 97.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 97.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 98.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 98.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 99.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 99.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 100.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020

 44%|█████████████████████████████████████▋                                                | 56/128 [00:01<00:02, 32.45it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 105.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 105.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 106.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 106.0, length: 864, expected: 854



 47%|████████████████████████████████████████▎                                             | 60/128 [00:01<00:02, 31.93it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 109.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 109.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 112.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 112.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 113.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 113.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 114.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 50%|███████████████████████████████████████████                                           | 64/128 [00:01<00:02, 31.92it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 119.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 119.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 120.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 120.0, length: 864, expected: 854



 53%|█████████████████████████████████████████████▋                                        | 68/128 [00:02<00:01, 31.87it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 123.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 123.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 125.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 125.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 127.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 127.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 128.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 56%|████████████████████████████████████████████████▍                                     | 72/128 [00:02<00:01, 31.32it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 136.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 136.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 137.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 137.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 141.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 141.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 146.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 59%|███████████████████████████████████████████████████                                   | 76/128 [00:02<00:01, 31.21it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 149.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 149.0, length: 864, expected: 854



 62%|█████████████████████████████████████████████████████▊                                | 80/128 [00:02<00:02, 20.28it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 151.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 151.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 153.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 153.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 155.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 155.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 160.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 68%|██████████████████████████████████████████████████████████▍                           | 87/128 [00:02<00:01, 23.75it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 169.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 169.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 172.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 172.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 177.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 177.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 182.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 74%|███████████████████████████████████████████████████████████████▊                      | 95/128 [00:03<00:01, 27.22it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 202.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 202.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 204.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 204.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 205.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 205.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 206.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 80%|████████████████████████████████████████████████████████████████████▍                | 103/128 [00:03<00:00, 28.78it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 215.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 215.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 216.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 216.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 218.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 218.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 219.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 85%|████████████████████████████████████████████████████████████████████████▍            | 109/128 [00:03<00:00, 28.19it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 223.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 223.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 227.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 227.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 228.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 228.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 229.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 91%|█████████████████████████████████████████████████████████████████████████████        | 116/128 [00:03<00:00, 29.14it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 233.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 233.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 238.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 238.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 239.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 239.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 240.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

 96%|█████████████████████████████████████████████████████████████████████████████████▋   | 123/128 [00:04<00:00, 29.09it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 245.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 245.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 250.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 250.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 251.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 251.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 254.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp

100%|█████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:04<00:00, 29.59it/s]


Some mistake occurred. 
The amount of observation time is different than expected.
User: 258.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 258.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 260.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 260.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 262.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp('2020-11-30 23:30:00'))


Some mistake occurred. 
The amount rows is different than expected.
User: 262.0, length: 864, expected: 854


Some mistake occurred. 
The amount of observation time is different than expected.
User: 265.0, dates: (Timestamp('2020-11-13 00:00:00'), Timestamp




KeyError: 12

In [184]:
days = dict()
for i in range(sport_eve.date_not.dt.day.min(), sport_eve.date_not.dt.day.max()):
    days[i] = i - sport_eve.date_not.dt.day.min()  
days
td_dataset.date_not.min()

Timestamp('2020-11-13 00:00:00')

In [164]:
n = results.id.unique().shape[0]
days = results.date_not.max() - results.date_not.min()
days.total_seconds()/60/60/2/24  
n*days.total_seconds()/60/60/2/24  

def today_or_yesterday(t): 
    if t < 0:
        return t + 24
    else:
        return t

experiment = results
experiment['hour'] = experiment['date_not'].dt.hour 
total = []
for hour in experiment['hour'].unique():
    total.append(sum(experiment[experiment['hour']==hour][['what']].value_counts().sort_values()))
print(total)    

[4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608, 4608]


## Sport data - from events to sessions
In order to use perform specific analyses, the data must fit specific machine-readable standards. In this section for later purposes (in other notebooks).


**Purpose**: have knowledge on how users have answered the option "Sport", as it may vary from type of activity, location, and many other factors.

**Process**: to understand the way people have answered "Sport", the events of said activity are grouped with each other when they are adjacent. Other code re-categorization are also performed for simplicity. 