# **User Data Score -- Ongo **


______________________________________________________________________________________________________________________
## In this document we want to assign a score to users, indicating the overall quality of their data. 

### We define a user's data quality score to be a number 1-100 reflecting a combination of recency, consistency, and quantity/correlation strength, indicating how strong the correlations and conclusions we can make from one's data are. The higher a user's score, the more confident we are of creating strong recommendations.


- Recency is defined by how recent the bulk of a user's data is. ideally user's have recent data, leading to more relevant conclusions
- Consistency is defined by how much consistently a user is collecting their data, an ideal user would upload data on a consistent basis, making patterns more clear
- Quantity/correlation strength is how many point of data a user owns of each type and how strong of patterns we are currently finding. This helps us deal with variance of data, and confirm the strength of our variable relationships. 

______________________________________________________________________________________________________________________




# Process

### To start we will partition the score into 1/8 recency, 1/8 consistency, and 3/4 quantity/correlation. 
### *We will focus on the type variables steps, weight, sleep, HR

Note: This splitting is arbitrary in nature and will be changed as we test our score. 

#### *This next cell just initializes our environment.* 

In [1]:
import seaborn as sns
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
import datetime
import warnings

data_sample = pd.read_csv('20171016-210106-DataSample.csv',dtype={"value": float})
data_sample2 = pd.read_csv('20171016-210304-DataSample.csv',dtype={"value": float})
data_sample3 = pd.read_csv('20171016-210529-DataSample.csv',dtype={"value": float})
data_sample4 = pd.read_csv('20171016-235959-DataSample.csv',dtype={"value": float})

data_sample = data_sample.append([data_sample2,data_sample3,data_sample4])
data_sample['startDate'] = pd.to_datetime(data_sample['startDate']) 
data_sample['owner'].replace('00000000-5854-8d6f-b8eb-cf14a0f795df','00000000-56ff-538b-2223-e1800b5e3ddb',inplace=True)
data_sample['startDate'] = pd.to_datetime(data_sample['startDate'])
data_sample['endDate'] = pd.to_datetime(data_sample['endDate'])


to_tdelta = lambda row: row['endDate'] - row['startDate']
data_sample['duration'] = data_sample.apply(to_tdelta, axis=1)
data_sample['day_of_week'] = data_sample['startDate'].dt.dayofweek
data_sample.head() #0 - 6 is monday - sunday

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,_id,owner,source,sourceId,sourceName,type,startDate,endDate,value,content,originalData,createdAt,updatedAt,duration,day_of_week
0,563257a8-70fc-45d3-bdad-106bd7f66b29,00000000-56ff-538b-2223-e1800b5e3ddb,nokia,step-count:2017-09-28,Nokia Health Mate,health-step-distance,2017-09-28 07:00:00,2017-09-29 06:59:59,6704.862,,,2017-10-13 14:58:59.142236-07,2017-10-13 15:01:20.166114-07,23:59:59,3
1,80ae5a3e-dabc-4235-bd94-c87cd396555e,00000000-56ff-538b-2223-e1800b5e3ddb,nokia,step-count:2016-03-10,Nokia Health Mate,health-step-count,2016-03-10 05:00:00,2016-03-11 04:59:59,10204.0,,"{""steps"": 10204, ""moderate"": 1860, ""date"": ""20...",2017-10-13 14:55:40.604012-07,2017-10-13 15:16:33.227505-07,23:59:59,3
2,c1c5bebc-4c8f-4027-9e08-c08fbf8a7321,00000000-56ff-538b-2223-e1800b5e3ddb,nokia,step-count:2016-03-08,Nokia Health Mate,health-step-count,2016-03-08 05:00:00,2016-03-09 04:59:59,94.0,,"{""steps"": 94, ""moderate"": 0, ""date"": ""2016-03-...",2017-10-13 14:55:40.604012-07,2017-10-13 15:16:33.227505-07,23:59:59,1
3,9bbe6936-21a2-4142-b641-e1bfbdf7280b,00000000-56ff-538b-2223-e1800b5e3ddb,nokia,step-count:2017-01-27,Nokia Health Mate,health-step-distance,2017-01-27 08:00:00,2017-01-28 07:59:59,3179.66,,"{""steps"": 4962, ""moderate"": 1380, ""date"": ""201...",2017-10-13 14:55:40.604012-07,2017-10-13 15:16:33.227505-07,23:59:59,4
4,e7625898-f2a8-4ab3-af93-04f52ad05ccc,00000000-56ff-538b-2223-e1800b5e3ddb,nokia,step-count:2017-07-25,Nokia Health Mate,health-step-distance,2017-07-25 07:00:00,2017-07-26 06:59:59,9816.914,,"{""steps"": 13544, ""moderate"": 1740, ""date"": ""20...",2017-10-13 14:55:40.604012-07,2017-10-13 15:16:33.227505-07,23:59:59,1


## Recency

### Method: We want to see whether or not a user's data is recent or not. We will take the total amount of a user's data, see when the data was initially tracked, and see how much of that data exists within the last 60 days. 

Example: User X has 100 points of data for sleep over 2 years of collected data. However 99% of this data was before the past 60 days. This is bad. 

### Aydin, Luqmaan -> Calculate a score out of a 100. A 100 for a type would be a user who's data in the past 60 days is equal to or greater than 60 days / total time interval data has been collected. 

## Formula: (x/y) * (z/60) = type recency score  

### x -> amount of data in last 60 days 
### y -> total amount of data 
### z -> total days that their data was collected

## After doing this for each type you will have a few scores, average them for a final score


Example: User has 100 points of data over past 120 days, 50 points of data are in past 60 days. Score of 100. 

## Consistency

### Method: We want to see how consistent the collection of a user's data is. For variables that we want to compare on certain intervals, we want to know how much data we're "losing". 

Example: User X has 1000 points of HR, and 500 points of step count, however, on a very large number of days where he measured his step counts, he didn't measure his heart rate. Or on those days he did not record his amount of sleep. This data is essentially difficult to use, as this could have profound effects on the correlations we find. 

### Matt -> Calculate out a score 0-100 grading the consistency of a user's data. 
#### Do this for each type, and average them out. 
#### Note the different types of variables will have varying intervals of time, so for weight you wouldn't really care if they measure every day, but if you had a month's worth of user activity, but no weight measurement's during that time, that is bad. Or if all 4 of their weight measurements were in the last 2 weeks out of 4 months.

#### There's a lot of freedom on this, let me know if you have any questions or anything


In [3]:
data_sample['type'].unique()

array(['health-step-distance', 'health-step-count',
       'health-fat-mass-weight', 'health-body-fat', 'health-fat-free-mass',
       'health-heart-rate', 'health-height', 'health-weight', 'health-bmi',
       'health-sleep'], dtype=object)

In [4]:
data_sample['owner'].unique()

array(['00000000-56ff-538b-2223-e1800b5e3ddb',
       '00000000-5851-ee08-eb34-e20acc5af74e',
       'd145b032-b7a5-4fa8-9887-b46598f4683a',
       '00000000-5951-4787-2497-ae32dc8d07d4',
       '00000000-584e-1f39-bdee-d4102b989d01',
       '00000000-584d-a4f0-bdee-d4102b989ce5'], dtype=object)

Looking at the startDate and endDate columns, it's clear that the following wouldn't exactly
work and that the index should be set before running this function, setting it intelligently
based on the source of the data

Also, this won't score sleep data properly until the duration gets moved to the value
column

In [28]:
ds_c = data_sample.copy()
ds_c.index = ds_c.startDate

# completeness within data types
for owner in ds_c.owner.unique():
    ds_o = ds_c[ds_c.owner == owner]
    for typ in ds_o.type.unique():
        print('Owner:', owner, 'Data Type:', typ)
        daily = ds_o[ds_o.type == typ].resample('D').sum()
        missing_days = len(daily[daily.value == 0])
        # print(missing_days)
        null_days = sum(daily.value.isnull())
        # print(null_days)
        total_days = len(daily)
        # print(total_days)
        consistency_score = (total_days - null_days - missing_days) / total_days * 100
        print('Score:', consistency_score)

Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-step-distance
Score: 50.9803921569
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-step-count
Score: 56.8292682927
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-fat-mass-weight
Score: 29.4220665499
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-body-fat
Score: 29.5096322242
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-fat-free-mass
Score: 29.4220665499
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-heart-rate
Score: 32.8097731239
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-height
Score: 100.0
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-weight
Score: 17.3892329681
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-bmi
Score: 16.9128156265
Owner: 00000000-56ff-538b-2223-e1800b5e3ddb Data Type: health-sleep
Score: 2.14489990467
Owner: 00000000-5851-ee08-eb34-e20acc5af74e Data Type: health-step-count
Score:

In [26]:
# cross type completeness

# first frequency must be lower and the data must be from one user only
def consistency_score(df, type1, freq1, type2, freq2):
    df_t1 = df[df.type == type1][['value']].resample(freq1).sum()
    df_t1.columns.values[0] = 'v1'
    df_t1 = df_t1.fillna(0)
    # print(df_t1)
    
    df_t2 = df[df.type == type2][['value']].resample(freq2).sum()
    df_t2.columns.values[0] = 'v2'
    df_t2 = df_t2.fillna(0)
    # print(df_t2)
    
    df_t2 = df_t2.resample(freq1).sum().fillna(method='ffill')
    # print(df_t2)
    
    
    df_cr = pd.concat([df_t1, df_t2], axis=1).fillna(0)
    # print(df_cr)
    
    missing = len(df_cr[(df_cr.v1 == 0) | (df_cr.v2 == 0)])
    return (len(df_cr) - missing) / len(df_cr) * 100

In [27]:
df1 = data_sample[data_sample.owner == data_sample.owner.unique()[0]]
df1.index = df1.startDate
consistency_score(df1, 'health-step-count', 'D', 'health-weight', 'W')

17.753450737743933

## Quantity/Correlation Strength 

### Method: We essentially want to see if this user's data is telling us anything. Can we find correlations, patterns in the data, and how confident can we be in those correlations. We look into the different combinations of the variables, see how many we can find strong results

Example: User X has stellar correlations, we find strong patterns in all the variable combinations, small p-values for each of them, high score. 

### Lucas, Sebastian, John -> We want to look at what the chance is that the correlations that we are finding are "random" or not. We can use p-values across the different combinations, see how many we can be fairly confident in, and create a score from that. 

## Testing 

### After creating our score we will want to be able test how accurate they are. 
### TO BE CONTINUED