# Patterns in Time

## 1. Background
The objective of this collection of Jupyter notebooks is to establish a systematic process for identifying time patterns within a user's data. These time patterns help in recognizing how external factors, such as the calendar, influence our users.

We aim to uncover recurring patterns across various aspects of time, such as: 
- Hours of the day
- Days of the week
- Days of the month
- Parts of the month
- Months

Example questions that users could ask or receive answers to include:

- Is there a tendency for headaches to occur mostly on Mondays?
- Do I generally experience higher levels of happiness later in the month?
- Is there a noticeable difference in my activity levels on certain days?
- How does my blood sugar level fluctuate throughout the day?

These notebooks were developed as a preliminary step towards an analytical extension operating on LLIF (Live Learn Innovate Foundation) and utilized by BestLife.

## 2. Outlier Detection Methods

Initially, we considered correlation methods to identify patterns within the data. However, after initial testing, we found these methods to be less than ideal. Following extensive discussions, we have decided to adopt generic statistical methods to determine the presence of outliers in a given data series, which we believe to be a more suitable metric. In the following section, I will outline some of the methods that we have chosen to explore.


### Standard Deviation on a Normalized Data Series

As an alternative to correlation coefficients, we explored the concept of using standard deviation on a normalized data series to assess the likelihood of outliers. Our objective was to derive a score ranging from 0 to 1, considering both the presence and magnitude of outliers in the data. Despite extensive exploration, we were unable to identify an approach that met all of our requirements.

### Standard Deviation in Combination with the Mean of the Data Series

Building upon the concept of using standard deviation, we abandoned the normalization process and instead investigated methods involving rules based on both the standard deviation and mean of the data series. While this approach demonstrated effectiveness with many basic testing datasets, we harbored reservations about its suitability for the final product. Nonetheless, it presents an opportunity to enhance the initial solution by incorporating mechanisms to determine the magnitude of detected outliers.

### Z-score outlier detection [SOURCE](https://www.machinelearningplus.com/machine-learning/how-to-detect-outliers-with-z-score/)
We investigated the possibility of using the Z-score method for outlier detection, which has provided us with satisfactory results for the majority of our testing datasets. This method stands as a viable option for our purposes. Additionally, we experimented with developing a process to generate a score ranging from 0 to 1, but unfortunately, this approach did not yield favorable results.

### Interquartile Outlier detection [SOURCE](https://online.stat.psu.edu/stat200/lesson/3/3.2)
This method utilizes the quartile properties of the data series along with the distance from the first quartile (Q1) to the third quartile (Q3) to identify potential outliers within the data. Our experimentation revealed that this approach consistently produced the most favorable results, leading us to select it for implementation. With this method, we can obtain the outliers within the data series along with their corresponding calendar representations.

Additionally, we attempted to generate a score ranging from 0 to 1 to summarize the outlier detection process. However, this approach did not yield satisfactory results.

### Bayesian Statistics Approach
Utilizing a Bayesian statistics approach can offer insights into whether the distribution of the data series conforms to a normal distribution, which is expected in the absence of temporal patterns. Here are the steps involved:

1. **Choose a Bayesian Model**: Select an appropriate Bayesian model that represents the underlying distribution of the data. In this case, a normal distribution would be suitable.
2. **Define Priors**: Specify prior distributions for the parameters of the chosen model. Priors reflect beliefs about the parameters before observing the data. They can be informed by domain knowledge or chosen to be uninformative if prior knowledge is limited.
3. **Compute Posteriors**: Update the priors using Bayes' theorem to calculate the posterior distributions of the parameters given the observed data. This typically involves techniques such as Markov Chain Monte Carlo (MCMC) or variational inference.
4. **Identify Outliers**: With the posterior distributions of the parameters, outliers can be identified as data points with low probability under the fitted model. Outliers can be defined using thresholds based on credible intervals or by assessing the probability of a data point given the model.
5. **Evaluate Model Fit**: Assess the overall fit of the Bayesian model to the data and validate whether the detected outliers are genuine anomalies or errors in the data.

While the Bayesian approach offers a rigorous method for outlier detection, the Interquartile Range (IQR) method is simpler and should serve our purposes nearly as effectively. Its ease of implementation and robustness make it a suitable choice for our problem.

## 3. Included Files

1. PatternsInTime_Analysis.ipynb: This is the main file for the LLIFE extension. It explores the generated domain-like testing datasets created via the PatternsInTime_GenerateTestingSets file. It serves as the primary building block for deploying to the LLIF codebase.
2. PatternsInTime_OutlierDetectionTesting.ipynb: This notebook was primarily used for researching different outlier detection methods. It contains functions used for outlier detection and tests the chosen method on given testing datasets.
3. PatternsInTime_GenerateTestingSets.ipynb: This file is solely used to generate the domain-like datasets required for the main analysis in PatternsInTime_Analysis.


## 4. Not Included

- **Time Between Events** - This has been a concept that we also want to explore, but we have yet to come to come to decisive conslusion on how to exactly use this.

In [89]:
import warnings
import json
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

from PatternsInTime_HelperFunctions import get_documents_data_into_df, load_file_as_list_of_documents, add_time_columns_to_dataframe, detect_outliers_iqr

warnings.filterwarnings("ignore")

In [90]:
# SETUP MAPPINGS

group_by_categories = ["Hours", "Week_Days", "Days_Of_Month", "Month_Parts", "Months"]

days_in_week = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
months_in_year = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
hours_in_day = ["07:00", "08:00", "09:00", "10:00",
                "11:00", "12:00", "13:00", "14:00",
                "15:00", "16:00", "17:00", "18:00",
                "19:00", "20:00", "21:00", "22:00", "23:00"]
                # "00:00", "01:00", "02:00", "03:00", "04:00",
                # "05:00", "06:00", Removed as those are usually not active hours
                # Don't want to pollute the calculation
parts_of_month = ["1/5", "2/5", "3/5", "4/5", "5/5"]
days_in_month = [i+1 for i in range(31)]

CategoryToExpectedRowValuesMapping = {"Hours": hours_in_day,
                                      "Week_Days": days_in_week,
                                      "Days_Of_Month":days_in_month,
                                      "Month_Parts": parts_of_month,
                                      "Months": months_in_year}

# Create a dictionary to map hour integers to hour strings
hour_mapping = {i: hour for i, hour in enumerate(hours_in_day)}

In [110]:
# HELPER FUNCTIONS

def load_file_as_list_of_documents(filepath):
    """
    Load a JSON file containing a list of documents and return them as a list.

    Args:
    filepath (str): The path to the JSON file.

    Returns:
    list: A list containing the documents loaded from the JSON file.
    """
    f = open(filepath)
    list_of_documents = []
    documents = json.load(f)
    for document in documents:
        list_of_documents.append(document)

    return list_of_documents


def get_documents_data_into_df(
    documents: list,
    fields: list,
) -> pd.DataFrame:
    """
    Filters through nested fields in JSON-like objects and converts them to a Pandas DataFrame with desired fields as columns.

    Args:
    documents (list): A list of dictionaries representing documents with nested fields.
    fields (list): A list of strings representing the fields to extract from the documents.

    Returns:
    pd.DataFrame: A DataFrame containing the specified fields as columns, with data extracted from the documents.

    Example:
    documents = [
        {"name": {"first": "John", "last": "Doe"}, "age": 30},
        {"name": {"first": "Jane", "last": "Smith"}, "age": 25}
    ]
    fields = ["name.first", "name.last", "age"]
    df = get_documents_data_into_df(documents, fields)
    print(df)
    Output:
       first    last   age
    0   John     Doe  30.0
    1   Jane   Smith  25.0
    """
    def get_field_data(input_doc, field):
        """
        Extracts data from nested fields in a document.

        Args:
        input_doc (dict): The document from which to extract data.
        field (str): The field to extract, possibly nested.

        Returns:
        obj: The data corresponding to the specified field.
        """
        for subfield in field.split("."):
            if isinstance(input_doc, list):
                input_doc = input_doc[0]
            input_doc = input_doc.get(subfield)
        return input_doc

    column_names = [field.split(".")[-1] for field in fields]
    return_df = pd.DataFrame(columns=column_names)

    for document in documents:
        temp_df = pd.DataFrame(
            {column: get_field_data(document, field) for column, field in zip(column_names, fields)},
            index=range(1)
        )
        return_df = pd.concat([return_df, temp_df]).reset_index(drop=True)

    return return_df.fillna(value=np.nan)

def add_time_columns_to_dataframe(df):
    """
    Adds additional time-related columns to a DataFrame based on the 'timestamp' column.

    Args:
    df (pd.DataFrame): The DataFrame containing a 'timestamp' column.

    Returns:
    pd.DataFrame: The DataFrame with added time-related columns.

    Example:
    import pandas as pd
    df = pd.DataFrame({
        'timestamp': pd.date_range(start='2022-01-01', end='2022-01-05', freq='D')
    })
    df = add_time_columns_to_dataframe(df)
    print(df)
    Output:
       timestamp  Hours   Months  Days_Of_Month  Week_Days Month_Parts
    0 2022-01-01      0  January              1   Saturday         1/5
    1 2022-01-02      0  January              2     Sunday         1/5
    2 2022-01-03      0  January              3     Monday         1/5
    3 2022-01-04      0  January              4    Tuesday         2/5
    4 2022-01-05      0  January              5  Wednesday         2/5
    """
    # Add Hours column
    df["Hours"] = df["timestamp"].dt.hour

    # Add Months column
    df["Months"] = df["timestamp"].dt.month_name()

    # Add Days_Of_Month column
    df["Days_Of_Month"] = df["timestamp"].dt.day

    # Add Week_Days column
    df["Week_Days"] = df["timestamp"].dt.day_name()

    # Add Month_Parts column
    df["Month_Parts"] = df["Days_Of_Month"] / 6
    # Map Month_Parts to respective values
    conditions = [
        (df["Month_Parts"] <= 1),
        (df["Month_Parts"] > 1) & (df["Month_Parts"] <= 2),
        (df["Month_Parts"] > 2) & (df["Month_Parts"] <= 3),
        (df["Month_Parts"] > 3) & (df["Month_Parts"] <= 4),
        (df["Month_Parts"] > 4)
    ]
    values = ["1/5", "2/5", "3/5", "4/5", "5/5"]
    df["Month_Parts"] = np.select(conditions, values)
    
    return df

def add_expected_rows_to_count_analysis(df, category):
    """
    Adds expected rows with zero count to a DataFrame based on a specified category.

    Args:
    df (pd.DataFrame): The DataFrame containing count analysis.
    category (str): The category column to check and add missing rows.

    Returns:
    pd.DataFrame: The DataFrame with added expected rows.
    """
    expected_row_values = CategoryToExpectedRowValuesMapping.get(category)
    missing_rows = []
    for value in expected_row_values:
        if value not in df[category].values:
            missing_rows.append(pd.DataFrame({category: [value], "Count": [0]}))
    
    if missing_rows:
        df = pd.concat([df] + missing_rows, ignore_index=True)

    # Sort the DataFrame based on the category to maintain order
    df[category] = pd.Categorical(df[category], categories=expected_row_values, ordered=True)
    df = df.sort_values(category).reset_index(drop=True)
    
    return df


def detect_outliers_iqr(data, k=1.5):
    """
    Detect outliers in the data using IQR (Interquartile Range) method.

    Parameters:
    data (numpy.ndarray): Input data as a numpy array.
    k (float): Coefficient to scale the IQR. Data points beyond k times the IQR
               from the first and third quartiles are considered outliers. Default is 1.5.

    Returns:
    numpy.ndarray: Boolean array indicating whether each data point is an outlier or not.
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower_bound = q1 - k * iqr
    upper_bound = q3 + k * iqr
    return (data < lower_bound) | (data > upper_bound)

In [92]:
# TESTING DOCUMENTS

blood_pressure_documents = load_file_as_list_of_documents(filepath="data/BloodPressure.json")
blood_pressure_higher_mondays_documents = load_file_as_list_of_documents(filepath="data/blood_pressure_diastolic_higher_on_monday.json")
diary_events_headache_on_monday = load_file_as_list_of_documents(filepath="data/diary_events_monday_heavy.json")

In [105]:
def execute(documents: list, values_aggregators: list[tuple]):
    """
    Execute data analysis on a list of documents based on provided value aggregators.

    Args:
    documents (list): A list of dictionaries representing documents with nested fields.
    values_aggregators (list): A list of tuples containing the value field and its aggregator function.

    Returns:
    dict: A dictionary containing the final result of the data analysis.
    """
    value_fields = [value[0] for value in values_aggregators]
    df = get_documents_data_into_df(documents=documents, fields=["timestamp"] + value_fields)
    df["timestamp"] = pd.to_datetime(df['timestamp'], utc=True).dt.tz_localize(None)
    df = add_time_columns_to_dataframe(df=df)
    
    final_result_dict = {}
    for value_field, aggregator in values_aggregators:
        results_list = []
        for category in group_by_categories:
            result_dict = {}
            temp_df = df[[category]]
            if aggregator == "size":
                temp_df = temp_df.groupby(category).size().fillna(0).reset_index(name=value_field)
                temp_df = add_expected_rows_to_count_analysis(df=temp_df, category=category)
            elif aggregator == "mean":
                temp_df = df[[category, value_field]].groupby(category).mean().reset_index().round(2)
            
            outlier_boolean_array = detect_outliers_iqr(temp_df[value_field].values)

            result_dict["category"] = category
            result_dict["series"] = temp_df[value_field].values
            result_dict["data_series_mean"] = temp_df[value_field].mean() # Do we even want to include these in the result?
            result_dict["data_series_min"] = temp_df[value_field].min() # Do we even want to include these in the result?
            result_dict["data_series_max"] = temp_df[value_field].max() # Do we even want to include these in the result?
            result_dict["outliers_values"] = temp_df[value_field][outlier_boolean_array].values.tolist()
            result_dict["outliers_column"] = temp_df[temp_df[value_field].isin(result_dict["outliers_values"])][category].values.tolist()
            result_dict["outliers_column_values"] = [temp_tuple for temp_tuple in temp_df[temp_df[value_field].isin(result_dict["outliers_values"])][[category, value_field]].itertuples(index=False, name=None)] # Creates tuples of column and value pairs ("January", 9)

            results_list.append(result_dict)

        final_result_dict[value_field] = results_list
    
    return final_result_dict

In [106]:
result_headache = execute(documents=diary_events_headache_on_monday, values_aggregators=[("Count", "size")])
result_bp = execute(documents=blood_pressure_higher_mondays_documents, values_aggregators=[("Count", "size"), ("diastolic_blood_pressure", "mean")])

In [107]:
result_headache["Count"]

[{'category': 'Hours',
  'series': array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         46]),
  'data_series_mean': 2.5555555555555554,
  'data_series_min': 0,
  'data_series_max': 46,
  'outliers_values': [46],
  'outliers_column': [nan],
  'outliers_column_values': [(nan, 46)]},
 {'category': 'Week_Days',
  'series': array([23,  1,  8,  3,  5,  1,  5]),
  'data_series_mean': 6.571428571428571,
  'data_series_min': 1,
  'data_series_max': 23,
  'outliers_values': [23],
  'outliers_column': ['Monday'],
  'outliers_column_values': [('Monday', 23)]},
 {'category': 'Days_Of_Month',
  'series': array([1, 1, 0, 2, 2, 3, 1, 3, 2, 2, 2, 3, 2, 0, 1, 1, 0, 2, 0, 2, 1, 1,
         2, 1, 2, 1, 4, 1, 1, 1, 1]),
  'data_series_mean': 1.4838709677419355,
  'data_series_min': 0,
  'data_series_max': 4,
  'outliers_values': [4],
  'outliers_column': [27],
  'outliers_column_values': [(27, 4)]},
 {'category': 'Month_Parts',
  'series': array([ 9, 13,  6,  7, 11]),
  'data

In [108]:
result_bp["Count"]

[{'category': 'Hours',
  'series': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 2, 1,
         2, 1, 3, 2, 1, 1, 3, 1, 1, 1, 2]),
  'data_series_mean': 0.7575757575757576,
  'data_series_min': 0,
  'data_series_max': 3,
  'outliers_values': [3, 3],
  'outliers_column': [nan, nan],
  'outliers_column_values': [(nan, 3), (nan, 3)]},
 {'category': 'Week_Days',
  'series': array([5, 7, 3, 2, 3, 3, 2]),
  'data_series_mean': 3.5714285714285716,
  'data_series_min': 2,
  'data_series_max': 7,
  'outliers_values': [7],
  'outliers_column': ['Tuesday'],
  'outliers_column_values': [('Tuesday', 7)]},
 {'category': 'Days_Of_Month',
  'series': array([1, 1, 0, 1, 0, 1, 1, 0, 0, 2, 2, 3, 2, 1, 1, 0, 1, 1, 2, 2, 0, 0,
         1, 0, 0, 0, 0, 0, 1, 1, 0]),
  'data_series_mean': 0.8064516129032258,
  'data_series_min': 0,
  'data_series_max': 3,
  'outliers_values': [3],
  'outliers_column': [12],
  'outliers_column_values': [(12, 3)]},
 {'category': 'Month_Parts',
  'series': a

In [109]:
result_bp["diastolic_blood_pressure"]

[{'category': 'Hours',
  'series': array([60.  , 54.5 , 54.  , 65.  , 56.  , 55.5 , 54.  , 53.67, 55.5 ,
         56.  , 61.  , 57.  , 58.  , 54.  , 60.  , 58.5 ]),
  'data_series_mean': 57.041875000000005,
  'data_series_min': 53.67,
  'data_series_max': 65.0,
  'outliers_values': [],
  'outliers_column': [],
  'outliers_column_values': []},
 {'category': 'Week_Days',
  'series': array([55.67, 62.2 , 55.33, 58.  , 55.  , 54.57, 57.  ]),
  'data_series_mean': 56.824285714285715,
  'data_series_min': 54.57,
  'data_series_max': 62.2,
  'outliers_values': [62.2],
  'outliers_column': ['Monday'],
  'outliers_column_values': [('Monday', 62.2)]},
 {'category': 'Days_Of_Month',
  'series': array([61.  , 56.  , 56.  , 58.  , 60.  , 55.5 , 55.  , 56.67, 54.  ,
         56.  , 58.  , 57.  , 60.  , 54.5 , 58.  , 53.  , 70.  , 54.  ]),
  'data_series_mean': 57.37055555555556,
  'data_series_min': 53.0,
  'data_series_max': 70.0,
  'outliers_values': [70.0],
  'outliers_column': [29],
  'outliers_