# 2b. Anomalies and Outliers Report

## Overview
This Jupyter notebook presents a report on anomalies and outliers detected in driver performance metrics. The report is generated based on a dataset containing various performance indicators for taxi drivers. By identifying anomalies, it aims to provide insights into potential issues or exceptional cases within the dataset.

## Description
The report includes a detailed analysis of driver performance metrics such as endurance score, profitability score, safety adherence score, and efficiency score. It also highlights percentile values for each metric, enabling users to compare individual driver scores with the dataset's overall distribution. Additionally, the report features a "sanity check" column to confirm the presence of anomalies based on predefined percentile thresholds.

## Purpose
The purpose of this report is to facilitate data-driven decision-making by offering visibility into outliers and anomalies in driver performance metrics. By flagging potential issues or exceptional cases, it helps stakeholders identify areas for further investigation or improvement. Ultimately, the report contributes to enhancing the quality and reliability of decision-making processes within the taxi service domain.

### Installs

In [1]:
!pip install datapane



### Imports

In [2]:
import pandas as pd
import datapane as dp
import warnings

### General Configs

In [3]:
# Suppress specific FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning, module='datapane.common.df_processor')

### Functions

In [4]:
def sanity_check(df, metrics, percentile_99, percentile_01):
    """
    Checks for anomalies in the DataFrame based on percentile thresholds.

    Parameters:
        df (DataFrame): The DataFrame containing the data to be checked for anomalies.
        metrics (list): A list of column names representing the metrics to be checked.
        percentile_99 (Series): A Series containing the 99th percentile values for each metric.
        percentile_01 (Series): A Series containing the 1st percentile values for each metric.

    Returns:
        list: A list of boolean values indicating whether each row in the DataFrame is an anomaly.
              True indicates an anomaly, False indicates normal data.
    """
    anomalies_confirmed = []
    for index, row in df.iterrows():
        is_anomaly = False
        for metric in metrics:
            if row[metric] > percentile_99[metric] or row[metric] < percentile_01[metric]:
                is_anomaly = True
                break
        anomalies_confirmed.append(is_anomaly)
    return anomalies_confirmed

In [5]:
def highlight_anomalies(s, percentile_99, percentile_01):
    """
    Highlights anomalies in a Series based on percentile thresholds.

    Parameters:
        s (Series): The Series containing the data to be highlighted.
        percentile_99 (Series): A Series containing the 99th percentile values for each metric.
        percentile_01 (Series): A Series containing the 1st percentile values for each metric.

    Returns:
        list: A list of CSS strings specifying the background color for each cell in the Series.
    """
    if s.name in new_metrics:
        return [
            'background-color: yellow' if (v > percentile_99.get(s.name, float('inf'))
                                            or v < percentile_01.get(s.name, float('-inf')))
            else ''
            for v in s
        ]
    return ['' for _ in s]

### Data Structure Configurations

In [6]:
metrics = [
    'driver_endurance_score', 'driver_profitabilty_score',
    'driver_safety_adherence_score', 'driving_efficiency_score'
]

In [7]:
rename_dict = {
    'driver_id': 'Driver ID',
    'full_name': 'Full Name',
    'driver_endurance_score': 'Endurance Score',
    'driver_profitabilty_score': 'Profitability Score',
    'driver_safety_adherence_score': 'Safety Adherence Score',
    'driving_efficiency_score': 'Efficiency Score',
    'driver_endurance_score_99th_percentile': 'Endurance Score 99th Percentile',
    'driver_profitabilty_score_99th_percentile': 'Profitability Score 99th Percentile',
    'driver_safety_adherence_score_99th_percentile': 'Safety Adherence Score 99th Percentile',
    'driving_efficiency_score_99th_percentile': 'Efficiency Score 99th Percentile',
    'driver_endurance_score_1st_percentile': 'Endurance Score 1st Percentile',
    'driver_profitabilty_score_1st_percentile': 'Profitability Score 1st Percentile',
    'driver_safety_adherence_score_1st_percentile': 'Safety Adherence Score 1st Percentile',
    'driving_efficiency_score_1st_percentile': 'Efficiency Score 1st Percentile',
    'sanity_check': 'Sanity Check'
}

In [8]:
# Update the metrics list with new names
new_metrics = [
    'Endurance Score', 'Profitability Score',
    'Safety Adherence Score', 'Efficiency Score'
]

In [9]:
# Markdown description for the report
description = """
# 2b. Anomalies and Outliers Report

## Overview
This report displays anomalies in driver performance metrics.

## Columns Included
- **Driver ID**: Identifier for the driver.
- **Full Name**: Name of the driver.
- **Endurance Score**: Score indicating the driver's endurance.
- **Profitability Score**: Score indicating the driver's profitability.
- **Safety Adherence Score**: Score indicating the driver's safety adherence.
- **Efficiency Score**: Score indicating the driver's efficiency.
- **Endurance Score 99th Percentile**: 99th percentile value for the endurance score.
- **Profitability Score 99th Percentile**: 99th percentile value for the profitability score.
- **Safety Adherence Score 99th Percentile**: 99th percentile value for the safety adherence score.
- **Efficiency Score 99th Percentile**: 99th percentile value for the efficiency score.
- **Endurance Score 1st Percentile**: 1st percentile value for the endurance score.
- **Profitability Score 1st Percentile**: 1st percentile value for the profitability score.
- **Safety Adherence Score 1st Percentile**: 1st percentile value for the safety adherence score.
- **Efficiency Score 1st Percentile**: 1st percentile value for the efficiency score.
- **Sanity Check**: Indicates whether the anomaly is confirmed based on percentile thresholds.

## How to Use
1. **Identify Anomalies**: Look for rows where the "Sanity Check" column is marked as True. These rows represent anomalies in the driver performance metrics.
2. **Analyze Metric Scores**: Focus on metrics with highlighted cells (background color yellow). These cells indicate that the corresponding metric value falls outside the normal range (outside 1st or 99th percentile).
3. **Compare with Percentiles**: Compare the metric values with their respective 1st and 99th percentile values provided in the report. This helps in understanding how far the metric values deviate from the normal range.

By following these steps, users can effectively identify and investigate anomalies and outliers in driver performance metrics.

> **_NOTE:_** The sanity check should ideally be true for all entries in this report. However, in future iterations of this report, it will be possible to toggle the percentile values used for analyzing the data. Users will be able to customize both the percentile value that defines the normal range and the percentile value that identifies an anomaly. This feature will be particularly useful for examining extreme values that fall within the required percentile range but still may be of interest for further investigation.

"""

### Execute

#### Prepare Data

In [10]:
# Load the data from the pickle file
df = pd.read_pickle('data2.pkl')

# Calculate the 99th and 1st percentiles for each metric column
percentile_99 = df[metrics].quantile(0.99)
percentile_01 = df[metrics].quantile(0.01)

# Filter rows where is_outlier is True
anomalies_df = df[df['is_outlier']].copy()

# Perform sanity check
sanity_check_results = sanity_check(anomalies_df, metrics, percentile_99, percentile_01)
anomalies_df.loc[:, 'sanity_check'] = sanity_check_results

# Add the 99th and 1st percentiles to the anomalies_df
for metric in metrics:
    anomalies_df.loc[:, f'{metric}_99th_percentile'] = percentile_99[metric]
    anomalies_df.loc[:, f'{metric}_1st_percentile'] = percentile_01[metric]

# Rename the columns
anomalies_df.rename(columns=rename_dict, inplace=True)

# Update the percentile dictionaries with new column names
new_percentile_99 = percentile_99.rename(index=rename_dict)
new_percentile_01 = percentile_01.rename(index=rename_dict)

# Select the specified columns along with the percentiles
columns_to_display = [
    'Driver ID', 'Full Name'
] + new_metrics + [
    f'{metric} 99th Percentile' for metric in new_metrics
] + [
    f'{metric} 1st Percentile' for metric in new_metrics
] + ['Sanity Check']

anomalies_df = anomalies_df[columns_to_display]

# Apply the highlight function only to the metric columns
styled_anomalies_df = anomalies_df.style.apply(
    highlight_anomalies,
    subset=new_metrics,
    percentile_99=new_percentile_99,
    percentile_01=new_percentile_01
)

# Display the styled DataFrame
#styled_anomalies_df

#### Report Generation

In [11]:
# Render the styled DataFrame to HTML
styled_html = styled_anomalies_df.to_html()

# Create a Datapane report using DataTable for better performance
report = dp.Report(
    dp.Text(description),
    dp.HTML(styled_html) # Add the styled DataFrame as HTML
)

# Save the report as an HTML file
report.save(path='2b_report_anomalies_and_outliers.html')

App saved to ./2b_report_anomalies_and_outliers.html