# Leaf DataOps Timed Technical Assessment (75 mins)

### Scenario

At a high level, Leaf's business model centers around finding predictable patterns in freight records. Thus, we ingest shipment data from a variety of external sources. As a DataOps Engineer at Leaf, you will regularly monitor that incoming data for anomalies.

You've been asked to test a pipeline that ingests data via a new API integration. The API returns a JSON containing shipment data within a date range passed in the query parameters.

Each key-value pair in the JSON is equivalent to the name and values, respectively, of a DataFrame column.

The "columns" are:
* the shipment `date`;
* the distance from origin to destination (in `miles`);
* the `price` the shipper paid to move the shipment (in USD); and
* the `weight` of the shipment (in pounds)

The cell below contains the pipeline code:

In [1]:
from json import loads
from functools import reduce
from math import isnan


def pipeline(*objects):
    return reduce(lambda x, fn: fn(x), objects)


def mock_api_request(url: str) -> str:
    with open("leaf_dataops.json", "r") as f:
        return f.read()


def parse_json(json: str) -> dict:
    return loads(json)

def validator(data: dict, rules: dict = None) -> tuple:
    """
    Validates a dict of data on specified logic.
    
    Args:
    - Data (dict): JSON of data with keys 'date', 'miles', 'price', and 'weight'
    - Rules (dict) : Dict of rules to apply to the data (Not used here for now)
    
    Returns:
    A tuple of two dicts, the first is valid data, and the second is invalid data
    
    Logic functions as follows:
    - 'date' must be an int and a unix time stamp
    - 'miles', 'price', and 'weight' must be greater than min or lower than max
    - NaN values must be considered invalid
    
    Given that all of the logic is met for the values, it is added to the valid data dict.
    If the logic is not me then the values is placed in the invalid data dict.
    """
    
    # dicts for valid and invalid_data
    
    valid_data = {}
    invalid_data = {}

    
    # Max Values
    max_miles = max(data['miles'])
    max_price = max(data['price'])
    max_weight = max(data['weight'])
    
    # Min Values
    min_miles = min(data['miles'])
    min_price = min(data['price'])
    min_weight = min(data['weight'])
    

    for i, _ in enumerate(data['date']):
        valid = True
        for key in data.keys():
            
            value = data[key][i]
            
            if key == 'date':
                if not isinstance(value, int):
                    valid = False
                    break
            elif key == 'miles' or key == 'price' or key == 'weight':
                if key == 'miles' and (value <= min_miles or value > max_miles or isnan(value)):
                    valid = False
                    break
                if key == 'price' and (value <= min_price or value > max_price or isnan(value)):
                    valid = False
                    break
                if key == 'weight' and (value <= min_weight or value > max_weight or isnan(value)):
                    valid = False
                    break
                

        if valid:
            valid_data[i] = {k: data[k][i] for k in data.keys()}
        else:
            invalid_data[i] = {k: data[k][i] for k in data.keys()}
    
    return valid_data, invalid_data


pipeline(
    mock_api_request("www.shipper.com/api?min_date=20210101&max_date=20221231"),
    parse_json,
    validator,
)

({0: {'date': 1650412800, 'miles': 571.8, 'price': 1045.0, 'weight': 37601.0},
  5: {'date': 1628640000, 'miles': 63.0, 'price': 671.0, 'weight': 40000.0},
  7: {'date': 1647561600,
   'miles': 1220.0,
   'price': 3776.03,
   'weight': 40032.0},
  9: {'date': 1644710400, 'miles': 336.0, 'price': 1996.92, 'weight': 24872.0},
  12: {'date': 1631491200,
   'miles': 1658.0,
   'price': 7123.49,
   'weight': 25368.0},
  13: {'date': 1634688000,
   'miles': 647.0,
   'price': 1038.92,
   'weight': 40000.0},
  14: {'date': 1670889600,
   'miles': 260.0,
   'price': 1031.69,
   'weight': 42676.0},
  16: {'date': 1631577600,
   'miles': 1842.0,
   'price': 5892.06,
   'weight': 18000.0},
  18: {'date': 1647043200,
   'miles': 177.0,
   'price': 1217.63,
   'weight': 13063.0},
  19: {'date': 1635724800,
   'miles': 1128.0,
   'price': 6308.4,
   'weight': 40000.0},
  22: {'date': 1660348800,
   'miles': 475.0,
   'price': 1734.73,
   'weight': 34564.0},
  26: {'date': 1635984000, 'miles': 241.0,

### Assignment

**Your task is to refactor the `validator` function to catch outliers in the API response.**

Your solution should meet the following criteria:
* It should use _only_ the Python standard library (no Pandas or other third-party libraries—though you can use whatever you want for data exploration)
* Running the cell below should return a `tuple` containing two objects: 
    1. a `dict` of valid shipment data (the definition of "valid" is up to you)
    2. a `dict` of invalid shipment data
* It should be structured such that we could reuse `validator` in other pipelines without altering the internal logic of the function itself.

Beyond these requirements, use your best judgment and be prepared to explain any assumptions or decisions you made. Everything you need is in this document. Your time is limited, so prioritize accordingly. 

**You have 75 minutes from receipt of this file to complete as much as possible. When time is up, send us what you have so far. Good luck!**

### Figuring out the solution below

In [2]:
# Need to visualise the data
import pandas as pd

In [3]:
temp_df = pd.read_json("leaf_dataops.json")

In [4]:
temp_df

Unnamed: 0,date,miles,price,weight
0,2022-04-20,571.80,1045.00,37601.0
1,2021-02-18,1110.00,,
2,2022-03-23,,2475.26,40000.0
3,2022-03-16,9.00,11.11,1.0
4,2022-03-25,,1200.00,222.0
...,...,...,...,...
99995,2022-05-17,36.00,196.85,1.0
99996,2022-05-10,30.00,187.86,1.0
99997,2021-06-18,66.00,177.98,40000.0
99998,2021-09-13,214.04,,40000.0


In [5]:
#pandas has parsed the date into something readable
# Noticing that there are a few NaN values scattered in the miles, price, and weight columns

In [6]:
temp_df.shape

(100000, 4)

In [7]:
temp_df.isna().sum()

date          0
miles     17564
price      6626
weight      705
dtype: int64

In [8]:
min(temp_df.date)

Timestamp('2021-01-01 00:00:00')

In [9]:
max(temp_df.date)

Timestamp('2032-02-20 05:20:00')

In [10]:
from json import loads
from functools import reduce
from math import isnan


def pipeline(*objects):
    return reduce(lambda x, fn: fn(x), objects)


def mock_api_request(url: str) -> str:
    with open("leaf_dataops.json", "r") as f:
        return f.read()


def parse_json(json: str) -> dict:
    return loads(json)

def validator(data: dict, rules: dict = None) -> tuple:
    # dicts for valid and invalid_data
    
    valid_data = {}
    invalid_data = {}

    
    # Max Values
    max_miles = max(data['miles'])
    max_price = max(data['price'])
    max_weight = max(data['weight'])
    
    # Min Values
    min_miles = min(data['miles'])
    min_price = min(data['price'])
    min_weight = min(data['weight'])
    

    for i, _ in enumerate(data['date']):
        valid = True
        for key in data.keys():
            
            value = data[key][i]
            
            if key == 'date':
                if not isinstance(value, int):
                    valid = False
                    break
            elif key == 'miles' or key == 'price' or key == 'weight':
                if key == 'miles' and (value <= 0 or value > max_miles or isnan(value)):
                    valid = False
                    break
                if key == 'price' and (value <= 0 or value > max_price or isnan(value)):
                    valid = False
                    break
                if key == 'weight' and (value <= 0 or value > max_weight or isnan(value)):
                    valid = False
                    break
                

        if valid:
            valid_data[i] = {k: data[k][i] for k in data.keys()}
        else:
            invalid_data[i] = {k: data[k][i] for k in data.keys()}
    
    return valid_data, invalid_data


val, inval = pipeline(
    mock_api_request("www.shipper.com/api?min_date=20210101&max_date=20221231"),
    parse_json,
    validator,
)

len(val), len(inval)

(78473, 21527)