# Model Evaluation
Noa Flaherty
noaflaherty@gmail.com
10/7/2018

## Table of Contents
### Phase 1: Data Parsing and Transformation
-  __Step 1:__ CSV Parsing and Day-Level Summarization
-  __Step 2:__ Generate Time-Series Dataframe
-  __Step 3:__ Compute Rolling Metrics
-  __Step 4:__ Generate "Target" Column
-  __Step 5:__ Filter Out Ineligibile Dates & Drop Unneeded Columns
-  __Result:__ Final Dataframe Output

### Phase 2: Dummy Model Creation

### Phase 3: Model Evaluation & Comparison

### Future Improvements

***

## Phase 0: Model Evaluator Inputs

These are the four key inputs for this model evaluation framework. You must specify your input csv and your two model file names. They must be located in model-evaluation/notebook/stored_csvs/ and model-evaluation/notebook/stored_models/ respectively.

Lastly, you must specify whether you want to generate dummy models to compare or compare your own. If CREATE_DUMMY_MODELS = True, then two models will be created with the above-specified file names.

In [1]:
INPUT_CSV_FILENAME = 'TransactionsCompany1.csv'
MODEL_1_FILENAME = 'model_1.sav'
MODEL_2_FILENAME = 'model_2.sav'
CREATE_DUMMY_MODELS = True

***
### Additional Setup

In [2]:
import pandas as pd
import sys
from os import path

# Import relative packages from utils
path_to_project = path.dirname( path.dirname( path.abspath('__file__') ) )
sys.path.append( path_to_project )
from server.utils import model_evaluation, data_processing


# File path to input csv
input_csv = "{PROJECT_PATH}/notebook/stored_csvs/{FILE_NAME}".format(PROJECT_PATH=path_to_project, FILE_NAME=INPUT_CSV_FILENAME)

# After processing the CSV, a pickle file will be saved here of the dataframe for fast retrieval.
pickle_file_of_df = "{PROJECT_PATH}/notebook/stored_dataframes/{FILE_NAME}.pkl".format(PROJECT_PATH=path_to_project, FILE_NAME=INPUT_CSV_FILENAME.split('.')[0])

# File paths for model pickle files
model_1_file = "{PROJECT_PATH}/notebook/stored_models/{FILE_NAME}".format(PROJECT_PATH=path_to_project, FILE_NAME=MODEL_1_FILENAME)
model_2_file = "{PROJECT_PATH}/notebook/stored_models/{FILE_NAME}".format(PROJECT_PATH=path_to_project, FILE_NAME=MODEL_2_FILENAME)



# Set pandas dataframe print options
pd.set_option('display.max_rows', 20)
# pd.set_option('display.max_columns', 4)
# pd.set_option('display.width', 500)  

SyntaxError: invalid syntax (model_evaluation.py, line 82)

## Phase 1: Data Parsing & Transformation
The goal of this phase is to read an input csv representing an event stream and output a time-series'ed dataframe for use in dummy model training and model evaluation.

The final dataframe outputted at the end of this phase has the following properties:
-  One row per customer per day, where:
    -  The first date for a given customer is the first day on which they had an event
    -  The maximum date for ALL customers is the maximum date across all events across all customers (if this was a live feed of data rather than a static csv, we might consider just making this "today.")
    -  We include all days between and including the above min and max dates for each customer
    -  We then filter out dates that are "too young" or "too old" to be useful for our model (described further in Step 4).
-  An index of ['customer_id', 'date']
-  Generated columns that represent computed metrics for use as model factors (e.g. purchase_value_l30d is the rolling 30 day sum of purchase values for a given customer on a given day)
-  The rightmost column is our target value: whether or not the customer made one or more purchases in the 6 months following that day


### Step 1: CSV Parsing and Day-Level Summarization

The first step is to read in the specified CSV, which represents an event stream, and perform the first level of data transformation. Our goal is to have one row per customer per day on which one or more events occured, with columns that aggregate the total value of the events for that customer on that day as well as the count of events that customer had on that day.

-  Two columns new for the given event type (in this case, purchase events).
    -  The first is a sum of the event values for all events of that event type on that day for that customer (e.g. total_purchase_value_on_day)
    -  The second is a count of all events of that event type on that day for that customer (e.g. purchase_count_on_day)

In [None]:


df_summarized_event_stream = data_processing.read_event_stream_data_set(input_csv, 'purchase')
print df_summarized_event_stream

### Step 2: Generate Time Series Dataframe
This next step is to take the dataframe from Step 1, which contains one row per customer per day on which an event occured, and extend it to be one row per customer per day since their first event. This makes it so that we can compute rolling metrics for each day and make it easy to see what a given customer looked like on a given day. We keep the first_event_timestamp as a column for later data transformation.

In [None]:
df_time_series = data_processing.generate_time_series(df_summarized_event_stream, event_names=['purchase'])
print df_time_series

### Step 3: Compute Rolling Metrics
In this step, we compute rolling metrics for use as input factors for models. For now, we simply compute the rolling sum of event values (e.g. total purchase value) and count of events (e.g. number of purchase events) with window sizes of 1, 3, and 6 month intervals (assuming 30 days per month). 

This could be a good area for future refinement if we wanted to develop more sophisticated models. For example, you could imagine additional columns for use as factors such as: number of previous consecutive months with one or more purchases, etc.

In [None]:
df_time_series_w_rolling_metrics = data_processing.generate_metrics_for_use_as_factors(df_time_series)
print df_time_series_w_rolling_metrics

### Step 4: Generate "Target" Column
This creates a column representing what we are trying to predict. If a customer made one or more purchases in the 6 months following that day-row, they get a 1 on that day, otherwise, they get a 0.

In [None]:
rolling_window_into_future_in_days =  6*30 # Number of days into the future to look. Set here to 6 months, assuming 30 days per month
df_time_series_w_target_col = data_processing.determine_if_event_occured_in_furture_x_days(df_time_series_w_rolling_metrics, 'purchase', rolling_window_into_future_in_days)
print df_time_series_w_target_col

### Step 5: Filter Out Ineligibile Dates & Drop Unneeded Columns
From Step 3, we know that we look some number of days into the past to compute rolling metrics (in this case, the largest is 180 days) and in Step 4 we say that we look some number of days into the future for our target column (also 180 days in this scenario). Therefore, some days will be "too young" to generate meaningful rolling metrics and some dates will be "too old" to have a full 180 window in the future that still exists within the timeframe covered in the dataset.

To account for these sets of dates that are "too young" or "too old," we generate columns and then use them to filter out ineligible rows to create our final dataframe.

In [None]:
max_rolling_window_span_in_days = 6*30 # This should be set to the same number of days as our longest rolling metric (in this case, 180 days)

df_time_series_w_age_viability = data_processing.determine_age_viability_for_model(df_time_series_w_target_col, max_rolling_window_span_in_days, rolling_window_into_future_in_days)
df_final = data_processing.get_eligible_training_set(df_time_series_w_age_viability)

# Save to a pickle file for easy retrieval later.
if pickle_file_of_df:
    df_final.to_pickle(pickle_file_of_df)
    print "Saved dataframe to %s." % pickle_file_of_df

### Result: Final Dataframe Output
This is the final output of Phase 1. It is a clean time-series'ed dataframe that can be used for training or evaluating models.

You can either run all cells above, or just this one cell.

In [None]:
df_final = data_processing.load_dataframe(pickle_file_of_df, csv=input_csv, overwrite_pickle=False)
print df_final

***

## Phase 2: Dummy Model Creation

For now, we will simply create two very similar Logistic Regression models, who differ only in their seed and training/test split values.

In [None]:
model_paths = [model_1_file, model_2_file]
models = model_evaluation.load_models(df_final, model_paths, generate_new=CREATE_DUMMY_MODELS) # generate_new can be set to False to simply load models from the inputs at the top.
print models

## Phase 3: Model Evaluation & Comparison


In [None]:
model_comparison_results = model_evaluation.compare_models(input_csv, model_paths, df_pickle_file=pickle_file_of_df, generate_new_models=False)

# Print results in easily-readable form
for idx, model_name in enumerate(model_comparison_results):
        model_results = model_comparison_results[model_name]
        print_model_evaluation_metrics(model_results, idx+1)


## Future Improvements
### Data Parsing & Transformation:
Ideally this event stream data would already live in our database rather than be batch-uploaded through a CSV. If this were the case, then we could run a job at the end of each day to incrementally produce our needed time-series data as each day goes by. This would have the following benefits:
 -  The user would not have to wait at all for their event data to process and be transformed
 -  We could optimize how we transform our data to look through a much smaller window of rows, rather than create it from scratch every time
 
While the code has been genericized to a point, more could be done to further extend this framework to other event streams. Ideally you could:
 -  Specify one or more csvs and the "event name" for each;
 -  If two represented the same event type, we'd take the union and treat them as if just one csv of that event type was uploaded;
 -  For each event type, we could dynamically generate additional rolling metric columns for our time-series dataframe. These could be used as additional model inputs (i.e. factors).
 -  You could specify which event type you want to predict for


### Interacting w/ the Model Evaluation Framework:
Currently, a user would interact with this model evaluation framework through this Jupyter notebook or through importing these methods into their own code. This could be extended to be its own REST endpoint, or its own whole web-app with a front end.


### Scalability: Supporting 100,000 Companies
Much of the wins for scalability would come from the "Data Parsing & Transformation" section just above. Much computation and waiting goes into the transformation of the event stream. However, there is also significant wait time when computing the evaluation metrics. To address this, let's imagine we've implemented the data transformation recommendations above and wanted to create a REST POST endpoint for model evaluation. You post to it two .sav model files and want it to return the evaluation metrics of each. 

Once the web server received the request, I would have it offload the work of computing the evaluation metrics to a worker (say, RabbitMQ). Ideally, in the post I also include a callback url. Once the worker is done computing the model evaluation metrics, it could post the results back to that callback url.

Even in a world where the event stream does not live in our database and we needed to accept csvs from customers, we could use this pattern of offloading the work to worker queue and post the results back to a callback url.
***