# Pricing Models 0, 1, 2, 3 Measurement

Besides the first pricing model (`pricing_model_0`) that train word2vec with full corpus and find 10 most similar challenge and taken average of their prizes as estimate. I've build 3 more pricing models:

1. Training word2vec model with **_corpus that deletes overlap sections_**, then find 10 most similar challenge and take average of their prizes as estimate
2. Training word2vec model with **_corpus that detects phrases (more than one word)_**, then find 10 most similar challenge and take average of their prizes as estimate
3. Training K-Near Neighboors model with:
   - X: document vectors calculated from pricing_model_0 appending meta data of challenges
   - y: actual total prize

Here below I will demonstrate the result.

In [None]:
import os
import json
from collections import defaultdict

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

pd.set_option('display.max_rows', 150)

MODELS = range(3)
TRACKS = ('all', 'develop', 'design')
DIMENSIONS = range(100, 1100, 100)

In [None]:
pm_measure_dfs = defaultdict(lambda: defaultdict(dict))

for model in MODELS:
    for track in TRACKS:
        for dimension in DIMENSIONS:
            with open(os.path.join(os.curdir, f'pricing_model_{model}', f'{track}_track', 'measures', f'measure_{dimension}D.json')) as f:
                pm_measure_dfs[model][track][dimension] = pd.read_json(f, orient='records').set_index('index')


In [None]:
MMRE = []
for model in MODELS:
    for track in TRACKS:
        for dimension in DIMENSIONS:
            MMRE.append(dict(track=track, dimension=dimension, model=model, mmre=pm_measure_dfs[model][track][dimension]['MRE'].mean()))

mmre_df = pd.DataFrame(MMRE)

The Mean MRE by track and dimension is shown below

In [None]:
title_text = [
    'Word2Vec from Full corpus',
    'Word2Vec from No Overlap corpus',
    'Word2Vec from phrases detected corpus'
]

with sns.axes_style('darkgrid'):
    fig, axes = plt.subplots(3, 1, figsize=(9, 9), dpi=200)

    for model in MODELS:
        ax = axes[model]

        sns.lineplot(
            data=mmre_df.loc[mmre_df.model == model],
            x='dimension',
            y='mmre',
            hue='track',
            style='track',
            palette='deep',
            linewidth=0.618,
            markers=['o', 'o', 'o'],
            markersize=4,
            ax=ax
        )

        handles, labels = ax.get_legend_handles_labels()
        ax.legend(handles, labels, prop={'size': 8})

        ax.set_xticks(list(range(100, 1100, 100)))
        ax.set_xticklabels(labels=list(range(100, 1100, 100)))
        ax.set_ylim(top=7, bottom=0)
        ax.set_yticks(list(range(7)))
        ax.set_yticklabels(labels=list(range(7)))

        ax.set_xlabel('Dimensionality of document vectors')
        ax.set_ylabel('Mean MRE')
        ax.set_title(f'Pricing model {model} - {title_text[model]}')

    fig.tight_layout()

Unexpectedly, with the removal of overlap sections and phrases detected. **The accuracy of the pricing models decrease.**

This result is against the assumption I made that with the refinement of the input corpus, the accuracy will increase.

> Note: All three models have removed the stop words from the cropus

To better demo the decreasement of models. I plot the MREs by track in different model below.

In [None]:
with sns.axes_style('darkgrid'):
    fig, axes = plt.subplots(3, 1, figsize=(9, 9), dpi=200)

    for idx, track in enumerate(TRACKS):
        ax = axes[idx]
        sns.lineplot(
            data=mmre_df.loc[mmre_df.track == track],
            x='dimension',
            y='mmre',
            hue='model',
            style='model',
            palette='deep',
            linewidth=0.618,
            markers=['o', 'o', 'o'],
            markersize=4,
            ax=ax
        )

        handles, labels = ax.get_legend_handles_labels()
        ax.legend(handles, labels, prop={'size': 8})

        ax.set_xticks(list(range(100, 1100, 100)))
        ax.set_xticklabels(labels=list(range(100, 1100, 100)))
        # ax.set_ylim(top=7, bottom=0)
        # ax.set_yticks(list(range(7)))
        # ax.set_yticklabels(labels=list(range(7)))

        ax.set_xlabel('Dimensionality of document vectors')
        ax.set_ylabel('Mean MRE')
        ax.set_title(f'Pricing model MMRE - {track.upper()} track')

    fig.tight_layout()

## KNN algorithm result

I trained the KNN model by using concatenation of metadata of challenges and document vectors from pm0 as input `X` and actual prize as input `y` and run 10-fold cross validation to assess the model. The mean MRE is rather positive.

In [None]:
with open(os.path.join(os.curdir, 'pricing_model_3', 'knn_pricing_model_measure.json')) as fread:
    pm4_data = json.load(fread)

pm4_measure_dct = {track: {int(dimension): result['Mean_MRE'] for dimension, result in d.items()} for track, d in pm4_data.items()}

pm4_measure_df = pd.DataFrame([dict(track=track, dimension=dimension, model=3, mmre=mmre) for track, d in pm4_measure_dct.items() for dimension, mmre in d.items()])

mmre_df = mmre_df.append(pm4_measure_df).reset_index(drop=True)

Comparing to the pricing model 0, which is based on text mining and analogy estimation, the KNN approach has a rather obivious performance boost.

In [None]:
sub_mmre_df = mmre_df.loc[mmre_df.model.isin((0, 3))].reset_index(drop=True)

with sns.axes_style('darkgrid'):
    fig, axes = plt.subplots(3, 1, figsize=(9, 9), dpi=200)

    for idx, track in enumerate(TRACKS):
        ax = axes[idx]
        sns.lineplot(
            data=sub_mmre_df.loc[sub_mmre_df.track == track],
            x='dimension',
            y='mmre',
            hue='model',
            style='model',
            palette='deep',
            linewidth=0.618,
            markers=['o', 'o'],
            markersize=4,
            ax=ax
        )

        handles, labels = ax.get_legend_handles_labels()
        ax.legend(handles, labels, prop={'size': 8})

        ax.set_xticks(list(range(100, 1100, 100)))
        ax.set_xticklabels(labels=list(range(100, 1100, 100)))

        ax.set_xlabel('Dimensionality of document vectors')
        ax.set_ylabel('Mean MRE')
        ax.set_title(f'Pricing model MMRE - {track.upper()} track')

    fig.tight_layout()

**What next?**

1. I've been trying the paragraph vector (`gensim.models.Doc2Vec`) as another approach to the document vector, but the computing resource required is too large, it will take some time.

2. Add more meta data dimensions. *

3. relation between subtrack and prize
  - violin plot
  
_Size_ and _Workload_ relation

raw dataset could be discrete.