# Data Exploration and Analysis

In [1]:
# put the code into a python module, was too large to paste here
from data_exploration_classes import Data, Statistics
from problem_description_and_state import ProblemDescription

pd = ProblemDescription(inventory_cols=3, inventory_rows=2)
for input_filepath in ['data/3x2/warehousetraining.txt', 'data/3x2/warehouseorder.txt']:
    data = Data(pd, input_filepath)
    data.print_statistics()
    print()

data/3x2/warehousetraining.txt
-----------------------------------------------------------
12108 requests total
6054 store requests, 6054 restore requests
8196 requests (0.68) with the same verb as the one before
min 0 / max 6 / avg 3.01 / final 0
blue  min 0 / max 5 / avg 0.85 / final 0 / request share 0.25
red   min 0 / max 6 / avg 1.31 / final 0 / request share 0.49
white min 0 / max 5 / avg 0.86 / final 0 / request share 0.26
at 0 items: 0.056 store / 0.000 restore
            0.000 store after store / 0.000 restore after restore
            0.056 store after restore / 0.000 restore after store
at 1 items: 0.088 store / 0.056 restore
            0.051 store after store / 0.051 restore after restore
            0.037 store after restore / 0.005 restore after store
at 2 items: 0.105 store / 0.088 restore
            0.076 store after store / 0.076 restore after restore
            0.029 store after restore / 0.012 restore after store
at 3 items: 0.106 store / 0.105 restore
          

### Comparison with Generated Data

A request generator function with similar statistical properties as the training data has been developed. Comparison of requests in the training data with requests from the generator function helps to discover more properties in the data.

In [2]:
from data_exploration_classes import GeneratedData

data = GeneratedData(pd)
data.print_statistics()

generated data
-----------------------------------------------------------
100000 requests total
50002 store requests, 49998 restore requests
41617 requests (0.42) with the same verb as the one before
min 0 / max 6 / avg 2.99 / final 4
blue  min 0 / max 5 / avg 0.74 / final 1 / request share 0.25
red   min 0 / max 6 / avg 1.50 / final 3 / request share 0.50
white min 0 / max 5 / avg 0.74 / final 0 / request share 0.25
at 0 items: 0.015 store / 0.000 restore
            0.000 store after store / 0.000 restore after restore
            0.015 store after restore / 0.000 restore after store
at 1 items: 0.079 store / 0.015 restore
            0.013 store after store / 0.013 restore after restore
            0.067 store after restore / 0.003 restore after store
at 2 items: 0.158 store / 0.079 restore
            0.053 store after store / 0.053 restore after restore
            0.105 store after restore / 0.026 restore after store
at 3 items: 0.156 store / 0.158 restore
            0.079 stor

Discovering sequential patterns in the data is more complicated:

In [3]:
from data_exploration_classes import PatternAnalysis
data = Data(pd, 'data/3x2/warehousetraining.txt')
ref_data = GeneratedData(pd)
# looking for patterns in the verbs of request sequences
for pattern_length in [2, 3, 4]:
    pa_from_training_data = PatternAnalysis(pd, data, pattern_length, use_verb=True, use_color=False)
    pa_from_generated_data = PatternAnalysis(pd, ref_data, pattern_length, use_verb=True, use_color=False)
    pa_from_training_data.compare(pa_from_generated_data, prob_ratio_threshold=1.25)
    print()

differences in length 2 pattern frequencies (by factor 1.25 or more) for verb only:
s?-s?: 0.3384539 vs 0.2084900 expected (#: 4098 vs 2524 expected)
s?-r?: 0.1615461 vs 0.2915200 expected (#: 1956 vs 3530 expected)
r?-s?: 0.1614635 vs 0.2915100 expected (#: 1955 vs 3530 expected)
r?-r?: 0.3384539 vs 0.2084700 expected (#: 4098 vs 2524 expected)

differences in length 3 pattern frequencies (by factor 1.25 or more) for verb only:
s?-s?-s?: 0.2073835 vs 0.0699100 expected (#: 2511 vs 846 expected)
s?-r?-s?: 0.0322927 vs 0.1518000 expected (#: 391 vs 1838 expected)
r?-s?-r?: 0.0304757 vs 0.1529400 expected (#: 369 vs 1852 expected)
r?-r?-r?: 0.2092005 vs 0.0687500 expected (#: 2533 vs 832 expected)

differences in length 4 pattern frequencies (by factor 1.25 or more) for verb only:
s?-s?-s?-s?: 0.1124050 vs 0.0177800 expected (#: 1361 vs 215 expected)
s?-s?-s?-r?: 0.0949785 vs 0.0521300 expected (#: 1150 vs 631 expected)
s?-s?-r?-s?: 0.0219689 vs 0.0637400 expected (#: 266 vs 772 expected

This suggests that store-store-store and restore-restore-restore patterns occur in the training data much more often than in the generated data. Also, store-restore-store and restore-store-restore pattern occur much less often. The output for pattern length 4 confirms that the pattern is limited to length 3.

Looking for sequential patterns in the colors of request sequences, or for verb-color combinations, did not yield any results.

## Analysis Results (for 3x2)

* input data is well-formed
  * inventory count is always between 0 and 6
  * starting and ending with 0
  * no requests for inventory items which it does not have
  * consequence: store requests are more likely at low inventory count, restore requests are more likely at high inventory count
* red is twice as likely as blue or white
* there is a sequential pattern in the data where a verb is repeated three times
* alternating verbs over 3 requests are much less likely ("negative pattern"?)