Rebuild cohort processing #72

KimballCai · 2022-07-25T07:59:26Z

Please update the details here when reviewing the codes. (issues to be addressed)

Create an issue and submit a corresponding PR for each one.

need to add the logic to check the meta chunk to speed up the query processing.
remove aggregation logic in ValueSelection.
- ValueSelection only handles filtering (can be merged with EventSelection)
- RetUnit will encapsulate the logic of updating statistics (discussed below)
Factory class to handle FieldRS creation (better encapsulation)
rework the ProjectedTuple
- existing impl aims to mimic the old logic which introduces additional indirections
- keep it simple, let its producer decides on which indexed value to retrieve from it.
- make it immutable. avoid loadattr that mutates internal data.
- separate handling for special field: user id and action time
MetaFieldRS to create value converter/translator (extensibility and polymorphism)
- currently we only have two types, we can expect having more field types.
- The MetaFieldRS translates the token (gid for string now) stored in data chunk to actual values. (no-op for range)
Augment RetUnit
- perhaps we should rename this variable
- it will contain additional variables for max, min, etc. (now only counts)
- A solution is to have a list of variable and keep a list functions (aggregators) that take in new value and update its corresponding variable.
add documentations: DataHashFieldRS Need to add descriptions to the assumptions: all input vector ever used there have efficient implementation of getting by index (Zint, BitVector, ZintBitInput,)

Zrealshadow · 2022-08-01T16:51:13Z

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/storage/ProjectedTuple.java

+
+    private Object[] tuple;
+
+    private HashMap<String, Integer> schema2Index;


Actually, I just want a unified structure which can regarded as a input for all processor unit. Any suggestion ?

Zrealshadow · 2022-08-01T16:52:20Z

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/storage/ProjectedTuple.java

+     * 
+     * @return the layout of this ProjectedTuple
+     */
+    public String[] getSchemaList(){


Reserved Interface

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/utils/DateUtils.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/utils/TimeUtils.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/valueSelect/ValueSelection.java

cool-core/src/main/java/com/nus/cool/core/io/readstore/HashMetaFieldRS.java

cool-core/src/main/java/com/nus/cool/core/schema/Codec.java

Zrealshadow · 2022-08-01T17:23:33Z

cool-core/src/main/java/com/nus/cool/model/CoolModel.java

+     * @return
+     * @throws IOException
+     */
+    public synchronized boolean isCubeExist(String cube) throws IOException{


cool-core/src/main/java/com/nus/cool/core/schema/FieldType.java

MisClick

hugy718

Add on to previous feedback. it will still take some time for me to read the birthSelect and cohortSelect folders..

After reading the filters and aggregators, I think we need to discuss about the plan on how PorjectedTuple is being used. it seems that the end consumers, aggregators only cares about one field. Then there are two fields, user id and action timein specific, added to help selection. we could have dedicated variables in ProjectedTuple for these two fields, then keep a list of value fields that will be aggregated. let CohortProcessor to use aggregator to update the corresponding intermediate result of retunit during processing.

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/birthSelect/EventSelection.java

hugy718 · 2022-08-02T03:52:30Z

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/filter/FilterType.java

+            case "RANGE": return Range;
+            case "SET": return Set;
+            default:
+                throw new IllegalArgumentException();


unhandled exception? Is @JsonCreator going to handle it? Actually in upstream function calls we need to do error handling gracefully instead of stopping the whole system. Just leaving a notes here. It can be a target in our next phase of development.

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/filter/RangeFilter.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/ageSelect/AgeSelection.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/aggregate/AverageAggregate.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/aggregate/DistinctCountAggregate.java

hugy718 · 2022-08-02T05:35:19Z

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/storage/RetUnit.java

+ * Provide get and set interface
+ */
+@Data
+public class RetUnit{


Is this where we are to support max, min, etc for the cohort result? if so, we should leave some notes here.

Leaving a note here: If we have separate variables tracking min, max, etc. we needs not have getters to expose the internal attributes. Instead we have a set of operations like max(), addToDistinct(), etc.

hugy718 · 2022-08-02T05:36:29Z

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/storage/RetUnit.java

+ * Provide get and set interface
+ */
+@Data
+public class RetUnit{


And Aggregator::calculate simply calls the respective methods.

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/aggregate/AggregateFunc.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/valueSelect/ValueSelection.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/CohortProcessor.java

hugy718

CohortSelect contains wrappers of filters. If we avoid the ProjectedTuple encapsulating the schema and indirection from schema to value, we could remove those wrappers.

I have finished my review. This is the last part. We need to increase test coverage. Especially, unit tests for selection context. I also noticed that filtering using metachunk has not been implemented in CohortProcessor.

Regarding birth sequence support. the old implementation benefited from the assumption that all records of a user are contiguous and guaranteed to be in a data chunk. The birthEvents are checked one by one in a loop, meaning all records of a user are available during processing. That is in conflict with update handling. Currently we support one entry in birthSequence, to support more, the selection context needs considerable modifications to support the tracking of multiple partial matching of birthSequence. (old implementation does not face this issue because it scans the all records of a user for every entry in birthSequence iteratively).

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/birthSelect/BirthContextWindow.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/birthSelect/EventSelection.java

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/CohortProcessor.java

cool-core/src/test/java/com/nus/cool/core/refactor/JsonMapper.java

hugy718 · 2022-08-02T09:22:53Z

BTW, please remove commented-out codes (including // TODO Auto-generated method stub), improper spacing and redundant imports.

NLGithubWP · 2022-08-03T08:45:39Z

This PR has made the following change:

For Test and dataset:
- Rename datasetSource to CubeRepo
For Schema
- Add Set data type as an indicator.
For readStore
- Add a new method into HashMetaFieldRS supporting retrieving all values for each field.
- Abstract Readfield logic into FieldRs interface as a static class.
Cohort Process Logic:
the new process logic follows the steps:
- Traverse the query.json file and get all required fields
- Retrieve all values of those fields into memory.
- Pre-defined type agnostics list projectTuple, it stores a record' information.
- For each row
  - Update its information into projectTuple. All following logics accept it as input.
  - Check if the user meets the birthSelection requirements by updating and checking the context
  - If the user is birthed, calculate this record's age, check which cohort the record belongs to, and finally update the query result according to the function defined in valueSelector.
Besides, the PR also re-defined aggregator, filter, and selector based on the new input projection - projectTuple
- For filter, the PR provides basic filter logic for set, range, and type
- For cohort selector, the PR provides cohort selector for both set and range, each of them extending from the filter while implementing the cohort selector interface.
Finally, all intermate data structure used in the above processing logic is defined in the storage folder. And some time-related functions are defined in utils.

Overall, the struct and logic are clear.

Currently, for each query, the system stores all related fileds' values into memory and are not released after the query is finished.

Although this can facilitate the further query, we cannot predict the visiting frequency of each field since there is no workload.

If the system runs as a service, all fields will eventually be loaded into memory. This may be inefficient.

Is it better to delete all cached data after finishing a query?

The main modification is to divide the query parser and query logic processing. And rename some method, add some doc

hugy718

Thanks for the quick action. The decoupling made it much cleaner. I have no further comments. Please do a rebase against dev. That would exclude those olap related commits from this branch. Easier for others to view and for future references.

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/cohortSelect/CohortSetSelector.java

hugy718 · 2022-08-05T05:14:41Z

cool-core/src/main/java/com/nus/cool/core/cohort/CohortUserSection.java


        if (!(userField.getValueVector() instanceof RLEInputVector)) {
-            totalCorruptedUsers++;
+            LOG.info("The user record corrupted: " + totalDataChunks);


maybe just leave "user record corrupted". totalDataChunk does not give much info.

@KimballCai

Zrealshadow · 2022-08-05T08:48:27Z

Thanks for the quick action. The decoupling made it much cleaner. I have no further comments. Please do a rebase against dev. That would exclude those olap related commits from this branch. Easier for others to view and for future references.

My suggestion is to directly merge and solve these conflict (which is not part of the main logic of this PR)
Since the commits are too much, errors may occur during the process of rebasing all these commits.

hugy718

@KimballCai @NLGithubWP please compile at your side to make sure it is working as well. Let's merge this PR and move on to develop in separate PRs to address its problems.

KimballCai · 2022-08-08T06:32:05Z

I have checked the codes and can run the codes successfully.

Zrealshadow · 2022-08-08T06:55:41Z

move these aftercare work in issue #83

Zrealshadow and others added 30 commits July 4, 2022 12:04

clean commit log, start to test cohort-processing part

0cd0bd9

fix some problem left by merging

c939941

refactor filter class

e4e4bbf

change interface from ArrayList to List

ce3bc44

add two FieldRS offer valueList

83de7d9

Complete RangeFilter generator

04757bc

Change ArrayList to List

f59f933

finish EventSelection

2eaedf2

refactor TimeUnit Class, all about time operation

694d1e5

BirthSelection and BirthSelectionContext

f5aaae6

add Set type in Codec

4e07ae4

add the new example of cohort query json file

208336d

sync

bd67ea5

remove unused package

e84a1ff

extract TimeWindow from TimeUtils class

ca8583a

add Date utils

7870796

update BirthSelection

ee3ed53

update javaxf plugin and dependency

1c9fae0

format

954ed92

add frequency part

8e46c82

add ProjectedTuple to replace Object[]

37f23ea

format

22ffefe

extract sub-class RangeUnit as a standalone class

e618040

finish ageselection module

39dc5b4

extract sub-class RangeUnit as a standalone class

ec15586

scope add toString

cffbb1b

update project structure

79d364e

finish cohort Selection

edf4723

modify the project structure

1d7fc70

add getFilterType interface

6663f07

Zrealshadow previously requested changes Aug 1, 2022

View reviewed changes

hugy718 reviewed Aug 2, 2022

View reviewed changes

KimballCai commented Aug 2, 2022

View reviewed changes

cool-core/src/main/java/com/nus/cool/core/cohort/refactor/CohortProcessor.java Show resolved Hide resolved

hugy718 reviewed Aug 2, 2022

View reviewed changes

Fix some typo

feeef79

hugy718 force-pushed the dev branch from 821cbca to 80d78ef Compare August 4, 2022 05:16

Zrealshadow added 3 commits August 4, 2022 20:56

Fix According to PR#72

547a7a3

The main modification is to divide the query parser and query logic processing. And rename some method, add some doc

pass cohort processor test

5a0bd30

fix some incompatiable problem when merging

873d352

Zrealshadow requested a review from hugy718 August 4, 2022 13:19

hugy718 reviewed Aug 5, 2022

View reviewed changes

fix some outdated doc

0aef81a

Zrealshadow requested a review from hugy718 August 5, 2022 14:14

fix some doc

325dbf9

hugy718 approved these changes Aug 8, 2022

View reviewed changes

fix data-source cube-repo name mismatch

6310dc8

Zrealshadow self-requested a review August 8, 2022 06:23

Zrealshadow approved these changes Aug 8, 2022

View reviewed changes

Merge branch 'dev' into rebuild-cohort-processing

c184a57

KimballCai force-pushed the rebuild-cohort-processing branch from 71349cf to c184a57 Compare August 8, 2022 06:41

KimballCai merged commit d564c98 into dev Aug 8, 2022

Zrealshadow mentioned this pull request Aug 8, 2022

Issues left by cohort processor refactoring #83

Open

hugy718 mentioned this pull request Aug 20, 2022

Persistent cohort result format update #92

Closed

KimballCai deleted the rebuild-cohort-processing branch August 31, 2022 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebuild cohort processing #72

Rebuild cohort processing #72

KimballCai commented Jul 25, 2022 •

edited by hugy718

Loading

Zrealshadow Aug 1, 2022

Zrealshadow Aug 1, 2022

Zrealshadow Aug 1, 2022

hugy718 left a comment •

edited

Loading

hugy718 Aug 2, 2022

hugy718 Aug 2, 2022

hugy718 Aug 2, 2022

hugy718 left a comment

hugy718 commented Aug 2, 2022

NLGithubWP commented Aug 3, 2022 •

edited

Loading

hugy718 left a comment

hugy718 Aug 5, 2022

Zrealshadow Aug 5, 2022

Zrealshadow commented Aug 5, 2022

hugy718 left a comment

KimballCai commented Aug 8, 2022

Zrealshadow commented Aug 8, 2022


		private Object[] tuple;

		private HashMap<String, Integer> schema2Index;

Rebuild cohort processing #72

Rebuild cohort processing #72

Conversation

KimballCai commented Jul 25, 2022 • edited by hugy718 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hugy718 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hugy718 left a comment

Choose a reason for hiding this comment

hugy718 commented Aug 2, 2022

NLGithubWP commented Aug 3, 2022 • edited Loading

hugy718 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zrealshadow commented Aug 5, 2022

hugy718 left a comment

Choose a reason for hiding this comment

KimballCai commented Aug 8, 2022

Zrealshadow commented Aug 8, 2022

KimballCai commented Jul 25, 2022 •

edited by hugy718

Loading

hugy718 left a comment •

edited

Loading

NLGithubWP commented Aug 3, 2022 •

edited

Loading