Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild cohort processing #72

Merged
merged 144 commits into from
Aug 8, 2022
Merged

Rebuild cohort processing #72

merged 144 commits into from
Aug 8, 2022

Conversation

KimballCai
Copy link
Contributor

@KimballCai KimballCai commented Jul 25, 2022

Please update the details here when reviewing the codes. (issues to be addressed)

Create an issue and submit a corresponding PR for each one.

  1. need to add the logic to check the meta chunk to speed up the query processing.
  2. remove aggregation logic in ValueSelection.
    • ValueSelection only handles filtering (can be merged with EventSelection)
    • RetUnit will encapsulate the logic of updating statistics (discussed below)
  3. Factory class to handle FieldRS creation (better encapsulation)
  4. rework the ProjectedTuple
    • existing impl aims to mimic the old logic which introduces additional indirections
    • keep it simple, let its producer decides on which indexed value to retrieve from it.
    • make it immutable. avoid loadattr that mutates internal data.
    • separate handling for special field: user id and action time
  5. MetaFieldRS to create value converter/translator (extensibility and polymorphism)
    • currently we only have two types, we can expect having more field types.
    • The MetaFieldRS translates the token (gid for string now) stored in data chunk to actual values. (no-op for range)
  6. Augment RetUnit
    • perhaps we should rename this variable
    • it will contain additional variables for max, min, etc. (now only counts)
    • A solution is to have a list of variable and keep a list functions (aggregators) that take in new value and update its corresponding variable.
  7. add documentations: DataHashFieldRS Need to add descriptions to the assumptions: all input vector ever used there have efficient implementation of getting by index (Zint, BitVector, ZintBitInput,)


private Object[] tuple;

private HashMap<String, Integer> schema2Index;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just want a unified structure which can regarded as a input for all processor unit. Any suggestion ?

*
* @return the layout of this ProjectedTuple
*/
public String[] getSchemaList(){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reserved Interface

* @return
* @throws IOException
*/
public synchronized boolean isCubeExist(String cube) throws IOException{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

Copy link
Collaborator

@hugy718 hugy718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add on to previous feedback. it will still take some time for me to read the birthSelect and cohortSelect folders..

After reading the filters and aggregators, I think we need to discuss about the plan on how PorjectedTuple is being used. it seems that the end consumers, aggregators only cares about one field. Then there are two fields, user id and action timein specific, added to help selection. we could have dedicated variables in ProjectedTuple for these two fields, then keep a list of value fields that will be aggregated. let CohortProcessor to use aggregator to update the corresponding intermediate result of retunit during processing.

case "RANGE": return Range;
case "SET": return Set;
default:
throw new IllegalArgumentException();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhandled exception? Is @JsonCreator going to handle it? Actually in upstream function calls we need to do error handling gracefully instead of stopping the whole system. Just leaving a notes here. It can be a target in our next phase of development.

* Provide get and set interface
*/
@Data
public class RetUnit{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where we are to support max, min, etc for the cohort result? if so, we should leave some notes here.

Leaving a note here: If we have separate variables tracking min, max, etc. we needs not have getters to expose the internal attributes. Instead we have a set of operations like max(), addToDistinct(), etc.

* Provide get and set interface
*/
@Data
public class RetUnit{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Aggregator::calculate simply calls the respective methods.

Copy link
Collaborator

@hugy718 hugy718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CohortSelect contains wrappers of filters. If we avoid the ProjectedTuple encapsulating the schema and indirection from schema to value, we could remove those wrappers.

I have finished my review. This is the last part. We need to increase test coverage. Especially, unit tests for selection context. I also noticed that filtering using metachunk has not been implemented in CohortProcessor.

Regarding birth sequence support. the old implementation benefited from the assumption that all records of a user are contiguous and guaranteed to be in a data chunk. The birthEvents are checked one by one in a loop, meaning all records of a user are available during processing. That is in conflict with update handling. Currently we support one entry in birthSequence, to support more, the selection context needs considerable modifications to support the tracking of multiple partial matching of birthSequence. (old implementation does not face this issue because it scans the all records of a user for every entry in birthSequence iteratively).

@hugy718
Copy link
Collaborator

hugy718 commented Aug 2, 2022

BTW, please remove commented-out codes (including // TODO Auto-generated method stub), improper spacing and redundant imports.

@NLGithubWP
Copy link
Collaborator

NLGithubWP commented Aug 3, 2022

This PR has made the following change:

  1. For Test and dataset:
    • Rename datasetSource to CubeRepo
  2. For Schema
    • Add Set data type as an indicator.
  3. For readStore
    • Add a new method into HashMetaFieldRS supporting retrieving all values for each field.
    • Abstract Readfield logic into FieldRs interface as a static class.
  4. Cohort Process Logic:
    the new process logic follows the steps:
    • Traverse the query.json file and get all required fields
    • Retrieve all values of those fields into memory.
    • Pre-defined type agnostics list projectTuple, it stores a record' information.
    • For each row
      • Update its information into projectTuple. All following logics accept it as input.
      • Check if the user meets the birthSelection requirements by updating and checking the context
      • If the user is birthed, calculate this record's age, check which cohort the record belongs to, and finally update the query result according to the function defined in valueSelector.
  5. Besides, the PR also re-defined aggregator, filter, and selector based on the new input projection - projectTuple
    • For filter, the PR provides basic filter logic for set, range, and type
    • For cohort selector, the PR provides cohort selector for both set and range, each of them extending from the filter while implementing the cohort selector interface.
  6. Finally, all intermate data structure used in the above processing logic is defined in the storage folder. And some time-related functions are defined in utils.

Overall, the struct and logic are clear.

Currently, for each query, the system stores all related fileds' values into memory and are not released after the query is finished.

Although this can facilitate the further query, we cannot predict the visiting frequency of each field since there is no workload.

If the system runs as a service, all fields will eventually be loaded into memory. This may be inefficient.

Is it better to delete all cached data after finishing a query?

The main modification is to divide the query parser and query logic processing.
And rename some method, add some doc
Copy link
Collaborator

@hugy718 hugy718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick action. The decoupling made it much cleaner. I have no further comments. Please do a rebase against dev. That would exclude those olap related commits from this branch. Easier for others to view and for future references.


if (!(userField.getValueVector() instanceof RLEInputVector)) {
totalCorruptedUsers++;
LOG.info("The user record corrupted: " + totalDataChunks);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just leave "user record corrupted". totalDataChunk does not give much info.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zrealshadow
Copy link
Collaborator

Thanks for the quick action. The decoupling made it much cleaner. I have no further comments. Please do a rebase against dev. That would exclude those olap related commits from this branch. Easier for others to view and for future references.

My suggestion is to directly merge and solve these conflict (which is not part of the main logic of this PR)
Since the commits are too much, errors may occur during the process of rebasing all these commits.

Copy link
Collaborator

@hugy718 hugy718 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KimballCai @NLGithubWP please compile at your side to make sure it is working as well. Let's merge this PR and move on to develop in separate PRs to address its problems.

@Zrealshadow Zrealshadow self-requested a review August 8, 2022 06:23
@KimballCai
Copy link
Contributor Author

I have checked the codes and can run the codes successfully.

@Zrealshadow
Copy link
Collaborator

move these aftercare work in issue #83

@KimballCai KimballCai deleted the rebuild-cohort-processing branch August 31, 2022 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
structure-adjust Adjust project structure or internal data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants