Skip to content
This repository has been archived by the owner. It is now read-only.

Rationalize item processing pipelines #22

Merged
merged 5 commits into from Jun 22, 2017

Conversation

Projects
None yet
2 participants
@c-w
Copy link
Member

commented Jun 21, 2017

We currently have a bunch of copy/paste code in the pipelines package which leads to poor maintainability because we need to make changes for new feature extractors in multiple places and because we can't guarantee uniform processing across all pipelines.

This pull request removes the copy/paste code and rationalizes the processing pipelines by introducing the TextPipeline class which is used by the Twitter, Facebook, Radio and Tadaweb pipelines.

@c-w c-w requested review from kevinhartman and erikschlegel Jun 21, 2017

@c-w c-w added the in progress label Jun 21, 2017

@kevinhartman

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2017

Happy to see the duplication reduced. The issue I have with this approach however is that for the broadcasting and shared state stuff I'm working on now, it'd be ideal to inject a transform at the very start of every pipeline, and to ideally have the logic within it be the same.

I've adapted the concepts from this PR a bit into a proof of concept configuration with inheritance. This will allow me to inject the code that I need for state updates in the transform call in the base class.

I've written the code for the Tadaweb pipeline to illustrate how this would come together. Note that the TransformContext is not used directly by TadawebPipeline on purpose. This is because it will be removed when the state update stuff comes in.

Aside, I think your initial thinking of keeping the original item around in the analyzed item is the right approach. It seems more future proof to me to defer the actual analysis of a unique struct to its pipeline rather than trying to stuff everything into a generic struct. The approach below abstracts the pipeline steps/order away from the individual pipelines but allows them to implement each how they like in the context of their own types.

What do you think?
CatalystCode:26202d7...kevinhartman:4ee2cf7

@c-w

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2017

As per my comment on your last pull request, we don't need the originalItem anymore now that we have a better idea of the Cassandra schema. There is a massive benefit of getting rid of originalItem, namely being able to treat all text sources the same no matter where the text comes. The downstream processing really shouldn't have to care about whether it's a Tweet or a Facebook post just to extract keywords or sentiment. Using a single structure also reduces the code complexity which is a pro. Also note that at the end of the processing all data will have to fit within a single structure anyways so that it can be persisted to Cassandra.

If we need to inject an additional transform, that's a separate concern which can be tackled in multiple ways. For example, if the transform truly needs to be applied to every stream, just do it in the StreamProvider's buildStream: buildStream should return a downstream usable stream so that transforms don't have to care about any book-keeping. Alternatively, setup code can also live in ProjectFortis so that we'd have a section in there for setup, then processing and then sinks.

@c-w c-w force-pushed the refactor-pipes branch from 686db25 to 73b71c0 Jun 21, 2017

@kevinhartman

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2017

It's not a question of knowing the DB schema, but rather knowing the formatting and analysis requirements of the incoming items. If an item has, for example, 2 separate body fields which need keyword analysis, we should be able to handle that. If we need to override sentiment analysis because the source item already has a better one, we need to support that. If we want to skip keyword extraction for the title of an item, we should support that. Imagine a case where we want to include sentiment analysis on the title but not the body.

We lose flexibility with this approach.

@c-w c-w force-pushed the refactor-pipes branch from 3bcae0d to 057cd9e Jun 21, 2017

@c-w

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2017

We currently don't have these requirements and don't plan to add any other data sources. As such, I'd be hesitant to add complexity for potential future needs instead of focusing on what we need now and refactor later if necessary. Going from this type of function-based code to an object-based code is an easy refactor if it becomes necessary (just turn TextPipeline into an object with methods for each step and instantiate a custom subclass in the caller to do the processing). Note that if you really need to do something very custom, you can just not use the TextPipeline (see InstagramPipeline for an example).

@kevinhartman

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2017

We already have one of these requirements: we should not perform our own sentiment analysis if Tadaweb has already provided it. We're also adding Bing in the next week or so :p

This LGTM to get us going. I'll sign off after reviewing (tomorrow morning).

@c-w

This comment has been minimized.

Copy link
Member Author

commented Jun 22, 2017

Seems odd to me that we'd trust a third party more than our own analysis, but sure, if that's the case, it's trivial to fix, see 14c9e3c.

As for Bing: it'll likely be the same processing as Facebook/Twitter.

@kevinhartman
Copy link
Contributor

left a comment

LGTM

@c-w c-w merged commit 7191f6c into master Jun 22, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@c-w c-w deleted the refactor-pipes branch Jun 22, 2017

@c-w c-w removed the in progress label Jun 22, 2017

kevinhartman added a commit to kevinhartman/project-fortis-spark that referenced this pull request Jun 23, 2017

Restructure pipeline code in preparation for broadcasts.
Changes:
* Draws on pull request CatalystCode#22 by reusing common analysis methods introduced there but with added support for override.
* Encapsulates the analysis section of the fortis pipeline within a single method, removing the need for TransformContext.
  Previously, every member of the TransformContext would be serialized to each task regardless of whether or not it was used.
  With the current multi-file analysis pipeline, it'd be necessary to copy each member of the context into local scope to avoid this.
* Adds a transform block which will execute on the driver for each batch. Config updates will occur here, along with broadcasts
  of those updates where applicable.
* Adds a ConfigurationManager interface for DB access of config settings. Currently includes stream config only, but will
  include methods for accessing transform tool settings as well.

kevinhartman added a commit to kevinhartman/project-fortis-spark that referenced this pull request Jun 25, 2017

Restructure pipeline code in preparation for broadcasts.
Changes:
* Draws on pull request CatalystCode#22 by reusing common analysis methods introduced there but with added support for override.
* Encapsulates the analysis section of the fortis pipeline within a single method, removing the need for TransformContext.
  Previously, every member of the TransformContext would be serialized to each task regardless of whether or not it was used.
  With the current multi-file analysis pipeline, it'd be necessary to copy each member of the context into local scope to avoid this.
* Adds a transform block which will execute on the driver for each batch. Config updates will occur here, along with broadcasts
  of those updates where applicable.
* Adds a ConfigurationManager interface for DB access of config settings. Currently includes stream config only, but will
  include methods for accessing transform tool settings as well.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.