Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing other rows than the current row #200

Open
tfmorris opened this issue Oct 15, 2012 · 14 comments
Open

Accessing other rows than the current row #200

tfmorris opened this issue Oct 15, 2012 · 14 comments
Labels
cross/join About the cross function, which makes it possible to retrieve data from other projects by joining expression language Support for scripting languages (GREL, Python…) imported from old code repo Issue imported from Google Code in 2010 logic Changes to the data model, to the way operations, expressions work Priority: Low Indicates less critical issues that can be dealt with at a later stage Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@tfmorris
Copy link
Member

tfmorris commented Oct 15, 2012

Original author: sanderse...@gmail.com (November 13, 2010 22:18:28)

I've searched through the issues and all the documentation I could find and come up blank, sorry if this has been answered elsewhere.

Is there any way of accessing rows other than the current row? Something like rows[rowIndex-1].cells["Column1"].value would be priceless. Unfortunately it doesn't work.

I'm looking for ways of computing moving averages etc and I need this feature. Thanks, and keep up the good work, this software looks great!

Original issue: http://code.google.com/p/google-refine/issues/detail?id=200

@tfmorris
Copy link
Member Author

From dfhu...@gmail.com on November 14, 2010 03:42:27:
You're right--it's not possible right now. I'd say what you're asking for falls into a new feature area that includes grouping rows and computing sum, average, min, max within each group. That's necessary for visualization, too.

@tfmorris
Copy link
Member Author

From jluci...@gmail.com on January 12, 2011 19:16:30:
I have actually had success referring to another column's cell by using the 'cell.column_name' command. However if you want to perform operation on the cells in another column you are out of luck. I was trying to create a new column by summing two existing column with no avail e.g. sum([value,cells.column_name]).

@tfmorris
Copy link
Member Author

From tfmorris on January 12, 2011 19:22:42:
jlucido2 - That doesn't sound related to enhancement that was requested. Please ask general questions on the mailing list or create a new issue if you've got a problem or would like something enhanced.

It's likely that you just need cells.column_name.value for your example.

@tfmorris
Copy link
Member Author

From TeamP...@gmail.com on June 16, 2011 11:43:17:
This is a very important feature for fixing common types of messy data. For example, I have a 10000 record file where the dates have been input in different orders yy/mm/dd, dd/mm/yy, yyyy/mm/dd and dd/mm/yyyy (and sometimes noise), because different people entered the data. I only have natural row ordering to help me out. In order to tell if 03/12/01 is March 12 or December 1st, I would look at the previous row to see what the 4 digit year was, and make a decision on that.

This could be implemented as a modification of "Fill Down" that takes a GREL expression.

@tfmorris
Copy link
Member Author

From TeamP...@gmail.com on June 16, 2011 12:41:56:
Typing out Comment 5 gave me an idea: If I know the month is always in the middle of a date format, I can use that to my advantage.
By matching out the 2 digit (yy/mm/dd and dd/mm/yy) versions as v[0],v[1], v[2] ==> whenever v[0] == v[2] we know the year, and can convert it to a 4 digit year!

From this I was able to establish a sufficient number of known good 4 digit years. Then I created a new column called "LastGoodDate", based on the existence of 4 digit years, and used the fill down option on that new column.

Now when I use a facet to display the remaining 2 digit versions, I have a reference to the last known good date that preceded it. I can exploit the natural ordering to better predict which part is the day versus the year.

There are still more complex types of data where referencing a previous row is necessary to clean it up, I think I just lucked out this time.

@thadguidry thadguidry added Priority: Low Indicates less critical issues that can be dealt with at a later stage and removed Priority: Medium Represents important issues that need to be addressed but are not urgent labels Apr 18, 2018
@wetneb
Copy link
Sponsor Member

wetneb commented Dec 20, 2019

For use cases like this, where it is not clear how to achieve the desired result via the records mode, I would be in favour of letting users run code (Python, R…) on the entire table, modifying it in arbitrary ways. It would avoid the need to export the project, run it through a script or some other tool, and import it back again.

We would need to expose the entire table in a suitable object in the guest language: pandas dataframe in Python, table in R…

@wetneb wetneb added the expression language Support for scripting languages (GREL, Python…) label Dec 20, 2019
@thadguidry
Copy link
Member

@wetneb or Apache Drill ;-)

@wetneb
Copy link
Sponsor Member

wetneb commented Dec 20, 2019

Drill is a query engine so it's less clear to me what its integration as an operation would look like - would you use it to replace the table by the query results? Or would you try to add support for INSERT statements which modify the table? It'd be good to have concrete workflow examples to make sure we are on the same page.

(But I agree their integration of structured data in a columnar environment is clearly related)

@thadguidry
Copy link
Member

thadguidry commented Dec 20, 2019

Supporting Selections (similar to how we allow users to select subdata various ways with GREL functions now, via parseJson() and parseHtml(), etc.)

I initially thought of "wouldn't it be cool if" somehow exposing a new select() and letting Apache Drill handling the heavylifting? It would be useful to know how others view this issue however like @ettorerizza

select() could be exposed as a Data Model transform function via an inverse insert().
And it doesn't have to be just at a GREL level...Our Expression editor UI could even be enhanced for support of this in a better fashion. Or a "Custom Importer" could use Apache Drill and expose a nice input box where SQL and use of SELECT statements that hand off to Apache Drill (given time/money).

I wonder what @tcbuzor thinks about this as well.
But I do need help breaking down the parts/issues as you see it also, @wetneb where the use case is around "selecting" data....either at Import, or Table/Column/Cell/Record, etc.
I guess we can probably say for now that Apache Drill is useful for "selecting" against a known grid and work on a "Custom Importer" feature later in another issue?

@antoine2711
Copy link
Member

antoine2711 commented Mar 31, 2020

Is there any way of accessing rows other than the current row? Something like rows[rowIndex-1].cells["Column1"].value would be priceless. Unfortunately it doesn't work.

Hi @tfmorris, while reading this, it came to me that now that the cross() function has been enhanced, I think you could get rows[rowIndex-1].cells["Column1"].value this way:

  1. Have a numerical key column where value = row.index + 1;
  2. where you want the value from another row, just put: cross(iID, "PRJ-NAME", "COL-NAME")[0].cells.value. Note that you can even get multiple values from this row in the same call: with(cross("iID", "PRJ-NAME", "COL-NAME")[0].cells), vR, vR["ID"] + vR["Name"] + vR["URL"]) where iID is of course the numerical value, and PRJ-NAME and COL-NAME the name of the joined table and it's key column.

Is there an angle I'm not seeing?

Regards, Antoine

@tfmorris
Copy link
Member Author

@antoine2711 Anything that you see which was "authored" by me on October 14, 2012 is usually just an artifact of the import that I did of all old issues from Google Code on that date.

You can see the original author (with obfuscated email) and date here:

Original author: sanderse...@gmail.com (November 13, 2010 22:18:28)

Your workaround sounds plausible on the face of it, although I haven't tested it. Syntactically it's quite a bit more unwieldy than something like rows[rowIndex-1].

@antoine2711
Copy link
Member

antoine2711 commented Mar 31, 2020

@antoine2711 Anything that you see which was "authored" by me on October 14, 2012 is usually just an artifact of the import that I did of all old issues from Google Code on that date.

Shame, @tfmorris, it would have been a very productive day… 3:-)

Yes, rows[rowIndex-1] is more elegant than cross(rowIndex-1, "PRJ-NAME", "COL-NAME")[0], still, the latter is much more flexible as it has no project boundary, and the point I was making is that it now exists in v3.4 and it would answer the need.

Regards, Antoine

@antoine2711 antoine2711 added the cross/join About the cross function, which makes it possible to retrieve data from other projects by joining label Apr 3, 2020
@antoine2711
Copy link
Member

antoine2711 commented Apr 3, 2020

Syntactically it's quite a bit more unwieldy than something like rows[rowIndex-1].

Yes @tfmorris, and, this made me think, and I created this issue for improvement: « Enhance cross() with empty "project name" calling parameter to represent the current project and empty column name to represent the index #2504 ».

Now, if that gets done, the comparaison would be:

rows[rowIndex-1] vs cross(rowIndex-1, "", "")[0] or even this cross(rowIndex-1)[0]

The difference would be cosmetic.

Regards, Antoine

@antoine2711
Copy link
Member

@tfmorris: With PR #2518, what is asked here will be doable like this:

cross(rowIndex-1)[0]

In essence, it would do EXACTLY this: rows[rowIndex-1], if it existed.

Regards, Antoine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cross/join About the cross function, which makes it possible to retrieve data from other projects by joining expression language Support for scripting languages (GREL, Python…) imported from old code repo Issue imported from Google Code in 2010 logic Changes to the data model, to the way operations, expressions work Priority: Low Indicates less critical issues that can be dealt with at a later stage Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

No branches or pull requests

4 participants