On AWS - after Diff, Insert columns are all null #64

leewalter78 · 2021-09-27T14:12:33Z

I found this project when trying to compare dataframes using pyspark, and it works appears to work great. I am seeing an issue when running this as part of an AWS Glue job with this jar - spark-extension_2.11-1.3.3-2.4.jar. Locally, it works fine and I cannot reproduce this issue.

If I create the dataframes using spark.createDataFrame().. and do a diff, I get the results I expect. When I read the dataframes from a glue context, after a diff, the Inserts columns are all null. I tried with and without unique identifiers, and I also reduced the number of rows and columns. I thought it might be Sparse mode causing this - so I tried both true and false on the sparse mode.

I created a simple case where the left dataframe has 1 row and the right dataframe has 2. One is the same row, and the diff identifies that and it identifies the Insert as well - but it loses all the values

The Deletes and No Changes rows are as expected, and currently, I only see the issue with Inserts. I have not tested Changes yes - been stuck on Inserts.

Any insights on what might be causing this?

Sample outlook
+----+---------+---------+------------------+-----------+-------------+--------------------+--------------+
|diff| col1|col2| col2|col4|col4| col5|col6|col7|col8|col9|
+----+---------+---------+------------------+-----------+-------------+--------------------+--------------+
| N|123456789| ABCDE|Test| 670| 4|Description...| 77.0| FL| null| null|
| I| null| null| null| null| null| null| null| null| null| null|
+----+---------+---------+------------------+-----------+-------------+--------------------+--------------+

Note: I also tried 1.3.0 and the same thing is happening (spark-extension_2.11-1.3.0-2.4)

leewalter78 · 2021-09-29T14:14:25Z

I know without being able to reproduce it locally makes it nearly impossible to analyze - but was wondering if anyone has seen a similar behavior?

Since the C, D, and N types work - there is a workaround - Do a left anti join on the id columns with the Right Datatframe and the recon results where diff != I. Basically, remove the C,D, and N rows from the right dataframe and you will be left with the inserts.

kasunnirmala · 2022-07-19T05:24:39Z

@leewalter78 Did you able to fix this?

EnricoMi · 2022-07-19T08:06:02Z

@leewalter78 do you see the same issue? Can you provide steps for me to reproduce this? "On AWS" does not really help.

leewalter78 · 2022-07-19T18:09:39Z

No, I was not able to resolve - I ended up using a custom solution with set differences and then joining on the keys .

At the time, I uploaded the jar ( spark-extension_2.11-1.3.3-2.4.jar) to an s3 bucket, and referenced it from the glue job under python lib path.

Then, I believe this was the sample snippet I had tried - (I forget all the details, but there were cases were this did not work - from my post, it was inserts where all the values would be NULL

from gresearch.spark.diff import *

left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
left.diff(right, "id").show()

leewalter78 · 2022-07-19T18:20:38Z

I believe the sample code worked above, but when I used my actual dataframes, the values for the Inserts were all null

EnricoMi · 2022-07-19T20:06:48Z

I just ran that sample snipped on Glue 3.0 Python with spark-extension_2.12-2.1.0-3.1.jar and it worked:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()

left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])

from gresearch.spark.diff import *

left.diff(right, "id").show()

+----+---+----------+-----------+
|diff| id|left_value|right_value|
+----+---+----------+-----------+
|   D|  3|     three|       null|
|   N|  1|       one|        one|
|   C|  2|       two|        Two|
|   I|  4|      null|       four|
+----+---+----------+-----------+

kasunnirmala · 2022-07-20T01:12:07Z

I just ran that sample snipped on Glue 3.0 Python with spark-extension_2.12-2.1.0-3.1.jar and it worked:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()

left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])

from gresearch.spark.diff import *

left.diff(right, "id").show()

+----+---+----------+-----------+
|diff| id|left_value|right_value|
+----+---+----------+-----------+
|   D|  3|     three|       null|
|   N|  1|       one|        one|
|   C|  2|       two|        Two|
|   I|  4|      null|       four|
+----+---+----------+-----------+

Can you please specify the exact Job details section, both Basic and Advanced Properties?

EnricoMi · 2022-07-22T10:11:58Z

Here are the job details:

IAM Role: a role with CloudWatch (logging) and S3 access
Type: Spark
Glue version: Glue 3.0 - Supports spark 3.1, Scala 2, Python 3
Language: Python 3
Worker Type: G 1X
Number of workers: 2
Job metrics, Continuous logging, Spark UI and "Use Glue data catalog as the Hive metastore" enabled by default
Python library path: s3://XXXXXXXXX/spark-extension_2.12-2.1.0-3.1.jar

bobhaffner · 2022-07-28T18:43:51Z

For what it's worth, I was able to get the same result as Enrico.

EnricoMi · 2022-09-06T07:51:19Z

Looks like spark-extensions work on AWS Glue and the reported issue is not reproducible. Closing for now.

leewalter78 closed this as completed Sep 29, 2021

EnricoMi reopened this Jul 19, 2022

leewalter78 mentioned this issue Jul 19, 2022

How to use with AWS Glue? #103

Closed

EnricoMi closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On AWS - after Diff, Insert columns are all null #64

On AWS - after Diff, Insert columns are all null #64

leewalter78 commented Sep 27, 2021 •

edited

leewalter78 commented Sep 29, 2021

kasunnirmala commented Jul 19, 2022

EnricoMi commented Jul 19, 2022

leewalter78 commented Jul 19, 2022

leewalter78 commented Jul 19, 2022

EnricoMi commented Jul 19, 2022

kasunnirmala commented Jul 20, 2022

EnricoMi commented Jul 22, 2022

bobhaffner commented Jul 28, 2022

EnricoMi commented Sep 6, 2022

On AWS - after Diff, Insert columns are all null #64

On AWS - after Diff, Insert columns are all null #64

Comments

leewalter78 commented Sep 27, 2021 • edited

leewalter78 commented Sep 29, 2021

kasunnirmala commented Jul 19, 2022

EnricoMi commented Jul 19, 2022

leewalter78 commented Jul 19, 2022

leewalter78 commented Jul 19, 2022

EnricoMi commented Jul 19, 2022

kasunnirmala commented Jul 20, 2022

EnricoMi commented Jul 22, 2022

bobhaffner commented Jul 28, 2022

EnricoMi commented Sep 6, 2022

leewalter78 commented Sep 27, 2021 •

edited