Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Plugin does not support CSV Header Checking #2862

Open
Tracked by #2063
revans2 opened this issue Jul 2, 2021 · 0 comments
Open
Tracked by #2063

[BUG] Plugin does not support CSV Header Checking #2862

revans2 opened this issue Jul 2, 2021 · 0 comments
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Jul 2, 2021

Describe the bug
When the schema of a CSV file does not match the headers in the file a warning is output.

scala> val schema = StructType(Seq(StructField("INPUT", StringType), StructField("INPUT1", StringType), StructField("MORE", StringType)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(INPUT,StringType,true), StructField(INPUT1,StringType,true), StructField(MORE,StringType,true))
scala> val df = spark.read.option("header", true).schema(schema).csv("duplicate.csv")
df: org.apache.spark.sql.DataFrame = [INPUT: string, INPUT1: string ... 1 more field]
scala> df.show
21/07/02 13:02:08 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: INPUT, INPUT, MORE
 Schema: INPUT, INPUT1, MORE
Expected: INPUT1 but found: INPUT
CSV file: file:///home/roberte/src/rapids-plugin-4-spark/duplicate.csv
+-----+------+----+
|INPUT|INPUT1|MORE|
+-----+------+----+
|    1|     2|   3|
|    1|     2|   3|
|    1|     2|   3|
+-----+------+----+

But when the plugin is enabled there is no warning form CSVHeaderChecker.

Steps/Code to reproduce bug
Have a CSV file with a different header than the schema passed in. Read it, preferably in local mode because the warning is logged by the process that reads the file, so it will not get back to the end user very easily.

Expected behavior
The plugin also outputs a warning.

Additional context
This was found because the cudf team asked us about requirements for duplicate header names. Apparently Pandas and Spark will create different unique header names when there are duplicates. They were in the process of making cudf do the right thing for the Pandas case, and wanted to be sure it would not cause issues with Spark. We need to be sure that when we do implement this feature that we test it with duplicate column names so that we are sure that cudf is doing the right thing for the warnings.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 2, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 6, 2021
@revans2 revans2 mentioned this issue Oct 27, 2022
38 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants