Hashing Flattened Fingerprints

In [1]:
# =============================================
# STEP 1: Set up Spark Session
# =============================================
from pyspark.sql import SparkSession
from pyspark.sql.functions import sha2, col, concat_ws

spark = SparkSession.builder \
    .appName("HashFlattenedFingerprints") \
    .getOrCreate()

# =============================================
# STEP 2: Read Flattened Fingerprints from HDFS
# =============================================
input_path = "hdfs://localhost:9000//lakehouse/silver/fingerprints/"
df = spark.read.json(input_path)

25/03/17 19:18:56 WARN Utils: Your hostname, osbdet resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface enp0s3)
25/03/17 19:18:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/17 19:19:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [2]:
df.limit(1).show(truncate=False)

AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

In [3]:
# =============================================
# STEP 3: Combine All Values Into a String Column
# =============================================
# Assuming the fingerprint file has a list of [x, y] coordinate pairs (flattened JSON),
# you’ll concatenate each array row into a string and hash it.

# Create a combined string representation of the entire fingerprint
df = df.withColumn("fingerprint_str", concat_ws(",", *df.columns))

# =============================================
# STEP 4: Hash the Fingerprint String
# =============================================
df_hashed = df.withColumn("fingerprint_hash", sha2(col("fingerprint_str"), 256))

# =============================================
# STEP 5: Optional - Drop Raw Fingerprint Data
# =============================================
columns_to_drop = [c for c in df.columns if c != "fingerprint_hash"]
df_hashed_clean = df_hashed.select("fingerprint_hash")

# =============================================
# STEP 6: Save Hashed Fingerprints to HDFS
# =============================================
output_path = "hdfs://localhost:9000/lakehouse/gold/fingerprint_hashes/"
df_hashed_clean.write.mode("overwrite").json(output_path)

# =============================================
# STEP 7: Preview the Result
# =============================================
df_hashed_clean.show(truncate=False)

25/03/17 19:17:45 ERROR FileFormatWriter: Aborting job f8386739-1f4e-49a3-89fe-0a23c09189f6.
org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
	at org.apache.spark.sql.errors.QueryCompilationErrors$.queryFromRawFilesIncludeCorruptRecordColumnError(QueryCompilationErrors.scala:2897)
	at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:113)
	at org.apache.spark.sql.execution.datasources.FileFormat.buildReaderW

AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().