Skip to content

Latest commit

 

History

History
30 lines (23 loc) · 802 Bytes

README.md

File metadata and controls

30 lines (23 loc) · 802 Bytes

pySpark-flatten-dataframe

PySpark function to flatten any complex nested dataframe structure loaded from JSON/CSV/SQL/Parquet

For example, for nested JSONs -

  • Flattens all nested items: { "human":{ "name":{ "first_name":"Jay Lohokare" } } }

Is converted to dataFrame with column = 'human-name-first_name' The connector '-' can be changed by changing the connector variable.

  • Explodes Arrays: { "array":["one", "two", "three"] } Is converted to dataFrame with column = 'array' with 3 rows

The function can handle any level of nesting.

The function can NOT handle Arrays within Arrays. This is just to keep the code dynamic and generic. To handle Arrays within Arrays, modify if isinstance in the for loop of flattenSchema function