pySpark-flatten-dataframe

PySpark function to flatten any complex nested dataframe structure loaded from JSON/CSV/SQL/Parquet

For example, for nested JSONs -

Flattens all nested items: { "human":{ "name":{ "first_name":"Jay Lohokare" } } }

Is converted to dataFrame with column = 'human-name-first_name' The connector '-' can be changed by changing the connector variable.

Explodes Arrays: { "array":["one", "two", "three"] } Is converted to dataFrame with column = 'array' with 3 rows

The function can handle any level of nesting.

The function can NOT handle Arrays within Arrays. This is just to keep the code dynamic and generic. To handle Arrays within Arrays, modify if isinstance in the for loop of flattenSchema function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

pySpark-flatten-dataframe

Files

README.md

Latest commit

History

README.md

File metadata and controls

pySpark-flatten-dataframe