Skip to content

JayLohokare/pySpark-flatten-dataframe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

pySpark-flatten-dataframe

PySpark function to flatten any complex nested dataframe structure loaded from JSON/CSV/SQL/Parquet

For example, for nested JSONs -

  • Flattens all nested items: { "human":{ "name":{ "first_name":"Jay Lohokare" } } }

Is converted to dataFrame with column = 'human-name-first_name' The connector '-' can be changed by changing the connector variable.

  • Explodes Arrays: { "array":["one", "two", "three"] } Is converted to dataFrame with column = 'array' with 3 rows

The function can handle any level of nesting.

The function can NOT handle Arrays within Arrays. This is just to keep the code dynamic and generic. To handle Arrays within Arrays, modify if isinstance in the for loop of flattenSchema function

About

PySpark function to flatten any complex nested dataframe structure loaded from JSON/CSV/SQL/Parquet

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published