# Data Wrangling with Kotlin

In [1]:
%use dataframe

Before digging into data wrangling techniques that DataFrame offers, there is one Column type that has not been covered yet: `ColumnGroup`.

## `ColumnGroup` and `FrameColumn`
They are a special kind of columns that contains a series of column (in `ColumnGroup`s) or a `DataFrame`.

The power of those structures is the ability to store and organize data in a **hierarchical** way. This is essential when dealing with JSON serialization and deserialization.

Dealing with *"nested"* objects can also occur very often when using grouping and pivoting operations (discussed in next chapter), and a minimum comprehension is required before dealing with those operations.

Let's consider a Dataframe of people with the following informations:

In [11]:
val name by columnOf(
    "Woody Allen",
    "Bob Dylan",
    "Charlie Chaplin",
    "John Coltrane",
    "Bob Marley",
    "Linus Torvalds",
    "Charlie Parker",
)
val age by columnOf(15, 45, 20, 30, 15, 22, 57)
val city by columnOf(
    "Rome",
    "Moscow",
    "Tirana",
    "Sarajevo",
    "Cesena",
    null,
    "Kyoto",
)

val weight by columnOf(55, 70, null, 80, null, null, 90)
val isDied by columnOf(false, false, true, true, true, false, true)

val df = dataFrameOf(name, age, city, weight, isDied)
df

Creating a group of columns is pretty straightforward:

In [39]:
df.group { age and city }.into("group")

We can also create a nested column, for example, splitting the name in a `firstName` and a `lastName` column:

In [19]:
val groupedDf = df.split { name }.by(' ').inward("firstName", "lastName")
groupedDf

Using the `inward()` method splits the columns into the provided column names, nesting the inside the original column, creating a `ColumnGroup`.

In [22]:
groupedDf.name.javaClass

class org.jetbrains.kotlinx.dataframe.impl.columns.ColumnGroupImpl

We can always access the fields of the `ColumnGroup` with the `.` notation

In [27]:
groupedDf.name.firstName

As said above, most of the time we will have to deal with these nested structures when using `pivot` or `groupBy` methods. 
We can, for example, pivot the table to create columns that contains a `DataFrame`: `FrameColumns` 

In [37]:
groupedDf.pivot{ name.firstName }

As the prompt below the dataframe suggests us, this is a `Pivot` object, and it should be a temporary object before applying an aggregate function or other manipulations. We will cover `pivot` and `groupBy` extensively in the next chapter.

These nested structures can resemble to a `pandas.MultiIndex`: they both express the concept of organizing data in a **hierarchical** way.

Dataframe multilevel structures differs from pandas because they do not have an explicit concept of `Index`, and operations like `pandas.dataframe.unfold()/unfold()` would make no sense. In some ways that result can be accomplished with some trickery, but Dataframe's `ColumnGroup` or `FrameColumn` are not intended to substitute `pandas.MultiIndex`, even if they're goal is very similar.

## Working with Multiple DataFrames

DataFrame provides three methods for operating with multiple `DataFrame`s:
- `add`: adds new **columns** to the `DataFrame`.
- `concat`: returns the **union** of the provided `DataFrame`s.
- `join`: SQL-like join of two `DataFrame`s by **key** columns.


we already have seen an application of the `add` method, but it is possible to add multiple columns all at once:

In [71]:
groupedDf
    .convert { weight }.toDouble()
    .dropNA { weight }
    .add {
    "year of birth" from 2023 - age
    age gt 18 into "is adult"
    "details" {
         "weight"<Double>() / 6.35 into "weight (approx. stones)"
        "full name" from { name.firstName + " " + name.lastName }
    }
}