In [1]:
%useLatestDescriptors
%use dataframe, lets-plot

The problem is found in one of the loaded libraries: check library imports, dependencies and repositories
Error compiling code:
@file:DependsOn("io.github.microutils:kotlin-logging-jvm:2.0.5")
@file:DependsOn("org.jetbrains.lets-plot:lets-plot-common:4.0.1")
@file:DependsOn("org.jetbrains.lets-plot:lets-plot-image-export:4.0.1")
@file:DependsOn("org.jetbrains.lets-plot:lets-plot-kotlin-kernel:4.4.3")
@file:DependsOn("org.jetbrains.lets-plot:platf-awt-jvm:4.0.1")
import org.jetbrains.letsPlot.*
import org.jetbrains.letsPlot.annotations.*
import org.jetbrains.letsPlot.bistro.corr.*
import org.jetbrains.letsPlot.bistro.joint.*
import org.jetbrains.letsPlot.bistro.qq.*
import org.jetbrains.letsPlot.bistro.residual.*
import org.jetbrains.letsPlot.coord.*
import org.jetbrains.letsPlot.export.*
import org.jetbrains.letsPlot.facet.*
import org.jetbrains.letsPlot.font.*
import org.jetbrains.letsPlot.geom.*
import org.jetbrains.letsPlot.geom.extras.*
import org.jetbrains.letsPlot.intern.toSpec
i

In [2]:
var df = DataFrame.readCSV(fileOrUrl = "../../idea-examples/titanic/src/main/resources/titanic.csv", delimiter = ';', parserOptions = ParserOptions(locale = java.util.Locale.FRENCH))

df.head()

We have a dataset which uses an alternative pattern for decimal numbers. This is a reason why the French locale will be used in the example.

But before data conversion, we should to handle *null* values.

In [3]:
df.describe()

In [4]:
df

# Imputing null values
Let's convert all columns of our dataset to non-nullable and impute null values based on mean values.

In [5]:
val df1 = df
    // imputing
    .fillNulls { sibsp and parch and age and fare }.perCol { mean() }
    .fillNulls { sex }.with { "female" }
    .fillNulls { embarked }.with { "S" }
    .convert { sibsp and parch and age and fare }.toDouble()

df1.head()

In [6]:
df1.schema()

pclass: Int
survived: Int
name: String
sex: String
age: Double
sibsp: Double
parch: Double
ticket: String
fare: Double
cabin: String?
embarked: String
boat: String?
body: Int?
homedest: String?

In [7]:
df1.corr()

In [8]:
val correlations = df1.corr { all() }.with { survived }
    .sortBy { survived }
correlations

Great, at this moment we have 5 numerical features available for numerical analysis: **pclass, age, sibsp, parch, fare**.

# Analyze by pivoting features
To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

- **Pclass**: We observe significant correlation (>0.5) among **Pclass**=1 and **Survived**.

- **Sex**: We confirm the observation during problem definition that Sex=female had a very high survival rate at 74%.

- **SibSp** and **Parch**: These features have zero correlation for the certain values. It may be best to derive a feature or a set of features from these individual features.

In [9]:
df1.groupBy { pclass }.mean { survived }.sortBy { pclass }

In [10]:
df1.groupBy { sex }.mean { survived }.sortBy { survived }

In [11]:
df1.groupBy { sibsp }.mean { survived }.sortBy { sibsp }

In [12]:
df1.groupBy { parch }.mean { survived }.sortBy { parch }

# Analyze the importance of the Age feature

It's interesting to discover both **age** distributions: among survived and not survived passengers.

In [13]:
val byAge = df1.valueCounts { age }.sortBy { age }
byAge

In [14]:
// JetBrains color palette
val colors = mapOf("light_orange" to "#ffb59e", "orange" to "#ff6632", "light_grey" to "#a6a6a6", "dark_grey" to "#4c4c4c")

In [15]:
letsPlot(byAge.toMap()) { x = "age"; y = "count" } + 
    geomPoint(size = 5, color = colors["dark_grey"]) +
    ggsize(850, 500)

Line_27.jupyter.kts (1:1 - 9) Unresolved reference: letsPlot
Line_27.jupyter.kts (1:27 - 28) Unresolved reference: x
Line_27.jupyter.kts (1:38 - 39) Unresolved reference: y
Line_27.jupyter.kts (2:5 - 14) Unresolved reference: geomPoint
Line_27.jupyter.kts (3:5 - 11) Unresolved reference: ggsize

In [16]:
val age = df.select { age }.dropNulls().sortBy { age }

letsPlot(age.toMap()) { x = "age" } + geomHistogram(binWidth=5, fill = colors["orange"]) + ggsize(850, 500)

Line_28.jupyter.kts (3:1 - 9) Unresolved reference: letsPlot
Line_28.jupyter.kts (3:25 - 26) Unresolved reference: x
Line_28.jupyter.kts (3:39 - 52) Unresolved reference: geomHistogram
Line_28.jupyter.kts (3:92 - 98) Unresolved reference: ggsize

In [17]:
df1.groupBy { age }.pivotCounts { survived }.sortBy { age }

In [18]:
val survivedByAge = df1.select { survived and age }.sortBy { age }
survivedByAge

In [19]:
val plot = letsPlot(survivedByAge.convert { survived }.with { if (it == 1) "Survived" else "Died" }.toMap())

plot +
    geomHistogram(binWidth = 5, alpha = 0.7, position = positionDodge()) { x = "age"; fill = "survived" } +
    scaleFillManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(850, 500)

Line_33.jupyter.kts (1:12 - 20) Unresolved reference: letsPlot
Line_33.jupyter.kts (4:5 - 18) Unresolved reference: geomHistogram
Line_33.jupyter.kts (4:57 - 70) Unresolved reference: positionDodge
Line_33.jupyter.kts (4:76 - 77) Unresolved reference: x
Line_33.jupyter.kts (4:87 - 91) Function invocation 'fill(...)' expected
Line_33.jupyter.kts (4:87 - 91) Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public inline fun <T, reified C> ConvertSchemaDsl<TypeVariable(T)>.fill(noinline columns: ColumnsSelector<TypeVariable(T), TypeVariable(C)> /* = ColumnsSelectionDsl<TypeVariable(T)>.(it: ColumnsSelectionDsl<TypeVariable(T)>) -> ColumnSet<TypeVariable(C)> */): ConvertToFill<TypeVariable(T), TypeVariable(C)> defined in org.jetbrains.kotlinx.dataframe.api
Line_33.jupyter.kts (5:5 - 20) Unresolved reference: scaleFillManual
Line_33.jupyter.kts (6:5 - 11) Unresolved reference: ggsize

In [20]:
// Density plot
plot +
    geomDensity { x="age"; color="survived" } +
    scaleColorManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(850, 250)

Line_34.jupyter.kts (2:1 - 5) Unresolved reference: plot
Line_34.jupyter.kts (3:5 - 16) Unresolved reference: geomDensity
Line_34.jupyter.kts (3:19 - 20) Unresolved reference: x
Line_34.jupyter.kts (3:28 - 33) Unresolved reference: color
Line_34.jupyter.kts (4:5 - 21) Unresolved reference: scaleColorManual
Line_34.jupyter.kts (5:5 - 11) Unresolved reference: ggsize

In [21]:
// A basic box plot
plot +
    geomBoxplot { x="survived"; y="age"; fill = "survived" } +
    scaleFillManual(listOf(colors["dark_grey"]!!, colors["orange"]!!)) +
    ggsize(500, 400)

Line_35.jupyter.kts (2:1 - 5) Unresolved reference: plot
Line_35.jupyter.kts (3:5 - 16) Unresolved reference: geomBoxplot
Line_35.jupyter.kts (3:19 - 20) Unresolved reference: x
Line_35.jupyter.kts (3:33 - 34) Unresolved reference: y
Line_35.jupyter.kts (3:42 - 46) Function invocation 'fill(...)' expected
Line_35.jupyter.kts (3:42 - 46) Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public inline fun <T, reified C> ConvertSchemaDsl<TypeVariable(T)>.fill(noinline columns: ColumnsSelector<TypeVariable(T), TypeVariable(C)> /* = ColumnsSelectionDsl<TypeVariable(T)>.(it: ColumnsSelectionDsl<TypeVariable(T)>) -> ColumnSet<TypeVariable(C)> */): ConvertToFill<TypeVariable(T), TypeVariable(C)> defined in org.jetbrains.kotlinx.dataframe.api
Line_35.jupyter.kts (4:5 - 20) Unresolved reference: scaleFillManual
Line_35.jupyter.kts (5:5 - 11) Unresolved reference: ggsize

Seems like we have the same age distribution among survived and not survived passengers.

# Categorical features with One Hot Encoding

To prepare data for the ML algorithms, we should replace all String values in categorical features on numbers. There are a few ways of how to preprocess categorical features, and One Hot Encoding is one of them. We will use [`pivotMatches`](https://kotlin.github.io/dataframe/pivot.html#pivotmatches) operation to convert categorical columns into sets of nested `Boolean` columns per every unique value.

In [22]:
val pivoted = df1.pivotMatches { pclass and sex and embarked }
pivoted.head()

In [23]:
val df2 = pivoted
            // feature extraction
            .select{ survived and pclass and sibsp and parch and age and fare and sex and embarked}
            .convert { recursively() }.toDouble()

df2.head()

Line_39.jupyter.kts (4:22 - 39) Type mismatch: inferred type is Unit but ColumnSet<TypeVariable(C)> was expected
Line_39.jupyter.kts (4:24 - 35) Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public open fun TransformableSingleColumn<*>.recursively(): SingleColumn<*> defined in org.jetbrains.kotlinx.dataframe.api.ColumnsSelectionDsl

In [24]:
val titanicData = df2.flatten().toMap()

gggrid(
    listOf(
        CorrPlot(titanicData, "Tiles").tiles()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(),
        CorrPlot(titanicData, "Points").points()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(), 
        CorrPlot(titanicData, "Tiles and labels").tiles().labels()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build(),
        CorrPlot(titanicData, "Tiles, points and labels").points().labels().tiles()
            .paletteGradient(colors["orange"]!!, colors["light_grey"]!!, colors["dark_grey"]!!).build()
    ), 1, 700, 600)

Line_40.jupyter.kts (1:19 - 22) Unresolved reference: df2
Line_40.jupyter.kts (3:1 - 7) Unresolved reference: gggrid
Line_40.jupyter.kts (5:9 - 17) Unresolved reference: CorrPlot
Line_40.jupyter.kts (7:9 - 17) Unresolved reference: CorrPlot
Line_40.jupyter.kts (9:9 - 17) Unresolved reference: CorrPlot
Line_40.jupyter.kts (11:9 - 17) Unresolved reference: CorrPlot

# Creation of new features

We suggest to combine both, **Sibsp** and **parch** features, into the new one feature with the name **FamilyNumber** as a simple sum of **sibsp** and **parch**.

In [25]:
val familyDF = df1.add("familyNumber") { sibsp + parch }

familyDF.head()

In [26]:
familyDF.corr { familyNumber }.with { survived }

In [27]:
familyDF.corr { familyNumber }.with { age }

Looks like the new feature has no influence on the **survived** column, but it has a strong negative correlation with **age**. 

# Titles
Let's try to extract something from the names. A lot of string in the name column contains special titles, like Done, Mr, Mrs and so on.

In [28]:
val titledDF = df.select { survived and name }.add ("title") { name.split(".")[0].split(",")[1].trim() }
titledDF.head(100)

In [29]:
titledDF.valueCounts { title }

New **Title** column contains some rare titles and some titles with typos. Let's clean the data and merge rare titles into one category.

In [30]:
val rareTitles = listOf("Dona", "Lady", "the Countess", "Capt", "Col", "Don", 
                "Dr", "Major", "Rev", "Sir", "Jonkheer")

val cleanedTitledDF = titledDF.update { title }.with { 
                            when {
                                it == "Mlle" -> "Miss"
                                it == "Ms" -> "Miss"
                                it == "Mme" -> "Mrs"
                                it in rareTitles -> "Rare Title"
                                else -> it
                            }
                        }

In [31]:
cleanedTitledDF.valueCounts { title }

Now it looks awesome and we have only 5 different titles and could see how it correlates with survival.

In [32]:
val correlations = cleanedTitledDF
                    .pivotMatches { title }
                    .corr { title }.with { survived }
correlations

In [33]:
correlations.update { title }.with { it.substringAfter('_') }.filter { title != "survived" }

The women with title **Miss** and **Mrs** have the same chances to survive, but not the same for the men. If you have a title **Mr**, your deals are bad on the Titanic.

**Rare title** is really rare and doesn't play a big role.

In [34]:
val groupedCleanedTitledDF = cleanedTitledDF.valueCounts { title and survived }.sortBy { title and survived }
groupedCleanedTitledDF

# Surname's analysis
It's very interesting to dig deeper into families, home destinations, and we could do start this analysis from surnames which could be easily extracted from **Name** feature.

In [35]:
val surnameDF = df1.select { survived and name }.add ("surname") { name.split(".")[0].split(",")[0].trim() }
surnameDF.head()

In [36]:
surnameDF.valueCounts { surname }

In [37]:
surnameDF.surname.countDistinct()

875

In [38]:
val firstSymbol by column<String>()

df1
.add (firstSymbol) { name.split(".")[0].split(",")[0].trim().first().toString() }
.pivotMatches(firstSymbol)
.corr { firstSymbol }.with { survived }
