# H&M Personalized Fashion Recommendations exploratory data analysis (EDA)

This notebook contains exploratory data analysis of the [H&M Personalized Fashion Recommendations dataset](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/). This dataset contains information about users, articles and transactions. 

This notebook is written in Kotlin language. Refer to the [official documentation](https://kotlinlang.org/docs/data-science-overview.html) to learn more about Kotlin Notebook.

## 0. Import dependencies

This notebook will use the following dependencies:
- [dataframe](https://github.com/Kotlin/dataframe) is used to read the structured csv data files, and then process the read data
- [kandy](https://github.com/Kotlin/kandy) is used to visualize the data

In [None]:
%useLatestDescriptors
%use dataframe, kandy

## 1. Data loading

To use this notebook, it is required to download the used dataset. As it was stated before, this notebook inspects the data of the [H&M Personalized Fashion Recommendations dataset](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/). The dataset is stored on Kaggle, it is possible to download it from the website, or using the [Kaggle API](https://github.com/Kaggle/kaggle-api).

Before we could load the datasets, the maximum heap space should be set to a proper size, because it must be ensured that the enough space is available for the dataframes. If not enough space is available, then OutOfMemory exception will be thrown with an error message related to the insufficient heap size. 
- To set the heap size in Kotlin Notebook, first open *Kotlin Notebook Settings*. You can open it with the gear icon next to the cell type selector dropdown, or obviously you can find it with the find action view. There you can specify the maximum heap size. 
- To set the heap size using other clients, refer to [Kotlin Jupyter kernel's documentation](https://github.com/Kotlin/kotlin-jupyter#other-clients)

**When I was running this notebook, I set the maximum heap size to 12 228 MiB (12 GiB).**

If you encounter OutOfMemoryException while running this notebook, I advise you to try the following:
- Increase the maximum heap size (if you have sufficient memory size in your hardware to do so).
- Clear the memory by restarting the kernel. **Please note that this will cause all variable values to be lost (and also you have to reimport the dependencies).**

Run the cell below to read the maximum heap size setting.

In [None]:
Runtime.getRuntime().maxMemory()

After the download was successful, change value of the *pathToDownloadedCsvFiles* variable in the cell below to the path of directory of the downloaded dataset on your local machine.

In [None]:
val pathToDownloadedCsvFiles = "C:\\Sajat\\Egyetem\\MSc\\Onallo\\HM_dataset"

Then the csv files can be imported into dataframes.

#### 1.1 Loading articles dataframe

Let's load the articles dataframe!

In [None]:
var articlesDf = DataFrame.readCSV(
    fileOrUrl = pathToDownloadedCsvFiles + "\\articles.csv",
)

Let's check out the column names of the articles dataframe!

In [None]:
articlesDf.columnNames()

For many attributes (e.g. graphical_appearance, perceived_color_value) two column is present that represent the same information: one column contains the value in a format that is easier to interpret, and another column contains an encoded representation that is more concise. In these inspections the comprehensible values are enough, so the other unnecessary columns should be removed!

I'm also removing the detail_desc column, because I couldn't really think of a way to analyse this attribute set.

In [None]:
articlesDf = articlesDf.select { article_id and prod_name and product_type_name and product_group_name and graphical_appearance_name and colour_group_name and perceived_colour_value_name and perceived_colour_master_name and department_name and index_name and index_group_name and section_name and garment_group_name }

#### 1.2 Loading customers dataframe

Let's load the customer dataframe!

In [None]:
var customersDf = DataFrame.readCSV(
    fileOrUrl = pathToDownloadedCsvFiles + "\\customers.csv",
    charset = Charsets.US_ASCII,
)

Let's inspect the columns of the customers dataframe!

In [None]:
customersDf.columnNames()

From these columns the **customer_id**, **age** and **postal_code** columns seem useful, because these attributes can be present in the database of any clothing webshop. Also, I couldn't decipher FN and Active columns' meaning. Let's do a projection on the dataframe to keep only the relevant attributes!

In [None]:
customersDf = customersDf.select { customer_id and age and postal_code }

#### 1.3 Loading transactions dataframe

To read transactions_train.csv, the computer needs a significant amount of RAM. When I was trying to read the whole file, **I needed 16 GiB of RAM**. In every attempt I made with less maximum memory limit, the code failed with an OutOfMemoryException.

To tackle the issue with the high required memory limit, with following code snippet I split the original file into two new files. In the original csv file the transaction date attribute covers a period of two years, so it's sensible to create two files that both cover a period of one year. Because the new files will only contain approximately half of the original data, less memory should be sufficient.

To modify the path of the two newly created file, please modify the values of variables that contain the path!

In [None]:
val firstYearTransactionsTargetPath = "C:\\Sajat\\Egyetem\\MSc\\Onallo\\HM_dataset_transformation\\transactions_train1.csv"
val secondYearTransactionsTargetPath = "C:\\Sajat\\Egyetem\\MSc\\Onallo\\HM_dataset_transformation\\transactions_train2.csv"

**Please note that this code below could run for a relatively long time. When I was running it, it ran for approximately one and a half hour.**

In [None]:
import java.io.File
import java.text.SimpleDateFormat

val dateFormat = SimpleDateFormat("yyyy-MM-dd")
val delimiterDate = dateFormat.parse("2019-09-21")

val firstYearTargetFile = File(firstYearTransactionsTargetPath)
val secondYearTargetFile = File(secondYearTransactionsTargetPath)

var isDelimiterDateReached = false
var areColumnNameLinesAdded = false

var lineSeparator = "\r\n"

File(pathToDownloadedCsvFiles + "\\transactions_train.csv").forEachLine { line ->
    line.split(',').firstOrNull()?.let { rawDate ->
        try {
            when {
                areColumnNameLinesAdded && isDelimiterDateReached -> {
                    secondYearTargetFile.appendText("$lineSeparator$line")
                }

                areColumnNameLinesAdded && isDelimiterDateReached.not() && dateFormat.parse(rawDate) < delimiterDate -> {
                    firstYearTargetFile.appendText("$lineSeparator$line")
                }

                areColumnNameLinesAdded && isDelimiterDateReached.not() -> {
                    isDelimiterDateReached = true
                    secondYearTargetFile.appendText("$lineSeparator$line")
                }
                
                areColumnNameLinesAdded.not() -> {
                    firstYearTargetFile.appendText("$line")
                    secondYearTargetFile.appendText("$line")
                    
                    areColumnNameLinesAdded = true
                }
            }
        } catch (t: Throwable) {
            t.printStackTrace()
        }
    }
}

If your computer has plenty of RAM, you can skip this csv splitting step, and instead just load the whole transaction csv file into one dataframe. After this, you can split the dataframes into two parts. For example the dataframes can be created using the filter operation on *t_dat* attribute.

Running the cells above, the transaction dataframes can be loaded!

In [None]:
var firstYeartransactionsDf = DataFrame.readCSV(
    fileOrUrl = firstYearTransactionsTargetPath,
    charset = Charsets.US_ASCII,
)

In [None]:
var secondYeartransactionsDf = DataFrame.readCSV(
    fileOrUrl = secondYearTransactionsTargetPath,
    charset = Charsets.US_ASCII,
)

After this let's do on these dataframes a projection, to keep only the relevant attributes.

In [None]:
firstYeartransactionsDf = firstYeartransactionsDf.select { t_dat and customer_id and article_id and price }

In [None]:
secondYeartransactionsDf = secondYeartransactionsDf.select { t_dat and customer_id and article_id and price }

## 2. Data analysis

In this part data wrangling and visualisation will be performed on the dataframes.

### 2.1 Examination specifically related to articles

#### 2.1.1 Basic inspections about the articles

Firstly let's read the first items to see some sample rows and have some idea about the stored data!

In [None]:
articlesDf.head()

Let's check is there any duplicate row in the dataframe!

In [None]:
articlesDf.describe { article_id }

We can see from the unique and count attributes that there is no duplicate row in the dataframe!

#### 2.2.2 Inspections related to individual attribute set

##### 2.2.2.1 Inspections related to the product name attribute

In [None]:
val countOfProdNameAttributes = articlesDf.valueCounts { prod_name }

countOfProdNameAttributes.describe()

As we can see, the prod_name is not a unique identifier. For instance products can exist with the same prod_name but have different colors.

There is at least one product_code value that belongs to 98 different products, one of them is called "Dragonfly dress"!

##### 2.2.2.2 Inspections related to the graphical appearance name attribute

In [None]:
val countOfArticlesOfGraphicalAppearanceName = articlesDf.valueCounts { graphical_appearance_name }

countOfArticlesOfGraphicalAppearanceName.describe()

In [None]:
countOfArticlesOfGraphicalAppearanceName.plot {
    bars { 
        x(graphical_appearance_name)
        y(count)
     }
}

The simple solid appearance seems to dominate the articles! Let's calculate what percentage of every product has solid appearance!

In [None]:
countOfArticlesOfGraphicalAppearanceName.filter { graphical_appearance_name == "Solid" }.count.values.first().toDouble() / countOfArticlesOfGraphicalAppearanceName.sum { count }.toDouble()

##### 2.2.2.3 Inspections related to the color group name attribute

In [None]:
val countOfArticlesOfColorGroupName = articlesDf.valueCounts { colour_group_name }

countOfArticlesOfColorGroupName.describe()

In [None]:
countOfArticlesOfColorGroupName.plot {
    bars { 
        x(colour_group_name)
        y(count)
     }
}

Of course black is the most popular color!

##### 2.2.2.4 Inspections related to the perceived colour master name attribute

In [None]:
val countOfPerceivedColourMasterName = articlesDf.valueCounts { perceived_colour_master_name }

countOfPerceivedColourMasterName.describe()

In [None]:
countOfPerceivedColourMasterName.plot {
    bars { 
        x(perceived_colour_master_name)
        y(count)
     }
}

##### 2.2.2.5 Inspections related to the color value name attribute

In [None]:
val countOfPerceivedColourValueName = articlesDf.valueCounts { perceived_colour_value_name }

countOfPerceivedColourValueName.describe()

In [None]:
countOfPerceivedColourValueName.plot {
    bars { 
        x(perceived_colour_value_name)
        y(count)
     }
}

##### 2.2.2.6 Inspections related to the product type name attribute

In [None]:
val countOfPorductTypeNames = articlesDf.valueCounts { product_type_name }

countOfPorductTypeNames.describe()

At least one of the product type value belongs to 11169 articles! Let's see what is this product type name!

In [None]:
countOfPorductTypeNames.filter { count == 11169 }

Interesting! I'd have guessed it would some kind of T-Shirt related value! Let's see the other most frequent values!

In [None]:
countOfPorductTypeNames.head()

##### 2.2.2.7 Inspections related to the product group name attribute

In [None]:
val countOfArticlesOfGroupName = articlesDf.valueCounts { product_group_name }

countOfArticlesOfGroupName.describe()

In [None]:
countOfArticlesOfGroupName.plot { 
    bars { 
        y(count)
        x(product_group_name)
     }
 }

##### 2.2.2.8 Inspections related to the department name attribute

In [None]:
val countOfDepartmentName = articlesDf.valueCounts { department_name }

countOfDepartmentName.describe()

In [None]:
countOfDepartmentName.plot {
    bars { 
        x(department_name)
        y(count)
     }
     layout {
         size = 1000 to 500
     }
}

##### 2.2.2.9 Inspections related to the index name attribute

In [None]:
val countOfIndexName = articlesDf.valueCounts { index_name }

countOfIndexName.describe()

In [None]:
countOfIndexName.plot { 
    pie { 
        slice(count)
        fillColor(index_name)
        size = 20.0
     }
 }

##### 2.2.2.10 Inspections related to the index group name attribute

In [None]:
val countOfIndexGroupName = articlesDf.valueCounts { index_group_name }

countOfIndexGroupName.describe()

In [None]:
countOfIndexGroupName.plot {
    pie {
        slice(count)
        fillColor(index_group_name)
        explode(listOf(0, 0, 0, 0, 0.5))
        size = 20.0
    }
}

##### 2.2.2.11 Inspections related to the section name attribute

In [None]:
val countOfSectionName = articlesDf.valueCounts { section_name }

countOfSectionName.describe()

In [None]:
countOfSectionName.plot {
    bars {
        x(section_name)
        y(count)
    }
    layout {
        size = 1000 to 500
    }
}

##### 2.2.2.12 Inspections related to the garment group name attribute

In [None]:
val countOfGarmentGroupName = articlesDf.valueCounts { garment_group_name }

countOfGarmentGroupName.describe()

In [None]:
countOfGarmentGroupName.plot {
    bars {
        x(garment_group_name)
        y(count)
    }
    layout {
        size = 1000 to 500
    }
}

#### 2.2.2 Inspections related to multiple attribute set

##### 2.2.2.1 Inspections related to the relationship of color related attributes

There are multiple color related attributes on the dataframe: perceived_colour_master_name, perceived_colour_value_name, colour_group_name. Let's inspect the relationship between them!

As it can be seen from the previous queries related to these attributes, their meanings can be defined as the following:
- perceived_colour_value_name: shade attribute
- perceived_colour_master_name: color attribute
- colour_group_name: a color attribute with more unique values than perceived_colour_value_name

Let's compare the sizes of these attribute sets!
1. perceived_colour_value_name: 8
2. perceived_colour_master_name: 20
3. colour_group_name: 50

In [None]:
val countOfMasterAndValueColorVariables = articlesDf.valueCounts { perceived_colour_master_name and perceived_colour_value_name }
val countOfMasterAndGroupColorVariables = articlesDf.valueCounts { perceived_colour_master_name and colour_group_name }
val countOfValueAndGroupColorVariables = articlesDf.valueCounts { perceived_colour_value_name and colour_group_name }

In [None]:
countOfMasterAndValueColorVariables.describe()

In [None]:
countOfMasterAndValueColorVariables.plot { 
    points { 
        x(perceived_colour_value_name)
        y(perceived_colour_master_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 500
     }
 }

Let's remove the perceived_colour_master_name values with the highest counts to learn more about the attributes with average counts!

In [None]:
countOfMasterAndValueColorVariables.drop { perceived_colour_master_name in setOf("Black", "Blue", "White") }.plot { 
    points { 
        x(perceived_colour_master_name)
        y(perceived_colour_value_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 500
     }
 }

In [None]:
countOfMasterAndGroupColorVariables.describe()

In [None]:
countOfMasterAndGroupColorVariables.plot { 
    points { 
        x(perceived_colour_master_name)
        y(colour_group_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 1000
     }
 }

Before looking at the data, I would have thought, that perceived_colour_master_name attributes will have pairs with only similar values from colour_group_name, for example blue with dark blue. But as we can see there are some surprising combinations like blue with grey.

Let's remove again the values with the highest counts to have a better visualization about average values!

In [None]:
countOfMasterAndGroupColorVariables.drop { perceived_colour_master_name in setOf("Black", "Blue", "White") }.plot {
    points {
        x(perceived_colour_master_name)
        y(colour_group_name)
        color(count)
        size = 5.0
    }
    layout {
        size = 1000 to 1000
    }
}

In [None]:
countOfValueAndGroupColorVariables.describe()

In [None]:
countOfValueAndGroupColorVariables.plot { 
    points { 
        x(perceived_colour_value_name)
        y(colour_group_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 1000
     }
 }

Let's remove once more the values with the highest counts to have a better visualization about average values!

In [None]:
countOfValueAndGroupColorVariables.drop { perceived_colour_value_name in setOf("Dark", "Light") }.plot { 
    points { 
        x(perceived_colour_value_name)
        y(colour_group_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 1000
     }
 }

##### 2.2.2.2 Inspections related to the relationship of the grouping related attributes

There are many attributes, that characterize the articles by the target audience age group, body part, gender and use case. Let's inspect these attributes and try to find the relationship between them!

As it can be seen from the previous queries related to these attributes, they can be described as the following:
- index_group_name: classifies the articles by age, gender and use case in a general way
- index_name: classifies the articles by age, gender and use case in a more detailed way than index_group_name
- section_name: classifies the articles by age, gender, use case
- department_name: classifies the articles by age, gender, use case and body part
- garment_group_name: classifies the articles by body part, gender and use case
- product_type_name: classifies the articles by body part and use case in a more specific way
- product_group_name: classifies the articles by body part and use case in a rather general way

Let's compare the sizes of these attribute sets!
- index_group_name: 5
- index_name: 10
- section_name: 56
- department_name: 250
- garment_group_name: 21
- product_type_name: 131
- product_group_name: 19

So the attribute sets' ascending order by size is the following:
1. index_group_name (5)
2. index_name (10)
3. product_group_name (19)
4. garment_group_name (21)
5. section_name (56)
6. product_type_name (131)
7. department_name (250)

In [None]:
val countOfIndexGroupAndNameAttributes = articlesDf.valueCounts { index_group_name and index_name }
val countOfIndexGroupNameAndSectionNameAttributes = articlesDf.valueCounts { index_group_name and section_name }

val departmentNamesWithGarmentGroups = articlesDf.select { department_name and garment_group_name }.distinct()

val countOfProductTypeAndGroupNameAttributes = articlesDf.valueCounts { product_type_name and product_group_name }

val countOfSectionNameAndGarmentGroupAttributes = articlesDf.valueCounts { section_name and garment_group_name }

val countOfProductAndGarmentGroupAttributes = articlesDf.valueCounts { product_group_name and garment_group_name }

val countOfIndexAndProductGroupNameAttributes = articlesDf.valueCounts { index_group_name and product_group_name }

Let's first compare index group and index attributes!

In [None]:
countOfIndexGroupAndNameAttributes.describe { count }

In [None]:
countOfIndexGroupAndNameAttributes.plot {
        points { 
        x(index_name)
        y(index_group_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 500
     }
}

As it can be seen from the chart, there is a strong connection between the two attributes.

Next, let's check index group and section name attribute sets.

In [None]:
countOfIndexGroupNameAndSectionNameAttributes.describe()

In [None]:
countOfIndexGroupNameAndSectionNameAttributes.plot { 
    points { 
        x(index_group_name)
        y(section_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 900
     }
}

Every value of section_name belongs to only one value of index_group_name except "Ladies Denim", that classifies articles in pair not only with "Ladieswear" but also "Divided" index_group_name attribute values!

Next, let's check out department and garment group attribute sets!

In [None]:
departmentNamesWithGarmentGroups.describe()

In [None]:
departmentNamesWithGarmentGroups.filter { department_name == "OL Extended Sizes" }

Every value of department_name belongs to only one value of garment_group_name except "OL Extended Sizes", that classifies articles in pair not only with "Trousers Denim" but also "Jersey Basic" attribute values!

Next let's see product_type_name and product_group_name! They have similar names, so I suspect we will find some kind of clear relationship between the values!

In [None]:
countOfProductTypeAndGroupNameAttributes.describe()

Similarly to the cases before this inspection, every value of product_type_name belongs to only one value of product_group_name except Umbrella value!

In [None]:
countOfProductTypeAndGroupNameAttributes.filter { product_type_name == "Umbrella" }

In [None]:
countOfProductTypeAndGroupNameAttributes.plot { 
    points { 
        x(product_group_name)
        y(product_type_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 900
     }
}

Next let's check section name and garment group attributes! As I understood the meaning of values, I reckon we should see some kind of relationship as well!

In [None]:
countOfSectionNameAndGarmentGroupAttributes.describe()

In [None]:
countOfSectionNameAndGarmentGroupAttributes.plot { 
    points { 
        x(garment_group_name)
        y(section_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 900
     }
}

There are some logical pairings between the values, but in this case, we can't see that simple relationship between the values as we've seen before!

Next let's inspect product_group_name and garment_group_name attribute sets!

In [None]:
countOfProductAndGarmentGroupAttributes.describe()

In [None]:
countOfProductAndGarmentGroupAttributes.plot {
        points { 
        x(product_group_name)
        y(garment_group_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 900
     }
}

Again, the relationship between this attribute set is not as simple as in the first cases!

Lastly let's inspect index_group_name and product_group_name attribute sets!

In [None]:
countOfIndexAndProductGroupNameAttributes.describe()

In [None]:
countOfIndexAndProductGroupNameAttributes.plot { 
    points { 
        x(index_group_name)
        y(product_group_name)
        color(count)
        size = 5.0
     }
     layout {
         size = 1000 to 900
     }
}

Yet again a rather complex relation!

Based on these, the attribute sets can be broken down into three groups that are true for groups that are true for the contained attribute sets in pairs, that the elements of one are more specific formulations of the elements of the other:
1. Group 1
    - index_group_name
    - index_name
    - section_name
2. Group 2
    - garment_group_name
    - department_name
3. Group 3
    - product_group_name
    - product_type_name

##### 2.2.2.3 Inspections related to the relationship between graphical appearance and color attributes

In [None]:
val countOfPerceivedColorValueAndGraphicalAppearanceAttributes = articlesDf.valueCounts { graphical_appearance_name and perceived_colour_value_name }

In [None]:
countOfPerceivedColorValueAndGraphicalAppearanceAttributes.describe()

In [None]:
countOfPerceivedColorValueAndGraphicalAppearanceAttributes.plot {
    points { 
        x(perceived_colour_value_name)
        y(graphical_appearance_name)
        color(count)
     }
     layout {
         size = 1000 to 800
     }
}

There are a lot of solid dark articles! We can also see, that there are some articles for every meaningful attribute combination!

### 2.2 Examination specifically related to customers

#### 2.2.1 Basic inspections about the customers

Let's read the first items in the customers dataframe!

In [None]:
customersDf.head()

Let's check if there is any duplicate row!

In [None]:
customersDf.describe()

We can see from the unique count and count attribute values of customer_id, that there is no duplicated row in the dataframe!

#### 2.2.2 Inspections related to the age attribute

Let's inspect the age attribute of customers!

In [None]:
customersDf.describe { age }

This is fascinating! The youngest user is 16 years old while the oldest is 99! (Assuming the users are honest about their ages!) I thought the user base would be much younger, I would have guessed that the median and mean values are lower than 30! The most common age is 21, which does not look surprising in my opinion! The standard deviation looks also sensible!

Let's see how many people are in the age groups! Let's also create a bar chart to visualize this information!

In [None]:
val countOfAge = customersDf.valueCounts() { age }.sortBy { age }

countOfAge.describe()

In [None]:
countOfAge.plot { 
    bars { 
        x(age)
        y(count)
     }
 }

The most surprising part of this barchart is the pit at the 40 years olds!

#### 2.2.3 Inspections related to the postal code attribute

It could be interesting to see how users are distributed according to their place of residence!

In [None]:
customersDf.describe { postal_code }

It seems that the most densely populated settlement of the users contains 120303 users! This seems a lot! I wonder if this information is accurate! Also, there are 352899 unique settlement populated by the users, this is also a huge number!

In [None]:
val countOfPostalCode = customersDf.valueCounts { postal_code }
countOfPostalCode

Based on the huge difference between the most and second most populated settlement, it seems something is definitely wrong with the settlement that is inhabited by most people!

Let's analyze next that usually how many customer live in the same settlement!

In [None]:
customersByPostalCode.describe { count }

In [None]:
val customerPopulationValuesBySettlements = customersByPostalCode.values { count }.toList()

plot(
    mapOf(
        "x" to listOf("customers"),
        "min" to listOf(customerPopulationValuesBySettlements.last()),
        "lower" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 4 * 3]),
        "middle" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 2]),
        "upper" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 4]),
        "max" to listOf(customerPopulationValuesBySettlements.first()),
    )
) {
    boxplot { 
        x("x"<String>())
        yMin("min"<Int>())
        lower("lower"<Int>())
        middle("middle"<Int>())
        upper("upper"<Int>())
        yMax("max"<Int>())
     }
     y {
         axis.name = "population"
     }
 }

Let's remove the outlier maximum value hoping that way we get a more meaningful chart!

In [None]:
plot(
    mapOf(
        "x" to listOf("customers"),
        "min" to listOf(customerPopulationValuesBySettlements[2]),
        "lower" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 4]),
        "middle" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 2]),
        "upper" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 4 * 3]),
        "max" to listOf(customerPopulationValuesBySettlements.last()),
    )
) {
    boxplot { 
        x("x"<String>())
        yMin("min"<Int>())
        lower("lower"<Int>())
        middle("middle"<Int>())
        upper("upper"<Int>())
        yMax("max"<Int>())
     }
     y {
         axis.name = "population"
     }
 }

Seems like in the majority of the cases, few customer share the same postal code!

### 2.3 Examination specifically related to transactions

Firstly, let's read the first items in the transactions dataframe!

#### 2.3.1 Inspection of transactions made in the first year period

In [None]:
firstYeartransactionsDf.describe()

Much interesting information can be read, for instance at least one user made 897 transactions in the inspected time period!

In [None]:
val firstYearTransactionsDailyCount = firstYeartransactionsDf.valueCounts { t_dat }

firstYearTransactionsDailyCount.describe()

In [None]:
firstYearTransactionsDailyCount.plot { 
    line {
        x(t_dat)
        y(count)
    }
 }

Wow, holidays have an enormous impact on the sales!

In [None]:
val firstYearTransactionsDailyAmount = firstYeartransactionsDf.groupBy { t_dat }.sum { price }

firstYearTransactionsDailyAmount.describe()

In [None]:
firstYearTransactionsDailyAmount.plot { 
    line {
        x(t_dat)
        y(price)
    }
 }

This graph looks really similar to the one that measures the count of the transactions! There is probably a strong correlation between them, let's calculate it!

In [None]:
val firstYearTransactionsCountAndAmounts = firstYearTransactionsDailyCount.join(firstYearTransactionsDailyAmount) { t_dat }

firstYearTransactionsCountAndAmounts.head()

In [None]:
firstYearTransactionsCountAndAmounts.corr { count }.with { price }

As expected, the correlation between the count and amount of transactions is indeed very high!

Let's inspect next that how many transactions does an ordinary user!

In [None]:
val firstYearTransactionsCountOfTransactionsForCustomers = customersDf.select { customer_id }.leftJoin(firstYeartransactionsDf.valueCounts { customer_id }) { customer_id }.fillNulls { colsOf<Int?>() }.withZero()

firstYearTransactionsCountOfTransactionsForCustomers.describe()

Let's check now without the customers, that did no transactions in the inspected time period!

In [None]:
firstYearTransactionsCountOfTransactionsForCustomers.drop { count == 0 }.describe()

Let's inspect that how are transactions are distributed between the age groups of users!

In [None]:
val firstYearTransactionsCountOfTransactionsForCustomerAge = customersDf.select { customer_id and age }.join(firstYeartransactionsDf.valueCounts { customer_id }) { customer_id }.valueCounts { age }

firstYearTransactionsCountOfTransactionsForCustomerAge.describe()

In [None]:
firstYearTransactionsCountOfTransactionsForCustomerAge.plot {
    bars { 
        x(age)
        y(count1)
     }
}

This graph looks quite similar to the customer age distribution graph. Seems like regardless of age, customers buy in similar quantities. Let's calculate the correlation to prove this statement!

In [None]:
countOfAge.join(firstYearTransactionsCountOfTransactionsForCustomerAge) { age }.corr { count }.with { "count1"<Int>() }

As expected, the correlation is indeed quite high!

In the following cells, let's inspect that what kind of articles are sold in what quantities!

In [None]:
val firstTransactionsCountOfPurchasedArticles = articlesDf.select { article_id }.leftJoin(firstYeartransactionsDf.valueCounts { article_id }) { article_id }.fillNulls { colsOf<Int?>() }.withZero()

firstTransactionsCountOfPurchasedArticles.describe()

In [None]:
firstTransactionsCountOfPurchasedArticles.sortBy { count.desc() }.head().join(articlesDf) { article_id }

In [None]:
articlesDf.select { article_id and index_group_name }.join(firstYeartransactionsDf) { article_id }.valueCounts { index_group_name }.plot {
    bars { 
        x(index_group_name)
        y("count"<Int?>())
     }
}

In [None]:
articlesDf.select { article_id and section_name }.join(firstYeartransactionsDf) { article_id }.valueCounts { section_name }.plot {
    bars { 
        x(section_name)
        y("count"<Int?>())
     }
     layout {
         size = 800 to 600
     }
}

As I was expecting, the woman sections are dominating!

In [None]:
articlesDf.select { article_id and garment_group_name }.join(firstYeartransactionsDf) { article_id }.valueCounts { garment_group_name }.plot {
    bars { 
        x(garment_group_name)
        y("count"<Int?>())
     }
     layout {
         size = 800 to 600
     }
}

In [None]:
articlesDf.select { article_id and product_group_name }.join(firstYeartransactionsDf) { article_id }.valueCounts { product_group_name }.plot {
    bars { 
        x(product_group_name)
        y("count"<Int?>())
     }
     layout {
         size = 800 to 600
     }
}

As I expected, most purchases are related to upper body related clothes.

#### 2.3.2 Inspection of transactions made in the second year period

Let's do similar inspections on the data of the second year period!

In [None]:
secondYeartransactionsDf.describe()

In [None]:
val secondYearTransactionsDailyCount = secondYeartransactionsDf.valueCounts { t_dat }

secondYearTransactionsDailyCount.describe()

In [None]:
secondYearTransactionsDailyCount.plot { 
    line {
        x(t_dat)
        y(count)
    }
 }

This graph seems quite similar to the previous time period's!

In [None]:
val secondTransactionsCountOfPurchasedArticles = articlesDf.select { article_id }.leftJoin(secondYeartransactionsDf.valueCounts { article_id }) { article_id }.fillNulls { colsOf<Int?>() }.withZero()

secondTransactionsCountOfPurchasedArticles.describe()

In [None]:
secondTransactionsCountOfPurchasedArticles.sortBy { count.desc() }.head().join(articlesDf) { article_id }

In [None]:
articlesDf.select { article_id and index_group_name }.join(secondYeartransactionsDf) { article_id }.valueCounts { index_group_name }.plot {
    bars { 
        x(index_group_name)
        y("count"<Int?>())
     }
}

In [None]:
articlesDf.select { article_id and section_name }.join(secondYeartransactionsDf) { article_id }
    .valueCounts { section_name }.plot {
    bars {
        x(section_name)
        y("count"<Int?>())
    }
    layout {
        size = 800 to 600
    }
}

In [None]:
articlesDf.select { article_id and garment_group_name }.join(secondYeartransactionsDf) { article_id }.valueCounts { garment_group_name }.plot {
    bars { 
        x(garment_group_name)
        y("count"<Int?>())
     }
     layout {
         size = 800 to 600
     }
}

In [None]:
articlesDf.select { article_id and product_group_name }.join(secondYeartransactionsDf) { article_id }.valueCounts { product_group_name }.plot {
    bars { 
        x(product_group_name)
        y("count"<Int?>())
     }
     layout {
         size = 800 to 600
     }
}

These statistics also seem to not differ much from the previous time period's!