# H&M Personalized Fashion Recommendations exploratory data analysis (EDA)

The Kotlin Clothing Webshop repository's main goal (as it's name probably makes clear) to contain a webshop's implementation written mostly in Kotlin language. Nowadays, one of the key feature of webshops is a recommendation system, which helps the users to find relevant articles.

This notebook contains exploratory data analysis of the [H&M Personalized Fashion Recommendations dataset](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/). This dataset contains information about users, articles and transactions. It's a perfect dataset to use to develop and inspect my recommendation system solution assuming on different clothing webshop applications the users behave similarly.

This notebook is written in Kotlin language. Refer to the [official documentation](https://kotlinlang.org/docs/data-science-overview.html) to learn more about Kotlin notebooks.

## 0. Import dependencies

This notebook will use the following dependencies:
- [dataframe](https://github.com/Kotlin/dataframe) is used to read the structured csv data files, and manipulate them
- [lets-plot](https://github.com/JetBrains/lets-plot-kotlin) is used to visualize the data

In [None]:
%use dataframe, lets-plot

## 1. Load data

To use this notebook, first it is required to download the dataset. As it was stated before, this notebook inspects the data of the [H&M Personalized Fashion Recommendations dataset](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/). The dataset is stored on Kaggle, it is possible to download it from the website, or using the [Kaggle API](https://github.com/Kaggle/kaggle-api).

Before we could load the datasets, the maximum heap space should be set to a proper size, because it must be ensured that the enough space is available for the dataframes. If not enough space is available, then OutOfMemory exception will be thrown with an error message related to the insufficient heap size. 
- To set the heap size in Kotlin Notebook, first open *Kotlin Notebook Settings*. You can open it with the gear icon next to the cell type selector dropdown, or obviously you can find it with the find action view. There you can specify the maximum heap size. 
- To set the heap size using other clients, refer to [Kotlin Jupyter kernel's documentation](https://github.com/Kotlin/kotlin-jupyter#other-clients)

**When I was running this notebook, I used 16 384 MiB (16 GiB).**

Run the cell below to read the maximum heap size setting.

In [None]:
Runtime.getRuntime().maxMemory()

After the download was successful, change value of the "pathToDownloadedCsvFiles" variable in the cell below to the path of directory of the downloaded dataset on your local machine.

In [None]:
val pathToDownloadedCsvFiles = "C:\\Sajat\\Egyetem\\MSc\\Onallo\\HM_dataset"

Then the csv files can be imported into dataframes.

In [None]:
val articlesDf = DataFrame.readCSV(
    fileOrUrl = pathToDownloadedCsvFiles + "\\articles.csv",
)

In [None]:
val customersDf = DataFrame.readCSV(
    fileOrUrl = pathToDownloadedCsvFiles + "\\customers.csv",
)

In [None]:
import java.util.*

var transactionsDf = DataFrame.readCSV(
    fileOrUrl = pathToDownloadedCsvFiles + "\\transactions_train.csv",
    header = listOf("t_dat", "customer_id", "article_id", "price"),
    colTypes = mapOf(
        "t_dat" to ColType.LocalDate,
        "customer_id" to ColType.String,
        "article_id" to ColType.Int,
        "price" to ColType.String,  // In the following cell converted to BigDecimal!
    ),
    charset = Charsets.US_ASCII,
    skipLines = 1,  // First line contains the column names and so should be skipped
)

The price's type conversion from String to BigDecimal put into a separate cell, because when I tried to do it in the readCSV method, I got OutOfMemory exception. I could increase the maximum heap size, but I don't want to solve it like that, because in my opinion the 16 GiB maximum heap size is already really high.

In [None]:
transactionsDf = transactionsDf.convert { price }.with { it.toBigDecimal() }

## 2. Analyze data

In this part data wrangling and visualisation will be performed on the loaded dataframes.

### 2.1 Basic inspections

First let's inspect some rows and the columns of the dataframes!

In [None]:
articlesDf.head()

In [None]:
customersDf.head()

In [None]:
transactionsDf.head()