# H&M Personalized Fashion Recommendations exploratory data analysis (EDA)

The Kotlin Clothing Webshop repository's main goal (as it's name probably makes clear) to contain a webshop's implementation written mostly in Kotlin language. Nowadays, one of the key feature of webshops is a recommendation system, which helps the users to find relevant articles.

This notebook contains exploratory data analysis of the [H&M Personalized Fashion Recommendations dataset](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/). This dataset contains information about users, articles and transactions. It's a perfect dataset to use to develop and inspect my recommendation system solution assuming on different clothing webshop applications the users behave similarly.

This notebook is written in Kotlin language. Refer to the [official documentation](https://kotlinlang.org/docs/data-science-overview.html) to learn more about Kotlin notebooks.

## 0. Import dependencies

This notebook will use the following dependencies:
- [dataframe](https://github.com/Kotlin/dataframe) is used to read the structured csv data files, and manipulate them
- [kandy](https://github.com/Kotlin/kandy) is used to visualize the data

In [None]:
%useLatestDescriptors
%use dataframe, kandy

## 1. Load data

To use this notebook, first it is required to download the dataset. As it was stated before, this notebook inspects the data of the [H&M Personalized Fashion Recommendations dataset](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/). The dataset is stored on Kaggle, it is possible to download it from the website, or using the [Kaggle API](https://github.com/Kaggle/kaggle-api).

Before we could load the datasets, the maximum heap space should be set to a proper size, because it must be ensured that the enough space is available for the dataframes. If not enough space is available, then OutOfMemory exception will be thrown with an error message related to the insufficient heap size. 
- To set the heap size in Kotlin Notebook, first open *Kotlin Notebook Settings*. You can open it with the gear icon next to the cell type selector dropdown, or obviously you can find it with the find action view. There you can specify the maximum heap size. 
- To set the heap size using other clients, refer to [Kotlin Jupyter kernel's documentation](https://github.com/Kotlin/kotlin-jupyter#other-clients)

**When I was running this notebook, I set the maximum heap size to 16 384 MiB (16 GiB).**

Run the cell below to read the maximum heap size setting.

In [None]:
Runtime.getRuntime().maxMemory()

After the download was successful, change value of the "pathToDownloadedCsvFiles" variable in the cell below to the path of directory of the downloaded dataset on your local machine.

In [None]:
val pathToDownloadedCsvFiles = "C:\\Sajat\\Egyetem\\MSc\\Onallo\\HM_dataset"

Then the csv files can be imported into dataframes.

TODO add quick explanation and inspect unneeded columns

In [None]:
val articlesDf = DataFrame.readCSV(
    fileOrUrl = pathToDownloadedCsvFiles + "\\articles.csv",
)

Let's load the customer dataframe!

In [None]:
var customersDf = DataFrame.readCSV(
    fileOrUrl = pathToDownloadedCsvFiles + "\\customers.csv",
    colTypes = mapOf(
        "customer_id" to ColType.String,
        "age" to ColType.Int,
        "postal_code" to ColType.String,
    ),
    charset = Charsets.US_ASCII,
)

The customers.csv file contains the following columns:
- customer_id
- FN
- club_member_status
- fashion_news_frequency
- age
- postal_code

From these columns the **customer_id**, **age** and **postal_code** columns seem interesting, because these attributes can be present in the database of any clothing webshop. Let's do a projection on the dataframe to keep only the relevant attributes!

In [None]:
customersDf = customersDf.remove { FN and Active and club_member_status and fashion_news_frequency }

TODO add explanation and remove header and skipLines attributes, also check that whether the unneeded sales_channel_id is removed or not

In [None]:
import java.util.*

var transactionsDf = DataFrame.readCSV(
    fileOrUrl = pathToDownloadedCsvFiles + "\\transactions_train.csv",
    header = listOf("t_dat", "customer_id", "article_id", "price"),
    colTypes = mapOf(
        "t_dat" to ColType.LocalDate,
        "customer_id" to ColType.String,
        "article_id" to ColType.Int,
        "price" to ColType.String,  // In the following cell converted to BigDecimal!
    ),
    charset = Charsets.US_ASCII,
    skipLines = 1,  // First line contains the column names and so should be skipped
)

The price's type conversion from String to BigDecimal put into a separate cell, because when I tried to do it in the readCSV method, I got OutOfMemory exception. I could increase the maximum heap size, but I don't want to solve it like that, because in my opinion the 16 GiB maximum heap size is already really high.

In [None]:
transactionsDf = transactionsDf.convert { price }.with { it.toBigDecimal() }

## 2. Analyze data

In this part data wrangling and visualisation will be performed on the loaded dataframes.

### 2.1 Examination specifically related to articles

Firstly, let's read the first items in the articles dataframe!

In [None]:
articlesDf.head()

### 2.2 Examination specifically related to customers

#### 2.2.1 Basic inspections about the customers

Firstly, let's read the first items in the customers dataframe!

In [None]:
customersDf.head()

Let's check if there is any duplicate row!

In [None]:
val countOfCustomerRows = customersDf.count()
val countOfDisctinctCustomerRows = customersDf.countDistinct()

println("Number of rows in customer dataframe: ${countOfCustomerRows}")
println("Number of distinct rows in customer dataframe: ${countOfDisctinctCustomerRows}")
println("is there any duplicate element in customer dataframe? ${countOfCustomerRows != countOfDisctinctCustomerRows}")

There are no duplicates in the customer dataframe! This information can be useful for further investigations!

#### 2.2.2 Inspections related to the age attribute of the customers

Let's inspect the age attribute of customers!

In [None]:
customersDf.describe { age }

This is fascinating! The youngest user is 16 years old while the oldest is 99! (Assuming the users are honest about their ages!) I thought the user base would be much younger, I would have guessed that the median and mean values are lower than 30! The most common age is 21, which does not look surprising in my opinion! The standard deviation looks also sensible!

Let's see how many people are in the age groups! Let's also create a bar chart to visualize this information!

In [None]:
val customersByAge = customersDf.valueCounts() { age }.sortBy { age }
customersByAge

In [None]:
customersByAge.plot { 
    bars { 
        x(age)
        y(count)
     }
 }

The most surprising part of this barchart is the pit at the 40 years olds!

#### 2.2.3 Inspections related to the postal code attribute of the customers

It could be interesting to see how users are distributed according to their place of residence!

In [None]:
customersDf.describe { postal_code }

It seems that the most densely populated settlement of the users contains 120303 users! This seems a lot! I wonder if this information is accurate! Also, there are 352899 unique settlement populated by the users, this is also a huge number!

In [None]:
val customersByPostalCode = customersDf.valueCounts { postal_code }
customersByPostalCode

Based on the huge difference between the most and second most populated settlement, it seems something is definitely wrong with the settlement that is inhabited by most people!

Let's analyze next that usually how many customer live in the same settlement!

In [None]:
customersByPostalCode.describe { count }

In [None]:
val customerPopulationValuesBySettlements = customersByPostalCode.values { count }.toList()

plot(
    mapOf(
        "x" to listOf("customers"),
        "min" to listOf(customerPopulationValuesBySettlements.last()),
        "lower" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 4 * 3]),
        "middle" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 2]),
        "upper" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 4]),
        "max" to listOf(customerPopulationValuesBySettlements.first()),
    )
) {
    boxplot { 
        x("x"<String>())
        yMin("min"<Int>())
        lower("lower"<Int>())
        middle("middle"<Int>())
        upper("upper"<Int>())
        yMax("max"<Int>())
     }
     y {
         axis.name = "population"
     }
 }

Let's remove the outlier maximum value hoping that way we get a more meaningful chart!

In [None]:
plot(
    mapOf(
        "x" to listOf("customers"),
        "min" to listOf(customerPopulationValuesBySettlements[2]),
        "lower" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 4]),
        "middle" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 2]),
        "upper" to listOf(customerPopulationValuesBySettlements[customerPopulationValuesBySettlements.size / 4 * 3]),
        "max" to listOf(customerPopulationValuesBySettlements.last()),
    )
) {
    boxplot { 
        x("x"<String>())
        yMin("min"<Int>())
        lower("lower"<Int>())
        middle("middle"<Int>())
        upper("upper"<Int>())
        yMax("max"<Int>())
     }
     y {
         axis.name = "population"
     }
 }

Seems like in the majority of the cases, few customer share the same postal code!

### 2.3 Examination specifically related to transactions

Firstly, let's read the first items in the transactions dataframe!

In [None]:
transactionsDf.head()