Classic machine learning approaches tested on diamond dataset from Kaggle, to test the script just download the entire folder and put inside the diamond dataset.
Data are loaded using loading.m, which also checks for NaNs, outliers and outputs dataset and a table with general statistics about data. No NaNs were detected, also every diamond with xyz (volume) equal to 0 was discarded.
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
mean | 0,7979 | 61,7494 | 57,4572 | 3932,7997 | 5,7312 | 5,7345 | 3,5387 |
max | 5,01 | 79 | 95 | 18823 | 10,74 | 58,9 | 31,8 |
min | 0,20 | 43 | 43 | 326 | 0 | 0 | 0 |
std | 0,4740 | 1,4326 | 2,2345 | 3989,4397 | 1,1218 | 1,1421 | 0,7057 |
The first step is to study features distribution, then how features influence the price. Lastly, it is possible to notice the price has a lognormal distribution, this is a common distribution in money-related datasets. Indeed, it is possible to determine 2 types of diamonds, those that are bought by the general public for big occasions (i.e. weddings), therefore they have lower prices, and those bought by the richer one. Given the price distribution, it is useful to consider log10(price) in order to linearize the relations.
In Data_analysis.m there are also boxplots and scatter matrixes.
Three different models were trained:
-
LR2: Regression with log10(price) as a reference, log10(carat) and volume. Diamonds are approximated with a pyramid, therefore starting from their physical features a new one is created: volume. In order to boost linear regression performances, the relationship needs to be as linear as possible, therefore log(carat) and log(price) are used
-
LR3: Regression with clustering. In log(price) it is possible to notice 2 distinct Gaussian distributions, therefore diamonds were clustered basing only on their features. To do this it was made a kmeans on features. Then 2 different regressions were trained, one for the first Gaussian and one for the second. To evaluate the test set each sample was assigned to its cluster choosing according to which centroid was closest (L2 standard) and then it was evaluated using the trained regression on that cluster
Despite all the efforts to create a good model using linear regression, a small NN trained on raw data can easily outperform linear regression. The selected network has 2 hidden layers respectively with 10 and 5 neurons.
stats | LR1 | LR2 | LR3 | NN |
---|---|---|---|---|
RMSE | 1200,7552 | 926,0462 | 839,6201 | 559,1686 |
Since results are calculated on a randomly divided validation set each test will give slightly different results