# Data exploration
This notebook aims to study the structure of the given raw data. Data cleaning, feature engineering and domain knowledge considerations are also part of this study.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from diamond import data

## Loading and prelimiaries
Data is loaded and quickly inspected.

In [None]:
df = data.load_raw('datasets/diamonds/diamonds.csv')

print('Columns', df.dtypes, sep='\n')
df.head()

Are there missing values?

In [None]:
df.isna().any().any()

Visual inspection of the data shows that a low number of decimal values is considered. All appropriate values are converted to single precision floating point, which comes without loss of information in this case. Prices are also converted to floating point values, in order to be more manageable by machine learning tools later in the pipeline.

In [None]:
df = df.astype({
    'carat': np.float32,
    'depth': np.float32,
    'table': np.float32,
    'price': np.float32,
    'x': np.float32,
    'y': np.float32,
    'z': np.float32,
})

In [None]:
df.dtypes

Let's see the account value of Don Francesco's investments. This will come handy when making scalability considerations.

In [None]:
print('total diamonds value:', df.price.sum())

## Cleanup
Observing some statistics, it is possible to immediately spot invalid samples. The *cut*, *color* and *clarity* features will be examined later, given that they are encoded using human names.

In [None]:
df.describe()

From the source of the dataset ([diamonds.csv](datasets/diamonds/README.md)) we know that *depth* and *table* are computed as percentages. Actual measurements (*x*, *y*, *z*) are realistically represented in millimiters, according to [GIA] reports. Measures are reported with two decimals. Negative values and values below 0.01 can be scrapped.

[GIA]: https://4cs.gia.edu/en-us/diamond-buying-guide/

In [None]:
df = df[(df.x >= 0.01) & (df.y >= 0.01) & (df.z >= 0.01) & (df.price > 0)]
df.describe()

Minimal measures and pricing are now realistic. Indeed, it looks like 0 sized and negative priced diamonds were isolated dirty outliers.

## Sequential or categorical?
*Cut*, *color* and *clarity* are three of the 4Cs. According to [GIA], each of them has a scale of desirability.

In particular, the cut grade is classified as:
* Excellent
* Very good
* Good
* Fair
* Poor

Color is classified using a scale from D to Z (D is most desirable). Colors after Z are classified as *fancy color*, and are to be considered outside of the desirability scale (considered separately).

Clarity is classified through eleven specific grades, grouped in six categories:
* Flawless (FL)
* Internally Flawless (IF)
* Very, Very Slightly Included (VVS1 and VVS2)
* Very Slightly Included (VS1 and VS2)
* Slightly Included (SI1 and SI2)
* Included (I1, I2, and I3)

[GIA]: https://4cs.gia.edu/en-us/4cs-diamond-quality/

Visual inspection of the data suggests that these grades were used. Let's take a deeper look.

In [None]:
print('Unique cut grades', *df.cut.unique(), sep=', ')
print('Unique colors', *df.color.unique(), sep=', ')
print('Unique clarities', *df.clarity.unique(), sep=', ')

First of all, the collection of Don Francesco does not include all the variants. For instance, there are no diamonds of color K or worse, nor poorly graded cuts. Moreover, it looks like the cuts were graded using a slightly different scale, using *Ideal* and *Premium* grades instead of *Excellent* (there is [some feedback about it on the web](https://www.loosediamondsreviews.com/diamondcut.html)).

In order to provide the correct ordering of these grades, it seems reasonable to encode them into sequential values. With scalability in mind, the full range of grades for each of the three Cs are considered.

In [None]:
df.cut = data.cut_grades_encoder.fit_transform(df[['cut']])
df.color = data.color_encoder.fit_transform(df[['color']])
df.clarity = data.clarity_encoder.fit_transform(df[['clarity']])

In [None]:
df[["cut", "color", "clarity"]].describe()

## Feature distributions and relations
Let's take a look at the feature distributions and their correlations. Later on, new features will be extracted from the given ones in an effort to understand correlations phenomena.

In [None]:
df.hist(figsize=(10, 10), bins=30)
plt.tight_layout()
plt.show()

Don Francesco prefers diamonds with good cuts and colors. Prices follow some tailed distribution, similar to an exponential distribution. *Depth* and *table* are seamingly gaussian distributed, with some outliers. Absolute measures (*x*, *y*, *z*) seem quite correlated to the *carats*. There is a small peak of diamonds weighting ~1 carat, which reflects onto price and dimensions as well. The 1 carat phenomenon can be brought back to the popularity of 1 carat jewelry, such as engagement rings.

Taking a look at a correlation matrix confirms some of these observations.

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1)
plt.show()

The plot shows a very strong correlation between absolute dimensions, carat and price. On the other hand, the other three Cs display some form of inverse correlation with the price and carat. We would expect the 3Cs to have a direct and strong correlation with the price instead. This must be connected to the peculiar collection of Don Francesco, which shows larger diamonds of generally lower quality.

### Carats and volume
The visual similarity between absolute measures' distributions and the carats distribution suggests that carats are related to the size of the diamond. This is expected since the density of diamond mostly constant. Small variations in density can be related to impurities, that is, imperfections [1].

We synthetize a new features which represents the volume of the minimum enclosing box of the diamond.

[1] Filgueira, Marcello & Pinatti, Daltro. (2001). Production of Diamond Wire by Cu15 v-% Nb "In situ" Process. Proc. of the 15th Int. Plausee Seminar. 1. 

In [None]:
df['volume'] = df.x * df.y * df.z
df.volume.describe()

In [None]:
df['4c'] = df.cut * df.carat * df.color * df.clarity

In [None]:
plt.figure(figsize=(5, 5))
sns.heatmap(df[['x', 'y', 'z', 'volume', 'carat', 'price']].corr(), annot=True,
            vmin=-1, vmax=1)
plt.show()

Volume is highly correlated to carats and price. As such, it represents a valuable combination of three features in one. This will be taken into consideration when selecting which features to use for the final predictive model.

Given the exponential-like distribution of price and carats, we compare them to the volume distribution by applying a lognorm transformation to them. The volume distribution is standardized (mean centered, rescaled by std deviation) as well.

In [None]:
# Compute log values
log_price = np.log(df.price)
log_carat = np.log(df.carat)

# Center and scale
lognorm_price = (log_price - log_price.mean()) / log_price.std()
lognorm_carat = (log_carat - log_carat.mean()) / log_carat.std()
lognorm_volume = (df.volume - df.volume.mean()) / df.volume.std()

sns.kdeplot(
    pd.DataFrame({'price': lognorm_price, 'carat': lognorm_carat,
                  'volume': lognorm_volume}))
plt.show()

It is possible to observe how similar the distributions are to each other, showing the two peaks.

### Cut and shape
Weirdly enough, exploring the correlation matrix displayed how the data is not showing a direct correlation between cut and diamond price. Since the general consensus considers the cut as the most important of the 4Cs, it is worth exploring it more in detail. Cut grade is determined by complex domain rules and considerations, involving the gem's proportions. Relative table width and pavilion depth are provided by Don Francesco's expert, but are they actually relevant? Or are they absorbed by the cut grade in the end?

Another property to take into consideration is the gem's shape. Visual inspection shows that *x* and *y* are very similar to each other, suggesting that the collection may comprise only round and/or square cuts (no ovals or other fancy cuts, etc.). We can get a clearer picture by extracting an eccentricity feature.

In [None]:
df['eccentricity'] = np.sqrt(1. - df[['x', 'y']].min(axis=1)
                             / df[['x', 'y']].max(axis=1))
df.eccentricity.hist()
df.eccentricity.describe()

Aside from some isolated samples, eccentricity has particularly low mean and variance, suggesting that indeed nearly all gems in the collection are not oval cuts. While a large eccentricity may signal the presence of an elongated fancy cut, a minor eccentricity might just be a defect.

In [None]:
plt.figure(figsize=(5, 5))
sns.heatmap(df[['cut', 'table', 'depth', 'eccentricity']].corr(), annot=True,
            vmin=-1, vmax=1)
plt.show()

pd.plotting.scatter_matrix(df[['cut', 'table', 'depth', 'eccentricity']],
                           figsize=(7, 7), hist_kwds={'bins': 30})
plt.tight_layout()
plt.show()
