# Data exploration
This notebook aims to study the structure of the given raw data. Data cleaning, feature engineering and domain knowledge considerations are also part of this study.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from diamond import data

## Loading and prelimiaries
Data is loaded and quickly inspected.

In [None]:
df = data.load_raw('datasets/diamonds/diamonds.csv')

print('Columns', df.dtypes, sep='\n')
df.head()

Are there missing values?

In [None]:
df.isna().any().any()

Visual inspection of the data shows that a low number of decimal values is considered. All appropriate values are converted to single precision floating point, which comes without loss of information in this case. Prices are also converted to floating point values, in order to be more manageable by machine learning tools later in the pipeline.

In [None]:
df = df.astype({
    'carat': np.float32,
    'depth': np.float32,
    'table': np.float32,
    'price': np.float32,
    'x': np.float32,
    'y': np.float32,
    'z': np.float32,
})

In [None]:
df.dtypes

Let's see the account value of Don Francesco's investments. This will come handy when making scalability considerations.

In [None]:
print('total diamonds value:', df.price.sum())

## Cleanup
Observing some statistics, it is possible to immediately spot invalid samples. The *cut*, *color* and *clarity* features will be examined later, given that they are encoded using human names.

In [None]:
df.describe()

From the source of the dataset ([diamonds.csv](datasets/diamonds/README.md)) we know that *depth* and *table* are computed as percentages. Actual measurements (*x*, *y*, *z*) are realistically represented in millimiters, according to [GIA] reports. Measures are reported with two decimals. Negative values and values below 0.01 can be scrapped.

[GIA]: https://4cs.gia.edu/en-us/diamond-buying-guide/

In [None]:
df = df[(df.x >= 0.01) & (df.y >= 0.01) & (df.z >= 0.01) & (df.price > 0)]
df.describe()

Minimal measures and pricing are now realistic. Indeed, it looks like 0 sized and negative priced diamonds were isolated dirty outliers.

## Sequential or categorical?
*Cut*, *color* and *clarity* are three of the 4Cs. According to [GIA], each of them has a scale of desirability.

In particular, the cut grade is classified as:
* Excellent
* Very good
* Good
* Fair
* Poor

Color is classified using a scale from D to Z (D is most desirable). Colors after Z are classified as *fancy color*, and are to be considered outside of the desirability scale (considered separately).

Clarity is classified through eleven specific grades, grouped in six categories:
* Flawless (FL)
* Internally Flawless (IF)
* Very, Very Slightly Included (VVS1 and VVS2)
* Very Slightly Included (VS1 and VS2)
* Slightly Included (SI1 and SI2)
* Included (I1, I2, and I3)

[GIA]: https://4cs.gia.edu/en-us/4cs-diamond-quality/

Visual inspection of the data suggests that these grades were used. Let's take a deeper look.

In [None]:
print('Unique cut grades', *df.cut.unique(), sep=', ')
print('Unique colors', *df.color.unique(), sep=', ')
print('Unique clarities', *df.clarity.unique(), sep=', ')

First of all, the collection of Don Francesco does not include all the variants. For instance, there are no diamonds of color K or worse, nor poorly graded cuts. Moreover, it looks like the cuts were graded using a slightly different scale, using *Ideal* and *Premium* grades instead of *Excellent* (there is [some feedback about it on the web](https://www.loosediamondsreviews.com/diamondcut.html)).

In order to provide the correct ordering of these grades, it seems reasonable to encode them into sequential values. With scalability in mind, the full range of grades for each of the three Cs are considered.

In [None]:
df.cut = data.cut_grades_encoder.fit_transform(df[['cut']])
df.color = data.color_encoder.fit_transform(df[['color']])
df.clarity = data.clarity_encoder.fit_transform(df[['clarity']])

In [None]:
df[["cut", "color", "clarity"]].describe()

## Feature distributions and relations

In [None]:
df.hist(figsize=(10, 10), bins=30)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1)
plt.show()