# Data Wrangling the Diamonds dataset

## The basics

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../data/diamonds_raw.csv")

In [3]:
df.tail(3)

Unnamed: 0.1,Unnamed: 0,diamond_id,shape,size,color,fancy_color_dominant_color,fancy_color_secondary_color,fancy_color_overtone,fancy_color_intensity,clarity,...,girdle_min,girdle_max,culet_size,culet_condition,fluor_color,fluor_intensity,lab,total_sales_price,eye_clean,date
219701,219701,135553116,Round,18.07,E,,,,,VS1,...,TN,M,N,,,,GIA,1315496,,2022-02-24
219702,219702,114752541,Princess,0.9,,Red,,,Fancy,SI2,...,XTN,VTK,N,,,Faint,GIA,1350000,,2022-02-24
219703,219703,129630500,Pear,10.03,,Yellow,,,Fancy Vivid,VVS2,...,,,,,,,GIA,1449881,,2022-02-24


## Inspect df; develop process.  
 
 1. What columns can be dropped?
 2. Deal with any NaN values.

## Identify and drop useless columns

In [4]:
for col in df:
    print(f"There are {df[col].isnull().sum()} nans in {col}")

There are 0 nans in Unnamed: 0
There are 0 nans in diamond_id
There are 0 nans in shape
There are 0 nans in size
There are 9162 nans in color
There are 210540 nans in fancy_color_dominant_color
There are 218642 nans in fancy_color_secondary_color
There are 217666 nans in fancy_color_overtone
There are 210542 nans in fancy_color_intensity
There are 0 nans in clarity
There are 60607 nans in cut
There are 0 nans in symmetry
There are 0 nans in polish
There are 0 nans in depth_percent
There are 0 nans in table_percent
There are 0 nans in meas_length
There are 0 nans in meas_width
There are 0 nans in meas_depth
There are 83433 nans in girdle_min
There are 84296 nans in girdle_max
There are 85741 nans in culet_size
There are 204385 nans in culet_condition
There are 203978 nans in fluor_color
There are 128 nans in fluor_intensity
There are 0 nans in lab
There are 0 nans in total_sales_price
There are 156917 nans in eye_clean
There are 0 nans in date


Three of the columns are useless: 
 * `Unnamed: 0` is just the same as the index
 * `df['date'].unique()` shows that the `date` values are all the same
 * `diamond_id` could make a decent index, but a serialized count is more useful in this case.
 
 So, let's ditch them.

In [5]:
df.drop(['Unnamed: 0', 'date', 'diamond_id'], axis=1, inplace=True)
df.head(2)

Unnamed: 0,shape,size,color,fancy_color_dominant_color,fancy_color_secondary_color,fancy_color_overtone,fancy_color_intensity,clarity,cut,symmetry,...,meas_depth,girdle_min,girdle_max,culet_size,culet_condition,fluor_color,fluor_intensity,lab,total_sales_price,eye_clean
0,Round,0.09,E,,,,,VVS2,Excellent,Very Good,...,1.79,M,M,N,,,,IGI,200,
1,Round,0.09,E,,,,,VVS2,Very Good,Very Good,...,1.78,STK,STK,N,,,,IGI,200,


## Deal with NaN values

#### Before I commit to dropping all the NaNs, let's get an idea of how many that would be.


```
df.shape #(219704, 25)
df.dropna(axis=1, inplace=True)
df.shape #(219704, 12)
```
That looks like this:

| diamond_id | shape | size | clarity | symmetry | polish  | depth_percent | table_percent | meas_length | meas_width | meas_depth | lab | total_sales_price|
|---|---|---|---|---|---|---|---|---|---|---|---|---|			
| 131328926 | Round | 0.09 | VVS2 | Very Good | Very Good | 62.7 | 59.0 | 2.85 | 2.87 | 1.79 | IGI | 200 |
| 131704776 | Round | 0.09 | VVS2 | Very Good | Very Good | 61.9 | 59.0 | 2.84 | 2.89 | 1.78 | IGI | 200 |
| 131584417 | Round | 0.09 | VVS2 | Very Good | Very Good | 61.1 | 59.0 | 2.88 | 2.90 | 1.77 | IGI | 200 |


A check of `df.isnull().sum().sum()` returning 0, as expected.


#### Dropping the `NaN` values would keep all the rows (which is good) but I'd loose columns (bad)


This wouldn't be bad in that it doesn't get rid of too many columns but the ones it does are important. Carat, Color, Cut, and Clarity are what the diamond industry are the factors. So, this needs to be fixed.

In [6]:
df.fillna({'color':'unknown', 'cut':'unknown', 'eye_clean': 'unknown',
           'fancy_color_dominant_color': 'unknown', 'fancy_color_secondary_color': 'unknown',
           'fancy_color_overtone':'unknown', 'fancy_color_intensity':'unknown',  
           'girdle_min':'unknown', 'girdle_max':'unknown', 
           'culet_size':'unknown', 'culet_condition':'unknown',
           'fluor_color':'unknown', 'fluor_intensity':'unknown'   
          }, inplace=True)

In [7]:
df.isnull().sum().sum()

0

### For the categorical variables, how much variety is there and what does it mean?


####  Should it be needed, here's the code for the values of each column:

```
col_list = ['color', 'clarity', 'cut', 'symmetry','polish','lab','eye_clean', 'culet_size', 'lab','shape',
            'fancy_color_intensity','fancy_color_dominant_color','fancy_color_secondary_color',
            'fancy_color_overtone', 'fluor_color', 'fluor_intensity',]
            
for col in col_list:
    print(f" '{col}' has the following values: \n \t {df[col].unique()} \n")
```

#### The only column that is of any concern is the size column which looks to have a huge amount of outliers.

## Save this for use in the other notebooks

In [8]:
df.to_pickle("../data/diamonds.pkl")