# Data Description

* `Carat`: Carat weight of the cubic zirconia.
* `Cut`:  Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good, Very Good, Premium, Ideal.
* `Color`: Colour of the cubic zirconia.With D being the best and J the worst.
* `Clarity`:  cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
* `Depth`:  The Height of a cubic zirconia, measured from the Culet to the table, divided by its average Girdle Diameter.
* `Table`: The Width of the cubic zirconia's Table expressed as a Percentage of its Average Diameter.
* `X`: Length of the cubic zirconia in mm.
* `Y`: Width of the cubic zirconia in mm.
* `Z`: Height of the cubic zirconia in mm.


#### Target Variable:
* `Price`: the Price of the cubic zirconia.


Dataset: https://www.kaggle.com/competitions/playground-series-s3e8/data?select=train.csv


In [14]:
# import libraries
import pandas as pd
import numpy as np
import sklearn  

## Data Ingestion

In [15]:
# Read data to DF
df = pd.read_csv("data/train.csv") 

#### Basic Sanity data check

In [16]:
# Get 5 random samples
df.sample(5)

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
89593,89593,0.9,Premium,F,VS2,60.2,60.0,6.24,6.19,3.74,4189
63890,63890,0.32,Ideal,F,VVS1,61.5,56.0,4.42,4.44,2.72,857
146208,146208,1.09,Ideal,G,VS2,61.1,57.0,6.6,6.64,4.07,6480
70993,70993,2.03,Very Good,E,SI2,63.7,59.0,7.99,8.03,5.14,17759
41294,41294,0.92,Very Good,G,SI1,62.8,58.0,6.3,6.16,3.9,4095


In [17]:
# Check for null values
df.isnull().sum()

id         0
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
price      0
dtype: int64

In [18]:
df.info

<bound method DataFrame.info of             id  carat        cut color clarity  depth  table     x     y  \
0            0   1.52    Premium     F     VS2   62.2   58.0  7.27  7.33   
1            1   2.03  Very Good     J     SI2   62.0   58.0  8.06  8.12   
2            2   0.70      Ideal     G     VS1   61.2   57.0  5.69  5.73   
3            3   0.32      Ideal     G     VS1   61.6   56.0  4.38  4.41   
4            4   1.70    Premium     G     VS2   62.6   59.0  7.65  7.61   
...        ...    ...        ...   ...     ...    ...    ...   ...   ...   
193568  193568   0.31      Ideal     D    VVS2   61.1   56.0  4.35  4.39   
193569  193569   0.70    Premium     G    VVS2   60.3   58.0  5.75  5.77   
193570  193570   0.73  Very Good     F     SI1   63.1   57.0  5.72  5.75   
193571  193571   0.34  Very Good     D     SI1   62.9   55.0  4.45  4.49   
193572  193572   0.71       Good     E     SI2   60.8   64.0  5.73  5.71   

           z  price  
0       4.55  13619  
1       5.0

In [19]:
# Drop "id" column from Df, as its a unnecessary data that does not impact target data 
df = df.drop('id', axis=1)
df.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'x', 'y', 'z',
       'price'],
      dtype='object')

In [20]:
# check duplicate value 
df.duplicated().sum()

0

In [21]:
# Segregate data into Categorical and Numerical column

# Categorical columns
categorical_col = df.columns[df.dtypes == 'object']
print(categorical_col)

# Numrical Column
numerical_col = df.columns[df.dtypes != 'object']
print(numerical_col)

Index(['cut', 'color', 'clarity'], dtype='object')
Index(['carat', 'depth', 'table', 'x', 'y', 'z', 'price'], dtype='object')


In [22]:
# Get descrition of categorical columns
df[categorical_col].describe()

Unnamed: 0,cut,color,clarity
count,193573,193573,193573
unique,5,7,8
top,Ideal,G,SI1
freq,92454,44391,53272


Reger link to get more details: https://www.vrai.com/journal/post/diamond-cut

In [24]:
# Get count of values occured for 'cut' column
df['cut'].value_counts()

cut
Ideal        92454
Premium      49910
Very Good    37566
Good         11622
Fair          2021
Name: count, dtype: int64

In [25]:
# Value count for color column
df['color'].value_counts()

color
G    44391
E    35869
F    34258
H    30799
D    24286
I    17514
J     6456
Name: count, dtype: int64

In [26]:
df['clarity'].value_counts()

clarity
SI1     53272
VS2     48027
VS1     30669
SI2     30484
VVS2    15762
VVS1    10628
IF       4219
I1        512
Name: count, dtype: int64

In [23]:
# describe numerical column
df[numerical_col].describe()

Unnamed: 0,carat,depth,table,x,y,z,price
count,193573.0,193573.0,193573.0,193573.0,193573.0,193573.0,193573.0
mean,0.790688,61.820574,57.227675,5.715312,5.720094,3.534246,3969.155414
std,0.462688,1.081704,1.918844,1.109422,1.102333,0.688922,4034.374138
min,0.2,52.1,49.0,0.0,0.0,0.0,326.0
25%,0.4,61.3,56.0,4.7,4.71,2.9,951.0
50%,0.7,61.9,57.0,5.7,5.72,3.53,2401.0
75%,1.03,62.4,58.0,6.51,6.51,4.03,5408.0
max,3.5,71.6,79.0,9.65,10.01,31.3,18818.0


## EDA

In [29]:
import matplotlib.pyplot as plt
