# Diamond Price Prediction

The goal of this competition is the prediction of the price of diamonds based on their characteristics. Let's start checking our dataset, in order to see how many entries we have, how many of them are null, what kind of data every feature is, etc.

In [1]:
import pandas as pd

In [2]:
diamonds = pd.read_csv('data/train.csv')
diamonds.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,0.5,Ideal,D,VS2,62.3,55.0,5.11,5.07,3.17,1845
1,1,1.54,Good,I,VS1,63.6,60.0,7.3,7.33,4.65,10164
2,2,1.32,Very Good,J,SI2,61.7,60.0,6.95,7.01,4.31,5513
3,3,1.2,Ideal,I,SI1,62.1,55.0,6.83,6.79,4.23,5174
4,4,1.73,Premium,I,SI1,61.2,60.0,7.67,7.65,4.69,10957


In [3]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       40455 non-null  int64  
 1   carat    40455 non-null  float64
 2   cut      40455 non-null  object 
 3   color    40455 non-null  object 
 4   clarity  40455 non-null  object 
 5   depth    40455 non-null  float64
 6   table    40455 non-null  float64
 7   x        40455 non-null  float64
 8   y        40455 non-null  float64
 9   z        40455 non-null  float64
 10  price    40455 non-null  int64  
dtypes: float64(6), int64(2), object(3)
memory usage: 3.4+ MB


As the .info shows, there are 40455 entries and a total of 10 columns for each of the features:

 - id: only for test & sample submission files, id for prediction sample identification
 - carat: weight of the diamond
 - cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
 - color: diamond colour, from J (worst) to D (best)
 - clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
 - depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
 - table: width of top of diamond relative to widest point (43--95)
 - x: length in mm
 - y: width in mm
 - z: depth in mm
 - price: price in USD
 
As we can see, there is no null values for any of the features, the data of eight of them is numeric (6 columns are float and 2 integers), and the remaining 3 are objects. As our dataset does not need any treatment for nulls, we can start by treating the three object-kind columns.

## Dealing with categorical data

There 3 categorical variables in our dataset: cut, color, and clarity. Let's start with 'cut':

### Cut feature

In [4]:
diamonds.cut.value_counts()

Ideal        16152
Premium      10321
Very Good     9040
Good          3729
Fair          1213
Name: cut, dtype: int64

There are 5 different categories on this column, but if we check more in detail, we can see that they follow a discrete scale ranking that goes from Fair to Ideal. This could be translated into a numeric discrete scale that goes from 1 (less quality) to 5 (top quality). Let's perform the operation:

In [20]:
diamonds.cut = diamonds.cut.apply(lambda x: '1' if x == 'Fair' else 
                                           ('2' if x == 'Good' else
                                           ('3' if x == 'Very Good' else
                                           ('4' if x == 'Premium' else '5'))))
    
diamonds.cut = diamonds.cut.astype('int64');

In [6]:
diamonds.cut.value_counts()

5    16152
4    10321
3     9040
2     3729
1     1213
Name: cut, dtype: int64

Our categorical variable 'cut' can be now numerically measure. Let's check now on 'color'.

### Color feature

In [7]:
diamonds.color.value_counts()

G    8469
E    7282
F    7199
H    6210
D    5098
I    4091
J    2106
Name: color, dtype: int64

Checking on the feature description, it says color is the "diamond colour, from J (worst) to D (best)". So we can say that this is could also be a numerical discrete scale, that could go from J = 1, as it is the worst, to D = 7, as it is the best color. Let's perform the same operation:

In [19]:
color_num = {'D': '7', 'E': '6', 'F': '5', 'G': '4', 'H': '3', 'I': '2', 'J': '1'}

for k, v in color_num.items():
    diamonds.color = diamonds.color.str.replace(k, v)
    
diamonds.color = diamonds.color.astype('int64');

In [9]:
diamonds.color.value_counts()

4    8469
6    7282
5    7199
3    6210
7    5098
2    4091
1    2106
Name: color, dtype: int64

Our categorical variable 'color' can be now numerically measure. Let's check now the last, 'clarity'.

### Clarity feature

In [10]:
diamonds.clarity.value_counts()

SI1     9758
VS2     9272
SI2     6895
VS1     6151
VVS2    3799
VVS1    2692
IF      1321
I1       567
Name: clarity, dtype: int64

Checking again on the documentation, the description for the feature clarity says it is "a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))". All we have to do is perform the same operation as in the others.

In [16]:
diamonds.clarity = diamonds.clarity.apply(lambda x: '1' if x == 'I1' else 
                                           ('2' if x == 'SI2' else
                                           ('3' if x == 'SI1' else
                                           ('4' if x == 'VS2' else 
                                           ('5' if x == 'VS1' else
                                           ('6' if x == 'VVS2' else
                                           ('7' if x == 'VVS1' else '8')))))))
    
diamonds.clarity = diamonds.clarity.astype('int64');

In [12]:
diamonds.clarity.value_counts()

3    9758
4    9272
2    6895
5    6151
6    3799
7    2692
8    1321
1     567
Name: clarity, dtype: int64

Now all our features are in a numeric format:

In [21]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   id       40455 non-null  int64  
 1   carat    40455 non-null  float64
 2   cut      40455 non-null  int64  
 3   color    40455 non-null  int64  
 4   clarity  40455 non-null  int64  
 5   depth    40455 non-null  float64
 6   table    40455 non-null  float64
 7   x        40455 non-null  float64
 8   y        40455 non-null  float64
 9   z        40455 non-null  float64
 10  price    40455 non-null  int64  
dtypes: float64(6), int64(5)
memory usage: 3.4 MB


In order to perform different models for this 'all-numeric' dataframe, I will save it as a new csv and start this trials in a new notebook.