# Challenge 1: what makes a diamond valuable?

I have used a dataset of 5000 diamonds to understand what are the main contributing factors to the their final price on the market. In the following, I report my analysis, step-by-step, highlighting the main results and figures that help to answer the question: what makes a diamond valuable?

In [2]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
from fits_lib import *
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.metrics import r2_score, mean_squared_error
import statsmodels.api as sm

%matplotlib inline

### Let's have a look at the dataset!

In [3]:
# read data into a DataFrame
data = pd.read_csv('diamonds.csv')
data.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1.1,Ideal,H,SI2,62.0,55.0,4733,6.61,6.65,4.11
1,1.29,Ideal,H,SI1,62.6,56.0,6424,6.96,6.93,4.35
2,1.2,Premium,I,SI1,61.1,58.0,5510,6.88,6.8,4.18
3,1.5,Ideal,F,SI1,60.9,56.0,8770,7.43,7.36,4.5
4,0.9,Very Good,F,VS2,61.7,57.0,4493,6.17,6.21,3.82


### Tranform the data.

In my analysis, I have looked at various characteristics of diamonds such as size, sparkle, and a few other factors that may contribute to their overall beuty and price. To make sense of these factors, I have used a method to transform some of the information into numbers, so I could see how they relate to each other.

For instance, the "cut" of a diamond is a critical aspect, and it originally had labels like "good", "very good", "premium", and "ideal". The numeric values has been assigned in such a way that higher values correspond to a better cut quality. So, in this transformed representation, a cut labeled as "good" is represented by 0, "very good" by 1, "premium" by 2, and "ideal" by 3.

A similar approach has been used for the "clarity" and "color" attributes of the diamonds. 
For "color" the original labels ranged from 'D' (colorless, the highest quality) to 'J' (light yellow). Here, 'D' was assigned the value 6, and as we move down the color scale, the values decrease, with 'J' being represented by 0.
Similarly, for "clarity" we started with labels like 'IF' (Internally Flawless, the highest quality) down to 'I1' (diamond with visible flaws). In this case, 'IF' was assigned the value 7, and the values decrease as we move down the clarity scale, with 'I1' represented by 0.

So, what does this mean? Higher numeric values for "cut", "color" and "clarity" signify better quality. Therefore, by examining these transformed numeric values, we can easily identify diamonds with superior cut, color and clarity.

Let's have a look at the new transformed dataset.

In [4]:
cut_rank=['Good', 'Very Good', 'Premium', 'Ideal']
color_rank = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
clarity_rank = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
rank = [cut_rank, color_rank, clarity_rank]

ordinal_columns = ['cut', 'color', 'clarity']

d_c = data_conversion(data)

data_converted = data
for col, rank_list in zip(ordinal_columns, rank):
    # convert ordinal data to numeric
    data_converted[col] = d_c.ordinal_to_numeric(col, rank_list)

data_converted.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1.1,3,2,1,62.0,55.0,4733,6.61,6.65,4.11
1,1.29,3,2,2,62.6,56.0,6424,6.96,6.93,4.35
2,1.2,2,1,2,61.1,58.0,5510,6.88,6.8,4.18
3,1.5,3,4,2,60.9,56.0,8770,7.43,7.36,4.5
4,0.9,1,4,3,61.7,57.0,4493,6.17,6.21,3.82


### Contributing factors to the diamon's price.

Here, I have checked which attributes contribute the most to the final price of the diamonds. I used a statistical measure called correlation to quantify how two variables move in relation to each other. This measure ranges from 0 to 1 in this case. A correlation larger than 0.5 indicates that the two variables talk to each other; as one variable increases, the other increases as well. A correlation of 0 indicates there is no relationship between the variables. 

In [5]:
corr = data_converted.corr(method='spearman').abs()

# create a mask for the upper-right corner
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
# apply the mask
corr[mask] = np.nan
(corr
 .style
 .background_gradient(cmap='Reds', axis=None, vmin=-1, vmax=1)
 .highlight_null(color='#f1f1f1')  # Color NaNs grey
 .format(precision=2))

# data_converted.corr(method='spearman').abs().style.background_gradient(cmap='coolwarm',axis=None)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
carat,,,,,,,,,,
cut,0.13,,,,,,,,,
color,0.24,0.01,,,,,,,,
clarity,0.35,0.18,0.04,,,,,,,
depth,0.0,0.17,0.03,0.07,,,,,,
table,0.2,0.48,0.03,0.15,0.26,,,,,
price,0.96,0.09,0.14,0.18,0.02,0.18,,,,
x,1.0,0.12,0.24,0.34,0.06,0.21,0.96,,,
y,1.0,0.12,0.24,0.34,0.06,0.21,0.96,1.0,,
z,0.99,0.14,0.24,0.35,0.07,0.17,0.95,0.99,0.99,


Now, let me walk you through the visualization of the displayed figure. Each cell of the figure represents the correlation between two variables which can be read in the respetive row and column. The colors help us to quickly identify the strength of the relationship:

* Light-red/orange: no relationship.
* Dark-red/brown: strong relationship.

As we explore this figure, we can see the correlation between the carat and the price of the diamond in the first column. The cell of the first column we should look at is along the seventh row (from the top) labeled as 'price'. The dark colour of this cell suggests that price and carat are highly related to each other; thus, carat is definitely a contributing factor to the final price. Moreover, looking at the seventh column (from the left-hand side of the figure) labeled as 'price', we can see 3 more dark cells. These cells are related to the size of the diamond (x,y,z). We can see there is a strong relationship between the price and the size of the diamond. Therefore, the size of the diamond as well as the carat must be taken into account into the final price.  

On the other hand, the other attributes do not have a strong relationship with the price of the diamond. Thus, **diamond's price is related to the carat and the size of the diamond itself**.

It is noteworthy that a strong relationship exists between the carat weight and the size (x, y, z) of the diamond, as highlighted by the dark colour of the three lower cells in the first column under 'carat.' Additionally, a strong correlation is evident among the sizes along different axes (x,y,z), illustrated by the colour of the last three cells on the right in the bottom row under 'z.' This is not surprising as larger dimensions typically correspond to greater weight and, subsequently, a higher price.

In [6]:
# calculate the VIF
# d_c = data_conversion(data_converted[['carat', 'x', 'y', 'z']])
# vif = d_c.calc_vif()
# vif

In [7]:
# plot the correlations
# for column in data_converted.columns:
#     if column != 'price':
#         data_converted.plot.scatter(x=column, y='price')

In [8]:
# data preparation
log_data_converted = np.log10(data_converted[['carat', 'x', 'y', 'z', 'price']])
d_c = data_conversion(log_data_converted)
cleaned_log_data = d_c.data_cleaning() #clean the data 

x_train,x_test,y_train,y_test = train_test_split(cleaned_log_data[['carat', 'x', 'y', 'z']],
                                                 cleaned_log_data[['price']],shuffle=True,
                                                 test_size=0.1,random_state=2)

  result = func(self.values, **kwargs)
  result = func(self.values, **kwargs)


# Challenge 2: A model to predict the gem's price.

In this second part, I have developed a model able to predict the value of a diamond starting from the diamond's dataset I had available. To achieve the goal, I have built a regression model to quantitatively identify the relationship between the price and the features of a diamond. 

In the previous task, I noticed there is a strong relationship between the price and the carat, size of a diamond. On the other hand, other features, such as color, cut or clarity, have no power in dictating the final price of the diamond. Therefore I have only used the former variables (carat and sizes) to predict the price of a gem. Moreover, I noticed there is a relationship between the carat, x, y, and z sizes of a diamond. The existence of such a relationship means that we can reduce the number of redunant features. "I will refer to this single attribute derived from the features carat, x, y, and z as 't' throughout the rest of the text.

Next, I have found the relationship between the price and the single feature t. I have found that the price of a gem can be estimated with the following relationship

$ \rm price = 2409 \times 10^{0.21*t} $.

To test whether my model accurately predicts the price of a diamond, I have used the R-squared score. This number tells us how well the selected model explain the the relationship between the data. The closer the score is to one, the better the model describes the data. A value of the R-squared score larger than 0.7 means that the data are well explained by the model. I have found an R-squared of 0.93, which is a strong indication that my model explains a significant portion (93%) of the variability in diamond prices.



Note for Luca: To remove the collinearity between the different features (carat,x,y,z) I have used the principal component analysis. The use of only one component already explains more than 95% of the variance of the data. Nevertheless, the number of components can be adjusted according to the percentage of the variance one wants to reproduce.

In [9]:
# linear regression with sklearn
lr = linear_regression(x_train, y_train)
reg = lr.regression()

In [10]:
# apply principal component analysis to x_test 
# to calculate y_pred
threshold = 0.95
x_test_pca = principal_component_analysis(x_test, threshold)

y_pred = reg.predict(x_test_pca)

print("Coefficient of linear regression (sklearn):", np.round(reg.coef_[0][0], 3))
print("Intercept of linear regression (sklearn):", np.round(reg.intercept_[0], 3))
print("R-squared Score of linear regression (sklearn):", np.round(reg.score(x_test_pca, y_test), 3))


Coefficient of linear regression (sklearn): 0.214
Intercept of linear regression (sklearn): 3.382
R-squared Score of linear regression (sklearn): 0.934


Note for Luca: sklearn is user-friendly, but it is not straighforward to implement an inference analysis and estimate the error on the coefficients and incercept of regression. Therefore, I have also included here an a linear regression with stats model. I have reported the coefficient, the intercept and the score of the linear regression to compare the results with sklearn method. On top of this, I have included the p-value of the coefficient and  intercept of the linear regression. P-value < 0.05 are accepted to establish that the fitted values of these parameters well describe the data. 

In [11]:
# linear regression model using sm.ols
x_pca = principal_component_analysis(x_train, 0.95)
model = sm.OLS(y_train, sm.add_constant(x_pca)).fit()

print(f"Coefficient of linear regression (stsmodels): {np.round(model.params[1],3)} +- {np.round(model.bse[1],3)}. P-value={np.round(model.pvalues[1],2)}")
print(f"Intercept of linear regression (stsmodels): {np.round(model.params[0],3)} +- {np.round(model.bse[0],3)}. P-value={np.round(model.pvalues[0],2)}")
print("R-squared Score of linear regression (stsmodels):", np.round(model.rsquared,3))



Coefficient of linear regression (stsmodels): 0.214 +- 0.001. P-value=0.0
Intercept of linear regression (stsmodels): 3.382 +- 0.002. P-value=0.0
R-squared Score of linear regression (stsmodels): 0.934


### What's the discrepancy between real and preticted prices?

To further prove the validity of my model I have compared the predicted values of the price of a diamond with the real value reported in dataset. I calculated an "average difference" between the predicted and real prices. I have found that this difference is of about 3%. This result implies:

* for a diamond valued at 2400 USD (which is the average price of a gem in the dataset), the predicted value is in the range 2330-2470 USD.

* for a diamond valued at 19.000 USD (which is the largest price of a gem in the dataset), the predicted value is in the range 18.430-19.570 USD.



In [12]:
print("Normalised Root Mean Squared Error of linear regression (sklearn):", np.round(mean_squared_error(y_test, y_pred, squared=False)/np.mean(y_test), 3))

print("Normalised mean absolute error", np.round(np.sum( np.abs(y_test -y_pred)/y_test )/len(y_test),3))


Normalised Root Mean Squared Error of linear regression (sklearn): 0.034
Normalised mean absolute error price    0.027
dtype: float64


# Summary

Let me summarise the key findings:

* The primary factors influencing a diamond's value are its carat (weight) and size (dimensions: x, y, z).

* Using a combination of carat and sizes, I developed a model to accurately estimate a diamond's price. 

* My model demonstrates exceptional accuracy, effectively predicting diamond values with a minimal 3% variance between the actual and predicted prices.

In [13]:
# predicted vs real prices plot

# x = [np.min([np.min(y_test),np.min(y_pred)]), np.max([np.max(y_test),np.max(y_pred)])]

# plt.scatter(y_test,y_pred, s=15, label='real vs pred.')
# plt.plot(x,x, linestyle='--', color='darkorange', linewidth=3.0, label='Line of equality')
# plt.xlabel('Real price (USD)')
# plt.ylabel('Predicted price (USD)')
# plt.legend()

In [14]:
# t = np.arange(np.min(x_pca), np.max(x_pca), 0.1)
# price = 2409 * 10**(0.21*t)

# plt.scatter(x_pca, 10**y_train)
# plt.plot(t, price, linestyle = "-", color='orange', label = "model")
# plt.xlabel("t")
# plt.ylabel("real prices (USD)")
# plt.legend()