# Diamonds 

![img](https://miro.medium.com/max/4000/0*6WLqebrITTPNHwu7.gif)

Diamond is one of the best-known and most sought-after gemstones. They have been used as decorative items since ancient times. 
The hardness of diamond and its high dispersion of light—giving the diamond its characteristic "fire" make it useful for industrial applications and desirable as jewellery. Diamonds are such a highly traded commodity that multiple organizations have been created for grading and certifying them based on the "four Cs", which are color, cut, clarity, and carat. Other characteristics, such as presence or lack of fluorescence, also affect the desirability and thus the value of a diamond used for jewelry. 

# Exploring About Diamonds

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')
pd.options.display.max_columns = 150

In [None]:
data = pd.read_csv('../input/diamonds/diamonds.csv')
print(data.shape)
data.head()

In [None]:
data.drop('Unnamed: 0', axis=1, inplace=True)

## Characterstics of Diamond

Column | Description
:---|:---
price | Price in US dollars
carat | The mass of a diamond. One carat is defined as 200 milligrams. The price per carat increases with carat weight, since larger diamonds are both rarer and more desirable for use as gemstones, but it does not increase linearly with increasing size. Instead, there are sharp jumps around milestone carat weights
cut quality of the cut | The cut of a diamond describes the manner in which a diamond has been shaped and polished from its beginning form as a rough stone to its final gem proportions. The cut of a diamond describes the quality of workmanship and the angles to which a diamond is cut, (Fair, Good, Very Good, Premium, Ideal)
color | Diamond colour, from J (worst) to D (best). The finest quality as per color grading is totally colorless, which is graded as D color diamond across the globe. The next grade has a very slight trace of color, These are graded as E color or F color diamonds. Diamonds which show very little traces of color are graded as G or H color diamonds. Slightly colored diamonds are graded as I or J or K color.
clarity | Clarity is a measure of internal defects of a diamond called inclusions. Inclusions may be crystals of a foreign material or another diamond crystal, or structural imperfections such as tiny cracks that can appear whitish or cloudy. The number, size, color, relative location, orientation, and visibility of inclusions can all affect the relative clarity of a diamond (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x | Length of the diamond in mm
y | Width of the diamond in mm
z | Depth of the diamond in mm
depth | Total depth percentage = z / mean(x, y) = 2 * z / (x + y) 
table | Width of top of diamond relative to widest point

**Measurements of a Diamond**

![img](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTSvEgoy9XcJ3itQ4B6i0IJTqwO4KR8zZ2dKQ&usqp=CAU)

**Color Grading Scale**

![img](https://i.pinimg.com/originals/23/c4/6a/23c46a456f285489c5893ab719cb611f.jpg)


**Clarity Grading**
![img](https://edipson.com/wp-content/uploads/2019/07/clarity-chart.jpg)

## Exploring the data

In [None]:
data.describe()

**The minimum values of x, y, z are 0 which is not possible, hence removing those data**

In [None]:
data = data[(data['x'] > 0) & (data['y'] > 0) & (data['z'] > 0)].reset_index(drop=True)
print(len(data))

**Checking Missing Data**
___

In [None]:
fig = plt.figure(figsize=(20, 6))
sns.heatmap(data.isnull(), yticklabels=False, cbar=False)

**Price Distribution**
___

In [None]:
fig = plt.figure(figsize=(20, 6))
sns.distplot(data['price'], kde=False)

## Feature Analysis (Univariate and Bivariate)

### Categorical Features

In [None]:
for col in ['cut', 'color', 'clarity']:
    fig, ax =plt.subplots(1, 2, figsize=(20, 6))
    fig.suptitle(col, fontsize=18)
    data[col].value_counts().plot.pie(ax=ax[0], autopct="%1.1f%%")
    ax[0].legend()
    for val in data[col].unique():
        sns.distplot(data[data[col] == val]['price'], ax=ax[1], label=val, kde=False)
    ax[1].legend()
    plt.show()

### Numerical Features

In [None]:
for col in ['carat', 'depth', 'table', 'x', 'y', 'z']:
    fig, ax =plt.subplots(1, 2, figsize=(20, 6))
    fig.suptitle(col, fontsize=18)
    sns.distplot(data[col], ax=ax[0], kde=False)
    data[[col]+['price']].plot.scatter(x=col, y='price', ax=ax[1])
    plt.show()

### Multivariate Analysis

In [None]:
sns.catplot(data=data, x='clarity', hue='cut', y='price', kind='point', aspect=3)

In [None]:
sns.catplot(data=data, x='color', hue='clarity', y='price', kind='point', aspect=3)

In [None]:
sns.catplot(data=data, x='color', hue='cut', y='price', kind='point', aspect=3)

### Mass, Volume and Density

In [None]:
df = pd.DataFrame()
df['Volume'] = data[['x', 'y', 'z']].apply(lambda row: row['x'] * row['y'] * row['z'], axis=1)
df['Mass'] = data['carat']
df['Density'] = df['Mass'] / df['Volume']
df['Price'] = data['price']

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
for col in ['Volume', 'Density']:
    fig, ax =plt.subplots(1, 2, figsize=(20, 6))
    fig.suptitle(col, fontsize=18)
    sns.distplot(df[col], ax=ax[0], kde=False)
    df[[col]+['Price']].plot.scatter(x=col, y='Price', ax=ax[1])
    plt.show()

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(20, 6))

df.plot.scatter(x='Density', y='Price', ax=ax[0])
df.plot.scatter(x='Mass', y='Price', ax=ax[1])
df.plot.scatter(x='Volume', y='Price', ax=ax[2])

In [None]:
sns.heatmap(df.corr(), annot=True, center=0, cmap='RdYlGn')

**Density is almost constant, and mass and volume are highly correlated**

## Data Preprocessing

* Missing Treatment (Not Required in this case)
    * Numerical data: Median Imputation 
    * Categorical data: Constant Imputation 
* Preprocessing:    
    * Numerical data: Scaling 
    * Categorical data: One Hot Encoding     

In [None]:
X = data.drop(['price'], axis=1)
y = data['price']

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline

In [None]:
def get_column_names(feature_name, columns):
    val = feature_name.split('_')[1]
    col_idx = int(feature_name.split('_')[0][1:])
    return f'{columns[col_idx]}_{val}'

class Preprocessor():
    
    def __init__(self, return_df=True):
        self.return_df = return_df
        
        self.impute_median = SimpleImputer(strategy='median')
        self.impute_const = SimpleImputer(strategy='constant')
        self.ss = StandardScaler()
        self.ohe = OneHotEncoder(handle_unknown='ignore')
        
        self.num_cols = make_column_selector(dtype_include='number')
        self.cat_cols = make_column_selector(dtype_exclude='number')
        
        self.preprocessor = make_column_transformer(
            (make_pipeline(self.impute_median, self.ss), self.num_cols),
            (make_pipeline(self.impute_const, self.ohe), self.cat_cols),
        )
        
    def fit(self, X):
        return self.preprocessor.fit(X)
        
    def transform(self, X):
        Xtransformed = self.preprocessor.transform(X)
        try:
            Xtransformed = Xtransformed.todense()
        except:
            pass
        if self.return_df:
            return pd.DataFrame(
                Xtransformed,
                columns=self.num_cols(X)+list(map(
                    lambda x: get_column_names(x, self.cat_cols(X)),
                    self.preprocessor.transformers_[1][1][1].get_feature_names()
                ))
            )
        return X
        
    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

In [None]:
X = Preprocessor().fit_transform(X)
print(X.shape)
X.head()

In [None]:
features = X.columns
X = X.values
y= y.values

# 'Deciding' the Diamond Prices

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import mean_squared_error

In [None]:
kf = KFold(random_state=19, shuffle=True)

## Linear Regression Baseline

In [None]:
%%time
r2scores = []
rmse = []
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = LinearRegression().fit(X_train, y_train)
    r2scores.append(model.score(X_test, y_test))
    rmse.append(np.sqrt(mean_squared_error(y_test, model.predict(X_test))))
    
print('Mean r2 score', np.mean(r2scores))
print('Mean rmse', np.mean(rmse))

## Decision Tree

In [None]:
%%time
r2scores = []
rmse = []
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = DecisionTreeRegressor(random_state=19).fit(X_train, y_train)
    r2scores.append(model.score(X_test, y_test))
    rmse.append(np.sqrt(mean_squared_error(y_test, model.predict(X_test))))
    
print('Mean r2 score', np.mean(r2scores))
print('Mean rmse', np.mean(rmse))

In [None]:
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from xgboost import XGBRFRegressor, XGBRegressor
from lightgbm import LGBMRegressor

trees = [
    ('Random Forest', RandomForestRegressor), ('Extra Trees', ExtraTreesRegressor), ('LightGBM', LGBMRegressor),
    ('Gradient Boosting', GradientBoostingRegressor), ('XGBoost', XGBRegressor), ('XGBoostRF', XGBRFRegressor),
]

In [None]:
%%time
for name, algo in trees:
    r2scores = []
    rmse = []
    for train_index, test_index in kf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model = algo(random_state=19).fit(X_train, y_train)
        r2scores.append(model.score(X_test, y_test))
        rmse.append(np.sqrt(mean_squared_error(y_test, model.predict(X_test))))

    print(name)
    print('Mean r2 score', np.mean(r2scores))
    print('Mean rmse', np.mean(rmse))
    print()

**LightGBM works best with this data, Lets explore it further**

In [None]:
%%time
r2scores = []
rmse = []
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model = LGBMRegressor(random_state=19).fit(X_train, y_train)
    r2scores.append(model.score(X_test, y_test))
    rmse.append(np.sqrt(mean_squared_error(y_test, model.predict(X_test))))
    
print('Mean r2 score', np.mean(r2scores))
print('Mean rmse', np.mean(rmse))

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
fig.suptitle('Feature Importance')
pd.Series(model.feature_importances_, index=features).sort_values(ascending=False).plot.bar(ax=ax)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

**The key deciding factors for the price of a diamond are:**

* Its weight (carat)
* Its Dimensions (y, z, depth, x, table)