We are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. We are provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning different profits on different prize slots. We have to help the company in predicting the price for the stone on the basis of the details given in the dataset so it can distinguish between higher profitable stones and lower profitable stones so as to have a better profit share. Also, provide them with the best 5 attributes that are most important.

# Importing libraries

In [25]:
import numpy as np
import pandas as pd



from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

# Reading Data
I'll read the dataset and get information about it

In [2]:
data = pd.read_csv('cubic_zirconia.csv')

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,1,0.3,Ideal,E,SI1,62.1,58.0,4.27,4.29,2.66,499
1,2,0.33,Premium,G,IF,60.8,58.0,4.42,4.46,2.7,984
2,3,0.9,Very Good,E,VVS2,62.2,60.0,6.04,6.12,3.78,6289
3,4,0.42,Ideal,F,VS1,61.6,56.0,4.82,4.8,2.96,1082
4,5,0.31,Ideal,F,VVS1,60.4,59.0,4.35,4.43,2.65,779


## Exploratory Data Analysis
Let's explore the various columns and draw information about how useful each column is. I'll also modify the test data based on training data.


The first column is the index for each data point and hence we can simply remove it.

In [4]:
data.drop('Unnamed: 0', axis=1, inplace= True)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26967 entries, 0 to 26966
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    26967 non-null  float64
 1   cut      26967 non-null  object 
 2   color    26967 non-null  object 
 3   clarity  26967 non-null  object 
 4   depth    26270 non-null  float64
 5   table    26967 non-null  float64
 6   x        26967 non-null  float64
 7   y        26967 non-null  float64
 8   z        26967 non-null  float64
 9   price    26967 non-null  int64  
dtypes: float64(6), int64(1), object(3)
memory usage: 2.1+ MB


In [6]:
data['cut'].unique()

array(['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], dtype=object)

In [7]:
data['clarity'].unique()


array(['SI1', 'IF', 'VVS2', 'VS1', 'VVS1', 'VS2', 'SI2', 'I1'],
      dtype=object)

## Filling nulll places with mean

In [8]:
data["depth"].fillna(data["depth"].astype("float64").mean(), inplace = True)

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26967 entries, 0 to 26966
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    26967 non-null  float64
 1   cut      26967 non-null  object 
 2   color    26967 non-null  object 
 3   clarity  26967 non-null  object 
 4   depth    26967 non-null  float64
 5   table    26967 non-null  float64
 6   x        26967 non-null  float64
 7   y        26967 non-null  float64
 8   z        26967 non-null  float64
 9   price    26967 non-null  int64  
dtypes: float64(6), int64(1), object(3)
memory usage: 2.1+ MB


## Creating Dummy Variables

In [10]:
data_new = pd.get_dummies(data,
                         columns = ["cut", "color", "clarity"],
                         drop_first = True , dtype= int)

In [11]:
data_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26967 entries, 0 to 26966
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   carat          26967 non-null  float64
 1   depth          26967 non-null  float64
 2   table          26967 non-null  float64
 3   x              26967 non-null  float64
 4   y              26967 non-null  float64
 5   z              26967 non-null  float64
 6   price          26967 non-null  int64  
 7   cut_Good       26967 non-null  int32  
 8   cut_Ideal      26967 non-null  int32  
 9   cut_Premium    26967 non-null  int32  
 10  cut_Very Good  26967 non-null  int32  
 11  color_E        26967 non-null  int32  
 12  color_F        26967 non-null  int32  
 13  color_G        26967 non-null  int32  
 14  color_H        26967 non-null  int32  
 15  color_I        26967 non-null  int32  
 16  color_J        26967 non-null  int32  
 17  clarity_IF     26967 non-null  int32  
 18  clarit

In [12]:
list(data_new.columns)

['carat',
 'depth',
 'table',
 'x',
 'y',
 'z',
 'price',
 'cut_Good',
 'cut_Ideal',
 'cut_Premium',
 'cut_Very Good',
 'color_E',
 'color_F',
 'color_G',
 'color_H',
 'color_I',
 'color_J',
 'clarity_IF',
 'clarity_SI1',
 'clarity_SI2',
 'clarity_VS1',
 'clarity_VS2',
 'clarity_VVS1',
 'clarity_VVS2']

In [13]:
data_n = data_new[['carat',
 'depth',
 'table',
 'x',
 'y',
 'z',
 'cut_Good',
 'cut_Ideal',
 'cut_Premium',
 'cut_Very Good',
 'color_E',
 'color_F',
 'color_G',
 'color_H',
 'color_I',
 'color_J',
 'clarity_IF',
 'clarity_SI1',
 'clarity_SI2',
 'clarity_VS1',
 'clarity_VS2',
 'clarity_VVS1',
 'clarity_VVS2',
 'price']]

In [14]:
data_n.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26967 entries, 0 to 26966
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   carat          26967 non-null  float64
 1   depth          26967 non-null  float64
 2   table          26967 non-null  float64
 3   x              26967 non-null  float64
 4   y              26967 non-null  float64
 5   z              26967 non-null  float64
 6   cut_Good       26967 non-null  int32  
 7   cut_Ideal      26967 non-null  int32  
 8   cut_Premium    26967 non-null  int32  
 9   cut_Very Good  26967 non-null  int32  
 10  color_E        26967 non-null  int32  
 11  color_F        26967 non-null  int32  
 12  color_G        26967 non-null  int32  
 13  color_H        26967 non-null  int32  
 14  color_I        26967 non-null  int32  
 15  color_J        26967 non-null  int32  
 16  clarity_IF     26967 non-null  int32  
 17  clarity_SI1    26967 non-null  int32  
 18  clarit

## relation of feature with labels

In [15]:
data_corr= data_new.corr()

In [16]:
corr =data_corr['price']
r= corr.to_dict()
r

{'carat': 0.9224161094805432,
 'depth': -0.002533517877597092,
 'table': 0.12694223324168055,
 'x': 0.8862471788154094,
 'y': 0.8562425409055257,
 'z': 0.8505361306239215,
 'price': 1.0,
 'cut_Good': -0.0007004694751990921,
 'cut_Ideal': -0.09869354027353494,
 'cut_Premium': 0.08868163619035392,
 'cut_Very Good': 0.012659865710896985,
 'color_E': -0.10155597394852611,
 'color_F': -0.02745673497047337,
 'color_G': 0.00809107961473747,
 'color_H': 0.057585277622442636,
 'color_I': 0.10008388527636619,
 'color_J': 0.08223084880152912,
 'clarity_IF': -0.05545281145748966,
 'clarity_SI1': 0.008269372976095809,
 'clarity_SI2': 0.129768469518776,
 'clarity_VS1': -0.010577566815271926,
 'clarity_VS2': 0.003927809873590731,
 'clarity_VVS1': -0.09656455737457728,
 'clarity_VVS2': -0.05391399474610804}

In [17]:
for a in r.keys() :
    if r[a]>0.5 and r[a]<1   :
        print(a)
    

carat
x
y
z


# Tmportant  Feature are 'charat', 'x', 'y', 'z', and 'clarity_VVS2'

# Train Test Spliting

In [18]:
x_train, x_test, y_train, y_test = train_test_split(data_n.iloc[:, :-1], 
                                                    data_n.iloc[:, -1], 
                                                    test_size = 0.2, 
                                                    random_state = 42)

In [19]:
standardScaler = StandardScaler()
standardScaler.fit(x_train)
x_train = standardScaler.transform(x_train)
x_test = standardScaler.transform(x_test)

# Models

## LinearRegression

In [20]:
linearRegression = LinearRegression()
linearRegression.fit(x_train, y_train)
y_pred = linearRegression.predict(x_test)
r2_score(y_test, y_pred)

0.872476216369724

## RandomForestRegressor

In [21]:
rf = RandomForestRegressor(n_estimators = 100)
rf.fit(x_train, y_train)
y_pred2 = rf.predict(x_test)
r2_score(y_test, y_pred2)

0.9717309611905421

## Cross Validation

In [22]:
x= data_n.drop('price', axis= 1)

In [23]:
y = data['price']

In [26]:
scores = cross_val_score(rf, x, y, cv=10)
scores.mean()

0.971426820719177