## Diamond Price Prediction

### Introduction About the Data :

**The dataset** The goal is to predict `price` of given diamond (Regression Analysis).

There are 10 independent variables (including `id`):

* `id` : unique identifier of each diamond
* `carat` : Carat (ct.) refers to the unique unit of weight measurement used exclusively to weigh gemstones and diamonds.
* `cut` : Quality of Diamond Cut
* `color` : Color of Diamond
* `clarity` : Diamond clarity is a measure of the purity and rarity of the stone, graded by the visibility of these characteristics under 10-power magnification.
* `depth` : The depth of diamond is its height (in millimeters) measured from the culet (bottom tip) to the table (flat, top surface)
* `table` : A diamond's table is the facet which can be seen when the stone is viewed face up.
* `x` : Diamond X dimension
* `y` : Diamond Y dimension
* `x` : Diamond Z dimension

Target variable:
* `price`: Price of the given Diamond.

Dataset Source Link :
[https://www.kaggle.com/competitions/playground-series-s3e8/data?select=train.csv](https://www.kaggle.com/competitions/playground-series-s3e8/data?select=train.csv)
https://www.kaggle.com/datasets/colearninglounge/gemstone-price-prediction

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import pickle
import json

import warnings
warnings.filterwarnings('ignore')

In [None]:
## Data Ingestions step
df=pd.read_csv('gemstone.csv')
df.head()

### EDA and Feature Engineering

In [None]:
df.isnull().sum()

In [None]:
### No missing values present in the data

In [None]:
df.info()

In [None]:
df.head()

In [None]:
## Lets drop the id column
df=df.drop(labels=['id'],axis=1)
df.head()

In [None]:
## check for duplicated records
df.duplicated().sum()

In [None]:
## segregate numerical and categorical columns

numerical_columns=df.columns[df.dtypes!='object']
categorical_columns=df.columns[df.dtypes=='object']

print("Numerical columns:",numerical_columns)
print('Categorical Columns:',categorical_columns)

In [None]:
df[categorical_columns].describe()

In [None]:
df['cut'].value_counts()

In [None]:
df['color'].value_counts()

In [None]:
df['clarity'].value_counts()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
x=0
for i in numerical_columns:
    sns.histplot(data=df,x=i,kde=True)
    print('\n')
    plt.show()

In [None]:
## Assignment Do the same for categorical data

In [None]:
## correlation
sns.heatmap(df[numerical_columns].corr(),annot=True)

In [None]:
df.head()

In [None]:
## For Domain Purpose https://www.americangemsociety.org/ags-diamond-grading-system/
df['cut'].unique()

In [None]:
cut_map={"Fair":1,"Good":2,"Very Good":3,"Premium":4,"Ideal":5}

In [None]:
df['clarity'].unique()

In [None]:
clarity_map = {"I1":1,"SI2":2 ,"SI1":3 ,"VS2":4 , "VS1":5 , "VVS2":6 , "VVS1":7 ,"IF":8}

In [None]:
df['color'].unique()

In [None]:
color_map = {"D":1 ,"E":2 ,"F":3 , "G":4 ,"H":5 , "I":6, "J":7}

In [None]:
df['cut']=df['cut'].map(cut_map)
df['clarity'] = df['clarity'].map(clarity_map)
df['color'] = df['color'].map(color_map)

In [None]:
df.head()

In [None]:
# # droping x,y,z cols

# df= df.drop(labels=['x','y','z'],axis=1)

### Model Training

In [None]:
## Independent and dependent features

X = df.drop(labels=['price'],axis=1)
y = df[['price']]

In [None]:
y.head()

In [None]:
X.head()

In [None]:
#Train Test Split

X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.30,random_state=30)

In [None]:
X_train.shape,X_test.shape

In [None]:
## Feature Selection based on correlaltion
X_train.corr()

In [None]:
## Check for multicollinearity
plt.figure(figsize=(12,10))

corr= X_train.corr()
sns.heatmap(corr,annot=True)

In [None]:
### Linear regression Model

In [None]:
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

In [None]:
linear_reg.coef_

In [None]:
linear_reg.intercept_

### Evaluation

In [None]:
y_pred = linear_reg.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("RMSE :",rmse)

mae = mean_absolute_error(y_test, y_pred)
print("MAE :",mae)

r2_value = r2_score(y_test, y_pred)
print("R-Squared :",r2_value)

In [None]:
y_pred_train = linear_reg.predict(X_train)

mse = mean_squared_error(y_train, y_pred_train)
rmse = np.sqrt(mse)
print("RMSE :",rmse)

mae = mean_absolute_error(y_train, y_pred_train)
print("MAE :",mae)

r2_value = r2_score(y_train, y_pred_train )
print("R-Squared :",r2_value)

In [None]:
plt.scatter(y_test,y_pred)
plt.scatter(y_train,y_pred_train)

In [None]:
# single row testing

X_test[20:21]

In [None]:
# prediction

linear_reg.predict(X_test[20:21])[0]

In [None]:
X_test[20:21].T

In [None]:
column_names = X.columns.tolist()
column_names

In [None]:
X.shape[1]

In [None]:
linear_reg.n_features_in_

In [None]:
cut_map={"Fair":1,"Good":2,"Very Good":3,"Premium":4,"Ideal":5}
clarity_map = {"I1":1,"SI2":2 ,"SI1":3 ,"VS2":4 , "VS1":5 , "VVS2":6 , "VVS1":7 ,"IF":8}
color_map = {"D":1 ,"E":2 ,"F":3 , "G":4 ,"H":5 , "I":6, "J":7}

In [None]:
carat=2.00
cut='Very Good'     
color='G'      
clarity="SI2"   
depth=63.50
table=60.00
x=7.98
y=7.93
z=5.04

cut     = cut_map[cut]   
color   = color_map[color]     
clarity = clarity_map[clarity]

test_array = np.zeros([1,linear_reg.n_features_in_])
test_array[0,0] = carat
test_array[0,1] = cut
test_array[0,2] = color
test_array[0,3] = clarity
test_array[0,4] = depth
test_array[0,5] = table
test_array[0,6] = x
test_array[0,7] = y
test_array[0,8] = z

predicted_charges = np.around(linear_reg.predict(test_array)[0],3)
predicted_charges

### MODELs

In [None]:
with open('linear_regression.pkl','wb') as f:
    pickle.dump(linear_reg, f)

In [None]:
project_data = {"Gender": cut_map,
                "color": color_map,
                "clarity": clarity_map,
                "Column Names" : column_names}

with open('proj_data.json','w') as f:
    json.dump(project_data, f)