**Diamond Price Prediction Model Summary:**

Your diamond price prediction model has shown remarkable performance in estimating the price of diamonds based on the provided dataset. Let's analyze the key metrics:

1. **Training Score: 99%**
   - The model achieved an impressive training score of 99%. This indicates that the model has learned the patterns present in the training data quite well.

2. **Test Score: 98%**
   - The model's test score of 98% indicates that it generalizes effectively to new, unseen data. This strong performance on the test set suggests that the model's predictive capabilities extend beyond the training data.

3. **R-squared Score: 98%**
   - The R-squared score (coefficient of determination) of 98% demonstrates that the model's predictions are highly aligned with the actual diamond prices. This metric indicates the proportion of variance in the target variable that is explained by the model.

4. **Mean Squared Error: 294670.65**
   - The mean squared error (MSE) of 294670.65 quantifies the average squared difference between the model's predicted prices and the actual prices of diamonds. Lower MSE values indicate better prediction performance.

In conclusion,diamond price prediction model exhibits exceptional performance with a training score of 99%, test score of 98%, an impressive R-squared score of 98%, and a mean squared error of 294670.65. These results signify the model's robust predictive capabilities in estimating diamond prices. While the model shows promise, continued validation on diverse data and real-world scenarios is recommended to affirm its reliability.

Import necessary libraries

In [1]:

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

Load the dataset from a CSV file

In [2]:
data=pd.read_csv("diamonds.csv")

Display the dataset

In [3]:
data

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
49995,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
49996,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
49997,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
49998,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


 Display the first few rows of the dataset

In [4]:
data.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Display the last few rows of the dataset

In [5]:
data.tail()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
49995,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
49996,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
49997,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
49998,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
49999,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


Display the shape of the dataset (rows, columns)


In [6]:
data.shape

(50000, 10)

Display the total number of elements in the dataset

In [7]:
data.size

500000

Check for missing values in the dataset and display the count of missing values per column

In [8]:
data.isnull().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

 Display summary statistics of the dataset (mean, std, min, max, etc.)

In [9]:
data.describe

<bound method NDFrame.describe of        carat        cut color clarity  depth  table  price     x     y     z
0       0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1       0.21    Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2       0.23       Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3       0.29    Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4       0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
...      ...        ...   ...     ...    ...    ...    ...   ...   ...   ...
49995   0.72      Ideal     D     SI1   60.8   57.0   2757  5.75  5.76  3.50
49996   0.72       Good     D     SI1   63.1   55.0   2757  5.69  5.75  3.61
49997   0.70  Very Good     D     SI1   62.8   60.0   2757  5.66  5.68  3.56
49998   0.86    Premium     H     SI2   61.0   58.0   2757  6.15  6.12  3.74
49999   0.75      Ideal     D     SI2   62.2   55.0   2757  5.83  5.87  3.64

[50000 rows x 10 columns]>

Create a copy of the dataset

In [10]:
data_copy=data.copy()

Convert specified categorical columns to numerical categorical codes

In [11]:
array=["cut","color","clarity"]
for i in array:
    data_copy[i]=data_copy[i].astype('category').cat.codes

Display the data types of columns in the dataset

In [12]:
data.dtypes

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price        int64
x          float64
y          float64
z          float64
dtype: object

Separate features (x) and target variable (y)

In [13]:
x=data_copy.drop(['price'],axis=1)
y=data_copy['price']

In [14]:
x

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
0,0.23,2,1,3,61.5,55.0,3.95,3.98,2.43
1,0.21,3,1,2,59.8,61.0,3.89,3.84,2.31
2,0.23,1,1,4,56.9,65.0,4.05,4.07,2.31
3,0.29,3,5,5,62.4,58.0,4.20,4.23,2.63
4,0.31,1,6,3,63.3,58.0,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...
49995,0.72,2,0,2,60.8,57.0,5.75,5.76,3.50
49996,0.72,1,0,2,63.1,55.0,5.69,5.75,3.61
49997,0.70,4,0,2,62.8,60.0,5.66,5.68,3.56
49998,0.86,3,4,3,61.0,58.0,6.15,6.12,3.74


In [15]:
y

0         326
1         326
2         327
3         334
4         335
         ... 
49995    2757
49996    2757
49997    2757
49998    2757
49999    2757
Name: price, Length: 50000, dtype: int64

Split the data into training and testing sets

In [16]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=15)

Create a dictionary of different regression models

In [24]:
models= {
    "Ridge": Ridge(),
    "Random Forest": RandomForestRegressor(n_estimators=100),
    "Gradient Boosting": GradientBoostingRegressor(n_estimators=100, learning_rate=0.1),
    "SVR":SVR()
}

Initialize variables to keep track of the best model and its accuracy

In [25]:
best_model=None
best_accuracy=0


 Create a dictionary to store results and Iterate over each model and evaluate its performance

In [33]:
results={}
from sklearn.metrics import accuracy_score
for name, model in models.items():
    model.fit(x_train, y_train)
    train_score = model.score(x_train, y_train)
    test_score = model.score(x_test, y_test)
    y_pred = model.predict(x_test) 
    results[name] = {'Train Score': train_score, 'Test Score': test_score,'y_pred':y_pred}
    if test_score>best_accuracy:
        best_accuracy=test_score
        best_model=name

Print performance results for each model

In [34]:
for model_name,scores in results.items():
    print(f"model:{model_name}")
    print(f"train_score:{scores['Train Score']}")
    print(f"test_score:{scores['Test Score']}")
    print(f"y_pred:{scores['y_pred']}\n")

model:Ridge
train_score:0.8838495565911912
test_score:0.8888162955371256
y_pred:[ -504.7069498   4651.32293121  5625.83363461 ...  3815.92531972
 10993.81407149   -43.21370012]

model:Random Forest
train_score:0.9973295557448466
test_score:0.9812689485467699
y_pred:[  476.9    5127.155  6120.51  ...  4176.67  13892.95    528.96 ]

model:Gradient Boosting
train_score:0.9730826212860342
test_score:0.9709514133576882
y_pred:[  607.86049923  4766.10350789  5857.61249971 ...  4093.57578604
 13945.02368765   606.17189011]

model:SVR
train_score:-0.1296988295447048
test_score:-0.12935783194660644
y_pred:[2362.75441826 2431.78052856 2419.94813459 ... 2446.65740128 2459.53926278
 2370.48960411]



Print the best model and its accuracy

In [38]:
print(f"The Best Model is{best_model} with an accuracy of {best_accuracy*100:.2f}%")

The Best Model isRandom Forest with an accuracy of 98.13%


Calculate and print the R-squared score for the best model

In [36]:
from sklearn.metrics import r2_score,mean_squared_error
r2=r2_score(y_test,results[best_model]['y_pred'])
print(f"r2score:{r2}")

r2score:0.9812689485467699


Calculate and print the mean squared error for the best model

In [37]:
mean_squared_error=mean_squared_error(y_test,results[best_model]['y_pred'])
print(f"mean_squared_error:{mean_squared_error}")

mean_squared_error:294670.6516420711
