## Diamond Price prediction

#### Firstly, I have imported necessary python libraries and dataset containing the diamond prices based on their features. 

In [249]:
import pandas as pd
import numpy as np
import plotly.express as px

# Importing the dataset using pandas
data = pd.read_csv("diamonds.csv") 

# Displaying the first few rows of the dataset
data.head() 

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


#### From the above dataset we can see that the dataset consists of diamonds types and prices on the basis of their cut, color, clarity, depth, table and price. Here, Table in this dataset refers to the table percentage of the diamond that represents the width of the top facet (table) of the diamond divided by the average of the width and length of the diamond, and then multiplied by 100 to get the percentage, and the x, y, z are the length, width and depth of the diamond respectively in millimeter.

In [250]:
# dropping a column named "Unnamed: 0" from a DataFrame using Python.
data = data.drop("Unnamed: 0",axis=1)

data.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


#### using Plotly Express to create a pie chart to visualize the distribution of a categorical column in a DataFrame.

In [251]:
# Choosing a categorical column (for example, 'cut') to visualize
categorical_column = 'cut'

# Counting the values of the chosen column
value_counts = data[categorical_column].value_counts()

# Defining the colors
colors = ['darkgreen', 'forestgreen', 'yellowgreen', 'olive', 'gold']

# Creating a pie chart using Plotly Express with custom colors
# names indicate the diamond cut type
# values indicates the counts of the each diamond cut type
fig = px.pie(data_frame = data, names = value_counts.index, values = value_counts, hole=0.4, title=f'Pie Chart of Diamond {categorical_column} Distribution', color_discrete_sequence = colors)

#varying the figure size
fig.update_layout(width=400, height=400)

fig.show()


#### creating a new column in my DataFrame called "size" by calculating the product of the "x", "y", and "z" columns. As x, y, and z represents as the length, width, and depth of the diamond.

In [252]:
data["size"] = data["x"] * data["y"] * data["z"]
data

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,size
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,38.202030
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,34.505856
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,38.076885
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63,46.724580
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,51.917250
...,...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50,115.920000
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61,118.110175
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56,114.449728
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74,140.766120


#### using Plotly Express to create a scatter plot based on the specified data. This code generates a scatter plot with markers representing data points from the DataFrame. The x-axis represents the "size" column, the y-axis represents the "price" column, the marker sizes vary based on the "size" column, and the markers are colored based on the "cut" column.

In [253]:
# 
figure = px.scatter(data_frame = data, x="size",
                    y="price", size="size", 
                    color= "cut", trendline="ols")
figure.show()

**Generating a box plot where each box represents a category in the "cut" column.** 
- The height and spread of each box are based on the "price" values associated with that category. 
- The boxes are colored differently based on the "clarity" column, which is a separate categorical variable. 
- This plot can be used to visualize the distribution of prices for different cuts and clarity levels of diamonds.

In [255]:
fig = px.box(data, 
             x="cut", 
             y="price", 
             color="clarity")
fig.show()

**Calculating and analyzing correlations is an excellent step in understanding the relationships between features and the target variable in our dataset.**

**Calculating the correlation matrix of the dataset using `.corr()` method and sorted the correlation values of the 'price' column in descending order.**

**By sorting the correlations in descending order, we can quickly identify which features have the strongest positive correlation with the "price" column, indicating that those features might be more influential in determining the diamond's price.**

In [256]:
correlation = data.corr()
correlation["price"].sort_values(ascending=False)

price    1.000000
carat    0.921591
size     0.902385
x        0.884435
y        0.865421
z        0.861249
table    0.127134
depth   -0.010647
Name: price, dtype: float64

**Mapped the "cut" column in my DataFrame to numerical values using a dictionary.** 
- This is a common technique in data preprocessing when we want to convert categorical variables into a format that can be used in machine learning models.

In [257]:
data["cut"] = data["cut"].map({"Ideal": 1, 
                               "Premium": 2, 
                               "Good": 3,
                               "Very Good": 4,
                               "Fair": 5})

**Preparing my data for modeling by splitting it into training and testing sets using the train_test_split function from Scikit-learn.**

In [258]:
#splitting data
from sklearn.model_selection import train_test_split

#defining matrix x as a NumPy array containing selected columns ("carat", "cut", "size") from our DataFrame.
x = np.array(data[["carat", "cut", "size"]]) 

#creating our target array 'y' containing the "price" column.
y = np.array(data[["price"]])

# using the .ravel() method to convert the target array y from a 2D array to a 1D array.
y = y.ravel()

# using train_test_split to divide our data into training and testing sets. 
# We're allocating 90% of the data to training (xtrain, ytrain) and 10% to testing (xtest, ytest). 
# The random_state parameter ensures reproducibility of the split.
xtrain, xtest, ytrain, ytest = train_test_split(x, y, 
                                                test_size=0.10, 
                                                random_state=10)

**Fitting a RandomForestRegressor model to my training data.** 
- The RandomForestRegressor is a popular ensemble learning method that builds multiple decision trees and combines their predictions to produce a more accurate and stable prediction.

In [259]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

# The model will learn to make predictions based on the features in xtrain and their corresponding target values in ytrain.
model.fit(xtrain, ytrain)

RandomForestRegressor()

**Created and fitted a K-nearest neighbors (KNN) regressor model to my training data.**

In [260]:

from sklearn.neighbors import KNeighborsRegressor

# created an instance of the KNeighborsRegressor model by calling its constructor with the parameter n_neighbors=5. 
# This means that the model will consider the 5 nearest neighbors for making predictions.
knn_model = KNeighborsRegressor(n_neighbors=5)

# The model will learn to make predictions based on the features in 'xtrain' and their corresponding target values in 'ytrain'.
knn_model.fit(xtrain,ytrain)

KNeighborsRegressor()

**Created a script for predicting diamond prices using the trained random forest model and the K-nearest neighbors (KNN) regressor model.** 
- My script takes user inputs for carat size, cut type, and size, and then predicts the diamond price using both models.

In [266]:
print('Diamond Price Prediction')

a = float(input('Carat Size: '))
b = int(input('Cut Type (Ideal: 1, Premium: 2, Good: 3, Very Good: 4, Fair: 5): '))
c = float(input("Size: "))

print(f'Carat Size: {a}')
print(f'Cut Type: {b}')
print(f'Size: {c}')

# creating a feature array using the user inputs. 
# This array (features) is then used to make predictions using the trained models.
features = np.array([[a, b, c]])

# Predicting the price of the diamond using Random Forest
predicted_price_rf = model.predict(features)

print(f"Predicted Diamond's Price using random forest = ${predicted_price_rf}" )

# Predicting the price of the diamond using KNN
predicted_price_knn = knn_model.predict(features) 

print(f"Predicted Price for Diamond using KNN: ${predicted_price_knn}")

Diamond Price Prediction
Carat Size: 1.0
Cut Type: 2
Size: 3.0
Predicted Diamond's Price using random forest = $[3433.77]
Predicted Price for Diamond using KNN: $[3245.]


**Using cross-validation to assess the performance of our random forest model using the mean squared error (MSE) as the scoring metric.**
- Cross-validation is a crucial technique used in machine learning to assess the performance of a model and estimate how well it will generalize to unseen data. It helps us to avoid overfitting and provides a more reliable evaluation of our model's capabilities.
- MSE (Mean Squared Error) quantifies the average squared difference between the predicted values and the actual (true) values in a regression task. Lower values of MSE indicate better model performance.

In [267]:
from sklearn.model_selection import cross_val_score
# Perform cross-validation using Random Forest model
# model is random forest regression model that you previously created using RandomForestRegressor().
# feature matrix (x) and target array (y).
# cv is The parameter that specifies the number of folds for cross-validation.
# Scoring is to calculate the negative mean squared error for each fold.
rf_scores = cross_val_score(model, x, y, cv=5, scoring='neg_mean_squared_error')
rf_mse_scores = -rf_scores
rf_mse_scores

array([ 3191029.60353547,  8671607.54773031, 19703603.72864727,
         151595.84536431,   576318.75911135])

**Using cross-validation to assess the performance of our K Nearest Neighbour(KNN) model using the mean squared error (MSE) as the scoring metric.**

In [268]:
# Perform cross-validation using KNN model
knn_scores = cross_val_score(knn_model, x, y, cv=5, scoring='neg_mean_squared_error')
knn_mse_scores = -knn_scores
knn_mse_scores

array([ 3071625.72866889,  8161372.46603263, 20559851.01220986,
         147367.72296626,   598447.15598814])

**Calculating the mean of the negated mean squared error (neg-MSE) scores for both your random forest model and your K-nearest neighbors (KNN) regressor model.**

In [269]:
# Calculate the mean MSE scores for both models
mean_rf_mse = rf_mse_scores.mean()
mean_knn_mse = knn_mse_scores.mean()


**Performing a comparison of the mean negated mean squared error (neg-MSE) scores between the random forest model and the K-nearest neighbors (KNN) regressor model.**

In [270]:
print("Mean MSE scores (Cross-Validation):")
print(f"Random Forest: {mean_rf_mse}")
print(f"KNN: {mean_knn_mse}")

# Compare which model has lower mean MSE
if mean_rf_mse < mean_knn_mse:
    print(f"\nRandom Forest has lower mean MSE, indicating better performance. \nSo, Price of the diamond will be = $ {predicted_price_rf}")
else:
    print("\nKNN has lower mean MSE, indicating better performance. \nSo, Price of the diamond will be = $ {predicted_price_knn}")

Mean MSE scores (Cross-Validation):
Random Forest: 6458831.096877743
KNN: 6507732.817173155

Random Forest has lower mean MSE, indicating better performance. 
So, Price of the diamond will be = $ [3433.77]
