# Predicting the price of diamonds

You’ll try logistic regression to predict prices of diamonds based on carat, cut, clarity and depth

## Process:


1. Prepare the data

2. Split the data into training and testing sets

3. Model and fit the data into a logistic regression

4. Predict the testing labels 

5. Calculate the  metrics



In [42]:
# Import the required modules
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

# Prepare the Data

### Step 1: Load the `Diamonds Prices2022.csv into Pandas DataFrame. Set the “id” column as the index.

In [43]:
# Read in the transaction_fraud_data.csv file into a PandasDataFrame.
diamonds = pd.read_csv(
    Path("Diamonds Prices2022.csv")
)


# Review the DataFrame
diamonds.head()


Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


### Step 2: Answer the following question:

Note that you want to predict the `fraud` variable. Answer the following question: Using `value_counts`, how many fraudulent transactions exist in this dataset?

In [44]:
# The  column 'fraud' is the thing you want to predict. 
# Class 0 indicates no-fraud trasactions and class 1 indicates fraudulent transactions
# Using value_counts, how many fraudulent transactions are in this dataset?
labelencoder = LabelEncoder()


In [45]:
# Assigning numerical values and storing in another column

diamonds['cut'] = labelencoder.fit_transform(diamonds['cut'])
diamonds['color'] = labelencoder.fit_transform(diamonds['color'])
diamonds['clarity'] = labelencoder.fit_transform(diamonds['clarity'])
diamonds["price"].value_counts()

605      132
802      127
625      126
828      125
776      124
        ... 
8816       1
14704      1
14699      1
14698      1
9793       1
Name: price, Length: 11602, dtype: int64

In [46]:
diamonds

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,2,1,3,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,3,1,2,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,1,1,4,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,3,5,5,62.4,58.0,334,4.20,4.23,2.63
4,5,0.31,1,6,3,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...,...
53938,53939,0.86,3,4,3,61.0,58.0,2757,6.15,6.12,3.74
53939,53940,0.75,2,0,3,62.2,55.0,2757,5.83,5.87,3.64
53940,53941,0.71,3,1,2,60.5,55.0,2756,5.79,5.74,3.49
53941,53942,0.71,3,2,2,59.8,62.0,2756,5.74,5.73,3.43


# Split the data into training and testing sets

In [47]:
# The target column should be the binary `fraud` column.
target = diamonds["price"]


# The features column should be all of the features. 
features = diamonds.drop(['price','Unnamed: 0'],axis=1,inplace=False)


**Display the features**

In [48]:
features

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
0,0.23,2,1,3,61.5,55.0,3.95,3.98,2.43
1,0.21,3,1,2,59.8,61.0,3.89,3.84,2.31
2,0.23,1,1,4,56.9,65.0,4.05,4.07,2.31
3,0.29,3,5,5,62.4,58.0,4.20,4.23,2.63
4,0.31,1,6,3,63.3,58.0,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...
53938,0.86,3,4,3,61.0,58.0,6.15,6.12,3.74
53939,0.75,2,0,3,62.2,55.0,5.83,5.87,3.64
53940,0.71,3,1,2,60.5,55.0,5.79,5.74,3.49
53941,0.71,3,2,2,59.8,62.0,5.74,5.73,3.43


In [49]:
# Split the dataset using the train_test_split function
training_features, testing_features, training_targets, testing_targets = train_test_split(features, target)

# Model and Fit the Data to a Logistic Regression

In [50]:
# Declare a logistic regression model.
# Apply a random_state of 7 to the model
logistic_regression_model = LogisticRegression(random_state=7)

In [None]:
# Fit and save the logistic regression model using the training data
diamonds.head()
lr_model = logistic_regression_model.fit(training_features, training_targets)

# Predict the Testing Labels

In [None]:
# Make and save testing predictions with the saved logistic regression model using the test data
testing_predections = lr_model.predict(testing_features)

# Review the predictions
testing_predections

# Calculate the Performance Metrics

In [None]:
# Display the accuracy score for the test dataset.
accuracy_score=accuracy_score(testing_targets, testing_predections)
display(accuracy_score)

accuracy_score