## Machine learning for Jewellry Price Optimization
This project aims to successfully predict the prices of jewelry pieces. This allows the jewelry company to reduce it's dependece of gemologists and expensive jewelry appraisal experts.

## Methodology
This project will be carried out using the crisp - DM (Cross Industry Standard Process For Data Mining )methodology. This is one of the more popular DS methodologies and it is characterized by six important phases such as:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Data Modeling
5. Data evaluation
6. Data deployment

## Business Understanding
Gemineye Emporium is a jewelery dealer and they have begun a new wve of expansion into the country.While this is good for business, however it comes with increased costs and increased need for operational efficiency.
Hence, there is a need for them to accurately price their jewelry products as it is currently being priced by jewelry experts and gemologists. 
However, this process is long and expensive as jewelry experts are not easy to come by and are also quite pricey. **Gemineye** would love to explore the use of Machine Learning for predicting the optimal prices at which their jewelry would be sold. Utilizing ML for this task would allow the company:
1. *improve the speed and scalability of their pricing process
2. *cut down on the costs of hiring gem experts

## Data Understanding
Exploratory Data analysis (EDA) is performed to understand the data obtained and used for the task. 

In [5]:
#importing needed libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#set visualization theme
sns.set()

In [6]:
#shape of the data
data.shape

NameError: name 'data' is not defined

In [None]:
#basic info of the data
data.info()

In [7]:
data = pd.read_csv('Jewelry_Dataset.csv')

data.columns = ['Order datetime',
                'Order ID',
                'Purchased product ID',
                'SKU_Quantity',
                'Category_ID',
                'Category',
                'Brand_ID',
                'Price_USD',
                'User_ID',
                'Target_gender',
                'Main_color',
                "Main_metal",
                'Main_gem'  
                ]

data.head()

Unnamed: 0,Order datetime,Order ID,Purchased product ID,SKU_Quantity,Category_ID,Category,Brand_ID,Price_USD,User_ID,Target_gender,Main_color,Main_metal,Main_gem
0,2018-12-01 17:38:31 UTC,1924899396621697920,1806829193678291446,1,1.806829e+18,,,212.14,1.515916e+18,,yellow,gold,
1,2018-12-02 13:53:42 UTC,1925511016616034733,1842214461889315556,1,1.806829e+18,jewelry.pendant,1.0,54.66,1.515916e+18,f,white,gold,sapphire
2,2018-12-02 17:44:02 UTC,1925626951238681511,1835566849434059453,1,1.806829e+18,jewelry.pendant,0.0,88.9,1.515916e+18,f,red,gold,diamond
3,2018-12-02 21:30:19 UTC,1925740842841014667,1873936840742928865,1,1.806829e+18,jewelry.necklace,0.0,417.67,1.515916e+18,,red,gold,amethyst
4,2018-12-02 22:09:34 UTC,1925760595336888995,1835566854827934449,1,1.806829e+18,jewelry.earring,1.0,102.27,1.515916e+18,,red,gold,


In [11]:
#explore missing values
data.isnull().sum()

Order datetime              0
Order ID                    0
Purchased product ID        0
SKU_Quantity                0
Category_ID              5352
Category                 9933
Brand_ID                 4785
Price_USD                5352
User_ID                  5352
Target_gender           48167
Main_color               7660
Main_metal               5462
Main_gem                34058
dtype: int64

In [15]:
#check feature cardinality
data.nunique()

Order datetime          74504
Order ID                74759
Purchased product ID     9613
SKU_Quantity                1
Category_ID                25
Category                  218
Brand_ID                 2537
Price_USD                3166
User_ID                 31079
Target_gender               2
Main_color                  5
Main_metal                  3
Main_gem                   30
dtype: int64

In [16]:
#data description
data.describe()

Unnamed: 0,Order ID,Purchased product ID,SKU_Quantity,Category_ID,Brand_ID,Price_USD,User_ID
count,95910.0,95910.0,95910.0,90558.0,91125.0,90558.0,90558.0
mean,2.485191e+18,1.81597e+18,1.0,1.805947e+18,8.891036e+16,362.213017,1.512644e+18
std,1.93475e+17,2.136814e+17,0.0,2.083954e+16,3.559651e+17,444.157665,2.374776e+16
min,1.924899e+18,1.313551e+18,1.0,1.313678e+18,0.0,0.99,1.313554e+18
25%,2.379732e+18,1.515966e+18,1.0,1.806829e+18,0.0,145.62,1.515916e+18
50%,2.524282e+18,1.956664e+18,1.0,1.806829e+18,1.0,258.77,1.515916e+18
75%,2.644347e+18,1.956664e+18,1.0,1.806829e+18,1.0,431.37,1.515916e+18
max,2.719022e+18,2.541962e+18,1.0,1.806829e+18,1.550613e+18,34448.6,1.554297e+18


In [17]:
#check duplicate values
num_duplicated = len(data.loc[data.duplicated()])
print(f"Number of Duplicated records:", num_duplicated)

Number of Duplicated records: 2589


In [19]:
#features with a cardinality of 1 (invariant features)
invariant_features = data.nunique()[data.nunique() == 1].index.tolist()
invariant_features

['SKU_Quantity']

In [20]:
#check for label ditribution
data['Price_USD'].skew()

18.95906072625981

In [None]:
#visalize price distribution
sns.set(figsize = 20,20))


In [6]:
dic = pd.read_excel('Jewelry_Data_Dictionary.xlsx')
dic

Unnamed: 0,Column name,Column description
0,Order datetime,Date product was ordered
1,Order ID,Identifier for order
2,Purchased product ID,Identifier for product ordered
3,Quantity of SKU in the order,Amount of stock keeping unit ordered
4,Category ID,Jewelry category identifier
5,Category alias,Jewelry category
6,Brand ID,Brand identifier
7,Price in USD,Jewelry price
8,User ID,User identifier
9,Product gender (for male/female),Target gender for product


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95910 entries, 0 to 95909
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Order datetime        95910 non-null  object 
 1   Order ID              95910 non-null  int64  
 2   Purchased product ID  95910 non-null  int64  
 3   SKU_Quantity          95910 non-null  int64  
 4   Category_ID           90558 non-null  float64
 5   Category              85977 non-null  object 
 6   Brand_ID              91125 non-null  float64
 7   Price_USD             90558 non-null  float64
 8   User_ID               90558 non-null  float64
 9   Target_gender         47743 non-null  object 
 10  Main_color            88250 non-null  object 
 11  Main_metal            90448 non-null  object 
 12  Main_gem              61852 non-null  object 
dtypes: float64(4), int64(3), object(6)
memory usage: 9.5+ MB
