# Predicting Diamond Prices
## Phase 1: Data Preparation & Visualisation

#### Group Number: Group 57

#### Name(s) & ID(s) of Group Members: 
- Eddie Ton (s3948609)
- Jabbar Baloghlan (s3890406)
- Tyler Xia (s3945694)

## Table of Contents
* [Introduction](#itr) 
  + [Dataset Source](#Dataset-Source)
  + [Dataset Details](#Dataset-Details)
  + [Dataset Features](#Dataset-Features)
  + [Target Feature](#Target-Feature)
* [Goals and Objectives](#Goals-and-Objectives)
* [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
* [Data Exploration and Visualisation](#Data-Exploration-and-Visualisation)
* [Summary and Conclusion](#Summary-and-Conclusion)
* [References](#References)

## Introduction

### Dataset Source

The Diamond dataset used in study was sourced from Kaggle (Shivam Agrawal, 2021). This dataset contains various details and properties of diamonds.

### Dataset Details

The dataset contains information about the carat, size, quality, colour and the price of a diamond. It also includes information on the depth to the middle, the table - which is the largest surface area on the diamond, commonly on the top as well as the width, length, and height of a diamond.

In [15]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

pd.set_option('display.max_columns', None) 

###
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")
###

In [21]:
# name of the dataset to be imported from our GitHub account
df_name = 'diamonds.csv'
df_url = 'https://raw.githubusercontent.com/Jobi060704/math_files/main/' + df_name
url_content = requests.get(df_url, verify=False).content
diamond_df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

In [24]:
new_diamond_df = diamond_df.drop(columns=diamond_df.columns[0])
new_diamond_df.sample(10, random_state=99)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
42653,0.4,Premium,G,IF,60.8,58.0,1333,4.73,4.77,2.89
4069,0.31,Very Good,D,SI1,63.0,55.0,571,4.32,4.34,2.73
27580,1.6,Ideal,F,VS2,62.0,55.0,18421,7.49,7.54,4.66
33605,0.31,Good,D,SI2,63.1,54.0,462,4.33,4.38,2.75
34415,0.3,Ideal,G,IF,61.2,57.0,863,4.35,4.38,2.67
46932,0.52,Premium,G,VS1,62.4,60.0,1815,5.12,5.11,3.19
52243,0.8,Very Good,J,VS1,62.7,58.0,2487,5.91,5.95,3.72
38855,0.4,Ideal,G,VVS2,62.4,56.0,1050,4.74,4.72,2.95
38362,0.33,Ideal,D,VVS2,61.1,56.0,1021,4.46,4.44,2.72
20258,1.72,Ideal,J,SI1,62.0,57.0,8688,7.66,7.62,4.74


### Dataset Details

The dataset is about diamonds, and any properties that can describe them up to date. Those include: dimensions(x,y,z), carats, color, depth, clarity, table, cut grade, and price. The variables seem suefficient to attempt and find relation between them and model interrelation to diamond pricing.

The edited dataset has a total of 10 features and 5000 observations. Entries with no information have already been removed from the dataset. 

**Dataset Retrieval**

- We read in the dataset from our GitHub repository and load the modules we will use throughout this report.
- We display 10 randomly sampled rows from this dataset.

### Dataset Features

The features in our dataset are described in the table below. These descriptions are taken from the Kaggle data source.

In [5]:
from tabulate import tabulate

table = [['Name','Data Type','Units','Description'],
        ]

print(tabulate(table, headers='firstrow', tablefmt='fancy_grid'))

╒════════════════╤═════════════════════╤══════════════╤═══════════════════════════════════════════════════════════════════════════════╕
│ Name           │ Data Type           │ Units        │ Description                                                                   │
╞════════════════╪═════════════════════╪══════════════╪═══════════════════════════════════════════════════════════════════════════════╡
│ Suburb         │ Nominal categorical │ NA           │ Suburb of house sold                                                          │
├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ Rooms          │ Numeric             │ NA           │ Number of rooms                                                               │
├────────────────┼─────────────────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ Type           │ Nominal categorical │ NA     

### Target Feature

For this project, the target feature in this dataset will be the diamond price in Australian dollars. That is, the price of diamonds will be predicted based on the explanatory/ descriptive variables. 

## Goals and Objectives

Diamond prices have a very complex calculation system. A model that could acurately predict/set diamond prices is absolutely a requirment. For instance, a jewelrer would need such a model to correctly state the properties of the diamonds after the work ahs been done. Similarly, a store selling diamons could use such a model to determine the price to sell the diamond at. 

Thus, the main objective of this project is two-fold: (1) predict the price of diamonds based on the publically available properties of diamonds available, and (2) which features seem to be the best predictors of the diamond sale price. A secondary objective is to perform some exploratory data analysis by basic descriptive statistics & data visualisation plots to gain some insight into the patterns and relationships existing in the data subsequent to some data cleaning & preprocessing, which is the subject of this Phase 1 report.

At this point, we make the important assumption that rows in our dataset are not correlated. That is, we assume that house price observations are independent of one another in this dataset. Of course, this is not a very realistic assumption, however, this assumption allows us to circumvent time series aspects of the underlying dynamics of diamond prices and also to resort to rather classical predictive models such as multiple linear regression.

## Data Cleaning and Preprocessing

In this section, we describe the data cleaning and preprocessing steps undertaken for this project.

> TYLERRRRRRRRRRRRRRRRRRR

## Data Exploration and Visualisation

Our dataset is now considered to be clean and we are ready to start visualising and explore each of the features.

> LATERRRRRRR

## Summary and Conclusions

> YEEEEEEEEEEEEEEE

## References

- Agrawal, S.. Diamonds (Kaggle). Retrieved September 26, 2022 from https://www.kaggle.com/datasets/shivam2503/diamonds