# 1) Explore Dataset & Examine what Features affect the Price of Diamonds.

## 1.1) Importing Libraries

In [None]:
# Ignore warnings :
import warnings
warnings.filterwarnings('ignore')


# Handle table-like data and matrices :
import numpy as np
import pandas as pd
import math 


In [None]:
# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
#import matplotlib.pylab as pylab
import seaborn as sns
import missingno as msno


In [None]:
# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
plt.style.use('fivethirtyeight')
sns.set(context="notebook", palette="dark", style = 'whitegrid' , color_codes=True)
params = { 
    'axes.labelsize': "medium",
    'xtick.labelsize': 'medium',
    'legend.fontsize': 15,
    'figure.dpi': 150,
    'figure.figsize': [25, 15]
}
plt.rcParams.update(params)

In [None]:
# Center all plots
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""");

## 1.2) Extract Dataset

* 1) read a csv file from a specified source, create a copy and view the first 5 rows
* 2) Do you see an unwanted column? Drop it off?
* 3) View only custom columns between row 10 to 15.
* 4) Get the number of rows and columns in this dataset (dimensions)
* 5) Understand what are different  data types of the variables
* 6) Select columns based on data types

1) read a csv file from a specified source and view the first 5 rows

In [None]:
df = pd.read_csv('./diamonds.csv')
diamonds = df.copy()

In [None]:
# How the data looks
df.head()

2) Do you see an unwanted column? Drop it off?

In [None]:
df.drop(['Unnamed: 0'] , axis=1 , inplace=True)
df.head()

3) View only custom columns between row 10 to 15.

In [None]:
custom_columns1=["cut","color","clarity"]

In [None]:
#df.head(16).tail(6)
df[custom_columns1].head(16).tail(6)

In [None]:
#df.loc[10:16, ]
df.loc[10:16, custom_columns1]

In [None]:
#df.iloc[10:16, custom_columns1]
df.iloc[10:16, 1:4]

4) Get the number of rows and columns in this dataset (dimensions)

In [None]:
df.shape

5) Understand what are different data types of the variables

In [None]:
df.info()

6) Select columns based on data types

In [None]:
#Eg - All numeric columns
df.select_dtypes(include=['float64','int64']).head()

In [None]:
#Eg - All non numeric columns
df.select_dtypes(exclude=['float64','int64']).head()

## 1.3) Features
* **Carat : ** Carat weight of the Diamond.
* **Cut : ** Describe cut quality of the diamond.
> * Quality in increasing order Fair, Good, Very Good, Premium, Ideal .
* **Color : ** Color of the Diamond.
> * With D being the best and J the worst.
* **Clarity : ** Diamond Clarity refers to the absence of the Inclusions and Blemishes.
> * (In order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
* **Depth : ** The Height of a Diamond, measured from the Culet to the table, divided by its average Girdle Diameter.
* **Table : ** The Width of the Diamond's Table expressed as a Percentage of its Average Diameter.
* **Price : ** the Price of the Diamond.
* **X : ** Length of the Diamond in mm.
* **Y : ** Width of the Diamond in mm.
* **Z : ** Height of the Diamond in mm.

*Qualitative Features (Categorical) : Cut, Color, Clarity. *

*Quantitative Features (Numerical) : Carat, Depth , Table , Price , X , Y, Z.*


### Price is the Target Variable.

## 1.4) Examine & Handle Null Values

* 1) Get the count of NULL Values for all colums
* 2) Get the count of not NULL values for particular column
* 3) Get the statistical summary of all the variables. Do you see some peculiarity?
* 4) Handle the peculiarity

1) Get the count of NULL Values for all colums

In [None]:
# It seems there are no Null Values.
# Let's Confirm
df.isnull().sum()

2) Get the count of not NULL values for particular column

In [None]:
# This gives count of not null values
df["carat"].count()

In [None]:
msno.matrix(df) # just to visualize. no missing values.

### Great, So there are no NaN values.

3) Get the statistical summary of all the variables? Do you see some peculiarity?

In [None]:
df.describe() 

### Wait
* **Do you see the Min. Values of X, Y and Z. It can't be possible..!!**
* **It doesn't make any sense to have either of Length or Width or Height to be zero..**


What about categorical features

In [None]:
df['color'].value_counts()

In [None]:
df['clarity'].value_counts()

In [None]:
df['cut'].value_counts()

4) Handling the peculiarity

In [None]:
# Filter the rows where  either of the dimensions are zero
df.loc[(df['x']==0) | (df['y']==0) | (df['z']==0)]

In [None]:
#How many such rows?
len(df[(df['x']==0) | (df['y']==0) | (df['z']==0)])

### We can see there are 20 rows with Dimensions 'Zero'.
* **We'll Drop them as it seems better choice instead of filling them with any of Mean or Median**

Dropping Rows with Dimensions 'Zero'.

In [None]:
df = df[(df[['x','y','z']] != 0).all(axis=1)]

In [None]:
# Just to Confirm
df.loc[(df['x']==0) | (df['y']==0) | (df['z']==0)]

In [None]:
# Nice and Clean. :)

## 1.5) Understand the scale  of all Features

In [None]:
sns.factorplot(data=df , kind='box' , size=7, aspect=2.5)

**The Values are Distributed over a Small Scale.**

# 2) Correlation Between Features

In [None]:
# Correlation Map
corr = df.corr()
sns.heatmap(data=corr, square=True , annot=True, cbar=True)

## CONCLUSIONS :
**1. Depth is inversely related to Price.**
> * This is because if a Diamond's Depth percentage is too large or small the Diamond will become '__Dark__' in appearance because it will no longer return an Attractive amount of light.

**2. The Price of the Diamond is highly correlated to Carat, and its Dimensions.**

**3. The Weight (Carat) of a diamond has the most significant impact on its Price. **
> * Since, the larger a stone is, the Rarer it is, one 2 carat diamond will be more '__Expensive__' than the total cost of two 1 Carat Diamonds of the same Quality.

**4. The Length(x) , Width(y) and Height(z) seems to be higly related to Price and even each other.**

**5. Self Relation ie. of a feature to itself is 1 as expected.**

**6. What other inferences can you draw. - Left out as a homework**

# 3. Visualization Of All Features

## 3.1) Carat

* **Carat refers to the Weight of the Stone, not the Size.**
* **The Weight of a Diamond has the most significant Impact on its Price.**
* **Since the larger a Stone is, the Rarer it is, one 2 Carat Diamond will be more Expensive than the Total cost of two 1 Carat Diamonds of the Same Quality.**
* **The carat of a Diamond is often very Important to People when shopping But it is a Mistake to Sacrifice too much quality for sheer size.**


[Click Here to Learn More about How Carat Affects the Price of Diamonds.](https://www.diamondlighthouse.com/blog/2014/10/23/how-carat-weight-affects-diamond-price/)

In [None]:
# Visualize via kde plots

In [None]:
sns.__version__

In [None]:
#df2 = sns.load_dataset(df)
#sns.displot(df, x="carat")

In [None]:
#sns.kdeplot(df['carat'], shade=True , color='r')
sns.displot(df,x='carat',kind='kde')

In [None]:
sns.displot(df, x="carat", hue="cut", kind="kde")

In [None]:
sns.displot(df, x="carat", hue="cut", element="step")

### Carat vs Price

In [None]:
sns.jointplot(x='carat' , y='price' , data=df , size=5)

### Carat varies with Price Exponentially.

## 3.2) Cut

* **Although the Carat Weight of a Diamond has the Strongest Effect on Prices, the Cut can still Drastically Increase or Decrease its value.**
* **With a Higher Cut Quality, the Diamond’s Cost per Carat Increases.**
* **This is because there is a Higher Wastage of the Rough Stone as more Material needs to be Removed in order to achieve better Proportions and Symmetry.**

[Click Here to Lean More about How Cut Affects the Price.](https://www.lumeradiamonds.com/diamond-education/diamond-cut)

In [None]:
sns.catplot(x="cut", data=df, kind="count", order=df.cut.value_counts().index)

## Cut vs Price

In [None]:
sns.factorplot(x='cut', y='price', data=df, kind='box' ,aspect=2.5 )

In [None]:
# Understanding Box Plot :

# The bottom line indicates the min value.
# The upper line indicates the max value.
# The middle line of the box is the median or the 50% percentile.
# The side lines of the box are the 25 and 75 percentiles respectively.

### Premium Cut on Diamonds as we can see are the most Expensive, followed by Excellent / Very Good Cut.

## 3.3) Color
* **The Color of a Diamond refers to the Tone and Saturation of Color, or the Depth of Color in a Diamond.**
* **The Color of a Diamond can Range from Colorless to a Yellow or a Faint Brownish Colored hue.**
* **Colorless Diamonds are Rarer and more Valuable because they appear Whiter and Brighter.**

[Click Here to Learn More about How Color Affects the Price](https://enchanteddiamonds.com/education/understanding-diamond-color)

In [None]:
sns.factorplot(x='color', data=df , kind='count',aspect=2.5 )

In [None]:
sns.catplot(x="color", data=df, kind="count", order=df.color.value_counts().index)

### Color vs Price

In [None]:
sns.factorplot(x='color', y='price' , data=df , kind='violin', aspect=2.5)

## 3.4) Clarity
* **Diamond Clarity refers to the absence of the Inclusions and Blemishes.**
* **An Inclusion is an Imperfection located within a Diamond. Inclusions can be Cracks or even Small Minerals or Crystals that have formed inside the Diamond.**
* **Blemishing is a result of utting and polishing process than the environmental conditions in which the diamond was formed. It includes scratches, extra facets etc.**

[Click Here to Learn More about How Clarity Affects the Price of Diamonds.](https://www.diamondmansion.com/blog/understanding-how-diamond-clarity-affects-value/)

In [None]:
labels = df.clarity.unique().tolist()
sizes = df.clarity.value_counts().tolist()
colors = ['#006400', '#E40E00', '#A00994', '#613205', '#FFED0D', '#16F5A7','#ff9999','#66b3ff']
explode = (0.1, 0.0, 0.1, 0, 0.1, 0, 0.1,0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors,autopct='%1.1f%%', shadow=True, startangle=0)
plt.axis('equal')
plt.title("Percentage of Clarity Categories")
plt.plot()
fig=plt.gcf()
fig.set_size_inches(6,6)
plt.show()

In [None]:
sns.boxplot(x='clarity', y='price', data=df )

### It seems that VS1 and VS2 affect the Diamond's Price equally having quite high Price margin.

## 3.5) Depth
* **The Depth of a Diamond is its Height (in millimeters) measured from the Culet to the Table.**
* **If a Diamond's Depth Percentage is too large or small the Diamond will become Dark in appearance because it will no longer return an Attractive amount of light.**

[Click Here to Learn More about How Depth Affects the Price of Diamonds.](https://beyond4cs.com/grading/depth-and-table-values/)

In [None]:
#plt.hist('depth' , data=df , bins=25)
sns.displot(df, x="depth", bins=25)

In [None]:
sns.jointplot(x='depth', y='price' , data=df , kind='reg', size=5)

### We can Infer from the plot that the Price can vary heavily for the same Depth.
* **And the Pearson's Correlation shows that there's a slightly inverse relation between the two.**

## 3.6) Table
* **Table is the Width of the Diamond's Table expressed as a Percentage of its Average Diameter.**
* **If the Table (Upper Flat Facet) is too Large then light will not play off of any of the Crown's angles or facets and will not create the Sparkly Rainbow Colors.**
* **If it is too Small then the light will get Trapped and that Attention grabbing shaft of light will never come out but will “leak” from other places in the Diamond.**

[Click Here to Learn More about How Depth Affects the Price of Diamonds.](https://beyond4cs.com/grading/depth-and-table-values/)

In [None]:
sns.kdeplot(df['table'] ,shade=True , color='orange')

In [None]:
sns.jointplot(x='depth', y='price' , data=df , kind='reg', size=5)

## 3.7) Dimensions

* **As the Dimensions increases, Obviously the Prices Rises as more and more Natural Resources are Utilised.**

In [None]:
sns.kdeplot(df['x'] ,shade=True , color='r' )
sns.kdeplot(df['y'] , shade=True , color='g' )
sns.kdeplot(df['z'] , shade= True , color='b')
plt.xlim(2,10)

**We'll Create a New Feature based on the Dimensions in the Next Section called 'Volume' and Visualize how it affects the Price.**

# 4) Feature Engineering

## 4.1) Create New Feature 'Volume'

In [None]:
df['volume'] = df['x']*df['y']*df['z']
df.head()

In [None]:
plt.figure(figsize=(5,5))
plt.hist( x=df['volume'] , bins=30 ,color='g')
plt.xlabel('Volume in mm^3')
plt.ylabel('Frequency')
plt.title('Distribution of Diamond\'s Volume')
plt.xlim(0,1000)
plt.ylim(0,50000)

In [None]:
sns.jointplot(x='volume', y='price' , data=df, size=5)

### It seems that there is Linear Relationship between Price and Volume (x \* y \* z).

## 4.2) Drop X, Y, Z

In [None]:
df.drop(['x','y','z'], axis=1, inplace= True)
#df.head()