
# Descriptive Statistics Analysis: Diamond Dataset

### In this notebook, we will load the 'Diamond.csv' dataset and compute descriptive statistics <br> to understand the distribution and characteristics of diamond features.



In [52]:
# 1. Import Libraries

import pandas as pd
import numpy as np



In [53]:
# 2. Load the Dataset

df = pd.read_csv("Diamond.csv")

# Display first few rows to inspect structure
df.head(10)

Unnamed: 0,carat,colour,clarity,certification,price
0,0.3,D,VS2,GIA,1302
1,0.3,E,VS1,GIA,1510
2,0.3,G,VVS1,GIA,1510
3,0.3,G,VS1,GIA,1260
4,0.31,D,VS1,GIA,1641
5,0.31,E,VS1,GIA,1555
6,0.31,F,VS1,GIA,1427
7,0.31,G,VVS2,GIA,1427
8,0.31,H,VS2,GIA,1126
9,0.31,I,VS1,GIA,1126


In [54]:
# 3. Basic Dataset Information

print("Dataset Shape:", df.shape)
print("\nColumn Names:", df.columns.tolist())
print("\nData Types:")
print(df.dtypes)

# Check for missing values
print("\nMissing Values per Column:")
print(df.isnull().sum())

Dataset Shape: (308, 5)

Column Names: ['carat', 'colour', 'clarity', 'certification', 'price']

Data Types:
carat            float64
colour            object
clarity           object
certification     object
price              int64
dtype: object

Missing Values per Column:
carat            0
colour           0
clarity          0
certification    0
price            0
dtype: int64


In [55]:
# 4. Identify Variable Types

# Based on the data snippet:
# - Numerical variables: 'carat', 'price'
# - Categorical variables: 'colour', 'clarity', 'certification'

numerical_cols = ['carat', 'price']
categorical_cols = ['colour', 'clarity', 'certification']

## **Descriptive Statistics for Numerical Variables**

In [56]:
# Select numerical columns
num_df = df[numerical_cols]

# 4.1 Sample Size (Number of Observations)
n_obs = len(num_df)
print(f"Sample size (number of observations): {n_obs}")

# Interpretation:
# This is the total number of diamond records in the dataset. 

Sample size (number of observations): 308


In [57]:
# 4.2 Min and Max
min_vals = num_df.min()
max_vals = num_df.max()
print("\nMinimum values:")
print(min_vals)
print("\nMaximum values:")
print(max_vals)

# Interpretation:
# The minimum and maximum show the range of observed values for each numerical variable.
# For example, the smallest diamond is X carats and the largest is Y carats.
# These help identify potential outliers or data entry errors (e.g., carat = 0 or price < 0).


Minimum values:
carat      0.18
price    638.00
dtype: float64

Maximum values:
carat        1.1
price    16008.0
dtype: float64


In [58]:
# 4.3 Sample Mean (Average)
mean_vals = num_df.mean()
print("\nSample mean (average):")
print(mean_vals)

# Interpretation:
# The mean is the arithmetic average. It represents the "center of mass" of the data.
# However, it can be influenced by extreme values (outliers). 
# For skewed distributions (like price), the median may be more representative.


Sample mean (average):
carat       0.630909
price    5019.483766
dtype: float64


In [59]:
# 4.4 Variance and Standard Deviation
var_vals = num_df.var(ddof=1)  # ddof=1 for sample variance; Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
std_vals = num_df.std(ddof=1)
print("\nSample variance:")
print(var_vals)
print("\nSample standard deviation:")
print(std_vals)

# Interpretation:
# Variance measures the average squared deviation from the mean.
# Standard deviation (square root of variance) is in the same units as the original data.
# A higher standard deviation means the data points are more spread out from the mean.
# For example, if price has a high std, diamond prices vary widely.


Sample variance:
carat    7.683044e-02
price    1.158120e+07
dtype: float64

Sample standard deviation:
carat       0.277183
price    3403.115715
dtype: float64


In [60]:
# 4.5 Median
median_vals = num_df.median()
print("\nMedian:")
print(median_vals)

# Interpretation:
# The median is the middle value when data is sorted. 
# It is robust to outliers and better represents the "typical" value in skewed distributions.
# Compare median and mean: if mean > median, the distribution is right-skewed (common for price).


Median:
carat       0.62
price    4215.00
dtype: float64


In [61]:
# ### 4.6 Quartiles and Quantiles
quartiles = num_df.quantile([0.25, 0.50, 0.75])
print("\nQuartiles (25%, 50%, 75%):")
print(quartiles)

# Also compute 5th and 95th percentiles for broader view
percentiles = num_df.quantile([0.05, 0.95])
print("\n5th and 95th percentiles:")
print(percentiles)

# Interpretation:
# - Q1 (25%): 25% of diamonds have values below this.
# - Q2 (50%): same as median.
# - Q3 (75%): 75% of diamonds have values below this.
# The interquartile range (IQR = Q3 - Q1) measures spread of the middle 50% of data.
# Percentiles help understand tails of the distribution (e.g., top 5% most expensive diamonds).


Quartiles (25%, 50%, 75%):
      carat   price
0.25   0.35  1625.0
0.50   0.62  4215.0
0.75   0.85  7446.0

5th and 95th percentiles:
      carat    price
0.05   0.20    919.0
0.95   1.01  10655.6


In [62]:
# 4.7 Other Useful Statistics: Skewness and Kurtosis
skew_vals = num_df.skew()
kurt_vals = num_df.kurtosis()
print("\nSkewness:")
print(skew_vals)
print("\nKurtosis:")
print(kurt_vals)

# Interpretation:
# Skewness is a measure of the asymmetry and kurtosis is a measure of 'peakedness' of a distribution. 
# Skewness measures asymmetry. 
#   - Skew > 0: right-skewed (long tail to the right, common for price).
#   - Skew < 0: left-skewed.
# Kurtosis measures tail heaviness relative to a normal distribution.
#   - Kurt > 0: heavy tails (more outliers).
#   - Kurt < 0: light tails.


Skewness:
carat    0.014802
price    0.657690
dtype: float64

Kurtosis:
carat   -1.24125
price   -0.32484
dtype: float64


# **Descriptive Statistics for Categorical Variables**

In [63]:
# 5.1 Frequency Tables
for col in categorical_cols:
    freq_table = df[col].value_counts().sort_index()
    print(f"\nFrequency table for '{col}':")
    print(freq_table)
    print(f"Total: {freq_table.sum()}")

# Interpretation:
# A frequency table shows how many observations fall into each category.
# It reveals the most and least common categories (e.g., most diamonds have 'GIA' certification).
# This helps understand the composition of the dataset and detect imbalanced categories.


Frequency table for 'colour':
D    16
E    44
F    82
G    65
H    61
I    40
Name: colour, dtype: int64
Total: 308

Frequency table for 'clarity':
IF      44
VS1     81
VS2     53
VVS1    52
VVS2    78
Name: clarity, dtype: int64
Total: 308

Frequency table for 'certification':
GIA    151
HRD     79
IGI     78
Name: certification, dtype: int64
Total: 308


In [64]:
# **Descriptive Statistics for numerical and categorical variables**

In [65]:
# 6.1 Compute mode for numerical and categorical variables
mode_numerical = num_df.mode().iloc[0]  # Take first mode if multiple
mode_categorical = df[categorical_cols].mode().iloc[0]

print("\nMode (most frequent value) for numerical variables:")
print(mode_numerical)

print("\nMode for categorical variables:")
print(mode_categorical)

# Interpretation:
# The mode is the value that appears most frequently in a variable.

# For categorical data (e.g., colour, certification), the mode tells us the most common category (e.g., "G" colour or "GIA" certification).
# For numerical data, the mode is less commonly used (especially with continuous values like price, where duplicates are rare),
# but it can still highlight common price points or carat sizes (e.g., many diamonds are exactly 1.0 carat due to market preferences).
# A dataset can be unimodal (one peak), bimodal (two common values), or have no clear mode. 
# Pandas .mode() returns all modes if multiple exist; here we take the first for simplicity.


Mode (most frequent value) for numerical variables:
carat       1.0
price    5122.0
Name: 0, dtype: float64

Mode for categorical variables:
colour             F
clarity          VS1
certification    GIA
Name: 0, dtype: object


# **Bivariate Analysis: Two Numerical Variables**

In [66]:
# 7.1 Compute covariance matrix for numerical variables
cov_matrix = num_df.cov()
print("\nCovariance matrix:")
print(cov_matrix)

# Extract specific covariance (e.g., between carat and price)
cov_carat_price = cov_matrix.loc['carat', 'price']
print(f"\nCovariance between carat and price: {cov_carat_price:.2f}")


# Interpretation:
# Covariance measures how two numerical variables change together.

# A positive covariance (e.g., carat and price) means that when one variable is above its mean, the other tends to be above its mean too (they increase together).
# A negative covariance would indicate an inverse relationship.
# However, covariance values are not standardized—their magnitude depends on the units of the variables 
# (e.g., price in dollars vs. carat in units), so they are hard to interpret in isolation.
# That’s why we often prefer correlation (which standardizes covariance to a -1 to +1 scale).
# The diagonal of the covariance matrix shows the variance of each variable (since covariance of a variable with itself is its variance).


Covariance matrix:
            carat         price
carat    0.076830  8.911474e+02
price  891.147376  1.158120e+07

Covariance between carat and price: 891.15


In [67]:
# 7.2 Pairwise Correlation (e.g., carat vs price)
correlation_carat_price = df['carat'].corr(df['price'])
print(f"\nCorrelation between carat and price: {correlation_carat_price:.4f}")

# Interpretation:
# Correlation measures the strength and direction of a linear relationship (-1 to 1).
# A value close to +1 means as carat increases, price tends to increase linearly.
# A value near 0 suggests no linear relationship.
# Note: correlation does not imply causation, and it only captures linear patterns.


Correlation between carat and price: 0.9447


In [68]:
# EXTRA 8. Multivariate Analysis: Correlation Matrix (All Numerical Variables)
# Compute full correlation matrix
corr_matrix = num_df.corr()
print("\nCorrelation matrix:")
print(corr_matrix)

# Interpretation:
# The correlation matrix shows pairwise correlations between all numerical variables.
# Diagonal is always 1 (variable correlated with itself).
# High positive/negative off-diagonal values indicate strong linear relationships.
# This helps identify redundant variables or key predictors (e.g., carat likely strongly correlates with price).


Correlation matrix:
          carat     price
carat  1.000000  0.944727
price  0.944727  1.000000


In [69]:
# EXTRA 9. Bivariate Analysis: Two Categorical Variables

# ### Contingency Table (e.g., colour vs certification)
contingency_table = pd.crosstab(df['colour'], df['certification'])
print("\nContingency table (colour vs certification):")
print(contingency_table)

# Interpretation:
# A contingency table shows the joint frequency distribution of two categorical variables.
# It reveals associations: e.g., are certain colours more likely to be certified by GIA vs IGI?
# If the distribution of certification is similar across colours, the variables may be independent.
# Large differences suggest a relationship worth exploring further (e.g., with chi-square test).


Contingency table (colour vs certification):
certification  GIA  HRD  IGI
colour                      
D               10    2    4
E               25   10    9
F               40   19   23
G               26   17   22
H               29   22   10
I               21    9   10
