# 💎 Diamond Price Prediction - Complete Analysis

This notebook provides a comprehensive analysis of diamond price prediction using both traditional machine learning and deep learning approaches.

## Table of Contents
1. [Setup and Imports](#setup)
2. [Data Loading and Exploration](#data)
3. [Traditional Machine Learning Models](#ml)
4. [Deep Learning Models](#dl)
5. [Model Comparison](#comparison)
6. [Feature Importance Analysis](#features)
7. [Price Prediction Examples](#prediction)
8. [Conclusions](#conclusions)

## 1. Setup and Imports {#setup}

In [None]:
# Install required packages if not already installed
import subprocess
import sys

def install_package(package):
    try:
        __import__(package)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Install required packages
packages = ['pandas', 'numpy', 'scikit-learn', 'tensorflow', 'xgboost', 
           'matplotlib', 'seaborn', 'plotly', 'optuna', 'shap']

for package in packages:
    install_package(package)

print("All packages installed successfully!")

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Import our custom modules
import sys
import os
sys.path.append('src')

from diamond_predictor import DiamondPricePredictor

# Set up plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)

print("Setup completed successfully!")

## 2. Data Loading and Exploration {#data}

In [None]:
# Initialize the predictor
predictor = DiamondPricePredictor()

# Load and prepare data
print("Loading diamond dataset...")
data = predictor.load_and_prepare_data(file_path=None, explore=True, visualize=True)

print(f"\nDataset shape: {data.shape}")
print(f"Columns: {list(data.columns)}")

In [None]:
# Display first few rows
print("First 10 rows of the dataset:")
data.head(10)

In [None]:
# Interactive price distribution
fig = px.histogram(data, x='price', nbins=50, title='Diamond Price Distribution')
fig.update_layout(xaxis_title='Price ($)', yaxis_title='Frequency')
fig.show()

# Price statistics
print(f"Price Statistics:")
print(f"Mean: ${data['price'].mean():.2f}")
print(f"Median: ${data['price'].median():.2f}")
print(f"Std: ${data['price'].std():.2f}")
print(f"Min: ${data['price'].min():.2f}")
print(f"Max: ${data['price'].max():.2f}")

In [None]:
# Interactive scatter plot: Carat vs Price
fig = px.scatter(data, x='carat', y='price', color='cut', 
                title='Diamond Price vs Carat (colored by Cut)',
                hover_data=['color', 'clarity'])
fig.update_layout(xaxis_title='Carat', yaxis_title='Price ($)')
fig.show()

In [None]:
# Categorical feature distributions
fig = make_subplots(rows=2, cols=2, 
                    subplot_titles=['Cut Distribution', 'Color Distribution', 
                                   'Clarity Distribution', 'Price by Cut'],
                    specs=[[{"type": "bar"}, {"type": "bar"}],
                           [{"type": "bar"}, {"type": "box"}]])

# Cut distribution
cut_counts = data['cut'].value_counts()
fig.add_trace(go.Bar(x=cut_counts.index, y=cut_counts.values, name='Cut'), row=1, col=1)

# Color distribution
color_counts = data['color'].value_counts()
fig.add_trace(go.Bar(x=color_counts.index, y=color_counts.values, name='Color'), row=1, col=2)

# Clarity distribution
clarity_counts = data['clarity'].value_counts()
fig.add_trace(go.Bar(x=clarity_counts.index, y=clarity_counts.values, name='Clarity'), row=2, col=1)

# Price by cut (box plot)
for cut_type in data['cut'].unique():
    cut_data = data[data['cut'] == cut_type]['price']
    fig.add_trace(go.Box(y=cut_data, name=cut_type), row=2, col=2)

fig.update_layout(height=800, showlegend=False, title_text="Diamond Feature Distributions")
fig.show()

## 3. Traditional Machine Learning Models {#ml}

In [None]:
# Train traditional ML models
print("Training Traditional Machine Learning Models...")
ml_results = predictor.train_traditional_ml_models(optimize_best=True)

# Display results summary
print(predictor.ml_models.get_model_summary())

In [None]:
# Create ML results DataFrame for visualization
ml_summary = []
for name, results in ml_results.items():
    ml_summary.append({
        'Model': name,
        'R² Score': results['test_r2'],
        'RMSE': results['test_rmse'],
        'MAE': results['test_mae']
    })

ml_df = pd.DataFrame(ml_summary).sort_values('R² Score', ascending=False)
print("Traditional ML Model Performance:")
ml_df

In [None]:
# Interactive ML model comparison
fig = make_subplots(rows=1, cols=2, 
                    subplot_titles=['R² Score Comparison', 'RMSE Comparison'])

# R² Score
fig.add_trace(go.Bar(x=ml_df['Model'], y=ml_df['R² Score'], 
                    name='R² Score', marker_color='lightblue'), row=1, col=1)

# RMSE
fig.add_trace(go.Bar(x=ml_df['Model'], y=ml_df['RMSE'], 
                    name='RMSE', marker_color='lightcoral'), row=1, col=2)

fig.update_layout(height=500, showlegend=False, 
                 title_text="Traditional ML Model Performance")
fig.update_xaxes(tickangle=45)
fig.show()

## 4. Deep Learning Models {#dl}

In [None]:
# Train deep learning models
print("Training Deep Learning Models...")
dl_results = predictor.train_deep_learning_models(epochs=50, batch_size=32)

# Display results summary
print(predictor.dl_models.get_model_summary())

In [None]:
# Create DL results DataFrame for visualization
dl_summary = []
for name, results in dl_results.items():
    dl_summary.append({
        'Model': name,
        'R² Score': results['test_r2'],
        'RMSE': results['test_rmse'],
        'MAE': results['test_mae'],
        'Epochs': results['epochs_trained']
    })

dl_df = pd.DataFrame(dl_summary).sort_values('R² Score', ascending=False)
print("Deep Learning Model Performance:")
dl_df

In [None]:
# Interactive DL model comparison
fig = make_subplots(rows=2, cols=2, 
                    subplot_titles=['R² Score', 'RMSE', 'MAE', 'Training Epochs'])

# R² Score
fig.add_trace(go.Bar(x=dl_df['Model'], y=dl_df['R² Score'], 
                    name='R² Score', marker_color='lightgreen'), row=1, col=1)

# RMSE
fig.add_trace(go.Bar(x=dl_df['Model'], y=dl_df['RMSE'], 
                    name='RMSE', marker_color='lightcoral'), row=1, col=2)

# MAE
fig.add_trace(go.Bar(x=dl_df['Model'], y=dl_df['MAE'], 
                    name='MAE', marker_color='lightyellow'), row=2, col=1)

# Epochs
fig.add_trace(go.Bar(x=dl_df['Model'], y=dl_df['Epochs'], 
                    name='Epochs', marker_color='lightpink'), row=2, col=2)

fig.update_layout(height=800, showlegend=False, 
                 title_text="Deep Learning Model Performance")
fig.update_xaxes(tickangle=45)
fig.show()

## 5. Model Comparison {#comparison}

In [None]:
# Compare all models
print("Comparing All Models...")
comparison_results = predictor.compare_all_models()

print("\nTop 10 Models:")
comparison_results.head(10)

In [None]:
# Interactive comprehensive comparison
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=['R² Score by Model Type', 'RMSE Distribution', 
                                   'Top 10 Models', 'Performance Scatter'],
                    specs=[[{"type": "box"}, {"type": "histogram"}],
                           [{"type": "bar"}, {"type": "scatter"}]])

# Box plot by model type
for model_type in comparison_results['Type'].unique():
    type_data = comparison_results[comparison_results['Type'] == model_type]
    fig.add_trace(go.Box(y=type_data['R² Score'], name=model_type), row=1, col=1)

# RMSE histogram
fig.add_trace(go.Histogram(x=comparison_results['RMSE'], nbinsx=20, name='RMSE'), row=1, col=2)

# Top 10 models
top_10 = comparison_results.head(10)
colors = ['gold' if i == 0 else 'silver' if i == 1 else '#CD7F32' if i == 2 else 'lightblue' 
          for i in range(len(top_10))]
fig.add_trace(go.Bar(x=top_10['R² Score'], y=top_10['Model'], 
                    orientation='h', marker_color=colors, name='Top 10'), row=2, col=1)

# Performance scatter
ml_data = comparison_results[comparison_results['Type'] == 'Traditional ML']
dl_data = comparison_results[comparison_results['Type'] == 'Deep Learning']

fig.add_trace(go.Scatter(x=ml_data['R² Score'], y=ml_data['RMSE'], 
                        mode='markers', name='Traditional ML', 
                        marker=dict(size=10, color='blue')), row=2, col=2)
fig.add_trace(go.Scatter(x=dl_data['R² Score'], y=dl_data['RMSE'], 
                        mode='markers', name='Deep Learning', 
                        marker=dict(size=10, color='red')), row=2, col=2)

fig.update_layout(height=800, title_text="Comprehensive Model Comparison")
fig.show()

In [None]:
# Performance statistics by model type
type_stats = comparison_results.groupby('Type').agg({
    'R² Score': ['mean', 'std', 'min', 'max'],
    'RMSE': ['mean', 'std', 'min', 'max'],
    'MAE': ['mean', 'std', 'min', 'max']
}).round(4)

print("Performance Statistics by Model Type:")
type_stats

## 6. Feature Importance Analysis {#features}

In [None]:
# Feature importance analysis
print("Analyzing Feature Importance...")
importance_df = predictor.get_feature_importance_analysis()

if importance_df is not None:
    print("\nFeature Importance Rankings:")
    importance_df['Mean'].sort_values(ascending=False)

In [None]:
# Interactive feature importance visualization
if importance_df is not None:
    fig = px.bar(x=importance_df['Mean'].values, 
                y=importance_df.index,
                orientation='h',
                title='Feature Importance Analysis',
                labels={'x': 'Importance', 'y': 'Features'})
    fig.update_layout(yaxis={'categoryorder': 'total ascending'})
    fig.show()
    
    # Feature importance by model
    importance_models = importance_df.drop('Mean', axis=1)
    fig2 = px.imshow(importance_models.T, 
                    title='Feature Importance Heatmap by Model',
                    labels={'x': 'Features', 'y': 'Models', 'color': 'Importance'})
    fig2.show()

## 7. Price Prediction Examples {#prediction}

In [None]:
# Example diamond predictions
print("Diamond Price Prediction Examples")
print("=" * 50)

# Example 1: High-quality diamond
diamond1 = {
    'carat': 1.5,
    'cut': 'Ideal',
    'color': 'D',
    'clarity': 'VVS1',
    'depth': 61.5,
    'table': 57.0,
    'x': 7.3,
    'y': 7.3,
    'z': 4.5
}

print("\nExample 1: High-Quality Diamond")
predictions1 = predictor.generate_prediction_report(diamond1)

# Example 2: Medium-quality diamond
diamond2 = {
    'carat': 1.0,
    'cut': 'Good',
    'color': 'G',
    'clarity': 'SI1',
    'depth': 62.0,
    'table': 58.0,
    'x': 6.2,
    'y': 6.2,
    'z': 3.8
}

print("\nExample 2: Medium-Quality Diamond")
predictions2 = predictor.generate_prediction_report(diamond2)

# Example 3: Lower-quality diamond
diamond3 = {
    'carat': 0.5,
    'cut': 'Fair',
    'color': 'J',
    'clarity': 'SI2',
    'depth': 64.0,
    'table': 60.0,
    'x': 5.0,
    'y': 5.0,
    'z': 3.2
}

print("\nExample 3: Lower-Quality Diamond")
predictions3 = predictor.generate_prediction_report(diamond3)

In [None]:
# Visualize prediction examples
examples = ['High Quality', 'Medium Quality', 'Lower Quality']
all_predictions = [predictions1, predictions2, predictions3]

# Create comparison chart
fig = go.Figure()

model_types = list(predictions1.keys())
colors = ['gold', 'silver', 'lightblue', 'lightgreen']

for i, model_type in enumerate(model_types):
    values = []
    for preds in all_predictions:
        pred_value = preds[model_type]
        if isinstance(pred_value, np.ndarray):
            pred_value = pred_value[0]
        values.append(pred_value)
    
    fig.add_trace(go.Bar(
        name=model_type,
        x=examples,
        y=values,
        marker_color=colors[i % len(colors)]
    ))

fig.update_layout(
    title='Diamond Price Predictions by Quality Level',
    xaxis_title='Diamond Quality',
    yaxis_title='Predicted Price ($)',
    barmode='group'
)

fig.show()

## 8. Interactive Prediction Widget

In [None]:
# Interactive prediction widget
from ipywidgets import interact, FloatSlider, Dropdown, fixed
import ipywidgets as widgets

def predict_diamond_price(carat, cut, color, clarity, depth, table, x, y, z):
    diamond_features = {
        'carat': carat,
        'cut': cut,
        'color': color,
        'clarity': clarity,
        'depth': depth,
        'table': table,
        'x': x,
        'y': y,
        'z': z
    }
    
    try:
        prediction = predictor.predict_price(diamond_features, 'best')
        if isinstance(prediction, np.ndarray):
            prediction = prediction[0]
        
        print(f"\n💎 Diamond Characteristics:")
        for key, value in diamond_features.items():
            print(f"  {key.title()}: {value}")
        
        print(f"\n💰 Predicted Price: ${prediction:,.2f}")
        
        # Price category
        if prediction < 1000:
            category = "Budget-Friendly"
        elif prediction < 5000:
            category = "Mid-Range"
        elif prediction < 15000:
            category = "Premium"
        else:
            category = "Luxury"
        
        print(f"📊 Price Category: {category}")
        
    except Exception as e:
        print(f"Error making prediction: {str(e)}")

# Create interactive widget
interact(predict_diamond_price,
         carat=FloatSlider(min=0.2, max=5.0, step=0.1, value=1.0, description='Carat:'),
         cut=Dropdown(options=['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], 
                     value='Good', description='Cut:'),
         color=Dropdown(options=['D', 'E', 'F', 'G', 'H', 'I', 'J'], 
                       value='G', description='Color:'),
         clarity=Dropdown(options=['FL', 'IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'I1'], 
                         value='VS1', description='Clarity:'),
         depth=FloatSlider(min=55.0, max=70.0, step=0.1, value=61.5, description='Depth (%):'),
         table=FloatSlider(min=50.0, max=70.0, step=0.1, value=57.0, description='Table (%):'),
         x=FloatSlider(min=3.0, max=12.0, step=0.1, value=6.0, description='Length (mm):'),
         y=FloatSlider(min=3.0, max=12.0, step=0.1, value=6.0, description='Width (mm):'),
         z=FloatSlider(min=1.5, max=8.0, step=0.1, value=3.7, description='Height (mm):'))

## 9. Model Saving and Loading

In [None]:
# Save all trained models
print("Saving all trained models...")
predictor.save_all_models(directory="saved_models")
print("Models saved successfully!")

In [None]:
# Example of loading models (for future use)
# new_predictor = DiamondPricePredictor()
# new_predictor.load_models(directory="saved_models")
# print("Models loaded successfully!")

## 10. Conclusions {#conclusions}

### Key Findings:

1. **Model Performance**: 
   - Both traditional ML and deep learning models achieved good performance
   - Ensemble methods typically performed best
   - XGBoost and Random Forest were top performers among traditional ML
   - Neural networks with proper architecture showed competitive results

2. **Feature Importance**:
   - Carat weight is typically the most important feature
   - Cut, color, and clarity significantly impact price
   - Diamond dimensions (x, y, z) provide additional predictive power

3. **Model Comparison**:
   - Traditional ML models train faster and are more interpretable
   - Deep learning models can capture complex non-linear relationships
   - Ensemble approaches combining multiple models often yield best results

4. **Practical Applications**:
   - The models can be used for diamond price estimation
   - Useful for jewelry retailers, appraisers, and consumers
   - Can help identify overpriced or underpriced diamonds

### Recommendations:

1. **For Production Use**:
   - Use ensemble methods for best accuracy
   - Implement model monitoring and retraining
   - Consider market conditions and temporal factors

2. **For Further Improvement**:
   - Collect more diverse and recent diamond data
   - Include additional features (certification, fluorescence, etc.)
   - Implement advanced ensemble techniques
   - Consider market trends and economic factors

3. **Model Selection**:
   - Use XGBoost or Random Forest for interpretability
   - Use neural networks for maximum accuracy
   - Use ensemble for production systems

This comprehensive analysis demonstrates the effectiveness of combining traditional machine learning with deep learning approaches for diamond price prediction, providing a robust foundation for real-world applications.