## Module 9 Lab 2 Assignment: Analyzing Diamond Prices with Machine Learning

In this assignment, you will explore a dataset of diamond prices and apply machine learning techniques to predict the prices. The code provided demonstrates various steps of the process, including data loading, visualization, model training, and evaluation. Your task is to understand the code, provide explanations for each step, and answer related questions.

Tip: Use Google Colab to view histograms


Instructions:
1. Code Analysis:
    - a. Carefully analyze the provided code and comments.
    - b. Write a brief explanation for each code block and its purpose.

2. Questions:
   Answer the following questions based on your analysis of the code and the concepts you've learned:

   - a. What is the dataset used in this code, and where is it obtained from?
   - b. What type of machine learning problem is being addressed in this code (regression or classification)?
   - c. How is the dataset visualized in the code, and what insights can be gained from the visualizations?
   - d. What is the purpose of the `setup` function from the `pycaret.regression` module?
   - e. What does the `compare_models` function do, and why is it used?
   - f. What is the meaning of the term "residuals" in the context of machine learning?
   - g. How are the predictions made for the unseen data, and what does the variable `predictions` contain?

3. Extend the Analysis:
   Modify the code to use a different regression model (not necessarily the best model). Evaluate its performance and compare it to the best model.

Research and refer to the PyCaret documentation (https://pycaret.org/) for better understanding and completing this assignment. Feel free to consult any additional resources as needed.

Please submit your analysis, answers to the questions, and the modified code as a .ipynb notebook to Canvas. 

In [79]:
!pip3 install pycaret
!pip3 install mlflow

# Installs the PyCaret and MLflow.

Collecting pycaret
  Using cached pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Using cached pyod-2.0.5-py3-none-any.whl.metadata (46 kB)
Collecting category-encoders>=2.4.0 (from pycaret)
  Using cached category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Collecting lightgbm>=3.0.0 (from pycaret)
  Using cached lightgbm-4.6.0-py3-none-win_amd64.whl.metadata (17 kB)
Collecting matplotlib<3.8.0 (from pycaret)
  Using cached matplotlib-3.7.5-cp312-cp312-win_amd64.whl.metadata (5.8 kB)
Collecting scikit-plot>=0.3.7 (from pycaret)
  Using cached scikit_plot-0.3.7-py3-none-any.whl.metadata (7.1 kB)
Collecting plotly-resampler>=0.8.3.1 (from pycaret)
  Using cached plotly_resampler-0.10.0-py3-none-any.whl.metadata (13 kB)
Collecting sktime==0.26.0 (from pycaret)
  Using cached sktime-0.26.0-py3-none-any.whl.metadata (29 kB)
Collecting tbats>=1.1.3 (from pycaret)
  Using cached tbats-1.1.3-py3-none-any.whl.metadata (3.8 kB)
Collecting pmdarima>=

error: uninstall-no-record-file

Cannot uninstall matplotlib 3.10.3

The package's contents are unknown: no RECORD file was found for matplotlib.

hint: You might be able to recover from this via: pip install --force-reinstall --no-deps matplotlib==3.10.3









In [80]:
import pandas as pd
import plotly.express as px
import numpy as np
import os 

import matplotlib
import matplotlib.pyplot as plt

from pycaret.regression import *
from pycaret.classification import *
from pycaret.datasets import get_data
from sklearn.datasets import fetch_openml

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Imports in data‐handling (pandas, numpy), visualization (plotly, matplotlib), 
# Imports SSL workaround (for fetch), and PyCaret’s regression/classification tools.

ModuleNotFoundError: No module named 'pycaret'

In [None]:
from pycaret.datasets import get_data
data = get_data('diamond')

# Downloads the “diamond” dataset (from OpenML via PyCaret)

In [None]:
fig = px.scatter(x=data['Carat Weight'], y=data['Price'], facet_col = data['Cut'], opacity = 0.25, trendline='ols', trendline_color_override = 'red')
fig.show()

# Visualizes the relationship between carat and price for each cut category, overlaying an OLS trend line.

In [None]:
fig = px.histogram(data, x=["Price"])
fig.show()

# Shows the distribution of raw diamond prices.

In [None]:
data['logged_Price'] = np.log(data['Price'])
fig = px.histogram(data, x=["logged_Price"]) 
fig.show()

#Applies log transformation to price to reduce skewness, producing a more symmetric distribution.

In [None]:
from pycaret.regression import *
s = setup(data, target = 'Price', transform_target = True, experiment_name = 'diamond')

# Initializes the PyCaret regression environment: 
# It splits data, encodes categoricals, imputes missing values, and transforms the target.


In [None]:
best = compare_models(sort='MAE')

# Trains multiple regression models using cross‐validation and returns the top performer.

In [None]:
best.get_params()

# Reveals the tuned hyperparameters of the selected best model.

In [None]:
data_unseen = data.copy()
data_unseen.drop('Price', axis = 1, inplace = True)
predictions = predict_model(best, data = data_unseen)

# Uses the best model to generate price predictions on the full dataset without the target column; 
# Predictions hold both inputs and a new column with predicted prices.


# **Answer the following questions based on your analysis of the code and the concepts you've learned:**

**a. What is the dataset used in this code, and where is it obtained from?**

*This is the diamond dataset from PyCaret*

**b. What type of machine learning problem is being addressed in this code (regression or classification)?**

*This is a regression task—predicting a continuous target (Price)*

**c. How is the dataset visualized in the code, and what insights can be gained from the visualizations?**

*The dataset is visualized through a scatterplot and histograms. The scatter plot of carat vs. price shows linear trends and cut‐dependent slopes. The  histograms of raw and log‐transformed prices highlight heavy right skew and the benefit of a log transform for modeling*

**d. What is the purpose of the setup function from the pycaret.regression module?**

*The setup function prepares the dataset and ML pipeline—handling splits, encodings, imputations, feature engineering, and target transformation, so that downstream PyCaret functions can work automatically*

**e. What does the compare_models function do, and why is it used?**

*The compare_models function trains and cross‐validates many algorithms, ranks them by MAE, and returns the best one*

**f. What is the meaning of the term "residuals" in the context of machine learning?**

*Residuals are the differences between actual target values and model predictions and they measure error on each point*

**g. How are the predictions made for the unseen data, and what does the variable predictions contain?**

*After dropping the true Price column, predict_model(best, data_unseen) applies the trained model to unseen data. The predictions DataFrame includes original features plus a column of predicted prices*



# Extended Analysis

**Modify the code to use a different regression model (not necessarily the best model).**

In [None]:
!pip3 install pycaret
!pip3 install mlflow

import pandas as pd
import plotly.express as px
import numpy as np
import os 
import matplotlib.pyplot as plt
import ssl

# SSL fix for OpenML downloads
ssl._create_default_https_context = ssl._create_unverified_context

from pycaret.regression import *
from pycaret.datasets import get_data

# 1. Load the diamond dataset
data = get_data('diamond')

# 2. Initial exploratory visualizations
#    a) Carat vs. Price by Cut
fig = px.scatter(
    x=data['Carat Weight'],
    y=data['Price'],
    facet_col=data['Cut'],
    opacity=0.25,
    trendline='ols',
    trendline_color_override='red'
)
fig.show()

#    b) Distribution of raw Price
fig = px.histogram(data, x='Price')
fig.show()

#    c) Distribution of log‐transformed Price
data['logged_Price'] = np.log(data['Price'])
fig = px.histogram(data, x='logged_Price')
fig.show()

# 3. PyCaret setup for regression (log‐transform target internally)
s = setup(
    data,
    target='Price',
    transform_target=True,
    experiment_name='diamond_rf'
)

# 4. Create an alternative regression model: Random Forest
rf = create_model('rf')   # RandomForestRegressor with default hyperparameters

# 5. Review the model’s learned parameters
print("Random Forest Parameters:")
print(rf.get_params())

# 6. Generate predictions on the full dataset (as “unseen” data)
data_unseen = data.copy()
data_unseen.drop('Price', axis=1, inplace=True)
predictions = predict_model(rf, data=data_unseen)

# 7. Display the first few predictions
print(predictions.head())


**Evaluate its performance and compare it to the best model**

In [None]:
from pycaret.regression import pull

# Rerun compare_models to capture the best baseline model's CV results
best = compare_models(sort='MAE')
baseline_results = pull()       # DataFrame of CV metrics for the best model
baseline_mae = baseline_results.loc['MAE', 'Mean']

# Create and evaluate Random Forest (as before)
rf = create_model('rf')
rf_results = pull()             # DataFrame of CV metrics for Random Forest
rf_mae = rf_results.loc['MAE', 'Mean']

# Print comparison
print(f"Baseline best model MAE: {baseline_mae:.4f}")
print(f"Random Forest MAE:        {rf_mae:.4f}")

# Optionally visualize the comparison
import matplotlib.pyplot as plt

models = ['Baseline', 'RandomForest']
mae_vals = [baseline_mae, rf_mae]

plt.figure(figsize=(6,4))
plt.bar(models, mae_vals, color=['skyblue','salmon'])
plt.ylabel('Mean Absolute Error')
plt.title('MAE Comparison: Baseline vs. Random Forest')
for i, v in enumerate(mae_vals):
    plt.text(i, v + 0.01, f"{v:.3f}", ha='center')
plt.tight_layout()
plt.show()
