# Machine Learning with the Diamonds Dataset

This notebook explores the **diamonds dataset** using `pandas` and `seaborn`.
We will conduct **exploratory data analysis (EDA)**, select relevant features, and build a regression model to **predict diamond prices**.

## Importing Required Libraries

We import the necessary libraries:
- `pandas` for handling structured data.
- `seaborn` for visualisation.
- `matplotlib.pyplot` for plotting.
- `sklearn.linear_model` for linear regression.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

plt.style.use('fivethirtyeight')

## Loading and Exploring the Data

We load the diamonds dataset from `seaborn`, which contains information about diamonds, including:
- **Carat**: Weight of the diamond.
- **Cut**: Quality of the cut.
- **Colour**: Diamond colour, from J (worst) to D (best).
- **Clarity**: Measurement of how clear the diamond is.
- **Depth & Table**: Proportions of the diamond relative to width
- **Price**: Price of the diamond in US dollars.
- **Dimensions (x, y, z)**: length, width and depth in mm.

In [None]:
df = sns.load_dataset('diamonds')
print(df.info())
df.head()

In [None]:
# Overview statistics
df.describe()

## Feature Selection

We can check correlations to determine which features (columns) are useful for predicting price.

In practice, we would encode our categorical variables to numerics so we can use them in our model, but for today, let's focus on numerical features only.

### Exercise 1:

Create a subset of the data, considering numerical features only. Store this in a new dataframe and call the `.corr()` method to observe correlations between features.

In [None]:
# Select numeric columns and create correlation matrix
numeric_df = df.select_dtypes(['float64', 'int64'])

correlation_matrix = numeric_df.corr()
correlation_matrix


### Visualising Correlation

Reading correation coefficients in a table can be difficult. Let's see how we can visualise these!

In [None]:
# Create heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## Building a Regression Model

Highly correlated features may be good predictors (e.g. price and carat are highly correlated). If we want to include multiple predictors, we should avoid multicollinearity (e.g. x, y, z are highly correlated with carat, so may not be useful additions to the model).

Based on our correlation analysis, we select **carat and depth** as predictors for price. Carat has a high correlation, depth has a weak (negative) correlation, and they are weakly correlated themselves.

In [None]:
# Prepare the data
X = df[['carat', 'depth']]  # Predictors
y = df['price']             # Target variable

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Print the results
print("Regression Coefficients:")
print(f"Carat: {model.coef_[0]:.4f}")
print(f"Depth: {model.coef_[1]:.4f}")
print(f"\nR-squared: {model.score(X, y):.4f}")

## Evaluating the model
- The R-squared value of 0.85 indicates our model explains about 85% of the variance in diamond prices
- This suggests a reasonably strong predictive model, though there's still unexplained variation

### Understanding the Coefficients
1. **Carat** (7765.14):
  - Strong positive relationship with price
  - For each carat increase, price increases by approximately £7,765
  - This aligns with our expectation that higher carat diamonds are more valuable

2. **Depth** (-102.17):
  - Slight negative relationship with price
  - For each mm increase in depth, price decreases by about £102
  - Suggests that, holding carat constant, customers slightly prefer less deep diamonds

Looking at our coefficients, the scales are very different:

- Carat typically ranges from about 0.2 to 5
- Depth typically ranges from about 45 to 80

**In practice, we would standarise these values, but we are leaving them in their raw form here because it makes explaining them easier.**

### Predictions

Let's use this model to predict prices for specific diamonds and see how well it performs in practice. The code below initialises a new diamond with a specific carat and depth, and then uses our model to make a price prediction.

**In practice, we would have _split_ our original data into a training set (which we would have used in the code above) and a test set (which we could use to verify how well our model predicts on new data).**



In [None]:
# Define new diamond
new_diamond = pd.DataFrame({
    'carat': [1.5],
    'depth': [61.5]
})

# Make prediction
prediction = model.predict(new_diamond)
print(f"Price prediction: £{prediction[0]:.2f}")

## Visualising Model Predictions

### Exercise 2:

Let's compare predicted prices with actual prices to assess model performance. Again, we would typically do this with a test set, but the approach is the same.

Use the `.predict()` method to predict prices for all of the original values in our dataset. Then use a scatter plot to compare actual prices with our model's predictions.

In [None]:
y_pred = model.predict(X)

plt.scatter(y, y_pred, alpha=0.5)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted Prices')
plt.show()

## Conclusion

In this notebook, we performed a quick exploration to select relevant features and built a regression model to predict diamond prices.