# Machine Learning with the Diamonds Dataset

This notebook explores the **diamonds dataset** using `pandas` and `seaborn`. 
We will conduct **exploratory data analysis (EDA)**, select relevant features, and build a regression model to predict diamond prices.

## Importing Required Libraries

We import the necessary libraries:
- `pandas` for handling structured data.
- `seaborn` for visualisation.
- `matplotlib.pyplot` for plotting.
- `sklearn.model_selection` for splitting the data.
- `sklearn.linear_model` for linear regression.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Loading and Exploring the Data

We load the diamonds dataset from `seaborn`, which contains information about diamonds, including:
- **Carat**: Weight of the diamond.
- **Cut**: Quality of the cut.
- **Colour**: Diamond colour, from J (worst) to D (best).
- **Clarity**: Measurement of how clear the diamond is.
- **Depth & Table**: Proportions of the diamond.
- **Price**: Price of the diamond in US dollars.

In [2]:
df = sns.load_dataset('diamonds')
print(df.info())
df.head()

## Data Cleaning and Visualisation

Before we build a model, we check for missing values and visualise the relationships between key variables.

In [3]:
# Drop any missing values (if any)
df = df.dropna()

# Pairplot to check relationships between variables
sns.pairplot(df[['price', 'carat', 'depth', 'table']])
plt.show()

## Feature Selection

We check correlations to determine which numerical features are useful for predicting price.

In [4]:
# Compute correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix['price'].sort_values(ascending=False))

## Building a Regression Model

Based on our correlation analysis, we select **carat, depth, and table** as predictors for price.

In [5]:
X = df[['carat', 'depth', 'table']]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)

# Display model coefficients
print(f'Coefficients: {model.coef_}')
print(f'Intercept: {model.intercept_}')

## Visualising Model Predictions

We compare predicted and actual prices to assess model performance.

In [6]:
y_pred = model.predict(X_test)

plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted Prices')
plt.show()

## Conclusion

We used EDA to select relevant features and built a regression model to predict diamond prices. The scatter plot above shows how well our model performs.