# Exploratory Data Analysis (EDA)

This notebook performs exploratory analysis on the fully featurized melting point dataset.
The goal is to validate the target distribution, identify broad structure–property trends,
and assess feature correlations that may impact downstream modelling.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Load the combined and featurized dataset
from mp.io import get_repo_root

ROOT = get_repo_root()

data_dir = ROOT / "data" / "processed"
fig_dir = ROOT / "reports" / "figures"

data = pd.read_csv(data_dir/"full_dataset.csv", index_col = 0)

## Dataset overview

The dataset loaded here corresponds to the output of `01_data_loading.ipynb`,
after cleaning, filtering, and RDKit-based featurization.

It contains:
- a continuous melting point target (`mpC`)
- RDKit physicochemical descriptors
- Morgan fingerprint bits (excluded from all EDA plots for interpretability)

In [None]:
#Explicitly exclude the fingerprint features

rdkit_feats = [c for c in data.columns if not c.startswith("FP")]
feat_df = data[rdkit_feats]

In [None]:
feat_df.shape

The full dataset contains **23060 compounds** with **40 total features** after featurization (excluding the 2048 fingerprint features).

## Target distribution (melting point)

First examine the distribution of the melting point target (`mpC`) to ensure it is
well-behaved for regression modelling.

A roughly unimodal, symmetric distribution with limited extreme outliers is desirable,
as it reduces the need for target transformations or specialised loss functions.

In [None]:
fig, axs = plt.subplots(1,2, figsize = (14, 5))

axs[0].hist(feat_df["mpC"], bins = 40, alpha = 0.7)
axs[0].set_xlabel("mpC")
axs[0].set_ylabel("Count")
axs[0].set_title("mpC Histogram")

axs[1].boxplot(feat_df["mpC"], flierprops = dict(marker = 'o', markersize = 5, alpha = 0.7))
axs[1].set_ylabel("mpC")
axs[1].set_title("mpC Boxplot")

plt.savefig(fig_dir / "target_distribution.png", dpi=300, bbox_inches="tight")
plt.show()

#### The melting-point distribution is **unimodal** and **approximately Gaussian**.

## Outlier analysis

Outliers are identified using the standard 1.5 × IQR rule.
This provides a simple, model-agnostic view of extreme melting point values
without making distributional assumptions.

In [None]:
mp = feat_df["mpC"]

summary = {
    "mean": mp.mean(),
    "median": mp.median(),
    "std": mp.std(),
    "skew": mp.skew(),
    "kurtosis": mp.kurtosis(),
    "Q1": mp.quantile(0.25),
    "Q3": mp.quantile(0.75),
}

summary["IQR"] = summary["Q3"] - summary["Q1"]

pd.Series(summary).round(3)

In [None]:
Q1 = feat_df["mpC"].quantile(0.25)
Q3 = feat_df["mpC"].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = ((feat_df["mpC"] < lower) | (feat_df["mpC"] > upper)).sum()
print(f"Outliers (1.5 × IQR): {outliers} ({outliers/len(feat_df):.1%})")

In [None]:
feat_df = feat_df.copy()

low_tail = feat_df[feat_df["mpC"] < Q1 - 1.5 * IQR]
high_tail = feat_df[feat_df["mpC"] > Q3 + 1.5 * IQR]

In [None]:
print(f"{len(low_tail)} compounds in the low tail, and {len(high_tail)} in the high tail")

In [None]:
plt.hist(feat_df["mpC"], bins=80)
plt.axvline(Q1 - 1.5*IQR, color="red", linestyle="--", label="IQR cutoff")
plt.axvline(Q3 + 1.5*IQR, color="red", linestyle="--")
plt.legend()
plt.xlabel("Melting point (°C)")
plt.ylabel("Count")
plt.title("mpC Histogram")

plt.savefig(fig_dir / "mp_iqr_outliers.png", dpi=300, bbox_inches="tight")
plt.show()

**The melting-point distribution is highly symmetric** (mean ≈ median and skewness ≈ 0) with near-normal tail behaviour (excess kurtosis ≈ 0). 
**Although the IQR spans a large range (120 °C), only 0.8% of observations are classified as outliers (1.5 × IQR)**. 
This indicates a **well-behaved continuous regression target** without strong multimodality or excessive outliers.

## Structure–property relationships

Visually inspect relationships between the melting point and a small number of
chemically interpretable RDKit descriptors.

These plots are intended to highlight broad trends rather than precise functional
relationships, and to confirm that physically meaningful features show signal.

In [None]:
mean_tm = feat_df["mpC"].mean()

low_p  = feat_df["mpC"].quantile(0.01)
high_p = feat_df["mpC"].quantile(0.99)

feat_df["tail"] = "middle"
feat_df.loc[feat_df["mpC"] <= low_p, "tail"] = "low"
feat_df.loc[feat_df["mpC"] >= high_p, "tail"] = "high"

colour_map = {
    "low": "tab:green",
    "middle": "tab:blue",
    "high": "tab:red"
}

colours = feat_df["tail"].map(colour_map)

In [None]:
fig, axs = plt.subplots(3, 2, figsize=(16, 12))


# --- Top Left plot: MolWt vs Tm ---
axs[0,0].scatter(feat_df["MolWt"], feat_df["mpC"], s=10, alpha=0.6, c = colours)


axs[0,0].axhline(mean_tm, color='red', linestyle='--')
axs[0,0].axvline(feat_df["MolWt"].mean(), color='green', linestyle='--')

axs[0,0].set_xlabel("MolWt")
axs[0,0].set_ylabel("MP")
axs[0,0].set_title("MP vs MolWt")


# --- Top Right plot: LogP vs Tm ---
axs[0,1].scatter(feat_df["LogP"], feat_df["mpC"], s=10, alpha=0.6, c = colours)

axs[0,1].axhline(mean_tm, color='red', linestyle='--')
axs[0,1].axvline(feat_df["LogP"].mean(), color='green', linestyle='--')

axs[0,1].set_xlabel("LogP")
axs[0,1].set_ylabel("MP")
axs[0,1].set_title("MP vs LogP")


# --- Middle Left plot: HBA + HBD vs TM ---
axs[1,0].scatter((feat_df["HBD"] + feat_df["HBA"]), feat_df["mpC"], s=10, alpha=0.6, c = colours)

axs[1,0].axhline(mean_tm, color='red', linestyle='--')
axs[1,0].axvline(
    ((feat_df["HBA"].mean() + feat_df["HBD"].mean()) / 2),
     color='green', linestyle='--')

axs[1,0].set_xlabel("HBA + HBD")
axs[1,0].set_ylabel("MP")
axs[1,0].set_title("MP vs (HBA + HBD)")


# --- Middle Right plot: TPSA vs TM ---
axs[1,1].scatter(feat_df["TPSA"], feat_df["mpC"], s=10, alpha=0.6, c = colours)

axs[1,1].axhline(mean_tm, color='red', linestyle='--')
axs[1,1].axvline(feat_df["TPSA"].mean(), color='green', linestyle='--')

axs[1,1].set_xlabel("TPSA")
axs[1,1].set_ylabel("MP")
axs[1,1].set_title("MP vs TPSA")


# --- Bottom Left plot: Ring count vs Tm ---
axs[2,0].scatter(feat_df["RingCount"],feat_df["mpC"], s=10, alpha=0.6, c = colours)

axs[2,0].axhline(mean_tm, color='red', linestyle='--')
axs[2,0].axvline(feat_df["RingCount"].mean(), color='green', linestyle='--')

axs[2,0].set_xlabel("RingCount")
axs[2,0].set_ylabel("MP")
axs[2,0].set_title("MP vs RingCount")


# --- Bottom Left plot: NumN vs Tm ---
axs[2,1].scatter(feat_df["NumN"],feat_df["mpC"], s=10, alpha=0.6, c = colours)

axs[2,1].axhline(mean_tm, color='red', linestyle='--')
axs[2,1].axvline(feat_df["NumN"].mean(), color='green', linestyle='--')

axs[2,1].set_xlabel("NumN")
axs[2,1].set_ylabel("MP")
axs[2,1].set_title("MP vs NumN")


plt.tight_layout()
plt.savefig(fig_dir / "mp_vs_key_rdkit_features.png", dpi=300, bbox_inches="tight")
plt.show()

The target melting point appears to be correlated with the **molecular weight** of the compounds, as well as the **number of hydrogen bonders**, the **total polar surface area (TPSA)**, the **ring count**, and the **number of nitrogen atoms**, while no obvious correlation between MP and **LogP** is evident.

## Feature correlation with melting point

Linear correlations between RDKit descriptors and the melting point target
provide a first indication of potentially informative features.

In [None]:
corr = feat_df.corr(numeric_only=True)

In [None]:
target_corr = corr["mpC"].sort_values()

plt.figure(figsize=(25, 8))
sns.heatmap(
    target_corr.to_frame().T,
    annot=True,
    cmap="coolwarm",
    fmt=".2f"
)
plt.title("Correlation of Features with MP", fontsize=14)

plt.savefig(fig_dir / "feature_target_correlations.png", dpi=300, bbox_inches="tight")
plt.show()

## Feature–feature correlations

Strong correlations between descriptors indicate potential redundancy in the feature set.
This motivates explicit feature filtering and correlation-based pruning
in later modelling stages.

In [None]:
feat_df_no_mp = feat_df.drop(columns = {"mpC"})
corr_no_mp = feat_df_no_mp.corr(numeric_only=True)

In [None]:
plt.figure(figsize=(25, 20))
sns.heatmap(
    corr_no_mp,
    annot=True,             
    cmap="coolwarm",       
    linewidths=0.5,         
    fmt=".2f"               
)
plt.title("Feature Correlation Matrix", fontsize=14)
plt.tight_layout()

plt.savefig(fig_dir / "feature_feature_correlations.png", dpi=300, bbox_inches="tight")
plt.show()