# Module 3: Featurization Basics

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NabKh/ML-for-Materials-Science/blob/main/Tutorial-07-ML-Discovery/notebooks/03_featurization_basics.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/NabKh/ML-for-Materials-Science/main?labpath=Tutorial-07-ML-Discovery/notebooks/03_featurization_basics.ipynb)

---

> **Before You Start:** Please check the [INSTALLATION_GUIDE.md](../../INSTALLATION_GUIDE.md) for setup instructions. For Google Colab:
> ```python
> !pip install pymatgen matminer shap -q
> ```
> Then restart the runtime (Runtime ‚Üí Restart runtime).

---

## üéØ Learning Objectives

By the end of this module, you will be able to:

1. **Understand** why featurization is crucial for materials ML
2. **Use** matminer to generate composition-based features
3. **Apply** structure-based featurizers when crystal structures are available
4. **Select** relevant features and handle high-dimensional data
5. **Choose** appropriate featurizers for different prediction tasks

---

**‚è±Ô∏è Estimated time: 75 minutes**

**üìö Difficulty: üü¢üü° Beginner-Intermediate**

## üì¶ Setup

In [1]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Create figures directory
os.makedirs('figures', exist_ok=True)

# Pymatgen
from pymatgen.core import Composition, Structure

# Matminer featurizers
from matminer.featurizers.composition import (
    ElementProperty,
    Stoichiometry,
    ValenceOrbital,
    IonProperty,
    ElementFraction,
    TMetalFraction,
    BandCenter,
)
from matminer.featurizers.conversions import StrToComposition

# Scikit-learn for feature selection
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor

# Interactive widgets
import ipywidgets as widgets
from IPython.display import display, HTML

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
print("‚úÖ Libraries imported successfully!")

‚úÖ Libraries imported successfully!


<cell_type>markdown</cell_type>---

## 1. Why Featurization?

### üìñ Theory

<div style="background: linear-gradient(135deg, #1e293b 0%, #0f172a 100%); padding: 20px; border-radius: 10px; border-left: 4px solid #6366f1;">

**The Problem**: ML algorithms need numerical inputs, but materials are described by:
- Chemical formulas: "Fe‚ÇÇO‚ÇÉ", "LiCoO‚ÇÇ"
- Crystal structures: 3D atomic positions
- Symbolic representations

**The Solution**: **Featurization** converts materials into fixed-length numerical vectors that encode physical/chemical information.

```
"Fe‚ÇÇO‚ÇÉ" ‚Üí [2.35, 1.47, 0.65, 3.44, ...] (132 numbers)
```

</div>

### Mathematical Framework

For a composition $C$ with elements $\{e_1, e_2, ..., e_n\}$ and atomic fractions $\{x_1, x_2, ..., x_n\}$, we compute features by aggregating elemental properties $P_i$:

**Weighted Mean:**
$$\bar{P} = \sum_{i=1}^{n} x_i \cdot P_i$$

**Weighted Standard Deviation:**
$$\sigma_P = \sqrt{\sum_{i=1}^{n} x_i \cdot (P_i - \bar{P})^2}$$

**Range:**
$$\Delta P = \max(P_i) - \min(P_i)$$

For example, for Fe‚ÇÇO‚ÇÉ (x_Fe = 0.4, x_O = 0.6):
$$\bar{\chi} = 0.4 \times 1.83 + 0.6 \times 3.44 = 2.80 \text{ (mean electronegativity)}$$

### Featurizer Categories

| Category | Input | Examples | Use Case |
|----------|-------|----------|----------|
| Composition | Formula | Magpie, ElementProperty | Quick screening |
| Structure | Crystal | Voronoi, SOAP, Coulomb Matrix | Accurate predictions |
| Electronic | DOS/Bands | Band center, width | Electronic properties |
| Graph | Atomic graph | CGCNN, MEGNet | Deep learning |

In [None]:
# Professional visualization of the featurization pipeline
fig, ax = plt.subplots(figsize=(16, 5), facecolor='white')

# Color palette - vibrant and professional
colors = {
    'input': '#ec4899',        # Pink - input
    'process': '#8b5cf6',      # Purple - featurizer
    'output': '#0ea5e9',       # Blue - features
    'model': '#10b981',        # Green - model
    'arrow': '#64748b',        # Slate - arrows
    'text': '#1e293b',         # Dark text
    'bg': '#f8fafc'            # Light background
}

from matplotlib.patches import FancyBboxPatch, FancyArrowPatch

# Pipeline stages with better positioning
pipeline_stages = [
    (1, 2, 'Chemical\nFormula', 'Fe2O3', colors['input']),
    (5, 2, 'Featurizer', 'matminer', colors['process']),
    (9, 2, 'Feature\nVector', '[2.35, 1.47, ...]', colors['output']),
    (13, 2, 'ML Model', 'sklearn', colors['model']),
]

# Draw boxes
for x, y, title, subtitle, color in pipeline_stages:
    # Main box with gradient-like effect
    box = FancyBboxPatch((x, y), 2.8, 2.2, boxstyle="round,rounding_size=0.3",
                         facecolor=color, alpha=0.15, edgecolor=color, linewidth=3)
    ax.add_patch(box)
    
    # Inner highlight
    inner_box = FancyBboxPatch((x+0.1, y+0.1), 2.6, 2, boxstyle="round,rounding_size=0.25",
                               facecolor='white', alpha=0.5, edgecolor='none')
    ax.add_patch(inner_box)
    
    # Title text
    ax.text(x+1.4, y+1.5, title, ha='center', va='center', fontsize=13, 
            fontweight='bold', color=colors['text'])
    
    # Subtitle with styling
    if subtitle == 'Fe2O3':
        # Use proper subscript rendering
        ax.text(x+1.4, y+0.6, r'$\mathbf{Fe_2O_3}$', ha='center', va='center', 
                fontsize=12, color=color)
    else:
        ax.text(x+1.4, y+0.6, subtitle, ha='center', va='center', 
                fontsize=10, color=color, style='italic')

# Draw arrows between boxes
arrow_positions = [(3.9, 3.1), (7.9, 3.1), (11.9, 3.1)]
for start_x, y in arrow_positions:
    arrow = FancyArrowPatch((start_x, y), (start_x + 0.9, y),
                           arrowstyle='-|>', mutation_scale=20,
                           color=colors['arrow'], linewidth=3)
    ax.add_patch(arrow)

# Add step labels above arrows
step_labels = [
    (4.35, 3.6, 'Extract'),
    (8.35, 3.6, 'Transform'),
    (12.35, 3.6, 'Predict'),
]
for x, y, label in step_labels:
    ax.text(x, y, label, ha='center', va='center', fontsize=10, 
            color=colors['arrow'], fontweight='bold')

# Add a subtle bottom annotation
ax.text(8.5, 0.7, 'Convert materials to numerical representations for machine learning',
        ha='center', va='center', fontsize=11, color=colors['arrow'], style='italic')

ax.set_xlim(0, 17)
ax.set_ylim(0, 5)
ax.set_title('The Featurization Pipeline', fontsize=18, fontweight='bold', 
             color=colors['text'], pad=25)
ax.axis('off')
ax.set_facecolor(colors['bg'])

plt.tight_layout()
plt.savefig('figures/03_featurization_pipeline.png', dpi=200, bbox_inches='tight',
            facecolor='white', edgecolor='none')
plt.show()

print("Figure saved to figures/03_featurization_pipeline.png")

---

## 2. Composition-Based Featurizers

These featurizers only need the chemical formula - no crystal structure required!

In [3]:
# Create some example compositions
formulas = ["Fe2O3", "SiO2", "LiCoO2", "GaAs", "TiO2", "ZnO", "CaTiO3"]
compositions = [Composition(f) for f in formulas]

print("Example compositions:")
for f, c in zip(formulas, compositions):
    print(f"  {f:10} ‚Üí Elements: {[str(el) for el in c.elements]}")

Example compositions:
  Fe2O3      ‚Üí Elements: ['Fe', 'O']
  SiO2       ‚Üí Elements: ['Si', 'O']
  LiCoO2     ‚Üí Elements: ['Li', 'Co', 'O']
  GaAs       ‚Üí Elements: ['Ga', 'As']
  TiO2       ‚Üí Elements: ['Ti', 'O']
  ZnO        ‚Üí Elements: ['Zn', 'O']
  CaTiO3     ‚Üí Elements: ['Ca', 'Ti', 'O']


<cell_type>markdown</cell_type>### 2.1 ElementProperty (Magpie Features)

The most popular composition featurizer! Calculates statistics (mean, std, min, max, etc.) over elemental properties.

<div style="background: rgba(99, 102, 241, 0.1); padding: 15px; border-radius: 10px; border-left: 4px solid #6366f1;">

**Magpie (Materials-Agnostic Platform for Informatics and Exploration)** computes 6 statistics over 22 elemental properties:

**Statistics:** mean, avg_dev, mode, min, max, range

**Properties:** Atomic number, Mendeleev number, atomic weight, melting temperature, row, column, covalent radius, electronegativity, valence electrons (s, p, d, f), unfilled orbitals, and more.

**Total features:** 6 √ó 22 = 132 features

</div>

**How it works for Fe‚ÇÇO‚ÇÉ:**

| Property | Fe | O | Weighted Mean | Formula |
|----------|-----|-----|---------------|---------|
| Electronegativity | 1.83 | 3.44 | 2.80 | $0.4 \times 1.83 + 0.6 \times 3.44$ |
| Atomic Weight | 55.85 | 16.00 | 31.94 | $0.4 \times 55.85 + 0.6 \times 16.00$ |
| Melting Point (K) | 1811 | 54 | 757 | $0.4 \times 1811 + 0.6 \times 54$ |

In [4]:
# Create ElementProperty featurizer with Magpie preset
ep = ElementProperty.from_preset("magpie")

# See what features it generates
print(f"Number of features: {len(ep.feature_labels())}")
print(f"\nFirst 20 feature names:")
for i, name in enumerate(ep.feature_labels()[:20]):
    print(f"  {i+1:2}. {name}")
print("  ...")

Number of features: 132

First 20 feature names:
   1. MagpieData minimum Number
   2. MagpieData maximum Number
   3. MagpieData range Number
   4. MagpieData mean Number
   5. MagpieData avg_dev Number
   6. MagpieData mode Number
   7. MagpieData minimum MendeleevNumber
   8. MagpieData maximum MendeleevNumber
   9. MagpieData range MendeleevNumber
  10. MagpieData mean MendeleevNumber
  11. MagpieData avg_dev MendeleevNumber
  12. MagpieData mode MendeleevNumber
  13. MagpieData minimum AtomicWeight
  14. MagpieData maximum AtomicWeight
  15. MagpieData range AtomicWeight
  16. MagpieData mean AtomicWeight
  17. MagpieData avg_dev AtomicWeight
  18. MagpieData mode AtomicWeight
  19. MagpieData minimum MeltingT
  20. MagpieData maximum MeltingT
  ...


In [5]:
# Featurize a single composition
fe2o3 = Composition("Fe2O3")
features = ep.featurize(fe2o3)

print(f"Fe‚ÇÇO‚ÇÉ features (first 10):")
for name, val in zip(ep.feature_labels()[:10], features[:10]):
    print(f"  {name:40} = {val:.4f}")

Fe‚ÇÇO‚ÇÉ features (first 10):
  MagpieData minimum Number                = 8.0000
  MagpieData maximum Number                = 26.0000
  MagpieData range Number                  = 18.0000
  MagpieData mean Number                   = 15.2000
  MagpieData avg_dev Number                = 8.6400
  MagpieData mode Number                   = 8.0000
  MagpieData minimum MendeleevNumber       = 55.0000
  MagpieData maximum MendeleevNumber       = 87.0000
  MagpieData range MendeleevNumber         = 32.0000
  MagpieData mean MendeleevNumber          = 74.2000


In [6]:
# Featurize all compositions into a DataFrame
df = pd.DataFrame({'formula': formulas, 'composition': compositions})

# Use featurize_dataframe for batch processing
df = ep.featurize_dataframe(df, 'composition', ignore_errors=True)

print(f"DataFrame shape: {df.shape}")
print(f"\nFeatures for each material:")
df[['formula'] + ep.feature_labels()[:5]].head()

ElementProperty:   0%|          | 0/7 [00:00<?, ?it/s]

DataFrame shape: (7, 134)

Features for each material:


Unnamed: 0,formula,MagpieData minimum Number,MagpieData maximum Number,MagpieData range Number,MagpieData mean Number,MagpieData avg_dev Number
0,Fe2O3,8.0,26.0,18.0,15.2,8.64
1,SiO2,8.0,14.0,6.0,10.0,2.666667
2,LiCoO2,3.0,27.0,24.0,11.5,7.75
3,GaAs,31.0,33.0,2.0,32.0,1.0
4,TiO2,8.0,22.0,14.0,12.666667,6.222222


<cell_type>markdown</cell_type>### üîç Explore Features for Different Materials

Let's examine how Magpie features vary across different compositions.

In [None]:
# Compare features across different material types
test_materials = {
    'Fe2O3': 'Transition metal oxide',
    'SiO2': 'Main group oxide', 
    'GaAs': 'III-V semiconductor',
    'NaCl': 'Ionic halide',
    'LiCoO2': 'Battery cathode'
}

# Key features to compare
key_features = [
    'MagpieData mean Electronegativity',
    'MagpieData mean AtomicWeight', 
    'MagpieData range Electronegativity',
    'MagpieData mean MeltingT'
]

print("Comparing key Magpie features across material types:")
print("=" * 80)

# Create comparison table
comparison_data = []
for formula, mat_type in test_materials.items():
    comp = Composition(formula)
    feats = ep.featurize(comp)
    feat_dict = dict(zip(ep.feature_labels(), feats))
    
    row = {'Material': formula, 'Type': mat_type}
    for f in key_features:
        short_name = f.replace('MagpieData ', '').replace('mean ', 'Œº_').replace('range ', 'Œî_')
        row[short_name] = feat_dict[f]
    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

print("\n" + "=" * 80)
print("Observations:")
print("  ‚Ä¢ Ionic compounds (NaCl) have high electronegativity range (Œî)")
print("  ‚Ä¢ Covalent compounds (SiO2, GaAs) have lower electronegativity range")
print("  ‚Ä¢ Transition metals (Fe2O3, LiCoO2) have higher mean atomic weights")

<cell_type>markdown</cell_type>### 2.2 Other Composition Featurizers

#### Stoichiometry Features

The **Stoichiometry** featurizer computes $L^p$ norms of the stoichiometric fractions:

$$||x||_p = \left(\sum_{i=1}^{n} x_i^p\right)^{1/p}$$

| p-norm | Physical Meaning | Example (Fe‚ÇÇO‚ÇÉ: x=[0.4, 0.6]) |
|--------|------------------|-------------------------------|
| 0-norm | Number of elements | 2 |
| 2-norm | "Concentration" of elements | $\sqrt{0.4^2 + 0.6^2} = 0.72$ |
| ‚àû-norm | Maximum fraction (approaches 0.6) | Captured by high p values |

**Why it matters:** These norms capture how "concentrated" or "spread out" the composition is. A binary compound like NaCl (0.5, 0.5) has different p-norms than a highly asymmetric one like Al‚ÇÇO‚ÇÉ (0.4, 0.6).

#### Valence Orbital Features

The **ValenceOrbital** featurizer captures electronic structure information by counting valence electrons in each orbital type (s, p, d, f):

- **Average valence electrons** per orbital type
- **Fraction** of valence electrons in each orbital

This is particularly useful for predicting electronic properties since band structure depends heavily on orbital character.

In [8]:
# Stoichiometry featurizer
stoich = Stoichiometry()
stoich_feats = stoich.featurize(fe2o3)

print("Stoichiometry features for Fe‚ÇÇO‚ÇÉ:")
for name, val in zip(stoich.feature_labels(), stoich_feats):
    print(f"  {name:30} = {val:.4f}")

Stoichiometry features for Fe‚ÇÇO‚ÇÉ:
  0-norm                         = 2.0000
  2-norm                         = 0.7211
  3-norm                         = 0.6542
  5-norm                         = 0.6150
  7-norm                         = 0.6049
  10-norm                        = 0.6010


In [9]:
# Valence orbital features
vo = ValenceOrbital()
vo_feats = vo.featurize(fe2o3)

print("\nValence Orbital features for Fe‚ÇÇO‚ÇÉ:")
for name, val in zip(vo.feature_labels(), vo_feats):
    print(f"  {name:30} = {val:.4f}")


Valence Orbital features for Fe‚ÇÇO‚ÇÉ:
  avg s valence electrons        = 2.0000
  avg p valence electrons        = 2.4000
  avg d valence electrons        = 2.4000
  avg f valence electrons        = 0.0000
  frac s valence electrons       = 0.2941
  frac p valence electrons       = 0.3529
  frac d valence electrons       = 0.3529
  frac f valence electrons       = 0.0000


---

## 3. Building a Complete Feature Set

Let's combine multiple featurizers for a comprehensive feature set.

In [10]:
# Load our dataset from Module 2 (or create a sample)
try:
    df = pd.read_csv('../data/sample_datasets/materials_bandgap.csv')
    print(f"Loaded dataset: {len(df)} materials")
except:
    # Create sample data if file doesn't exist
    sample_data = {
        'formula': ['SiO2', 'TiO2', 'ZnO', 'GaN', 'AlN', 'BN', 'GaAs', 'InP', 
                   'CdS', 'ZnS', 'CuO', 'Fe2O3', 'MgO', 'CaO', 'Al2O3'],
        'band_gap': [8.9, 3.2, 3.3, 3.4, 6.0, 5.5, 1.4, 1.3, 2.4, 3.7, 1.2, 2.2, 7.8, 7.1, 8.8]
    }
    df = pd.DataFrame(sample_data)
    print(f"Created sample dataset: {len(df)} materials")

# Convert formulas to compositions (only if composition column doesn't exist)
if 'composition' not in df.columns:
    stc = StrToComposition()
    df = stc.featurize_dataframe(df, 'formula')
else:
    # Convert string representation to Composition objects
    df['composition'] = df['formula'].apply(lambda x: Composition(x))
    
df.head()

Loaded dataset: 1966 materials


Unnamed: 0,material_id,formula,composition,band_gap,formation_energy,energy_above_hull,density,volume,nelements,nsites,spacegroup,element_group
0,mp-11107,Ac2O3,"(Ac, O)",3.5226,-3.737668,0.0,9.10913,91.511224,2,5,P-3m1,"frozenset({'O', 'Ac'})"
1,mp-32800,Ac2S3,"(Ac, S)",2.2962,-2.493064,0.0,6.535149,1118.407852,2,40,I-42d,"frozenset({'S', 'Ac'})"
2,mp-1183115,AcAlO3,"(Ac, Al, O)",4.1024,-3.690019,0.0,8.72823,57.451413,3,5,Pm-3m,"frozenset({'Al', 'O', 'Ac'})"
3,mp-27972,AcBr3,"(Ac, Br)",4.1033,-2.494519,0.0,5.679086,272.928947,2,8,P6_3/m,"frozenset({'Br', 'Ac'})"
4,mp-30274,AcBrO,"(Ac, Br, O)",4.241,-3.396186,0.0,7.65229,140.13941,3,6,P4/nmm,"frozenset({'O', 'Br', 'Ac'})"


In [11]:
# Apply multiple featurizers
from matminer.featurizers.composition import ElementProperty, Stoichiometry, ValenceOrbital

featurizers = [
    ('ElementProperty (Magpie)', ElementProperty.from_preset('magpie')),
    ('Stoichiometry', Stoichiometry()),
    ('ValenceOrbital', ValenceOrbital()),
]

print("Applying featurizers...")
for name, featurizer in featurizers:
    df = featurizer.featurize_dataframe(df, 'composition', ignore_errors=True)
    print(f"  ‚úì {name}: {len(featurizer.feature_labels())} features")

print(f"\n‚úÖ Total features: {df.shape[1] - 3}")

Applying featurizers...


ElementProperty:   0%|          | 0/1966 [00:00<?, ?it/s]

  ‚úì ElementProperty (Magpie): 132 features


Stoichiometry:   0%|          | 0/1966 [00:00<?, ?it/s]

  ‚úì Stoichiometry: 6 features


ValenceOrbital:   0%|          | 0/1966 [00:00<?, ?it/s]

  ‚úì ValenceOrbital: 8 features

‚úÖ Total features: 155


<cell_type>markdown</cell_type>---

## 4. Feature Selection

With 150+ features, we need to select the most relevant ones!

### Why Feature Selection?

<div style="background: rgba(245, 158, 11, 0.1); padding: 15px; border-radius: 10px; border-left: 4px solid #f59e0b;">

**Problems with too many features:**
1. **Curse of dimensionality**: Need exponentially more data as dimensions increase
2. **Overfitting**: Model learns noise instead of signal
3. **Computational cost**: Training and inference become slower
4. **Interpretability**: Hard to understand which features matter

</div>

### Feature Selection Methods

**1. Variance Threshold**

Remove features with variance below threshold $\tau$:
$$\text{Var}(X_j) = \frac{1}{n}\sum_{i=1}^{n}(x_{ij} - \bar{x}_j)^2 < \tau \implies \text{remove } X_j$$

Features with near-zero variance provide no discriminative information.

**2. Univariate Selection (F-score)**

The F-statistic measures how well each feature individually predicts the target:
$$F = \frac{\text{Between-group variability}}{\text{Within-group variability}} = \frac{MSB}{MSW}$$

For regression, this simplifies to testing correlation significance:
$$F = \frac{r^2 / 1}{(1-r^2)/(n-2)}$$

where $r$ is the Pearson correlation between feature and target.

**3. Model-Based Selection (Random Forest Importance)**

Feature importance from tree ensembles:
$$I_j = \sum_{\text{trees}} \sum_{\text{nodes using } j} \Delta \text{impurity}$$

In [12]:
# Prepare feature matrix
# Exclude non-feature columns and non-numeric columns
exclude_cols = ['formula', 'composition', 'band_gap', 'material_id', 'spacegroup', 'element_group']
feature_cols = [c for c in df.columns if c not in exclude_cols]

# Only keep numeric columns
X = df[feature_cols].select_dtypes(include=[np.number]).copy()
y = df['band_gap'].copy()

# Handle any remaining NaN or inf values
X = X.replace([np.inf, -np.inf], np.nan)
X = X.fillna(X.median())

print(f"Feature matrix shape: {X.shape}")

Feature matrix shape: (1966, 152)


In [13]:
# Method 1: Remove low variance features
var_thresh = VarianceThreshold(threshold=0.01)
X_var = var_thresh.fit_transform(X)
selected_var = X.columns[var_thresh.get_support()].tolist()

print(f"After variance threshold: {X_var.shape[1]} features (removed {X.shape[1] - X_var.shape[1]})")

After variance threshold: 137 features (removed 15)


In [14]:
# Method 2: Select K best features (correlation with target)
k_best = SelectKBest(f_regression, k=min(20, len(selected_var)))
X_kbest = k_best.fit_transform(X[selected_var], y)
selected_kbest_mask = k_best.get_support()
selected_kbest = [selected_var[i] for i in range(len(selected_var)) if selected_kbest_mask[i]]

print(f"Top {len(selected_kbest)} features by correlation with band gap:")
scores = pd.DataFrame({
    'feature': selected_kbest,
    'score': k_best.scores_[selected_kbest_mask]
}).sort_values('score', ascending=False)

print(scores.to_string(index=False))

Top 20 features by correlation with band gap:
                             feature      score
            frac p valence electrons 720.514221
                 MagpieData mean Row 710.045582
                    formation_energy 598.617904
            frac d valence electrons 581.131543
      MagpieData mean CovalentRadius 555.936820
              MagpieData mean Number 543.497150
        MagpieData mean AtomicWeight 503.452335
   MagpieData minimum CovalentRadius 442.001710
MagpieData maximum Electronegativity 420.187504
      MagpieData mode CovalentRadius 414.636286
           MagpieData mean NdValence 411.307817
             avg d valence electrons 411.307817
                 MagpieData mode Row 375.504952
              MagpieData minimum Row 370.070094
         MagpieData minimum MeltingT 369.204793
            MagpieData mode MeltingT 363.686276
   MagpieData mean Electronegativity 354.239876
  MagpieData range Electronegativity 309.547892
        MagpieData avg_dev NdValence 304.1

In [None]:
# Visualize top features with professional styling
fig, ax = plt.subplots(figsize=(12, 8), facecolor='white')

# Color palette
colors_palette = {
    'primary': '#6366f1',
    'secondary': '#0ea5e9',
    'tertiary': '#10b981',
    'text': '#1e293b',
    'grid': '#e2e8f0'
}

top_n = min(15, len(scores))

# Create gradient colors
cmap = plt.cm.viridis
bar_colors = [cmap(0.2 + 0.6 * (i / top_n)) for i in range(top_n)]

# Create horizontal bar chart
bars = ax.barh(range(top_n), scores['score'].values[:top_n], 
               color=bar_colors, edgecolor='white', linewidth=0.5, alpha=0.85)

# Customize axes
ax.set_yticks(range(top_n))
ax.set_yticklabels(scores['feature'].values[:top_n], fontsize=11, color=colors_palette['text'])
ax.set_xlabel('F-Score (Correlation with Band Gap)', fontsize=12, color=colors_palette['text'])
ax.set_title('Top Features for Band Gap Prediction', fontsize=16, fontweight='bold', 
             color=colors_palette['text'], pad=15)
ax.invert_yaxis()

# Add value labels at the end of bars
for bar, score in zip(bars, scores['score'].values[:top_n]):
    ax.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2, 
            f'{score:.0f}', va='center', ha='left', fontsize=10, color=colors_palette['text'])

# Styling
ax.set_facecolor('white')
ax.grid(True, axis='x', alpha=0.3, color=colors_palette['grid'])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_xlim(0, max(scores['score'].values[:top_n]) * 1.12)

# Add annotation box
ax.text(0.98, 0.02, 'Higher F-score = stronger\ncorrelation with band gap', 
        transform=ax.transAxes, fontsize=10, ha='right', va='bottom',
        color=colors_palette['text'], style='italic',
        bbox=dict(boxstyle='round,pad=0.4', facecolor='#f1f5f9', edgecolor=colors_palette['grid']))

plt.tight_layout()
plt.savefig('figures/03_top_features.png', dpi=200, bbox_inches='tight', facecolor='white')
plt.show()

print("Figure saved to figures/03_top_features.png")

---

## üìù Exercises

### Exercise 1: Compare Featurizer Presets

ElementProperty has multiple presets. Compare them!

In [16]:
# Exercise 1: Compare presets
presets = ['magpie', 'matminer', 'deml']

# TODO: For each preset, create a featurizer and count features
# for preset in presets:
#     ep = ElementProperty.from_preset(preset)
#     print(f"{preset}: {len(ep.feature_labels())} features")

### Exercise 2: Feature Importance with Random Forest

In [17]:
# Exercise 2: Train a Random Forest and extract feature importances
# rf = RandomForestRegressor(n_estimators=100, random_state=42)
# rf.fit(X[selected_kbest], y)

# TODO: Get feature importances and plot them
# importance_df = pd.DataFrame({
#     'feature': selected_kbest,
#     'importance': rf.feature_importances_
# }).sort_values('importance', ascending=False)

---

## ‚úÖ Module Summary

### Key Takeaways

1. **Featurization** converts materials into numerical vectors for ML
2. **matminer** provides 70+ featurizers for compositions and structures
3. **ElementProperty (Magpie)** is the go-to composition featurizer (~132 features)
4. **Feature selection** is crucial with many features (variance threshold, SelectKBest)
5. **Combine multiple featurizers** for comprehensive feature sets

### What's Next?

In **Module 4: Classical ML Models**, you'll learn to:
- Train various ML algorithms (Linear, RF, XGBoost)
- Compare model performance
- Understand when to use each model type

---

**üìö Continue to Module 4:** [Classical ML Models](04_classical_ml_models.ipynb)