# Day 1 — Materials Science AI Workshop

Welcome! Today we will:
1. Verify your Python/Jupyter setup.
2. Refresh core Python, NumPy, Pandas, and Matplotlib skills.
3. Do a mini materials exercise using simple composition features.


## 0) Environment check
Run the cell below. If it prints a Python version and no import errors, you're ready.


In [None]:
import sys, numpy, pandas, matplotlib, sklearn, ase, pymatgen, matminer
print("Ready! Python:", sys.version.split()[0])
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from pymatgen.core.periodic_table import Element
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


## 1) Python recap
**Task A.** Create a list of the first 5 elements you can think of and a dict mapping element → atomic number.

**Task B.** Write a function `parse_percent_formula(formula)` that converts strings like `'Al50Cu50'` → `{'Al': 0.5, 'Cu': 0.5}`.
Hint: split letters from digits; divide all percentages by 100 so they sum to 1.


In [None]:
# Your turn — Task A
elements = ['Al','Cu','Fe','Ni','Mg']  # edit/add
atomic_numbers = {e: Element(e).Z for e in elements}
elements, atomic_numbers


In [None]:
# Your turn — Task B
import re
def parse_percent_formula(formula: str):
    parts = re.findall(r"([A-Z][a-z]?)(\d+(?:\.\d+)?)", formula)
    if not parts:
        raise ValueError("Could not parse formula (expected e.g. 'Al50Cu50')")
    total = sum(float(p[1]) for p in parts)
    if total == 0:
        raise ValueError("Total percentage is zero")
    comp = {p[0]: float(p[1])/total for p in parts}
    return comp

# quick check
parse_percent_formula('Al50Cu50'), parse_percent_formula('Fe50Ni30Cr20')


## 2) NumPy warmup
- Make a vector `x = linspace(0, 2π, 200)` and compute `sin(x)`.
- Compute a simple moving average with window size 5 using convolution (`np.convolve`).
- Plot both.


In [None]:
x = np.linspace(0, 2*np.pi, 200)
y = np.sin(x)
window = np.ones(5)/5
y_smooth = np.convolve(y, window, mode='same')
plt.figure()
plt.plot(x, y, label='sin(x)')
plt.plot(x, y_smooth, label='moving avg')
plt.legend()
plt.title('NumPy warmup')
plt.xlabel('x'); plt.ylabel('value');
plt.show()


## 3) Pandas basics
We will create a tiny alloys table to practice. Each row is a composition string and a short label.


In [None]:
import io
csv_text = '''composition,label\n
Al50Cu50,binary\n
Fe50Ni50,binary\n
Mg60Zn40,binary\n
Al70Mg30,binary\n
Cu50Ni50,binary\n
Ti50Al50,binary\n
Fe40Ni40Cr20,ternary\n
Al33Mg33Zn34,ternary\n
Fe34Co33Ni33,ternary\n
Al25Fe25Ni50,ternary\n
Mg20Al40Zn40,ternary\n
Cu30Ni50Zn20,ternary\n
'''
df = pd.read_csv(io.StringIO(csv_text))
df.head()


**Try it:**
- How many rows? How many binary vs. ternary? (`value_counts`) 
- Filter rows that contain `'Fe'`.


In [None]:
len(df), df['label'].value_counts(), df[df['composition'].str.contains('Fe')]


## 4) Materials-flavored features
We'll convert each composition string into fractional elements and compute simple features using periodic table data from `pymatgen.Element`:
- Weighted mean electronegativity (Pauling) → `mean_X`
- Weighted variance of electronegativity → `var_X`
- Weighted mean atomic mass → `mean_mass`
- Weighted mean Mendeleev number → `mean_M`

Then, we'll fabricate a **toy target** as a function of these features to practice ML (no internet needed).


In [None]:
def comp_to_features(comp_str: str):
    comp = parse_percent_formula(comp_str)
    els, fracs = list(comp.keys()), list(comp.values())
    X = []  # electronegativities
    mass = []
    M = []  # mendeleev numbers
    for e in els:
        el = Element(e)
        X.append(el.X if el.X is not None else np.nan)
        mass.append(el.atomic_mass if el.atomic_mass is not None else np.nan)
        M.append(el.mendeleev_no if el.mendeleev_no is not None else np.nan)
    X = np.array(X, dtype=float)
    mass = np.array(mass, dtype=float)
    M = np.array(M, dtype=float)
    w = np.array(fracs, dtype=float)
    def wmean(arr):
        mask = ~np.isnan(arr)
        if not mask.any():
            return np.nan
        ww = w[mask]
        ww = ww/ww.sum()
        return float((arr[mask]*ww).sum())
    def wvar(arr):
        m = wmean(arr)
        if np.isnan(m):
            return np.nan
        mask = ~np.isnan(arr)
        ww = w[mask]
        ww = ww/ww.sum()
        return float(((arr[mask]-m)**2 * ww).sum())
    return {
        'mean_X': wmean(X),
        'var_X': wvar(X),
        'mean_mass': wmean(mass),
        'mean_M': wmean(M),
    }

feat_df = df['composition'].apply(comp_to_features).apply(pd.Series)
data = pd.concat([df, feat_df], axis=1)
data


## 5) Plotting
Make a scatter of `mean_X` vs `mean_mass`. Color or marker by `label` if you like.


In [None]:
plt.figure()
for grp, g in data.groupby('label'):
    plt.scatter(g['mean_X'], g['mean_mass'], label=grp)
plt.xlabel('Weighted mean electronegativity (Pauling)')
plt.ylabel('Weighted mean atomic mass (amu)')
plt.title('Alloy feature space (toy)')
plt.legend()
plt.show()


## 6) (Challenge) Toy ML model
We'll create a toy target `y` from the features and train a Linear Regression.
This is just to practice the ML workflow (train/test split, fit, score).

In [None]:
rng = np.random.default_rng(0)
y = 2*data['mean_X'] + 0.01*data['mean_mass'] - 0.5*data['var_X'] + rng.normal(0, 0.1, len(data))
X = data[['mean_X','mean_mass','var_X','mean_M']].values
mask = ~np.isnan(X).any(axis=1)
X = X[mask]
y = np.asarray(y)[mask]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LinearRegression().fit(X_train, y_train)
print('R^2 train:', model.score(X_train, y_train))
print('R^2 test :', model.score(X_test, y_test))


### Stretch: Matminer composition featurizers (optional)
If you finish early, try `matminer.featurizers.composition.ElementProperty` on a `pymatgen.Composition` and compare features.

In [None]:
from pymatgen.core.composition import Composition
from matminer.featurizers.composition import ElementProperty

def percent_to_fraction(comp_str):
    import re
    parts = re.findall(r"([A-Z][a-z]?)(\d+(?:\.\d+)?)", comp_str)
    total = sum(float(p[1]) for p in parts)
    fracs = [float(p[1])/total for p in parts]
    return ' '.join(f"{el}{frac}" for (el, _), frac in zip(parts, fracs))

comp_series = df['composition'].apply(percent_to_fraction).apply(Composition)
featurizer = ElementProperty(features=['X','AtomicWeight','MendeleevNumber'], stats=['mean','std'])
matminer_feats = featurizer.featurize_many(comp_series, pbar=False)
matminer_df = pd.DataFrame(matminer_feats, columns=featurizer.feature_labels())
pd.concat([df, matminer_df], axis=1).head()


Great work! Save this notebook as your Day 1 deliverable.