### 🧠 Final Model Summary – Predicting Either Primary or Secondary Type

As an exploratory variant of our multiclass typing task, we tested whether a model could **correctly predict either a Pokémon’s primary OR secondary type**. A prediction was counted as correct if it matched **either label**, allowing some flexibility.

---

#### 📁 Dataset Overview

- **Kaggle Pokemon Database**: https://www.kaggle.com/datasets/mrdew25/pokemon-database 
- **Total Pokémon used**: `1,382` (including alternate forms)  
- **Evaluation style**: A prediction was considered correct if it matched **either the primary or secondary type** listed in the dataset

---

#### ✅ Model Setup
- **Model**: `MLPClassifier`  
- **Feature set**: Same as in other typing tasks — stats, categorical traits, pre-evolution flag  
- **Target**: Flexible match between prediction and `[Primary Type, Secondary Type]`

---

#### 📊 Result
- **Accuracy**: **`~76.9%`**
  - Higher than strict primary-only classification
  - No Kappa or confusion matrix due to flexible evaluation logic

---

#### 🎯 Takeaway
- The **either-or type task is easier** than strict primary typing
- Useful for **simplified matching or fuzzy classification** tasks where partial correctness is acceptable


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
import numpy as np

# Load and clean
df = pd.read_csv("../../data/pokemon_database.csv")
df = df.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
# df = df[df['Alternate Form Name'].isnull()]
df['Classification'] = df['Classification'].str.replace("(?i)\\s*pokemon\\s*", "", regex=True).str.strip()

# Features
features = [
    'Primary Ability', 'Primary Egg Group', 'Secondary Egg Group',
    'Classification', 'Health Stat', 'Attack Stat', 'Defense Stat',
    'Special Attack Stat', 'Special Defense Stat', 'Speed Stat', 'Base Stat Total'
]
df['is_pre_evolution'] = df['Pokemon Id'].isin(df['Pre-Evolution Pokemon Id'].dropna().astype(int))
features.append('is_pre_evolution')

# Normalized stats
for stat in ['Health Stat', 'Attack Stat', 'Defense Stat',
             'Special Attack Stat', 'Special Defense Stat', 'Speed Stat']:
    pct = f"{stat}_pct"
    df[pct] = (df[stat] / df['Base Stat Total']) * 100
    features.append(pct)

df = df.dropna(subset=['Primary Type'])

# Inputs and targets
X = df[features]
y_primary = df['Primary Type']
y_secondary = df['Secondary Type'].fillna('')
label_encoder = LabelEncoder()
y_primary_encoded = label_encoder.fit_transform(y_primary)

# Inject "None" to label space for missing secondary
if 'None' not in label_encoder.classes_:
    label_encoder.classes_ = np.append(label_encoder.classes_, 'None')
y_secondary_encoded = label_encoder.transform(y_secondary.replace('', 'None'))

# Train/test split
X_train, X_test, y_train, y_test, y_sec_train, y_sec_test = train_test_split(
    X, y_primary_encoded, y_secondary_encoded,
    test_size=0.2, stratify=y_primary_encoded, random_state=42
)

# Preprocessing
categorical = ['Primary Ability', 'Primary Egg Group', 'Secondary Egg Group', 'Classification']
numerical = [col for col in X.columns if col not in categorical]
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical),
    ('num', 'passthrough', numerical)
])

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', MLPClassifier(hidden_layer_sizes=(256, 128), max_iter=500, learning_rate_init=0.001, random_state=42))
])

# Fit
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Decode for logical accuracy check
y_pred_labels = label_encoder.inverse_transform(y_pred)
y_true_primary = label_encoder.inverse_transform(y_test)
y_true_secondary = label_encoder.inverse_transform(y_sec_test)

# Accuracy: predicted type matches either primary or secondary
correct = np.array([
    pred == primary or pred == secondary
    for pred, primary, secondary in zip(y_pred_labels, y_true_primary, y_true_secondary)
])
either_accuracy = correct.mean()
print(f"Accuracy (either primary or secondary match): {either_accuracy:.3f}")


  df = df.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)


Accuracy (either primary or secondary match): 0.769
