# **The Whispering Blight of Eldoria**

**Royal Scrivener:** `[Ardalan Siavashpour]`

**Ledger ID:** `[99109896]`

## 1. The Royal Decree (Project Overview)

**Hark, Royal Scrivener! Your kingdom needs your sharp mind.**

The Grand Pilgrimage to the Sunstone has been struck by a mysterious magical malady, the "Whispering Blight," causing nearly half of the pilgrims to vanish. The King has tasked you with using the recovered caravan manifests to uncover the pattern behind the Blight.

### Your Mission:

Your goal is to predict which pilgrims vanished (`Vanished` column). Your performance will be judged on your ability to skillfully apply techniques such as **data imputation, feature engineering, feature encoding, and normalization/scaling**.

## 2. Assembling the Ledgers (Setup and Data Loading)

First, we must gather our tools and inspect the recovered caravan manifests.

In [1]:
# --- Essential Scribe's Tools ---
import pandas as pd
import numpy as np


# --- Set the Royal Seal for Reproducibility ---
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [2]:
# --- Load the Caravan Manifests ---
df_train = pd.read_csv('whispering_blight_of_eldoria_train.csv')
df_test = pd.read_csv('whispering_blight_of_eldoria_test.csv')

print("--- Training Ledger ---")
display(df_train.head())
df_train.info()

--- Training Ledger ---


Unnamed: 0,PilgrimId,HomeRealm,MeditativeTrance,Wagon,DestinationSanctuary,Age,NobleBlood,TavernBill,MarketSpend,ArtisanGuilds,HealersHut,DivinationDen,Name,Vanished
0,0001_01,Whispering Coast,False,B/0/P,The Gilded Spire,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Sunstone Dominion,False,F/0/S,The Gilded Spire,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Whispering Coast,False,A/0/S,The Gilded Spire,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Whispering Coast,False,A/0/S,The Gilded Spire,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Sunstone Dominion,False,F/1/S,The Gilded Spire,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   PilgrimId             8693 non-null   object 
 1   HomeRealm             8492 non-null   object 
 2   MeditativeTrance      8476 non-null   object 
 3   Wagon                 8494 non-null   object 
 4   DestinationSanctuary  8511 non-null   object 
 5   Age                   8514 non-null   float64
 6   NobleBlood            8490 non-null   object 
 7   TavernBill            8512 non-null   float64
 8   MarketSpend           8510 non-null   float64
 9   ArtisanGuilds         8485 non-null   float64
 10  HealersHut            8510 non-null   float64
 11  DivinationDen         8505 non-null   float64
 12  Name                  8493 non-null   object 
 13  Vanished              8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


## 3. Deciphering the Records (EDA & Preprocessing)

Before we can find a pattern, we must understand, clean, and enhance the records.

**TODO:** Perform your Exploratory Data Analysis and Preprocessing in the cells below. This should include:
1.  **Analysis:** Investigate the data. Check the target distribution, look for correlations, and understand the data types.
2.  **Missing Values:** Develop a strategy for handling missing data.
3.  **Feature Engineering:** Create new, more informative features from the existing columns.
4.  **Data Cleaning:** Drop any columns you deem unnecessary for prediction.

Remember to apply any transformations consistently across both the training and test sets.

In [5]:
# --- 3. Deciphering the Records (EDA & Preprocessing) ---

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# =====================
# Basic EDA
# =====================

print("Target distribution (Vanished):")
print(df_train["Vanished"].value_counts(normalize=True))

print("\nMissing values per column:")
print(df_train.isna().sum().sort_values(ascending=False))

print("\nNumeric feature summary:")
print(df_train.describe(include=[np.number]))

print("\nCategorical feature sample:")
print(df_train.select_dtypes(include=["object"]).head())

# =====================
# Feature Engineering
# =====================

spend_cols = ["TavernBill", "MarketSpend", "ArtisanGuilds",
              "HealersHut", "DivinationDen"]

# Total spend across all mystical services
df_train["TotalSpend"] = df_train[spend_cols].sum(axis=1)
df_test["TotalSpend"] = df_test[spend_cols].sum(axis=1)

# Name-based features (handle cases where 'Name' may be missing)
for df in [df_train, df_test]:
    if "Name" in df.columns:
        df["Name"] = df["Name"].fillna("")
        df["NameLength"] = df["Name"].str.len()
        df["NameWordCount"] = df["Name"].str.split().str.len()
    else:
        df["NameLength"] = 0
        df["NameWordCount"] = 0
    # Flag for missing Age (useful for the model)
    df["AgeMissing"] = df["Age"].isna().astype(int)

# Preserve IDs for later submission
train_ids = df_train["PilgrimId"].copy()
test_ids = df_test["PilgrimId"].copy()

# Drop columns not directly useful as features
cols_to_drop = ["PilgrimId"]
if "Name" in df_train.columns:
    cols_to_drop.append("Name")

df_train = df_train.drop(columns=cols_to_drop)
df_test = df_test.drop(columns=[c for c in cols_to_drop if c in df_test.columns])

# =====================
# Define Features / Target
# =====================

y = df_train["Vanished"].copy()
X = df_train.drop(columns=["Vanished"])
X_test = df_test.copy()

numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(exclude=[np.number]).columns.tolist()

print("\nNumeric features:", numeric_features)
print("Categorical features:", categorical_features)

# =====================
# Preprocessing Pipelines
# =====================

numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Fit the preprocessor on training data and transform both train and test
X_processed = preprocessor.fit_transform(X)
X_test_processed = preprocessor.transform(X_test)

print("\nProcessed training feature matrix shape:", X_processed.shape)
print("Processed test feature matrix shape:", X_test_processed.shape)

Target distribution (Vanished):
Vanished
True     0.503624
False    0.496376
Name: proportion, dtype: float64

Missing values per column:
MeditativeTrance        217
ArtisanGuilds           208
NobleBlood              203
HomeRealm               201
Wagon                   199
DivinationDen           188
MarketSpend             183
HealersHut              183
DestinationSanctuary    182
TavernBill              181
Age                     179
TotalSpend                0
NameWordCount             0
NameLength                0
PilgrimId                 0
Vanished                  0
Name                      0
AgeMissing                0
dtype: int64

Numeric feature summary:
               Age    TavernBill   MarketSpend  ArtisanGuilds    HealersHut  \
count  8514.000000   8512.000000   8510.000000    8485.000000   8510.000000   
mean     28.827930    224.687617    458.077203     173.729169    311.138778   
std      14.489021    666.717663   1611.489240     604.696458   1136.705535   
min

## 4. Forging the Ward (Model Building & Evaluation)

Now, build and test your predictive model.

**TODO:** Use **Stratified K-Fold Cross-Validation** to get a reliable estimate of your model's performance. Your workflow in the cell below should:
1.  Define your features (`X`) and target (`y`).
2.  Set up the cross-validation splitter.
3.  Inside the loop, for each fold:
    *   Preprocess the training and validation data (impute, scale, encode).
    *   Train a model.
    *   Make predictions and evaluate performance (e.g., accuracy).
4.  Calculate and print the average performance across all folds.

Experiment with different models and preprocessing steps to find the best combination.


In [9]:
# --- 4. Forging the Ward (Model Building & Evaluation) ---

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

# X and y were already defined in the previous cell

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

fold_accuracies = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y), start=1):
    X_train_fold = X.iloc[train_idx]
    y_train_fold = y.iloc[train_idx]
    X_val_fold = X.iloc[val_idx]
    y_val_fold = y.iloc[val_idx]

    # Build a fresh pipeline for each fold to avoid leakage across folds
    model = RandomForestClassifier(
        n_estimators=200,
        random_state=RANDOM_SEED,
        n_jobs=-1
    )

    clf = Pipeline(
        steps=[
            ("preprocessor", preprocessor),  # from previous cell
            ("model", model),
        ]
    )

    # Fit on training fold
    clf.fit(X_train_fold, y_train_fold)

    # Predict on validation fold
    y_val_pred = clf.predict(X_val_fold)

    # Evaluate
    acc = accuracy_score(y_val_fold, y_val_pred)
    fold_accuracies.append(acc)
    print(f"Fold {fold} accuracy: {acc:.4f}")

print("\nAverage CV accuracy:", np.mean(fold_accuracies))

Fold 1 accuracy: 0.8033
Fold 2 accuracy: 0.7821
Fold 3 accuracy: 0.7987
Fold 4 accuracy: 0.7940
Fold 5 accuracy: 0.7906

Average CV accuracy: 0.7937417573291531


## 5. Casting the Final Spell (Final Training & Prediction)

**TODO:** Train your best model configuration on **all** the training data and use it to predict the `Vanished` status for the pilgrims in the test set.

In [10]:
# --- 5. Casting the Final Spell (Final Training & Prediction) ---

from sklearn.ensemble import RandomForestClassifier

# Define the final model (same config that worked in CV)
final_model = RandomForestClassifier(
    n_estimators=200,
    random_state=RANDOM_SEED,
    n_jobs=-1
)

final_clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),  # from previous cell
        ("model", final_model),
    ]
)

# Train on all available training data
final_clf.fit(X, y)

# Predict on the test set
y_test_pred = final_clf.predict(X_test)

# Build submission DataFrame
if test_ids is not None:
    submission = pd.DataFrame({
        "PilgrimId": test_ids,
        "Vanished": y_test_pred
    })
else:
    # Fallback in case IDs were not preserved
    submission = pd.DataFrame({
        "PilgrimId": np.arange(len(y_test_pred)),
        "Vanished": y_test_pred
    })

# Save to CSV
submission.to_csv("submission.csv", index=False)

print("Submission file 'submission.csv' created.")
display(submission.head())

Submission file 'submission.csv' created.


Unnamed: 0,PilgrimId,Vanished
0,0,True
1,1,True
2,2,False
3,3,True
4,4,False


## 7. Submitting the Royal Ledger

Prepare the final ledger for the King.

In [11]:
# --- Create the submission file ---
# TODO: Create a pandas DataFrame with 'PilgrimId' and your 'Vanished' predictions.
submission_df = pd.DataFrame({
    "PilgrimId": test_ids,
    "Vanished": y_test_pred
})
# --- Save the Ledger ---
output_filename = 'submission.csv'
# submission_df.to_csv(output_filename, index=False)

print(f"Final ledger saved as '{output_filename}'. The King awaits your report.")
# display(submission_df.head())

Final ledger saved as 'submission.csv'. The King awaits your report.
