# Pokemon Classification through Supervised Machine Learning

This notebook provides an interactive environment for exploring the Pokemon dataset and running machine learning models for legendary classification.

## Table of Contents
1. [Setup and Imports](#setup)
2. [Data Loading](#loading)
3. [Exploratory Data Analysis](#eda)
4. [Feature Engineering](#features)
5. [Model Training](#training)
6. [Results Analysis](#results)

## 1. Setup and Imports {#setup}

In [1]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import logging
from pathlib import Path

# Suppress warnings for cleaner output
warnings.simplefilter(action="ignore")

# Configure matplotlib
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12
sns.set_style("whitegrid")
sns.set_palette("husl")

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

In [2]:
# Import custom modules
import sys
sys.path.append('../src')

from eda import DataExplorer
from feature_engineering import FeatureEngineer
from model_trainer import ModelTrainer
from main import run_pokemon_analysis

## 2. Data Loading {#loading}

In [3]:
# Load the Pokemon dataset
data_path = "../data/Pokemon.csv"

try:
    df_raw = pd.read_csv(data_path)
    print(f"Dataset loaded successfully!")
    print(f"Shape: {df_raw.shape}")
    print(f"Columns: {list(df_raw.columns)}")
except FileNotFoundError:
    print(f"File not found: {data_path}")
    print("Please ensure Pokemon.csv is in the data/ directory")

In [4]:
# Quick data preview
display(df_raw.head())
print("\nData Info:")
df_raw.info()

## 3. Exploratory Data Analysis {#eda}

In [5]:
# Run EDA
explorer = DataExplorer()
explorer.explore_dataframe(df_raw, target='Legendary')

### Custom Analysis

In [6]:
# Analyze legendary distribution by generation
if 'Generation' in df_raw.columns:
    plt.figure(figsize=(12, 6))
    legendary_by_gen = df_raw.groupby('Generation')['Legendary'].value_counts().unstack()
    legendary_by_gen.plot(kind='bar', stacked=True)
    plt.title('Legendary Pokemon Distribution by Generation')
    plt.xlabel('Generation')
    plt.ylabel('Count')
    plt.legend(['Non-Legendary', 'Legendary'])
    plt.xticks(rotation=0)
    plt.tight_layout()
    plt.show()

In [7]:
# Analyze stats distribution for legendary vs non-legendary
stat_columns = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
available_stats = [col for col in stat_columns if col in df_raw.columns]

if available_stats:
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.flatten()
    
    for i, stat in enumerate(available_stats):
        sns.boxplot(data=df_raw, x='Legendary', y=stat, ax=axes[i])
        axes[i].set_title(f'{stat} Distribution')
    
    plt.tight_layout()
    plt.show()

## 4. Feature Engineering {#features}

In [8]:
# Initialize feature engineer
engineer = FeatureEngineer()

# Create enhanced features
df_enhanced = engineer.create_statistical_features(df_raw)
print(f"Statistical features created!")
print(f"Original columns: {len(df_raw.columns)}")
print(f"Enhanced columns: {len(df_enhanced.columns)}")
print(f"New features: {set(df_enhanced.columns) - set(df_raw.columns)}")

In [9]:
# Prepare data for machine learning
X_train, X_test, y_train, y_test = engineer.prepare_pokemon_data(
    df_enhanced, 
    target_col='Legendary', 
    test_size=0.2, 
    random_state=42
)

print(f"Data prepared for ML!")
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Feature columns: {list(X_train.columns)}")

## 5. Model Training {#training}

In [10]:
# Initialize model trainer
trainer = ModelTrainer()

# Train and evaluate all models
results_df = trainer.train_and_evaluate_models(X_train, X_test, y_train, y_test)

print("Model training completed!")
display(results_df.round(4))

In [11]:
trainer.visualize_results(results_df)

In [12]:
# Optimize KNN hyperparameters
print("Optimizing KNN hyperparameters...")
best_knn_params = trainer.optimize_knn(X_train, y_train)
print(f"Best KNN parameters: {best_knn_params}")

## 6. Results Analysis {#results}

In [13]:
# Find best performing model
best_model_name = results_df.loc[results_df['F1_Score'].idxmax(), 'Model']
print(f"Best performing model: {best_model_name}")

# Generate detailed classification report
trainer.generate_classification_report(X_test, y_test, best_model_name)

In [14]:
# Feature importance analysis (for tree-based models)
if 'Tree' in best_model_name or 'Forest' in best_model_name:
    feature_importance = trainer.get_feature_importance(best_model_name, X_train.columns.tolist())
    if not feature_importance.empty:
        print(f"\nTop 10 Most Important Features ({best_model_name}):")
        display(feature_importance.head(10))
else:
    print(f"Feature importance not available for {best_model_name}")

In [15]:
# Save results
output_path = "../outputs/notebook_results.csv"
Path("../outputs").mkdir(exist_ok=True)
results_df.to_csv(output_path, index=False)
print(f"Results saved to {output_path}")

## Summary

The above code demonstrated:
- Comprehensive exploratory data analysis of the Pokemon dataset
- Feature engineering including type encoding and statistical features
- Training and evaluation of multiple machine learning models
- Performance comparison and model selection
- Detailed analysis of the best performing model

The modular structure allows for easy experimentation and extension of the analysis pipeline.