# 🎮 Pokémon Data & Analytics Platform

## Objective
Build a modern data engineering pipeline that extracts rich Pokémon data from the [PokeAPI](https://pokeapi.co/), processes and stores it in a scalable **Lakehouse architecture**, and enables complex analytical queries, ML insights, and interactive visualizations.

## Key Features / Workflow

### 1. Data Ingestion (ELT Pipeline)
- Periodically extract data from the PokeAPI.
- Extract hierarchical data:
  - Pokémon → Evolutions → Moves → Stats → Types.
- Store raw JSON data in a **data lake** (e.g., AWS S3, GCS, or local MinIO) or SQLite?

In [1]:
import requests

url = "https://pokeapi.co/api/v2/pokemon/ditto"
response = requests.get(url, verify=False) # Self-signed SSL certificate error, so I disabled verification for development purposes
data = response.json()



### 2. Data Transformation
- Use **Apache Spark** (PySpark) for transformation:
  - Flatten and normalize nested structures.
  - Create dimensional models:
    - `pokemon` (fact)
    - `types`, `abilities`, `moves` (dimensions)
  - Enrich data with external datasets (e.g., popularity, games, community rankings).

In [None]:
import pandas as pd
import sqlite3
import json

conn = sqlite3.connect('pokedex.db')

def extract_ability_names(abilities):
    return ", ".join([a['ability']['name'] for a in abilities])

def extract_type_names(types):
    return ", ".join([t['type']['name'] for t in types])

def extract_move_names(moves):
    return ", ".join([m['move']['name'] for m in moves])

if data:
    df = pd.json_normalize(data)
    columns_to_keep = ['name', 'height', 'weight', 'abilities', 'types', 'moves']
    df = df[columns_to_keep]

    df['abilities'] = df['abilities'].apply(extract_ability_names)
    df['types'] = df['types'].apply(extract_type_names)
    df['moves'] = df['moves'].apply(extract_move_names)
    
    df.to_sql('pokemon', conn, if_exists='replace', index=False)
    df_loaded = pd.read_sql('SELECT * FROM pokemon', conn)
    display(df_loaded)
    
    conn.close()

Unnamed: 0,name,height,weight,abilities,types,moves
0,ditto,3,40,"limber, imposter",normal,transform


### 3. Data Storage (Lakehouse Architecture)
- Store processed data

### 4. Data Serving & Exploration
- Use tools like:
  - **Apache Superset** / **Metabase** / **Streamlit**
- Build visualizations and dashboards:
  - Top 10 strongest Pokémon by base stats.
  - Evolution treemaps.
  - Type effectiveness matrix.
  - Fun stats and comparisons.
- Interactive querying:
  - E.g., "List all Fire-type Pokémon with speed > 100 and special attack > 90".

### 5. Machine Learning & Graph Analytics (Bonus)
- **Clustering**: Use K-Means to group similar Pokémon based on stats.
- **Classification**: Predict battle outcomes using logistic regression or tree-based models.
- **Graph Analysis**:
  - Use **NetworkX** or **Neo4j** to explore evolution chains as directed graphs.

## Bonus Ideas
- **API Gateway**:
  - Build a FastAPI or GraphQL service layer on top of your Lakehouse data.
- **Streaming**:
  - Simulate real-time "wild Pokémon encounters" using **Kafka** + **Spark Structured Streaming**.
- **Leaderboard**:
  - Create a dynamic ranking of strongest Pokémon by an aggregated score (e.g., sum of normalized stats).
- **Data Versioning**:
  - Integrate **DVC** or **LakeFS** to version data and track evolution of your datasets over time.