# Project Markdown (for a report / write-up)

## Title
**Aggregating 2025 MLB Players by Qualified Positions**

## Goal
Given a dataset of 2025 MLB players with their qualified positions, compute how many **unique players** qualify at each position, counting multi-position players once per position.

## Input
A CSV file containing:
- Player name
- Team(s)
- Qualified position(s)

Positions may be multi-valued (e.g., `OF/DH`).

## Approach
1. **Parse positions**: Split multi-position strings using common delimiters.
2. **Normalize to long format**: Convert each player row into multiple rows—one per position.
3. **De-duplicate**: Remove duplicate `(player, position)` pairs to prevent double-counting due to trades or repeated rows.
4. **Aggregate**: Group by position and count distinct players.
5. **Sort**: Order positions by descending player counts.

## Key Detail
Multi-position players contribute to the count of **every** position they qualify for.

## Output
A ranked list/table of positions with counts, from most to least common.

## Result Extraction Example
To find “the third most popular position,” take the third row after sorting and read its count.


In [1]:
import pandas as pd
import re

DATA_PATH = "baseball_positions.csv"  # change if needed

df = pd.read_csv(DATA_PATH)

required = {"Name", "Team", "Position"}
missing = required - set(df.columns)
if missing:
    raise ValueError(f"Missing required columns: {sorted(missing)}")

def split_positions(pos: str) -> list[str]:
    # Supported delimiters: / , ; |
    if pd.isna(pos):
        return []
    parts = re.split(r"[/,;|]+", str(pos))
    return [p.strip() for p in parts if p and p.strip()]

# Expand multi-position players to one row per player-position
pos_long = (
    df.assign(Position_list=df["Position"].apply(split_positions))
      .explode("Position_list", ignore_index=True)
      .rename(columns={"Position_list": "Position_clean"})
)

# Count unique players per position (handles multi-team duplicates)
pos_unique = (
    pos_long[["Name", "Position_clean"]]
    .dropna()
    .drop_duplicates(subset=["Name", "Position_clean"])
)

position_counts = (
    pos_unique.groupby("Position_clean")["Name"]
    .nunique()
    .reset_index(name="player_count")
    .sort_values(["player_count", "Position_clean"], ascending=[False, True])
    .reset_index(drop=True)
)

print(position_counts.to_string(index=False))


Position_clean  player_count
            OF           294
            SP           226
            RP           222
            DH           134
            3B           126
            2B           125
            1B           114
             C           106
            SS            85


In [2]:
# assumes `position_counts` is already computed as in the previous code

sorted_counts = position_counts.sort_values(
    ["player_count", "Position_clean"],
    ascending=[False, True]
).reset_index(drop=True)

third_count = int(sorted_counts.loc[2, "player_count"])
third_count


222