# World Cup Squad Builder Pipeline

**Team Members:**
- Person A: Data ingestion, retrieval, prompt engineering
- Person B: Reasoning pipeline, constraint logic, report generation
- Person C: Agent orchestration, tools, UI, notebook assembly, submission

**Track:** World Cup Squad Builder with Reasoning Pipeline (IE5374, Northeastern University)


## Dependencies and Environment Setup

_This cell will contain `pip install` commands for all required libraries (langchain, langchain-openai, langchain-community, langchain-core, langchain-text-splitters, faiss-cpu, openai, pandas, numpy, matplotlib, python-dotenv). It will also note any version pins and how to run the notebook reproducibly._

In [None]:
# Install dependencies (run once)
!pip install -q langchain langchain-openai langchain-community langchain-core langchain-text-splitters faiss-cpu openai pandas numpy matplotlib python-dotenv

## Imports and API Key Configuration

_This cell will import all necessary modules (pandas, langchain, langchain_openai, FAISS, etc.) and load the OpenAI API key from `.env` using `python-dotenv`. It will also set any global configuration (e.g., model names, temperature)._

In [None]:
import os
import sys
from dotenv import load_dotenv
load_dotenv()

# Add backend to path for imports
sys.path.insert(0, os.path.join(os.path.dirname(os.getcwd()), 'backend'))

import pandas as pd
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

from src.ingestion import load_raw_data, clean_data, cache_processed_data, dataframe_to_documents, load_and_clean_data
from src.retrieval import create_vector_store, get_retriever, retrieve_players
from src.reasoning import build_squad, validate_squad
from src.synthesis import generate_report, format_squad_table
from src.agent import create_agent, run_query

# Ensure we have API key
assert os.environ.get("OPENAI_API_KEY"), "Set OPENAI_API_KEY in .env"

## Stage 1 — Data Ingestion

_This cell will demonstrate the use of `src.ingestion.load_raw_data`, `clean_data`, `cache_processed_data`, and `dataframe_to_documents`. It will display a sample of the cleaned DataFrame and one or two example `Document` objects to illustrate the natural-language descriptions and metadata._


In [None]:
# Stage 1: Load, clean, cache, convert to Documents
documents = load_and_clean_data()
print(f"Loaded {len(documents)} player documents.")
if documents:
    print("Sample document (first player):")
    print(documents[0].page_content[:300], "...")
    print("Metadata keys:", list(documents[0].metadata.keys())[:10])

## Stage 2 — Retrieval

_This cell will construct the FAISS vector store and retriever using `src.retrieval.create_vector_store` and `get_retriever`. It will then show example retrievals for queries such as "fast defenders", "best free kick takers", and "young high-potential midfielders", printing out the top retrieved players for each query._


In [None]:
# Stage 2: FAISS vector store and retrieval
vector_store = create_vector_store(documents)
retriever = get_retriever(vector_store, k=10)

for query in ["fast defenders", "best free kick takers", "young high-potential midfielders"]:
    docs = retrieve_players(query, retriever)
    names = [d.metadata.get("short_name", "?") for d in docs[:5]]
    print(f"Query: '{query}' -> {names}")

## Stage 3 — Reasoning and Constraint Solving

_This cell will demonstrate building a squad from a retrieved shortlist using `src.reasoning.build_squad`. It will show how constraints (max 23 players, positional minimums, optional budget) are defined and passed in, and will print the structured squad output (selected players, excluded players, total wage, formation notes)._


In [None]:
# Stage 3: Build squad from retrieved players
shortlist = [doc.metadata for doc in retrieve_players("balanced squad with strong defense and attack", retriever)]
constraints = {"max_players": 23, "min_gk": 3, "min_def": 6, "min_mid": 6, "min_fwd": 4, "budget": 500_000}
squad = build_squad(shortlist, constraints, "prioritize experience and balance")
print("Selected:", len(squad["selected"]), "| Total wage:", squad.get("total_wage"))
print("Formation notes:", squad.get("formation_notes", "")[:200])

## Stage 4 — Synthesis and Report Generation

_This cell will use `src.synthesis.generate_report` to produce a formatted natural-language squad report from the structured squad dictionary. It will display the resulting report, including the squad table, philosophy summary, budget summary, notable exclusions, limitations disclaimer, and data source citation._


In [None]:
# Stage 4: Generate formatted report
report = generate_report(squad, constraints)
print(report)

## Agent Demo with Memory

_This cell will showcase the end-to-end LangChain agent created in `src.agent`. It will run a multi-turn conversation where the user first specifies preferences (e.g., pace-focused squad) and then adjusts constraints (e.g., "now make it cheaper"), demonstrating that the agent retains at least two user preferences via `ConversationBufferMemory` while using tools to rebuild and re-explain the squad._


In [None]:
# Agent demo with memory (multi-turn)
agent = create_agent()
print("Turn 1:", run_query(agent, "Build me a World Cup squad focused on pace, budget under 400000 EUR."))
print("\nTurn 2 (memory):", run_query(agent, "Now make it cheaper and keep at least 4 forwards."))

## Responsible AI Disclaimer

_This cell will state the responsible AI considerations, including the fact that outputs are educational, not professional sports analytics advice; that they are based on FIFA video game ratings; and that real-world performance may differ significantly. It will also mention data source licensing and limitations of using video game data for serious decision-making._
