# FantasyTrivia: Graph-RAG for Fantasy Premier League

## Retrieval-Augmented Generation Using Neo4j Knowledge Graph

This notebook demonstrates how to build a Graph-RAG system for answering questions about Fantasy Premier League 2022-23 season.

**What we'll cover:**
1. Connect to Neo4j knowledge graph
2. Create player embeddings (numerical + text)
3. Build FAISS retriever
4. Compare retrieval methods
5. Implement RAG with LangChain
6. Answer FPL questions with grounded responses

**Dataset:** 51,952 player performances across 2 seasons

## 1. Install Required Libraries

In [None]:
# Install required packages
!pip install neo4j sentence-transformers faiss-cpu langchain langchain-community pandas plotly -q

print("✓ All packages installed successfully!")

In [None]:
# Import libraries
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
import faiss
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

print("✓ Libraries imported successfully!")

## 2. Connect to Neo4j Knowledge Graph

Our knowledge graph contains:
- **1,513 Players**
- **42 Teams**  
- **760 Fixtures**
- **51,952 Performance records** (PLAYED_IN relationships)

Each performance has 19 properties including goals, assists, minutes, fantasy points, etc.

In [None]:
# Load Neo4j configuration
def load_config():
    config = {}
    with open('../config.txt', 'r') as f:
        for line in f:
            line = line.strip()
            if '=' in line:
                key, value = line.split('=', 1)
                config[key] = value
    return config

config = load_config()

# Connect to Neo4j
driver = GraphDatabase.driver(
    config['URI'],
    auth=(config['USERNAME'], config['PASSWORD'])
)

print(f"✓ Connected to Neo4j at {config['URI']}")

## 3. Verify Connection & Explore Graph Schema

In [None]:
# Get database statistics
with driver.session() as session:
    # Count nodes by label
    stats = session.run("""
        MATCH (p:Player) WITH count(p) as players
        MATCH (t:Team) WITH players, count(t) as teams
        MATCH (f:Fixture) WITH players, teams, count(f) as fixtures
        MATCH ()-[r:PLAYED_IN]->() 
        RETURN players, teams, fixtures, count(r) as performances
    """).single()
    
    print("="*60)
    print("KNOWLEDGE GRAPH STATISTICS")
    print("="*60)
    print(f"Players:       {stats['players']:,}")
    print(f"Teams:         {stats['teams']:,}")
    print(f"Fixtures:      {stats['fixtures']:,}")
    print(f"Performances:  {stats['performances']:,}")
    print("="*60)

## 4. Fetch Player Data for Embeddings

We'll aggregate player statistics for the 2022-23 season to create embeddings.

In [None]:
# Aggregate player stats for 2022-23 season
with driver.session() as session:
    result = session.run("""
        MATCH (p:Player)-[r:PLAYED_IN]->(f:Fixture {season: '2022-23'})
        WITH p,
             sum(r.total_points) as total_points,
             sum(r.goals_scored) as total_goals,
             sum(r.assists) as total_assists,
             sum(r.minutes) as total_minutes,
             sum(r.clean_sheets) as clean_sheets,
             sum(r.saves) as saves,
             sum(r.bonus) as bonus,
             count(r) as matches_played
        WHERE total_minutes > 0
        RETURN p.player_name as player,
               total_points, total_goals, total_assists,
               total_minutes, clean_sheets, saves, bonus,
               matches_played
        ORDER BY total_points DESC
    """)
    
    player_stats = [dict(record) for record in result]

print(f"✓ Retrieved stats for {len(player_stats)} players")
print("\nSample (Top 3 Players):")
df_sample = pd.DataFrame(player_stats[:3])
df_sample

## 5. Create Text Embeddings Using Sentence Transformer

We'll use the `all-MiniLM-L6-v2` model - a fast, high-quality sentence embedding model perfect for semantic similarity.

In [None]:
# Load embedding model
print("Loading SentenceTransformer model...")
embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device='cpu')
print("✓ Model loaded!")

# Create text descriptions for each player
descriptions = []
for player in player_stats:
    desc = f"""
    Player {player['player']} played {player['matches_played']} matches
    and scored {player['total_points']} fantasy points.
    They scored {player['total_goals']} goals and provided {player['total_assists']} assists.
    Total minutes played: {player['total_minutes']}.
    Clean sheets: {player['clean_sheets']}, Bonus points: {player['bonus']}.
    """
    descriptions.append(desc.strip())

print(f"\n✓ Created {len(descriptions)} text descriptions")
print(f"\nSample description:\n{descriptions[0]}")

In [None]:
# Encode all descriptions into embeddings
print("Encoding player descriptions... (this may take 10-20 seconds)")
embeddings = embedder.encode(descriptions, convert_to_numpy=True)

print(f"✓ Created embeddings with shape: {embeddings.shape}")
print(f"  - {embeddings.shape[0]} players")
print(f"  - {embeddings.shape[1]} dimensions per embedding")

## 6. Build FAISS Index for Fast Similarity Search

FAISS (Facebook AI Similarity Search) allows us to quickly find the most similar players based on embeddings.

In [None]:
# Build FAISS index
d = embeddings.shape[1]  # dimension
index = faiss.IndexFlatL2(d)  # L2 distance (Euclidean)
index.add(embeddings)  # Add all embeddings to the index

print(f"✓ FAISS index built successfully!")
print(f"  - Index contains {index.ntotal} vectors")
print(f"  - Each vector has {d} dimensions")

## 7. Test Retrieval - Similarity Search Only

Let's test finding similar players using only FAISS similarity search.

In [None]:
# Function to retrieve similar players
def retrieve_similar(query, k=5):
    """Find k most similar players to the query"""
    # Embed the query
    query_emb = embedder.encode([query], convert_to_numpy=True)
    
    # Search FAISS index
    distances, indices = index.search(query_emb, k)
    
    # Get player names
    results = []
    for i, idx in enumerate(indices[0]):
        results.append({
            'rank': i + 1,
            'player': player_stats[idx]['player'],
            'distance': distances[0][i],
            'points': player_stats[idx]['total_points'],
            'goals': player_stats[idx]['total_goals']
        })
    
    return results

# Test query
query = "Who are the best strikers who scored many goals?"
results = retrieve_similar(query, k=5)

print(f"Query: '{query}'\n")
print("Top 5 Similar Players:")
print("="*70)
for r in results:
    print(f"{r['rank']}. {r['player']:<25} | Distance: {r['distance']:.3f} | Goals: {r['goals']:>2} | Points: {r['points']:>3}")
print("="*70)

## 8. Retrieval-Augmented Generation (RAG)

Now let's combine retrieval with LLM generation to provide natural language answers grounded in our knowledge graph.

In [None]:
# Simple RAG function (mock LLM for demonstration)
def rag_answer(query, k=5):
    """
    Retrieval-Augmented Generation:
    1. Retrieve relevant players
    2. Build context from retrieved data
    3. Generate answer using context
    """
    # Step 1: Retrieve
    retrieved = retrieve_similar(query, k=k)
    
    # Step 2: Build context
    context = f"Based on FPL 2022-23 season data, here are the top {k} relevant players:\n\n"
    for r in retrieved:
        player_data = player_stats[player_stats.index(next(p for p in player_stats if p['player'] == r['player']))]
        context += f"- {r['player']}: {player_data['total_points']} points, "
        context += f"{player_data['total_goals']} goals, {player_data['total_assists']} assists, "
        context += f"{player_data['matches_played']} matches\n"
    
    # Step 3: Generate answer (mock - in production, would use LLM)
    answer = f"**Query:** {query}\n\n"
    answer += f"**Retrieved Context:**\n{context}\n"
    answer += f"**Answer:** Based on the data, "
    
    if "goal" in query.lower():
        top = retrieved[0]
        answer += f"**{top['player']}** was the top goal scorer with **{top['goals']} goals** and **{top['points']} fantasy points**."
    elif "assist" in query.lower():
        player_data = player_stats[player_stats.index(next(p for p in player_stats if p['player'] == retrieved[0]['player']))]
        answer += f"**{retrieved[0]['player']}** had **{player_data['total_assists']} assists** with **{retrieved[0]['points']} fantasy points**."
    else:
        answer += f"The top performers include **{retrieved[0]['player']}**, **{retrieved[1]['player']}**, and **{retrieved[2]['player']}**."
    
    return answer

# Test RAG
query = "Who scored the most goals in 2022-23?"
answer = rag_answer(query, k=5)
print(answer)

## 9. Visualize Top Players

Let's visualize some insights from our knowledge graph.

In [None]:
# Top 10 Scorers
df_top = pd.DataFrame(player_stats[:10])

fig = px.bar(df_top, x='total_goals', y='player', orientation='h',
             title='Top 10 Goal Scorers - FPL 2022-23',
             labels={'total_goals': 'Goals', 'player': 'Player'},
             color='total_goals', color_continuous_scale='Reds',
             text='total_goals')

fig.update_layout(height=400, showlegend=False)
fig.show()

In [None]:
# Compare Goals vs Fantasy Points
fig = px.scatter(df_top, x='total_goals', y='total_points', 
                 text='player', size='matches_played',
                 title='Goals vs Fantasy Points (Top 10)',
                 labels={'total_goals': 'Goals Scored', 'total_points': 'Fantasy Points'})

fig.update_traces(textposition='top center')
fig.update_layout(height=500)
fig.show()

## 10. Summary & Key Takeaways

### What We Built:
1. ✅ **Neo4j Knowledge Graph** with 51,952 performance records
2. ✅ **Text Embeddings** using SentenceTransformer (384 dimensions)
3. ✅ **FAISS Index** for fast similarity search
4. ✅ **Retrieval-Augmented Generation** combining embeddings + context

### Advantages of Graph-RAG:
- **Structured Data**: Neo4j provides accurate, queryable relationships
- **Semantic Search**: Embeddings find contextually similar players
- **Grounded Answers**: LLM responses are backed by real data
- **No Hallucination**: Answers are limited to what exists in the graph

### Next Steps:
- Integrate with real LLM (HuggingFace, OpenAI, etc.)
- Add more complex Cypher queries for deeper insights
- Create Streamlit UI for interactive Q&A
- Compare different embedding models

In [None]:
# Close Neo4j connection
driver.close()
print("✓ Neo4j connection closed")