# Family Activity Ranking Model

This notebook implements a machine learning model that merges group members' ages and power levels (overall ability) to rank activities suitable for the entire group.

## Features:
- Age-based activity filtering
- Power/ability-based scoring
- Group compatibility analysis
- Multi-factor ranking algorithm

## Google Colab Compatible
Upload your `dataset/dataset empty -open space-.csv` file when running on Colab.

## Methodology

We build on foundational methods in information retrieval and modern semantic search to create a robust activity ranking system:

### Core Technologies:

1. **BM25 (Best Matching 25)**: Provides strong keyword-based baselines for activity search and ranking. BM25 is a probabilistic ranking function that considers term frequency, inverse document frequency, and document length normalization [Robertson & Zaragoza, 2009].

2. **Sentence-BERT**: Enables dense embeddings for semantic similarity, allowing us to understand the meaning and context of activities beyond simple keyword matching [Reimers & Gurevych, 2019].

3. **FAISS (Facebook AI Similarity Search)**: Supports efficient approximate nearest neighbor search at scale, enabling fast similarity computations across large activity databases [Johnson et al., 2017].

4. **MMR (Maximal Marginal Relevance)**: Introduces diversity-aware ranking to ensure recommended activities are both relevant and diverse, preventing redundant suggestions [Carbonell & Goldstein, 1998].

### Data Sources:

For content, we collect activity data from trusted family and education resources:
- **Raising Children Network**: Evidence-based parenting information
- **Active for Life**: Physical literacy and child development activities
- **Oxford Owl**: Educational activities and learning resources
- **Positive Action**: Social-emotional learning activities
- **Escape Room Geeks**: Problem-solving and group activities
- Additional curated educational and recreational resources

### References:
- Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. *Foundations and Trends in Information Retrieval*, 3(4), 333-389.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. *EMNLP 2019*.
- Johnson, J., Douze, M., & Jégou, H. (2017). Billion-scale similarity search with GPUs. *arXiv preprint arXiv:1702.08734*.
- Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. *SIGIR 1998*.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Information Retrieval and Semantic Search Libraries
try:
    from rank_bm25 import BM25Okapi
    print("✓ BM25 available")
except ImportError:
    print("⚠ BM25 not installed. Install with: pip install rank-bm25")
    BM25Okapi = None

try:
    from sentence_transformers import SentenceTransformer
    print("✓ Sentence-BERT available")
except ImportError:
    print("⚠ Sentence-BERT not installed. Install with: pip install sentence-transformers")
    SentenceTransformer = None

try:
    import faiss
    print("✓ FAISS available")
except ImportError:
    print("⚠ FAISS not installed. Install with: pip install faiss-cpu")
    faiss = None

# Set style for visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Core libraries imported successfully")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set style for visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Libraries imported successfully")

## 2. Define Group Member Structure

Each member has:
- **age**: Member's age in years
- **overall_ability**: Power level (1-10 scale) representing physical and cognitive capabilities

In [None]:
class GroupMember:
    """Represents a family member with age and ability attributes"""
    
    def __init__(self, name: str, age: int, overall_ability: float, 
                 interests: List[str] = None, special_needs: List[str] = None):
        self.name = name
        self.age = age
        self.overall_ability = overall_ability  # 1-10 scale
        self.interests = interests or []
        self.special_needs = special_needs or []
    
    def __repr__(self):
        return f"{self.name} (Age: {self.age}, Ability: {self.overall_ability})"


class FamilyGroup:
    """Manages a group of family members"""
    
    def __init__(self, members: List[GroupMember] = None):
        self.members = members or []
    
    def add_member(self, member: GroupMember):
        self.members.append(member)
    
    def get_age_range(self) -> Tuple[int, int]:
        """Returns (min_age, max_age) of the group"""
        if not self.members:
            return (0, 0)
        ages = [m.age for m in self.members]
        return (min(ages), max(ages))
    
    def get_avg_ability(self) -> float:
        """Returns average overall_ability of the group"""
        if not self.members:
            return 0.0
        return np.mean([m.overall_ability for m in self.members])
    
    def get_ability_range(self) -> Tuple[float, float]:
        """Returns (min_ability, max_ability) of the group"""
        if not self.members:
            return (0.0, 0.0)
        abilities = [m.overall_ability for m in self.members]
        return (min(abilities), max(abilities))
    
    def __repr__(self):
        return f"FamilyGroup({len(self.members)} members)"


print("✓ Group member classes defined")

## 3. Create Sample Family Group

Define your family members here with their ages and ability levels.

In [None]:
# Example family group
sample_group = FamilyGroup([
    GroupMember("Emma", age=10, overall_ability=7.0, 
                interests=["Arts & Crafts", "Reading", "Nature"]),
    GroupMember("Liam", age=6, overall_ability=5.0, 
                interests=["Sports", "Building", "Outdoors"]),
    GroupMember("Sophia", age=13, overall_ability=8.5, 
                interests=["Music", "Dance", "Art"]),
    GroupMember("Dad", age=42, overall_ability=6.0, 
                interests=["Sports", "Hiking", "Cooking"]),
])

print("Family Group Summary:")
print("=" * 50)
for member in sample_group.members:
    print(f"  {member}")
print("=" * 50)
print(f"Age Range: {sample_group.get_age_range()[0]}-{sample_group.get_age_range()[1]} years")
print(f"Average Ability: {sample_group.get_avg_ability():.2f}/10")
print(f"Ability Range: {sample_group.get_ability_range()[0]}-{sample_group.get_ability_range()[1]}")

## 4. Load Activity Dataset

### For Google Colab:
Upload your CSV file using the code below, or mount Google Drive.

In [None]:
# Easy Installation: Use the requirements.txt file
# !pip install -r requirements.txt

# OR install packages individually:
# !pip install rank-bm25 sentence-transformers faiss-cpu

# Note: Use faiss-cpu for maximum compatibility (works on all platforms)
# Only use faiss-gpu if you have CUDA-enabled GPU and NVIDIA drivers installed
# !pip install faiss-gpu  # Only if you have CUDA support

print("""
Required Packages:
==================

Quick Install (Recommended):
   pip install -r requirements.txt

This will install:

1. rank-bm25: BM25 keyword-based ranking
   - Install: pip install rank-bm25
   - Reference: Robertson & Zaragoza (2009)

2. sentence-transformers: Sentence-BERT embeddings
   - Install: pip install sentence-transformers
   - Reference: Reimers & Gurevych (2019)
   - Models will download automatically on first use (~90MB for MiniLM)

3. faiss-cpu: Efficient similarity search (CPU version - works on all platforms)
   - Install: pip install faiss-cpu
   - Reference: Johnson et al. (2017)
   - For GPU: pip install faiss-gpu (requires CUDA - not available on all systems)

Core libraries (already included):
- pandas, numpy, matplotlib, seaborn

Note: The system will work without these packages but will use fallback methods.
The hybrid ranker will automatically detect and use available components.
""")

## Installation Instructions

To use the hybrid ranking system with all features, install these packages:

In [None]:
# Compare different ranking approaches
test_query = "fun outdoor games for children"

print("=" * 80)
print(f"Comparing Ranking Methods for Query: '{test_query}'")
print("=" * 80)

# BM25 only (keyword-based)
results_bm25 = hybrid_ranker.search(test_query, top_k=10, bm25_weight=1.0, semantic_weight=0.0, use_mmr=False)
print("\n1. BM25 (Keyword-Based) - Top 5:")
print("-" * 80)
print(results_bm25[['rank', 'title', 'bm25_score']].head(5).to_string(index=False))

# Semantic only (meaning-based)
results_semantic = hybrid_ranker.search(test_query, top_k=10, bm25_weight=0.0, semantic_weight=1.0, use_mmr=False)
print("\n2. Semantic (Meaning-Based) - Top 5:")
print("-" * 80)
print(results_semantic[['rank', 'title', 'semantic_score']].head(5).to_string(index=False))

# Hybrid (combined)
results_hybrid = hybrid_ranker.search(test_query, top_k=10, bm25_weight=0.3, semantic_weight=0.7, use_mmr=False)
print("\n3. Hybrid (BM25 30% + Semantic 70%) - Top 5:")
print("-" * 80)
print(results_hybrid[['rank', 'title', 'hybrid_score', 'bm25_score', 'semantic_score']].head(5).to_string(index=False))

# Hybrid + MMR (diversity-aware)
results_mmr = hybrid_ranker.search(test_query, top_k=10, bm25_weight=0.3, semantic_weight=0.7, use_mmr=True)
print("\n4. Hybrid + MMR (Diversity-Aware) - Top 5:")
print("-" * 80)
print(results_mmr[['rank', 'title', 'hybrid_score']].head(5).to_string(index=False))

print("\n" + "=" * 80)
print("Key Observations:")
print("=" * 80)
print("• BM25: Excels at exact keyword matching (e.g., 'outdoor', 'games')")
print("• Semantic: Captures meaning and context (e.g., finds 'play' activities even without 'games')")
print("• Hybrid: Balances keyword precision with semantic understanding")
print("• MMR: Adds diversity to prevent redundant similar activities")
print("=" * 80)

## 7. Comparing Ranking Methods

Compare BM25 (keyword), Semantic (meaning-based), and Hybrid (combined) approaches.

In [None]:
# Initialize hybrid ranking system
hybrid_ranker = HybridActivityRanker(activities_df)

# Example 1: Keyword-based search
print("=" * 80)
print("EXAMPLE 1: Keyword Search - 'outdoor sports'")
print("=" * 80)
results1 = hybrid_ranker.search("outdoor sports", top_k=10)
print(results1[['rank', 'title', 'hybrid_score', 'bm25_score', 'semantic_score', 
                'age_min', 'age_max']].to_string(index=False))

print("\n" + "=" * 80)
print("EXAMPLE 2: Semantic Search - 'creative activities for kids'")
print("=" * 80)
results2 = hybrid_ranker.search("creative activities for kids", top_k=10)
print(results2[['rank', 'title', 'hybrid_score', 'bm25_score', 'semantic_score',
                'age_min', 'age_max']].to_string(index=False))

print("\n" + "=" * 80)
print("EXAMPLE 3: Group-based Search (using sample family)")
print("=" * 80)
results3 = hybrid_ranker.search_by_group(sample_group, top_k=10)
print(results3[['rank', 'title', 'hybrid_score', 'age_min', 'age_max', 'duration_mins']].to_string(index=False))

print("\n" + "=" * 80)
print("EXAMPLE 4: Custom Query for Group - 'educational STEM activities'")
print("=" * 80)
results4 = hybrid_ranker.search_by_group(sample_group, query="educational STEM activities", top_k=10)
print(results4[['rank', 'title', 'hybrid_score', 'age_min', 'age_max']].to_string(index=False))

## 6. Hybrid Ranking System Demo

Demonstrate the hybrid ranking system combining BM25, Sentence-BERT, FAISS, and MMR.

In [None]:
class HybridActivityRanker:
    """
    Hybrid ranking system combining BM25, Sentence-BERT, FAISS, and MMR.
    Provides state-of-the-art activity search and ranking with diversity awareness.
    """
    
    def __init__(self, activities_df: pd.DataFrame):
        """
        Initialize hybrid ranker with activity data.
        
        Args:
            activities_df: DataFrame containing activity information
        """
        self.activities_df = activities_df
        
        # Prepare text representations of activities
        self.activity_texts = self._prepare_activity_texts()
        
        # Initialize IR components
        print("\nInitializing hybrid ranking system...")
        print("-" * 50)
        
        self.bm25_ranker = BM25Ranker(self.activity_texts)
        self.semantic_embedder = SemanticEmbedder()
        self.faiss_index = None
        self.mmr_ranker = MMRRanker(lambda_param=0.6)  # Slightly favor relevance
        
        # Build semantic index
        if self.semantic_embedder.model is not None:
            self.semantic_embedder.build_index(self.activity_texts)
            
            # Build FAISS index
            if faiss is not None:
                embedding_dim = self.semantic_embedder.embeddings.shape[1]
                self.faiss_index = FAISSIndex(embedding_dim)
                self.faiss_index.add_embeddings(self.semantic_embedder.embeddings)
        
        print("-" * 50)
        print("✓ Hybrid ranking system ready\n")
    
    def _prepare_activity_texts(self) -> List[str]:
        """
        Prepare text representations of activities for indexing.
        Combines title, description, tags, and metadata.
        """
        texts = []
        for _, row in self.activities_df.iterrows():
            # Combine relevant fields
            parts = [
                str(row.get('title', '')),
                str(row.get('description', '')),
                ' '.join(str(tag) for tag in row.get('tags', [])),
                f"age {row.get('age_min', '')} to {row.get('age_max', '')}",
                str(row.get('indoor_outdoor', '')),
                str(row.get('category', ''))
            ]
            text = ' '.join(part for part in parts if part)
            texts.append(text)
        
        return texts
    
    def search(
        self,
        query: str,
        top_k: int = 20,
        bm25_weight: float = 0.3,
        semantic_weight: float = 0.7,
        use_mmr: bool = True,
        mmr_lambda: float = 0.6
    ) -> pd.DataFrame:
        """
        Search activities using hybrid ranking.
        
        Args:
            query: Search query string
            top_k: Number of results to return
            bm25_weight: Weight for BM25 scores (keyword matching)
            semantic_weight: Weight for semantic similarity scores
            use_mmr: Apply MMR for diversity-aware re-ranking
            mmr_lambda: MMR lambda parameter (relevance vs diversity trade-off)
            
        Returns:
            DataFrame with ranked activities and scores
        """
        n_activities = len(self.activities_df)
        
        # Get BM25 scores
        bm25_scores = self.bm25_ranker.get_scores(query)
        
        # Get semantic similarity scores
        semantic_scores = self.semantic_embedder.similarity(query)
        
        # Normalize scores to [0, 1]
        def normalize(scores):
            if scores.max() > scores.min():
                return (scores - scores.min()) / (scores.max() - scores.min())
            return scores
        
        bm25_scores_norm = normalize(bm25_scores)
        semantic_scores_norm = normalize(semantic_scores)
        
        # Combine scores
        hybrid_scores = (
            bm25_weight * bm25_scores_norm + 
            semantic_weight * semantic_scores_norm
        )
        
        # Get initial ranking
        initial_ranking = np.argsort(hybrid_scores)[::-1]
        
        # Apply MMR for diversity if requested
        if use_mmr and self.semantic_embedder.embeddings is not None:
            query_embedding = self.semantic_embedder.encode([query])[0]
            
            # Re-rank using MMR
            final_ranking = self.mmr_ranker.rerank(
                query_embedding,
                self.semantic_embedder.embeddings,
                initial_ranking.tolist(),
                hybrid_scores,
                k=top_k,
                lambda_param=mmr_lambda
            )
        else:
            final_ranking = initial_ranking[:top_k].tolist()
        
        # Prepare results
        results = self.activities_df.iloc[final_ranking].copy()
        results['hybrid_score'] = hybrid_scores[final_ranking]
        results['bm25_score'] = bm25_scores_norm[final_ranking]
        results['semantic_score'] = semantic_scores_norm[final_ranking]
        results['rank'] = range(1, len(results) + 1)
        
        return results
    
    def search_by_group(
        self,
        group: 'FamilyGroup',
        query: str = None,
        top_k: int = 20,
        use_mmr: bool = True
    ) -> pd.DataFrame:
        """
        Search activities suitable for a family group with optional query.
        Combines hybrid search with age/ability filtering.
        
        Args:
            group: FamilyGroup object
            query: Optional search query (e.g., "outdoor sports")
            top_k: Number of results to return
            use_mmr: Apply MMR for diversity
            
        Returns:
            DataFrame with ranked activities
        """
        # Build query from group characteristics if not provided
        if query is None:
            age_min, age_max = group.get_age_range()
            avg_ability = group.get_avg_ability()
            
            query = f"activities for ages {age_min} to {age_max}"
            
            # Add interests
            all_interests = []
            for member in group.members:
                all_interests.extend(member.interests)
            if all_interests:
                unique_interests = list(set(all_interests))[:3]
                query += f" {' '.join(unique_interests)}"
        
        # Search
        results = self.search(query, top_k=top_k * 2, use_mmr=use_mmr)
        
        # Filter by age appropriateness
        age_min, age_max = group.get_age_range()
        
        # Keep activities where there's some age overlap
        def has_age_overlap(row):
            return not (row['age_max'] < age_min or row['age_min'] > age_max)
        
        results = results[results.apply(has_age_overlap, axis=1)]
        
        return results.head(top_k)


print("✓ HybridActivityRanker class defined")

In [None]:
class MMRRanker:
    """
    Maximal Marginal Relevance (MMR) for diversity-aware ranking.
    Balances relevance to query with diversity among results to avoid redundancy.
    """
    
    def __init__(self, lambda_param: float = 0.5):
        """
        Initialize MMR ranker.
        
        Args:
            lambda_param: Trade-off between relevance and diversity (0-1)
                         1.0 = pure relevance, 0.0 = pure diversity
                         Default: 0.5 (balanced)
        """
        self.lambda_param = lambda_param
    
    def compute_mmr(
        self,
        query_embedding: np.ndarray,
        doc_embeddings: np.ndarray,
        relevance_scores: np.ndarray = None,
        k: int = 10,
        lambda_param: float = None
    ) -> List[int]:
        """
        Compute MMR ranking for documents.
        
        Args:
            query_embedding: Query embedding vector
            doc_embeddings: Document embedding matrix (shape: [n_docs, embedding_dim])
            relevance_scores: Pre-computed relevance scores (optional)
            k: Number of results to return
            lambda_param: Override default lambda (optional)
            
        Returns:
            List of document indices in MMR order
        """
        if lambda_param is None:
            lambda_param = self.lambda_param
        
        n_docs = len(doc_embeddings)
        k = min(k, n_docs)
        
        # Compute relevance scores if not provided (cosine similarity with query)
        if relevance_scores is None:
            query_norm = query_embedding / np.linalg.norm(query_embedding)
            doc_norms = doc_embeddings / np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
            relevance_scores = np.dot(doc_norms, query_norm)
        
        # Normalize relevance scores to [0, 1]
        if relevance_scores.max() > relevance_scores.min():
            relevance_scores = (relevance_scores - relevance_scores.min()) / (
                relevance_scores.max() - relevance_scores.min()
            )
        
        # Compute pairwise similarities between documents
        doc_norms = doc_embeddings / np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
        doc_similarities = np.dot(doc_norms, doc_norms.T)
        
        # MMR selection
        selected_indices = []
        remaining_indices = list(range(n_docs))
        
        # Select first document (highest relevance)
        first_idx = np.argmax(relevance_scores)
        selected_indices.append(first_idx)
        remaining_indices.remove(first_idx)
        
        # Iteratively select documents
        for _ in range(k - 1):
            if not remaining_indices:
                break
            
            mmr_scores = []
            for idx in remaining_indices:
                # Relevance component
                relevance = relevance_scores[idx]
                
                # Diversity component (max similarity to already selected docs)
                max_similarity = max(doc_similarities[idx, sel_idx] for sel_idx in selected_indices)
                
                # MMR score
                mmr_score = lambda_param * relevance - (1 - lambda_param) * max_similarity
                mmr_scores.append(mmr_score)
            
            # Select document with highest MMR score
            best_idx = remaining_indices[np.argmax(mmr_scores)]
            selected_indices.append(best_idx)
            remaining_indices.remove(best_idx)
        
        return selected_indices
    
    def rerank(
        self,
        query_embedding: np.ndarray,
        doc_embeddings: np.ndarray,
        initial_ranking: List[int],
        relevance_scores: np.ndarray = None,
        k: int = 10,
        lambda_param: float = None
    ) -> List[int]:
        """
        Re-rank an initial ranking using MMR for diversity.
        
        Args:
            query_embedding: Query embedding
            doc_embeddings: Document embeddings
            initial_ranking: Initial ranking (list of document indices)
            relevance_scores: Pre-computed relevance scores
            k: Number of results to return
            lambda_param: Override default lambda
            
        Returns:
            Re-ranked list of document indices
        """
        # Only apply MMR to top documents from initial ranking
        candidate_indices = initial_ranking[:min(len(initial_ranking), k * 3)]
        candidate_embeddings = doc_embeddings[candidate_indices]
        
        if relevance_scores is not None:
            candidate_scores = relevance_scores[candidate_indices]
        else:
            candidate_scores = None
        
        # Apply MMR
        mmr_indices = self.compute_mmr(
            query_embedding,
            candidate_embeddings,
            candidate_scores,
            k,
            lambda_param
        )
        
        # Map back to original indices
        reranked_indices = [candidate_indices[i] for i in mmr_indices]
        
        return reranked_indices


print("✓ MMRRanker class defined")

In [None]:
class FAISSIndex:
    """
    FAISS (Facebook AI Similarity Search) for efficient nearest neighbor search.
    Enables fast similarity search at scale using approximate methods.
    """
    
    def __init__(self, embedding_dim: int = 384):
        """
        Initialize FAISS index.
        
        Args:
            embedding_dim: Dimension of embeddings (default: 384 for MiniLM)
        """
        self.embedding_dim = embedding_dim
        self.index = None
        self.is_trained = False
        
        if faiss is not None:
            # Use IndexFlatIP (Inner Product) for cosine similarity with normalized vectors
            self.index = faiss.IndexFlatIP(embedding_dim)
            print(f"✓ FAISS index initialized (dim={embedding_dim})")
        else:
            print("⚠ FAISS not available - will use brute force search")
    
    def add_embeddings(self, embeddings: np.ndarray):
        """
        Add embeddings to the FAISS index.
        
        Args:
            embeddings: Array of embeddings to index (shape: [n, embedding_dim])
        """
        if self.index is not None and faiss is not None:
            # Normalize embeddings for cosine similarity
            embeddings = embeddings.astype('float32')
            faiss.normalize_L2(embeddings)
            
            self.index.add(embeddings)
            self.is_trained = True
            print(f"✓ Added {embeddings.shape[0]} embeddings to FAISS index")
        else:
            self.embeddings_fallback = embeddings
            print(f"⚠ Using fallback storage for {len(embeddings)} embeddings")
    
    def search(self, query_embedding: np.ndarray, k: int = 10) -> Tuple[np.ndarray, np.ndarray]:
        """
        Search for k nearest neighbors.
        
        Args:
            query_embedding: Query embedding vector
            k: Number of nearest neighbors to return
            
        Returns:
            Tuple of (distances, indices) arrays
        """
        if self.index is not None and self.is_trained and faiss is not None:
            # Normalize query
            query = query_embedding.reshape(1, -1).astype('float32')
            faiss.normalize_L2(query)
            
            # Search
            distances, indices = self.index.search(query, k)
            return distances[0], indices[0]
        else:
            # Fallback: brute force cosine similarity
            query_norm = query_embedding / np.linalg.norm(query_embedding)
            embeddings_norm = self.embeddings_fallback / np.linalg.norm(
                self.embeddings_fallback, axis=1, keepdims=True
            )
            
            similarities = np.dot(embeddings_norm, query_norm)
            top_k_indices = np.argsort(similarities)[::-1][:k]
            top_k_distances = similarities[top_k_indices]
            
            return top_k_distances, top_k_indices
    
    def batch_search(self, query_embeddings: np.ndarray, k: int = 10) -> Tuple[np.ndarray, np.ndarray]:
        """
        Search for k nearest neighbors for multiple queries.
        
        Args:
            query_embeddings: Array of query embeddings (shape: [n_queries, embedding_dim])
            k: Number of nearest neighbors per query
            
        Returns:
            Tuple of (distances, indices) arrays
        """
        if self.index is not None and self.is_trained and faiss is not None:
            queries = query_embeddings.astype('float32')
            faiss.normalize_L2(queries)
            
            distances, indices = self.index.search(queries, k)
            return distances, indices
        else:
            # Fallback: batch brute force search
            distances_list = []
            indices_list = []
            for query_emb in query_embeddings:
                dist, idx = self.search(query_emb, k)
                distances_list.append(dist)
                indices_list.append(idx)
            
            return np.array(distances_list), np.array(indices_list)


print("✓ FAISSIndex class defined")

In [None]:
class SemanticEmbedder:
    """
    Sentence-BERT semantic embeddings for capturing meaning and context.
    Converts text into dense vector representations for similarity computation.
    """
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """
        Initialize Sentence-BERT model.
        
        Args:
            model_name: Pre-trained model to use (default: all-MiniLM-L6-v2)
                       Options: 'all-MiniLM-L6-v2' (fast, good quality)
                               'all-mpnet-base-v2' (best quality, slower)
        """
        self.model = None
        self.model_name = model_name
        self.embeddings = None
        
        if SentenceTransformer is not None:
            try:
                self.model = SentenceTransformer(model_name)
                print(f"✓ Sentence-BERT model '{model_name}' loaded")
            except Exception as e:
                print(f"⚠ Could not load Sentence-BERT model: {e}")
        else:
            print("⚠ Sentence-BERT not available - semantic search disabled")
    
    def encode(self, texts: List[str], show_progress: bool = False) -> np.ndarray:
        """
        Encode texts into semantic embeddings.
        
        Args:
            texts: List of text strings to encode
            show_progress: Show progress bar during encoding
            
        Returns:
            Array of embeddings (shape: [n_texts, embedding_dim])
        """
        if self.model is not None:
            embeddings = self.model.encode(
                texts, 
                show_progress_bar=show_progress,
                convert_to_numpy=True
            )
            return embeddings
        else:
            # Fallback: return zero vectors
            return np.zeros((len(texts), 384))
    
    def build_index(self, documents: List[str], show_progress: bool = True):
        """
        Build embeddings index for a document collection.
        
        Args:
            documents: List of documents to index
            show_progress: Show progress during encoding
        """
        print(f"Building semantic embeddings for {len(documents)} documents...")
        self.embeddings = self.encode(documents, show_progress=show_progress)
        print(f"✓ Embeddings built: shape {self.embeddings.shape}")
    
    def similarity(self, query: str, documents: List[str] = None) -> np.ndarray:
        """
        Compute cosine similarity between query and documents.
        
        Args:
            query: Query string
            documents: Documents to compare (uses indexed if None)
            
        Returns:
            Array of similarity scores (0-1 range)
        """
        if self.model is None:
            return np.zeros(len(documents) if documents else len(self.embeddings))
        
        query_embedding = self.encode([query])[0]
        
        if documents is not None:
            doc_embeddings = self.encode(documents)
        else:
            doc_embeddings = self.embeddings
        
        # Cosine similarity
        similarities = np.dot(doc_embeddings, query_embedding) / (
            np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
        )
        
        return similarities
    
    def rank(self, query: str, top_k: int = None) -> List[Tuple[int, float]]:
        """
        Rank documents by semantic similarity to query.
        
        Args:
            query: Search query
            top_k: Return only top K results
            
        Returns:
            List of (document_index, similarity_score) tuples
        """
        similarities = self.similarity(query)
        ranked_indices = np.argsort(similarities)[::-1]
        
        if top_k:
            ranked_indices = ranked_indices[:top_k]
        
        return [(idx, similarities[idx]) for idx in ranked_indices]


print("✓ SemanticEmbedder class defined")

In [None]:
class BM25Ranker:
    """
    BM25 (Best Matching 25) keyword-based ranking.
    Provides probabilistic ranking based on term frequency and inverse document frequency.
    """
    
    def __init__(self, documents: List[str], k1: float = 1.5, b: float = 0.75):
        """
        Initialize BM25 ranker.
        
        Args:
            documents: List of text documents to index
            k1: Term frequency saturation parameter (default: 1.5)
            b: Length normalization parameter (default: 0.75)
        """
        self.documents = documents
        self.k1 = k1
        self.b = b
        self.bm25 = None
        
        if BM25Okapi is not None:
            # Tokenize documents (simple whitespace tokenization)
            tokenized_docs = [doc.lower().split() for doc in documents]
            self.bm25 = BM25Okapi(tokenized_docs, k1=k1, b=b)
            print(f"✓ BM25 index built with {len(documents)} documents")
        else:
            print("⚠ BM25 not available - using fallback TF-IDF scoring")
    
    def get_scores(self, query: str) -> np.ndarray:
        """
        Get BM25 scores for a query against all indexed documents.
        
        Args:
            query: Search query string
            
        Returns:
            Array of BM25 scores for each document
        """
        if self.bm25 is not None:
            tokenized_query = query.lower().split()
            scores = self.bm25.get_scores(tokenized_query)
            return scores
        else:
            # Fallback: simple term frequency scoring
            query_terms = set(query.lower().split())
            scores = []
            for doc in self.documents:
                doc_terms = doc.lower().split()
                score = sum(1 for term in query_terms if term in doc_terms)
                scores.append(score)
            return np.array(scores)
    
    def rank(self, query: str, top_k: int = None) -> List[Tuple[int, float]]:
        """
        Rank documents by relevance to query.
        
        Args:
            query: Search query
            top_k: Return only top K results (None = all)
            
        Returns:
            List of (document_index, score) tuples, sorted by score descending
        """
        scores = self.get_scores(query)
        ranked_indices = np.argsort(scores)[::-1]
        
        if top_k:
            ranked_indices = ranked_indices[:top_k]
        
        return [(idx, scores[idx]) for idx in ranked_indices]


print("✓ BM25Ranker class defined")

## 5. Information Retrieval Components

This section implements the core IR methods: BM25, Sentence-BERT, FAISS, and MMR.

In [None]:
# Uncomment for Google Colab file upload
# from google.colab import files
# uploaded = files.upload()
# csv_path = list(uploaded.keys())[0]  # Get uploaded filename

# For local execution
csv_path = "dataset/dataset empty -open space-.csv"

# Load activities
def load_activities(file_path: str) -> pd.DataFrame:
    """Load and preprocess activity dataset"""
    df = pd.read_csv(file_path)
    
    # Clean and convert data types
    df['age_min'] = pd.to_numeric(df['age_min'], errors='coerce')
    df['age_max'] = pd.to_numeric(df['age_max'], errors='coerce')
    df['duration_mins'] = pd.to_numeric(df['duration_mins'], errors='coerce')
    
    # Parse tags (handle string representation of lists)
    df['tags'] = df['tags'].apply(lambda x: [tag.strip() for tag in str(x).split(',')])
    
    return df

activities_df = load_activities(csv_path)

print(f"✓ Loaded {len(activities_df)} activities")
print(f"\nSample activities:")
print(activities_df[['title', 'age_min', 'age_max', 'duration_mins', 'tags']].head(10))

## 5. Activity Ranking Model

This model ranks activities based on multiple factors:
1. **Age Fit Score**: How well the activity age range matches the group
2. **Ability Score**: How the activity difficulty aligns with group ability
3. **Coverage Score**: What percentage of the group can participate
4. **Diversity Score**: Whether activity suits different ability levels

In [None]:
class ActivityRankingModel:
    """Model to rank activities for a family group based on age and ability"""
    
    def __init__(self, group: FamilyGroup, activities: pd.DataFrame):
        self.group = group
        self.activities = activities
        self.ranked_activities = None
    
    def calculate_age_fit_score(self, activity_row) -> float:
        """
        Calculate how well activity age range fits the group.
        Returns score 0-1 (higher is better)
        """
        group_min, group_max = self.group.get_age_range()
        activity_min = activity_row['age_min']
        activity_max = activity_row['age_max']
        
        # Check if activity range overlaps with group range
        if activity_max < group_min or activity_min > group_max:
            return 0.0  # No overlap
        
        # Calculate overlap percentage
        overlap_min = max(activity_min, group_min)
        overlap_max = min(activity_max, group_max)
        overlap_size = overlap_max - overlap_min
        
        group_range = group_max - group_min + 1
        activity_range = activity_max - activity_min + 1
        
        # Score based on how much of the group range is covered
        coverage = overlap_size / group_range if group_range > 0 else 1.0
        
        return min(coverage, 1.0)
    
    def calculate_member_coverage(self, activity_row) -> float:
        """
        Calculate what percentage of group members can participate.
        Returns score 0-1 (1 = all members can participate)
        """
        if not self.group.members:
            return 0.0
        
        activity_min = activity_row['age_min']
        activity_max = activity_row['age_max']
        
        eligible_count = sum(
            1 for member in self.group.members 
            if activity_min <= member.age <= activity_max
        )
        
        return eligible_count / len(self.group.members)
    
    def estimate_difficulty(self, activity_row) -> float:
        """
        Estimate activity difficulty based on age range and tags.
        Returns difficulty score 1-10 (10 = most difficult)
        """
        # Base difficulty on minimum age requirement
        age_difficulty = activity_row['age_min'] / 2.0  # Scale to roughly 1-10
        
        # Adjust based on tags
        difficulty_modifiers = {
            'exercise': 1.5,
            'sports': 2.0,
            'coordination': 1.5,
            'STEM': 1.5,
            'problem solving': 2.0,
            'balance': 1.0,
            'sensory': -0.5,
            'fun': -0.5,
        }
        
        tags = activity_row['tags']
        modifier = sum(
            difficulty_modifiers.get(tag.strip().lower(), 0) 
            for tag in tags
        )
        
        difficulty = age_difficulty + modifier
        return np.clip(difficulty, 1, 10)
    
    def calculate_ability_score(self, activity_row) -> float:
        """
        Calculate how well activity difficulty matches group ability.
        Returns score 0-1 (higher = better match)
        """
        activity_difficulty = self.estimate_difficulty(activity_row)
        group_avg_ability = self.group.get_avg_ability()
        
        # Score based on how close difficulty is to average ability
        # Activities slightly below average ability are preferred (more accessible)
        difference = abs(activity_difficulty - (group_avg_ability - 1))
        
        # Convert difference to score (smaller difference = higher score)
        score = 1.0 - (difference / 10.0)
        return max(0, score)
    
    def calculate_diversity_score(self, activity_row) -> float:
        """
        Calculate whether activity can accommodate different ability levels.
        Returns score 0-1 (higher = more inclusive)
        """
        activity_age_range = activity_row['age_max'] - activity_row['age_min']
        
        # Wider age ranges typically accommodate more ability levels
        diversity = min(activity_age_range / 10.0, 1.0)
        
        return diversity
    
    def calculate_composite_score(self, activity_row, weights: Dict[str, float] = None) -> float:
        """
        Calculate weighted composite score for an activity.
        
        Args:
            activity_row: Activity data
            weights: Dictionary of weights for each component score
        
        Returns:
            Composite score 0-100
        """
        if weights is None:
            weights = {
                'age_fit': 0.30,       # 30% - Age appropriateness is crucial
                'coverage': 0.30,      # 30% - How many can participate
                'ability': 0.25,       # 25% - Difficulty match
                'diversity': 0.15,     # 15% - Inclusivity
            }
        
        scores = {
            'age_fit': self.calculate_age_fit_score(activity_row),
            'coverage': self.calculate_member_coverage(activity_row),
            'ability': self.calculate_ability_score(activity_row),
            'diversity': self.calculate_diversity_score(activity_row),
        }
        
        composite = sum(scores[key] * weights[key] for key in weights.keys())
        
        # Store individual scores in activity row for debugging
        for key, value in scores.items():
            activity_row[f'score_{key}'] = value
        
        return composite * 100  # Scale to 0-100
    
    def rank_activities(self, top_n: int = None, weights: Dict[str, float] = None) -> pd.DataFrame:
        """
        Rank all activities for the group.
        
        Args:
            top_n: Return only top N activities (None = all)
            weights: Custom weights for scoring components
        
        Returns:
            DataFrame of ranked activities with scores
        """
        # Calculate composite score for each activity
        ranked = self.activities.copy()
        ranked['composite_score'] = ranked.apply(
            lambda row: self.calculate_composite_score(row, weights), 
            axis=1
        )
        
        # Sort by score (descending)
        ranked = ranked.sort_values('composite_score', ascending=False)
        
        # Filter out activities with zero score (completely inappropriate)
        ranked = ranked[ranked['composite_score'] > 0]
        
        # Add rank column
        ranked['rank'] = range(1, len(ranked) + 1)
        
        self.ranked_activities = ranked
        
        if top_n:
            return ranked.head(top_n)
        return ranked
    
    def get_recommendations(self, n: int = 10) -> pd.DataFrame:
        """
        Get top N recommended activities with detailed scores.
        """
        if self.ranked_activities is None:
            self.rank_activities()
        
        cols = ['rank', 'title', 'composite_score', 'score_age_fit', 
                'score_coverage', 'score_ability', 'score_diversity',
                'age_min', 'age_max', 'duration_mins', 'tags']
        
        return self.ranked_activities[cols].head(n)


print("✓ Activity Ranking Model defined")

## 6. Generate Activity Rankings

Apply the model to rank activities for your family group.

In [None]:
# Create model instance
model = ActivityRankingModel(sample_group, activities_df)

# Rank activities
ranked_activities = model.rank_activities()

print(f"✓ Ranked {len(ranked_activities)} suitable activities")
print(f"\nTop 15 Recommended Activities for Your Group:")
print("=" * 100)

recommendations = model.get_recommendations(n=15)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print(recommendations.to_string(index=False))

## 7. Visualize Results

In [None]:
# Visualization 1: Score Distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Overall score distribution
axes[0, 0].hist(ranked_activities['composite_score'], bins=30, color='steelblue', edgecolor='black')
axes[0, 0].set_title('Distribution of Composite Scores', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Composite Score')
axes[0, 0].set_ylabel('Number of Activities')
axes[0, 0].axvline(ranked_activities['composite_score'].mean(), color='red', 
                   linestyle='--', label=f'Mean: {ranked_activities["composite_score"].mean():.1f}')
axes[0, 0].legend()

# Component scores for top 20 activities
top_20 = ranked_activities.head(20)
score_cols = ['score_age_fit', 'score_coverage', 'score_ability', 'score_diversity']
score_data = top_20[score_cols].values

x = np.arange(len(top_20))
width = 0.2

for i, col in enumerate(score_cols):
    label = col.replace('score_', '').replace('_', ' ').title()
    axes[0, 1].bar(x + i * width, top_20[col], width, label=label)

axes[0, 1].set_title('Component Scores - Top 20 Activities', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Activity Rank')
axes[0, 1].set_ylabel('Score (0-1)')
axes[0, 1].legend()
axes[0, 1].set_xticks(x + width * 1.5)
axes[0, 1].set_xticklabels(range(1, 21), rotation=0)

# Age range coverage
top_15 = ranked_activities.head(15)
activity_names = [title[:20] + '...' if len(title) > 20 else title for title in top_15['title']]

axes[1, 0].barh(activity_names, top_15['composite_score'], color='teal')
axes[1, 0].set_title('Top 15 Activities by Score', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Composite Score')
axes[1, 0].invert_yaxis()

# Member coverage analysis
coverage_dist = ranked_activities['score_coverage'].value_counts().sort_index()
axes[1, 1].bar(coverage_dist.index, coverage_dist.values, color='coral', edgecolor='black')
axes[1, 1].set_title('Member Coverage Distribution', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Coverage Score (% of members who can participate)')
axes[1, 1].set_ylabel('Number of Activities')

plt.tight_layout()
plt.show()

print("✓ Visualizations generated")

## 8. Detailed Analysis of Top Activities

In [None]:
def print_activity_details(activity_row):
    """Print detailed information about an activity"""
    print(f"\n{'='*80}")
    print(f"RANK #{int(activity_row['rank'])}: {activity_row['title'].upper()}")
    print(f"{'='*80}")
    print(f"Overall Score: {activity_row['composite_score']:.1f}/100")
    print(f"\nComponent Scores:")
    print(f"  • Age Fit:        {activity_row['score_age_fit']:.2f} (How well ages match)")
    print(f"  • Coverage:       {activity_row['score_coverage']:.2f} (% of members who can participate)")
    print(f"  • Ability Match:  {activity_row['score_ability']:.2f} (Difficulty appropriateness)")
    print(f"  • Diversity:      {activity_row['score_diversity']:.2f} (Inclusivity)")
    print(f"\nActivity Details:")
    print(f"  • Age Range:      {int(activity_row['age_min'])}-{int(activity_row['age_max'])} years")
    print(f"  • Duration:       {int(activity_row['duration_mins'])} minutes")
    print(f"  • Tags:           {', '.join(activity_row['tags'])}")
    print(f"  • Cost:           {activity_row['cost']}")
    print(f"  • Location:       {activity_row['indoor_outdoor']}")
    print(f"  • Players:        {activity_row['players']}")

# Show detailed analysis of top 5
print("\n" + "#" * 80)
print("# DETAILED ANALYSIS - TOP 5 RECOMMENDED ACTIVITIES")
print("#" * 80)

for idx in range(min(5, len(ranked_activities))):
    print_activity_details(ranked_activities.iloc[idx])

## 9. Customize Scoring Weights

You can adjust the importance of different factors by changing weights.

In [None]:
# Example: Prioritize activities where ALL members can participate
custom_weights = {
    'age_fit': 0.20,       # 20%
    'coverage': 0.50,      # 50% - Prioritize full group participation
    'ability': 0.20,       # 20%
    'diversity': 0.10,     # 10%
}

# Re-rank with custom weights
custom_ranked = model.rank_activities(weights=custom_weights)
custom_recs = custom_ranked[['rank', 'title', 'composite_score', 'score_coverage', 
                              'age_min', 'age_max']].head(10)

print("Top 10 Activities with Custom Weights (Prioritizing Full Participation):")
print("=" * 80)
print(custom_recs.to_string(index=False))

## 10. Export Results

In [None]:
# Export top recommendations to CSV
output_file = 'top_activity_recommendations.csv'
recommendations_export = ranked_activities.head(50)[[
    'rank', 'title', 'composite_score', 'score_age_fit', 'score_coverage',
    'score_ability', 'score_diversity', 'age_min', 'age_max', 
    'duration_mins', 'cost', 'indoor_outdoor', 'tags'
]]

recommendations_export.to_csv(output_file, index=False)
print(f"✓ Exported top 50 recommendations to '{output_file}'")

# Summary statistics
print(f"\n{'='*80}")
print("MODEL SUMMARY")
print(f"{'='*80}")
print(f"Total activities evaluated:     {len(activities_df)}")
print(f"Suitable activities found:      {len(ranked_activities)}")
print(f"Average composite score:        {ranked_activities['composite_score'].mean():.2f}")
print(f"Best activity:                  {ranked_activities.iloc[0]['title']}")
print(f"Best activity score:            {ranked_activities.iloc[0]['composite_score']:.2f}/100")
print(f"\nGroup Details:")
print(f"  Members:                      {len(sample_group.members)}")
print(f"  Age range:                    {sample_group.get_age_range()[0]}-{sample_group.get_age_range()[1]} years")
print(f"  Average ability:              {sample_group.get_avg_ability():.2f}/10")
print(f"={'='*80}")

## 11. Create Your Own Family Group

Modify the cell below to create your own family group and get personalized recommendations!

In [None]:
# CREATE YOUR OWN GROUP HERE
my_group = FamilyGroup([
    # Add your family members here
    # GroupMember("Name", age=X, overall_ability=Y),
    # overall_ability scale: 1=very low, 5=average, 10=very high
])

if my_group.members:
    # Create model for your group
    my_model = ActivityRankingModel(my_group, activities_df)
    my_ranked = my_model.rank_activities()
    my_recs = my_model.get_recommendations(n=20)
    
    print(f"\n🎯 TOP 20 ACTIVITIES FOR YOUR GROUP:")
    print("=" * 100)
    print(my_recs.to_string(index=False))
else:
    print("⚠️ Please add members to my_group to get recommendations!")

---

## Summary

This notebook implements a comprehensive activity ranking model that:

1. **Merges** group member ages and ability levels
2. **Analyzes** activity suitability across multiple dimensions
3. **Ranks** activities using a weighted scoring algorithm
4. **Provides** detailed recommendations with explanations

### Scoring Components:
- **Age Fit**: How well activity age range matches group
- **Coverage**: Percentage of members who can participate  
- **Ability Match**: How activity difficulty aligns with group power/ability
- **Diversity**: Whether activity accommodates different levels

### Next Steps:
- Customize weights based on your priorities
- Add more group members
- Filter by indoor/outdoor, cost, or season
- Integrate with calendar scheduling

**Compatible with Google Colab** - Upload your dataset and run!