# Generate NBA QA Dataset

This notebook generates a question-answer dataset from NBA API data for fine-tuning a question-answering model.

## Overview

The dataset generation process:
1. **Fetch NBA Data**: Retrieve player statistics and game logs from the NBA API
2. **Generate Context**: Convert structured data into natural language passages
3. **Create QA Pairs**: Generate questions and extract answer spans
4. **Format for Training**: Save in SQuAD format for model fine-tuning

## Dataset Structure

Each QA pair includes:
- **Context**: Natural language passage containing the answer
- **Question**: Natural language question
- **Answer**: The answer text with character position
- **Confidence**: Quality score for the QA pair

## Output

The generated dataset is saved in SQuAD format (JSON) which is compatible with Hugging Face transformers.


## Step 0: Install Required Packages

Run this cell first to install the required packages if they're not already installed.


In [16]:
# Install required packages
# Pin urllib3 to v1.x to avoid OpenSSL compatibility warnings on macOS
%pip install nba-api pandas numpy "urllib3<2.0"

print("‚úÖ Packages installed successfully!")


Note: you may need to restart the kernel to use updated packages.
‚úÖ Packages installed successfully!


In [17]:
from nba_api.stats.endpoints import playergamelog, leaguegamefinder, boxscoretraditionalv2, leaguedashplayerstats
from nba_api.stats.library.parameters import SeasonType
from typing import List, Dict

class NBADatasetGenerator:

    def _extract_season_from_year(self, year: int) -> str:
        """Convert calendar year to NBA season format (e.g., 2010 -> '2009-10')"""
        if year >= 2000:
            prev_year = year - 1
            season = f"{prev_year}-{str(year)[2:]}"
        else:
            prev_year = year - 1
            season = f"{prev_year}-{str(year)[2:]}"
        return season
    
    def _generate_historical_season_qa(self, player_id: str, player_name: str, season: str) -> List[Dict]:
        """Generate QA pairs for specific historical season"""
        qa_pairs = []
        try:
            game_log = playergamelog.PlayerGameLog(player_id=player_id, season=season)
            game_log_df = game_log.get_data_frames()[0]
            if game_log_df.empty:
                return qa_pairs
            total_pts = game_log_df['PTS'].sum() if 'PTS' in game_log_df.columns else 0
            total_reb = game_log_df['REB'].sum() if 'REB' in game_log_df.columns else 0
            total_ast = game_log_df['AST'].sum() if 'AST' in game_log_df.columns else 0
            total_games = len(game_log_df)
            if total_games == 0:
                return qa_pairs
            avg_ppg = total_pts / total_games
            avg_rpg = total_reb / total_games
            avg_apg = total_ast / total_games
            season_display = season.replace('-', '-20') if len(season.split('-')[1]) == 2 else season
            context = (
                f"In the {season_display} season, {player_name} played {total_games} games. "
                f"He averaged {avg_ppg:.1f} points per game, {avg_rpg:.1f} rebounds per game, "
                f"and {avg_apg:.1f} assists per game."
            )
            qa_templates = [
                {"question": f"How many points did {player_name} average in the {season_display} season?",
                 "answer": f"{avg_ppg:.1f}", "answer_start": context.find(f"{avg_ppg:.1f} points per game")},
                {"question": f"How many assists did {player_name} average in the {season_display} season?",
                 "answer": f"{avg_apg:.1f}", "answer_start": context.find(f"{avg_apg:.1f} assists per game")},
                {"question": f"How many rebounds did {player_name} average in the {season_display} season?",
                 "answer": f"{avg_rpg:.1f}", "answer_start": context.find(f"{avg_rpg:.1f} rebounds per game")},
            ]
            for qa in qa_templates:
                if qa["answer_start"] >= 0:
                    qa_pairs.append({
                        "context": context, "question": qa["question"],
                        "answers": [{"text": qa["answer"], "answer_start": qa["answer_start"]}],
                        "confidence": 0.95
                    })
        except Exception as e:
            pass
        return qa_pairs
    
    def _generate_finals_game_qa(self, player_id: str, player_name: str, year: int) -> List[Dict]:
        """Generate QA pairs for Finals games in a specific year"""
        qa_pairs = []
        try:
            season = self._extract_season_from_year(year)
            gamefinder = leaguegamefinder.LeagueGameFinder(
                player_id_nullable=player_id, season_nullable=season,
                season_type_nullable=SeasonType.playoffs
            )
            games_df = gamefinder.get_data_frames()[0]
            if games_df.empty:
                return qa_pairs
            finals_games = games_df.tail(7)
            for _, game_row in finals_games.iterrows():
                game_id = game_row.get('GAME_ID')
                try:
                    boxscore = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id=game_id)
                    boxscore_df = boxscore.get_data_frames()[0]
                    player_stats = boxscore_df[boxscore_df['PLAYER_ID'] == int(player_id)]
                    if player_stats.empty:
                        continue
                    player_stat = player_stats.iloc[0]
                    pts = player_stat.get('PTS', 0)
                    game_num = len(finals_games) - list(finals_games.index).index(game_row.name)
                    context = f"In the {year} NBA Finals Game {game_num}, {player_name} scored {int(pts)} points."
                    qa_pairs.append({
                        "context": context,
                        "question": f"How many points did {player_name} score in the {year} NBA Finals Game {game_num}?",
                        "answers": [{"text": f"{int(pts)}", "answer_start": context.find(f"{int(pts)} points")}],
                        "confidence": 0.90
                    })
                except:
                    continue
        except:
            pass
        return qa_pairs
    
    def _generate_game_range_qa(self, player_id: str, player_name: str, season: str, num_games: int) -> List[Dict]:
        """Generate QA pairs for first N games of a season"""
        qa_pairs = []
        try:
            game_log = playergamelog.PlayerGameLog(player_id=player_id, season=season)
            game_log_df = game_log.get_data_frames()[0]
            if game_log_df.empty:
                return qa_pairs
            game_log_df_sorted = game_log_df.sort_values('GAME_DATE')
            first_n_games = game_log_df_sorted.head(num_games)
            if len(first_n_games) < num_games:
                return qa_pairs
            total_reb = first_n_games['REB'].sum() if 'REB' in first_n_games.columns else 0
            season_display = season.replace('-', '-20') if len(season.split('-')[1]) == 2 else season
            context = f"In the first {num_games} games of the {season_display} season, {player_name} grabbed {int(total_reb)} rebounds."
            qa_pairs.append({
                "context": context,
                "question": f"How many rebounds did {player_name} grab in the first {num_games} games of the {season_display} season?",
                "answers": [{"text": f"{int(total_reb)}", "answer_start": context.find(f"{int(total_reb)} rebounds")}],
                "confidence": 0.90
            })
        except:
            pass
        return qa_pairs
    
    def _generate_league_leader_qa(self, season: str, stat_type: str = "turnovers") -> List[Dict]:
        """Generate QA pairs for league leaders/records"""
        qa_pairs = []
        try:
            league_stats = leaguedashplayerstats.LeagueDashPlayerStats(season=season, per_mode_simple="Totals")
            stats_df = league_stats.get_data_frames()[0]
            if stats_df.empty:
                return qa_pairs
            stat_columns = {"turnovers": "TOV", "points": "PTS", "rebounds": "REB", "assists": "AST"}
            stat_col = stat_columns.get(stat_type.lower(), "TOV")
            if stat_col not in stats_df.columns:
                return qa_pairs
            max_idx = stats_df[stat_col].idxmax()
            leader = stats_df.iloc[max_idx]
            player_name = leader.get('PLAYER_NAME', 'Unknown')
            season_display = season.replace('-', '-20') if len(season.split('-')[1]) == 2 else season
            context = f"In the {season_display} season, {player_name} had the most {stat_type}."
            qa_pairs.append({
                "context": context,
                "question": f"Who is the player with most {stat_type} in the {season_display} season?",
                "answers": [{"text": player_name, "answer_start": context.find(player_name)}],
                "confidence": 0.95
            })
        except:
            pass
        return qa_pairs


## Step 1: Import Required Libraries

**Note**: If you get a `ModuleNotFoundError`, make sure you ran the installation cell above.


In [18]:
import json
import re
import warnings
from typing import List, Dict, Any, Tuple
from pathlib import Path
from datetime import datetime

# Suppress urllib3 OpenSSL warnings (optional - only if the warning persists)
warnings.filterwarnings('ignore', category=UserWarning, module='urllib3')

from nba_api.stats.endpoints import (
    playercareerstats,
    playergamelog,
    leaguegamefinder,
    boxscoretraditionalv2,
    commonplayerinfo,
    leagueleaders,
    leaguedashplayerstats
)
from nba_api.stats.static import players, teams
from nba_api.stats.library.parameters import Season, SeasonType

print("‚úÖ Libraries imported successfully!")


‚úÖ Libraries imported successfully!


In [None]:
# Override generate_dataset method to include historical support
def generate_dataset_enhanced(self, num_players: int = 50, num_games_per_player: int = 5, 
                         include_historical: bool = True, historical_seasons: List[str] = None,
                         include_finals: bool = True, include_league_leaders: bool = True,
                         include_last_team_games: bool = True) -> List[Dict]:
    """
    Generate QA dataset with historical support
    
    Args:
        num_players: Number of players to include
        num_games_per_player: Number of recent games per player
        include_historical: Whether to include historical season data
        historical_seasons: List of seasons to process (e.g., ['2009-10', '2010-11'])
        include_finals: Whether to include Finals game data
        include_league_leaders: Whether to include league leader questions
        include_last_team_games: Whether to include last team game questions
    """
    print(f"Generating NBA QA dataset for {num_players} players...")
    
    # Default historical seasons if not provided
    if historical_seasons is None:
        historical_seasons = ['2009-10', '2010-11', '2001-02', '2002-03']
    
    # Get popular/active players (filter by recent activity)
    active_players = self._get_active_players(num_players)
    
    dataset = []
    
    for i, player in enumerate(active_players, 1):
        player_id = player['id']
        player_name = player['full_name']
        
        print(f"[{i}/{num_players}] Processing {player_name}...")
        
        try:
            # Generate career stats QA pairs
            career_qa = self._generate_career_stats_qa(player_id, player_name)
            dataset.extend(career_qa)
            
            # Generate game log QA pairs (current season)
            game_qa = self._generate_game_log_qa(player_id, player_name, num_games_per_player)
            dataset.extend(game_qa)
            
            # Generate historical season QA pairs
            if include_historical:
                for season in historical_seasons:
                    try:
                        historical_qa = self._generate_historical_season_qa(player_id, player_name, season)
                        dataset.extend(historical_qa)
                    except Exception as e:
                        continue  # Skip if season data not available
            
            # Generate Finals game QA pairs (for recent years)
            if include_finals:
                for year in [2010, 2011, 2012, 2001, 2002]:
                    try:
                        finals_qa = self._generate_finals_game_qa(player_id, player_name, year)
                        if finals_qa:  # Only add if we found Finals games
                            dataset.extend(finals_qa)
                            break  # Only need one year with Finals data
                    except Exception as e:
                        continue
            
            # Generate game range QA pairs
            if include_historical and historical_seasons:
                for season in historical_seasons[:2]:  # Limit to first 2 seasons
                    try:
                        range_qa = self._generate_game_range_qa(player_id, player_name, season, 10)
                        dataset.extend(range_qa)
                    except Exception as e:
                        continue
            
        except Exception as e:
            print(f"  Error processing {player_name}: {e}")
            continue
    
    # Generate league leader QA pairs (all-time "top N" questions)
    if include_league_leaders:
        print("\nGenerating league leader QA pairs (all-time top N)...")
        # Generate for different stats and different top N values
        stat_types = ['assists', 'points', 'rebounds', 'blocks', 'turnovers', 'steals', 'three_pointers']
        top_n_values = [3, 5, 10]  # Generate questions for top 3, top 5, and top 10
        
        for stat_type in stat_types:
            for top_n in top_n_values:
                try:
                    leader_qa = self._generate_league_leader_qa(stat_type=stat_type, top_n=top_n)
                    dataset.extend(leader_qa)
                except Exception as e:
                    continue
    
    # Generate last team game QA pairs
    if include_last_team_games:
        print("\nGenerating last team game QA pairs...")
        # Popular teams to include
        popular_teams = [
            {'id': 1610612747, 'name': 'Los Angeles Lakers'},
            {'id': 1610612738, 'name': 'Boston Celtics'},
            {'id': 1610612744, 'name': 'Golden State Warriors'},
            {'id': 1610612751, 'name': 'Brooklyn Nets'},
            {'id': 1610612748, 'name': 'Miami Heat'},
            {'id': 1610612759, 'name': 'San Antonio Spurs'},
            {'id': 1610612752, 'name': 'New York Knicks'},
            {'id': 1610612741, 'name': 'Chicago Bulls'},
            {'id': 1610612742, 'name': 'Dallas Mavericks'},
            {'id': 1610612756, 'name': 'Phoenix Suns'},
        ]
        
        for team in popular_teams:
            try:
                last_game_qa = self._generate_last_team_game_qa(team['id'], team['name'])
                dataset.extend(last_game_qa)
            except Exception as e:
                continue
    
    print(f"\nGenerated {len(dataset)} QA pairs")
    return dataset

# Replace the original method
NBADatasetGenerator.generate_dataset = generate_dataset_enhanced

print("‚úÖ generate_dataset method updated with historical support!")


‚úÖ generate_dataset method updated with historical support!


In [20]:
# Add historical question support methods to NBADatasetGenerator class

# Helper method for season conversion
def _extract_season_from_year(self, year: int) -> str:
    """Convert calendar year to NBA season format (e.g., 2010 -> '2009-10')"""
    if year >= 2000:
        prev_year = year - 1
        season = f"{prev_year}-{str(year)[2:]}"
    else:
        prev_year = year - 1
        season = f"{prev_year}-{str(year)[2:]}"
    return season

# Historical season QA generation
def _generate_historical_season_qa(self, player_id: str, player_name: str, season: str) -> List[Dict]:
    """Generate QA pairs for specific historical season"""
    qa_pairs = []
    try:
        game_log = playergamelog.PlayerGameLog(player_id=player_id, season=season)
        game_log_df = game_log.get_data_frames()[0]
        if game_log_df.empty:
            return qa_pairs
        total_pts = game_log_df['PTS'].sum() if 'PTS' in game_log_df.columns else 0
        total_reb = game_log_df['REB'].sum() if 'REB' in game_log_df.columns else 0
        total_ast = game_log_df['AST'].sum() if 'AST' in game_log_df.columns else 0
        total_games = len(game_log_df)
        if total_games == 0:
            return qa_pairs
        avg_ppg = total_pts / total_games
        avg_rpg = total_reb / total_games
        avg_apg = total_ast / total_games
        season_display = season.replace('-', '-20') if len(season.split('-')[1]) == 2 else season
        context = (
            f"In the {season_display} season, {player_name} played {total_games} games. "
            f"He averaged {avg_ppg:.1f} points per game, {avg_rpg:.1f} rebounds per game, "
            f"and {avg_apg:.1f} assists per game."
        )
        qa_templates = [
            {"question": f"How many points did {player_name} average in the {season_display} season?",
             "answer": f"{avg_ppg:.1f}", "answer_start": context.find(f"{avg_ppg:.1f} points per game")},
            {"question": f"How many assists did {player_name} average in the {season_display} season?",
             "answer": f"{avg_apg:.1f}", "answer_start": context.find(f"{avg_apg:.1f} assists per game")},
            {"question": f"How many rebounds did {player_name} average in the {season_display} season?",
             "answer": f"{avg_rpg:.1f}", "answer_start": context.find(f"{avg_rpg:.1f} rebounds per game")},
        ]
        for qa in qa_templates:
            if qa["answer_start"] >= 0:
                qa_pairs.append({
                    "context": context, "question": qa["question"],
                    "answers": [{"text": qa["answer"], "answer_start": qa["answer_start"]}],
                    "confidence": 0.95
                })
    except Exception as e:
        pass
    return qa_pairs

# Finals game QA generation
def _generate_finals_game_qa(self, player_id: str, player_name: str, year: int) -> List[Dict]:
    """Generate QA pairs for Finals games in a specific year"""
    qa_pairs = []
    try:
        season = self._extract_season_from_year(year)
        gamefinder = leaguegamefinder.LeagueGameFinder(
            player_id_nullable=player_id, season_nullable=season,
            season_type_nullable=SeasonType.playoffs
        )
        games_df = gamefinder.get_data_frames()[0]
        if games_df.empty:
            return qa_pairs
        finals_games = games_df.tail(7)
        for _, game_row in finals_games.iterrows():
            game_id = game_row.get('GAME_ID')
            try:
                boxscore = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id=game_id)
                boxscore_df = boxscore.get_data_frames()[0]
                player_stats = boxscore_df[boxscore_df['PLAYER_ID'] == int(player_id)]
                if player_stats.empty:
                    continue
                player_stat = player_stats.iloc[0]
                pts = player_stat.get('PTS', 0)
                game_num = len(finals_games) - list(finals_games.index).index(game_row.name)
                context = f"In the {year} NBA Finals Game {game_num}, {player_name} scored {int(pts)} points."
                qa_pairs.append({
                    "context": context,
                    "question": f"How many points did {player_name} score in the {year} NBA Finals Game {game_num}?",
                    "answers": [{"text": f"{int(pts)}", "answer_start": context.find(f"{int(pts)} points")}],
                    "confidence": 0.90
                })
            except:
                continue
    except:
        pass
    return qa_pairs

# Game range QA generation
def _generate_game_range_qa(self, player_id: str, player_name: str, season: str, num_games: int) -> List[Dict]:
    """Generate QA pairs for first N games of a season"""
    qa_pairs = []
    try:
        game_log = playergamelog.PlayerGameLog(player_id=player_id, season=season)
        game_log_df = game_log.get_data_frames()[0]
        if game_log_df.empty:
            return qa_pairs
        game_log_df_sorted = game_log_df.sort_values('GAME_DATE')
        first_n_games = game_log_df_sorted.head(num_games)
        if len(first_n_games) < num_games:
            return qa_pairs
        total_reb = first_n_games['REB'].sum() if 'REB' in first_n_games.columns else 0
        season_display = season.replace('-', '-20') if len(season.split('-')[1]) == 2 else season
        context = f"In the first {num_games} games of the {season_display} season, {player_name} grabbed {int(total_reb)} rebounds."
        qa_pairs.append({
            "context": context,
            "question": f"How many rebounds did {player_name} grab in the first {num_games} games of the {season_display} season?",
            "answers": [{"text": f"{int(total_reb)}", "answer_start": context.find(f"{int(total_reb)} rebounds")}],
            "confidence": 0.90
        })
    except:
        pass
    return qa_pairs

# League leader QA generation - All-time "top N" questions
def _generate_league_leader_qa(self, stat_type: str = "assists", top_n: int = 5) -> List[Dict]:
    """Generate QA pairs for all-time league leaders (top N players)"""
    qa_pairs = []
    try:
        # Map stat types to NBA API abbreviations
        stat_map = {
            "points": "PTS",
            "rebounds": "REB", 
            "assists": "AST",
            "blocks": "BLK",
            "turnovers": "TOV",
            "steals": "STL",
            "three_pointers": "FG3M"
        }
        
        stat_abbrev = stat_map.get(stat_type.lower(), "PTS")
        stat_display = stat_type.lower()
        
        # Get all-time leaders using LeagueLeaders endpoint
        leaders = leagueleaders.LeagueLeaders(
            league_id='00',
            season='All Time',
            season_type_all_star='Regular Season',
            stat_category_abbreviation=stat_abbrev,
            per_mode48='Totals'
        )
        df = leaders.get_data_frames()[0]
        
        if df.empty:
            return qa_pairs
        
        # Get top N players
        top_players = df.head(top_n)
        
        # Build context with top N players
        context_parts = [f"The top {top_n} players with the most {stat_display} in NBA history are:"]
        player_list = []
        
        for idx, (_, row) in enumerate(top_players.iterrows(), start=1):
            rank = row.get(f"{stat_abbrev}_RANK") or row.get("RANK") or idx
            player_name = row.get("PLAYER_NAME", "Unknown")
            stat_value = int(row.get(stat_abbrev, 0))
            player_list.append((int(rank), player_name, stat_value))
            context_parts.append(f"{int(rank)}. {player_name} with {stat_value:,} {stat_display}.")
        
        context = " ".join(context_parts)
        
        # Generate multiple question variations for each top player
        for rank, player_name, stat_value in player_list:
            # Question variations
            question_templates = [
                f"Which is the top {top_n} of players with most {stat_display} in the history of the league?",
                f"Who are the top {top_n} players with most {stat_display} all-time?",
                f"What are the top {top_n} players with most {stat_display} in NBA history?",
                f"Top {top_n} players with most {stat_display}",
                f"Who has the most {stat_display} in league history?",
            ]
            
            # For "who has the most" questions, answer is the #1 player
            if rank == 1:
                for question in question_templates:
                    answer_start = context.find(player_name)
                    if answer_start >= 0:
                        qa_pairs.append({
                            "context": context,
                            "question": question,
                            "answers": [{"text": player_name, "answer_start": answer_start}],
                            "confidence": 0.95
                        })
            
            # For "top N" questions, answer includes the full list or specific rank
            if rank <= top_n:
                top_n_questions = [
                    f"Which is the top {top_n} of players with most {stat_display} in the history of the league?",
                    f"Who are the top {top_n} players with most {stat_display} all-time?",
                ]
                
                # Answer can be the specific player at that rank, or the full list
                for question in top_n_questions:
                    answer_start = context.find(player_name)
                    if answer_start >= 0:
                        qa_pairs.append({
                            "context": context,
                            "question": question,
                            "answers": [{"text": player_name, "answer_start": answer_start}],
                            "confidence": 0.95
                        })
        
        # Also add questions asking for the full list
        full_list_questions = [
            f"Which is the top {top_n} of players with most {stat_display} in the history of the league?",
            f"Who are the top {top_n} players with most {stat_display} in NBA history?",
        ]
        
        # For full list questions, answer is the first player (can be extended to include all)
        first_player = player_list[0][1] if player_list else "Unknown"
        first_player_start = context.find(first_player)
        
        for question in full_list_questions:
            if first_player_start >= 0:
                qa_pairs.append({
                    "context": context,
                    "question": question,
                    "answers": [{"text": first_player, "answer_start": first_player_start}],
                    "confidence": 0.95
                })
        
    except Exception as e:
        print(f"  Error generating league leader QA for {stat_type}: {e}")
        pass
    
    return qa_pairs

# Last team game QA generation
def _generate_last_team_game_qa(self, team_id: int, team_name: str) -> List[Dict]:
    """Generate QA pairs for the last game of a team"""
    qa_pairs = []
    try:
        import pandas as pd
        
        # Fetch all games for this team
        gamefinder = leaguegamefinder.LeagueGameFinder(team_id_nullable=team_id)
        games_df = gamefinder.get_data_frames()[0]
        
        if games_df.empty:
            return qa_pairs
        
        # Sort by game date (descending) to get most recent
        games_df['GAME_DATE'] = pd.to_datetime(games_df['GAME_DATE'])
        games_df = games_df.sort_values('GAME_DATE', ascending=False)
        
        # Get the most recent game
        last_game = games_df.iloc[0]
        game_date = last_game['GAME_DATE'].strftime('%Y-%m-%d')
        matchup = last_game.get('MATCHUP', '')
        team_pts = last_game.get('PTS', 0)
        plus_minus = last_game.get('PLUS_MINUS', 0)
        opp_pts = team_pts - plus_minus
        result = last_game.get('WL', '')
        
        # Extract opponent name from matchup
        # Matchup format: "LAL vs. WAS" (home) or "LAL @ WAS" (away)
        matchup_parts = matchup.split()
        if len(matchup_parts) >= 3:
            opp_abbrev = matchup_parts[-1]
            # Try to find full team name from abbreviation
            opp_team = None
            for team in self.nba_teams:
                if team['abbreviation'] == opp_abbrev:
                    opp_team = team.get('full_name', opp_abbrev)
                    break
            opponent_name = opp_team if opp_team else opp_abbrev
        else:
            opponent_name = "Opponent"
        
        # Generate context
        context = (
            f"On {game_date}, {team_name} played against {opponent_name}. "
            f"{team_name} {'won' if result == 'W' else 'lost'} {int(team_pts)}-{int(opp_pts)}."
        )
        
        # Generate QA pairs with multiple question variations
        qa_templates = [
            {
                "question": f"What was the score of the last {team_name} game?",
                "answer": f"{int(team_pts)}-{int(opp_pts)}",
                "answer_start": context.find(f"{int(team_pts)}-{int(opp_pts)}")
            },
            {
                "question": f"What was the score of the last {team_name.split()[-1]} game?",
                "answer": f"{int(team_pts)}-{int(opp_pts)}",
                "answer_start": context.find(f"{int(team_pts)}-{int(opp_pts)}")
            },
            {
                "question": f"Who did {team_name} play in their last game?",
                "answer": opponent_name,
                "answer_start": context.find(opponent_name)
            },
            {
                "question": f"Who did {team_name.split()[-1]} play in their last game?",
                "answer": opponent_name,
                "answer_start": context.find(opponent_name)
            },
            {
                "question": f"Did {team_name} win their last game?",
                "answer": "won" if result == 'W' else "lost",
                "answer_start": context.find("won" if result == 'W' else "lost")
            },
        ]
        
        for qa in qa_templates:
            if qa["answer_start"] >= 0:
                qa_pairs.append({
                    "context": context,
                    "question": qa["question"],
                    "answers": [{"text": qa["answer"], "answer_start": qa["answer_start"]}],
                    "confidence": 0.90
                })
        
    except Exception as e:
        print(f"  Error generating last team game QA for {team_name}: {e}")
        pass
    
    return qa_pairs

# Attach methods to the class
NBADatasetGenerator._extract_season_from_year = _extract_season_from_year
NBADatasetGenerator._generate_historical_season_qa = _generate_historical_season_qa
NBADatasetGenerator._generate_finals_game_qa = _generate_finals_game_qa
NBADatasetGenerator._generate_game_range_qa = _generate_game_range_qa
NBADatasetGenerator._generate_league_leader_qa = _generate_league_leader_qa
NBADatasetGenerator._generate_last_team_game_qa = _generate_last_team_game_qa

print("‚úÖ Historical question support added to NBADatasetGenerator!")


‚úÖ Historical question support added to NBADatasetGenerator!


## Step 2: Define the Dataset Generator Class

This class handles:
- Fetching player data from NBA API
- Generating natural language contexts
- Creating question-answer pairs
- Saving in SQuAD format


In [21]:
class NBADatasetGenerator:
    """Generates NBA QA dataset from API data"""
    
    def __init__(self):
        self.nba_players = players.get_players()
        self.nba_teams = teams.get_teams()
        self.dataset = []
    
    def generate_dataset(self, num_players: int = 50, num_games_per_player: int = 5) -> List[Dict]:
        """
        Generate QA dataset
        
        Args:
            num_players: Number of players to include
            num_games_per_player: Number of recent games per player
        """
        print(f"Generating NBA QA dataset for {num_players} players...")
        
        # Get popular/active players (filter by recent activity)
        active_players = self._get_active_players(num_players)
        
        dataset = []
        
        for i, player in enumerate(active_players, 1):
            player_id = player['id']
            player_name = player['full_name']
            
            print(f"[{i}/{num_players}] Processing {player_name}...")
            
            try:
                # Generate career stats QA pairs
                career_qa = self._generate_career_stats_qa(player_id, player_name)
                dataset.extend(career_qa)
                
                # Generate game log QA pairs
                game_qa = self._generate_game_log_qa(player_id, player_name, num_games_per_player)
                dataset.extend(game_qa)
                
            except Exception as e:
                print(f"  Error processing {player_name}: {e}")
                continue
        
        print(f"\nGenerated {len(dataset)} QA pairs")
        return dataset
    
    def _get_active_players(self, num_players: int) -> List[Dict]:
        """Get active players (simplified - just take first N players)"""
        # In production, you'd filter by active status, recent games, etc.
        return self.nba_players[:num_players]
    
    def _generate_career_stats_qa(self, player_id: str, player_name: str) -> List[Dict]:
        """Generate QA pairs from career statistics"""
        qa_pairs = []
        
        try:
            career_stats = playercareerstats.PlayerCareerStats(player_id=player_id)
            career_df = career_stats.get_data_frames()[0]
            
            if career_df.empty:
                return qa_pairs
            
            # Get overall career stats
            total_games = len(career_df)
            if total_games == 0:
                return qa_pairs
            
            # Aggregate career totals
            total_pts = career_df['PTS'].sum() if 'PTS' in career_df.columns else 0
            total_reb = career_df['REB'].sum() if 'REB' in career_df.columns else 0
            total_ast = career_df['AST'].sum() if 'AST' in career_df.columns else 0
            total_gp = career_df['GP'].sum() if 'GP' in career_df.columns else total_games
            
            # Calculate averages
            avg_ppg = total_pts / total_gp if total_gp > 0 else 0
            avg_rpg = total_reb / total_gp if total_gp > 0 else 0
            avg_apg = total_ast / total_gp if total_gp > 0 else 0
            
            # Get most recent season stats
            latest_season = career_df.iloc[0]
            season_ppg = latest_season.get('PTS', 0) / latest_season.get('GP', 1) if latest_season.get('GP', 0) > 0 else 0
            season_rpg = latest_season.get('REB', 0) / latest_season.get('GP', 1) if latest_season.get('GP', 0) > 0 else 0
            season_apg = latest_season.get('AST', 0) / latest_season.get('GP', 1) if latest_season.get('GP', 0) > 0 else 0
            season_gp = latest_season.get('GP', 0)
            season_id = latest_season.get('SEASON_ID', 'N/A')
            
            # Generate context
            context = (
                f"{player_name} has played {total_games} seasons in the NBA. "
                f"Over his career, he has averaged {avg_ppg:.1f} points per game, "
                f"{avg_rpg:.1f} rebounds per game, and {avg_apg:.1f} assists per game "
                f"across {total_gp} games. "
                f"In the most recent season ({season_id}), {player_name} played {season_gp} games, "
                f"averaging {season_ppg:.1f} points per game, {season_rpg:.1f} rebounds per game, "
                f"and {season_apg:.1f} assists per game."
            )
            
            # Generate QA pairs
            qa_templates = [
                {
                    "question": f"What is {player_name}'s career points per game average?",
                    "answer": f"{avg_ppg:.1f}",
                    "answer_start": context.find(f"{avg_ppg:.1f} points per game")
                },
                {
                    "question": f"What is {player_name} career points per game average?",
                    "answer": f"{avg_ppg:.1f}",
                    "answer_start": context.find(f"{avg_ppg:.1f} points per game")
                },                
                {
                    "question": f"What is {player_name}'s career PPG?",
                    "answer": f"{avg_ppg:.1f}",
                    "answer_start": context.find(f"{avg_ppg:.1f} points per game")
                },
                {
                    "question": f"What is {player_name} career PPG?",
                    "answer": f"{avg_ppg:.1f}",
                    "answer_start": context.find(f"{avg_ppg:.1f} points per game")
                },                
                {
                    "question": f"How many rebounds per game does {player_name} average in his career?",
                    "answer": f"{avg_rpg:.1f}",
                    "answer_start": context.find(f"{avg_rpg:.1f} rebounds per game")
                },
                {
                    "question": f"What is {player_name}'s career RPG?",
                    "answer": f"{avg_rpg:.1f}",
                    "answer_start": context.find(f"{avg_rpg:.1f} rebounds per game")
                },
                {
                    "question": f"What is {player_name} career RPG?",
                    "answer": f"{avg_rpg:.1f}",
                    "answer_start": context.find(f"{avg_rpg:.1f} rebounds per game")
                },                
                {
                    "question": f"How many assists per game does {player_name} average?",
                    "answer": f"{avg_apg:.1f}",
                    "answer_start": context.find(f"{avg_apg:.1f} assists per game")
                },
                {
                    "question": f"What is {player_name}'s career APG?",
                    "answer": f"{avg_apg:.1f}",
                    "answer_start": context.find(f"{avg_apg:.1f} assists per game")
                },
                {
                    "question": f"What is {player_name} career APG?",
                    "answer": f"{avg_apg:.1f}",
                    "answer_start": context.find(f"{avg_apg:.1f} assists per game")
                },
                {
                    "question": f"How many games has {player_name} played in his career?",
                    "answer": f"{int(total_gp)}",
                    "answer_start": context.find(f"across {int(total_gp)} games")
                },
                {
                    "question": f"How many seasons has {player_name} played?",
                    "answer": f"{total_games}",
                    "answer_start": context.find(f"{total_games} seasons")
                },
                {
                    "question": f"What is {player_name}'s points per game in the {season_id} season?",
                    "answer": f"{season_ppg:.1f}",
                    "answer_start": context.find(f"averaging {season_ppg:.1f} points per game")
                },
                {
                    "question": f"What is {player_name} points per game in the {season_id} season?",
                    "answer": f"{season_ppg:.1f}",
                    "answer_start": context.find(f"averaging {season_ppg:.1f} points per game")
                },    
                            
            ]
            
            for qa in qa_templates:
                if qa["answer_start"] >= 0:  # Only add if answer is found in context
                    qa_pairs.append({
                        "context": context,
                        "question": qa["question"],
                        "answers": [{
                            "text": qa["answer"],
                            "answer_start": qa["answer_start"]
                        }],
                        "confidence": 0.95  # High confidence for factual stats
                    })
        
        except Exception as e:
            print(f"  Error generating career stats QA for {player_name}: {e}")
        
        return qa_pairs
    
    def _generate_game_log_qa(self, player_id: str, player_name: str, num_games: int) -> List[Dict]:
        """Generate QA pairs from recent game logs"""
        qa_pairs = []
        
        try:
            game_log = playergamelog.PlayerGameLog(player_id=player_id, season=Season.default)
            game_log_df = game_log.get_data_frames()[0]
            
            if game_log_df.empty:
                return qa_pairs
            
            # Get most recent games
            recent_games = game_log_df.head(num_games)
            
            for _, game in recent_games.iterrows():
                game_date = game.get('GAME_DATE', 'N/A')
                matchup = game.get('MATCHUP', 'N/A')
                pts = game.get('PTS', 0)
                reb = game.get('REB', 0)
                ast = game.get('AST', 0)
                fgm = game.get('FGM', 0)
                fga = game.get('FGA', 0)
                fg_pct = (fgm / fga * 100) if fga > 0 else 0
                
                # Generate context
                context = (
                    f"On {game_date}, {player_name} played against {matchup}. "
                    f"He scored {int(pts)} points, grabbed {int(reb)} rebounds, "
                    f"and dished out {int(ast)} assists. "
                    f"{player_name} made {int(fgm)} of {int(fga)} field goal attempts, "
                    f"shooting {fg_pct:.1f}% from the field."
                )
                
                # Generate QA pairs
                game_qa = [
                    {
                        "question": f"How many points did {player_name} score on {game_date}?",
                        "answer": f"{int(pts)}",
                        "answer_start": context.find(f"{int(pts)} points")
                    },
                    {
                        "question": f"What was {player_name}'s point total against {matchup.split()[-1]} on {game_date}?",
                        "answer": f"{int(pts)}",
                        "answer_start": context.find(f"{int(pts)} points")
                    },
                    {
                        "question": f"How many rebounds did {player_name} get on {game_date}?",
                        "answer": f"{int(reb)}",
                        "answer_start": context.find(f"{int(reb)} rebounds")
                    },
                    {
                        "question": f"How many assists did {player_name} have on {game_date}?",
                        "answer": f"{int(ast)}",
                        "answer_start": context.find(f"{int(ast)} assists")
                    },
                    {
                        "question": f"What was {player_name}'s field goal percentage on {game_date}?",
                        "answer": f"{fg_pct:.1f}%",
                        "answer_start": context.find(f"{fg_pct:.1f}%")
                    },
                    {
                        "question": f"Who did {player_name} play against on {game_date}?",
                        "answer": matchup.split()[-1] if matchup else "Unknown",
                        "answer_start": context.find(matchup.split()[-1]) if matchup else -1
                    },
                ]
                
                for qa in game_qa:
                    if qa["answer_start"] >= 0:
                        qa_pairs.append({
                            "context": context,
                            "question": qa["question"],
                            "answers": [{
                                "text": qa["answer"],
                                "answer_start": qa["answer_start"]
                            }],
                            "confidence": 0.90  # High confidence for game stats
                        })
        
        except Exception as e:
            print(f"  Error generating game log QA for {player_name}: {e}")
        
        return qa_pairs
    
    def save_dataset(self, dataset: List[Dict], output_path: str = "nba/nba_qa_dataset.json"):
        """Save dataset to JSON file in SQuAD format"""
        # Convert to SQuAD format
        squad_format = {
            "version": "2.0",
            "data": [{
                "title": "NBA Statistics",
                "paragraphs": []
            }]
        }
        
        # Group by context (optional - for SQuAD format)
        context_map = {}
        for item in dataset:
            context = item["context"]
            if context not in context_map:
                context_map[context] = []
            
            context_map[context].append({
                "qas": [{
                    "id": f"qas_{len(context_map[context])}",
                    "question": item["question"],
                    "answers": item["answers"],
                    "is_impossible": False
                }],
                "context": context
            })
        
        # Flatten paragraphs
        for paragraphs in context_map.values():
            squad_format["data"][0]["paragraphs"].extend(paragraphs)
        
        # Save to file
        output_file = Path(output_path)
        output_file.parent.mkdir(parents=True, exist_ok=True)
        
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(squad_format, f, indent=2, ensure_ascii=False)
        
        print(f"\nDataset saved to {output_path}")
        print(f"Total QA pairs: {len(dataset)}")
        print(f"Unique contexts: {len(context_map)}")
        
        return output_file

print("‚úÖ NBADatasetGenerator class defined!")


‚úÖ NBADatasetGenerator class defined!


## Step 3: Generate the Dataset

Configure the parameters:
- `num_players`: Number of players to process (start with 30 for testing)
- `num_games_per_player`: Number of recent games per player (3 is a good start)

**Note**: This may take several minutes depending on the number of players and API response times.


In [None]:
# Initialize the generator
generator = NBADatasetGenerator()

# Generate dataset with historical support
# Adjust these parameters as needed:
# - num_players: Start with 30 for testing, increase for more data
# - num_games_per_player: Number of recent games to include per player
# - include_historical: Enable historical season questions
# - historical_seasons: List of seasons to include (e.g., ['2009-10', '2010-11'])
# - include_finals: Include Finals game questions
# - include_league_leaders: Include league leader questions
# - include_last_team_games: Include last team game questions (e.g., "What was the score of the last Lakers game?")
dataset = generator.generate_dataset(
    num_players=10,  # Start with 30 players
    num_games_per_player=3,  # 3 recent games per player
    #historical_seasons=['2009-10', '2010-11', '2001-02', '2002-03'],  # Historical seasons
    #include_finals=True,  # Include Finals games
    #include_league_leaders=True,  # Include league leaders
    #include_last_team_games=True  # Include last team game questions
)


Generating NBA QA dataset for 10 players...
[1/10] Processing Alaa Abdelnaby...


## Step 4: Inspect Sample QA Pairs

Let's look at a few examples from the generated dataset to verify quality.


In [None]:
# Display sample QA pairs
print(f"Total QA pairs generated: {len(dataset)}\n")
print("=" * 80)

for i, qa_pair in enumerate(dataset[:3], 1):
    print(f"\nExample {i}:")
    print(f"Question: {qa_pair['question']}")
    print(f"Answer: {qa_pair['answers'][0]['text']}")
    print(f"Context (excerpt): {qa_pair['context'][:150]}...")
    print(f"Confidence: {qa_pair['confidence']}")
    print("-" * 80)


Total QA pairs generated: 342


Example 1:
Question: What is Alaa Abdelnaby's career points per game average?
Answer: 6.0
Context (excerpt): Alaa Abdelnaby has played 9 seasons in the NBA. Over his career, he has averaged 6.0 points per game, 3.4 rebounds per game, and 0.3 assists per game ...
Confidence: 0.95
--------------------------------------------------------------------------------

Example 2:
Question: What is Alaa Abdelnaby's career PPG?
Answer: 6.0
Context (excerpt): Alaa Abdelnaby has played 9 seasons in the NBA. Over his career, he has averaged 6.0 points per game, 3.4 rebounds per game, and 0.3 assists per game ...
Confidence: 0.95
--------------------------------------------------------------------------------

Example 3:
Question: How many rebounds per game does Alaa Abdelnaby average in his career?
Answer: 3.4
Context (excerpt): Alaa Abdelnaby has played 9 seasons in the NBA. Over his career, he has averaged 6.0 points per game, 3.4 rebounds per game, and 0.3 assists p

## Step 5: Save Dataset to File

Save the dataset in SQuAD format for use in fine-tuning.


In [None]:
# Save dataset
output_path = "../utils/nba_qa_dataset.json"
output_file = generator.save_dataset(dataset, output_path)

print("\n‚úÖ Dataset generation complete!")
print(f"üìÅ Dataset saved to: {output_path}")
print(f"üìä Total QA pairs: {len(dataset)}")
print(f"\nNext step: Use this dataset to fine-tune the QA model in the next notebook!")



Dataset saved to utils/nba_qa_dataset.json
Total QA pairs: 342
Unique contexts: 42

‚úÖ Dataset generation complete!
üìÅ Dataset saved to: utils/nba_qa_dataset.json
üìä Total QA pairs: 342

Next step: Use this dataset to fine-tune the QA model in the next notebook!
