# dbApps05 Walkthrough — SQL Aggregations, GROUP BY & Excel

**Course:** Database Applications Development (145085)

**Instructor:** Ryan McMaster

**Medina County Career Center**

---

In this walkthrough, you will learn how to:

- Use aggregate functions (COUNT, SUM, AVG, MIN, MAX)
- Group data with the GROUP BY clause
- Use HAVING to filter grouped data
- Distinguish between WHERE and HAVING
- Export query results to Excel with formatting


## Setup

Import libraries and establish database connection.

In [None]:
# Import required libraries
import pandas as pd
import sqlite3
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment

# Connect to the NBA database
conn = sqlite3.connect('nba_5seasons.db')

print('Database connection established successfully!')

---

## Sub-Lesson 05a — Aggregate Functions

Aggregate functions compute a single result from a set of input values. Common aggregate functions include:

- **COUNT(*)** — counts all rows
- **COUNT(column)** — counts non-NULL values in a column
- **SUM(column)** — adds all values in a column
- **AVG(column)** — calculates average of column values
- **MIN(column)** — finds minimum value
- **MAX(column)** — finds maximum value
- **ROUND(value, decimals)** — rounds to specified decimal places


### Example 1: COUNT(*) — Count All Rows

Let's start by counting the total number of games in our dataset.

In [None]:
# Count total number of team game statistics records
query = """
SELECT COUNT(*) AS totalGames
FROM team_game_stats;
"""

totalGamesResult = pd.read_sql(query, conn)
print(totalGamesResult)
print(f"\nTotal games in dataset: {totalGamesResult['totalGames'][0]}")

### Example 2: COUNT(column) — Count Non-NULL Values

Let's count how many player records we have.

In [None]:
# Count total number of unique players
query = """
SELECT COUNT(player_id) AS totalPlayers
FROM players;
"""

totalPlayersResult = pd.read_sql(query, conn)
print(totalPlayersResult)
print(f"\nTotal players in dataset: {totalPlayersResult['totalPlayers'][0]}")

### Example 3: SUM — Total Points

Calculate the total points scored across all games.

In [None]:
# Sum all points scored across all team games
query = """
SELECT SUM(pts) AS totalPointsScored
FROM team_game_stats;
"""

totalPointsResult = pd.read_sql(query, conn)
print(totalPointsResult)
print(f"\nTotal points across all games: {totalPointsResult['totalPointsScored'][0]:,.0f}")

### Example 4: AVG and ROUND — Average Points Per Game

Calculate the average points scored per team per game, rounded to 1 decimal.

In [None]:
# Calculate average points per game, rounded to 1 decimal place
query = """
SELECT ROUND(AVG(pts), 1) AS avgPointsPerGame
FROM team_game_stats;
"""

avgPointsResult = pd.read_sql(query, conn)
print(avgPointsResult)
print(f"\nAverage points per game: {avgPointsResult['avgPointsPerGame'][0]}")

### Example 5: MIN and MAX — Find Extremes

Find the lowest and highest player season scoring totals.

In [None]:
# Find minimum and maximum points scored in a season by any player
query = """
SELECT 
  MIN(pts) AS minSeasonPoints,
  MAX(pts) AS maxSeasonPoints
FROM player_season_stats;
"""

minMaxResult = pd.read_sql(query, conn)
print(minMaxResult)
print(f"\nLowest season points: {minMaxResult['minSeasonPoints'][0]}")
print(f"Highest season points: {minMaxResult['maxSeasonPoints'][0]}")

### Try This 1: COUNT Non-NULL Rebounds

Write a query to count how many non-NULL rebound values exist in the `player_season_stats` table.

**Hint:** Use `COUNT(reb)` from the `player_season_stats` table.

### Try This 2: Average Field Goal Percentage

Write a query to calculate the average field goal percentage (`fg_pct`) from the `player_season_stats` table, rounded to 2 decimal places.

**Hint:** Use `ROUND(AVG(fg_pct), 2)`

### Try This 3: Multiple Aggregates

Write a query to show the total and average rebounds from `player_season_stats`, rounded to 2 decimals.

**Hint:** Use both `SUM()` and `AVG()` in the same SELECT statement.

---

## Sub-Lesson 05b — GROUP BY and Aliases

The GROUP BY clause groups rows by one or more columns. It's often combined with aggregate functions.

- **Single Column GROUP BY:** Group by one attribute
- **Multiple Column GROUP BY:** Group by multiple attributes
- **AS (Aliases):** Rename columns in output


### Example 6: GROUP BY Single Column

Get total points scored by each team across all games.

In [None]:
# Group by team_id and sum their points
query = """
SELECT 
  team_id,
  SUM(pts) AS totalPointsByTeam
FROM team_game_stats
GROUP BY team_id
ORDER BY totalPointsByTeam DESC
LIMIT 5;
"""

pointsByTeam = pd.read_sql(query, conn)
print(pointsByTeam)
print("\nTop 5 teams by total points scored")

### Example 7: GROUP BY with Meaningful Column Names (Aliases)

Use aliases to make column names more descriptive.

In [None]:
# Group by team and use aliases for better readability
query = """
SELECT 
  team_id AS teamID,
  COUNT(*) AS gamesPlayed,
  ROUND(AVG(pts), 1) AS avgPointsPerGame,
  MAX(pts) AS highestScoringGame
FROM team_game_stats
GROUP BY team_id
ORDER BY avgPointsPerGame DESC
LIMIT 5;
"""

teamStats = pd.read_sql(query, conn)
print(teamStats)
print("\nTop 5 teams by average points per game")

### Example 8: GROUP BY Multiple Columns

Group by season and team to see performance trends.

In [None]:
# Group by multiple columns: season and team_id
query = """
SELECT 
  season,
  team_id,
  COUNT(*) AS gamesPlayed,
  ROUND(AVG(pts), 1) AS avgPointsPerGame
FROM team_game_stats
GROUP BY season, team_id
ORDER BY season DESC, avgPointsPerGame DESC
LIMIT 10;
"""

seasonTeamStats = pd.read_sql(query, conn)
print(seasonTeamStats)
print("\nTop 10 team performances by season")

### Try This 4: GROUP BY Player Season Stats

Write a query to group `player_season_stats` by `player_id` and show:
- player_id (as playerID)
- Number of seasons played (count of rows, as seasonsPlayed)
- Average points per season (rounded to 1 decimal, as avgPointsPerSeason)

**Hint:** Use GROUP BY player_id with COUNT() and AVG() functions.

### Try This 5: GROUP BY Multiple Columns

Write a query to group `player_season_stats` by both `season` and `team_id`. Show:
- season
- team_id (as teamID)
- Count of players on that team for that season (as playerCount)
- Average rebounds per player that season (rounded to 1 decimal, as avgRebounds)

**Hint:** GROUP BY season, team_id

---

## Sub-Lesson 05c — HAVING vs WHERE

Both WHERE and HAVING filter data, but they work differently:

- **WHERE:** Filters rows BEFORE aggregation (applies to individual rows)
- **HAVING:** Filters rows AFTER aggregation (applies to grouped results)

**Typical Order:** SELECT → FROM → WHERE → GROUP BY → HAVING → ORDER BY


### Example 9: WHERE — Filter Before Grouping

Find average points for games where the team won (wl = 'W').

In [None]:
# WHERE filters rows BEFORE aggregation
# Here we only look at winning games
query = """
SELECT 
  team_id AS teamID,
  COUNT(*) AS winCount,
  ROUND(AVG(pts), 1) AS avgPointsInWins
FROM team_game_stats
WHERE wl = 'W'  -- Filter: only winning games
GROUP BY team_id
ORDER BY avgPointsInWins DESC
LIMIT 5;
"""

winsStats = pd.read_sql(query, conn)
print(winsStats)
print("\nTop 5 teams by average points in WINNING games")

### Example 10: HAVING — Filter After Grouping

Show teams with more than 150 total wins.

In [None]:
# HAVING filters groups AFTER aggregation
# We show only teams with more than 150 wins
query = """
SELECT 
  team_id AS teamID,
  COUNT(*) AS totalWins
FROM team_game_stats
WHERE wl = 'W'  -- First, filter to wins
GROUP BY team_id
HAVING COUNT(*) > 150  -- Then, filter groups by win count
ORDER BY totalWins DESC;
"""

teamWinsFiltered = pd.read_sql(query, conn)
print(teamWinsFiltered)
print("\nTeams with MORE than 150 wins across 5 seasons")

### Example 11: WHERE + GROUP BY + HAVING Together

Find teams with average points > 110 in games where they scored > 100 points.

In [None]:
# Combining WHERE (filter rows) + GROUP BY (group) + HAVING (filter groups)
query = """
SELECT 
  team_id AS teamID,
  COUNT(*) AS highScoringGames,
  ROUND(AVG(pts), 1) AS avgPointsHighScoring
FROM team_game_stats
WHERE pts > 100  -- WHERE: filter to games where team scored > 100
GROUP BY team_id
HAVING AVG(pts) > 110  -- HAVING: filter to groups with avg > 110
ORDER BY avgPointsHighScoring DESC;
"""

highScoringTeams = pd.read_sql(query, conn)
print(highScoringTeams)
print("\nTeams with avg > 110 pts in games where they scored > 100")

### Try This 6: HAVING to Filter Groups

Write a query to find players (from `player_season_stats`) who averaged more than 20 points per season.

Show:
- player_id (as playerID)
- Number of seasons (as seasons)
- Average points per season rounded to 1 decimal (as avgPointsPerSeason)

**Hint:** GROUP BY player_id, then use HAVING with AVG(pts) > 20

### Try This 7: WHERE + GROUP BY + HAVING

Write a query to find teams (from `team_game_stats`) where:
- They played in season 2021 (WHERE condition)
- They won at least 10 games (HAVING condition)

Show:
- team_id (as teamID)
- Win count (as wins)
- Average points in those wins rounded to 1 decimal (as avgPointsPerWin)

**Hint:** WHERE season = 2021 AND wl = 'W', then HAVING COUNT(*) >= 10

---

## Sub-Lesson 05d — SQL to Excel Export

Export your query results to Excel files with various formatting options.

- **Basic Export:** Simple .to_excel() export
- **ExcelWriter:** Multiple sheets in one workbook
- **Formatting:** Colors, fonts, alignment using openpyxl


### Example 12: Basic Excel Export

Export team statistics to a simple Excel file.

In [None]:
# Query team statistics
query = """
SELECT 
  team_id AS teamID,
  COUNT(*) AS gamesPlayed,
  SUM(pts) AS totalPoints,
  ROUND(AVG(pts), 1) AS avgPointsPerGame
FROM team_game_stats
GROUP BY team_id
ORDER BY avgPointsPerGame DESC;
"""

# Read data into DataFrame
teamStatsForExcel = pd.read_sql(query, conn)

# Export to Excel (basic)
excelFileName = 'team_statistics_basic.xlsx'
teamStatsForExcel.to_excel(excelFileName, index=False, sheet_name='Team Stats')

print(f'File exported: {excelFileName}')
print(f'Rows exported: {len(teamStatsForExcel)}')
print(teamStatsForExcel.head())

### Example 13: Multiple Sheets with ExcelWriter

Create a single Excel file with multiple sheets.

In [None]:
# Query 1: Team statistics
teamQuery = """
SELECT 
  team_id AS teamID,
  COUNT(*) AS gamesPlayed,
  ROUND(AVG(pts), 1) AS avgPointsPerGame
FROM team_game_stats
GROUP BY team_id
ORDER BY avgPointsPerGame DESC;
"""

# Query 2: Top scorers
playerQuery = """
SELECT 
  player_id AS playerID,
  COUNT(*) AS seasons,
  ROUND(AVG(pts), 1) AS avgPointsPerSeason
FROM player_season_stats
GROUP BY player_id
ORDER BY avgPointsPerSeason DESC
LIMIT 10;
"""

# Read both queries
teamData = pd.read_sql(teamQuery, conn)
playerData = pd.read_sql(playerQuery, conn)

# Create Excel file with multiple sheets
excelFileName = 'nba_analysis_multisheet.xlsx'
with pd.ExcelWriter(excelFileName, engine='openpyxl') as writer:
    teamData.to_excel(writer, sheet_name='Team Stats', index=False)
    playerData.to_excel(writer, sheet_name='Top Scorers', index=False)

print(f'File exported: {excelFileName}')
print(f'Sheets created: Team Stats, Top Scorers')

### Example 14: Advanced Formatting with openpyxl

Add colors, fonts, and alignment to make the Excel file look professional.

In [None]:
# Query high-scoring games
gameQuery = """
SELECT 
  season,
  game_date AS gameDate,
  team_id AS teamID,
  pts AS points,
  ast AS assists,
  reb AS rebounds
FROM team_game_stats
WHERE pts >= 120
ORDER BY season DESC, pts DESC
LIMIT 20;
"""

# Read data
gameData = pd.read_sql(gameQuery, conn)

# Export to Excel
excelFileName = 'high_scoring_games_formatted.xlsx'
gameData.to_excel(excelFileName, sheet_name='Games', index=False)

# Load workbook and apply formatting
wb = load_workbook(excelFileName)
ws = wb.active

# Define styles
headerFill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')
headerFont = Font(bold=True, color='FFFFFF', size=12)
centerAlignment = Alignment(horizontal='center', vertical='center')

# Format header row
for cell in ws[1]:
    cell.fill = headerFill
    cell.font = headerFont
    cell.alignment = centerAlignment

# Auto-adjust column widths
for column in ws.columns:
    max_length = 0
    column_letter = column[0].column_letter
    for cell in column:
        try:
            if len(str(cell.value)) > max_length:
                max_length = len(str(cell.value))
        except:
            pass
    adjusted_width = (max_length + 2)
    ws.column_dimensions[column_letter].width = adjusted_width

# Save formatted workbook
wb.save(excelFileName)

print(f'Formatted file exported: {excelFileName}')
print(f'Rows: {len(gameData)}, Columns: {len(gameData.columns)}')
print(gameData.head())

### Try This 8: Basic Excel Export

Write a query to find the top 10 players by average points per season from `player_season_stats`. Export the results to an Excel file called `top_10_scorers.xlsx`.

Include:
- player_id (as playerID)
- Number of seasons (as seasons)
- Average points per season rounded to 1 decimal (as avgPointsPerSeason)

**Hint:** GROUP BY player_id, ORDER BY avgPointsPerSeason DESC LIMIT 10, then use .to_excel()

### Try This 9: Multiple Sheets Excel Export

Create an Excel file with TWO sheets:

1. **Sheet 1 ("Winning Teams"):** Teams with the most wins in season 2021
   - team_id (as teamID)
   - Win count (as wins)
   - Average points in wins rounded to 1 decimal (as avgPointsPerWin)

2. **Sheet 2 ("Top Rebounders"):** Top 5 players by average rebounds
   - player_id (as playerID)
   - Average rebounds rounded to 1 decimal (as avgRebounds)

Export to `nba_multi_analysis.xlsx`

**Hint:** Use ExcelWriter with two separate queries and two .to_excel() calls

### Try This 10: Formatted Excel Export

Create an Excel file with the top 15 games by points scored, but with professional formatting:

- Blue header row with white text
- Auto-adjusted column widths
- Save as `top_games_formatted.xlsx`

Include columns:
- season
- game_date (as gameDate)
- team_id (as teamID)
- pts (as points)
- ast (as assists)
- reb (as rebounds)

Order by pts DESC and limit to 15 rows.

**Hint:** Query → .to_excel() → load_workbook() → format header row → save

---

## Summary

You've learned:

1. **Aggregate Functions:** COUNT, SUM, AVG, MIN, MAX, ROUND
2. **GROUP BY:** Single and multiple column grouping with aliases
3. **HAVING vs WHERE:** HAVING filters grouped results, WHERE filters individual rows
4. **Excel Export:** Basic export, multiple sheets, and professional formatting

Keep practicing these techniques with different datasets!
