# dbApps06 Walkthrough: Database Design & Normalization

**Course:** Database Applications Development (145085)  
**Medina County Career Center**

---

This lesson explores **why** we use multiple tables in databases, the concepts of **entities, keys, and relationships**, and how **normalization** prevents data anomalies. We'll examine the NBA database structure and understand the design decisions behind it.

## Setup

Import libraries and connect to the nba_5seasons.db database.

In [None]:
import pandas as pd
import sqlite3

# Connect to the NBA database
dbPath = './nba_5seasons.db'
connection = sqlite3.connect(dbPath)
cursor = connection.cursor()

print(f"Connected to database: {dbPath}")

---

## Sub-Lesson 06a — Why Multiple Tables?

### The Problem: Denormalized Data

Imagine storing all NBA data in a **single table**. Each row would contain:
- Team information (name, city, state, year founded)
- Player information (name)
- Game statistics (date, points, rebounds, assists, etc.)

**What happens?**
- Every time the Lakers play, we repeat the team name, city, state, and founding year
- Every time LeBron James has stats, we repeat his full name
- Updating a team's city requires changing THOUSANDS of rows
- Disk space is wasted on redundant data

This is called **denormalization** and causes **data anomalies**.

### Current NBA Database Structure

Let's examine the actual tables we have:

In [None]:
# Get all table names in the database
tableQuery = "SELECT name FROM sqlite_master WHERE type='table';"
tables = cursor.execute(tableQuery).fetchall()
tableList = [t[0] for t in tables]

print("Tables in nba_5seasons.db:")
for i, tableName in enumerate(tableList, 1):
    print(f"  {i}. {tableName}")

### Table 1: teams

In [None]:
# Show the structure of the teams table
print("\nTable: teams")
print("\nColumns:")
teamSchema = cursor.execute("PRAGMA table_info(teams);").fetchall()
for col in teamSchema:
    colId, colName, colType, notNull, defaultVal, pk = col
    print(f"  {colName:20} {colType:10} PK={pk}")

# Show row count
teamCount = cursor.execute("SELECT COUNT(*) FROM teams;").fetchone()[0]
print(f"\nRows: {teamCount}")

# Show first 5 rows
print("\nFirst 5 rows:")
teamsDF = pd.read_sql_query("SELECT * FROM teams LIMIT 5;", connection)
print(teamsDF.to_string())

### Table 2: players

In [None]:
# Show the structure of the players table
print("\nTable: players")
print("\nColumns:")
playersSchema = cursor.execute("PRAGMA table_info(players);").fetchall()
for col in playersSchema:
    colId, colName, colType, notNull, defaultVal, pk = col
    print(f"  {colName:20} {colType:10} PK={pk}")

# Show row count
playersCount = cursor.execute("SELECT COUNT(*) FROM players;").fetchone()[0]
print(f"\nRows: {playersCount}")

# Show first 5 rows
print("\nFirst 5 rows:")
playersDF = pd.read_sql_query("SELECT * FROM players LIMIT 5;", connection)
print(playersDF.to_string())

### Table 3: team_game_stats

In [None]:
# Show the structure of the team_game_stats table
print("\nTable: team_game_stats")
print("\nColumns:")
gameStatsSchema = cursor.execute("PRAGMA table_info(team_game_stats);").fetchall()
for col in gameStatsSchema:
    colId, colName, colType, notNull, defaultVal, pk = col
    print(f"  {colName:20} {colType:10} PK={pk}")

# Show row count
gameStatsCount = cursor.execute("SELECT COUNT(*) FROM team_game_stats;").fetchone()[0]
print(f"\nRows: {gameStatsCount}")

# Show first 3 rows
print("\nFirst 3 rows:")
gameStatsDF = pd.read_sql_query("SELECT * FROM team_game_stats LIMIT 3;", connection)
print(gameStatsDF.to_string())

### Table 4: player_season_stats

In [None]:
# Show the structure of the player_season_stats table
print("\nTable: player_season_stats")
print("\nColumns:")
playerSeasonSchema = cursor.execute("PRAGMA table_info(player_season_stats);").fetchall()
for col in playerSeasonSchema:
    colId, colName, colType, notNull, defaultVal, pk = col
    print(f"  {colName:20} {colType:10} PK={pk}")

# Show row count
playerSeasonCount = cursor.execute("SELECT COUNT(*) FROM player_season_stats;").fetchone()[0]
print(f"\nRows: {playerSeasonCount}")

# Show first 3 rows
print("\nFirst 3 rows:")
playerSeasonDF = pd.read_sql_query("SELECT * FROM player_season_stats LIMIT 3;", connection)
print(playerSeasonDF.to_string())

### Redundancy Example

Notice: **team_id** appears in both `team_game_stats` and `player_season_stats`. Instead of repeating the team name and city every time, we just store the ID once and reference it from other tables.

In [None]:
# Example: How many times does team_id 1610612737 (Lakers) appear in team_game_stats?
lakeTokenId = 1610612737

lakeTeam = pd.read_sql_query(
    "SELECT full_name FROM teams WHERE team_id = ?;",
    connection,
    params=(lakeTokenId,)
)
lakeName = lakeTeam['full_name'].values[0]

lakeGameCount = cursor.execute(
    "SELECT COUNT(*) FROM team_game_stats WHERE team_id = ?;",
    (lakeTokenId,)
).fetchone()[0]

print(f"Team: {lakeName}")
print(f"Games in database: {lakeGameCount}")
print(f"\nIn a denormalized table, we'd repeat '{lakeName}', 'Los Angeles', and 'CA' {lakeGameCount} times!")

### Try This

**Task:** How many times does a specific player (player_id) appear in the `player_season_stats` table? First, find a player's ID, then count their appearances. Why is it better to store the player's name in a separate table and reference it by ID?

*(Write your query and answer below)*

In [None]:
# Your code here


---

## Sub-Lesson 06b — Entities, Keys & Relationships

### Entity Types

An **entity** is a "thing" in the real world that we track:
- **Teams** — The 30 NBA franchises
- **Players** — The athletes in the league
- **Games** — Individual team performances in games (stored as game_stats)
- **Seasons** — Player stats for a season on a team

Each entity gets its own table.

### Primary Keys (PK)

A **primary key** uniquely identifies each row in a table. Let's verify that our keys actually are unique:

In [None]:
# Check that all teams have distinct team_ids
distinctTeamIds = cursor.execute(
    "SELECT COUNT(DISTINCT team_id) FROM teams;"
).fetchone()[0]
totalTeams = cursor.execute(
    "SELECT COUNT(*) FROM teams;"
).fetchone()[0]

print(f"Teams table:")
print(f"  Total rows: {totalTeams}")
print(f"  Distinct team_id values: {distinctTeamIds}")
print(f"  Is team_id unique? {totalTeams == distinctTeamIds}")

# Check players table
distinctPlayerIds = cursor.execute(
    "SELECT COUNT(DISTINCT player_id) FROM players;"
).fetchone()[0]
totalPlayers = cursor.execute(
    "SELECT COUNT(*) FROM players;"
).fetchone()[0]

print(f"\nPlayers table:")
print(f"  Total rows: {totalPlayers}")
print(f"  Distinct player_id values: {distinctPlayerIds}")
print(f"  Is player_id unique? {totalPlayers == distinctPlayerIds}")

### Foreign Keys (FK)

A **foreign key** is a column that references the primary key of another table. This creates relationships between tables.

In [None]:
# In team_game_stats, team_id is a FOREIGN KEY that references teams.team_id
print("Foreign Key Relationships:")
print("\n1. team_game_stats.team_id → teams.team_id")

# Verify: Do all team_ids in team_game_stats exist in teams?
orphanedTeams = cursor.execute("""
    SELECT COUNT(DISTINCT team_id) FROM team_game_stats
    WHERE team_id NOT IN (SELECT team_id FROM teams);
""").fetchone()[0]

print(f"   Orphaned team_ids (not in teams table): {orphanedTeams}")

# Check the second relationship
print("\n2. player_season_stats.player_id → players.player_id")
orphanedPlayers = cursor.execute("""
    SELECT COUNT(DISTINCT player_id) FROM player_season_stats
    WHERE player_id NOT IN (SELECT player_id FROM players);
""").fetchone()[0]

print(f"   Orphaned player_ids (not in players table): {orphanedPlayers}")

# Check the third relationship
print("\n3. player_season_stats.team_id → teams.team_id")
orphanedTeamsInPlayerStats = cursor.execute("""
    SELECT COUNT(DISTINCT team_id) FROM player_season_stats
    WHERE team_id NOT IN (SELECT team_id FROM teams);
""").fetchone()[0]

print(f"   Orphaned team_ids (not in teams table): {orphanedTeamsInPlayerStats}")

### Relationship Example: Using Foreign Keys to Join Tables

Let's find all games played by the Lakers using the foreign key relationship:

In [None]:
# Find all games for the Lakers by joining on team_id
lakeQuery = """
    SELECT 
        t.full_name,
        g.game_date,
        g.matchup,
        g.wl,
        g.pts
    FROM team_game_stats g
    INNER JOIN teams t ON g.team_id = t.team_id
    WHERE t.full_name = 'Los Angeles Lakers'
    LIMIT 10;
"""

lakeGames = pd.read_sql_query(lakeQuery, connection)
print(f"First 10 Lakers games:")
print(lakeGames.to_string())

### Try This

**Task:** Write a query that finds all the stats for a specific player across all seasons using the foreign key relationship. Pick any player from the `players` table, then join to `player_season_stats` on `player_id`. What columns would you need in the result?

*(Write your query and answer below)*

In [None]:
# Your code here


---

## Sub-Lesson 06c — Normalization & ER Diagrams

### What is Normalization?

**Normalization** is the process of organizing a database into tables and columns to:
1. **Minimize data redundancy** — Don't repeat the same data
2. **Prevent data anomalies** — Avoid update, insertion, and deletion problems
3. **Maintain data integrity** — Keep relationships consistent

There are several "normal forms" (1NF, 2NF, 3NF, BCNF, 4NF, 5NF). We focus on the first three.

### First Normal Form (1NF)

**Rule:** All columns contain atomic (indivisible) values. No repeating groups or arrays in a single cell.

**Bad (not 1NF):**
```
player_id | player_name        | team_ids
1         | LeBron James       | 1610612737, 1610612752
```

**Good (1NF):**
```
player_id | player_name
1         | LeBron James

player_id | team_id      | season
1         | 1610612737   | 2019
1         | 1610612752   | 2020
```

**NBA Database:** ✓ All values are atomic.

### Second Normal Form (2NF)

**Rule:** Must be in 1NF, AND all non-key columns must depend on the ENTIRE primary key (not just part of it).

**Bad (not 2NF):**
```
season | player_id | team_id | pts | team_city
2019   | 1         | 12345   | 23  | Los Angeles
```
Here, `team_city` depends only on `team_id`, not the full key (season, player_id, team_id).

**Good (2NF):**
- `player_season_stats` has key (season, player_id, team_id) → stores stats that depend on ALL three
- `teams` has key (team_id) → stores team info that depends on team_id

**NBA Database:** ✓ Each table's non-key columns depend on the full primary key.

### Third Normal Form (3NF)

**Rule:** Must be in 2NF, AND no non-key column depends on another non-key column (no transitive dependencies).

**Bad (not 3NF):**
```
team_id | team_name | city | state | country
12345   | Lakers    | Los Angeles | CA | USA
```
Here, `country` depends on `state`, which is a non-key column. If California changes country, we must update multiple rows.

**Good (3NF):**
- `teams` has (team_id, team_name, city, state, year_founded) — each depends on team_id
- We don't store derived or dependent non-key values

**NBA Database:** ✓ No transitive dependencies.

### ER Diagram: NBA Database

```
┌──────────────────────────────┐
│          teams               │
├──────────────────────────────┤
│ PK: team_id                  │
│ full_name                    │
│ abbreviation                 │
│ nickname                     │
│ city                         │
│ state                        │
│ year_founded                 │
└──────────────────────────────┘
           ↑         ↑
           │         │
         1 │         │ N
           │         │
           │         └──────────────────────────────────┐
           │                                            │
           │                    ┌──────────────────────────────────┐
           │                    │   team_game_stats                │
           │                    ├──────────────────────────────────┤
           │                    │ FK: team_id → teams.team_id      │
           │                    │ season                           │
           │                    │ game_id                          │
           │                    │ game_date                        │
           │                    │ matchup                          │
           │                    │ wl                               │
           │                    │ pts, fgm, fga, ... (stats)      │
           │                    └──────────────────────────────────┘
           │
           │
           │
┌──────────────────────────────┐
│       players                │
├──────────────────────────────┤
│ PK: player_id                │
│ full_name                    │
└──────────────────────────────┘
           ↑
           │
         1 │     N
           │     │
           └─────┘
                  │
                  │
        ┌─────────────────────────────────────┐
        │   player_season_stats               │
        ├─────────────────────────────────────┤
        │ FK: player_id → players.player_id   │
        │ FK: team_id → teams.team_id         │
        │ season                              │
        │ gp, min, pts, reb, ast (stats)     │
        │ fg_pct, fg3_pct, ft_pct            │
        └─────────────────────────────────────┘
```

**Key:** 
- `1:N` means "one team has many games"
- `→` indicates a foreign key relationship
- PK = Primary Key
- FK = Foreign Key

### Verifying Normalization in the NBA Database

In [None]:
# Let's verify the NBA database follows 1NF, 2NF, and 3NF

print("NBA Database Normalization Check")
print("=" * 50)

print("\n1NF Check: All values are atomic (not arrays/lists)?")
print("   ✓ Every cell contains a single value")
print("   ✓ No repeating groups within cells")
print("   ✓ teams, players, team_game_stats, player_season_stats all follow 1NF")

print("\n2NF Check: Non-key columns depend on the FULL primary key?")
print("   ✓ teams: all non-key columns depend on team_id")
print("   ✓ players: all non-key columns depend on player_id")
print("   ✓ team_game_stats: pts, fgm, fga, etc. depend on (season, game_id, team_id)")
print("   ✓ player_season_stats: pts, reb, ast, etc. depend on (season, player_id, team_id)")

print("\n3NF Check: No non-key columns depend on other non-key columns?")
print("   ✓ teams: city, state, year_founded all independent")
print("   ✓ players: only stores player_id and full_name (no derived data)")
print("   ✓ game stats and season stats are primary measurements, not derived")

print("\n" + "=" * 50)
print("Result: NBA database is in 3NF ✓")

### Try This

**Conceptual Questions:**

1. **Redundancy Problem:** In a denormalized version of the database, if we had to move the Lakers from Los Angeles to another city, what would we need to do and why would it be risky?

2. **Normalization Benefit:** Look at our four-table design. How does having a separate `players` table prevent the insertion anomaly (a problem where we can't add data without incomplete records)?

3. **Foreign Key Integrity:** What would happen if we deleted a team from the `teams` table but that team's records still existed in `team_game_stats`? (This is called a referential integrity violation.)

*(Write your answers below)*

---

## Summary

- **Multiple tables** reduce redundancy and prevent anomalies
- **Primary keys** uniquely identify each row
- **Foreign keys** create relationships between tables
- **Normalization** (1NF, 2NF, 3NF) is a systematic way to design databases
- The **NBA database** demonstrates all these principles in a real-world context

Understanding these concepts is fundamental to becoming a database designer and developer.