# dbApps06a Task: Analyzing a Denormalized Dataset

## Learning Objectives
- Understand the problems caused by data redundancy
- Identify update, insert, and delete anomalies
- Recognize how normalization solves these issues
- Analyze real-world examples from the NBA database

---

## Context
The NBA database contains multiple tables:
- **teams**: One row per team (30 unique teams)
- **team_game_stats**: One row per game per team (heavily denormalized with team_id repeated)
- **players**: Player information
- **player_season_stats**: Player stats across seasons

We'll examine what happens when data is denormalized and learn why normalization matters.

In [None]:
# Import required libraries
import pandas as pd
import sqlite3

# Connect to the NBA database
dbPath = '/sessions/sweet-lucid-archimedes/mnt/databaseApplicationsForGitHub/dbApps06/nba_5seasons.db'
connection = sqlite3.connect(dbPath)
cursor = connection.cursor()

print('Database connection established successfully!')

---

## TASK 1: Observe Redundancy in team_game_stats

**Task**: Run a query to display the first 5 rows of team_game_stats. Look at the columns and notice which values repeat.

**Hint**: Use `SELECT * FROM team_game_stats LIMIT 5`

In [None]:
# TASK 1: Your code here
# Display first 5 rows of team_game_stats



### Analysis

After viewing the results above, answer:
- Which column do you see repeating multiple times in the results?
- Why might storing this data in every single row be problematic?

*Write your answer below:*

**YOUR ANSWER HERE**


---

## TASK 2: Count Redundancy by Team

**Task**: Write a query that counts how many rows exist for each team_id in team_game_stats using GROUP BY and COUNT.

**Goal**: See how many times each team_id appears (showing massive redundancy).

**Hint**: `SELECT team_id, COUNT(*) AS gameCount FROM team_game_stats GROUP BY team_id ORDER BY gameCount DESC`

In [None]:
# TASK 2: Your code here
# Count how many rows each team_id has



### Analysis

What do these numbers tell you about data redundancy? If the team_id appeared in every game row along with team name, city, and state, how many copies of that same team information would exist?

*Write your answer below:*

**YOUR ANSWER HERE**


---

## TASK 3: The Update Anomaly Problem

**Scenario**: Imagine that instead of storing just team_id in team_game_stats, we stored the full team name in every game row.

**Question**: If a team changed its name (for example, the city/nickname changed), what would happen? How many rows would need to be updated?

*Write your answer below. Include:**
- *Which anomaly type this represents (update, insert, or delete)*
- *Specific numbers from TASK 2 to illustrate the problem*
- *Why this is dangerous*

**YOUR ANSWER HERE**


---

## TASK 4: Calculate Total Redundant Data

**Task**: Write a query that counts:
1. Total rows in team_game_stats
2. Unique teams in teams table
3. Calculate how many EXTRA copies of team information exist if we store it in every game row

**Formula**: (Total game rows - 1 copy per team) = Redundant rows

**Example**: If 10,842 game rows exist and 30 teams exist:
- We need only 1 copy of each team's info = 30 rows
- Instead we'd have team info in all 10,842 rows
- Redundancy = 10,842 - 30 = 10,812 extra copies

In [None]:
# TASK 4: Your code here
# Calculate redundancy metrics



---

## TASK 5: Define the Three Anomaly Types

**Task**: In the markdown cell below, write a clear definition for each of the three types of data anomalies.

**Instructions**:
- **Update Anomaly**: Define what happens and give a brief example
- **Insert Anomaly**: Define what happens and give a brief example
- **Delete Anomaly**: Define what happens and give a brief example

*Use your own words. These are core database concepts.*

### Update Anomaly
**Definition**: 

**Example**: 

---

### Insert Anomaly
**Definition**: 

**Example**: 

---

### Delete Anomaly
**Definition**: 

**Example**: 


---

## TASK 6: Real-World NBA Examples of Anomalies

**Scenario**: Imagine a denormalized table that combines team_game_stats with team information:

```
team_game_stats_denormalized:
season | game_id | team_id | team_name | city | state | pts | fgm | fga | ...
```

**Task**: For each anomaly type, write a specific example using the NBA database:

1. **UPDATE ANOMALY**: Describe a change that would require multiple row updates
2. **INSERT ANOMALY**: Describe data we couldn't insert without redundancy
3. **DELETE ANOMALY**: Describe data we might accidentally lose when deleting

*Make sure your examples use team, game, or player data from the NBA context.*

### Update Anomaly Example
**Scenario**: 

**Problem**: 

---

### Insert Anomaly Example
**Scenario**: 

**Problem**: 

---

### Delete Anomaly Example
**Scenario**: 

**Problem**: 


---

## TASK 7: Why Normalization Solves These Problems

**Task**: Explain how splitting team information into a separate **teams** table solves each anomaly type.

**Instructions**: For each anomaly, explain:
- How the problem is eliminated by normalization
- How you would perform the operation in the normalized design (teams separate from team_game_stats)

*Think about how you would now handle updates, inserts, and deletes with separate tables.*

### How Normalization Prevents Update Anomalies
**Explanation**: 

---

### How Normalization Prevents Insert Anomalies
**Explanation**: 

---

### How Normalization Prevents Delete Anomalies
**Explanation**: 


---

## TASK 8: Verify the teams Table Has NO Redundancy

**Task**: Write a query that proves the teams table is normalized (no redundancy).

**Approach**:
1. Count total rows in teams
2. Count DISTINCT team_id values in teams
3. Verify these two numbers are equal (proving each team appears exactly once)

**Hint**: `SELECT COUNT(*) AS totalRows, COUNT(DISTINCT team_id) AS uniqueTeams FROM teams`

In [None]:
# TASK 8: Your code here
# Verify that teams table has no redundancy



### Analysis

What do the results prove about the teams table? Why is this the correct design?

*Write your answer below:*

**YOUR ANSWER HERE**


---

## Summary

In this task, you explored:
1. How redundancy appears in denormalized data (team_id repeating thousands of times)
2. The three types of anomalies that redundancy causes
3. Real-world problems that normalization prevents
4. Why splitting tables (normalization) is the solution

**Key Takeaway**: Normalization is not just an abstract concept â€” it solves real, measurable problems in data storage, consistency, and maintainability.