# Pandas NFL Data Analysis Exercises

Welcome to the final session of your Pandas training! This notebook is designed to help you recap and deepen your knowledge of fundamental Pandas operations through a series of hands-on exercises, all centered around NFL (National Football League) data. Get ready to tackle some data!

The exercises are structured to gradually increase in complexity. Don't worry if you get stuck; there are hints and solutions available (hidden, so try to solve it first!).

Let's start by importing `pandas` and setting up our environment, including generating our synthetic data.

In [None]:
import pandas as pd
import numpy as np

# Set display options for better viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
print("Pandas and necessary libraries imported. Display options set.")

# --- Generate synthetic Player Data ----
np.random.seed(42) # Seed for player data generation
num_players = 100
teams = ['Patriots', 'Chiefs', 'Eagles', '49ers', 'Cowboys', 'Packers', 'Bills', 'Dolphins', 'Saints', 'Steelers']
positions = ['QB', 'RB', 'WR', 'TE', 'OL', 'DL', 'LB', 'CB', 'S', 'K', 'P']
colleges = ['Alabama', 'Ohio State', 'Georgia', 'LSU', 'Clemson', 'Notre Dame', 'Michigan', 'USC', 'Texas A&M', 'Florida']

player_data = {
    'player_id': [f'P{i:03d}' for i in range(1, num_players + 1)],
    'name': [f'Player {i}' for i in range(1, num_players + 1)],
    'team': np.random.choice(teams, num_players),
    'position': np.random.choice(positions, num_players, p=[0.1, 0.1, 0.15, 0.08, 0.15, 0.1, 0.1, 0.07, 0.07, 0.04, 0.04]),
    'draft_year': np.random.randint(2015, 2025, num_players),
    'college': np.random.choice(colleges, num_players)
}
players_df = pd.DataFrame(player_data)
print("players_df created.")

# --- Generate synthetic Game Data ---
np.random.seed(43) # Seed for game data generation
num_games = 200
game_data = {
    'game_id': [f'G{i:04d}' for i in range(1, num_games + 1)],
    'season': np.random.choice([2022, 2023, 2024], num_games),
    'week': np.random.randint(1, 19, num_games),
    'home_team': np.random.choice(teams, num_games),
    'away_team': np.random.choice(teams, num_games),
    'home_score': np.random.randint(10, 45, num_games),
    'away_score': np.random.randint(7, 40, num_games),
    'date': pd.to_datetime(pd.date_range(start='2022-09-01', periods=num_games, freq='W')) # Simplified dates
}
games_df = pd.DataFrame(game_data)

# Ensure home_team != away_team
for i in range(num_games):
    while games_df.loc[i, 'home_team'] == games_df.loc[i, 'away_team']:
        games_df.loc[i, 'away_team'] = np.random.choice(teams)
print("games_df created.")

# --- Generate synthetic Player Stats Data ---
np.random.seed(44) # Seed for player stats data generation
stat_rows = []
for _ in range(num_games * 5): # Roughly 5 players per game with stats
    game_id = np.random.choice(games_df['game_id'])
    player_id = np.random.choice(players_df['player_id'])

    # Ensure a QB is picked for passing stats
    if players_df[players_df['player_id'] == player_id]['position'].iloc[0] == 'QB':
        passing_yards = np.random.randint(50, 400)
        interceptions = np.random.randint(0, 3)
        sacks_allowed = np.random.randint(0, 5) # For OL
    else:
        passing_yards = 0
        interceptions = 0
        sacks_allowed = 0

    # Ensure RBs get rushing yards
    if players_df[players_df['player_id'] == player_id]['position'].iloc[0] == 'RB':
        rushing_yards = np.random.randint(0, 200)
    else:
        rushing_yards = 0

    # Ensure WRs/TEs get receiving yards
    if players_df[players_df['player_id'] == player_id]['position'].iloc[0] in ['WR', 'TE', 'RB']:
        receiving_yards = np.random.randint(0, 150)
    else:
        receiving_yards = 0

    # Sacks made (for DL/LB)
    if players_df[players_df['player_id'] == player_id]['position'].iloc[0] in ['DL', 'LB']:
        sacks_made = np.random.randint(0, 3)
    else:
        sacks_made = 0

    touchdowns = np.random.randint(0, 4)

    stat_rows.append({
        'game_id': game_id,
        'player_id': player_id,
        'passing_yards': passing_yards,
        'rushing_yards': rushing_yards,
        'receiving_yards': receiving_yards,
        'sacks_made': sacks_made, # Sacks by defensive players
        'interceptions_thrown': interceptions, # Interceptions by QB
        'touchdowns': touchdowns
    })
player_stats_df = pd.DataFrame(stat_rows)
print("player_stats_df created.")

## Exercises

### Exercise 1: Series and DataFrame Creation

#### Task 1: Create a Pandas Series named `favorite_teams` containing your top 3 NFL teams (as strings).

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `pd.Series()`.
</details>

##### Solution

In [None]:
favorite_teams = pd.Series(['Kansas City Chiefs', 'Green Bay Packers', 'Philadelphia Eagles'], name='Favorite NFL Teams')
print("\nFavorite Teams Series:\n", favorite_teams)

#### Task 2: Create a DataFrame named `example_data_df` with three columns: `Fruit`, `Quantity`, and `Price`. Populate it with at least 4 rows of sample data (e.g., 'Apple', 10, 1.20).

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `pd.DataFrame()` and pass a dictionary where keys are column names and values are lists of data.
</details>

##### Solution

In [None]:
example_data_df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Orange', 'Grape'],
    'Quantity': [10, 15, 8, 20],
    'Price': [1.20, 0.75, 1.50, 2.00]
})
print("\nExample Data DataFrame:\n", example_data_df)

### Exercise 2: Initial Exploration

Using the already created `players_df`:

#### Task 1: Display the first 5 rows of `players_df`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.head()` method.
</details>

##### Solution

In [None]:
print("\nFirst 5 rows of players_df:\n", players_df.head())

#### Task 2: Get a concise summary of the DataFrame, including data types and non-null values.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.info()` method.
</details>

##### Solution

In [None]:
print("\nInfo about players_df:")
players_df.info()

#### Task 3: Display the dimensions (number of rows and columns) of `players_df`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.shape` attribute.
</details>

##### Solution

In [None]:
print("\nShape of players_df:", players_df.shape)

#### Task 4: Show descriptive statistics for numerical columns.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.describe()` method.
</details>

##### Solution

In [None]:
print("\nDescriptive statistics for players_df:\n", players_df.describe())

#### Task 5: Display all column names.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.columns` attribute, optionally converting to a list.
</details>

##### Solution

In [None]:
print("\nColumns in players_df:", players_df.columns.tolist())

#### Task 6: Check for any missing values in each column and show the count of missing values.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `.isnull().sum()`.
</details>

##### Solution

In [None]:
print("\nMissing values in players_df:\n", players_df.isnull().sum())

### Exercise 3: Indexing and Selection Basics

#### Task 1: Select the 'name' and 'position' columns from `players_df` using bracket notation `[]` and display the first 5 rows.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

For multiple columns, pass a list of column names to `[]`.
</details>

##### Solution

In [None]:
name_position = players_df[['name', 'position']]
print("\nName and Position columns (first 5 rows):\n", name_position.head())

#### Task 2: Select the player with `player_id` 'P007' using boolean indexing.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Create a boolean mask inside `[]` (e.g., `df[df['column'] == value]`).
</details>

##### Solution

In [None]:
player_007 = players_df[players_df['player_id'] == 'P007']
print("\nDetails for Player P007:\n", player_007)

#### Task 3: Using `.iloc`, select the 10th row of `players_df`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Remember Python uses 0-indexed positions for `.iloc`.
</details>

##### Solution

In [None]:
tenth_row = players_df.iloc[9] # 9 for the 10th row
print("\n10th row of players_df:\n", tenth_row)

#### Task 4: Using `.loc`, select players from 'Patriots' and 'Chiefs' teams. Display their 'name', 'team', and 'position'.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `.loc` with boolean indexing for rows and a list for columns. The `.isin()` method is helpful for multiple values.
</details>

##### Solution

In [None]:
selected_teams_players = players_df.loc[players_df['team'].isin(['Patriots', 'Chiefs']), ['name', 'team', 'position']]
print("\nPlayers from Patriots and Chiefs:\n", selected_teams_players.head())

### Exercise 4: Slicing with .loc and .iloc (Trick Question Alert!)

Let's create a small DataFrame with a default `RangeIndex` for this exercise.

In [None]:
example_slice_df = pd.DataFrame({
    'Item': ['A', 'B', 'C', 'D', 'E', 'F'],
    'Value': [10, 20, 30, 40, 50, 60]
})
print("Example Slice DataFrame (with default RangeIndex):\n", example_slice_df)

#### Task 1: Using `.iloc`, slice `example_slice_df` to get rows at **positions 1 through 3 (inclusive)** and all columns.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

`.iloc` uses position-based indexing, similar to standard Python list slicing, where the end is exclusive. To include position 3, you need to slice up to position 4 (`[1:4]`).
</details>

##### Solution

In [None]:
iloc_slice_result = example_slice_df.iloc[1:4, :]
print("\n.iloc slice (positions 1, 2, 3):\n", iloc_slice_result)

#### Task 2: Using `.loc`, slice `example_slice_df` to get rows with **labels 1 through 3 (inclusive)** and the 'Value' column.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

`.loc` uses label-based indexing, where both the start and end labels are **inclusive**. So, `[1:3]` will include labels 1, 2, and 3.
</details>

##### Solution

In [None]:
loc_slice_result = example_slice_df.loc[1:3, 'Value']
print("\n.loc slice (labels 1, 2, 3 for 'Value'):\n", loc_slice_result)

#### Task 3: **Explain the difference**: How does the end boundary behavior of slicing differ between `.iloc` and `.loc` when using numerical indices?

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Focus on whether the end index/label is included or excluded in the slice for each method.
</details>

##### Solution

In [None]:
print("\nExplanation of the difference:")
print("When slicing with numerical indices:")
print("- `.iloc` uses **position-based** indexing and behaves like standard Python list slicing: the start position is inclusive, and the end position is **exclusive**.")
print("- `.loc` uses **label-based** indexing: the start label is inclusive, and the end label is **inclusive** as well. This holds true even if your labels are numbers (as in a default `RangeIndex`).")

### Exercise 5: Boolean Indexing and Filtering

Using the already created `games_df` and `player_stats_df`:

#### Task 1: From `players_df`, find all players who are either 'QB' (Quarterback) or 'RB' (Running Back).

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.isin()` method for selecting rows where a column's value is in a list of possibilities.
</details>

##### Solution

In [None]:
qbs_rbs = players_df[players_df['position'].isin(['QB', 'RB'])]
print("\nQBs and RBs:\n", qbs_rbs.head())

#### Task 2: From `games_df`, select all games from the 2023 season where the home team scored more than 30 points.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Combine multiple conditions using the `&` (AND) operator, enclosing each condition in parentheses.
</details>

##### Solution

In [None]:
high_scoring_2023_games = games_df[(games_df['season'] == 2023) & (games_df['home_score'] > 30)]
print("\n2023 Games with Home Score > 30:\n", high_scoring_2023_games.head())

#### Task 3: From `player_stats_df`, find the top 5 players with the most 'passing_yards'.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.nlargest()` method.
</details>

##### Solution

In [None]:
top_5_passers = player_stats_df.nlargest(5, 'passing_yards')
print("\nTop 5 Passers (by passing_yards):\n", top_5_passers)

#### Task 4: From `player_stats_df`, filter for players who had 'rushing_yards' between 50 and 100 (inclusive).

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

The `.between()` method is useful for filtering within a range.
</details>

##### Solution

In [None]:
mid_range_rushers = player_stats_df[player_stats_df['rushing_yards'].between(50, 100)]
print("\nPlayers with Rushing Yards between 50 and 100:\n", mid_range_rushers.head())

### Exercise 6: Cleaning Data - Handling Missing Values

Let's introduce some missing values into a copy of `player_stats_df` to practice cleaning.

In [None]:
stats_with_nans = player_stats_df.copy()
np.random.seed(42) # Re-using 42 for this specific data corruption as per previous versions
for col in ['passing_yards', 'rushing_yards', 'receiving_yards']:
    # Randomly set 10% of values to NaN
    nan_indices = np.random.choice(stats_with_nans.index, size=int(len(stats_with_nans) * 0.1), replace=False)
    stats_with_nans.loc[nan_indices, col] = np.nan

print("Stats with NaNs created. Missing values:")
print(stats_with_nans.isnull().sum())

#### Task 1: Drop all rows from `stats_with_nans` that contain any missing values. How many rows are left?

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.dropna()` method.
</details>

##### Solution

In [None]:
stats_dropped_nans = stats_with_nans.dropna()
print("\nRows remaining after dropping NaNs:", len(stats_dropped_nans))

#### Task 2: Fill missing 'passing_yards' values with 0.0, 'rushing_yards' with the median, and 'receiving_yards' with the mean of their respective columns. Store this in a new DataFrame called `stats_filled`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `.fillna()`. You can pass a single value or a dictionary for different columns. Remember to calculate median/mean for imputation.
</details>

##### Solution

In [None]:
stats_filled = stats_with_nans.copy()
stats_filled['passing_yards'] = stats_filled['passing_yards'].fillna(0.0)
stats_filled['rushing_yards'] = stats_filled['rushing_yards'].fillna(stats_filled['rushing_yards'].median())
stats_filled['receiving_yards'] = stats_filled['receiving_yards'].fillna(stats_filled['receiving_yards'].mean())
print("\nStats after filling NaNs (first 5 rows):\n", stats_filled.head())

#### Task 3: Verify that `stats_filled` has no missing values.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Check the sum of null values for each column using `.isnull().sum()`.
</details>

##### Solution

In [None]:
print("\nMissing values in stats_filled:\n", stats_filled.isnull().sum())

### Exercise 7: Data Manipulation - Adding/Updating Columns, Dropping Duplicates

Using the `player_stats_df` and `players_df` from the setup. Also, let's create a DataFrame with some duplicates.

In [None]:
duplicate_players_data = {
    'player_id': ['P001', 'P002', 'P001', 'P003'],
    'name': ['Player 1', 'Player 2', 'Player 1', 'Player 3'],
    'team': ['Patriots', 'Chiefs', 'Patriots', 'Eagles'],
    'position': ['QB', 'RB', 'QB', 'WR']
}
duplicate_players_df = pd.DataFrame(duplicate_players_data)
print("\nDataFrame with duplicates:\n", duplicate_players_df)

#### Task 1: From `player_stats_df`, create a new column `total_yards` which is the sum of `passing_yards`, `rushing_yards`, and `receiving_yards`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

You can create a new column by simply assigning the result of an operation on existing columns (e.g., `df['new_col'] = df['col1'] + df['col2']`).
</details>

##### Solution

In [None]:
player_stats_df['total_yards'] = player_stats_df['passing_yards'] + player_stats_df['rushing_yards'] + player_stats_df['receiving_yards']
print("\nPlayer Stats with 'total_yards':\n", player_stats_df.head())

#### Task 2: From `players_df`, create a new column `experience_level`. If `draft_year` is before 2018, assign 'Veteran', otherwise 'Rookie/Newcomer'.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `np.where(condition, value_if_true, value_if_false)` for conditional assignment.
</details>

##### Solution

In [None]:
players_df['experience_level'] = np.where(players_df['draft_year'] < 2018, 'Veteran', 'Rookie/Newcomer')
print("\nPlayers with 'experience_level':\n", players_df.sample(5))

#### Task 3: Remove duplicate rows from `duplicate_players_df` based on `player_id`, keeping the first occurrence.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.drop_duplicates()` method with the `subset` and `keep` parameters.
</details>

##### Solution

In [None]:
unique_players_df = duplicate_players_df.drop_duplicates(subset=['player_id'], keep='first')
print("\nDataFrame after removing duplicates:\n", unique_players_df)

#### Task 4: From `players_df`, change the `team` of all players drafted in 2024 to 'Expansion Team'. Use `.where()` or direct assignment.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `df.loc[condition, 'column'] = new_value` for direct boolean assignment.
</details>

##### Solution

In [None]:
players_df.loc[players_df['draft_year'] == 2024, 'team'] = 'Expansion Team'
print("\nPlayers with 'Expansion Team' (2024 draftees):\n", players_df[players_df['draft_year'] == 2024].head())

#### Task 5: From `player_stats_df`, replace any instance of `0` in `sacks_made` with `NaN` using `.replace()`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

The `.replace()` method can be used on a Series or DataFrame to substitute values.
</details>

##### Solution

In [None]:
player_stats_df['sacks_made'] = player_stats_df['sacks_made'].replace(0, np.nan)
print("\nPlayer stats with 0 sacks_made replaced by NaN (sample):\n", player_stats_df[player_stats_df['sacks_made'].isnull()].head())

### Exercise 8: Data Type Conversion & Concatenation

#### Task 1: Create a Pandas Series `player_ratings_raw` with some mixed string and numeric values (e.g., '85', '90', 'N/A', '72', 'injured'). Convert this Series to a numeric type, coercing errors to NaN, and then fill any NaNs with the mean of the valid numbers.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `pd.to_numeric(..., errors='coerce')` to convert to numeric, and then `.fillna()` to impute missing values.
</details>

##### Solution

In [None]:
player_ratings_raw = pd.Series(['85', '90', 'N/A', '72', 'injured', '95', '60'])
player_ratings_numeric = pd.to_numeric(player_ratings_raw, errors='coerce')
mean_valid_ratings = player_ratings_numeric.mean()
player_ratings_cleaned = player_ratings_numeric.fillna(mean_valid_ratings)
print("\nCleaned Player Ratings Series:\n", player_ratings_cleaned)

#### Task 2: Create two small DataFrames: `df_part1` with `player_id` and `games_played_2023`, and `df_part2` with `player_id` and `games_played_2024`. Vertically concatenate them into a single `all_games_played_df`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `pd.concat()` for vertical concatenation. Remember to set `ignore_index=True` to reset the index of the combined DataFrame.
</details>

##### Solution

In [None]:
df_part1 = pd.DataFrame({
    'player_id': ['P001', 'P002', 'P003'],
    'games_played': [17, 15, 12]
})
df_part2 = pd.DataFrame({
    'player_id': ['P004', 'P001', 'P005'],
    'games_played': [16, 17, 10]
})
all_games_played_df = pd.concat([df_part1, df_part2], ignore_index=True)
print("\nConcatenated Games Played DataFrame:\n", all_games_played_df)

### Exercise 9: String Methods

Using `players_df`:

#### Task 1: Convert all 'college' names to lowercase.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.str.lower()` accessor method.
</details>

##### Solution

In [None]:
players_df['college_lower'] = players_df['college'].str.lower()
print("\nColleges in lowercase (sample):\n", players_df[['college', 'college_lower']].sample(5))

#### Task 2: Find players whose name starts with 'P'.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.str.startswith()` method for filtering strings.
</details>

##### Solution

In [None]:
players_starting_with_p = players_df[players_df['name'].str.startswith('P')]
print("\nPlayers whose name starts with 'P':\n", players_starting_with_p.head())

#### Task 3: Count how many players have 'State' in their `college` name (case-insensitive).

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `.str.contains()` with `case=False` and `na=False` (to handle potential NaN values in the string column) and then `.sum()` on the resulting boolean Series.
</details>

##### Solution

In [None]:
state_college_count = players_df['college'].str.contains('State', case=False, na=False).sum()
print("\nNumber of players from a 'State' college:", state_college_count)

#### Task 4: Extract the first three characters of each `player_id` to a new column `id_prefix`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

You can use `.str.slice(start, end)` or direct string slicing like `.str[:end]`.
</details>

##### Solution

In [None]:
players_df['id_prefix'] = players_df['player_id'].str.slice(0, 3)
print("\nPlayers with 'id_prefix' (sample):\n", players_df[['player_id', 'id_prefix']].sample(5))

#### Task 5: Clean up any leading/trailing whitespace from the 'team' column (though our generated data is clean, practice the method!).

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.str.strip()` method.
</details>

##### Solution

In [None]:
players_df['team_stripped'] = players_df['team'].str.strip()
print("\nTeam names after stripping whitespace (sample):\n", players_df[['team', 'team_stripped']].sample(5))

### Exercise 10: Grouping and Aggregation

Using `players_df` and `player_stats_df`.

#### Task 1: Count the number of players per `position`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `groupby('column').size()` or `groupby('column').count()`.
</details>

##### Solution

In [None]:
players_by_position = players_df.groupby('position').size().sort_values(ascending=False)
print("\nNumber of players per position:\n", players_by_position)

#### Task 2: Calculate the average `passing_yards`, `rushing_yards`, and `receiving_yards` for each `position`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

You'll need to merge `player_stats_df` with `players_df` first to get the 'position' for each stat entry. Then use `groupby()` and `.mean()`.
</details>

##### Solution

In [None]:
# First, merge player_stats_df with players_df to get position
stats_with_position = pd.merge(player_stats_df, players_df[['player_id', 'position']], on='player_id', how='left')
avg_yards_per_position = stats_with_position.groupby('position')[['passing_yards', 'rushing_yards', 'receiving_yards']].mean()
print("\nAverage yards per position:\n", avg_yards_per_position)

#### Task 3: Find the total `touchdowns` scored by each `team` (you'll need to combine `players_df` and `player_stats_df`).

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Merge the dataframes to get 'team' and 'touchdowns' together. Then `groupby('team')` and `.sum()` on 'touchdowns'.
</details>

##### Solution

In [None]:
stats_with_team = pd.merge(player_stats_df, players_df[['player_id', 'team']], on='player_id', how='left')
team_total_touchdowns = stats_with_team.groupby('team')['touchdowns'].sum().sort_values(ascending=False)
print("\nTotal touchdowns per team:\n", team_total_touchdowns)

#### Task 4: For each `team` and `position` combination, find the minimum and maximum `draft_year`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Group by multiple columns by passing a list to `groupby()`. Then use `.agg()` with a list of aggregation functions like `['min', 'max']`.
</details>

##### Solution

In [None]:
min_max_draft_year = players_df.groupby(['team', 'position'])['draft_year'].agg(['min', 'max'])
print("\nMin/Max draft year per team and position:\n", min_max_draft_year.head())

#### Task 5: Using `.agg()`, calculate the total sum of `total_yards` and the count of unique `player_id`s for each `game_id`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `groupby('game_id').agg()` with a dictionary to specify new column names and their respective aggregation functions (`'sum'` for total yards, `'nunique'` for unique player IDs).
</details>

##### Solution

In [None]:
game_performance = player_stats_df.groupby('game_id').agg(
    total_yards_sum=('total_yards', 'sum'),
    unique_players=('player_id', 'nunique')
)
print("\nTotal yards and unique players per game (using .agg()):\n", game_performance.head())

### Exercise 11: Merging DataFrames

Using `players_df`, `games_df`, and `player_stats_df`.

#### Task 1: Merge `player_stats_df` with `players_df` to include player `name`, `team`, and `position` in the stats DataFrame. Store it in `full_stats_df`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `pd.merge()` with `on='player_id'` and `how='left'` (or `inner` if you only want players with stats).
</details>

##### Solution

In [None]:
full_stats_df = pd.merge(player_stats_df, players_df[['player_id', 'name', 'team', 'position']],
                         on='player_id', how='left')
print("\nFull Stats DataFrame (first 5 rows):\n", full_stats_df.head())

#### Task 2: Merge `full_stats_df` with `games_df` to include `home_team`, `away_team`, `home_score`, `away_score`, and `season` information for each player's statistical entry. Store it in `combined_nfl_data`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Merge `full_stats_df` and `games_df` on `game_id`.
</details>

##### Solution

In [None]:
combined_nfl_data = pd.merge(full_stats_df, games_df[['game_id', 'home_team', 'away_team', 'home_score', 'away_score', 'season']],
                             on='game_id', how='left')
print("\nCombined NFL Data (first 5 rows):\n", combined_nfl_data.head())

#### Task 3: Perform a **left merge** of `players_df` with `player_stats_df` to see which players have *no* entries in `player_stats_df` (i.e., their `player_id` appears in `players_df` but not in `player_stats_df`). How many such players are there?

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use a left merge and then check for `NaN` values in a column that was only present in the right DataFrame (e.g., `game_id` from `player_stats_df`).
</details>

##### Solution

In [None]:
players_no_stats_merge = pd.merge(players_df, player_stats_df, on='player_id', how='left')
players_with_null_stats = players_no_stats_merge[players_no_stats_merge['game_id'].isnull()]
num_players_no_stats = len(players_with_null_stats)
print(f"\nNumber of players with no stats entries: {num_players_no_stats}")
print("Sample of players with no stats:\n", players_with_null_stats[['player_id', 'name', 'team']].head())

### Exercise 12: Advanced Grouping & Aggregation

Using `players_df` and `full_stats_df` (which you created by merging in the previous exercise):

#### Task 1: Using `players_df`, for each `college`, count the number of players and calculate the average `draft_year`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `groupby('college').agg()` with a dictionary for multiple aggregations (`'size'` for count, `'mean'` for average).
</details>

##### Solution

In [None]:
college_summary = players_df.groupby('college').agg(
    num_players=('player_id', 'size'),
    avg_draft_year=('draft_year', 'mean')
).sort_values(by='num_players', ascending=False)
print("\nCollege Summary (Number of Players and Average Draft Year):\n", college_summary.head())

#### Task 2: Using `full_stats_df`, calculate the total `passing_yards` and average `interceptions_thrown` for `QB` players.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

First, filter `full_stats_df` to include only 'QB' positions. Then, use `.agg()` to calculate the sum of 'passing_yards' and the mean of 'interceptions_thrown'.
</details>

##### Solution

In [None]:
# Ensure full_stats_df is available (from Exercise 11)
if 'full_stats_df' not in locals():
    full_stats_df = pd.merge(player_stats_df, players_df[['player_id', 'name', 'team', 'position']],
                             on='player_id', how='left')

qb_stats = full_stats_df[full_stats_df['position'] == 'QB']
qb_performance = qb_stats.agg(
    total_passing_yards=('passing_yards', 'sum'),
    avg_interceptions=('interceptions_thrown', 'mean')
)
print("\nQB Performance (Total Passing Yards, Avg Interceptions):\n", qb_performance)

### Exercise 13: Basic Plotting (using Pandas' built-in methods)

Using `combined_nfl_data`.

#### Task 1: Create a histogram of `total_yards` using Pandas' built-in plotting method.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `series.hist()` or `dataframe['column'].plot.hist()`.
</details>

##### Solution

In [None]:
combined_nfl_data['total_yards'].hist(bins=20, edgecolor='black', figsize=(10, 6))

#### Task 2: Create a bar plot showing the total `touchdowns` scored by each `team` using Pandas' built-in plotting method.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

First, aggregate the data (e.g., `groupby('team')['touchdowns'].sum()`), then use `.plot.bar()` on the resulting Series.
</details>

##### Solution

In [None]:
team_tds = combined_nfl_data.groupby('team')['touchdowns'].sum().sort_values(ascending=False)
team_tds.plot.bar(figsize=(12, 7), color='skyblue')

#### Task 3: Create a scatter plot showing `passing_yards` vs `receiving_yards` for all players using Pandas' built-in plotting method.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `dataframe.plot.scatter(x='col1', y='col2')`.
</details>

##### Solution

In [None]:
combined_nfl_data.plot.scatter(x='passing_yards', y='receiving_yards', alpha=0.6, s=50, c='red', figsize=(10, 6))

### Exercise 14: Advanced Data Manipulation with Functions

Using `combined_nfl_data`.

#### Task 1: Define a function `get_game_result(row)` that takes a row from `games_df` and returns 'Home Win', 'Away Win', or 'Draw' based on `home_score` and `away_score`. Apply this function to create a new `game_result` column in `games_df`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use the `.apply()` method with `axis=1` when your function needs to access multiple columns from a row.
</details>

##### Solution

In [None]:
def get_game_result(row):
    if row['home_score'] > row['away_score']:
        return 'Home Win'
    elif row['away_score'] > row['home_score']:
        return 'Away Win'
    else:
        return 'Draw'

games_df['game_result'] = games_df.apply(get_game_result, axis=1)
print("\nGames DataFrame with 'game_result' (sample):\n", games_df[['home_team', 'away_team', 'home_score', 'away_score', 'game_result']].sample(5))

#### Task 2: Define a function `assign_player_grade(yards, tds)` that assigns a 'Grade' ('A', 'B', 'C', 'D') based on `total_yards` and `touchdowns`. Apply this function to create a `player_grade` column in `full_stats_df`.
    * Grade A: Total yards > 300 AND TDs >= 2
    * Grade B: Total yards > 150 OR TDs >= 1
    * Grade C: Total yards > 50
    * Grade D: Otherwise

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Define a function that takes a row as input and use conditional logic (`if/elif/else`). Apply this function to the DataFrame with `axis=1`.
</details>

##### Solution

In [None]:
def assign_player_grade(row):
    yards = row['total_yards']
    tds = row['touchdowns']

    if yards > 300 and tds >= 2:
        return 'A'
    elif yards > 150 or tds >= 1:
        return 'B'
    elif yards > 50:
        return 'C'
    else:
        return 'D'

full_stats_df['player_grade'] = full_stats_df.apply(assign_player_grade, axis=1)
print("\nFull Stats DataFrame with 'player_grade' (sample):\n", full_stats_df[['name', 'total_yards', 'touchdowns', 'player_grade']].sample(5))

### Exercise 15: Challenge - Beyond the Basics (Finding New Methods)

Using `combined_nfl_data`.

#### Task 1: For each game, identify the player with the highest `total_yards`. Create a DataFrame showing `game_id`, `player_id`, `name`, and `total_yards` for these top performers.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Consider using `groupby('game_id')['total_yards'].idxmax()` to get the index of the row with the maximum total yards within each game. Then use `.loc` to select those specific rows.
</details>

##### Solution

In [None]:
# One way: groupby and idxmax
idx_max_yards_per_game = combined_nfl_data.groupby('game_id')['total_yards'].idxmax()
top_performers_per_game = combined_nfl_data.loc[idx_max_yards_per_game, ['game_id', 'player_id', 'name', 'total_yards']]
print("\nTop Performer in Total Yards per Game:\n", top_performers_per_game.head())

#### Task 2: Calculate the difference in scores between `home_score` and `away_score` for each game. Then, categorize games as 'Close Game' (difference <= 7 points) or 'Blowout' (difference > 7 points). Add this as a `game_type` column to `games_df`.

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

Use `abs()` for the absolute difference. For conditional assignment, `np.where()` is powerful: `np.where(condition, value_if_true, value_if_false)`.
</details>

##### Solution

In [None]:
games_df['score_difference'] = (games_df['home_score'] - games_df['away_score']).abs()
games_df['game_type'] = np.where(games_df['score_difference'] <= 7, 'Close Game', 'Blowout')
print("\nGames DataFrame with 'score_difference' and 'game_type' (sample):\n", games_df[['home_team', 'away_team', 'home_score', 'away_score', 'score_difference', 'game_type']].sample(5))

### Exercise 16: Debugging Scenario

#### Task 1: You are trying to find the average passing yards for 'QB' position players, but you made a mistake. Debug the following code block and fix it.

```python
# Original faulty code
avg_qb_passing_yards = players_df[players_df['position'] == 'QB']['passing_yards'].mean()
print(avg_qb_passing_yards)
```
**Problem**: The above code will likely give a `KeyError` or an attribute error. Why, and how do you fix it?

In [None]:
#Place your code here

##### Hint
<details>
<summary>Click to reveal hint</summary>

The 'passing_yards' column is in `player_stats_df`, not `players_df`. You need to merge the necessary DataFrames to have both the 'position' and 'passing_yards' in one place to perform the calculation.
</details>

##### Solution

In [None]:
# The problem is that 'passing_yards' is in 'player_stats_df', not 'players_df'.
# We need to merge the two DataFrames first to have both 'position' and 'passing_yards' in one place.

# First, ensure we have the combined stats DataFrame from previous exercises (or re-create it if needed)
if 'full_stats_df' not in locals():
    full_stats_df = pd.merge(player_stats_df, players_df[['player_id', 'position']], on='player_id', how='left')

# Corrected code:
avg_qb_passing_yards = full_stats_df[full_stats_df['position'] == 'QB']['passing_yards'].mean()
print(f"Corrected: Average QB Passing Yards: {avg_qb_passing_yards:.2f}")

print("\nExplanation: The original code tried to access 'passing_yards' from `players_df`, which only contains player metadata like name, team, and position. 'passing_yards' is actually in `player_stats_df`. To fix this, we first need to merge `players_df` and `player_stats_df` (or use `full_stats_df` if already merged) to have both position and statistical data together before filtering and calculating the mean.")

## Congratulations!
You've completed the Pandas NFL Data Analysis Exercises. You've demonstrated a strong understanding of data loading, exploration, cleaning, manipulation, grouping, merging, and even some basic plotting. Keep practicing to become a Pandas pro!