Welcome to Day 5! If Pandas were a superpower, groupby would be the ability to see through walls. It allows you to summarize thousands of rows into a few meaningful numbers.

In [1]:
import numpy as np
import pandas as pd

The Mental Model: Split-Apply-Combine
Think of your data as a deck of cards.

Split: You sort the cards into piles based on their suit (e.g., all Hearts in one pile, all Diamonds in another).

Apply: You do some math on each pile (e.g., you count how many cards are in the "Hearts" pile).

Combine: You put the counts for each suit into a new, smaller list.

The Syntax Anatomy
A typical GroupBy looks like this: df.groupby('Genre')['Sales'].mean()

'Genre': The column you want to group by (the "piles").

['Sales']: The column you want to perform math on.

.mean(): The actual math (could be .sum(), .count(), .min(), or .max()).

In [2]:
df = pd.read_csv("games3.csv")
df

Unnamed: 0,game_id,title,genre,release_year,units_sold_m,rating
0,1,The Legend of Data,Adventure,2022,15.5,9.5
1,2,Python Quest,RPG,2021,8.2,8.2
2,3,SQL Arena,Strategy,2023,5.4,7.5
3,4,Pandas Paradise,Simulation,2022,12.1,8.45
4,5,Matrix Reloaded,Action,2020,3.3,6.5
...,...,...,...,...,...,...
65,66,Action Matrix,Indie,2021,8.7,9.77
66,67,Queries Action,Action,2024,1.8,6.33
67,68,Strategy Sim,Action,2024,7.0,7.04
68,69,Strategy Query,RPG,2022,8.5,8.69


In [7]:
df.isnull().sum()

game_id         0
title           0
genre           0
release_year    0
units_sold_m    0
rating          0
dtype: int64

In [10]:
print(df.columns)

Index(['game_id', 'title', 'genre', 'release_year', 'units_sold_m', 'rating'], dtype='object')


In [11]:
df

Unnamed: 0,game_id,title,genre,release_year,units_sold_m,rating
0,1,The Legend of Data,Adventure,2022,15.5,9.5
1,2,Python Quest,RPG,2021,8.2,8.2
2,3,SQL Arena,Strategy,2023,5.4,7.5
3,4,Pandas Paradise,Simulation,2022,12.1,8.45
4,5,Matrix Reloaded,Action,2020,3.3,6.5
...,...,...,...,...,...,...
65,66,Action Matrix,Indie,2021,8.7,9.77
66,67,Queries Action,Action,2024,1.8,6.33
67,68,Strategy Sim,Action,2024,7.0,7.04
68,69,Strategy Query,RPG,2022,8.5,8.69


In [12]:
(
    df.groupby(['genre', 'release_year'])['rating']
    .mean()
    .reset_index()
)

TypeError: agg function failed [how->mean,dtype->object]

In [13]:
df['rating'].dtype

dtype('O')

Perfect.
dtype('O') means Object ‚Üí in practice ‚Üí string / mixed values.

So Pandas cannot compute the mean because math needs numbers.

You diagnosed it correctly üëç
Now we fix it like a data scientist.

‚úÖ Convert rating ‚Üí numeric
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
üß† Why errors='coerce' ?

If something cannot convert (like "N/A"), it becomes:

NaN


and mean() will automatically ignore NaN.

Very useful in real datasets.

After this run:

df['rating'].dtype


You should see something like:

float64


Now math operations will work.

In [14]:
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

In [15]:
df['rating'].dtype

dtype('float64')

In [None]:
# üéØ What is the goal?

# üëâ For each genre and each year, find the average rating.

(
    df.groupby(['genre','release_year'])['rating'].mean()
    .reset_index()
)


# If you wanted only ONE row per genre?

# Then you should do:

# df.groupby('genre')['rating'].mean().reset_index()

Unnamed: 0,genre,release_year,rating
0,Action,2020,7.225
1,Action,2021,6.64
2,Action,2022,8.37
3,Action,2024,6.685
4,Adventure,2020,8.7
5,Adventure,2021,9.1
6,Adventure,2022,7.67
7,Adventure,2023,8.973333
8,Adventure,2024,8.61
9,Indie,2020,8.53


When you wrote
df.groupby(['genre', 'release_year'])


You are telling pandas:

üëâ ‚ÄúDo NOT mix Action movies from different years.‚Äù
üëâ ‚ÄúTreat Action-2020 and Action-2021 as different.‚Äù

So pandas obeys.

When I wrote
df.groupby('genre')


I was telling pandas:

üëâ ‚ÄúI don‚Äôt care about year.‚Äù
üëâ ‚ÄúPut ALL Action movies together.‚Äù

So it gives only one row per genre.

In [18]:
print(df.columns)

Index(['game_id', 'title', 'genre', 'release_year', 'units_sold_m', 'rating'], dtype='object')


In [19]:
df.groupby(['genre'])['units_sold_m'].sum()

genre
Action         80.7
Adventure     109.0
Indie          89.1
RPG           135.5
Simulation    163.3
Strategy       43.2
Name: units_sold_m, dtype: float64

| transform           | mean                |
| ------------------- | ------------------- |
| keeps original rows | makes summary table |
| same length         | smaller             |
| used for filling    | used for reporting  |
