In [None]:
# Simpsons's Paradox Exericse

#  A baseball batter Tim has a better batting average than his teammate Frank. However, some
# one notices that Frank has a better batting average than Tim against both right-handed and
# left-handed pitchers. How can this happen?

In [10]:
# 1. Dataset Creation

import pandas as pd

data = {
    'pitcher_side': ['L', 'L', 'R', 'R'], # L (left-handed) or R (right-handed).
    'batter_name': ['Tim', 'Frank', 'Tim', 'Frank'], # "Tim" or "Frank".
    'hits': [10, 8, 90, 31], # number of successful hits.
    'at_bats': [55, 40, 300, 100], # total attempts (plate appearances that count toward batting average).
    'success_rates': [10/55, 8/40, 90/300, 31/100]
}

df = pd.DataFrame(data)
df

Unnamed: 0,pitcher_side,batter_name,hits,at_bats,success_rates
0,L,Tim,10,55,0.181818
1,L,Frank,8,40,0.2
2,R,Tim,90,300,0.3
3,R,Frank,31,100,0.31


In [13]:
grouped = df.groupby(['pitcher_side', 'batter_name'])['success_rates'].mean()
print("\nSubgroup (L/R) success rate:")
print(grouped)


Subgroup (L/R) success rate:
pitcher_side  batter_name
L             Frank          0.200000
              Tim            0.181818
R             Frank          0.310000
              Tim            0.300000
Name: success_rates, dtype: float64


In [None]:
# In each subgroup, Frank has a better batting average.

In [11]:
# Overall totals per batter. We'll ignore pitcher side and look at each batter's overall hits and at_bats combined.

overall = df.groupby('batter_name')[['hits','at_bats']].sum()
overall['overall_batting_avg'] = overall['hits']/overall['at_bats']
print("\nOverall batting averages, aggregated across L and R:")
print(overall)


Overall batting averages, aggregated across L and R:
             hits  at_bats  overall_batting_avg
batter_name                                    
Frank          39      140             0.278571
Tim           100      355             0.281690


In [None]:
# Observe the Paradox
# Within Each Pitcher Side: Frank has the higher average (0.20 > 0.18 vs lefties, 0.31 > 0.30 vs righties).
# Overall: Tim has a higher average (0.2817 vs 0.2786).
# This reversal or contradiction is exactly Simpson’s Paradox.

# Why Does It Happen Here?
# Tim has many more at-bats in the right-handed group (300 attempts) compared to Frank (100 attempts).
# The difference in batting averages against right-handers (0.31 vs. 0.30) is small, so Tim isn’t losing by much in the big group.
# Meanwhile, Frank outperforms Tim by a bigger margin among left-handed pitchers, but fewer total attempts are in that group.
# When you add everything up, Tim’s large volume of at-bats in the “closer” group ends up boosting his overall average slightly above Frank’s.