<a href="https://colab.research.google.com/github/Bassel20/Simpson-s_Paradox/blob/main/simpsons_paradox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Simpson's Paradox**
By: Bassel Sherif <br>
The purpose of this notebook is to demonstrate one of  most interesting statistical phenomena in Statistics <br>
**The Simpson's paradox** refers to a phenomenon where a trend appears in several different groups of data, but disappears or reverses when these groups are combined. It is a result of confounding variables, where the relationship between two variables is obscured by the influence of a third variable. This paradox can lead to incorrect conclusions and misleading results, which highlights the importance of considering all relevant variables when analyzing data. <br>
In this notebook I have demonstrated this phenomenon using a college admissions dataset<br>
`admission_data.csv` was used for this exercise. <br>

In [None]:
# Load and view first few lines of dataset
import numpy as np
import pandas as pd

df = pd.read_csv("admission_data.csv")
df.head(8)


Unnamed: 0,student_id,gender,major,admitted
0,35377,female,Chemistry,False
1,56105,male,Physics,True
2,31441,female,Chemistry,False
3,51765,male,Physics,True
4,53714,female,Physics,True
5,50693,female,Chemistry,False
6,25946,male,Physics,True
7,27648,female,Chemistry,True


### Proportion and admission rate for each gender

In [None]:
# Proportion of students that are female
x = df['gender'].tolist().count('female')
y = df['gender'].count()
female_proportion = x/y
female_proportion

0.51400000000000001

In [None]:
# Proportion of students that are male
z = df['gender'].tolist().count('male')
k = df['gender'].count()
male_proportion = z/k
male_proportion

0.48599999999999999

In [None]:
# Admission rate for females
df.loc[df['gender'].eq('female'), 'admitted'].mean()

0.28793774319066145

In [None]:
# Admission rate for males
df.loc[df['gender'].eq('male'), 'admitted'].mean()

0.48559670781893005

### Proportion and admission rate for physics majors of each gender

In [None]:
# What proportion of female students are majoring in physics?
df2 = df.loc[df['gender'].eq('female')]
x = df2.loc[df['major'].eq('Physics')].count()
x / df2.count()

student_id    0.120623
gender        0.120623
major         0.120623
admitted      0.120623
dtype: float64

In [None]:
# What proportion of male students are majoring in physics?
df2 = df.loc[df['gender'].eq('male')]
x = df2.loc[df['major'].eq('Physics')].count()
x / df2.count()

student_id    0.925926
gender        0.925926
major         0.925926
admitted      0.925926
dtype: float64

In [None]:
# Admission rate for female physics majors

df2 = df.loc[df['gender'].eq('female')]
df.loc[df['major'].eq('Physics'), 'admitted'].mean()


0.54296875

In [None]:
# Admission rate for male physics majors
df2 = df.loc[df['gender'].eq('male')]
df2.loc[df['major'].eq('Physics'), 'admitted'].mean()

0.51555555555555554

### Proportion and admission rate for chemistry majors of each gender

In [None]:
# What proportion of female students are majoring in chemistry?
df2 = df.loc[df['gender'].eq('female')]
x = df2.loc[df['major'].eq('Chemistry')].count()
x / df2.count()

student_id    0.879377
gender        0.879377
major         0.879377
admitted      0.879377
dtype: float64

In [None]:
# What proportion of male students are majoring in chemistry?
df2 = df.loc[df['gender'].eq('male')]
x = df2.loc[df['major'].eq('Chemistry')].count()
x / df2.count()

student_id    0.074074
gender        0.074074
major         0.074074
admitted      0.074074
dtype: float64

In [None]:
# Admission rate for female chemistry majors
df2 = df.loc[df['gender'].eq('female')]
df2.loc[df['major'].eq('Chemistry'), 'admitted'].mean()

0.22566371681415928

In [None]:
# Admission rate for male chemistry majors
df2 = df.loc[df['gender'].eq('male')]
df2.loc[df['major'].eq('Chemistry'), 'admitted'].mean()

0.1111111111111111

### Admission rate for each major

In [None]:
# Admission rate for physics majors
df.loc[df['major'].eq('Physics'), 'admitted'].mean()

0.54296875

In [None]:
# Admission rate for chemistry majors
df.query('major == "Chemistry"').mean()

student_id    42189.500000
admitted          0.217213
dtype: float64