# Simpson's Paradox
Use `admission_data.csv` for this exercise.

In [1]:
import pandas as pd
# Load and view first few lines of dataset
df = pd.read_csv("admission_data.csv")
df.head()

Unnamed: 0,student_id,gender,major,admitted
0,35377,female,Chemistry,False
1,56105,male,Physics,True
2,31441,female,Chemistry,False
3,51765,male,Physics,True
4,53714,female,Physics,True


### Proportion and admission rate for each gender

In [2]:
# Proportion of students that are female
female = df[df["gender"] == "female"]
print("Number of female applicants is", format(len(female)))
prop_female = len(female)/len(df)
prop_female

Number of female applicants is 257


0.514

In [3]:
# Proportion of students that are male
male = df[df["gender"] == "male"]
print("Number of male applicants is", format(len(male)))
prop_male = len(male)/len(df)
prop_male

Number of male applicants is 243


0.486

In [6]:
# Admission rate for females
admitted_female = female.admitted.value_counts()[1]
admission_rate_female = admitted_female/len(female)
admission_rate_female

0.28793774319066145

In [7]:
# Admission rate for males
admitted_male = male.admitted.value_counts()[1]
admission_rate_male = admitted_male/len(male)
admission_rate_male

0.48559670781893005

### Proportion and admission rate for physics majors of each gender

In [8]:
# What proportion of female students are majoring in physics?
female_physics = female[female["major"] == "Physics"]
prop_female_physics = len(female_physics)/len(female)
prop_female_physics

0.12062256809338522

In [9]:
# What proportion of male students are majoring in physics?
male_physics = male[male["major"] == "Physics"]
prop_male_physics = len(male_physics)/len(male)
prop_male_physics

0.9259259259259259

In [10]:
# # Admission rate for female physics majors
admitted_female_physics = female_physics.admitted.value_counts("True")
admitted_female_physics

True     0.741935
False    0.258065
Name: admitted, dtype: float64

In [11]:
# Admission rate for male physics majors
admitted_male_physics = male_physics.admitted.value_counts("True")
admitted_male_physics

True     0.515556
False    0.484444
Name: admitted, dtype: float64

### Proportion and admission rate for chemistry majors of each gender

In [12]:
# What proportion of female students are majoring in chemistry?
female_chemistry = female[female["major"] == "Chemistry"]
prop_female_chemistry = len(female_chemistry)/len(female)
prop_female_chemistry

0.8793774319066148

In [13]:
# What proportion of male students are majoring in chemistry?
male_chemistry = male[male["major"] == "Chemistry"]
prop_male_chemistry = len(male_chemistry)/len(male)
prop_male_chemistry

0.07407407407407407

In [14]:
# Admission rate for female chemistry majors
admitted_female_chemistry = female_chemistry.admitted.value_counts("True")
admitted_female_chemistry

False    0.774336
True     0.225664
Name: admitted, dtype: float64

In [15]:
# Admission rate for male chemistry majors
admitted_male_chemistry = male_chemistry.admitted.value_counts("True")
admitted_male_chemistry

False    0.888889
True     0.111111
Name: admitted, dtype: float64

### Admission rate for each major

In [16]:
# Admission rate for physics majors
total_physics = df[df["major"] == "Physics"]
admitted_total_physics = total_physics.admitted.value_counts()[0]
admit_rate_physics = admitted_total_physics/len(total_physics)
admit_rate_physics

0.45703125

In [17]:
# Admission rate for chemistry majors
total_chemistry = df[df["major"] == "Chemistry"]
admitted_total_chemistry = total_chemistry.admitted.value_counts()[1]
admit_rate_chemistry = admitted_total_chemistry/len(total_chemistry)
admit_rate_chemistry

0.21721311475409835

Initially, one is forced to believe that Males have a higher admission rate than Females (48.6% for Males against 28.8% for Females) even though more Females applied generally than Males did. However, considering the admissions by majors, Females have a higher admission rates than Males for both majors. For Physics, Females had a higher admission rate than Males (74.2% for Females against 51.6% for Males) even though more Males applied to study Physics than Females did. Also for Chemistry, Females had a higher admission rate than Males(22.6% for Females against 11.1% for Males) albeit more Females applied to study Chemistry. 

This is the Simpson's Paradox at play. Conclusively, data analysts should always be weary of confounding variables when analyzing explanatory variable(s) with respect to a response variable.