# Exploring Biased Data via Penguins

Many popular machine algorithms are affected by issues of bias. For example, facial recognition algorithms often work better on people with lighter skin than on people with darker skin. The potential harm is massive. For example, someone could be wrongly arrested because of faulty facial recognition. 

There are many possible sources of these issues, but one notable source is bias in the training data. For example, many computer vision algorithms are trained on majority white datasets which leads to them working well for white people but poorly for people of color.

Below, we will illustrate this issue through a somewhat silly example. We will consider a data set containing biological measurements of penguins belonging to three different species and will train a model to predict which species a given penguin is. When the training data is evenly divided amongst male and female penguins, the model will preform equally well for both sexes. However, this will change if we make the dataset predominantly male, it will not work well when we test it on female penguins.

## Introduction

To start with, we are going to do a condensed variation of one a discussion exericse that I use in my Intro to Python with Applications Class.

Let's begin by importing all the libraries we'll need, and by downloading the penguins dataset:

*If you experience `ConnectionRefused` errors when doing this, instead copy/paste the url into your browser. Save the data in the same directory as this notebook in a file called `penguins.csv`, and then replace `url` with `"penguins.csv"` in the block below.* 

In [1]:
#import needed libraries and read in data
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import tree, preprocessing
import numpy as np
url = "https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguins

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,,,,Adult not sampled.
4,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,PAL0910,120,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N38A2,No,12/1/09,,,,,,,,
340,PAL0910,121,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A1,Yes,11/22/09,46.8,14.3,215.0,4850.0,FEMALE,8.41151,-26.13832,
341,PAL0910,122,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N39A2,Yes,11/22/09,50.4,15.7,222.0,5750.0,MALE,8.30166,-26.04117,
342,PAL0910,123,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N43A1,Yes,11/22/09,45.2,14.8,212.0,5200.0,FEMALE,8.24246,-26.11969,


In [2]:
#drop nans and one row where the sex was not recorded
penguins=penguins[["Species","Sex","Flipper Length (mm)", "Body Mass (g)"]]
penguins=penguins.dropna()
penguins=penguins[penguins["Sex"]!="."]
penguins

Unnamed: 0,Species,Sex,Flipper Length (mm),Body Mass (g)
0,Adelie Penguin (Pygoscelis adeliae),MALE,181.0,3750.0
1,Adelie Penguin (Pygoscelis adeliae),FEMALE,186.0,3800.0
2,Adelie Penguin (Pygoscelis adeliae),FEMALE,195.0,3250.0
4,Adelie Penguin (Pygoscelis adeliae),FEMALE,193.0,3450.0
5,Adelie Penguin (Pygoscelis adeliae),MALE,190.0,3650.0
...,...,...,...,...
338,Gentoo penguin (Pygoscelis papua),FEMALE,214.0,4925.0
340,Gentoo penguin (Pygoscelis papua),FEMALE,215.0,4850.0
341,Gentoo penguin (Pygoscelis papua),MALE,222.0,5750.0
342,Gentoo penguin (Pygoscelis papua),FEMALE,212.0,5200.0


# Training a model to predict the species

Run the next cell. Doing this will make sure that the random values that your code will generate will be the same every time you run the code.

In [3]:
np.random.seed(3354354524)

### Prepocessing 

In [5]:
from sklearn import preprocessing

le=preprocessing.LabelEncoder()

def prep_data(df):
    """
    Prepares dataframe from ML by 
    converting text to ints
    """
    
    X=df.drop(["Species"],axis=1)
    y=df["Species"]
    
    X["Sex"]=le.fit_transform(X["Sex"])
    y=le.fit_transform(y)
    return X,y

X,y=prep_data(penguins)

To make sure that you know what is going on, look at your `X` and `y` variables by running the next cells.

In [6]:
X

Unnamed: 0,Sex,Flipper Length (mm),Body Mass (g)
0,1,181.0,3750.0
1,0,186.0,3800.0
2,0,195.0,3250.0
4,0,193.0,3450.0
5,1,190.0,3650.0
...,...,...,...
338,0,214.0,4925.0
340,0,215.0,4850.0
341,1,222.0,5750.0
342,0,212.0,5200.0


In [7]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

### Train Model 

In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.5)

In [9]:
T=tree.DecisionTreeClassifier(max_depth=2)
T.fit(X_train,y_train)
T.score(X_train,y_train), T.score(X_test,y_test)

(0.8313253012048193, 0.7784431137724551)

# Does the model work equally well for male and female penguins?

### Split data into MALE and FEMALE

In [10]:
#split data
just_F=penguins[penguins["Sex"]=="FEMALE"]
just_M=penguins[penguins["Sex"]=="MALE"]

#shuffle data
just_F=just_F.sample(frac=1,random_state=1)
just_M=just_M.sample(frac=1,random_state=1)


In [12]:
just_M

Unnamed: 0,Species,Sex,Flipper Length (mm),Body Mass (g)
86,Adelie Penguin (Pygoscelis adeliae),MALE,190.0,3800.0
271,Gentoo penguin (Pygoscelis papua),MALE,220.0,5300.0
14,Adelie Penguin (Pygoscelis adeliae),MALE,198.0,4400.0
303,Gentoo penguin (Pygoscelis papua),MALE,224.0,5350.0
239,Gentoo penguin (Pygoscelis papua),MALE,222.0,5350.0
...,...,...,...,...
273,Gentoo penguin (Pygoscelis papua),MALE,225.0,5000.0
281,Gentoo penguin (Pygoscelis papua),MALE,221.0,5300.0
151,Adelie Penguin (Pygoscelis adeliae),MALE,201.0,4000.0
287,Gentoo penguin (Pygoscelis papua),MALE,229.0,5800.0


In [32]:
#Number of MALES and FEMALES in training data
fraction_female=.9
training_percentage=.5

F_samples=int(len(penguins)*fraction_female*training_percentage)
M_samples=int(len(penguins)*training_percentage)-F_samples

F_train=just_F.iloc[0:F_samples]
M_train=just_M.iloc[0:M_samples]

F_test=just_F.iloc[F_samples:]
M_test=just_M.iloc[M_samples:]



In [33]:
len(F_train),len(M_train),len(F_test),len(M_test)

(149, 17, 16, 151)

In [34]:
#build train and test
train=pd.concat([F_train,M_train])
test=pd.concat([F_test,M_test])

In [35]:
X_train,y_train=prep_data(train)
X_test,y_test=prep_data(test)


In [36]:
y_test

array([2, 2, 2, 1, 0, 0, 2, 0, 2, 2, 1, 2, 2, 0, 2, 0, 2, 2, 2, 0, 2, 0,
       0, 0, 2, 0, 0, 1, 0, 0, 1, 1, 0, 2, 1, 0, 2, 0, 1, 0, 1, 1, 0, 2,
       0, 2, 2, 2, 0, 1, 0, 0, 0, 2, 0, 0, 2, 1, 0, 1, 0, 2, 0, 2, 0, 1,
       0, 2, 2, 2, 2, 0, 0, 0, 2, 0, 0, 2, 1, 0, 0, 1, 1, 0, 0, 1, 2, 2,
       2, 2, 1, 0, 0, 0, 2, 0, 1, 2, 2, 0, 1, 0, 1, 0, 0, 2, 1, 2, 2, 2,
       1, 1, 2, 2, 2, 0, 0, 1, 0, 2, 2, 1, 0, 0, 1, 0, 2, 0, 0, 2, 0, 2,
       2, 0, 2, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 0, 0, 2, 2, 2,
       1, 0, 0, 2, 0, 2, 2, 1, 2, 2, 0, 2, 0])

In [37]:
#train model
T.fit(X_train,y_train)
T.score(X_train,y_train), T.score(X_test,y_test)

(0.8072289156626506, 0.6826347305389222)

In [38]:
### Scores on subsets
def score_on_subset(subset,C):
    """
    Predicts score for a machine learning model, C,
    on a subset, subset
    """
    subset_X,subset_y=prep_data(subset)
    return(C.score(subset_X,subset_y))

In [39]:
print("Percent Female: "+str(fraction_female*100)+"%")
print("Female Score: "+str(score_on_subset(F_test,T)))
print("Male Score: " + str(score_on_subset(M_test,T)))

Percent Female: 90.0%
Female Score: 0.8125
Male Score: 0.6688741721854304


# Questions for self-experiment


1. How unbalanced does the training data need to be in order to create problems?
2. The model performs better on males even with the even training split: 
        Is this random chance? If not, what else could cause these issues?
3. What fraction_female leads to the most equitable performance in both sexes?