# Exploring Biased Data via Penguins

Many popular machine algorithms are affected by issues of bias. For example, facial recognition algorithms often work better on people with lighter skin than on people with darker skin. The potential harm is massive. For example, someone could be wrongly arrested because of faulty facial recognition. 

There are many possible sources of these issues, but one notable source is bias in the training data. For example, many computer vision algorithms are trained on majority white datasets which leads to them working well for white people but poorly for people of color.

Below, we will illustrate this issue through a somewhat silly example. We will consider a data set containing biological measurements of penguins belonging to three different species and will train a model to predict which species a given penguin is. When the training data is evenly divided amongst male and female penguins, the model will preform equally well for both sexes. However, this will change if we make the dataset predominantly male, it will not work well when we test it on female penguins.

## Introduction

To start with, we are going to do a condensed variation of one a discussion exericse that I use in my Intro to Python with Applications Class.

Let's begin by importing all the libraries we'll need, and by downloading the penguins dataset:

*If you experience `ConnectionRefused` errors when doing this, instead copy/paste the url into your browser. Save the data in the same directory as this notebook in a file called `penguins.csv`, and then replace `url` with `"penguins.csv"` in the block below.* 

In [1]:
#import needed libraries and read in data
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import tree, preprocessing
import numpy as np
url = "https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguins

In [2]:
#drop nans and one row where the sex was not recorded
penguins = penguins[['Species', 'Flipper Length (mm)', 'Body Mass (g)', "Sex"]]
penguins = penguins.dropna()
penguins=penguins[penguins["Sex"]!="."]
penguins

# Training a model to predict the species

Run the next cell. Doing this will make sure that the random values that your code will generate will be the same every time you run the code.

In [3]:
np.random.seed(3354354524)

### Preprocessing 

In [4]:
from sklearn import preprocessing

le=preprocessing.LabelEncoder()

def prep_data(df):
    """ 
    Prepares dataframe, df, for ML algorithms
    by converting strings to ints
    """

    pass

To make sure that you know what is going on, look at your `X` and `y` variables by running the next cells.

Unnamed: 0,Flipper Length (mm),Body Mass (g),Sex
0,181.0,3750.0,1
1,186.0,3800.0,0
2,195.0,3250.0,0
4,193.0,3450.0,0
5,190.0,3650.0,1
...,...,...,...
338,214.0,4925.0,0
340,215.0,4850.0,0
341,222.0,5750.0,1
342,212.0,5200.0,0


### Train model

# Does the model work equally well for male and female penguins?

### Custom version of train_test_split

In [5]:
def custom_split(fraction_female,training_percentage):

    #split data into male and femal
    
    #shuffle data
    
    #Number of male and female samples
    
    #MALE and FEMALE training sets
    
    #MALE and FEMALE test sets

    #train and test set

    #prep data
    
    pass

In [6]:
def score_on_subset(subset,C):
    """
    Predicts score for a machine learning model, C,
    on a subset, subset
    """
    pass




In [8]:
fraction_female=.5
X_train, X_test, y_train, y_test, F_train, F_test, M_train, M_test=custom_split(fraction_female,.5)
T = tree.DecisionTreeClassifier(max_depth=2)
T.fit(X_train, y_train)
T.score(X_train, y_train), T.score(X_test, y_test)




In [9]:
print("Percent Female: "+str(fraction_female*100)+"%")
print("Female Score: "+str(score_on_subset(F_test,T)))
print("Male Score: " + str(score_on_subset(M_test,T)))

# Questions for self-experiment


1. How unbalanced does the training data need to be in order to create problems?
2. The model performs better on males even with the even training split: 
        Is this random chance? If not, what else could cause these issues?
3. What fraction_female leads to the most equitable performance in both sexes?