# Exploring Biased Data via Penguins

Many popular machine algorithms are affected by issues of bias. For example, facial recognition algorithms often work better on people with lighter skin than on people with darker skin. The potential harm is massive. For example, someone could be wrongly arrested because of faulty facial recognition. 

There are many possible sources of these issues, but one well-understood issue is that machine learning models are typically trained on majority white datasets.

In today's lecture, we will illustrate the potential harms of this via penguins.

## Introduction

To start with, we are going to do a modified version of Discussion 14. The difference is that we will be predicting the `Sex` of the penguin rather than the species.

Let's begin by importing all the libraries we'll need, and by downloading the penguins dataset:

*If you experience `ConnectionRefused` errors when doing this, instead copy/paste the url into your browser. Save the data in the same directory as this notebook in a file called `penguins.csv`, and then replace `url` with `"penguins.csv"` in the block below.* 

In [68]:
#import needed libraries and read in data
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import tree, preprocessing
import numpy as np
url = "https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

### §1. Preparing your data

For this activity, we will use only the following columns: `"Species"`, `"Flipper Length (mm)"`, `"Body Mass (g)"`, `"Sex"`. (Use the square brackets operator on the list of these strings, and **assign the result back to `penguins`.**)

In [69]:
#drop nans and one row where the sex was not recorded
penguins = penguins[['Species', 'Flipper Length (mm)', 'Body Mass (g)', "Sex"]]
penguins = penguins.dropna()
penguins=penguins[penguins["Sex"]!="."]
penguins

Unnamed: 0,Species,Flipper Length (mm),Body Mass (g),Sex
0,Adelie Penguin (Pygoscelis adeliae),181.0,3750.0,MALE
1,Adelie Penguin (Pygoscelis adeliae),186.0,3800.0,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),195.0,3250.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),193.0,3450.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),190.0,3650.0,MALE
...,...,...,...,...
338,Gentoo penguin (Pygoscelis papua),214.0,4925.0,FEMALE
340,Gentoo penguin (Pygoscelis papua),215.0,4850.0,FEMALE
341,Gentoo penguin (Pygoscelis papua),222.0,5750.0,MALE
342,Gentoo penguin (Pygoscelis papua),212.0,5200.0,FEMALE


# Training a model the same way as in Discussion

Run the next cell. Doing this will make sure that the random values that your code will generate will be the same every time you run the code.

In [70]:
np.random.seed(3354354524)

In [71]:
from sklearn import preprocessing

le=preprocessing.LabelEncoder()

def prep_data(df):


    X = df.drop(['Sex'], axis = 1)
    y = df['Sex']
 
    le = preprocessing.LabelEncoder()
    X['Species'] = le.fit_transform(X['Species'])
    y = le.fit_transform(y)
    return X,y

X,y=prep_data(penguins)

To make sure that you know what is going on, look at your `X` and `y` variables by running the next cells.

In [72]:
#X

In [73]:
#y

Now split `X` and `y` into training and test data (80/20% of the rows).

**Note**: *You should conduct all splits using a single call to the function `train_test_split` from `sklearn.model_selection`.* You can achieve this by supplying two arrays to this function, as illustrated in the [second example here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [74]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [75]:
T = tree.DecisionTreeClassifier(max_depth=3)
T.fit(X_train, y_train)
T.score(X_train, y_train), T.score(X_test, y_test)

(0.8834586466165414, 0.835820895522388)

# Biased Data

## What if the training data was nearly all Adelie penguins?

To help answer this, let's split up the data into three smaller data frames. One for each species

Now, lets grab test data from each of those columns. I am being a bit lazy and not using a random split here. (Don't imitate this behavior on your project!)

Now, let's build the total training set and testing set

# Interpreting results 

With `balanced` data our model works worse on Gentoo. Let's try to explore why.

We can see that Gentoo's are bigger than the other two species. Perhaps our classifier is seeing that a Gentoo is big and assuming that it is a male