<a target="_blank" href="https://colab.research.google.com/github/ArtificialIntelligenceToolkit/aitk/blob/master/notebooks/NeuralNetworks/DataManipulation.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Manipulation
In this notebook, we explore how the composition of the training set for neural networks can lead to biased outcomes. We explore this in a categorization task with two classes by manipulating the balance of training examples across these classes.

Here we will install the relevant library in order for this notebook to run.

In [1]:
%pip install aitk --quiet

Running this next code block will allow us to import the additional libraries for this notebook.

In [2]:
from aitk.utils import gallery, array_to_image
from aitk.networks import SimpleNetwork
import tensorflow
from tensorflow.keras.datasets import mnist
import numpy as np
from random import shuffle

## Manipulating the Data
In this section, we will perform some manipulations on a dataset in order to show how the composition of a dataset can affect the efficacy of a neural network.

To demonstrate dataset composition's relevance to network efficacy, we will use a part of the MNIST dataset. MNIST is a data set composed of hand-written digits. We will focus on two digits from this dataset in order to demonstrate how over representing one of these two digits will affect the network's ability to accurately distinguish between them. Initially we will try categorizing 3's vs 5's. The reason we chose 3 and 5 is that they have some similarities, such as a similar curve in their bottom halves.
So distinguishing a 3 vs a 5 is more difficult than say a 3 vs a 1.

We will start with an equal number of each digit, and then we will change the percentages of them in the dataset to explore how this imbalance impacts a network.

This code block loads in the MNIST dataset.

In [3]:
(train_x, train_y), (test_x, test_y) = mnist.load_data()

### Set the two digits to explore

We will begin by focusing on categorizing 3's vs 5's. But later you may change the two digits you wish to categorize to something else, such as 3 vs 8 or 1 vs 7.


In [4]:
digit1 = 3
digit2 = 5

We begin by extracting just the two digits we want from the MNIST **training** data. We want to ensure that we start with an equal number of samples. We will gather 5000 samples of each.

In [5]:
digit1_train_x = []
digit1_train_y = []
digit2_train_x = []
digit2_train_y = []
num_train_digit1 = 0
num_train_digit2 = 0

for i in range(len(train_x)):
  if train_y[i] == digit1 and num_train_digit1 < 5000:
    digit1_train_x.append(train_x[i])
    digit1_train_y.append([1,0])
    num_train_digit1 += 1
  elif train_y[i] == digit2 and num_train_digit2 < 5000:
    digit2_train_x.append(train_x[i])
    digit2_train_y.append([0,1])
    num_train_digit2 += 1

print("number of %d's: %d" % (digit1, len(digit1_train_x)))
print("number of %d's: %d" % (digit2, len(digit2_train_x)))

number of 3's: 5000
number of 5's: 5000


We also need to extract just the two digits of interest from the **testing** data. We will gather 750 samples of each.

In [6]:
new_test_x = []
new_test_y = []
num_digit1 = 0
num_digit2 = 0

for i in range(len(test_x)):
  if test_y[i] == digit1 and num_digit1 < 750:
    new_test_x.append(test_x[i])
    new_test_y.append([1,0])
    num_digit1 += 1
  elif test_y[i] == digit2 and num_digit2 < 750:
    new_test_x.append(test_x[i])
    new_test_y.append([0,1])
    num_digit2 += 1

new_test_x = np.array(new_test_x)
new_test_y = np.array(new_test_y)

new_test_x_normalized = new_test_x/255

print("number of %d's: %d" % (digit1, num_digit1))
print("number of %d's: %d" % (digit2, num_digit2))

number of 3's: 750
number of 5's: 750


Now that we have all of the data, we can begin to manipulate the balance within it. The next function allows us to specify how to split up the data.

In [7]:
def split_data(pct_digit1, pct_digit2):
  assert pct_digit1+pct_digit2 == 1, "percentages must sum to 1"
  num_digit1 = int(pct_digit1*(len(digit1_train_x)))
  num_digit2 = int(pct_digit2*(len(digit2_train_x)))
  shuffle(digit1_train_x)
  shuffle(digit2_train_x)
  print("%d train length: %d" % (digit1, num_digit1))
  print("%d train length: %d" % (digit2, num_digit2))
  inputs = digit1_train_x[:num_digit1] + digit2_train_x[:num_digit2]
  targets = ([[1,0]]*num_digit1) + ([[0,1]]*num_digit2)
  mix = list(zip(inputs, targets))
  shuffle(mix)
  inputs, targets = zip(*mix)

  inputs = np.array(inputs)/255
  targets = np.array(targets)

  return inputs, targets


### Set percentages of each digit in the training set

Now, **enter the percentages for both digits** that you want in the dataset as a decimal. To begin, we will look at a balanced dataset that is split 50/50 (which should be entered as 0.5 for both digits below).

After seeing how that works, change the percentages and rerun all the code blocks below to see how the network changes. For example you might want to try a 70/30 split (which should be entered as .7 and .3 below).

In [8]:
pct_digit1 = 0.5
pct_digit2 = 0.5

Run the next code block to split up the data as you specified. Additionally, the quantities of each digit will be printed; verify that the numbers look correct given the percentages that you entered.

In [9]:
inputs, targets = split_data(pct_digit1, pct_digit2)

3 train length: 2500
5 train length: 2500


Now, we can see what a sample of 10 images from the new dataset looks like. The number of examples of each digit will vary based on the percentages you inputted.

In [10]:
images = [array_to_image(inputs[i]) for i in range(20)]
gallery(images)

0,1,2,3,4
0,1,2,3,4
5,6,7,8,9
10,11,12,13,14
15,16,17,18,19


Here we create the neural network. We will utilize a simple network which first flattens the two-dimensional input, passes it through two hidden layers, and then on to the output layer of size 2.

In [11]:
net = SimpleNetwork((28, 28), "Flatten", 25, 10, (2, "softmax"))

Summarizing the network allows us to make sure it looks as it is expected. The total number of parameters gives you a good sense of the size of the network. Ours is less than 20 thousand, which is quite small by modern standards.

In [12]:
net.summary()

Model: "SimpleNetwork"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input (InputLayer)          [(None, 28, 28)]          0         
                                                                 
 flatten (Flatten)           (None, 784)               0         
                                                                 
 hidden_2 (Dense)            (None, 25)                19625     
                                                                 
 hidden_3 (Dense)            (None, 10)                260       
                                                                 
 output (Dense)              (None, 2)                 22        
                                                                 
Total params: 19907 (77.76 KB)
Trainable params: 19907 (77.76 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


This is a fairly simple task for the network so it only needs 10 epochs of training to achieve relatively high accuracy (each **epoch** is one pass through the training data).

In [13]:
history = net.fit(inputs,                                   # new training examples
                  targets,                                  # new training labels
                  verbose=1,                                # verbose output
                  validation_data=(new_test_x_normalized,   # validation examples
                                   new_test_y),             # validation labels
                  epochs=10)                                 # number of times to loop through the training set

Epoch 10/10 loss: 0.016801241785287857 - tolerance_accuracy: 0.909832775592804 - val_loss: 0.021938364952802658 - val_tolerance_accuracy: 0.9067248702049255


After training the network, **take note of the tolerance_accuracy and val_tolerance_accuracy for every time you retrain the network with manipulated percentages**.

The tolerance accuracy represents the accuracy of the network on the **training** data

The validation tolerance accuracy represents the accuracy of the network on the **testing** data.

**tolerance_accuracy**: *enter here*

**val_tolerance_accuracy**: *enter here*

###Testing the Network

Let's create a function to allow us to easily visualize how the trained network is doing on some sample inputs from the testing data.

In [14]:
from time import sleep
def test(net, n):
  for i in range(n):
    net.display(new_test_x_normalized[i])
    outputs = net.propagate(new_test_x_normalized[i])
    print(", ".join([str(round(v,2)) for v in outputs]))
    sleep(2)

In the visualization below, focus your attention on the two boxes in the final output layer.

* When the network recognizes digit1 (which is initially 3), there should be a white block on the left and a black block on the right for the output.

* When the network recognizes digit2 (which is initially 5), there should be a black block on the left and a white block on the right for the output.

* For inputs that the network is having trouble recognizing, their output will not be clearly black or white in either of the two output blocks.

Additionally, below the visualization of the network, when the test function is run, you can see percentages of certainty. The first number is the certainty that the digit is a digit1, and the second number is the certainty that the digit is a digit2.

In [15]:
test(net, 10)

0.01, 0.99


As you can see from these examples, when the network's dataset is split evenly between the two digits, it typically performs quite well. It has an accuracy of around 91% on the test dataset (which it was not trained on). This shows that this network is effective at predicting whether a hand drawn digit is a 3 vs a 5.

NOTE: We could build a more complex network that would perform better, but it would take longer to train.

Now, let's see how many total errors this network made and which specific digits it classified incorrectly.

Run this next code block to see a summary of the errors the network made. **Note how many errors the network made and which digit it classified incorrectly most often.**

In [16]:
from numpy import argmax
outputs = net.predict(new_test_x_normalized)
answers = [argmax(output) for output in outputs]
newtargets = [argmax(target) for target in new_test_y]
incorrect = [i for i in range(len(answers)) if answers[i] != newtargets[i]]
print("number of digits classified incorrectly:", len(incorrect))
missed_target = [targets[i] for i in incorrect]
wrong_answer = [answers[i] for i in incorrect]
per_digit1 = 0
per_digit2 = 0
for i in range(len(incorrect)):
  if wrong_answer[i] == 1:
    per_digit1 += 1
  else:
    per_digit2 += 1
print("percentage of errors on %d's: %.2f" % (digit1, per_digit1/len(incorrect)))
print("percentage of errors on %d's: %.2f" % (digit2, per_digit2/len(incorrect)))

number of digits classified incorrectly: 44
percentage of errors on 3's: 0.27
percentage of errors on 5's: 0.73


*write observations here*

We can see a gallery of all of the digits that were classified incorrectly.

In [17]:
images = [array_to_image(new_test_x[index]) for index in incorrect]
gallery(images)

0,1,2,3,4,5,6
0,1,2.0,3.0,4.0,5.0,6.0
7,8,9.0,10.0,11.0,12.0,13.0
14,15,16.0,17.0,18.0,19.0,20.0
21,22,23.0,24.0,25.0,26.0,27.0
28,29,30.0,31.0,32.0,33.0,34.0
35,36,37.0,38.0,39.0,40.0,41.0
42,43,,,,,


After seeing which digits were classified incorrectly, consider **why** this may have occurred and **how** the percentages that you inputted would have this effect.

*write observations here*

###Changing Dataset Composition

Now that we have shown how this network performs with an equal number of each digit in the dataset, we want to show how the performance and accuracy of the network changes when we manipulate the percentages in the dataset.

**Return to where you entered the percentages of each digit in the training dataset and change the percentages from 50/50 to 70/30.**

**Rerun all the code blocks from there (including the testing summary).**

Describe the differences you saw in the results between the 50/50 split and the 70/30 split.

*write observations here*

After you finish training your network with the first manipulated percentages, experiment with the percentages some more.

How imbalanced does the training set have to be before the network is unable to distinguish between the two digits?

*write answer here*

Also feel free to explore other digits besides 3 vs 5. To do this go back to the section where you set the digits to explore, and then rerun all of the code cells below that.

*write any additional observations here*

After finishing testing the different percentages that you can use to show varying levels of efficacy in the network, consider the potential broader implications of dataset composition before moving on to the next section.

* How could having a specific subsection of data that an AI is trained on being underrepresented have very real world consequences?
* What possible issues and biases might arise with the human decision making that goes into the creation of datasets that are used to train these networks?

*write answer here*

## Implications of Dataset Composition
We have explored how manipulating a dataset can change a network's efficacy in a categorization task. Datasets, and thus their composition, is an essential component of neural networks. Below, we will explore how bias in a dataset's composition can lead to negative impacts on marginalized communities.

Within the past couple of years, bias within algorithms and AI has begun to receive attention. Many computer scientists and researchers have begun to recognize inherent bias, known also as "algorithmic prejudices," present in algorithms, software, machine learning, artificial intelligence, and nearly every facet of computer science (see reference [3]). Within the context of datasets that are used for the training of neural networks, bias is pervasive. This bias becomes particularly concerning as algorithms are beginning to take over human responsibilities (see reference [2]). For example, algorithms are now being used by US law enforcement for "predictive justice"(see reference [3]). These tools "calculate the probability that a person will not show up for trial as scheduled or commit future crimes"(see reference [3]). As these algorithms become increasingly present in our society, we must evaluate and consider their inherent biases.

A major contributor to the current movement exploring and combatting biases in algorithms is Joy Buolamwini. As a graduate student at MIT, Buolamwini co-wrote the paper "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification" along with Timmit Gebru which explores the ways in which machine learning algorithms can discriminate based on classes like race and gender (see reference [1]). As Buolamwini describes in her Ted Talk, she was inspired to address bias in machine learning algorithms when as an undergraduate student at Georgia Tech a robot that was supposed to recognize faces could not detect her's, as a black woman. Among other findings, this paper revealed that while lighter-skinned males had an extremely low error rate of 0.8% while darker-skinned females had a significantly higher error rate of up to 34.7% (see reference [1]).

### Buolamwini's Work

In 2017, Buolamwini gave a Ted Talk demonstrating the discriminatory tendencies of widely used and accepted training sets and algorithms (see reference [2]). She refers to the concept of the "coded gaze" as algorithmic bias in the field of computer science. Within her talk, she dives deeper into the harms and discriminatory practices perpetrated by these training sets which are often severely lacking diversity. These practices include predictive policing. A study from the Georgetown Law Center showed that these police systems contain 1 in 2 adults in the US in a criminal facial recognition network (see reference [4]). These networks used by law enforcement have not been audited for accuracy and can result in misidentification of criminals, having a potentially serious consequence on the victim of this misidentification. With such serious stakes, it is essential to consider and address the biases of these algorithms and networks.

To see the Georgetown Law Center's full report on law enforcement's use of facial recognition and recommendations, please access this link: [Perpetual Line Up](https://www.perpetuallineup.org/).

Click here to watch Buolamwini's Ted Talk!

[![IMAGE ALT TEXT](http://img.youtube.com/vi/UG_X_7g63rY/0.jpg)](https://www.youtube.com/watch?v=UG_X_7g63rY "How I'm fighting bias in algorithms")

#### Gender Shades

"Gender Shades" tested 3 commercial gender classification systems (Microsoft, IBM, Face++) using a dataset specifically designed to determine the potential biases present in these systems (see reference [1]). The dataset (Pilot Parliaments Benchmark), specifically created for this study, was composed of faces of 1270 individuals from three African countries and three European countries. The individuals were each given skin type labels per the Fitzpatrick six-point labeling system and given gender labels, either female or male given the binary nature of the evaluation systems.

In evaluation of these classifiers, there were several main takeaways. Firstly, "male subjects were more accurately classified than female subjects"(reference [1] pg. 8). Additionally, lighter-skinned subjects were more accurately classified than those with darker skin. Further, all classifiers performed worst on darker female subjects (reference [1] pg. 8). Here is the complete summarized key findings as outlined in the study:

* All classifiers perform better on male faces
than female faces (8.1% − 20.6% difference
in error rate)

* All classifiers perform better on lighter faces
than darker faces (11.8% − 19.2% difference
in error rate)

* All classifiers perform worst on darker female
faces (20.8% − 34.7% error rate)

* Microsoft and IBM classifiers perform best on lighter male faces (error rates of 0.0% and 0.3% respectively)

* Face++ classifiers perform best on darker
male faces (0.7% error rate)

* The maximum difference in error rate between the best and worst classified groups is 34.4%

(reference [1] pg. 8).

Further, this paper emphasizes the complete inability of these commercial systems to recognize gender minorities as they are completely excluded from datasets and classification options. Buolamwini notes, "The companies provide no documentation to clarify if their gender classification systems which provide sex labels are classifying gender identity or biological sex"(reference [1] pg. 6). As she emphasizes, "This reductionist view of gender does not adequately capture the complexities of gender or address trangender identities"(reference [1] pg. 6). When using these systems it is important to consider the erasure they create of people of non binary gender identities.

Buolamwini and Gerbru's study "Gender Shades" brought to the forefront the inherent biases present in well-established commercial classifiers and marginalization of those with intersectional identities, particularly darker skinned women, in these algorithms. The consequences of these prejudices have the potential to only further harm people of intersectional minority identities who are already marginalized in our society. As companies continue to develop these tools, Buolamwini calls for "inclusive benchmark datasets and subgroup accuracy reports" which will be "necessary to increase transparency and accountability in artificial intelligence"(reference [1] pg. 12). Continuing into the development of these tools, there will need to be increased "demographic and phenotypic transparency and accountability in artificial intelligence"(reference [1] pg. 12).

To have a more comprehensive understanding of Buolamwini and her co-collabrator Timmit Gebru's research "Gender Shades," you can read the full paper here:  [Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification](https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf).

###Reflect

After reading more about the biases present in machine learning algorithms, **how do you see your role as a member of a modern society in which the presence of these algorithms is only increasing? What are ways in which we can combat these biases?**

*write answer here*

Go back to the data manipulation section of this notebook and take note of the accuracy percentage you recorded when either digit1s or digit2s were overrepresented (particularly for when the minority digit represented 7% or less of the dataset). **Why is accuracy alone not always a reliable parameter? What are the fallacies underlying reporting the "accuracy" of a system? How could this number impact systems' usage and our trust in them?** Think about Buolamwini's findings. Contextualize your answer accordingly.

*write answer here*

## Navigating Biases
Considering ways in which we can work towards a more inclusive computing community.

As we work towards a more inclusive and less prejudiced computer science sphere, we must consider these issues, recognize them in our processes, and change our practices. A major component of changing the presence of these biases and their impact is focusing on inclusive coding practices. As Buolamwini outlines in her Ted Talk, we must consider who codes, how we code, and why we code (see reference [2]). Having a more diverse community of coders that consider and prioritize the needs and experiences of marginalized communities is an important step in creating a more inclusive field and algorithms.

Within the Georgetown Law Center report, the writers emphasize a need for significant legislative and regulatory change (see reference [4]). Law enforcement's use of facial recognition has the potential to do real damage, if it has not already impacted countless individuals. The report suggests legislation should be passed to regulate these technologies including requiring reasonable suspicion to use facial recognition, only use mug shot databases, court approval to use ID photos and license photos, requiring probable cause to use surveillance footage, completely ban tracking individuals for free speech issues, and increase accuracy testing. Further, they suggest a complete reform to the FBI facial recognition systems. They argue that these systems must be transparent and held publicly accountable, releasing statistics relating to arrest numbers. Importantly, they call for testing of racial bias within these systems and datasets that reflect the diversity of the American population. All of these reforms are important to implement if we want to mitigate the potential harm that these law enforcement agencies can perpetuate against already vulnerable communities.

As Buolamwini emphasizes at the end of her Ted Talk, we must create "a world where technology works for all of us, not just some of us, a world where we value inclusion and center social change." To finish her talk, she poses a question: "Will you join me in the fight?" **After reading through this computational essay, consider why it is important to join this "fight"? What are your personal motivations behind creating a more inclusive computing space and why is it important?**

*write answer here*

## References


[1] J. Buolamwini, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification,” MIT Media Lab. Accessed: Jul. 23, 2024. [Online]. Available: https://www.media.mit.edu/publications/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/

[2] J. Buolamwini, How I’m fighting bias in algorithms, (1489075733). Accessed: Jul. 23, 2024. [Online Video]. Available: https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms

[3] M. S. Cataleta, “Humane Artificial Intelligence: The Fragility of Human Rights Facing AI,” East-West Center, 2020. Accessed: Jul. 23, 2024. [Online]. Available: https://www.jstor.org/stable/resrep25514

[4] “The Perpetual Line-Up,” Perpetual Line Up. Accessed: Jul. 23, 2024. [Online]. Available: https://www.perpetuallineup.org/