# About the Dataset and Prediction Task
In this exercise, you'll work with the Adult Census Income dataset, which is commonly used in machine learning literature. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker.

Each example in the dataset contains the following demographic data for a set of individuals who took part in the 1994 Census:

Numeric Features
age: The age of the individual in years.
fnlwgt: The number of individuals the Census Organizations believes that set of observations represents.
education_num: An enumeration of the categorical representation of education. The higher the number, the higher the education that individual achieved. For example, an education_num of 11 represents Assoc_voc (associate degree at a vocational school), an education_num of 13 represents Bachelors, and an education_num of 9 represents HS-grad (high school graduate).
capital_gain: Capital gain made by the individual, represented in US Dollars.
capital_loss: Capital loss mabe by the individual, represented in US Dollars.
hours_per_week: Hours worked per week.
Categorical Features
workclass: The individual's type of employer. Examples include: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, and Never-worked.
education: The highest level of education achieved for that individual.
marital_status: Marital status of the individual. Examples include: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, and Married-AF-spouse.
occupation: The occupation of the individual. Example include: tech-support, Craft-repair, Other-service, Sales, Exec-managerial and more.
relationship: The relationship of each individual in a household. Examples include: Wife, Own-child, Husband, Not-in-family, Other-relative, and Unmarried.
gender: Gender of the individual available only in binary choices: Female or Male.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Black, and Other.
native_country: Country of origin of the individual. Examples include: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, and more.
Prediction Task
The prediction task is to determine whether a person makes over $50,000 US Dollar a year.

Label
income_bracket: Whether the person makes more than $50,000 US Dollars annually.
Notes on Data Collection
All the examples extracted for this dataset meet the following conditions:

age is 16 years or older.
The adjusted gross income (used to calculate income_bracket) is greater than $100 USD annually.
fnlwgt is greater than 0.
hours_per_week is greater than 0.

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.keras.layers
import matplotlib.pyplot as plt

#from google.colab import widgets
# For facets
from IPython.core.display import display, HTML
import base64
!pip install facets-overview==1.0.0
from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator

Collecting facets-overview==1.0.0
  Downloading facets_overview-1.0.0-py2.py3-none-any.whl (24 kB)
Installing collected packages: facets-overview
Successfully installed facets-overview-1.0.0


# Load the Adult Dataset
With the modules now imported, we can load the Adult dataset into a pandas DataFrame data structure.

In [16]:
Columns = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]

#Loading Datasets
train_dataset = pd.read_csv("adult_census_train.csv", names=Columns, sep=r'\s*,\s*', 
                       engine='python', na_values="?")
test_dataset = pd.read_csv("adult_census_test.csv", names = Columns, sep=r'\s*,\s*', 
                       engine='python', na_values="?")

In [15]:
train_dataset.head(30)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


# Analyzing the Adult Dataset with Facets
As mentioned in MLCC, it is important to understand your dataset before diving straight into the prediction task.

Some important questions to investigate when auditing a dataset for fairness:

Are there missing feature values for a large number of observations?
Are there features that are missing that might affect other features?
Are there any unexpected feature values?
What signs of data skew do you see?

To start, we can use Facets Overview, an interactive visualization tool that can help us explore the dataset. With Facets Overview, we can quickly analyze the distribution of values across the Adult dataset.

In [20]:
fsg = FeatureStatisticsGenerator()
dataframes = [{"table": train_dataset, "name": "train_Data"}]
censusPhoto = fsg.ProtoFromDataFrames(dataframes)
protostr = base64.b64encode(censusPhoto.SerializeToString()).decode("utf-8")

HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

# FairAware Task 1
Review the descriptive statistics and histograms for each numerical and continuous feature. Click the Show Raw Data button above the histograms for categorical features to see the distribution of values per category.

Then, try to answer the following questions from earlier:

Are there missing feature values for a large number of observations?
Are there features that are missing that might affect other features?
Are there any unexpected feature values?
What signs of data skew do you see?


# Solution
Click below for some insights we uncovered.

We can see from reviewing the missing column that the following categorical features contain missing values:

workclass
occupation
Now, because it's only a small percentage of samples that contain either a missing workclass value or occupation value, we can safely drop those rows from the data set. If that percentage was much higher, then we would have to consider using a different data set that is more complete.

Luckily, in Pandas, there is a convenient way to drop any row containing a missing value in the data set:

In [21]:
train_dataset.dropna(how="any", inplace=True, axis = 0)

We will use this method prior to training the model when we convert a Pandas DataFrame to a Numpy array.

As for the remaining data that does not contain any missing values: if we look at the min/max values and histograms for each numeric feature, then we can pinpoint any extreme outliers in our data set.

For hours_per_week, we can see that the minimum is 1, which might be a bit surprising, given that most jobs typically require multiple hours of work per week. For capital_gain and capital_loss, we can see that over 90% of values are 0. Given that capital gains/losses are only registered by individuals who make investments, it's certainly plausible that less than 10% of examples would have nonzero values for these feature, but we may want to take a closer look to verify the values for these features are valid.

In looking at the histogram for gender, we see that over two-thirds (approximately 67%) of examples represent males. This strongly suggests data skew, as we would expect the breakdown between genders to be closer to 50/50.



# A Deeper Dive
To futher explore the dataset, we can use Facets Dive, a tool that provides an interactive interface where each individual item in the visualization represents a data point. But to use Facets Dive, we need to convert the data to a JSON array. Thankfully the DataFrame method to_json() takes care of this for us.

Run the cell below to perform the data transform to JSON and also load Facets Dive.

In [30]:
#@title Set the Number of Data Points to Visualize in Facets Dive

SAMPLE_SIZE = 5000 #@param
  
train_dive = train_dataset.sample(SAMPLE_SIZE).to_json(orient='records')

HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=train_dive)
display(HTML(html))

# FairAware Task #2
Use the menus on the left panel of the visualization to change how the data is organized:

In the Binning | X-Axis menu, select education, and in the Color By and Label By menus, select income_bracket. How would you describe the relationship between education level and income bracket?

Next, in the Binning | X-Axis menu, select marital_status, and in the Color By and Label By menus, select gender. What noteworthy observations can you make about the gender distributions for each marital-status category?

As you perform the above tasks, keep the following fairness-related questions in mind:

What's missing?
What's being overgeneralized?
What's being underrepresented?
How do the variables, and their values, reflect the real world?
What might we be leaving out?

Now that you've explored the dataset using Facets, see if you can identify some of the problems that may arise with regard to fairness based on what you've learned about its features.

Which of the following features might pose a problem with regard to fairness?

Choose a feature from the drop-down options in the cell below, and then run the cell to check your answer. Then explore the rest of the options to get more insight about how each influences the model's predictions.

# Predicting income using the Keras API
Now that we have a better sense of the Adult dataset, we can now begin with creating a neural network to predict income. In this section, as with previous exercises, we will be using TensorFlow's Keras API (specifically, tf.keras.Sequential) to construct our neural network model.

In [31]:
#@title Define Function to Visualize Binary Confusion Matrix
def plot_confusion_matrix(
    confusion_matrix, class_names, subgroup, figsize = (8,6)):
  # We're taking our calculated binary confusion matrix that's already in the 
  # form of an array and turning it into a pandas DataFrame because it's a lot 
  # easier to work with a pandas DataFrame when visualizing a heat map in 
  # Seaborn.
  df_cm = pd.DataFrame(
      confusion_matrix, index=class_names, columns=class_names, 
  )

  rcParams.update({
  'font.family':'sans-serif',
  'font.sans-serif':['Liberation Sans'],
  })
  
  sns.set_context("notebook", font_scale=1.25)

  fig = plt.figure(figsize=figsize)

  plt.title('Confusion Matrix for Performance Across ' + subgroup)

  # Combine the instance (numercial value) with its description
  strings = np.asarray([['True Positives', 'False Negatives'],
                        ['False Positives', 'True Negatives']])
  labels = (np.asarray(
      ["{0:g}\n{1}".format(value, string) for string, value in zip(
          strings.flatten(), confusion_matrix.flatten())])).reshape(2, 2)

  heatmap = sns.heatmap(df_cm, annot=labels, fmt="", 
      linewidths=2.0, cmap=sns.color_palette("GnBu_d"));
  heatmap.yaxis.set_ticklabels(
      heatmap.yaxis.get_ticklabels(), rotation=0, ha='right')
  heatmap.xaxis.set_ticklabels(
      heatmap.xaxis.get_ticklabels(), rotation=45, ha='right')
  plt.ylabel('References')
  plt.xlabel('Predictions')
  return fig

In [35]:
#@title Visualize Binary Confusion Matrix and Compute Evaluation Metrics Per Subgroup
CATEGORY  =  "gender" #@param {type:"string"}
SUBGROUP =  "Male" #@param {type:"string"}

# Labels for annotating axes in plot.
classes = ['Over $50K', 'Less than $50K']

# Given define subgroup, generate predictions and obtain its corresponding 
# ground truth.
subgroup_filter  = test_dataset.loc[test_dataset[CATEGORY] == SUBGROUP]
features, labels = pandas_to_numpy(subgroup_filter)
subgroup_results = model.evaluate(x=features, y=labels, verbose=0)
confusion_matrix = np.array([[subgroup_results[1], subgroup_results[4]], 
                             [subgroup_results[2], subgroup_results[3]]])

subgroup_performance_metrics = {'ACCURACY': subgroup_results[5],'PRECISION': subgroup_results[6], 'RECALL': subgroup_results[7],'AUC': subgroup_results[8]}
performance_df = pd.DataFrame(subgroup_performance_metrics, index=[SUBGROUP])
pd.options.display.float_format = '{:,.4f}'.format

plot_confusion_matrix(confusion_matrix, classes, SUBGROUP);
performance_df

NameError: name 'pandas_to_numpy' is not defined