# **Homework 15: AI Ethics**
---

### **Description**:
In this notebook, we will use AWS Clarify to analyze UCI's Parkinson's Dataset.

<br>

### **Structure**
**Part 1**: [UCI's Parkinson's Dataset](#p1)



</br>




### **Cheat Sheets**
[AWS Clarify](https://docs.google.com/document/d/1eGmQBEzCt4YBgPDxWB7j6UPb1HPucIISz3k1FLGunr4/edit?usp=sharing)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
#!pip install scikit-learn
!pip install --quiet smclarify

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from smclarify.bias.report import *

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

<a name="p1"></a>

---
## **Part 1: UCI's Parkinson's Dataset**
---

In this section, we are going to use Clarify to identify bias. This method is *much faster* than creating bar graphs for every column.

Here is an AWS Clarify [cheatsheet](https://docs.google.com/document/d/1PY06KZU97J-HU9nRArMH6biuoaCjPA_zm-EhIlTId2I/edit?usp=sharing) for interpretting results.



### **Problem #1.1: AWS Clarify & UCI's Parkinson's Dataset**

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.

Columns in the table contain subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals. The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.


#### **Steps #1-2: Load the data and import packages.**


In [None]:
url = "https://raw.githubusercontent.com/the-codingschool/TRAIN-datasets/main/parkinsons/parkinsons.csv"
df = pd.read_csv(url)
df.head()

#### **Step #3: Denote the facet column, the label column, and the group variable.**

Set the:
* Facet column as `age`.
* Label column with `sex` as the target column and `'0'` as the positive label.
* `age` as the group variable.

In [None]:
facet_column = FacetColumn(# COMPLETE THIS LINE
label_column = LabelColumn(# COMPLETE THIS LINE
group_variable = # COMPLETE THIS LINE

#### **Step #4: Generate bias report.**


Use the cheat sheet to interpret pre-training metrics. Using this page, interpret what the following report is saying.

<br>

**Run the cell below to generate your bias report.**

In [None]:
report = bias_report(df, facet_column, label_column, stage_type=StageType.PRE_TRAINING, group_variable=group_variable)

# use this to print your report - call it "report" for the code to work
for cl in report:
    print("\n\n","-"*35)
    print("-"*15, cl["value_or_threshold"], "-"*15)
    for metric in cl['metrics']:
        print(f"{metric['description']}: {metric['value']}")

#### **Step #5: Look at the imbalance.**

Use the cheat sheet provided to interpret what the bias report is telling you.

#End of notebook
---
© 2024 The Coding School, All rights reserved