# Naive ML

Author:       Riley Hunter

Last Updated: 2019.07.28

## Overview

This Notebook provides the code for the research paper Risks of using naïve approaches to Artificial Intelligence: A case study (Hunter & Prabhu, 2019).

Dataset used: Kaggle student academic performance data, https://www.kaggle.com/aljarah/xAPI-Edu-Data

For the purposes of this research, we consider the following attributes of the dataset to be "undesirable", meaning 

## Citations

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2016). Mining Educational Data to Predict Student’s academic Performance using Ensemble Methods. International Journal of Database Theory and Application, 9(8), 119-136.

Amrieh, E. A., Hamtini, T., & Aljarah, I. (2015, November). Preprocessing and analyzing educational data set using X-API for improving student's performance. In Applied Electrical Engineering and Computing Technologies (AEECT), 2015 IEEE Jordan Conference on (pp. 1-5). IEEE.

Ruizendaalr. (2017). Visualizations and Classification Using Tree-based Models: Decision Tree, Random Forest & XGBoost. https://gist.github.com/ruizendaalr/2ecc1951b1415f842feea8bd3d9b1c5d

Kernfeld, P. (2016) https://stackoverflow.com/questions/20224526/how-to-extract-the-decision-rules-from-scikit-learn-decision-tree

In [8]:
# Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import _tree
from sklearn.tree import plot_tree
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import cross_val_score

In [12]:
# Data import, preprocessing and cleaning
edm = pd.read_csv('xAPI-Edu-Data.csv')
edm.rename(index=str, columns={'gender':'Gender', 'NationalITy':'Nationality',
                               'raisedhands':'RaisedHands', 'VisITedResources':'VisitedResources',
                              'PlaceofBirth':'PlaceOfBirth'},
                               inplace=True)

X = edm.drop('Class', axis=1)
y = edm['Class']
labelEncoder = LabelEncoder()
cat_columns = X.dtypes.pipe(lambda x: x[x == 'object']).index
label_mappings = {}

for col in cat_columns:
    X[col], label_mappings[col] = pd.factorize(X[col])

### Decision Tree training and exploration

In this section, we will fit a Decision Tree across all features bar the class. You might imagine this step as an automated tool simply scraping a database for all columns - without proper preparation, we will include many undesirable features such as gender and nationality.

A Decision Tree was chosen for its ease of interpretation. While other models are not necessarily less prone to discriminating on undesirable features, they are generally harder to detect when they do so.

#### Train Decision Tree

In [13]:
keys = []
scores = []
    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=52)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)
print('Results for: ' + str(k) + '\n')
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
acc = accuracy_score(y_test, pred)
print("accuracy is "+ str(acc)) 
print('\n' + '\n')
keys.append(k)
scores.append(acc)
table = pd.DataFrame({'model':keys, 'accuracy score':scores})

print(table)

NameError: name 'k' is not defined

#### Explore Decision Tree

In this section, we convert the trained Decision Tree back to python code, so we may observe it and note any times the gender or nationality features are used as discriminators.

Adapted from (Kernfeld 2016).

In [11]:


def recurse(node, depth):
    indent = "  " * depth
    if tree_.feature[node] != _tree.TREE_UNDEFINED:
        name = feature_name[node]
        if node in undesirable_nodes:
            label_mapping = label_mappings[name].to_list()
        threshold = tree_.threshold[node]
        
        if node in undesirable_nodes:
            print(f"{indent}if {name} in {[val for val in label_mapping if label_mapping.index(val) < threshold]}:")
        else:
            print(f"{indent}if {name} <= {threshold}:")
            
        recurse(tree_.children_left[node], depth + 1)
        
        if node in undesirable_nodes:
            print(f"{indent}else:  # if {name} in {[val for val in label_mapping if label_mapping.index(val) > threshold]}")
        else:
            print(f"{indent}else:  # if {name} > {threshold}:")
            
        recurse(tree_.children_right[node], depth + 1)
    else:
        print(f"{indent}return {str(tree_.value[node][0]).replace('.', '.,', 2)}")

tree_ = model.tree_
feature_name = [list(edm)[i] if i != _tree.TREE_UNDEFINED else "undefined!"for i in tree_.feature]
undesirable_nodes = [node for node in range(len(feature_name)) if feature_name[node] in ("Gender", "Nationality", "PlaceOfBirth")]
feature_names = ", ".join(list(edm))
# print(f"def tree({feature_names}):")

recurse(0, 1)

  if VisitedResources <= 27.0:
    if StudentAbsenceDays <= 0.5:
      if RaisedHands <= 13.5:
        if Gender in ['M']:
          if VisitedResources <= 16.0:
            return [0., 7., 0.]
          else:  # if VisitedResources > 16.0:
            if AnnouncementsView <= 31.0:
              return [1., 0., 0.]
            else:  # if AnnouncementsView > 31.0:
              return [0., 0., 1.]
        else:  # if Gender in ['F']
          if AnnouncementsView <= 5.5:
            if GradeID <= 5.0:
              return [0., 0., 1.]
            else:  # if GradeID > 5.0:
              return [0., 1., 0.]
          else:  # if AnnouncementsView > 5.5:
            return [0., 0., 5.]
      else:  # if RaisedHands > 13.5:
        if RaisedHands <= 69.5:
          if GradeID <= 0.5:
            if AnnouncementsView <= 17.0:
              return [0., 0., 1.]
            else:  # if AnnouncementsView > 17.0:
              return [1., 0., 0.]
          else:  # if GradeID > 0.5:
           

## Analysis

The code above is quite verbose. It does, however, reveal a few important features:

In [95]:
print(f"Found {len(undesirable_nodes)} nodes discrimination on place of birth, gender, or nationality")

Found 11 nodes discrimination on place of birth, gender, or nationality


What this means is that at eleven points in the final classifier, the results may be directly impacted by discrimination on one of these features.
For example, in the following section I will adapt a short branch of the decision tree, mirroring the internal logic of the classifier, and show that a fictitious student 
will be differentiated in their prediction based on purely their gender.

### Proof of gender discrimination in naive decision tree

In [121]:
# Lines 167 - 189 of output above, adapted as a method
def get_branch_output(Gender, ParentAnsweringSurvey, Nationality, Topic, PlaceOfBirth):
    if Gender in ['M']:
        if ParentAnsweringSurvey <= 0.5:
            if Nationality in ['KW', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan', 'venzuela', 'Iran', 'Tunis', 'Morocco']:
                if Topic <= 9.5:
                    return [14.,  0.,  0.]
                else:  # if Topic > 9.5:
                    return [0., 0., 1.]
            else:  # if Nationality in ['Syria', 'Palestine', 'Iraq', 'Lybia']
                return [0., 0., 2.]
        else:  # if ParentAnsweringSurvey > 0.5:
            if Nationality in ['KW', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan', 'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria']:
                return [0., 0., 3.]
            else:  # if Nationality in ['Palestine', 'Iraq', 'Lybia']
                return [1., 0., 0.]
    else:  # if Gender in ['F']
        if PlaceOfBirth in ['KuwaIT', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan', 'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Iraq']:
            return [24.,  0.,  0.]
        else:  # if PlaceOfBirth in ['Palestine', 'Lybia']
            if AnnouncementsView <= 67.5:
                return [0., 0., 1.]
            else:  # if AnnouncementsView > 67.5:
                return [2., 0., 0.]
        
def get_prediction(student):
    predictions = ["Low", "Mid", "High"]
    output = get_branch_output(student.Gender, student.ParentAnsweringSurvey, student.Nationality, student.Topic, student.PlaceOfBirth)
    return predictions[[output.index(val) for val in output if val > 0][0]]

class student:
    ParentAnsweringSurvey = 0
    Nationality = 'Egypt'
    Topic = 10
    PlaceOfBirth = 'Egypt'
    
    # Allow only changes to gender
    def __init__(self, gender):
        self.Gender = gender

student_m = student('M')
student_f = student('F')

for student in [student_m, student_f]:
    print(f'Prediction for student with gender {student.Gender} is {get_prediction(student)}')

Prediction for student with gender M is High
Prediction for student with gender F is Low
