# Placements

This data set consists of Placement data of students in a XYZ campus. It includes secondary and higher secondary school
percentage and specialization. It also includes degree specialization, type and Work experience and salary offers to
the placed students

https://www.kaggle.com/benroshan/factors-affecting-campus-placement


In [1]:
import dalex as dx
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.tree import DecisionTreeClassifier

data = pd.read_csv("Placement_Data_Full_Class.csv")

In [2]:
# status is the target
data.pop('sl_no')
data.pop('salary')

X = data.drop(columns='status')

enc = LabelEncoder()
y = enc.fit_transform(data['status'])
data.pop('status')

0          Placed
1          Placed
2          Placed
3      Not Placed
4          Placed
          ...    
210        Placed
211        Placed
212        Placed
213        Placed
214    Not Placed
Name: status, Length: 215, dtype: object

In [3]:
categorical_features = ['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('cat', categorical_transformer, categorical_features)
])

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(max_depth=7, random_state=123))
])

In [4]:
clf.fit(X, y)

exp = dx.Explainer(clf, X, y)

Preparation of a new explainer is initiated

  -> data              : 215 rows 12 cols
  -> target variable   : 215 values
  -> model_class       : sklearn.tree._classes.DecisionTreeClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x000001EAE6C148B0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0, mean = 0.688, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.818, mean = 4.65e-18, max = 0.75
  -> model_info        : package sklearn

A new explainer has been created!


In [5]:
# we're going to be checking out the combinations between gender and work experience in this dataset
protected = data.gender + '_' + data.workex
privileged = 'M_Yes'

In [6]:
fobject = exp.model_fairness(protected=protected, privileged=privileged)

In [7]:
fobject.fairness_check(epsilon=0.8)  # default epsilon


Bias detected in 3 metrics: ACC, PPV, STP

Conclusion: your model is not fair because 2 or more criteria exceeded acceptable limits set by epsilon.

Ratios of metrics, based on 'M_Yes'. Parameter 'epsilon' was set to 0.8 and therefore metrics should be within (0.8, 1.25)
         TPR       ACC       PPV    FPR       STP
F_No   0.867  0.786624  0.768903  0.834  0.708068
F_Yes  0.944  1.013800  1.064963    NaN  0.820594
M_No   0.926  0.792994  0.782748  1.090  0.830149

Take into consideration that NaN's are present, consider checking 'metric_scores' plot to see the difference


From these results we can say for sure that our model is biased. For one, it prefers males over females, and second it
prefers people with work experience over those who don't.