# Doing a multivariable logistic regression on employee data

In [17]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score

In [2]:
employee_data = pd.read_csv("https://tinyurl.com/y6r7qjrp")
employee_data.head()

Unnamed: 0,SEX,AGE,PROMOTIONS,YEARS_EMPLOYED,DID_QUIT
0,0,25,2,3,0
1,0,30,2,3,0
2,0,26,2,3,0
3,0,25,1,2,0
4,0,28,1,2,0


In [6]:
# grab independent variable columns
inputs = employee_data.iloc[:, :-1]
inputs.head()

Unnamed: 0,SEX,AGE,PROMOTIONS,YEARS_EMPLOYED
0,0,25,2,3
1,0,30,2,3
2,0,26,2,3
3,0,25,1,2
4,0,28,1,2


In [7]:
# grab dependent "did_quit" variable column
output = employee_data.iloc[:, -1]
output.head()

0    0
1    0
2    0
3    0
4    0
Name: DID_QUIT, dtype: int64

In [8]:
# build logistic regression
fit = LogisticRegression(penalty=None).fit(inputs, output)

In [9]:
# Print coefficients:
print("COEFFICIENTS: {0}".format(fit.coef_.flatten()))
print("INTERCEPT: {0}".format(fit.intercept_.flatten()))

COEFFICIENTS: [ 0.03213405  0.03682453 -2.50410028  0.9742266 ]
INTERCEPT: [-2.73485302]


By the weight of the coefficients, you can see that sex and age play very little role in the prediction (they both have a weight near 0). However, promotions and years_employed have significant weights of –2.504 and 0.97.

In [10]:
# Interact and test with new employee data
def predict_employee_will_stay(sex, age, promotions, years_employed):
    prediction = fit.predict([[sex, age, promotions, years_employed]])
    probabilities = fit.predict_proba([[sex, age, promotions, years_employed]])
    if prediction == [[1]]:
        return "WILL LEAVE: {0}".format(probabilities)
    else:
        return "WILL STAY: {0}".format(probabilities)

In [16]:
predict_employee_will_stay(0,35,2,6)



'WILL STAY: [[0.64767509 0.35232491]]'

In [11]:
# Test a prediction
while True:
    n = input("Predict employee will stay or leave {sex}, {age},{promotions},{years employed}: ")
    (sex, age, promotions, years_employed) = n.split(",")
    print(predict_employee_will_stay(int(sex), int(age), int(promotions),
          int(years_employed)))



WILL LEAVE: [[0.28570264 0.71429736]]


ValueError: not enough values to unpack (expected 4, got 1)

**BE CAREFUL MAKING CLASSIFICATIONS ON PEOPLE!**

A quick and surefire way to shoot yourself in the foot is to collect data on people and use it to make predictions haphazardly. Not only can data privacy concerns come about, but legal and PR issues can emerge if the model is found to be discriminatory. Input variables like race and gender can become weighted from machine learning training. After that, undesirable outcomes are inflicted on those demographics like not being hired or being denied loans. More extreme applications include being falsely flagged by surveillance systems or being denied criminal parole. Note too that seemingly benign variables like commute time have been found to correlate with discriminatory variables.

At the time of writing, a number of articles have been citing machine learning discrimination as an issue:

Katyanna Quach, “Teen turned away from roller rink after AI wrongly identifies her as banned troublemaker”, The Register, July 16, 2021. https://www.theregister.com/2021/07/16/facial_recognition_failure

Kashmir Hill, “Wrongfully Accused by an Algorithm”, New York Times, June 24, 2020. https://www.nytimes.com/2020/06/24/technology/facial-recognition-arrest.html

As data privacy laws continue to evolve, it is advisable to err on the side of caution and engineer personal data carefully. Think about what automated decisions will be propagated and how that can cause harm. Sometimes it is better to just leave a “problem” alone and keep doing it manually.

## Train/Test Splits

In [19]:
# Load the data
df = pd.read_csv("https://tinyurl.com/y6r7qjrp", delimiter=",")

X = df.values[:, :-1]
Y = df.values[:, -1]

# "random_state" is the random seed, which we fix to 7
kfold = KFold(n_splits=3, random_state=7, shuffle=True)
model = LogisticRegression(penalty=None)
results = cross_val_score(model, X, Y, cv=kfold)

print("Accuracy Mean: %.3f (stdev=%.3f)" % (results.mean(), results.std()))

Accuracy Mean: 0.611 (stdev=0.000)
