# Preparing employee data for safe release

It's clear that the Titanic didn't have a good ending. Nevertheless, we now have a great dataset to practice and if we were going to make use of similar data, we would need to make sure that there is no way our customers, clients, or other people's personal information can be traced or exposed.

In this exercise we will practice some basic Supression andn Generalization techniques using Pandas and Scipy. We will anonymize categorical columns by converting them in number by encoding them. Like in the case of the column `Gender` so it's 0 or 1 and `Department`to be as many departments are in the company.

A copy of the **IBM HR Analytics Employee Attrition & Performance** dataset has already been loaded as a Pandas dataframe called `df`. Feel free to use the interactive console to explore more of it: checking the informantion and the data types with `df.info()`. 



1. Drop the unique and NaN values from the dataset.
2. Anonymize the categocial columns by replacing to be encoded numeric labels.
3. Knowing that for the column "age" the best distribution fit is the "fisk", use it to obtain the necessary parameters and apply it to later generate sample data close to the original ones.
4. For even more privacy and avoiding leak some infomation about the dataset, replace the column names with numbers.

In [70]:
######## Pre code
import pandas as pd
import numpy as np
import scipy.stats
import statsmodels as sm
from math import floor
import matplotlib.pyplot as plt
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")
# Multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Import label encoder 
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder() 

df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.shape
df.head()

(1470, 35)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [71]:
# Drop unique data and almost unique data for every row as well as NaN
df.drop(columns=["EmployeeNumber", "MonthlyIncome", "MonthlyRate", "DailyRate"], inplace=True) 
df.dropna(inplace=True)

# Encode labels of categorical variables
for c in ["Gender", "Attrition", "Department", "EducationField", "BusinessTravel"]:
    df[c]= label_encoder.fit_transform(df[c]) 

# Apply the best probabilistic distribution fit for Age:fisk and round it.
params = scipy.stats.fisk.fit(df['Age'])
df['Age'] = scipy.stats.fisk.rvs(size=len(df.index), *params).round()

# Replace the column names with numbers and see the new generated dataset
df.columns = list(range(df.shape[1]))
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,35.0,1,2,2,1,2,1,1,2,0,...,1,80,0,8,0,1,6,4,0,5
1,41.0,0,1,1,8,1,1,1,3,1,...,4,80,1,10,3,3,10,7,1,7
2,34.0,1,2,1,2,2,4,1,4,1,...,2,80,0,7,3,3,0,0,0,0
3,28.0,0,1,1,3,4,1,1,4,0,...,3,80,0,8,3,3,8,7,3,0
4,22.0,0,2,1,2,1,3,1,1,1,...,4,80,1,6,3,3,2,2,2,2


22