# Preparing employee data for safe release

When we are dealing with real data, we need to make sure that there is no way our customers, clients, or other people's personal information can be traced or exposed.

In this exercise we will use the **IBM HR Analytics Employee Attrition & Performance** dataset to practice Supression and Generalization techniques using Pandas and Scipy.

We will anonymize categorical columns by converting them in numbers by encoding them. Like in the case of the column `Gender`, to be 0 or 1, and `Department`, to be as many departments are in the company. 

Remember that to generate sample data to be the most similar to the original one, we have to apply the best distribution fit for it; in this exercise we will use the `fisk` distribution for the `Age` column. For even more privacy and avoiding leak some infomation about the dataset, we will also replace the column names with numbers.

A copy of the dataset has already been loaded as `df`. Feel free to use the interactive console to explore more of it: checking the informantion, the data types with `df.info()` and the number of unique values for each column using `df.nunique()`. 




1. Drop the unique and NaN values from the dataset.
2. Anonymize the categorical columns by replacing to be encoded numeric labels.
3. Obtain the parameters for sampling `Age` data using the `fisk` distribution, apply them and round the values.
4. Replace the column names with numbers.

In [None]:
######## Pre code
import pandas as pd
import numpy as np
import scipy.stats
import statsmodels as sm
from math import floor
import matplotlib.pyplot as plt
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")
# Multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Import label encoder 
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder() 

df = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.shape
df.head()

In [None]:
# Drop unique data and almost unique data for every row as well as NaN
df.____(columns=["EmployeeNumber", "MonthlyIncome", "MonthlyRate", "DailyRate"], inplace=True) 
df.____(inplace=True)

# Encode labels of categorical variables
for c in ["Gender", "Attrition", "Department", "EducationField", "BusinessTravel"]:
    df[c]= ____(df[c]) 

# Apply the best probabilistic distribution fit for Age:fisk and round it.
params = scipy.stats.fisk.____(df['Age'])
df['Age'] = scipy.stats.fisk.____(size=len(df.index), *____).____()

# Replace the column names with numbers and see the new generated dataset
df.columns = list(____(df.shape[1]))
df.head()