## Introduction

In this notebook I use the preproccessed  data set on absenteeism provided by 365 careers on udemy.com. This end of course assignment was given to showcase everything that was taught during the Data Science Bootcamp. In this notebook I standardize the data, create a custom scaler, split the data and create a Logistic regression using sklearn, After I test the model accuracy and pickle the model and scaler.


## Table of Content 
    
   
   1. [Import Reelevant libraries ](#cell1)
   2. [Load the data](#cell2)
   2. [Creating the target](#cell3)
   3. [Standardize the data](#cell4)
   4. [Splitting](#cell5)
   5. [Logistic Regression with sklearn](#cell6)
   6. [Testing Accuracy](#cell7)
       - Coefficient and Intercept Summary Table
       - Hyper Paremeter Tuning
   7. [Saving Model](#cell8)
        






## Import Relevant libraries <a id="cell1"></a>

In [1]:
import pandas as pd
import numpy as np

## Load the data  <a id="cell2"></a>

In [2]:
data_preprocessed = pd.read_csv('Absenteeism_preprocessed_Dataset.csv')

In [3]:
data_preprocessed

Unnamed: 0,Reason Category 1,Reason Category 2,Reason Category 3,Reason Category 4,Month value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,2,2,0,8
696,1,0,0,0,5,2,225,26,28,237.656,24,1,1,2,3
697,1,0,0,0,5,3,330,16,28,237.656,25,2,0,0,8
698,0,0,0,1,5,3,235,16,32,237.656,25,3,0,0,2


In [4]:
data_preprocessed['Reason Category 1'] = data_preprocessed['Reason Category 1'].map({1:1, 0:'0'})
data_preprocessed['Reason Category 2'] = data_preprocessed['Reason Category 2'].map({1:2, 0:'0'})
data_preprocessed['Reason Category 3'] = data_preprocessed['Reason Category 3'].map({1:3, 0:'0'})
data_preprocessed['Reason Category 4'] = data_preprocessed['Reason Category 4'].map({1:4, 0:'0'})

In [5]:
data_preprocessed['Reason Category 1']=data_preprocessed['Reason Category 1'].astype(int)
data_preprocessed['Reason Category 2']=data_preprocessed['Reason Category 2'].astype(int)
data_preprocessed['Reason Category 3']=data_preprocessed['Reason Category 3'].astype(int)
data_preprocessed['Reason Category 4']=data_preprocessed['Reason Category 4'].astype(int)

In [6]:
data_preprocessed['Reason Category 1'] = data_preprocessed['Reason Category 2'] + data_preprocessed['Reason Category 3'] + data_preprocessed['Reason Category 4'] 

In [7]:
data_preprocessed = data_preprocessed.drop(['Reason Category 4','Reason Category 3','Reason Category 2'],axis=1)

In [8]:
median = data_preprocessed['Reason Category 1'].median()
data_preprocessed['Reason Category 1'].fillna(median, inplace=True)

In [9]:
data_preprocessed

Unnamed: 0,Reason Category 1,Month value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,4,7,1,289,36,33,239.554,30,1,2,1,4
1,0,7,1,118,13,50,239.554,31,1,1,0,0
2,4,7,2,179,51,38,239.554,31,1,0,0,2
3,0,7,3,279,5,39,239.554,24,1,2,0,4
4,4,7,3,289,36,33,239.554,30,1,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,5,2,179,22,40,237.656,22,2,2,0,8
696,0,5,2,225,26,28,237.656,24,1,1,2,3
697,0,5,3,330,16,28,237.656,25,2,0,0,8
698,4,5,3,235,16,32,237.656,25,3,0,0,2


## Creating the target <a id="cell3"></a>

In [10]:
# using median to find the median number of absentense. I then classify by median 

In [11]:
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

In [12]:
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

In [13]:
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [14]:
data_preprocessed['Immoderately late'] = targets
data_preprocessed.head()

Unnamed: 0,Reason Category 1,Month value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Immoderately late
0,4,7,1,289,36,33,239.554,30,1,2,1,4,1
1,0,7,1,118,13,50,239.554,31,1,1,0,0,0
2,4,7,2,179,51,38,239.554,31,1,0,0,2,0
3,0,7,3,279,5,39,239.554,24,1,2,0,4,1
4,4,7,3,289,36,33,239.554,30,1,2,1,2,0


In [15]:
## Data with targets and using backward elimination to create a simplier model, I use the stats from a coefiecnet table o create
#later down the project

In [16]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the week','Daily Work Load Average',
                                            'Distance to Work','Month value','Body Mass Index'],axis=1)

In [17]:
data_with_targets.head()

Unnamed: 0,Reason Category 1,Transportation Expense,Age,Education,Children,Pets,Immoderately late
0,4,289,33,1,2,1,1
1,0,118,50,1,1,0,0
2,4,179,38,1,0,0,0
3,0,279,39,1,2,0,1
4,4,289,33,1,2,1,0


In [18]:
unscaled_inputs = data_with_targets.iloc[:,0:5]

## Standardize the data <a id="cell4"></a>

Creating a custom scaler to scale everything but the dummy features

In [19]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class CustomScaler(BaseEstimator, TransformerMixin):
    
    def __init__(self,columns):
        self.scaler = StandardScaler()
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [20]:
unscaled_inputs.columns.values

array(['Reason Category 1', 'Transportation Expense', 'Age', 'Education',
       'Children'], dtype=object)

In [21]:
columns_to_omit = ['Reason Category 1','Education']

columns_to_omit = ['Reason Category 1', 'Reason Category 2', 'Reason Category 3',
       'Reason Category 4','Education']

In [22]:
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [23]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [24]:
absenteeism_scaler.fit(unscaled_inputs)

CustomScaler(columns=['Transportation Expense', 'Age', 'Children'])

In [25]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [26]:
scaled_inputs

Unnamed: 0,Reason Category 1,Transportation Expense,Age,Education,Children
0,4,1.005844,-0.536062,1,0.880469
1,0,-1.574681,2.130803,1,-0.019280
2,4,-0.654143,0.248310,1,-0.919030
3,0,0.854936,0.405184,1,0.880469
4,4,1.005844,-0.536062,1,0.880469
...,...,...,...,...,...
695,0,-0.654143,0.562059,2,0.880469
696,0,0.040034,-1.320435,1,-0.019280
697,0,1.624567,-1.320435,2,-0.919030
698,4,0.190942,-0.692937,3,-0.919030


## Splitting <a id="cell5"></a>

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 9)

## Logistic Regression with sklearn <a id="cell6"></a>
I used the hypertuning parameters wrote later on in the code to achieve the optimal results

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [30]:
Log_reg = LogisticRegression(penalty='none', solver= 'lbfgs',
  multi_class='auto',
 class_weight = 'balanced',
 C = 10)

In [31]:
Log_reg.fit(x_train,y_train)

  "Setting penalty='none' will ignore the C and l1_ratio "


LogisticRegression(C=10, class_weight='balanced', penalty='none')

### Testing Accuracy <a id="cell7"></a>

In [32]:
y_pred_lgr = Log_reg.predict(x_train)

In [33]:
y_pred_lgr


array([0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,

In [34]:
from sklearn.metrics import confusion_matrix
sc_reg_lgr=confusion_matrix(y_train, y_pred_lgr)
sc_lgr = np.array(sc_reg_lgr)
accuracy_score_lgr = (sc_lgr[0,0]+sc_lgr[1,1])/sc_lgr.sum()
accuracy_score_lgr

0.7125

### Coefficient and Intercept Summary Table 
Also used backward elimation to remove varibles with coeafficients close to zero

In [35]:
feature_name = unscaled_inputs.columns.values

In [36]:
summary_table = pd.DataFrame (columns=['Feature name'], data=feature_name)

summary_table['Coefficient'] = np.transpose(Log_reg.coef_)

summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason Category 1,-0.316579
1,Transportation Expense,0.423425
2,Age,-0.102224
3,Education,-0.001275
4,Children,0.313627


### Hyper Paremeter Tuning
I made the bottom cell a markdown to efficiently run the code.

In [37]:
from sklearn.model_selection import RandomizedSearchCV

random_grid_par = {
'penalty' : ['none', 'l1', 'l2', 'elasticnet'],
'C' : [100, 10, 1.0, 0.1, 0.01],
'class_weight' : ['balanced','None'],
'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'multi_class' : ['auto', 'ovr', 'multinomial']}


In [38]:
lgr_rng = RandomizedSearchCV(estimator = Log_reg, param_distributions = random_grid_par, cv = 10, verbose=2, n_jobs = 6)

In [39]:
lgr_rng.fit(x_train,y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits


 0.69285714        nan        nan        nan]


RandomizedSearchCV(cv=10,
                   estimator=LogisticRegression(C=10, class_weight='balanced',
                                                penalty='none'),
                   n_jobs=6,
                   param_distributions={'C': [100, 10, 1.0, 0.1, 0.01],
                                        'class_weight': ['balanced', 'None'],
                                        'multi_class': ['auto', 'ovr',
                                                        'multinomial'],
                                        'penalty': ['none', 'l1', 'l2',
                                                    'elasticnet'],
                                        'solver': ['newton-cg', 'lbfgs',
                                                   'liblinear', 'sag',
                                                   'saga']},
                   verbose=2)

In [40]:
lgr_rng.best_params_

{'solver': 'saga',
 'penalty': 'l2',
 'multi_class': 'ovr',
 'class_weight': 'None',
 'C': 0.1}

In [41]:
lgr_rng.best_score_

0.6928571428571428

## Saving Model <a id="cell8"></a>

In [42]:
import pickle

In [43]:
with open('model', 'wb') as file:
    pickle.dump(Log_reg, file)

In [44]:
with open('scaler', 'wb') as file:
    pickle.dump(absenteeism_scaler, file)