# Creating a logistic regression model to predict absenteeism

## Import the relevant libraries

In [1]:
import pandas as pd
import numpy as np

## Load the preprocessed data

In [2]:
file_path = '../1 - Data Preprocessing/Absenteeism_preprocessed.csv'
data_preprocessed = pd.read_csv(file_path)

In [3]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


## Creating Targets for the Logistic Regression

To build the logistic regression model, we need to create target categories to determine if someone is "being absent too much" or not. Here's how i approached this:

1. **Using the Median as a Cut-off Line**  
   - I decided to use the **median of the dataset** as the cut-off line.  
   - This approach ensures a **balanced dataset**, with a roughly equal number of 0s and 1s for the logistic regression.  
   - Balancing is a common challenge in machine learning, and this method addresses it effectively.  

2. **Alternative Approaches**  
   - If we had more data, we could have explored other ways to address the issue, such as:  
     - Assigning an **arbitrary value** as a cut-off line instead of using the median.  

3. **What the Cut-off Line Represents**  
   - In this case, the cut-off line assigns a value of **1** to anyone who has been absent for **4 hours or more**.  
   - This threshold corresponds to taking **half a day off** (more than 3 hours).  


In [4]:
# find the median of 'Absenteeism Time in Hours'
median_absenteeism = data_preprocessed["Absenteeism Time in Hours"].median()
median_absenteeism

np.float64(3.0)

In [5]:
# parameterized code
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > median_absenteeism, 1, 0)

In [6]:
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [7]:
# create a Series in the original data frame that will contain the targets for the regression
data_preprocessed['Excessive Absenteeism'] = targets

In [8]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


## Checking if the Dataset is Balanced

To ensure the dataset is balanced for our logistic regression model, we need to verify the proportion of targets that are 1s. Here's how we can do this:

1. **Calculating the Percentage of 1s**  
   - Use `targets.sum()` to find the **total number of 1s** in the dataset.  
   - Use `targets.shape[0]` to get the **total length** of the `targets` array.
     

2. **Checking the Balance**  
  


In [9]:
# Calculate the proportion of 1s in the dataset
percentage_of_ones = (targets.sum() / targets.shape[0]) * 100

# Print the result in a user-friendly format
print(f"The dataset has {percentage_of_ones:.2f}% of targets classified as 1.")


The dataset has 45.57% of targets classified as 1.


#### Evaluating Target Balance

Based on our calculation, **45.57% of the targets** are classified as 1. The rest are classified as 0.

With this proportion being close to an equal split, we can consider the targets to be **balanced**. This balance is crucial for the performance of our logistic regression model, as it ensures that the model is not biased towards one class.

**Note:** 

 A perfect 50/50 split isn’t necessary. For logistic regression, a 60/40 or 45/55 split is usually sufficient. This balance prevents the model from favoring one class too much. 


## Create a checkpoint

In [10]:
# data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'],axis=1)

In [11]:
# create a checkpoint by dropping the unnecessary variables
# also drop the variables we 'eliminated' after exploring the weights
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the Week', 'Daily Work Load Average','Distance to Work'],axis=1)

In [12]:
# check if the line above is a checkpoint :)

# if data_with_targets is data_preprocessed = True, then the two are pointing to the same object
# if it is False, then the two variables are completely different and this is in fact a checkpoint
data_with_targets is data_preprocessed

False

In [13]:
data_with_targets.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0


## Select the inputs for the Logistic Regression

In [14]:
data_with_targets.shape

(700, 12)

In [15]:
# Create a variable that will contain the inputs (everything without the targets)
unscaled_inputs = data_with_targets.iloc[:,:-1]

In [16]:
unscaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,289,33,30,0,2,1
1,0,0,0,0,7,118,50,31,0,1,0
2,0,0,0,1,7,179,38,31,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0
4,0,0,0,1,7,289,33,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0
696,1,0,0,0,5,225,28,24,0,1,2
697,1,0,0,0,5,330,28,25,1,0,0
698,0,0,0,1,5,235,32,25,1,0,0


## Standardizing the Inputs

### Why Standardization is Important
Standardization is one of the most common preprocessing tools in machine learning. Here's why:

- **Bias from Data of Different Magnitudes:**  
  Data with different scales can bias the model towards higher magnitude values.
- **Ensuring Similar Magnitudes:**  
  Standardization ensures all inputs are of similar magnitude, reducing bias. 
- **Machine Learning Peculiarity:**  
  Most (though not all) machine learning algorithms perform poorly with unscaled data.

### Using `StandardScaler`
The `StandardScaler` module from `sklearn.preprocessing` is a powerful tool for standardizing data. It offers much more functionality than simpler preprocessing methods.  

---


## Standardizing Only the Numerical Variables: Creating a Custom Scaler

Standardizing dummy variables (e.g., 0s and 1s) is considered bad practice because it destroys their interpretability. Address it by creating a custom scaler that standardizes only numerical variables, leaving dummy variables untouched.


### Why Standardizing Dummies Is a Problem

#### Key Issues:
- When dummy variables are standardized:
  - They lose their interpretability.
  - A unit change no longer represents a switch between 0 (ignored) and 1 (included).

### Consequences:
- While the model's predictive power remains intact, the ability to compare the importance of different reasons for absence is lost.
- This is particularly problematic because reasons for absence are some of the most significant features in the model.

### The Solution: Custom Scaler

By implementing a custom scaler that standardizes only numerical variables, we address this issue and achieve two goals:
1. Preserve the interpretability of dummy variables.
2. Maintain the model's predictive power.


- **Key Takeaways:**
  - Avoid standardizing dummy variables to retain their interpretability.
  - Use custom preprocessing techniques to handle mixed data types effectively.
- **Advantages of a Custom Scaler:**
  - Preserves the meaning of critical features, such as reasons for absence.
  - Ensures the model remains both accurate and interpretable.

By resolving this issue, we ensured that the model provides actionable insights while retaining its performance. 


In [17]:
# import the libraries needed to create the Custom Scaler

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin): 

    def __init__(self, columns, copy=True, with_mean=True, with_std=True):
        self.scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        self.copy = copy
        self.with_mean = with_mean
        self.with_std = with_std
        
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns], axis=0)  
        self.var_ = np.var(X[self.columns], axis=0)    
        return self

# the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):
        if copy is None:
            copy = self.copy
        
        # record the initial order of the columns
        init_col_order = X.columns

        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)

        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:, ~X.columns.isin(self.columns)]

        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [18]:
# check what are all columns that we've got
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [19]:
# select the columns to omit
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4','Education']

In [20]:
# create the columns to scale, based on the columns to omit
# use list comprehension to iterate over the list
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [21]:
# declare a scaler object, specifying the columns you want to scale
absenteeism_scaler = CustomScaler(columns_to_scale)

### Fit the Scaler 

Fit the scaler to the unscaled inputs. This calculates the mean and standard deviation for each feature and stores the information in the scaler: 

In [22]:
absenteeism_scaler.fit(unscaled_inputs)

### Using the `transform` Method
The `transform` method is used to standardize the data based on the scaling parameters calculated earlier. Here's what happens:

1. **Fitting the Data:**  
   In the previous step, we fitted the data to the scaler. This means we calculated the **internal parameters** (e.g., mean and standard deviation) that the model will use for transformation.

2. **Transforming the Data:**  
   Applying these parameters to the dataset standardizes it, ensuring that all features are on a similar scale.

3. **Handling New Data:**  
   When you receive new data, you can use the same scaler to transform it in the same way by calling the `transform` method again.

This process ensures consistency in preprocessing and allows the model to handle future data effectively.


In [23]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [24]:
# the scaled_inputs are now an ndarray, because sklearn works with ndarrays
scaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690


In [25]:
# check the shape of the inputs
scaled_inputs.shape

(700, 11)

## Split the data into train & test and shuffle

### Import the relevant module

In [26]:
# import train_test_split so we can split our data into train and test
from sklearn.model_selection import train_test_split

### Split

In [27]:
# declare 4 variables for the split
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 20)

In [28]:
# check the shape of the train inputs and targets
print (x_train.shape, y_train.shape)

(560, 11) (560,)


In [29]:
# check the shape of the test inputs and targets
print (x_test.shape, y_test.shape)

(140, 11) (140,)


## Fitting the Model and Assessing its Accuracy

### Logistic regression with sklearn

In [30]:
# import the LogReg model from sklearn
from sklearn.linear_model import LogisticRegression

# import the 'metrics' module, which includes important metrics we may want to use
from sklearn import metrics

### Training the model

In [31]:
# create a logistic regression object
reg = LogisticRegression()

In [32]:
# fit our train inputs
reg.fit(x_train,y_train)

In [57]:
# Assess the train accuracy of the model
accuracy = reg.score(x_train, y_train) * 100  # Convert to percentage
print(f"Model Train Accuracy: {accuracy:.2f}%")

Model Train Accuracy: 77.32%


This means the model correctly classifies 77.32% of the training observations. 

### Manually check the accuracy

In [34]:
# find the model outputs according to our model
model_outputs = reg.predict(x_train)
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [35]:
# compare them with the targets
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [36]:
# ACTUALLY compare the two variables
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [37]:
# find out in how many instances we predicted correctly
np.sum((model_outputs==y_train))

np.int64(433)

In [38]:
# get the total number of instances
model_outputs.shape[0]

560

In [39]:
# calculate the accuracy of the model
manual_accuracy = np.sum((model_outputs==y_train)) / model_outputs.shape[0] * 100  # Convert to percentage
print(f"Manual calculated Accuracy: {manual_accuracy:.2f}%")

Manual calculated Accuracy: 77.32%


## Creating a Summary Table with the Coefficients and Intercept

In [40]:
# get the intercept (bias) of our model
reg.intercept_

array([-1.6469898])

In [41]:
# get the coefficients (weights) of our model
reg.coef_

array([[ 2.80000644,  0.95174778,  3.1140605 ,  0.83835931,  0.15897713,
         0.60513709, -0.16990589,  0.27998236, -0.21017416,  0.34842434,
        -0.27721907]])

In [42]:
# check what were the names of our columns
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [43]:
# save the names of the columns in an ad-hoc variable
feature_name = unscaled_inputs.columns.values

### Summary Table with the Coefficients

To analyze and visualize the impact of each feature on the logistic regression model, we create a summary table containing the feature names and their corresponding coefficients. These coefficients will later be used in Tableau for further analysis and visualization.

#### Steps:
1. **Create a DataFrame for the Summary Table:**  
   - The table will include a column for feature names.
2. **Add Coefficients:**  
   - The coefficients from the logistic regression model are transposed and added as a column to the summary table.
3. **Display the Summary Table:**  
   - The table is displayed for verification and further use.


In [44]:
# Create a DataFrame with feature names
summary_table = pd.DataFrame(columns=['Feature name'], data=feature_name)

# Add coefficients to the summary table
summary_table['Coefficient'] = np.transpose(reg.coef_)

# Display the summary table
summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason_1,2.800006
1,Reason_2,0.951748
2,Reason_3,3.114061
3,Reason_4,0.838359
4,Month Value,0.158977
5,Transportation Expense,0.605137
6,Age,-0.169906
7,Body Mass Index,0.279982
8,Education,-0.210174
9,Children,0.348424


### Adding the Intercept to the Summary Table

In logistic regression, the intercept is a key part of the model. To ensure it is included in our summary table for analysis, we add it at the top of the table. Here's what happens:

#### Steps:
1. **Adjust Indices:**  
   - Shift all existing indices by 1 to make room for the intercept at the top.
2. **Add the Intercept:**  
   - Insert the intercept value at index 0 with the label 'Intercept.'
3. **Sort the Table:**  
   - Sort the table by index to ensure the intercept is at the top and the rest of the rows follow in order.


In [45]:
# Shift indices to make room for the intercept
summary_table.index = summary_table.index + 1

# Add the intercept at index 0
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

# Sort the table by index
summary_table = summary_table.sort_index()

# Display the updated summary table
summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.64699
1,Reason_1,2.800006
2,Reason_2,0.951748
3,Reason_3,3.114061
4,Reason_4,0.838359
5,Month Value,0.158977
6,Transportation Expense,0.605137
7,Age,-0.169906
8,Body Mass Index,0.279982
9,Education,-0.210174


## Interpreting the Coefficients of this Logistic Regression

Now that we have obtained the logistic regression coefficients, let's break down their meaning and significance:

#### Coefficients (Weights)
- Represent the relationship between each feature and the **log odds** of the target variable.

#### Intercept (Bias)
- Serves as the base value for the model, representing the log odds when all features are zero.

#### Odds Ratios
- The **odds ratio** is calculated by taking the exponential of each coefficient. 
- This transformation makes the coefficients more interpretable, as it converts the log odds into a multiplicative effect on the odds.

---

### Calculating the Odds Ratios

To calculate the odds ratio:
1. **Create a New Series:**  
   - Generate a series named `'Odds ratio'` to store the odds ratio for each feature.  
   - This is computed as the exponential of the respective coefficients.

---

### Interpreting the Odds Ratios

- **Odds Ratio > 1:**  
  The feature increases the odds of excessive absenteeism.  
- **Odds Ratio < 1:**  
  The feature decreases the odds of excessive absenteeism.  
- **Odds Ratio ≈ 1:**  
  The feature has little to no effect on the odds.  


In [46]:
# create a new Series called 'Odds ratio' which will show the odds ratio of each feature
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

In [47]:
summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Intercept,-1.64699,0.192629
1,Reason_1,2.800006,16.444753
2,Reason_2,0.951748,2.590233
3,Reason_3,3.114061,22.51227
4,Reason_4,0.838359,2.31257
5,Month Value,0.158977,1.172311
6,Transportation Expense,0.605137,1.831503
7,Age,-0.169906,0.843744
8,Body Mass Index,0.279982,1.323106
9,Education,-0.210174,0.810443


In [48]:
# sort the table according to odds ratio
# by default, the sort_values method sorts values by 'ascending'
summary_table.sort_values('Odds_ratio', ascending=False)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
3,Reason_3,3.114061,22.51227
1,Reason_1,2.800006,16.444753
2,Reason_2,0.951748,2.590233
4,Reason_4,0.838359,2.31257
6,Transportation Expense,0.605137,1.831503
10,Children,0.348424,1.416833
8,Body Mass Index,0.279982,1.323106
5,Month Value,0.158977,1.172311
7,Age,-0.169906,0.843744
9,Education,-0.210174,0.810443


## Summary Table with Odds Ratios (prior to backward elimination of features with minimal impact)

| Feature Name                | Coefficient | Odds Ratio |
|-----------------------------|-------------|------------|
| Reason_3                   | 3.096739    | 22.125672  |
| Reason_1                   | 2.801363    | 16.467081  |
| Reason_2                   | 0.933541    | 2.543499   |
| Reason_4                   | 0.857183    | 2.356513   |
| Transportation Expense     | 0.613216    | 1.846359   |
| Children                   | 0.361898    | 1.436052   |
| Body Mass Index            | 0.271155    | 1.311478   |
| Month Value                | 0.166403    | 1.181049   |
| Daily Work Load Average    | -0.000077   | 0.999923   |
| Distance to Work           | -0.007779   | 0.992251   |
| Day of the Week            | -0.084316   | 0.919141   |
| Age                        | -0.165545   | 0.847431   |
| Education                  | -0.206027   | 0.813811   |
| Pets                       | -0.285729   | 0.751466   |
| Intercept                  | -1.656628   | 0.190781   |

---

### Key Concepts

### 1. Importance of Features
- The further a coefficient is from zero, the greater its impact on the model.
- **Most important features:**
  - Reasons for Absence (Reasons 1, 2, 3, and 4)
  - Transportation Expense
  - Children, Pets, and Education

---

### 2. Interpreting the Reasons for Absence

#### **Reason Categories**
- **Baseline (Reason 0):** No specific reason for absence.
- **Reason 1:** Various diseases.
- **Reason 2:** Pregnancy and childbirth.
- **Reason 3:** Poisoning and other uncategorized issues.
- **Reason 4:** Light diseases (e.g., dental appointments).

#### **Key Insights**
- **Reason 3 (Poisoning):**
  - Coefficient: 3.096739
  - Odds Ratio: 22.13
  - Interpretation: The odds of excessive absence are 22 times higher than the baseline.
- **Reason 1 (Various Diseases):**
  - Coefficient: 2.801363
  - Odds Ratio: 16.47
  - Interpretation: The odds of excessive absence are 16 times higher than the baseline.
- **Reason 2 (Pregnancy and Childbirth):**
  - Coefficient: 0.933541
  - Odds Ratio: 2.54
  - Interpretation: The odds of excessive absence are 2.5 times higher than the baseline.
- **Reason 4 (Light Diseases):**
  - Coefficient: 0.857183
  - Odds Ratio: 2.36
  - Interpretation: The odds of excessive absence are 2.36 times higher than the baseline.

---

### 3. Non-Dummy Features

- **Transportation Expense:**
  - Coefficient: 0.613216
  - Odds Ratio: 1.85
  - Interpretation: For each standardized unit increase in transportation expense, the odds of excessive absence increase by 85%. Due to standardization, direct interpretation is limited.

- **Children:**
  - Coefficient: 0.361898
  - Odds Ratio: 1.44
  - Interpretation: Having children increases the odds of excessive absence by 44%.

- **Pets:**
  - Coefficient: -0.285729
  - Odds Ratio: 0.75
  - Interpretation: Each additional standardized unit of pets reduces the odds of excessive absence by 25%. Explanation: Individuals with multiple pets may have others helping care for them.

- **Education:**
  - Coefficient: -0.206027
  - Odds Ratio: 0.81
  - Interpretation: Higher education levels decrease the odds of excessive absence by 19%.

---

### 4. Features with Minimal Impact
- **Daily Work Load Average, Distance to Work, and Day of the Week:**
  - Coefficients close to zero.
  - Minimal impact on the model.

---

### 5. Intercept
- Coefficient: -1.656628
- Odds Ratio: 0.19
- Purpose:
  - Calibrates the model for accurate predictions.
  - In machine learning, this is referred to as the **bias term**.
  - No direct interpretation.

---

### Standardization and Interpretation

- **Benefits of Standardization:**
  - Improves model accuracy by scaling features.
- **Drawbacks:**
  - Reduces direct interpretability of coefficients.
- **Solution:**
  - Create two models:
    - One with standardized features (for accuracy).
    - One without standardization (for interpretability).

---

### Summary of Findings
- **Top Features:** Reasons for Absence, Transportation Expense, Children, Pets, and Education.
- **Interpreting Coefficients:** Use odds ratios for better interpretability.
- **Standardization:** Balances accuracy and interpretability.
- **Intercept:** Ensures the model’s predictions are correctly calibrated.
- **Daily Work Load Average**, **Day of the Week**, and **Distance to Work** have minimal influence and could potentially be dropped (Backward Elimination).  
- The odds ratios provide an intuitive way to understand the impact of each feature on excessive absenteeism.

---
---


## Backward Elimination: Simplifying the Model

### Identifying Features to Remove

From the previous analysis, the following features showed minimal impact (coefficients close to zero) and can be removed:

- **Day of the Week**
- **Daily Work Load Average**
- **Distance to Work**

Removing these features is unlikely to significantly affect the model's performance.

### Steps to Perform Backward Elimination

#### 1. Return to the Checkpoint
- Use the `data_with_targets` checkpoint, which represents the dataset's last state before standardization.

#### 2. Drop the Low-Impact Features
- Remove the following columns from the dataset:
  - **Day of the Week**
  - **Daily Work Load Average**
  - **Distance to Work**

By following these steps, the model will be streamlined, focusing only on the most impactful features while maintaining or improving overall performance.


## Testing the model

In [56]:
# Assess the test accuracy of the model
test_accuracy = reg.score(x_test, y_test)

# Print the result in a user-friendly format
print(f"The model's test accuracy is: {test_accuracy:.2%}")


The model's test accuracy is: 75.00%


## Model Interpretation: Accuracy Analysis

### Train Accuracy: **77.32%**  
### Test Accuracy: **75.00%**

---

### Key Insights:
- **Small Difference Between Train and Test Accuracy (less than 3%):**
  - Indicates that the model generalizes well to unseen data.
  - The model is not overfitting.

### Understanding Overfitting:
- If the test accuracy were significantly lower (e.g., 10-20% lower than train accuracy), it would indicate overfitting. 
- Overfitting occurs when the model learns the training data too well, capturing noise and patterns that do not generalize to new data.

---

### Conclusion:
The model's similar train and test accuracies demonstrate a good balance between learning the training data and generalizing to new data. This is a strong indicator of a well-performing model.

---
---

## Predicting Outcomes with Logistic Regression

### Binary Predictions with `predict`
- Use the `predict` method to obtain predictions for the test set.
- Output: Binary values (**0** or **1**) for each test observation.
  - **0:** Not excessively absent.
  - **1:** Excessively absent.

---

### Predicting Probabilities with `predict_proba`
- The `predict_proba` method provides the **probabilities** for each class (0 or 1).
- Output: A \(140 \times 2\) array (assuming 140 test observations).
  - **First Column:** Probability of the outcome being **0** (not excessively absent).
  - **Second Column:** Probability of the outcome being **1** (excessively absent).

---

### Extracting Probabilities for Excessive Absenteeism
- To get the probabilities of being excessively absent, use the **second column** of the array returned by `predict_proba`.

---

### Understanding `predict_proba`
- Behind the scenes, the logistic regression model calculates probabilities for each class.
- The threshold for predicting outcomes is set at **0.5**:
  - **Probability ≥ 0.5:** Predict **1** (excessively absent).
  - **Probability < 0.5:** Predict **0** (not excessively absent).

Using these methods, we can either predict binary outcomes directly or analyze the probabilities to gain deeper insights into the predictions.


In [50]:
# find the predicted probabilities of each class
# the first column shows the probability of a particular observation to be 0, while the second one - to be 1
predicted_proba = reg.predict_proba(x_test)

# let's check that out
predicted_proba

array([[0.71342516, 0.28657484],
       [0.5873216 , 0.4126784 ],
       [0.44016153, 0.55983847],
       [0.78163061, 0.21836939],
       [0.08407928, 0.91592072],
       [0.3348226 , 0.6651774 ],
       [0.29971206, 0.70028794],
       [0.13112385, 0.86887615],
       [0.78627908, 0.21372092],
       [0.74906578, 0.25093422],
       [0.49395555, 0.50604445],
       [0.22492002, 0.77507998],
       [0.07135527, 0.92864473],
       [0.73173354, 0.26826646],
       [0.30957854, 0.69042146],
       [0.54726422, 0.45273578],
       [0.55051921, 0.44948079],
       [0.53926379, 0.46073621],
       [0.40197149, 0.59802851],
       [0.05365482, 0.94634518],
       [0.70030387, 0.29969613],
       [0.78163061, 0.21836939],
       [0.42028246, 0.57971754],
       [0.42028246, 0.57971754],
       [0.24801464, 0.75198536],
       [0.74567806, 0.25432194],
       [0.51026557, 0.48973443],
       [0.8569309 , 0.1430691 ],
       [0.20365204, 0.79634796],
       [0.78163061, 0.21836939],
       [0.

In [51]:
predicted_proba.shape

(140, 2)

In [58]:
# select ONLY the probabilities referring to 1s
predicted_proba[:,1]

array([0.28657484, 0.4126784 , 0.55983847, 0.21836939, 0.91592072,
       0.6651774 , 0.70028794, 0.86887615, 0.21372092, 0.25093422,
       0.50604445, 0.77507998, 0.92864473, 0.26826646, 0.69042146,
       0.45273578, 0.44948079, 0.46073621, 0.59802851, 0.94634518,
       0.29969613, 0.21836939, 0.57971754, 0.57971754, 0.75198536,
       0.25432194, 0.48973443, 0.1430691 , 0.79634796, 0.21836939,
       0.36947677, 0.67913195, 0.68508325, 0.52870791, 0.21836939,
       0.53505228, 0.22144744, 0.73673169, 0.40500758, 0.60504297,
       0.21072119, 0.45227108, 0.23749326, 0.39847178, 0.82763577,
       0.56771922, 0.69120847, 0.28657484, 0.2192347 , 0.2032712 ,
       0.57634482, 0.32954238, 0.6651774 , 0.26937528, 0.83323682,
       0.43484145, 0.88365871, 0.23125087, 0.33433749, 0.34451397,
       0.69915101, 0.6549938 , 0.29244583, 0.79186052, 0.20752232,
       0.26838009, 0.08710411, 0.22144744, 0.73215219, 0.30536526,
       0.22144744, 0.2900789 , 0.90443841, 0.46065771, 0.60175

## Saving the Model and Preparing for Deployment

### Why Save a Model?

Saving a trained model offers several key benefits:

1. **Reusability:**  
   - Apply the model to new data without retraining.
   
2. **Portability:**  
   - Share the model easily with colleagues or deploy it in different environments.

3. **Efficiency:**  
   - Save time and computational resources by avoiding repeated training.

---

### What to Save

When deploying a logistic regression model, the following components are essential:

1. **The Model (`reg`):**  
   - Contains the logistic regression model, including:
     - Coefficients.
     - Intercept.
     - Other metadata.

2. **The Scaler (`absenteeism_scaler`):**  
   - Contains the scaling parameters (mean and standard deviation) used to standardize the data during preprocessing.
   - Ensures that new data can be preprocessed in the same way as the training data.

Saving these components ensures that the model and preprocessing steps can be seamlessly integrated into a deployment pipeline.


In [59]:
# import the relevant module
import pickle

In [60]:
# pickle the model file
with open('model', 'wb') as file:
    pickle.dump(reg, file)

In [61]:
# pickle the scaler file
with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler, file)