## Creating a logistic regression to make predictions

### Import relevant libraries

In [1]:
import pandas as pd
import numpy as np

## Load the data

In [2]:
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')

In [3]:
data_preprocessed.head()

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of Week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,0,2,1,4,7,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,0,1,0,0,7,1
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,0,0,0,2,7,2
3,1,1,0,0,2015-07-16,279,5,39,239.554,24,0,2,0,4,7,3
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,0,2,1,2,7,3


### Create the targets
#### LR is a type of classification - We will be dividing the data into classes 
#### what will be these classes? The further preprocessing depends on that
#### for us -- excessively absent and moderately absent
#### Naive approach - we consider the median of absenteeism hours and anything below would be considered normal, above that will be excessive

In [4]:
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

### Anything below 3hrs - moderately absent and above would be excessive absent -- In ML, we assign 0s to less than 3hrs and and 1s more than 3hrs

### we use np.where(condition, valueiftrue, valueifFalse) for creating targets as above -- 
### np.where is similar to the IF function in Excel 
### This is another method for mapping 

In [5]:
# targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 3, 1, 0)

In [6]:
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

# parameterizing makes understanding code easier

In [7]:
data_preprocessed['Excessive absenteeism'] = targets

In [8]:
data_preprocessed

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of Week,Excessive absenteeism
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,0,2,1,4,7,1,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,0,1,0,0,7,1,0
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,0,0,0,2,7,2,0
3,1,1,0,0,2015-07-16,279,5,39,239.554,24,0,2,0,4,7,3,1
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,0,2,1,2,7,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,1,0,0,2018-05-23,179,22,40,237.656,22,1,2,0,8,5,2,1
696,1,1,0,0,2018-05-23,225,26,28,237.656,24,0,1,2,3,5,2,0
697,1,1,0,0,2018-05-24,330,16,28,237.656,25,1,0,0,8,5,3,1
698,0,0,0,1,2018-05-24,235,16,32,237.656,25,1,0,0,2,5,3,0


#### A comment on the targets

In [9]:
targets.sum() / targets.shape[0]  #this shows around 46% people are 1s and rest 54% are 0s

0.45571428571428574

### While balancing our dataset, 50-50 split is not mandatory 
### 60-40 works good for logistic regression
### A balance of 45-55 is almost always sufficient

#### removing Absenteeism from df

In [10]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'], axis = 1)

## Similar to creating a checkpoint

In [11]:
data_with_targets.head()

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Month Value,Day of Week,Excessive absenteeism
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,0,2,1,7,1,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,0,1,0,7,1,0
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,0,0,0,7,2,0
3,1,1,0,0,2015-07-16,279,5,39,239.554,24,0,2,0,7,3,1
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,0,2,1,7,3,0


### Selecting the inputs for regression
#### df.iloc[row indices, column indices]

In [12]:
data_with_targets.shape

(700, 16)

In [13]:
data_with_targets = data_with_targets.drop(['Date', 'Distance to Work',
                                            'Daily Work Load Average','Day of Week'],axis = 1)

In [14]:
data_with_targets.columns.values

array(['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Month Value', 'Excessive absenteeism'],
      dtype=object)

In [15]:
cols = ['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Month Value', 'Excessive absenteeism']

In [16]:
data_with_targets = data_with_targets[cols]

In [17]:
# data_with_targets

In [18]:
# data_with_targets.iloc[:,0:14] OR

In [19]:
data_with_targets.iloc[:,:14]

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Month Value,Excessive absenteeism
0,0,0,0,1,289,33,30,0,2,1,7,1
1,0,0,0,0,118,50,31,0,1,0,7,0
2,0,0,0,1,179,38,31,0,0,0,7,0
3,1,1,0,0,279,39,24,0,2,0,7,1
4,0,0,0,1,289,33,30,0,2,1,7,0
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,1,0,0,179,40,22,1,2,0,5,1
696,1,1,0,0,225,28,24,0,1,2,5,0
697,1,1,0,0,330,28,25,1,0,0,5,1
698,0,0,0,1,235,32,25,1,0,0,5,0


### In case of huge datasets with large no of cols, we cannot count the indices  -- SO

In [20]:
data_with_targets.iloc[:,:-1] # can put the no. of last cols we want to skip

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Month Value
0,0,0,0,1,289,33,30,0,2,1,7
1,0,0,0,0,118,50,31,0,1,0,7
2,0,0,0,1,179,38,31,0,0,0,7
3,1,1,0,0,279,39,24,0,2,0,7
4,0,0,0,1,289,33,30,0,2,1,7
...,...,...,...,...,...,...,...,...,...,...,...
695,1,1,0,0,179,40,22,1,2,0,5
696,1,1,0,0,225,28,24,0,1,2,5
697,1,1,0,0,330,28,25,1,0,0,5
698,0,0,0,1,235,32,25,1,0,0,5


In [21]:
unscaled_inputs = data_with_targets.iloc[:,:-1]
unscaled_inputs

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Month Value
0,0,0,0,1,289,33,30,0,2,1,7
1,0,0,0,0,118,50,31,0,1,0,7
2,0,0,0,1,179,38,31,0,0,0,7
3,1,1,0,0,279,39,24,0,2,0,7
4,0,0,0,1,289,33,30,0,2,1,7
...,...,...,...,...,...,...,...,...,...,...,...
695,1,1,0,0,179,40,22,1,2,0,5
696,1,1,0,0,225,28,24,0,1,2,5
697,1,1,0,0,330,28,25,1,0,0,5
698,0,0,0,1,235,32,25,1,0,0,5


## Customer scaler 

In [22]:
# import the libraries needed to create the Custom Scaler
# note that all of them are a part of the sklearn package
# moreover, one of them is actually the StandardScaler module, 
# so you can imagine that the Custom Scaler is build on it

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        # scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler(copy,with_mean,with_std)
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    # the fit method, which, again based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    # the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):
        
        # record the initial order of the columns
        init_col_order = X.columns
        
        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [23]:
unscaled_inputs.columns.values

array(['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Month Value'], dtype=object)

### Done during backward elimination

In [24]:
columns_to_omit = ['Reason 1', 'Reason 2', 'Reason 3', 'Reason 4', 'Education']

In [25]:
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit ]

In [26]:
columns_to_scale

['Transportation Expense',
 'Age',
 'Body Mass Index',
 'Children',
 'Pets',
 'Month Value']

In [27]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [28]:
absenteeism_scaler.fit(unscaled_inputs)

CustomScaler(columns=['Transportation Expense', 'Age', 'Body Mass Index',
                      'Children', 'Pets', 'Month Value'],
             copy=None, with_mean=None, with_std=None)

In [29]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [30]:
scaled_inputs

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Month Value
0,0,0,0,1,1.005844,-0.536062,0.767431,0,0.880469,0.268487,0.182726
1,0,0,0,0,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690,0.182726
2,0,0,0,1,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690,0.182726
3,1,1,0,0,0.854936,0.405184,-0.643782,0,0.880469,-0.589690,0.182726
4,0,0,0,1,1.005844,-0.536062,0.767431,0,0.880469,0.268487,0.182726
...,...,...,...,...,...,...,...,...,...,...,...
695,1,1,0,0,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690,-0.388293
696,1,1,0,0,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663,-0.388293
697,1,1,0,0,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690,-0.388293
698,0,0,0,1,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690,-0.388293


### standardize the data / scaling it

### We created this object - this will be used to scale our data - in other words, it will substract the mean and divide it by std deviation variablewise (featurewise)

#### NOTE: CHECK FOR STANDARDIZATION LESSONS

### Whenever we get new data, we will know that the standardization information is contained in absenteeism_Scaler and thus, we'll be able to standardize the new data in the same way !!! IMP STEP

### So far, we created the scaling mechanism and in order to apply it, we need another method: transform!
### It will transform the unscaled input using the information from the absenteeism_Scaler

### Underfitting and overfitting -- 
### Overfitting - when the model learns to predict the data we gave accurately and fails miserably when given a new data - to avoid it:
### 1. Hide a small part of data from the algorithm - we train the model based on majority part of our data and we use the test data to test if our model will work in real life
### 2. Randomization - shuffling the data

## Split in train and test
### Import the appropriate module

In [31]:
from sklearn.model_selection import train_test_split

In [32]:
# x_train, x_test, y_train, y_test = train_test_split(scaled_inputs,targets)   

#splits the inputs into test,train and targets into test,train giving us 4 arrays
# have to name them tho

In [33]:
# print( x_train.shape, y_train.shape)

In [34]:
# print( x_test.shape, y_test.shape)

### usually 80-20 / 90-10 splits are preferred

In [1]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs,targets, train_size = 0.8, random_state = 20)  

# train_test_split( processed_df, target i.e o/p, size, state)

NameError: name 'train_test_split' is not defined

### also takes up shuffle = true argument by default -- but this split is changed everytime we run the code, We may get lucky and achieve higher accuracy due to this or vice versa, so use random_state = 20 (psuedorandom nos) and it will always shuffle it in the same random way -- set seed --

## LR with sklearn

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [37]:
reg = LogisticRegression()

In [38]:
reg.fit(x_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

#### All these are default parameters - didnot pass any while fittinig - and each of these will help on improving our model further

In [39]:
reg.score(x_train,y_train)  

0.7696428571428572

### The LR is trained on train inputs and based on them it finds the outputs which are as close to the target as possible - i.e here 80% 
### our model learnt to classify around 80% observations
### accuracy means x% of the outputs match the targets

### Manually checking the accuracy - based on above info, we need to find the outputs and compare then with the inputs -- USING PREDICT

In [40]:
model_outputs = reg.predict(x_train)  #should match the target - y_train

In [41]:
# model_outputs == y_train
np.sum(model_outputs == y_train)   #gives total no. of trues i.e correct predictions

431

In [42]:
model_outputs.shape[0]

560

In [43]:
accuracy = np.sum(model_outputs == y_train) / model_outputs.shape[0]
accuracy

0.7696428571428572

## Finding intercepts and coefficients

In [44]:
reg.intercept_

array([-1.50477804])

In [45]:
reg.coef_

array([[ 1.36092638,  1.36092638,  2.9264957 ,  0.69905475,  0.60371772,
        -0.16882678,  0.27480791, -0.2354129 ,  0.3554835 , -0.27564339,
         0.15292338]])

### sklearn functions result in nd arrays - cuz we std the data using scaler -- different in statsmodels - there it is pandas df

In [46]:
# unscaled_inputs.columns.values

In [47]:
feature_name = unscaled_inputs.columns.values

In [48]:
summary_table = pd.DataFrame(columns = ['feature_name'], data = feature_name) # EMPTY WITH COL NAMES 

summary_table['Coefficient'] = np.transpose(reg.coef_)
summary_table

#Have to take transpore since the array is row wise

Unnamed: 0,feature_name,Coefficient
0,Reason 1,1.360926
1,Reason 2,1.360926
2,Reason 3,2.926496
3,Reason 4,0.699055
4,Transportation Expense,0.603718
5,Age,-0.168827
6,Body Mass Index,0.274808
7,Education,-0.235413
8,Children,0.355483
9,Pets,-0.275643


In [49]:
# add intercept to the table

In [50]:
summary_table.index = summary_table.index+1

In [51]:
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table

Unnamed: 0,feature_name,Coefficient
1,Reason 1,1.360926
2,Reason 2,1.360926
3,Reason 3,2.926496
4,Reason 4,0.699055
5,Transportation Expense,0.603718
6,Age,-0.168827
7,Body Mass Index,0.274808
8,Education,-0.235413
9,Children,0.355483
10,Pets,-0.275643


### Intercept = bias and coefficient = weight
### More weight means feature is more imp - this is the case when all the features are scaled to same level - like here 
### STD coefficient values = the coef of regression where all variables have been standardized

### The coef we're predicting are 'log(odds)'  -- Logistic regression is nothing but linear model predicting log(odds) -- These log(odds) are later transformed into 0s and 1s
### i.e log(odds) = intercept + b1x1 + b2x2 + ... 

### We find the exponentials - 'odds ratio' is what we get on taking exponentials of the coefs.  

In [52]:
summary_table['Odds Ratio'] = np.exp(summary_table.Coefficient)

In [53]:
# summary_table

In [54]:
summary_table.sort_values('Odds Ratio', ascending = False)  # ascending order

Unnamed: 0,feature_name,Coefficient,Odds Ratio
3,Reason 3,2.926496,18.662118
1,Reason 1,1.360926,3.899804
2,Reason 2,1.360926,3.899804
4,Reason 4,0.699055,2.01185
5,Transportation Expense,0.603718,1.828906
9,Children,0.355483,1.42687
7,Body Mass Index,0.274808,1.316278
11,Month Value,0.152923,1.165236
6,Age,-0.168827,0.844655
8,Education,-0.235413,0.790244


## ----------------------------------- INTERPRETATION -----------------------------------------
### 1. The 4 reason groups, age and education seem to be the most imp features, 
### 2. Dist to work, work load avg and day of week have little to no impact
### 3. Our base model is Reason 0 = No reason given -- 
### 4. The odds of a person being 'excessively absent' are almost '20 times more' than that of being absent due to 'No reason' 
### 5. Coming to the stdized inputs like transportation expense -- we dont have direct interpretability for non dummies - stdized variables -- 1unit std dev. increase in transportation expense is twice as likely to cause excessive absenteeism as 'no increase in expense' given
### 6. For each additional stdzed unit of pet, odds are 1-0.7526 = 24% lower than base model (no pet) - also -ve suggests that as no. of pets increase, odds reduce, i.e such people have helpers to help with pets and so not absent
### 7. The intercept is BIAS -- calibrates the model

### standardization leads to increased accuracy and is thus preferred in ML, for statisticians and econometricians, accuracy is not that imp, but more interpretability since they care about the underlying reason behind different phenomena - for Datasci, mix and depends 

## ----------------------------------------------------------------------------------------------------




## Backward elimination 
### - We simplify the model by eliminating less imp features
### - When we have p-values, we get remove all with p > 0.05, since not significant
### -- When learing with sklearn, we dont have p values since we dont need them COZ if the weight is small eough, it wont have an impact anyway, thus, if we remove these features, the coef for rest should not change much
### We eliminate it where we set the targets (checkpoint) 

#### columns_to_scale - this list was defined by us - hard coded - 
#### We create a 'columns_to_omit' (for scaling)
## ----------------------------------------------------------------------------------------------

## Interpretation: 
### 1. If the coef is around 0 or the odds ratio is close to 1 = Feature is NOT THAT IMP
### 2. For a unit change in std feature, the 'odds' increase by a multiple of 'odds ratio'
#### based on this, we can say that Daily work load avg is of less importance compared to other features - similar case for dist to work and day of week - Can keep these or drop them later on 

### While creating dummies, we dropped reason 0 which meant that reason is not listed/no reason was given and thus, 
## Base model is the case when there is no reason aka reason 0
### We see that whenever people listed the reason for absence, there was a higher chance of getting excessive absenteeism, by how much?? 

### When we std the variables, dummies were std too. which is a bad practice since we loose their interpretability -- If we had kept it 0s and 1s, we could have easily said that for a unit change, it is 3.456 times more likely for the person to be excessively absent compared to no reason given -- Thus we dont want the dummies to be standardized -- We thus go back to the checkpoint created while scaling the inputs and use 'Custom scaler' that is based on the std scaler but only stdzes the mentioned cols. 


# custom scaler code - from course

# POST this, we go back to the spliting data into train and test
# --------------------------------------------------------------------------------------

# Testing the model 
### - Done just once and at the end of ML part cuz some see the test accuracy and tewak the model to improve the accuracy which is ok but doing it iteratively is not good
### - Instead of testing, we train the model a bit more in the test data but MANUALLY --
### - Conceptually, We are not supposed to touch the model post testing it on test data

In [56]:
reg.score(x_test,y_test)   
# is always less than train accuracy by defination by almost 10-20% and this is due to overfitting

0.75

### Another method to calc. accuracy - manually 
### Instead of getting 0s and 1s in o/p, we can get the probability of an o/p being 0 or 1

In [62]:
predicted_proba = reg.predict_proba(x_test)
# predicted_proba

### Probs of 0s and 1s in resp. cols and thus their horizontal sum is 1, we are intereted in prob of getting 1, and thus we slice out the second col
### Logistic regression calculates the probability in background - if the prob is less than 0.5, it places 0 else 1

In [60]:
predicted_proba.shape

(140, 2)

In [61]:
predicted_proba[:,1]

array([0.28320907, 0.41264964, 0.55514098, 0.21790996, 0.92221156,
       0.67806862, 0.71542133, 0.86427493, 0.2163643 , 0.24913268,
       0.5055378 , 0.77797445, 0.92733408, 0.26452142, 0.68460675,
       0.4459844 , 0.45111823, 0.46195059, 0.61846997, 0.94351282,
       0.29935541, 0.21790996, 0.59170838, 0.59170838, 0.74487137,
       0.25565515, 0.49214419, 0.14002501, 0.88057609, 0.21790996,
       0.37106115, 0.69490239, 0.69682674, 0.5273413 , 0.21790996,
       0.53781361, 0.2238587 , 0.73133277, 0.40704825, 0.61063539,
       0.21056068, 0.46466566, 0.23939537, 0.4380522 , 0.83361216,
       0.55635603, 0.70766581, 0.28320907, 0.22028326, 0.20339481,
       0.58387451, 0.35198264, 0.67806862, 0.26853864, 0.84154588,
       0.43396297, 0.87949178, 0.23462367, 0.36241159, 0.3725597 ,
       0.70642977, 0.66846461, 0.2938503 , 0.78410477, 0.21094446,
       0.26582268, 0.09840225, 0.2238587 , 0.72380693, 0.2998621 ,
       0.2238587 , 0.31956474, 0.90787194, 0.46061288, 0.60192

# Saving the model for future purpose
### - It is the process of saving all imp relevant info of the model
### - For this model, LR, coef, intercepts, random state etc 
## The sklearn object 'reg' had all this info and saving the model is saving the 'reg' object

### 1. Using python module 'pickle [module]' is used to convert a python object into character stream (simplest)
### 2. This character stream will contain sufficient info and when we want to use it in another notebook, we unpickle it --- Save reg variable in a file (< 1kb size) and loaded in another file

In [63]:
import pickle

In [64]:
with open('LRmodel', 'wb') as file:
    pickle.dump(reg, file)

### More about 'pickling'
 
#### There are several popular ways to save (and finalize) a model. To name some, you can use Joblib (a part of the SciPy ecosystem), and JSON. Certainly, each of those choices has its pros and cons. Pickle is probably the most intuitive and definitely our preferred choice.

#### Once again, ‘pickle’ is the standard Python tool for serialization and deserialization. In simple words, pickling means: converting a Python object (no matter what) into a string of characters. Logically, unpickling is about converting a string of characters (that has been pickled) into a Python object.

#### There are some potential issues you should be aware of, though!

#### Pickle and Python version.

#### Pickling is strictly related to Python version. It is not recommended to (de)serialize objects across different Python versions. Logically, if you’re working on your own this will never be an issue (unless you upgrade/downgrade your Python version). 

#### Pickle is slow.

#### Well, you will barely notice that but for complex structures it may take loads of time to pickle and unpickle.

#### Pickle is not secure.

#### This is evident from the documentation of pickle, quote: “Never unpickle data received from an untrusted or unauthenticated source.” The reason is that just about anything can be pickled, so you can easily unpickle malicious code.

#### Now, if you are unpickling your own code, you are more or less safe.

#### If, however, you receive pickled objects from someone you don’t fully trust, you should be very cautious. That’s how viruses affect your operating system.

#### Finally, even your own file may be changed by an attacker. Thus, the next time you unpickle, you can unpickle just about anything (that this unethical person put there).

#### Certainly, all these cases are very rare, but you must be aware of them. Generally, we recommend using JSON, but that’s a topic for another time

In [66]:
with open('LRscaler','wb') as scaler_file:
    pickle.dump(absenteeism_scaler, scaler_file)

### The 2nd step of deploying the model for future use is creating a mechanism to load saved models and make predictions
### - Can simply use the new data file and run the model - can lead to errors - not recommended  OR
### - Store it in a module for future use - and use it similar to using numpy, pandas modules etc