In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

## Day 32 Lecture 1 Assignment

In this assignment, we will learn about K nearest neighbor regression. We will use the absenteeism at work dataset loaded below and analyze the model generated for this dataset.

The meaning of the different columns can be found here: https://www.kaggle.com/tonypriyanka2913/employee-absenteeism

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

<IPython.core.display.Javascript object>

In [3]:
absent = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Absenteeism_at_work.csv",
    sep=";",
)

<IPython.core.display.Javascript object>

In [4]:
absent.shape

(740, 21)

<IPython.core.display.Javascript object>

In [5]:
absent.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2


<IPython.core.display.Javascript object>

Find which variables have the highest pairwise correlation and remove them from our dataset. Additionally, try to think of which variables are correlated by looking at the column names and remove those columns as well.

Note: When choosing between two categorical variables that are correlated, you often want to keep the one with fewer unique values. Why might that be? (Think about the cons of KNN.)

In [6]:
# # answer below:
# plt.figure(figsize=(15,15))
# sns.heatmap(absent.drop(['ID', 'Height', 'Weight', 'Month of absence'],1).corr(), vmin=-1, vmax=1, annot=True)


<IPython.core.display.Javascript object>

Figure out which columns actually contain sneaky categorical variables and turn those into dummy variables.

In [7]:
absent = absent.drop(["ID", "Height", "Weight", "Month of absence"], 1)

<IPython.core.display.Javascript object>

In [8]:
absent["Pet"].value_counts()

0    460
1    138
2     96
4     32
8      8
5      6
Name: Pet, dtype: int64

<IPython.core.display.Javascript object>

In [9]:
absent.isna().mean()

Reason for absence                 0.0
Day of the week                    0.0
Seasons                            0.0
Transportation expense             0.0
Distance from Residence to Work    0.0
Service time                       0.0
Age                                0.0
Work load Average/day              0.0
Hit target                         0.0
Disciplinary failure               0.0
Education                          0.0
Son                                0.0
Social drinker                     0.0
Social smoker                      0.0
Pet                                0.0
Body mass index                    0.0
Absenteeism time in hours          0.0
dtype: float64

<IPython.core.display.Javascript object>

In [10]:
absent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 740 entries, 0 to 739
Data columns (total 17 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Reason for absence               740 non-null    int64  
 1   Day of the week                  740 non-null    int64  
 2   Seasons                          740 non-null    int64  
 3   Transportation expense           740 non-null    int64  
 4   Distance from Residence to Work  740 non-null    int64  
 5   Service time                     740 non-null    int64  
 6   Age                              740 non-null    int64  
 7   Work load Average/day            740 non-null    float64
 8   Hit target                       740 non-null    int64  
 9   Disciplinary failure             740 non-null    int64  
 10  Education                        740 non-null    int64  
 11  Son                              740 non-null    int64  
 12  Social drinker        

<IPython.core.display.Javascript object>

In [11]:
reason_counts = absent["Reason for absence"].value_counts()
above_thresh_diseases = reason_counts[reason_counts > 8]
keep_reasons = above_thresh_diseases.index

<IPython.core.display.Javascript object>

In [12]:
reason_filter = absent["Reason for absence"].isin(keep_reasons)
absent.loc[-reason_filter, "Reason for absence"] = -1
absent["Reason for absence"].value_counts()

 23    149
 28    112
 27     69
 13     55
-1      48
 0      43
 19     40
 22     38
 26     33
 25     31
 11     26
 10     25
 18     21
 14     19
 1      16
 7      15
Name: Reason for absence, dtype: int64

<IPython.core.display.Javascript object>

In [13]:
X = absent.drop("Absenteeism time in hours", 1)
y = absent["Absenteeism time in hours"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

<IPython.core.display.Javascript object>

In [14]:
cat_cols = [ "Reason for absence", 'Seasons', 'Disciplinary failure',
           'Education', 'Social drinker', 'Social smoker']  # What categorical columns do we have?
drop_cats = [ -1,4,0,1,0,0]  # Which categories from those columns do we want to drop?

# The rest are numeric
num_cols = [c for c in X if c not in cat_cols]



<IPython.core.display.Javascript object>

In [15]:
for col in cat_cols:
    print(absent[col].value_counts())

 23    149
 28    112
 27     69
 13     55
-1      48
 0      43
 19     40
 22     38
 26     33
 25     31
 11     26
 10     25
 18     21
 14     19
 1      16
 7      15
Name: Reason for absence, dtype: int64
4    195
2    192
3    183
1    170
Name: Seasons, dtype: int64
0    700
1     40
Name: Disciplinary failure, dtype: int64
1    611
3     79
2     46
4      4
Name: Education, dtype: int64
1    420
0    320
Name: Social drinker, dtype: int64
0    686
1     54
Name: Social smoker, dtype: int64


<IPython.core.display.Javascript object>

In [16]:
preprocessing = ColumnTransformer(
    [
        ("scale", StandardScaler(), num_cols),
        ("one_hot_encode", OneHotEncoder(drop=drop_cats), cat_cols),
    ]
)

<IPython.core.display.Javascript object>

In [17]:
pipeline = Pipeline(
    [
        # ("name of step", sklearn object with a fit method)
        ("preprocessing", preprocessing),
        ("knn", KNeighborsRegressor()),
    ]
)

<IPython.core.display.Javascript object>

In [18]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessing',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('scale',
                                                  StandardScaler(copy=True,
                                                                 with_mean=True,
                                                                 with_std=True),
                                                  ['Day of the week',
                                                   'Transportation expense',
                                                   'Distance from Residence to '
                                                   'Work',
                                                   'Service time', 'Age',
                                                   'Work load Average/day ',
                         

<IPython.core.display.Javascript object>

In [19]:
pipeline.score(X_train, y_train)

0.31932655154008666

<IPython.core.display.Javascript object>

In [20]:
pipeline.score(X_test, y_test)

-0.32713892147333845

<IPython.core.display.Javascript object>

Split the data into train and test with test containing 20% of the data, then scale your features.

In [21]:
# answer below:



<IPython.core.display.Javascript object>

Train a series of KNN regression models with a range of K values. For each K value, use cross validation of the training set and find the average RMSE score. Make a plot of K versus average RMSE. What is the best value of K?

In [22]:
# answer below

<IPython.core.display.Javascript object>

Using your best K, fit a model to all your training data and show the RMSE for the training and testing sets.

In [23]:
# answer below

<IPython.core.display.Javascript object>

Create a homoscedasticity plot (also called residual plot). How is your model performing? What ideas do you have to improve the model?

In [24]:
# answer below

<IPython.core.display.Javascript object>