### COMMENTS

**Comments regarding data** \
In my data, it looks like there are none missing in the test data

The following features are not included:

- Date of birth: lack of time
- Full name: no explanatory power
- Zip code: I did not really understand it - where do I find it?

The feature **On a shift** is not included due to lack of time \
However, my approach would be:
Insert a column on X_test indicating if **on shift last day seen** and use this as regressor


**Which features have most explanatory power** \
I did not find this, but my approach would be:

1) Use X_train and y_train to set up an econometric model using the regressors and dummy variables \
2) Estimate the model with a probit model for this binary classification \
3) Evaluate the regressors model coeffficients, the *Betas* (and if they are significant at some level) \
4) List the most significant coefficients (which have positive and which have negative effect)

In [1]:
import os
from os.path import join as pjoin
import pandas as pd
import numpy as np
from datetime import datetime, date
import matplotlib.pyplot as plt

In [2]:
DATA_FOLDER = pjoin(os.getcwd(), "Data")
files = os.listdir(DATA_FOLDER)
assert len(files) == 4, "Too many files in here"

### Investigating, cleaning and merging data

In [3]:
data_train_explanatory = pd.read_csv(pjoin(DATA_FOLDER, "data_train_fin.csv"))
data_test_explanatory = pd.read_csv(pjoin(DATA_FOLDER, "data_test_fin.csv"))
missing_report = pd.read_csv(pjoin(DATA_FOLDER, "missing_report.csv"))
shift_report = pd.read_csv(pjoin(DATA_FOLDER, "shift_report.csv"))

print(data_test_explanatory.shape)
print(data_train_explanatory.shape)
print(missing_report.shape)                                               
print(shift_report.shape)                                               

(4680, 13)
(10920, 13)
(276, 3)
(13, 6)


In [4]:
data_test_explanatory

Unnamed: 0,full_name,sex,profession,hourly_salary,date_of_birth,weight,height,city_of_origin,last_seen,satisfaction_score,sleep_1,sleep_2,sleep_3
0,Ryan Weasley,m,bartender,11.9,2/15/83,78.0,185.0,Melbourne,2208-02-23,,8.3,7.2,7.0
1,Jason Bradbury,m,developer,14.3,,83.0,162.0,Seattle,2208-02-18,0.672414,6.6,7.7,7.2
2,,f,developer,14.0,5/25/64,84.0,171.0,Brisbane,2208-02-22,0.597701,8.4,8.1,
3,Jose Huxley,m,clerk,16.3,8/6/62,,182.0,Sydney,2208-02-24,0.718391,8.2,5.6,9.8
4,Willie Woolf,m,bookkeeper,16.0,2/17/75,94.0,187.0,Vancouver,2208-02-24,0.718391,8.7,7.9,8.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4675,,f,server,10.3,10/3/87,72.0,195.0,Brisbane,2208-02-24,0.574713,8.1,8.4,8.2
4676,Ann Martin,f,cashier,9.9,9/13/92,68.0,176.0,Melbourne,2208-02-16,0.465517,8.4,7.9,6.2
4677,Raymond Laurie,m,trader,15.0,9/12/96,80.0,190.0,Toronto,2208-02-16,0.781609,8.9,8.6,8.3
4678,Donald Jackson,m,plant worker,10.1,10/13/84,87.0,177.0,Brisbane,2208-02-15,0.545977,6.7,7.2,7.7


In [5]:
data_train_explanatory

Unnamed: 0,full_name,sex,profession,hourly_salary,date_of_birth,weight,height,city_of_origin,last_seen,satisfaction_score,sleep_1,sleep_2,sleep_3
0,Emily Holmes,f,laborer,14.0,10/2/73,84.0,175.0,Manchester,2208-02-22,0.706897,7.0,8.8,11.3
1,Vincent Harris,m,psychologist,,1/11/72,72.0,171.0,Seattle,2208-02-24,0.701149,7.9,7.6,9.0
2,Matthew Wilson,m,electrician,14.0,7/6/71,66.0,179.0,Toronto,2208-02-22,0.580460,8.5,6.2,8.0
3,,m,police officer,14.0,11/2/66,82.0,177.0,New York,2208-02-22,0.666667,9.4,8.1,8.5
4,Marilyn Rodriguez,f,clerk,16.2,1/12/65,94.0,185.0,,2208-02-24,0.729885,8.9,,8.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10915,Larry White,m,cook,12.0,5/18/96,75.0,170.0,London,2208-02-15,0.724138,7.2,7.9,8.9
10916,Christopher Wilson,m,police officer,13.7,1/9/80,65.0,175.0,Manchester,2208-02-23,0.614943,7.1,8.2,7.2
10917,Scott Laurie,m,server,10.3,4/4/62,66.0,164.0,Vancouver,2208-02-20,0.316092,8.3,6.3,9.3
10918,Jason Morrison,m,cook,11.5,,70.0,177.0,Seattle,2208-02-24,0.574713,7.6,7.2,7.9


In [6]:
# Make new column matching other data set's column
missing_report['full_name'] = missing_report.apply(lambda row: row.first_name + " " + row.last_name, axis=1)
missing_report = missing_report.loc[:, ["missing", "full_name"]]

In [7]:
missing_report

Unnamed: 0,missing,full_name
0,1,Willie Gallager
1,1,Melissa Morrison
2,1,Gregory Miller
3,1,Richard White
4,1,Julia Gilmour
...,...,...
271,1,Kevin White
272,1,Jeremy Gallager
273,1,Abigail Forster
274,1,Roy Burgess


In [8]:
# Merge Explanatory variables with dependent variable "missing"
train_set = data_train_explanatory.merge(right=missing_report, how='left', left_on="full_name", right_on='full_name')
print("No. of depedent variables inserted:", train_set.missing.count())
print("No. of depedent variables in total:", len(missing_report.dropna()))

No. of depedent variables inserted: 260
No. of depedent variables in total: 276


In [9]:
# Merge Explanatory variables with dependent variable "missing"
test_set = data_test_explanatory.merge(right=missing_report, how='left', left_on="full_name", right_on='full_name')
print("No. of depedent variables inserted:", test_set.missing.count())
print("No. of depedent variables in total:", len(missing_report.dropna()))

No. of depedent variables inserted: 0
No. of depedent variables in total: 276


In [10]:
# Make missing a binary classication column
train_set.missing = train_set.missing.replace(np.nan, 0)
test_set.missing = test_set.missing.replace(np.nan, 0)
test_set

Unnamed: 0,full_name,sex,profession,hourly_salary,date_of_birth,weight,height,city_of_origin,last_seen,satisfaction_score,sleep_1,sleep_2,sleep_3,missing
0,Ryan Weasley,m,bartender,11.9,2/15/83,78.0,185.0,Melbourne,2208-02-23,,8.3,7.2,7.0,0.0
1,Jason Bradbury,m,developer,14.3,,83.0,162.0,Seattle,2208-02-18,0.672414,6.6,7.7,7.2,0.0
2,,f,developer,14.0,5/25/64,84.0,171.0,Brisbane,2208-02-22,0.597701,8.4,8.1,,0.0
3,Jose Huxley,m,clerk,16.3,8/6/62,,182.0,Sydney,2208-02-24,0.718391,8.2,5.6,9.8,0.0
4,Willie Woolf,m,bookkeeper,16.0,2/17/75,94.0,187.0,Vancouver,2208-02-24,0.718391,8.7,7.9,8.1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4675,,f,server,10.3,10/3/87,72.0,195.0,Brisbane,2208-02-24,0.574713,8.1,8.4,8.2,0.0
4676,Ann Martin,f,cashier,9.9,9/13/92,68.0,176.0,Melbourne,2208-02-16,0.465517,8.4,7.9,6.2,0.0
4677,Raymond Laurie,m,trader,15.0,9/12/96,80.0,190.0,Toronto,2208-02-16,0.781609,8.9,8.6,8.3,0.0
4678,Donald Jackson,m,plant worker,10.1,10/13/84,87.0,177.0,Brisbane,2208-02-15,0.545977,6.7,7.2,7.7,0.0


In [11]:
shift_report = shift_report.set_index("date")
shift_report

Unnamed: 0_level_0,hairdresser,counselor,psychologist,doctor,dentist
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2208-02-23,Christian Lee,Brandon O'Hara,Zachary Dickens,Patrick Lennon,Katherine Heller
2208-02-22,Isabella Huxley,Paul Waters,Benjamin Roth,Anthony Vonnegut,Laura Mikkelsen
2208-02-24,Jason Hernandez,Joseph Watson,Zachary Dickens,Anthony Vonnegut,Philip Gonzalez
2208-02-18,Isabella McConaughey,Jordan Garcia,Janet Davis,Anthony Vonnegut,Steven Lewis
2208-02-19,Beverly Hernandez,Michael Lee,Frank Jagger,William Walker,Steven Lewis
2208-02-17,Beverly Hernandez,Kathryn Gallager,Zachary Dickens,William Walker,Joan Gilmour
2208-02-12,Frances Gallager,Jordan Garcia,Kelly Orwell,William Walker,Gerald Roth
2208-02-13,Sophia Hall,Kelly Granger,Donna Smith,William Walker,Laura Mikkelsen
2208-02-14,Christian Lee,Abigail Martinez,John Morrison,Anthony Vonnegut,Gerald Roth
2208-02-20,Isabella McConaughey,Brandon O'Hara,Brandon Lawrence,Anthony Vonnegut,Jennifer Thompson


In [12]:
def on_shift(name, date):
    return 1 if name in shift_report.loc[date, :].values else 0

In [13]:
on_shift("Christian Lee", "2208-02-23")

1

In [14]:
train_set.shape[1] == test_set.shape[1]

True

### Missing Data:
An approach: 

- remove all records having a missing data point
- Another approch: replace all missing data points with mean of that column

Since first approach removes half the data, I *would* go for the other \
Due to lack of time, I went for the second

In [15]:
train_set.shape[1] == test_set.shape[1]

True

In [16]:
train_set_no_na = train_set.dropna()
test_set_no_na = test_set.dropna()

In [17]:
train_set_no_na['on_shift_last_day_seen'] = train_set_no_na.apply(lambda row: on_shift(row['full_name'], row['last_seen']) , axis=1)
test_set_no_na['on_shift_last_day_seen'] = test_set_no_na.apply(lambda row: on_shift(row['full_name'], row['last_seen']) , axis=1)
train_set_no_na['on_shift_last_day_seen'].sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_set_no_na['on_shift_last_day_seen'] = train_set_no_na.apply(lambda row: on_shift(row['full_name'], row['last_seen']) , axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_set_no_na['on_shift_last_day_seen'] = test_set_no_na.apply(lambda row: on_shift(row['full_name'], row['last_seen']) , axis=1)


5

In [18]:
train_set_no_na.last_seen = train_set_no_na.last_seen.apply(lambda x: datetime.strptime(str(x), "%Y-%m-%d")) 
test_set_no_na.last_seen = test_set_no_na.last_seen.apply(lambda x: datetime.strptime(str(x), "%Y-%m-%d")) 
# train_set_no_na.last_seen
test_set_no_na.last_seen

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


4      2208-02-24
6      2208-02-17
8      2208-02-20
11     2208-02-16
17     2208-02-24
          ...    
4674   2208-02-19
4676   2208-02-16
4677   2208-02-16
4678   2208-02-15
4679   2208-02-24
Name: last_seen, Length: 2372, dtype: datetime64[ns]

In [19]:
today = datetime(2208, 2, 24)

In [20]:
train_set_no_na.shape[1] == test_set_no_na.shape[1]

True

In [21]:
train_set_no_na['last_seen_in_days'] = train_set_no_na.apply(lambda row: (today - row.last_seen).days, axis=1)
test_set_no_na['last_seen_in_days'] = test_set_no_na.apply(lambda row: (today - row.last_seen).days, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_set_no_na['last_seen_in_days'] = train_set_no_na.apply(lambda row: (today - row.last_seen).days, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_set_no_na['last_seen_in_days'] = test_set_no_na.apply(lambda row: (today - row.last_seen).days, axis=1)


In [22]:
train_set_no_na.shape[1] == test_set_no_na.shape[1]

True

In [23]:
train_set_no_na['male'] = train_set_no_na.apply(lambda row: 1 if row.sex == "m" else 0, axis=1)
test_set_no_na['male'] = test_set_no_na.apply(lambda row: 1 if row.sex == "m" else 0, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_set_no_na['male'] = train_set_no_na.apply(lambda row: 1 if row.sex == "m" else 0, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_set_no_na['male'] = test_set_no_na.apply(lambda row: 1 if row.sex == "m" else 0, axis=1)


In [24]:
train_set_no_na.shape[1] == test_set_no_na.shape[1]

True

In [25]:
# train_set_no_na.profession.nunique()
test_set_no_na.profession.nunique()

20

In [26]:
# train_set_no_na.city_of_origin.nunique()
# test_set_no_na

length = len(train_set)

length = len(train_set_no_na)

cols = list(train_set.columns)
values = []

for column in list(train_set.columns):
    values.append(length - train_set_no_na[column].count())
    print(f"Missing values for column {column}: " + str(length - train_set_no_na[column].count()))
    
plt.bar(cols, values)

### One Hot Encoding
To ensure the model to consume the data set I apply OHC formatting the categorical columns.

In [27]:
profession_dummies_train = pd.get_dummies(train_set_no_na.profession)
profession_dummies_test = pd.get_dummies(test_set_no_na.profession)
profession_dummies_train

Unnamed: 0,bartender,bookkeeper,carpenter,cashier,clerk,cook,counselor,dentist,developer,doctor,...,gardener,hairdresser,janitor,laborer,manager,plant worker,police officer,psychologist,server,trader
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
10,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10913,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
10914,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10915,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10916,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [28]:
list(profession_dummies_train.columns) == list(profession_dummies_test.columns)

False

In [29]:
"doctor" in list(profession_dummies_train.columns)

True

In [30]:
"doctor" in list(profession_dummies_test.columns)

False

In [31]:
profession_dummies_test['doctor'] = 0
profession_dummies_test

Unnamed: 0,bartender,bookkeeper,carpenter,cashier,clerk,cook,counselor,dentist,developer,electrician,...,hairdresser,janitor,laborer,manager,plant worker,police officer,psychologist,server,trader,doctor
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
11,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4674,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4676,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4677,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4678,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [32]:
train_set_no_na = pd.concat([train_set_no_na, profession_dummies_train], axis=1)
test_set_no_na = pd.concat([test_set_no_na, profession_dummies_test], axis=1)

In [33]:
train_set_no_na.shape[1] == test_set_no_na.shape[1]

True

In [34]:
origin_dummies_train = pd.get_dummies(train_set_no_na.city_of_origin)
origin_dummies_test = pd.get_dummies(test_set_no_na.city_of_origin)

In [35]:
list(origin_dummies_train.columns) == list(origin_dummies_test.columns)

True

In [36]:
train_set_no_na = pd.concat([train_set_no_na, origin_dummies_train], axis=1)
test_set_no_na = pd.concat([test_set_no_na, origin_dummies_test], axis=1)
train_set_no_na

Unnamed: 0,full_name,sex,profession,hourly_salary,date_of_birth,weight,height,city_of_origin,last_seen,satisfaction_score,...,London,Los Angeles,Manchester,Melbourne,Montreal,New York,Seattle,Sydney,Toronto,Vancouver
0,Emily Holmes,f,laborer,14.0,10/2/73,84.0,175.0,Manchester,2208-02-22,0.706897,...,0,0,1,0,0,0,0,0,0,0
2,Matthew Wilson,m,electrician,14.0,7/6/71,66.0,179.0,Toronto,2208-02-22,0.580460,...,0,0,0,0,0,0,0,0,1,0
6,Ashley White,f,developer,13.5,12/20/90,93.0,195.0,Vancouver,2208-02-19,0.718391,...,0,0,0,0,0,0,0,0,0,1
7,Carl Griffin,m,plant worker,9.5,10/8/69,78.0,195.0,Birmingham,2208-02-22,0.425287,...,0,0,0,0,0,0,0,0,0,0
10,Sean Perez,m,bookkeeper,16.3,12/15/76,88.0,163.0,Seattle,2208-02-17,0.798851,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10913,Keith Smith,m,laborer,14.0,12/3/82,70.0,190.0,Brisbane,2208-02-12,0.701149,...,0,0,0,0,0,0,0,0,0,0
10914,Sarah Perez,f,carpenter,14.3,1/27/81,67.0,177.0,Manchester,2208-02-12,0.770115,...,0,0,1,0,0,0,0,0,0,0
10915,Larry White,m,cook,12.0,5/18/96,75.0,170.0,London,2208-02-15,0.724138,...,1,0,0,0,0,0,0,0,0,0
10916,Christopher Wilson,m,police officer,13.7,1/9/80,65.0,175.0,Manchester,2208-02-23,0.614943,...,0,0,1,0,0,0,0,0,0,0


In [37]:
train_set_no_na.shape[1] == test_set_no_na.shape[1]

True

In [38]:
columns_not_to_use_anymore = ['full_name', 'date_of_birth', 'sex', 'profession', 'city_of_origin', 'last_seen']
estimation_set_train = train_set_no_na.drop(columns_not_to_use_anymore, axis=1)
estimation_set_test = test_set_no_na.drop(columns_not_to_use_anymore, axis=1)

In [39]:
estimation_set_train.shape[1]

44

In [40]:
estimation_set_test.shape[1]

44

In [41]:
estimation_set_train

Unnamed: 0,hourly_salary,weight,height,satisfaction_score,sleep_1,sleep_2,sleep_3,missing,on_shift_last_day_seen,last_seen_in_days,...,London,Los Angeles,Manchester,Melbourne,Montreal,New York,Seattle,Sydney,Toronto,Vancouver
0,14.0,84.0,175.0,0.706897,7.0,8.8,11.3,0.0,0,2,...,0,0,1,0,0,0,0,0,0,0
2,14.0,66.0,179.0,0.580460,8.5,6.2,8.0,0.0,0,2,...,0,0,0,0,0,0,0,0,1,0
6,13.5,93.0,195.0,0.718391,7.6,8.2,7.3,0.0,0,5,...,0,0,0,0,0,0,0,0,0,1
7,9.5,78.0,195.0,0.425287,9.3,6.8,9.2,0.0,0,2,...,0,0,0,0,0,0,0,0,0,0
10,16.3,88.0,163.0,0.798851,8.2,8.0,5.5,0.0,0,7,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10913,14.0,70.0,190.0,0.701149,9.0,7.7,10.5,0.0,0,12,...,0,0,0,0,0,0,0,0,0,0
10914,14.3,67.0,177.0,0.770115,7.4,9.6,7.3,0.0,0,12,...,0,0,1,0,0,0,0,0,0,0
10915,12.0,75.0,170.0,0.724138,7.2,7.9,8.9,0.0,0,9,...,1,0,0,0,0,0,0,0,0,0
10916,13.7,65.0,175.0,0.614943,7.1,8.2,7.2,0.0,0,1,...,0,0,1,0,0,0,0,0,0,0


In [42]:
estimation_set_test

Unnamed: 0,hourly_salary,weight,height,satisfaction_score,sleep_1,sleep_2,sleep_3,missing,on_shift_last_day_seen,last_seen_in_days,...,London,Los Angeles,Manchester,Melbourne,Montreal,New York,Seattle,Sydney,Toronto,Vancouver
4,16.0,94.0,187.0,0.718391,8.7,7.9,8.1,0.0,0,0,...,0,0,0,0,0,0,0,0,0,1
6,12.5,70.0,169.0,0.620690,8.8,8.3,7.9,0.0,0,7,...,0,0,0,0,0,0,0,1,0,0
8,13.9,79.0,181.0,0.614943,8.1,8.7,9.6,0.0,0,4,...,0,0,0,0,0,0,0,0,0,1
11,16.0,90.0,178.0,0.850575,8.2,7.8,9.4,0.0,0,8,...,0,0,1,0,0,0,0,0,0,0
17,15.3,81.0,174.0,0.660920,8.5,8.5,9.0,0.0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4674,14.2,75.0,173.0,0.597701,6.2,7.5,8.2,0.0,0,5,...,0,0,0,0,0,0,0,0,0,0
4676,9.9,68.0,176.0,0.465517,8.4,7.9,6.2,0.0,0,8,...,0,0,0,1,0,0,0,0,0,0
4677,15.0,80.0,190.0,0.781609,8.9,8.6,8.3,0.0,0,8,...,0,0,0,0,0,0,0,0,1,0
4678,10.1,87.0,177.0,0.545977,6.7,7.2,7.7,0.0,0,9,...,0,0,0,0,0,0,0,0,0,0


In [43]:
# NOTE TO SELF:
# Date of birth should be included
# If you were at shift the day you were last seen

In [44]:
y_train = estimation_set_train.missing
y_train = y_train.reset_index().drop("index", axis=1)
y_train

Unnamed: 0,missing
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
5566,0.0
5567,0.0
5568,0.0
5569,0.0


In [45]:
X_train = estimation_set_train.drop("missing", axis=1)
X_train = X_train.reset_index().drop("index", axis=1)
X_train

Unnamed: 0,hourly_salary,weight,height,satisfaction_score,sleep_1,sleep_2,sleep_3,on_shift_last_day_seen,last_seen_in_days,male,...,London,Los Angeles,Manchester,Melbourne,Montreal,New York,Seattle,Sydney,Toronto,Vancouver
0,14.0,84.0,175.0,0.706897,7.0,8.8,11.3,0,2,0,...,0,0,1,0,0,0,0,0,0,0
1,14.0,66.0,179.0,0.580460,8.5,6.2,8.0,0,2,1,...,0,0,0,0,0,0,0,0,1,0
2,13.5,93.0,195.0,0.718391,7.6,8.2,7.3,0,5,0,...,0,0,0,0,0,0,0,0,0,1
3,9.5,78.0,195.0,0.425287,9.3,6.8,9.2,0,2,1,...,0,0,0,0,0,0,0,0,0,0
4,16.3,88.0,163.0,0.798851,8.2,8.0,5.5,0,7,1,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5566,14.0,70.0,190.0,0.701149,9.0,7.7,10.5,0,12,1,...,0,0,0,0,0,0,0,0,0,0
5567,14.3,67.0,177.0,0.770115,7.4,9.6,7.3,0,12,0,...,0,0,1,0,0,0,0,0,0,0
5568,12.0,75.0,170.0,0.724138,7.2,7.9,8.9,0,9,1,...,1,0,0,0,0,0,0,0,0,0
5569,13.7,65.0,175.0,0.614943,7.1,8.2,7.2,0,1,1,...,0,0,1,0,0,0,0,0,0,0


In [46]:
pd.DataFrame(X_train).to_excel("X_train.xlsx")

In [47]:
y_test = estimation_set_test.missing
y_test = y_test.reset_index().drop("index", axis=1)
y_test

Unnamed: 0,missing
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
2367,0.0
2368,0.0
2369,0.0
2370,0.0


In [48]:
X_test = estimation_set_test.drop("missing", axis=1)
X_test = X_test.reset_index().drop("index", axis=1)
X_test

Unnamed: 0,hourly_salary,weight,height,satisfaction_score,sleep_1,sleep_2,sleep_3,on_shift_last_day_seen,last_seen_in_days,male,...,London,Los Angeles,Manchester,Melbourne,Montreal,New York,Seattle,Sydney,Toronto,Vancouver
0,16.0,94.0,187.0,0.718391,8.7,7.9,8.1,0,0,1,...,0,0,0,0,0,0,0,0,0,1
1,12.5,70.0,169.0,0.620690,8.8,8.3,7.9,0,7,0,...,0,0,0,0,0,0,0,1,0,0
2,13.9,79.0,181.0,0.614943,8.1,8.7,9.6,0,4,0,...,0,0,0,0,0,0,0,0,0,1
3,16.0,90.0,178.0,0.850575,8.2,7.8,9.4,0,8,0,...,0,0,1,0,0,0,0,0,0,0
4,15.3,81.0,174.0,0.660920,8.5,8.5,9.0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2367,14.2,75.0,173.0,0.597701,6.2,7.5,8.2,0,5,0,...,0,0,0,0,0,0,0,0,0,0
2368,9.9,68.0,176.0,0.465517,8.4,7.9,6.2,0,8,0,...,0,0,0,1,0,0,0,0,0,0
2369,15.0,80.0,190.0,0.781609,8.9,8.6,8.3,0,8,1,...,0,0,0,0,0,0,0,0,1,0
2370,10.1,87.0,177.0,0.545977,6.7,7.2,7.7,0,9,1,...,0,0,0,0,0,0,0,0,0,0


In [49]:
from pip._internal import main

def import_or_install(package):
    try:
        __import__(package)
    except ImportError:
        main(['install', package])
        __import__(package)

In [50]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

In [51]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [52]:
import_or_install("tensorflow")
import_or_install("keras")
from keras.models import Sequential
from keras.layers import Dense

In [53]:
input_dim = X_train.shape[1]
input_dim

43

### Setting up ANN

**Why is used Keras**

I have used Keras due to its easy high-level API. \
But I would have implemented **my own ANN** if I had the time. Here, I would have done the chain rule derivations myself, specifying the loss and activation functions explicitly and done the batch Gradient Descent myself. 

**The implementation**

Here I am using Binary Cross Entropy Loss equivalent to Log Loss. \
This is the loss function for binary response using Bernoulli Distribution.
I apply Stochastic Gradient Descent Algorithm to tune the weights (batches of 10)

This ANN is as follows:

- Input Layer 
- 6 Neurons w. ReLU acitivation
- 6 Neurons w. ReLU activation
- 1 Neuron w. Sigmoid (Logistic) activation

**Note**: The number of layers, neurons and epochs are more or less chosen arbritrarily. \
ReLU is chosen in hidden layers as this emprically converges (to optimal weight) faster than Sigmoid and Tanh

*Improvements* \
Adding normalization/regularization constant to weight matrices (\lambda) to ensure right-size weights. \
E.g. this could be done by using LASSO for penalizing high weight coefficients. \
Or deleting weights equal to 0 (memory efficiency)

**Other ML approaches**

Other tools for classification:
- Gaussian Discriminant Analysis (GDA)
- EM (Expectation Maximization) estimation
- Kernel functions (for higher dimensionality regression)
- Maybe: Naive Bayes

In [54]:
classifier = Sequential()
classifier.add(Dense(units=6, kernel_initializer='uniform', activation='relu', input_dim=input_dim))
classifier.add(Dense(units=6, kernel_initializer='uniform', activation='relu'))
classifier.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
classifier.fit(x=X_train, y=y_train, batch_size=10, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x1e6b6e493a0>

In [55]:
y_pred = classifier.predict(X_test)
# y_pred = (y_pred > 0.5)     # Like rounding: getting True / False

In [56]:
probabilities = np.sort(y_pred, axis=0)[::-1]
probabilities

array([[0.6112688],
       [0.6112688],
       [0.6112688],
       ...,
       [0.       ],
       [0.       ],
       [0.       ]], dtype=float32)

In [57]:
### Making the Confusion Matrix
# from sklearn.metrics import confusion_matrix
# cm = confusion_matrix(y_test, y_pred)

In [58]:
y_pred = (y_pred > 0.5)     # Like rounding: getting True / False
prediction_correct = y_pred == y_test
accuracy = prediction_correct.mean().values[0]
print("ACCURACY:", accuracy)

ACCURACY: 0.8975548060708263


In [59]:
# POPULATE THIS (with 43 parameters)
# characteristic_vector = []
# new_prediction = classifier.predict(sc.transform(np.array([characteristic_vector])))

### COMMENTS

**Comments regarding data** \
In my data, it looks like there are none missing in the test data

The following features are not included:

- Date of birth: lack of time
- Full name: no explanatory power
- Zip code: I did not really understand it - where do I find it?

The feature **On a shift** is not included due to lack of time \
However, my approach would be:
Insert a column on X_test indicating if **on shift last day seen** and use this as regressor


**Which features have most explanatory power** \
I did not find this, but my approach would be:

1) Use X_train and y_train to set up an econometric model using the regressors and dummy variables \
2) Estimate the model with a probit model for this binary classification \
3) Evaluate the regressors model coeffficients, the *Betas* (and if they are significant at some level) \
4) List the most significant coefficients (which have positive and which have negative effect)