# <center>  Employee Access Prediction </center>

# ECorp
Company Introduction: 
Your client for this project is a multinational technology company.

ECorp is an American multinational technology company that focuses on e-commerce, cloud computing, digital streaming, and artificial intelligence.
They are spread across the globe with hundreds of thousands of employees in these domains.
When an employee at any company starts working, they first need to obtain the resource access necessary to fulfill their role.
Employees passing certain criteria regarding their primary and secondary roles are granted access to the requested resources.
This efficient system has helped the company maintain background checks of its employees and usage of allocated resources.

Current Scenario
A group of employees has been invited to test the automated system. An auto-access model seeks to minimize the human involvement required to grant or revoke employee access.




# Problem Statement
The current process suffers from the following problems:

If an employee discovers that they need access to certain resources, they are supposed to contact a knowledgeable supervisor.
The supervisor takes time to manually grant the needed access to the requesting employee.
As employees move throughout a company, this access discovery/recovery cycle wastes a non-trivial amount of time and money.

The company has hired you as a data science consultant. They want to automate the process of approving or revoking access to a resource according to their role in the company.

Your Role
You are given a dataset containing the ACTION (ground truth), RESOURCE, and information about the employee's role at the time of approval.
The model will take an employee's role information and the requested resource in the form of a resource code and will determine if an employee should be given access or not.
Your task is to build a binary-class classification model using the dataset.
Because there was no machine learning model for this problem in the company, you donâ€™t have a quantifiable win condition. You need to build the best possible model.

Project Deliverables
Deliverable: Employee Access Classification.<br>
Machine Learning Task: Classification<br>
Target Variable: <b>ACTION<b>

Evaluation Metric
The model evaluation will be based on the <b>Accuracy</b> Score.

# Data Description

<table>	<th>	Column Name	</th>	<th>	Description	</th>	
<tr>	<td>	RESOURCE	</td>	<td>	An ID for each resource.	</td<	</tr>
<tr>	<td>	MGR_ID	</td>	<td>	The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time.	</td<	</tr>
<tr>	<td>	ROLE_ROLLUP_1	</td>	<td>	Company role grouping category id 1 (e.g. US Engineering).	</td<	</tr>
<tr>	<td>	ROLE_ROLLUP_2	</td>	<td>	Company role grouping category id 2 (e.g. US Retail).	</td<	</tr>
<tr>	<td>	ROLE_DEPTNAME	</td>	<td>	Company role department description (e.g. Retail).	</td<	</tr>
<tr>	<td>	ROLE_TITLE	</td>	<td>	Company role business title description (e.g. Senior Engineering Retail Manager)	</td<	</tr>
<tr>	<td>	ROLE_FAMILY_DESC	</td>	<td>	Company role family extended description (e.g. Retail Manager, Software Engineering)	</td<	</tr>
<tr>	<td>	ROLE_FAMILY	</td>	<td>	Company role family description (e.g. Retail Manager).	</td<	</tr>
<tr>	<td>	ROLE_CODE	</td>	<td>	Company role code; this code is unique to each role (e.g. Manager)	</td<	</tr>
<tr>	<td>	ID	</td>	<td>	ID of the Employee	</td<	</tr>
<tr>	<td>	ACTION	</td>	<td>	ACTION is 1 if the resource was approved, 0 if the resource was no.	</td<	</tr>
</table>							


In [1]:
#Important Libraries Import

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import dtale
from pandas_profiling import profile_report

import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost



The dash_core_components package is deprecated. Please replace
`import dash_core_components as dcc` with `from dash import dcc`
  import dash_core_components as dcc
The dash_html_components package is deprecated. Please replace
`import dash_html_components as html` with `from dash import html`
  import dash_html_components as html


In [5]:
train_data = pd.read_csv('ea_train.csv')
train_data.shape

(24576, 11)

In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24576 entries, 0 to 24575
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   RESOURCE          24576 non-null  int64
 1   MGR_ID            24576 non-null  int64
 2   ROLE_ROLLUP_1     24576 non-null  int64
 3   ROLE_ROLLUP_2     24576 non-null  int64
 4   ROLE_DEPTNAME     24576 non-null  int64
 5   ROLE_TITLE        24576 non-null  int64
 6   ROLE_FAMILY_DESC  24576 non-null  int64
 7   ROLE_FAMILY       24576 non-null  int64
 8   ROLE_CODE         24576 non-null  int64
 9   ID                24576 non-null  int64
 10  ACTION            24576 non-null  int64
dtypes: int64(11)
memory usage: 2.1 MB


In [9]:
test_data = pd.read_csv('ea_test.csv')
test_data.shape

(8193, 10)

In [10]:
train_data.isnull().sum().any()

False

In [12]:
d = dtale.show(train_data)
d.open_browser()

In [14]:
sns.pairplott(train_data)

<seaborn.axisgrid.PairGrid at 0x27008482310>

In [17]:
sns.heatmap(train_data.corr(),annot=True,cmap='viridis')

<AxesSubplot:>

In [24]:
train_data['ACTION'].value_counts(normalize=True)

1    0.941895
0    0.058105
Name: ACTION, dtype: float64

In [28]:
train_data.nunique()

RESOURCE             6469
MGR_ID               3996
ROLE_ROLLUP_1         123
ROLE_ROLLUP_2         168
ROLE_DEPTNAME         440
ROLE_TITLE            331
ROLE_FAMILY_DESC     2183
ROLE_FAMILY            64
ROLE_CODE             331
ID                  24576
ACTION                  2
dtype: int64

In [126]:
train_data1=train_data.copy()

In [127]:
X=train_data1.drop(['ID','ACTION','RESOURCE'],axis=1)
y=train_data1['ACTION']
print(X.shape,y.shape)

(24576, 8) (24576,)


In [128]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)
print(X_train.shape,y_train.shape,X_test.shape,y_test.shape)

(19660, 8) (19660,) (4916, 8) (4916,)


In [129]:
SS = StandardScaler()

In [130]:
X_train[X_train.columns] = SS.fit_transform(X_train)
X_train.head()

Unnamed: 0,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
10948,-0.017882,0.093435,0.01773,0.148835,-0.24328,-0.747518,1.067351,-0.256185
1352,-0.321744,0.112446,-0.018656,-0.018206,-0.255491,-0.366488,1.086011,-0.322304
11701,-0.580922,0.093435,0.01773,0.033928,-0.226576,-0.51903,-0.648361,-0.165315
14343,0.128293,-2.346923,-0.048943,0.029062,-0.069823,-0.672209,-0.627533,0.685752
4659,-0.610424,0.093435,0.01773,0.163005,-0.24328,1.034586,1.067351,-0.256185


In [131]:
X_test[X_test.columns] = SS.transform(X_test)
X_test.head()

Unnamed: 0,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
24173,0.175785,0.093435,0.014365,-0.00163,-0.24328,1.034586,1.067351,-0.256185
13381,-0.583591,0.123505,0.006793,0.028046,-0.244192,-0.724596,-0.649644,-0.260781
9712,-0.691089,0.093435,0.014365,0.140387,-0.232861,1.981782,-0.646183,-0.199258
17230,-0.693063,0.093435,0.008686,0.072106,-0.171807,-0.672209,-0.627533,0.132223
14350,0.064673,0.12579,0.04297,0.220592,-0.24328,-0.569231,1.067351,-0.256185


In [132]:
# Model1 - Logistic Regression
LR = LogisticRegression()
LR.fit(X_train,y_train)
LR_Train = LR.predict(X_train)
LR_Test = LR.predict(X_test)
print(accuracy_score(y_train,LR_Train))
print(accuracy_score(y_test,LR_Test))
cv = cross_val_score(estimator=LogisticRegression(),cv=10,X=X_train,y=y_train)
print(cv)
print(np.mean(cv))

0.9419125127161749
0.9418226200162734
[0.9415056  0.9415056  0.94201424 0.94201424 0.94201424 0.94201424
 0.94201424 0.94201424 0.94201424 0.94201424]
0.941912512716175


In [142]:
# Model1 - RandomForestRegressor
'''GSV1 = GridSearchCV(estimator=RandomForestClassifier(class_weight='balanced',random_state=42),cv=10,param_grid=dict(n_estimators=np.arange(1,50,5)))
GSV1.fit(X_train,y_train)
print(GSV1.best_params_)
print(GSV1.best_score_)
'''
RF = RandomForestClassifier(random_state=42,class_weight='balanced',n_estimators=23)
RF.fit(X_train,y_train)
RF_Train = RF.predict(X_train)
RF_Test = RF.predict(X_test)
print(accuracy_score(y_train,RF_Train))
print(accuracy_score(y_test,RF_Test))'''
#cv1 = cross_val_score(estimator=RandomForestClassifier(),cv=10,X=X_train,y=y_train)
#print(cv1)
#print(np.mean(cv1))

{'n_estimators': 21}
0.924618514750763


"\nRF = RandomForestClassifier(random_state=42,class_weight='balanced',n_estimators=23)\nRF.fit(X_train,y_train)\nRF_Train = RF.predict(X_train)\nRF_Test = RF.predict(X_test)\nprint(accuracy_score(y_train,RF_Train))\nprint(accuracy_score(y_test,RF_Test))"

In [71]:
from xgboost import XGBClassifier,XGBRFClassifier

In [72]:
# Model1 - RandomForestRegressor
'''GSV1 = GridSearchCV(estimator=RandomForestClassifier(class_weight='balanced'),cv=10,param_grid=dict(n_estimators=np.arange(20,30,1)))
GSV1.fit(X_train,y_train)
print(GSV1.best_params_)
print(GSV1.best_score_)
'''

XG = XGBRFClassifier(random_state=42,class_weight='balanced',n_estimators=23)
XG.fit(X_train,y_train)
XG_Train = XG.predict(X_train)
XG_Test = XG.predict(X_test)
print(accuracy_score(y_train,XG_Train))
print(accuracy_score(y_test,XG_Test))
#cv1 = cross_val_score(estimator=RandomForestClassifier(),cv=10,X=X_train,y=y_train)
#print(cv1)
#print(np.mean(cv1))

Parameters: { "class_weight" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


0.943031536113937
0.9420260374288039


In [134]:
test_data1=test_data.copy()

In [135]:
X=test_data1.drop(['ID','RESOURCE'],axis=1)

In [136]:
X[X.columns] = SS.transform(X)

In [137]:
prediction = RF.predict(X)

In [138]:
prediction

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

In [139]:
FinalOP = pd.DataFrame(test_data['ID'])

In [140]:
FinalOP['Pred'] = prediction

In [141]:
FinalOP.to_csv('submission.csv',header=False,index=False)