# Predicting whether or not an employee leaves a company after training

## Table of contents
#### Business Understanding
#### Data Understanding
#### Data Preparation
#### Modelling
#### Evaluation
#### Deployment

## Business understanding

A company, active in Big Data and Data Science, offers training to Data Scientists. The company then wishes to hire trainees who pass courses in the training. Moreover, other than demographics, education, experience data, the company collects information on Data Scientists who wish to leave or retain their current roles.

With the data, the company can allocate enough resources to candidates who really want to work for them. This will reduce the cost and time the program will take. Also, the company can prioritize quality and better categorizing of candidates for better planning of the courses.

## Data Understanding

Binary problem with 0 representing and 1 representing in the target variable.

`enrollee_id`   
`city`  
`city_development_index`   
`gender`  
`relevent_experience`   
`enrolled_university`  
`education_level`  
`major_discipline`  
`experience`  
`company_size`  
`company_type`  
`last_new_job`  
`training_hours`  
`target`

In [2]:
# importing libraries and frameworks
import pandas as pd
import numpy as np


In [3]:
# Loading datasets into pandas dataframes
#test_data = pd.read_csv(r"C:\Users\Administrator\Documents\Predicting\aug_test.csv")
train_data = pd.read_csv("aug_train.csv")

# Creating a new dataframe, to retain a copy as train_data
data = train_data

In [4]:
# The first five rows of the dataset
data.head(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [5]:
# Dimensions of the array of the dataframe, i.e (rows, columns)
data.shape

(19158, 14)

In [None]:
# A look at number of none null values, and datatype of each column of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

As presented above, the data has missing values in some of its columns. The data types of each column are clearly shown.

## Data Preparation

Duplicate values, missing values,and outliers.

### Duplicates
These are rows that have the same entries across all columns.

In [None]:
# Checking for duplicated entries, in all columns, ordering
#data.duplicated(subset=None, keep='last')

# Sum of duplicates
data.duplicated().sum()

0

There are no duplicates in the dataframe

### Missing Values

In [None]:
# Percentage of missing values in each column of the data frame.
(data.isnull().sum()/len(data.axes[0])*100).sort_values(ascending=False)

company_type              32.049274
company_size              30.994885
gender                    23.530640
major_discipline          14.683161
education_level            2.401086
last_new_job               2.207955
enrolled_university        2.014824
experience                 0.339284
enrollee_id                0.000000
city                       0.000000
city_development_index     0.000000
relevent_experience        0.000000
training_hours             0.000000
target                     0.000000
dtype: float64

The `company_type` column has the highest number of null values at about 32%.