# Purpose

The goal of this project is to create a dashboard to display data from a dataset. First we need to clean our data, then we can display it. The dataset used can be found [here](https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists). I'll be including the metadata for ease of use. Please note that I will only be using the training dataset, since my ultimate goal is not to predict data but to display it in a meaningful way.

# Data Description

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

The whole data divided to train and test . Target isn't included in test but the test target values data file is in hands for related tasks. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target

Note:

The dataset is imbalanced.
Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality.
Missing imputation can be a part of your pipeline as well.
Features

* enrollee_id : Unique ID for candidate
* city: City code
* city_ development _index : Developement index of the city (scaled)
* gender: Gender of candidate
* relevent_experience: Relevant experience of candidate
* enrolled_university: Type of University course enrolled if any
* education_level: Education level of candidate
* major_discipline :Education major discipline of candidate
* experience: Candidate total experience in years
* company_size: No of employees in current employer's company
* company_type : Type of current employer
* lastnewjob: Difference in years between previous job and current job
* training_hours: training hours completed
* target: 0 – Not looking for job change, 1 – Looking for a job change

# Data Cleaning

In [41]:
import pandas as pd

In [42]:
jobdata = pd.read_csv('aug_train.csv')
jobdata

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [43]:
# Checking nans for all columns
jobdata.isna().sum()

enrollee_id                  0
city                         0
city_development_index       0
gender                    4508
relevent_experience          0
enrolled_university        386
education_level            460
major_discipline          2813
experience                  65
company_size              5938
company_type              6140
last_new_job               423
training_hours               0
target                       0
dtype: int64

Looks enrolee_id, city, city_development_index, relevent_experience, training_hours, and target have data values that are complete. First thing we need to do is deal with the columns that have nans, or we will have issues.

In [44]:
# Exploring the gender column
jobdata['gender'].value_counts(dropna=False)

Male      13221
NaN        4508
Female     1238
Other       191
Name: gender, dtype: int64

I'm a little hesitant to drop all the NaN rows for gender, so we'll hold off on that and come back to it in a moment.

In [45]:
# Exploring the enrolled_university column
jobdata['enrolled_university'].value_counts(dropna=False)

no_enrollment       13817
Full time course     3757
Part time course     1198
NaN                   386
Name: enrolled_university, dtype: int64

It looks like most individuals are not currently enrolled in university courses. For that reason we will replace the NaN values with no_enrollement. Enrolling is also an optional choice as well, unlike gender which each individual must have, therefore it makes sense to replace NaN values with the most common value.

In [46]:
jobdata['enrolled_university'] = jobdata['enrolled_university'].fillna('no_enrollment')

In [47]:
# Exploring the education_level column
jobdata['education_level'].value_counts(dropna=False)

Graduate          11598
Masters            4361
High School        2017
NaN                 460
Phd                 414
Primary School      308
Name: education_level, dtype: int64

I had a feeling that having no education (an NaN value) would also mean that you would have no major_discipline as well.

In [48]:
test1 = jobdata['education_level'].fillna(1)

In [49]:
test2 = jobdata['major_discipline'].fillna(1)

In [50]:
test1.eq(test2).value_counts()

False    18698
True       460
dtype: int64

After some testing it looks like that assumption was correct. We can then replace the NaN values in education_level with "No Education" as they exactly correspond to the major_discipline NaN values. This makes sense since if you have no education, you cannot have a major_discipline. We also change the corresponding major_discipline as well.

In [53]:
jobdata['education_level'] = jobdata['education_level'].fillna('No Education')

In [56]:
# find where education_level == 'No education', and change the corresponding rows in major_discipline to 'No Major'
jobdata.loc[jobdata['education_level'] == 'No Education', 'major_discipline'] = 'No Major'

In [57]:
# Exploring major_discipline
jobdata['major_discipline'].value_counts(dropna=False)

STEM               14492
NaN                 2353
No Major             683
Humanities           669
Other                381
Business Degree      327
Arts                 253
Name: major_discipline, dtype: int64

Using the same logic we change the major_discipline of 'High School' education_level and 'Primary School' to 'No Major' as well, since individuals with these education levels would have NaNs for major_discipline,.

In [58]:
# find where education_level == 'High School', and change the corresponding rows in major_discipline to 'No Major'
jobdata.loc[jobdata['education_level'] == 'High School', 'major_discipline'] = 'No Major'
# find where education_level == 'Primary School', and change the corresponding rows in major_discipline to 'No Major'
jobdata.loc[jobdata['education_level'] == 'Primary School', 'major_discipline'] = 'No Major'

In [59]:
jobdata['major_discipline'].value_counts(dropna=False)

STEM               14492
No Major            3008
Humanities           669
Other                381
Business Degree      327
Arts                 253
NaN                   28
Name: major_discipline, dtype: int64

We have now chopped down our NaN values to just a mere 28. Now lets take a look at the remaining rows where major_discipine is NaN

In [61]:
jobdata.loc[jobdata['major_discipline'].isna()]

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
391,12038,city_90,0.698,Male,Has relevent experience,Full time course,Masters,,,,,,44,1.0
1771,1485,city_103,0.92,,Has relevent experience,no_enrollment,Graduate,,15,50-99,Pvt Ltd,>4,42,0.0
3796,2946,city_21,0.624,,No relevent experience,Full time course,Graduate,,2,,Pvt Ltd,1,50,1.0
3923,22935,city_136,0.897,,No relevent experience,no_enrollment,Graduate,,3,,,,92,0.0
4859,23075,city_16,0.91,Male,Has relevent experience,no_enrollment,Graduate,,>20,1000-4999,Public Sector,>4,7,0.0
5190,16615,city_162,0.767,,Has relevent experience,no_enrollment,Graduate,,16,,,,43,0.0
6405,2874,city_41,0.827,,Has relevent experience,Part time course,Masters,,18,10000+,Pvt Ltd,>4,49,1.0
7816,28855,city_16,0.91,Male,Has relevent experience,no_enrollment,Graduate,,10,10000+,Pvt Ltd,3,14,0.0
8272,18836,city_136,0.897,,No relevent experience,Full time course,Graduate,,3,,,never,37,0.0
9038,17738,city_99,0.915,,No relevent experience,Full time course,Graduate,,5,,,1,18,1.0
