# <h1><center>Project Title: Data Scientist Job Change Prediction</center></h1>

# Contents

### Part_1
- Dataset
- Description about Dataset
- Objecties
- Key Questions
- Approach
- Importing Libraries/ Datasets
- Segregating Dataset
- Preliminery Investigation of Data

### Part 2
#### Exploratory Data Analysis
- Description about Target Variable
- Categorical Variables
- Numerical Variables
- Missing Values
- Duplicate Values
- Distribution, Skewness, Kurtosis
- Variance/ Standard Variation
- Outliers & Anomalies
- Bi-Variate Analysis
- Multi-Variate Analysis

# Part 3
#### Data Pre-Processing
- Handling Missing Values
- Handling Duplicated Values
- Handling outliers
- Capping Outliers using IQR Ranges
- Multicolinearity (VIF)
- Encoding Categorical Variables 
- Drop Irrelevant Columns

# Part 4
#### Modeling
- Model Performance
- Models
- Data Scalling
- Cross Validation using K-Fold

# Part 5
#### Hyper-Parameters Tunning
- Conclusions

# Description about Dataset

A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

## Features

- enrollee_id : Unique ID for candidate
- city: City code
- city_ development _index : Developement index of the city (scaled)
- gender: Gender of candidate
- relevent_experience: Relevant experience of candidate (To Data Science)
- enrolled_university: Type of University course enrolled if any
- education_level: Education level of candidate
- major_discipline :Education major discipline of candidate
- experience: Candidate total experience in years
- company_size: No of employees in current employer's company
- company_type : Type of current employer
- lastnewjob: Difference in years between previous job and current job
- training_hours: training hours completed
- target: 0 – Not looking for job change, 1 – Looking for a job change

# Objectives

- To figure out the effective predictors of the target variable. 
- To analyse the job change prediction, Which variables have more impact?

# Key Questions

- The data-set aims to answer the following key questions:- 

- Do various predicting factors which has been chosen initially really affect the target variable?
- What are the predicting variables actually affecting the job change predictions?
- Should education level and relevant experience impacted?
- How does company type and experience affect job change?
- Does target variable has positive or negative correlation with other different variables.

# Approach

##### Target Variable: target: 0 – Not looking for job change, 1 – Looking for a job change
##### Nature of Target Variable: Categorical
##### Machine Learning Category: Supervised
##### Machine Learning Task : Classification  

##  Importing Libraries/ Datasets

In [10]:
# Data Mainplation Libraries
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Other Libraries
import warnings
warnings.filterwarnings('ignore')

In [6]:
# Import Dataset
df=pd.read_csv('aug_train.csv')
df

<IPython.core.display.Javascript object>

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


In [7]:
df.columns

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'target'],
      dtype='object')

##  Segregating Dataset

- 1- Segregating Target Variable from Predictors 
- 2- Segregating Numerical Variables from Categorical Variables

In [8]:
# Segregating Target Variable from Predictors

df_x=df.drop(['target'],axis=1)
df_y=df[['target']]


In [13]:
# Segregating Numerical Variables from Categorical Variables

df_num=df.select_dtypes(include='number')
df_cat=df.select_dtypes(include=['object','category'])

##  Preliminery Investigation of Data

In [16]:
df.sample(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
8252,21997,city_16,0.91,Male,No relevent experience,no_enrollment,Masters,STEM,>20,500-999,Pvt Ltd,never,72,0.0
17183,17572,city_65,0.802,Male,No relevent experience,Full time course,Graduate,STEM,7,,,never,6,1.0
12392,12250,city_93,0.865,Male,Has relevent experience,Full time course,High School,,6,50-99,Funded Startup,1,50,0.0
12313,26758,city_21,0.624,Male,Has relevent experience,no_enrollment,Graduate,STEM,13,10000+,Pvt Ltd,3,21,1.0
4559,28889,city_114,0.926,Male,Has relevent experience,no_enrollment,,,7,500-999,Other,3,80,1.0


In [17]:
df.shape

(19158, 14)

In [18]:
df.size

268212

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [20]:
df.dtypes

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
target                    float64
dtype: object

In [21]:
df.dtypes.value_counts()

object     10
int64       2
float64     2
dtype: int64

In [22]:
df.columns

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'target'],
      dtype='object')

In [23]:
df.nunique()

enrollee_id               19158
city                        123
city_development_index       93
gender                        3
relevent_experience           2
enrolled_university           3
education_level               5
major_discipline              6
experience                   22
company_size                  8
company_type                  6
last_new_job                  6
training_hours              241
target                        2
dtype: int64

In [24]:
df.count()

enrollee_id               19158
city                      19158
city_development_index    19158
gender                    14650
relevent_experience       19158
enrolled_university       18772
education_level           18698
major_discipline          16345
experience                19093
company_size              13220
company_type              13018
last_new_job              18735
training_hours            19158
target                    19158
dtype: int64