<a href="https://colab.research.google.com/github/Adizcool/Job_Promotion_Prediction/blob/main/Job_Promotion_Prediction_Data_Preprocessing_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Classification Project using Loan Status Data**

##**Main Goals of this Program:**
**I) Check and decide the ML Learning Type and sub-type as applicable**

**II) Check and remove the duplicate records, if any**

**III) Check the class balance**

**IV) Check for Missing Values and handle them as required**

**V) Check for the necessity of creating new column(s) and create the columns as required**

**VI) Check the unique Values of each column and observe the following and take actions as required:**
* **1) Wrong Data in the columns, if any** 
* **2) Wrong format of the data in the columns, if any**
* **3) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**

**VII) Check the Test accuracy using appropriate algorithm and Holdout Method.**

**VIII) Implement the Scaling as required**

**IX) Write out the transformed Input file for further usage**

**1) Install/ Import the required Python Packages/ Libraries**

In [1]:
#Import required python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn import preprocessing
%matplotlib inline

In [2]:
pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.3.0-py2.py3-none-any.whl (82 kB)
[?25l[K     |████                            | 10 kB 13.2 MB/s eta 0:00:01[K     |████████                        | 20 kB 17.0 MB/s eta 0:00:01[K     |████████████                    | 30 kB 21.6 MB/s eta 0:00:01[K     |████████████████                | 40 kB 12.8 MB/s eta 0:00:01[K     |████████████████████            | 51 kB 5.8 MB/s eta 0:00:01[K     |████████████████████████        | 61 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████    | 71 kB 5.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 81 kB 5.6 MB/s eta 0:00:01[K     |████████████████████████████████| 82 kB 323 kB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.3.0


**2) Mounting the Google Drive**

In [3]:
# Mount the Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


**3) Read the Data file and check**

In [6]:
# Read the Diabetes Data from .csv file and check the data shape (number of Rows and Columns)
df = pd.read_csv('gdrive/My Drive/Datasets/HR Analysis/train_HR_Analytics.csv')
print(df.shape)
df.head()

(5000, 14)


Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


##**I) Check and decide the ML Learning Type and sub-type as applicable**

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           5000 non-null   int64  
 1   department            5000 non-null   object 
 2   region                5000 non-null   object 
 3   education             4788 non-null   object 
 4   gender                5000 non-null   object 
 5   recruitment_channel   5000 non-null   object 
 6   no_of_trainings       5000 non-null   int64  
 7   age                   5000 non-null   int64  
 8   previous_year_rating  4624 non-null   float64
 9   length_of_service     5000 non-null   int64  
 10  KPIs_met >80%         5000 non-null   int64  
 11  awards_won?           5000 non-null   int64  
 12  avg_training_score    5000 non-null   int64  
 13  is_promoted           5000 non-null   int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 547.0+ KB


In [8]:
df.isnull().sum()

employee_id               0
department                0
region                    0
education               212
gender                    0
recruitment_channel       0
no_of_trainings           0
age                       0
previous_year_rating    376
length_of_service         0
KPIs_met >80%             0
awards_won?               0
avg_training_score        0
is_promoted               0
dtype: int64

**Observations on the given Dataset:**
* a) Number of Independet Variables: 12 (Identified)
* b) Number of Dependent Variable : 1 (Loan_Status) (Identified)
* c) There is no Missing Value in the Dependent Variable column "Loan_Status"


**Conclusions:**
###**a) The given dataset probably belongs to the"Supervised Learning" main-type**
###**b) Since the Dependent variable values are categorical in nature, the given dataset is of "Classification" sub-type.**

##**II) Check and remove the duplicate records, if any**

In [9]:
df.shape

(5000, 14)

In [10]:
# Returns True for every row that is a duplicate, othwerwise False:
print(df.duplicated())

0       False
1       False
2       False
3       False
4       False
        ...  
4995    False
4996    False
4997    False
4998    False
4999    False
Length: 5000, dtype: bool


In [11]:
# Remove all duplicates:
df.drop_duplicates(inplace = True)

In [12]:
df.shape

(5000, 14)

###**Conclusion: No Duplicate Records**

##**III) Check the Class balance**

In [13]:
df["is_promoted"].value_counts()

0    2733
1    2267
Name: is_promoted, dtype: int64

###**Conclusion: It is a Binary Classification with imbalanced Classes**

##**IV) Check for Missing Values and handle them as required**

**a) Check the Missing Values, if any**

In [14]:
df.isnull().sum()

employee_id               0
department                0
region                    0
education               212
gender                    0
recruitment_channel       0
no_of_trainings           0
age                       0
previous_year_rating    376
length_of_service         0
KPIs_met >80%             0
awards_won?               0
avg_training_score        0
is_promoted               0
dtype: int64

**b) Checking the total number of rows having the missing Values**

In [15]:
df[df.isnull().any(axis=1)]

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
10,29934,Technology,region_23,,m,sourcing,1,30,,1,0,0,77,0
21,33332,Operations,region_15,,m,sourcing,1,41,4.0,11,0,0,57,0
23,71177,Procurement,region_5,Bachelor's,m,other,1,27,,1,0,0,70,0
29,74759,Sales & Marketing,region_4,Bachelor's,m,sourcing,1,26,,1,0,0,44,0
32,35465,Sales & Marketing,region_7,,f,sourcing,1,24,1.0,2,0,0,48,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4955,32626,Technology,region_17,Bachelor's,m,other,1,27,,1,1,0,80,1
4963,54606,Sales & Marketing,region_6,,m,other,1,32,4.0,4,1,1,94,1
4969,18185,Technology,region_2,,f,other,1,49,3.0,11,1,0,83,1
4995,61657,Technology,region_28,Bachelor's,f,other,1,30,,1,1,0,77,1


**c) Observations, Decisions and Actions**

**Observations:**
* a) Here, the data values of 7 columns are missing
* b) The total number rows having missing values is 134 against the total number of rows (614) in the dataset. 
###**So, we cannot use the option of dropping the rows having missing values.**

**Decision and Actions:**

###**Fill the missing values of the columns with that of the most_frequent values of the respective columns.**

**d) Imputation of Missing Values using the "fillna" command and checking**

In [16]:
df.fillna(df.mode().iloc[0], inplace=True)

In [17]:
df.isnull().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

##**V) Check for necessity of creating new column(s) and create the columns as required**

###**Decision: As of now, there is no necessity to create new column(s).**

##**VI) Check the unique Values of each column and observe the following and take actions as required:**
* **a) Wrong Data in the columns, if any** 
* **b) Wrong format of the data in the columns, if any**
* **c) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**


###**Column-1: employee_id**

In [18]:
df['employee_id'].value_counts()

26623    1
4747     1
47800    1
72318    1
60032    1
        ..
35501    1
44331    1
61449    1
11567    1
14764    1
Name: employee_id, Length: 5000, dtype: int64

**Observations:**
* a) Data in this column will not be contributing to the prediction of the Depenedent variable

**Decsion:**

**We will be dropping this column**

**Action:**

In [19]:
df.drop(['employee_id'], axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   department            5000 non-null   object 
 1   region                5000 non-null   object 
 2   education             5000 non-null   object 
 3   gender                5000 non-null   object 
 4   recruitment_channel   5000 non-null   object 
 5   no_of_trainings       5000 non-null   int64  
 6   age                   5000 non-null   int64  
 7   previous_year_rating  5000 non-null   float64
 8   length_of_service     5000 non-null   int64  
 9   KPIs_met >80%         5000 non-null   int64  
 10  awards_won?           5000 non-null   int64  
 11  avg_training_score    5000 non-null   int64  
 12  is_promoted           5000 non-null   int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 546.9+ KB


###**Column-2: department**

In [20]:
df['department'].value_counts()

Sales & Marketing    1409
Operations           1043
Technology            747
Procurement           729
Analytics             513
Finance               202
HR                    197
R&D                    85
Legal                  75
Name: department, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [21]:
#encode the data
dept = pd.DataFrame(df['department'])
dept_encoded=pd.get_dummies(data= dept, drop_first=True)
dept_encoded

Unnamed: 0,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology
0,0,0,0,0,0,0,1,0
1,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...
4995,0,0,0,0,0,0,0,1
4996,0,0,0,0,0,0,0,0
4997,0,0,0,0,0,0,0,0
4998,0,0,0,0,0,0,0,0


###**Column-3: Region**

In [22]:
df['region'].value_counts()

region_2     1095
region_22     660
region_7      507
region_13     263
region_15     243
region_4      209
region_26     183
region_31     147
region_27     147
region_28     130
region_23     116
region_16     115
region_11     103
region_17      95
region_25      90
region_29      86
region_19      75
region_14      69
region_30      69
region_32      63
region_20      62
region_1       61
region_5       56
region_8       55
region_6       49
region_12      45
region_10      44
region_24      37
region_21      28
region_3       28
region_9       25
region_34      24
region_33      19
region_18       2
Name: region, dtype: int64

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   department            5000 non-null   object 
 1   region                5000 non-null   object 
 2   education             5000 non-null   object 
 3   gender                5000 non-null   object 
 4   recruitment_channel   5000 non-null   object 
 5   no_of_trainings       5000 non-null   int64  
 6   age                   5000 non-null   int64  
 7   previous_year_rating  5000 non-null   float64
 8   length_of_service     5000 non-null   int64  
 9   KPIs_met >80%         5000 non-null   int64  
 10  awards_won?           5000 non-null   int64  
 11  avg_training_score    5000 non-null   int64  
 12  is_promoted           5000 non-null   int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 546.9+ KB


**Observations:**
* a) Data in this column will not be contributing to the prediction of the Depenedent variable

**Decision and Actions to be taken:**

* We will be dropping this column


**Action:**

In [24]:
df.drop(['region'], axis = 1, inplace = True)

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   department            5000 non-null   object 
 1   education             5000 non-null   object 
 2   gender                5000 non-null   object 
 3   recruitment_channel   5000 non-null   object 
 4   no_of_trainings       5000 non-null   int64  
 5   age                   5000 non-null   int64  
 6   previous_year_rating  5000 non-null   float64
 7   length_of_service     5000 non-null   int64  
 8   KPIs_met >80%         5000 non-null   int64  
 9   awards_won?           5000 non-null   int64  
 10  avg_training_score    5000 non-null   int64  
 11  is_promoted           5000 non-null   int64  
dtypes: float64(1), int64(7), object(4)
memory usage: 507.8+ KB


###**Column-4: Education**

In [26]:
df['education'].value_counts()

Bachelor's          3452
Master's & above    1477
Below Secondary       71
Name: education, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Ordinal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Ordnial Type method "preprocessing.LabelEncoder()**

**Action:**

In [27]:
le = preprocessing.LabelEncoder()
df['education'] = le.fit_transform(df.education.values)
df['education'].value_counts()

0    3452
2    1477
1      71
Name: education, dtype: int64

###**Column-5: Gender**

In [28]:
df['gender'].value_counts()

m    3477
f    1523
Name: gender, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [29]:
#encode the data
gender = pd.DataFrame(df['gender'])
gender_encoded=pd.get_dummies(data= gender, drop_first=True)
gender_encoded

Unnamed: 0,gender_m
0,0
1,1
2,1
3,1
4,1
...,...
4995,0
4996,1
4997,1
4998,1


###**Column-6: recruitment channel**

In [30]:
df['recruitment_channel'].value_counts()

other       2786
sourcing    2084
referred     130
Name: recruitment_channel, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [31]:
#encode the data
recruitment_channel = pd.DataFrame(df['recruitment_channel'])
recruitment_channel_encoded=pd.get_dummies(data= recruitment_channel, drop_first=True)
recruitment_channel_encoded

Unnamed: 0,recruitment_channel_referred,recruitment_channel_sourcing
0,0,1
1,0,0
2,0,1
3,0,0
4,0,0
...,...,...
4995,0,0
4996,0,1
4997,0,0
4998,0,1


###**Column-7 to 14 : no_of_trainings, age, previous_year_rating, length_of_service, KPIs_met, awards_won, avg_trainging_score, is_promoted**

In [32]:
df.describe()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.605,1.2224,34.6706,3.5742,5.8674,0.4932,0.0578,66.755,0.4534
std,0.911011,0.547538,7.49921,1.177613,4.282883,0.500004,0.233388,14.48307,0.497873
min,0.0,1.0,20.0,1.0,1.0,0.0,0.0,39.0,0.0
25%,0.0,1.0,29.0,3.0,3.0,0.0,0.0,54.0,0.0
50%,0.0,1.0,33.0,3.0,5.0,0.0,0.0,64.0,0.0
75%,2.0,1.0,38.0,5.0,8.0,1.0,0.0,80.0,1.0
max,2.0,7.0,60.0,5.0,34.0,1.0,1.0,99.0,1.0


**Observations:**


* a) Here, all the Integer and float Column values are described.
* b) Each column has got a Standard Deviation, Min and Max Values.
* c) We can assume that there is no wrong data and wrong data format.
* **d) Scaling is not required since all values are between 0 ans 100**

###**Drop the columns which are to be categorically converted and include the their respective coverted Numeric Values**

In [33]:
df.drop(['department', 'gender', 'recruitment_channel'], axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   education             5000 non-null   int64  
 1   no_of_trainings       5000 non-null   int64  
 2   age                   5000 non-null   int64  
 3   previous_year_rating  5000 non-null   float64
 4   length_of_service     5000 non-null   int64  
 5   KPIs_met >80%         5000 non-null   int64  
 6   awards_won?           5000 non-null   int64  
 7   avg_training_score    5000 non-null   int64  
 8   is_promoted           5000 non-null   int64  
dtypes: float64(1), int64(8)
memory usage: 390.6 KB


In [34]:
df = pd.concat([df,dept_encoded, gender_encoded, recruitment_channel_encoded], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   education                     5000 non-null   int64  
 1   no_of_trainings               5000 non-null   int64  
 2   age                           5000 non-null   int64  
 3   previous_year_rating          5000 non-null   float64
 4   length_of_service             5000 non-null   int64  
 5   KPIs_met >80%                 5000 non-null   int64  
 6   awards_won?                   5000 non-null   int64  
 7   avg_training_score            5000 non-null   int64  
 8   is_promoted                   5000 non-null   int64  
 9   department_Finance            5000 non-null   uint8  
 10  department_HR                 5000 non-null   uint8  
 11  department_Legal              5000 non-null   uint8  
 12  department_Operations         5000 non-null   uint8  
 13  dep

In [35]:
df.corr()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
education,1.0,-0.051637,0.383352,0.035436,0.266965,0.032528,0.018963,0.034312,0.050483,-0.024766,0.043806,-0.036803,0.036197,0.109469,0.034945,-0.062686,-0.007967,-0.069929,-0.03401,0.001416
no_of_trainings,-0.051637,1.0,-0.091283,-0.05352,-0.0543,-0.029548,-0.023909,0.082625,-0.049298,0.005706,-0.072881,-0.038108,-0.035031,0.009182,0.048312,-0.038458,0.014211,0.079148,-0.013571,0.001866
age,0.383352,-0.091283,1.0,0.037414,0.677262,-0.03242,-0.001006,-0.060699,-0.037035,-0.087843,-0.013862,-0.023983,0.125813,0.070903,-0.034457,-0.007937,-0.015857,-0.032609,-0.032378,-0.023345
previous_year_rating,0.035436,-0.05352,0.037414,1.0,0.025292,0.294898,0.079375,0.112824,0.287039,0.012087,0.019105,-0.014064,0.090342,-0.001247,0.02259,-0.101202,-0.018548,-0.020111,0.05588,-0.00573
length_of_service,0.266965,-0.0543,0.677262,0.025292,1.0,-0.045773,-0.053769,-0.035127,-0.01955,-0.072869,0.007231,-0.056116,0.087046,0.061889,-0.042532,-0.013409,-0.004054,-0.017956,-0.033093,-0.010861
KPIs_met >80%,0.032528,-0.029548,-0.03242,0.294898,-0.045773,1.0,0.052224,0.032766,0.388057,-0.005336,-0.00033,-0.003258,0.049812,0.017522,0.01262,-0.075511,0.011872,-0.017259,0.034902,-0.021769
awards_won?,0.018963,-0.023909,-0.001006,0.079375,-0.053769,0.052224,1.0,0.172203,0.216859,-0.002941,-0.006108,0.011739,0.0226,0.002098,-0.012683,-0.025604,0.018808,-0.009256,0.013388,-0.006006
avg_training_score,0.034312,0.082625,-0.060699,0.112824,-0.035127,0.032766,0.172203,1.0,0.271689,-0.040648,-0.203224,-0.047564,-0.131729,0.168261,0.165575,-0.568318,0.427395,-0.039254,0.02151,-0.012785
is_promoted,0.050483,-0.049298,-0.037035,0.287039,-0.01955,0.388057,0.216859,0.271689,1.0,-0.001197,-0.031636,-0.023153,0.020867,0.000537,-0.02343,-0.040937,0.056699,-0.014378,0.038015,-0.012945
department_Finance,-0.024766,0.005706,-0.087843,0.012087,-0.072869,-0.005336,-0.002941,-0.040648,-0.001197,1.0,-0.041555,-0.025321,-0.105343,-0.08477,-0.026983,-0.128527,-0.085992,0.018824,-0.033524,0.022264


##**VII) Check the Test accuracy using appropriate algorithm and Holdout Method.**

##**Step-5: Slice X and y Values**

In [36]:
X = df.drop(['is_promoted'], axis = 1)
Y = df['is_promoted']
X.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,2,1,35,5.0,8,1,0,49,0,0,0,0,0,0,1,0,0,0,1
1,0,1,30,5.0,4,0,0,60,0,0,0,1,0,0,0,0,1,0,0
2,0,1,34,3.0,7,0,0,50,0,0,0,0,0,0,1,0,1,0,1
3,0,2,39,1.0,10,0,0,50,0,0,0,0,0,0,1,0,1,0,0
4,0,1,45,3.0,2,0,0,73,0,0,0,0,0,0,0,1,1,0,0


In [37]:
Y.head()

0    0
1    0
2    0
3    0
4    0
Name: is_promoted, dtype: int64

##**Step-6: Execute Train-Test-Split Command and Verify**

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 66)

In [39]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4000, 19)
(4000,)
(1000, 19)
(1000,)


##**Step-7: Learn the Data and Predict the dependent Variable values for the "X_test"data using "LogisticRegression()" algorithm**

In [40]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [41]:
y_pred = logmodel.predict(X_test)
y_pred

array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,

##**Step-8: Calculate the Accuracy of the Model**

In [42]:
accuracy_lr = logmodel.score(X_test, y_test)
print("Accuracy of Logistic Regression on test set:",accuracy_lr)

Accuracy of Logistic Regression on test set: 0.722


##**Step-9: Display the Confusion Matrix and Classification Report of the Model**

In [43]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  

[[418 132]
 [146 304]]
              precision    recall  f1-score   support

           0       0.74      0.76      0.75       550
           1       0.70      0.68      0.69       450

    accuracy                           0.72      1000
   macro avg       0.72      0.72      0.72      1000
weighted avg       0.72      0.72      0.72      1000



##**VIII) Implement the Scaling as required**

###**Use Normalization**

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 0 to 4999
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   education                     5000 non-null   int64  
 1   no_of_trainings               5000 non-null   int64  
 2   age                           5000 non-null   int64  
 3   previous_year_rating          5000 non-null   float64
 4   length_of_service             5000 non-null   int64  
 5   KPIs_met >80%                 5000 non-null   int64  
 6   awards_won?                   5000 non-null   int64  
 7   avg_training_score            5000 non-null   int64  
 8   is_promoted                   5000 non-null   int64  
 9   department_Finance            5000 non-null   uint8  
 10  department_HR                 5000 non-null   uint8  
 11  department_Legal              5000 non-null   uint8  
 12  department_Operations         5000 non-null   uint8  
 13  dep

In [45]:
columnNames = ['education','no_of_trainings', 'age', 'previous_year_rating', 'length_of_service', 'KPIs_met>80%', 'awards_won?', 'avg_training_score', 'department_Finance',
               'department_HR', 'department_Legal', 'department_Operations', 'department_Procurement', 'department_R&D', 'department_Sales & Marketing', 'deartment_Technology',
               'gender_m', 'recruitment_channel_referred', 'recruitment_channel_sourcing']

In [46]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X_train1 = min_max_scaler_object.fit_transform(X_train)
X_train1 = pd.DataFrame(X_train1 , columns = columnNames)
X_train1.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,0.0,0.0,0.425,1.0,0.212121,1.0,0.0,0.833333,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.0,0.0,0.475,0.5,0.30303,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
2,0.0,0.166667,0.275,0.5,0.151515,0.0,1.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.425,0.5,0.060606,0.0,0.0,0.416667,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
4,0.0,0.0,0.2,0.5,0.030303,0.0,0.0,0.45,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [47]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X_test1 = min_max_scaler_object.fit_transform(X_test)
X_test1 = pd.DataFrame(X_test1 , columns = columnNames)
X_test1.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,0.0,0.0,0.3,0.5,0.032258,0.0,0.0,0.491525,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.225,0.0,0.129032,0.0,0.0,0.135593,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
2,0.0,0.0,0.3,0.75,0.096774,1.0,1.0,0.288136,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.35,0.75,0.129032,1.0,0.0,0.966102,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.3,0.75,0.064516,1.0,0.0,0.830508,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


In [48]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel1 = LogisticRegression()
logmodel1.fit(X_train1, y_train)

LogisticRegression()

In [49]:
#predictions
predictions1 = logmodel1.predict(X_test1)

In [50]:
print(confusion_matrix(y_test, predictions1))
print(classification_report(y_test,predictions1))

[[422 128]
 [122 328]]
              precision    recall  f1-score   support

           0       0.78      0.77      0.77       550
           1       0.72      0.73      0.72       450

    accuracy                           0.75      1000
   macro avg       0.75      0.75      0.75      1000
weighted avg       0.75      0.75      0.75      1000



###**Use Standardization**

In [51]:
from sklearn.preprocessing import StandardScaler
std_scaler_object = preprocessing.StandardScaler()
X_train2 = std_scaler_object.fit_transform(X_train)
X_train2 = pd.DataFrame(X_train2 , columns = columnNames)
X_train2.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,-0.663988,-0.405767,0.319692,1.213018,0.51454,1.011567,-0.249269,1.53771,-0.204788,-0.202122,-0.124443,1.945428,-0.416371,-0.129525,-0.620904,-0.422141,0.665193,-0.156813,1.173934
1,-0.663988,-0.405767,0.588538,-0.493955,1.220841,-0.988565,-0.249269,-0.884534,-0.204788,-0.202122,-0.124443,-0.514026,-0.416371,-0.129525,1.610556,-0.422141,0.665193,-0.156813,1.173934
2,-0.663988,1.477142,-0.486847,-0.493955,0.043673,-0.988565,4.011735,1.191675,-0.204788,-0.202122,-0.124443,-0.514026,-0.416371,-0.129525,-0.620904,-0.422141,0.665193,-0.156813,-0.851837
3,-0.663988,-0.405767,0.319692,-0.493955,-0.662628,-0.988565,-0.249269,-0.192465,-0.204788,-0.202122,-0.124443,-0.514026,-0.416371,-0.129525,1.610556,-0.422141,0.665193,-0.156813,1.173934
4,-0.663988,-0.405767,-0.890116,-0.493955,-0.898062,-0.988565,-0.249269,-0.054051,-0.204788,-0.202122,-0.124443,-0.514026,2.401706,-0.129525,-0.620904,-0.422141,-1.503324,-0.156813,1.173934


In [52]:
from sklearn.preprocessing import StandardScaler
std_scaler_object = preprocessing.StandardScaler()
X_test2 = std_scaler_object.fit_transform(X_test)
X_test2 = pd.DataFrame(X_test2 , columns = columnNames)
X_test2.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,-0.664865,-0.411275,-0.370865,-0.463194,-0.924263,-0.978237,-0.241249,0.160788,-0.206768,-0.204124,-0.119159,-0.510915,2.499275,-0.139169,-0.648425,-0.406818,-1.542199,-0.187608,-0.819903
1,-0.664865,-0.411275,-0.75907,-2.129361,-0.244491,-0.978237,-0.241249,-1.276654,-0.206768,-0.204124,-0.119159,-0.510915,-0.400116,-0.139169,1.542199,-0.406818,0.648425,-0.187608,1.219657
2,-0.664865,-0.411275,-0.370865,0.369889,-0.471082,1.022247,4.145096,-0.660607,-0.206768,-0.204124,-0.119159,1.957273,-0.400116,-0.139169,-0.648425,-0.406818,0.648425,-0.187608,-0.819903
3,-0.664865,-0.411275,-0.112062,0.369889,-0.244491,1.022247,-0.241249,2.077378,-0.206768,-0.204124,-0.119159,1.957273,-0.400116,-0.139169,-0.648425,-0.406818,0.648425,-0.187608,-0.819903
4,-0.664865,-0.411275,-0.370865,0.369889,-0.697672,1.022247,-0.241249,1.529781,-0.206768,-0.204124,-0.119159,-0.510915,-0.400116,-0.139169,1.542199,-0.406818,0.648425,-0.187608,-0.819903


In [53]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel2 = LogisticRegression()
logmodel2.fit(X_train2, y_train)

LogisticRegression()

In [54]:
#predictions
predictions2 = logmodel2.predict(X_test2)

In [55]:
print(confusion_matrix(y_test, predictions2))
print(classification_report(y_test,predictions2))

[[412 138]
 [ 97 353]]
              precision    recall  f1-score   support

           0       0.81      0.75      0.78       550
           1       0.72      0.78      0.75       450

    accuracy                           0.77      1000
   macro avg       0.76      0.77      0.76      1000
weighted avg       0.77      0.77      0.77      1000



**Observation: Standardization method gives the accuracy of 77%**

**Decision: We will use the "Standardization" method for our model.**


##**IX) Write out the transformed Input file for further usage**

In [56]:
X1 = df.drop(['is_promoted'], axis = 1)
Y1 = df['is_promoted']
X.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,2,1,35,5.0,8,1,0,49,0,0,0,0,0,0,1,0,0,0,1
1,0,1,30,5.0,4,0,0,60,0,0,0,1,0,0,0,0,1,0,0
2,0,1,34,3.0,7,0,0,50,0,0,0,0,0,0,1,0,1,0,1
3,0,2,39,1.0,10,0,0,50,0,0,0,0,0,0,1,0,1,0,0
4,0,1,45,3.0,2,0,0,73,0,0,0,0,0,0,0,1,1,0,0


In [57]:
std_scaler_object = preprocessing.StandardScaler()
X2 = std_scaler_object.fit_transform(X1)
X2 = pd.DataFrame(X2 , columns = columnNames)
print(X2.shape)
X2.head()

(5000, 19)


Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,1.531419,-0.406222,0.043929,1.210876,0.497985,1.013694,-0.247681,-1.226037,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,-1.510958,-0.163383,1.182891
1,-0.664164,-0.406222,-0.622875,1.210876,-0.436058,-0.986491,-0.247681,-0.466453,-0.205185,-0.202524,-0.123404,1.947784,-0.413142,-0.131507,-0.626394,-0.419095,0.661832,-0.163383,-0.845386
2,-0.664164,-0.406222,-0.089432,-0.487645,0.264474,-0.986491,-0.247681,-1.156984,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,0.661832,-0.163383,1.182891
3,-0.664164,1.420317,0.577372,-2.186166,0.965007,-0.986491,-0.247681,-1.156984,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,0.661832,-0.163383,-0.845386
4,-0.664164,-0.406222,1.377536,-0.487645,-0.90308,-0.986491,-0.247681,0.431236,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,-0.626394,2.386093,0.661832,-0.163383,-0.845386


In [58]:
df1 = pd.DataFrame(data=X2)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 19 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   education                     5000 non-null   float64
 1   no_of_trainings               5000 non-null   float64
 2   age                           5000 non-null   float64
 3   previous_year_rating          5000 non-null   float64
 4   length_of_service             5000 non-null   float64
 5   KPIs_met>80%                  5000 non-null   float64
 6   awards_won?                   5000 non-null   float64
 7   avg_training_score            5000 non-null   float64
 8   department_Finance            5000 non-null   float64
 9   department_HR                 5000 non-null   float64
 10  department_Legal              5000 non-null   float64
 11  department_Operations         5000 non-null   float64
 12  department_Procurement        5000 non-null   float64
 13  dep

In [59]:
df1.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,1.531419,-0.406222,0.043929,1.210876,0.497985,1.013694,-0.247681,-1.226037,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,-1.510958,-0.163383,1.182891
1,-0.664164,-0.406222,-0.622875,1.210876,-0.436058,-0.986491,-0.247681,-0.466453,-0.205185,-0.202524,-0.123404,1.947784,-0.413142,-0.131507,-0.626394,-0.419095,0.661832,-0.163383,-0.845386
2,-0.664164,-0.406222,-0.089432,-0.487645,0.264474,-0.986491,-0.247681,-1.156984,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,0.661832,-0.163383,1.182891
3,-0.664164,1.420317,0.577372,-2.186166,0.965007,-0.986491,-0.247681,-1.156984,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,0.661832,-0.163383,-0.845386
4,-0.664164,-0.406222,1.377536,-0.487645,-0.90308,-0.986491,-0.247681,0.431236,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,-0.626394,2.386093,0.661832,-0.163383,-0.845386


In [60]:
df1 = pd.concat([df1,Y1], axis=1)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   education                     5000 non-null   float64
 1   no_of_trainings               5000 non-null   float64
 2   age                           5000 non-null   float64
 3   previous_year_rating          5000 non-null   float64
 4   length_of_service             5000 non-null   float64
 5   KPIs_met>80%                  5000 non-null   float64
 6   awards_won?                   5000 non-null   float64
 7   avg_training_score            5000 non-null   float64
 8   department_Finance            5000 non-null   float64
 9   department_HR                 5000 non-null   float64
 10  department_Legal              5000 non-null   float64
 11  department_Operations         5000 non-null   float64
 12  department_Procurement        5000 non-null   float64
 13  dep

In [61]:
df1.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing,is_promoted
0,1.531419,-0.406222,0.043929,1.210876,0.497985,1.013694,-0.247681,-1.226037,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,-1.510958,-0.163383,1.182891,0
1,-0.664164,-0.406222,-0.622875,1.210876,-0.436058,-0.986491,-0.247681,-0.466453,-0.205185,-0.202524,-0.123404,1.947784,-0.413142,-0.131507,-0.626394,-0.419095,0.661832,-0.163383,-0.845386,0
2,-0.664164,-0.406222,-0.089432,-0.487645,0.264474,-0.986491,-0.247681,-1.156984,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,0.661832,-0.163383,1.182891,0
3,-0.664164,1.420317,0.577372,-2.186166,0.965007,-0.986491,-0.247681,-1.156984,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,0.661832,-0.163383,-0.845386,0
4,-0.664164,-0.406222,1.377536,-0.487645,-0.90308,-0.986491,-0.247681,0.431236,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,-0.626394,2.386093,0.661832,-0.163383,-0.845386,0


In [62]:
from google.colab import files
df1.to_csv("gdrive/My Drive/Datasets/train_HR_Analytics-Preprocessed.csv",index = False)