<a href="https://colab.research.google.com/github/Adizcool/Job_Promotion_Prediction/blob/main/Job_Promotion_Prediction_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Step-1: Install/ Import the required Python Packages/ Libraries, Mount the Google Drive and read and check the Data and Customer files**

**1) Install/ Import the required Python Packages/ Libraries**

In [1]:
#Import required python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn import svm
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [2]:
pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.3.0-py2.py3-none-any.whl (82 kB)
[?25l[K     |████                            | 10 kB 18.0 MB/s eta 0:00:01[K     |████████                        | 20 kB 12.8 MB/s eta 0:00:01[K     |████████████                    | 30 kB 9.5 MB/s eta 0:00:01[K     |████████████████                | 40 kB 8.5 MB/s eta 0:00:01[K     |████████████████████            | 51 kB 5.0 MB/s eta 0:00:01[K     |████████████████████████        | 61 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████████    | 71 kB 5.6 MB/s eta 0:00:01[K     |███████████████████████████████▉| 81 kB 6.2 MB/s eta 0:00:01[K     |████████████████████████████████| 82 kB 209 kB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.3.0


**2) Mounting the Google Drive**

In [3]:
# Mount the Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


**3) Read the Data file and Customer file and check**

In [4]:
# Read the Diabetes Data from .csv file and check the data shape (number of Rows and Columns)
train_df = pd.read_csv('gdrive/My Drive/Datasets/HR Analysis/train_HR_Analytics.csv')
test_df = pd.read_csv('gdrive/My Drive/Datasets/HR Analysis/customer_HR_Analytics.csv')
print(train_df.shape)
print(test_df.shape)

(5000, 14)
(1000, 13)


##**Step-2: Combine the Train and Test File**

In [5]:
train_df['train']=1
test_df['test'] = 0

In [6]:
print(train_df.shape)
print(test_df.shape)

(5000, 15)
(1000, 14)


In [7]:
train_df.info()
print()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           5000 non-null   int64  
 1   department            5000 non-null   object 
 2   region                5000 non-null   object 
 3   education             4788 non-null   object 
 4   gender                5000 non-null   object 
 5   recruitment_channel   5000 non-null   object 
 6   no_of_trainings       5000 non-null   int64  
 7   age                   5000 non-null   int64  
 8   previous_year_rating  4624 non-null   float64
 9   length_of_service     5000 non-null   int64  
 10  KPIs_met >80%         5000 non-null   int64  
 11  awards_won?           5000 non-null   int64  
 12  avg_training_score    5000 non-null   int64  
 13  is_promoted           5000 non-null   int64  
 14  train                 5000 non-null   int64  
dtypes: float64(1), int64(

In [8]:
combined_df  = pd.concat([train_df, test_df])
combined_df.shape

(6000, 16)

In [9]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           6000 non-null   int64  
 1   department            6000 non-null   object 
 2   region                6000 non-null   object 
 3   education             5752 non-null   object 
 4   gender                6000 non-null   object 
 5   recruitment_channel   6000 non-null   object 
 6   no_of_trainings       6000 non-null   int64  
 7   age                   6000 non-null   int64  
 8   previous_year_rating  5541 non-null   float64
 9   length_of_service     6000 non-null   int64  
 10  KPIs_met >80%         6000 non-null   int64  
 11  awards_won?           6000 non-null   int64  
 12  avg_training_score    6000 non-null   int64  
 13  is_promoted           5000 non-null   float64
 14  train                 5000 non-null   float64
 15  test                  

##**Step-3: Check the Data Types of the Columns as well as Missing Data**

**1) Execute the "info()" command and check datatypes of the Columns and Missing Data**

In [10]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           6000 non-null   int64  
 1   department            6000 non-null   object 
 2   region                6000 non-null   object 
 3   education             5752 non-null   object 
 4   gender                6000 non-null   object 
 5   recruitment_channel   6000 non-null   object 
 6   no_of_trainings       6000 non-null   int64  
 7   age                   6000 non-null   int64  
 8   previous_year_rating  5541 non-null   float64
 9   length_of_service     6000 non-null   int64  
 10  KPIs_met >80%         6000 non-null   int64  
 11  awards_won?           6000 non-null   int64  
 12  avg_training_score    6000 non-null   int64  
 13  is_promoted           5000 non-null   float64
 14  train                 5000 non-null   float64
 15  test                  

**2) Summarize the columnwise Missing Data**

In [11]:
combined_df.isnull().sum()

employee_id                0
department                 0
region                     0
education                248
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating     459
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted             1000
train                   1000
test                    5000
dtype: int64

**Observations:**
* **a) We have the missing data, hence we need to handle this.**

##**Step-4: Check on Data Preprocessing applicability (Initial)**


###**1) Checking the Missing Values and its Handling**

**a) Check the Missing Values, if any**

In [12]:
combined_df.isnull().sum()

employee_id                0
department                 0
region                     0
education                248
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating     459
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted             1000
train                   1000
test                    5000
dtype: int64

**b) Checking the total number of rows having the missing Values**

In [13]:
combined_df[combined_df.isnull().any(axis=1)]

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,train,test
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0.0,1.0,
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0.0,1.0,
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0.0,1.0,
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0.0,1.0,
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0.0,1.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,68388,Operations,region_22,Bachelor's,f,other,1,36,4.0,9,0,0,74,,,0.0
996,32344,Sales & Marketing,region_19,Bachelor's,m,other,1,35,3.0,5,1,0,49,,,0.0
997,71931,Legal,region_31,Bachelor's,m,other,1,35,3.0,6,0,0,61,,,0.0
998,55130,Technology,region_11,Master's & above,m,other,1,41,2.0,2,0,0,81,,,0.0


**c) Observations, Decisions and Actions**

**Observations:**
* a) Here, the data values of 2 columns are missing
* b) The total number rows having missing values is 6000 against the total number of rows (6000) in the dataset. 
###**So, we cannot use the option of dropping the rows having missing values.**

**Decision and Actions:**

###**Fill the missing values of the columns with that of the most_frequent values of the respective columns.**

**d) Imputation of Missing Values using the "fillna" command and checking**

In [14]:
combined_df['education'].fillna(combined_df['education'].mode().iloc[0], inplace=True)
combined_df['previous_year_rating'].fillna(combined_df['previous_year_rating'].mode().iloc[0], inplace=True)
combined_df['is_promoted'].fillna(combined_df['is_promoted'].mode().iloc[0], inplace=True)

In [15]:
combined_df.isnull().sum()

employee_id                0
department                 0
region                     0
education                  0
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating       0
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
train                   1000
test                    5000
dtype: int64

###**2) Check the unique Values of each column and observe the following:**
* **a) Wrong Data in the columns, if any** 
* **b) Wrong format of the data in the columns, if any**
* **c) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**


###**Column-1: employee_id**

In [16]:
combined_df['employee_id'].value_counts()

26623    1
11930    1
41625    1
12955    1
43676    1
        ..
71000    1
7513     1
60763    1
62812    1
24099    1
Name: employee_id, Length: 6000, dtype: int64

**Observations:**
* a) Data in this column will not be contributing to the prediction of the Depenedent variable

**Decsion:**

**We will be dropping this column**

**Action:**

In [17]:
combined_df.drop(['employee_id'], axis = 1, inplace = True)
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   department            6000 non-null   object 
 1   region                6000 non-null   object 
 2   education             6000 non-null   object 
 3   gender                6000 non-null   object 
 4   recruitment_channel   6000 non-null   object 
 5   no_of_trainings       6000 non-null   int64  
 6   age                   6000 non-null   int64  
 7   previous_year_rating  6000 non-null   float64
 8   length_of_service     6000 non-null   int64  
 9   KPIs_met >80%         6000 non-null   int64  
 10  awards_won?           6000 non-null   int64  
 11  avg_training_score    6000 non-null   int64  
 12  is_promoted           6000 non-null   float64
 13  train                 5000 non-null   float64
 14  test                  1000 non-null   float64
dtypes: float64(4), int64(6

###**Column-2: department**

In [18]:
combined_df['department'].value_counts()

Sales & Marketing    1720
Operations           1235
Technology            883
Procurement           851
Analytics             620
Finance               249
HR                    245
R&D                   104
Legal                  93
Name: department, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [19]:
#encode the data
dept = pd.DataFrame(combined_df['department'])
dept_encoded=pd.get_dummies(data= dept, drop_first=True)
dept_encoded

Unnamed: 0,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology
0,0,0,0,0,0,0,1,0
1,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...
995,0,0,0,1,0,0,0,0
996,0,0,0,0,0,0,1,0
997,0,0,1,0,0,0,0,0
998,0,0,0,0,0,0,0,1


###**Column-3: region**

In [20]:
combined_df['region'].value_counts()

region_2     1313
region_22     793
region_7      575
region_13     318
region_15     288
region_4      250
region_26     231
region_31     181
region_27     174
region_28     155
region_16     141
region_23     130
region_11     126
region_17     112
region_29     107
region_25     106
region_19      92
region_14      89
region_30      84
region_20      79
region_32      76
region_1       73
region_5       73
region_8       65
region_6       59
region_10      54
region_12      51
region_24      45
region_21      39
region_3       31
region_34      30
region_9       30
region_33      27
region_18       3
Name: region, dtype: int64

**Observations:**
* a) Data in this column will not be contributing to the prediction of the Depenedent variable

**Decision and Actions to be taken:**

* We will be dropping this column

**Action:**

In [21]:
combined_df.drop(['region'], axis = 1, inplace = True)

###**Column-4: education**

In [22]:
combined_df['education'].value_counts()

Bachelor's          4157
Master's & above    1754
Below Secondary       89
Name: education, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Ordinal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Ordnial Type method "preprocessing.LabelEncoder()**

**Action:**

In [23]:
le = preprocessing.LabelEncoder()
combined_df['education'] = le.fit_transform(combined_df.education.values)
combined_df['education'].value_counts()

0    4157
2    1754
1      89
Name: education, dtype: int64

###**Column-5: Gender**

In [24]:
combined_df['gender'].value_counts()

m    4203
f    1797
Name: gender, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [25]:
#encode the data
gender = pd.DataFrame(combined_df['gender'])
gender_encoded=pd.get_dummies(data= gender, drop_first=True)
gender_encoded

Unnamed: 0,gender_m
0,0
1,1
2,1
3,1
4,1
...,...
995,0
996,1
997,1
998,1


###**Column-6: recruitment channel**

In [26]:
combined_df['recruitment_channel'].value_counts()

other       3365
sourcing    2488
referred     147
Name: recruitment_channel, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [27]:
#encode the data
recruitment_channel = pd.DataFrame(combined_df['recruitment_channel'])
recruitment_channel_encoded=pd.get_dummies(data= recruitment_channel, drop_first=True)
recruitment_channel_encoded

Unnamed: 0,recruitment_channel_referred,recruitment_channel_sourcing
0,0,1
1,0,0
2,0,1
3,0,0
4,0,0
...,...,...
995,0,0
996,0,0
997,0,0
998,0,0


###**Column-7 to 14 : no_of_trainings, age, previous_year_rating, length_of_service, KPIs_met, awards_won, avg_trainging_score, is_promoted**

In [28]:
combined_df.describe()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,train,test
count,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,5000.0,1000.0
mean,0.5995,1.2275,34.720167,3.534,5.876333,0.47,0.052,66.205167,0.377833,1.0,0.0
std,0.908242,0.55178,7.532152,1.185219,4.253123,0.499141,0.222046,14.395437,0.484886,0.0,0.0
min,0.0,1.0,20.0,1.0,1.0,0.0,0.0,39.0,0.0,1.0,0.0
25%,0.0,1.0,29.0,3.0,3.0,0.0,0.0,53.0,0.0,1.0,0.0
50%,0.0,1.0,33.0,3.0,5.0,0.0,0.0,63.0,0.0,1.0,0.0
75%,2.0,1.0,38.0,5.0,8.0,1.0,0.0,80.0,1.0,1.0,0.0
max,2.0,7.0,60.0,5.0,34.0,1.0,1.0,99.0,1.0,1.0,0.0


**Observations:**
* a) Here, all the Integer and float Column values are described.
* b) Each column has got a Standard Deviation, Min and Max Values.
* c) We can assume that there is no wrong data and wrong data format.
* **d) But we need to do Scaling**

##**Step-6: Drop the columns which are to be categorically converted and include the their respective coverted Numeric Values**

In [29]:
combined_df.drop(['department', 'gender', 'recruitment_channel'], axis = 1, inplace = True)
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   education             6000 non-null   int64  
 1   no_of_trainings       6000 non-null   int64  
 2   age                   6000 non-null   int64  
 3   previous_year_rating  6000 non-null   float64
 4   length_of_service     6000 non-null   int64  
 5   KPIs_met >80%         6000 non-null   int64  
 6   awards_won?           6000 non-null   int64  
 7   avg_training_score    6000 non-null   int64  
 8   is_promoted           6000 non-null   float64
 9   train                 5000 non-null   float64
 10  test                  1000 non-null   float64
dtypes: float64(4), int64(7)
memory usage: 562.5 KB


In [30]:
combined_df = pd.concat([combined_df,dept_encoded, gender_encoded, recruitment_channel_encoded], axis=1)
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   education                     6000 non-null   int64  
 1   no_of_trainings               6000 non-null   int64  
 2   age                           6000 non-null   int64  
 3   previous_year_rating          6000 non-null   float64
 4   length_of_service             6000 non-null   int64  
 5   KPIs_met >80%                 6000 non-null   int64  
 6   awards_won?                   6000 non-null   int64  
 7   avg_training_score            6000 non-null   int64  
 8   is_promoted                   6000 non-null   float64
 9   train                         5000 non-null   float64
 10  test                          1000 non-null   float64
 11  department_Finance            6000 non-null   uint8  
 12  department_HR                 6000 non-null   uint8  
 13  depa

##**Step-5: Seggregate the Train and Test Data**

In [31]:
train_df1 = combined_df[combined_df["train"] == 1]
test_df1 = combined_df[combined_df["test"] == 0]
train_df1.drop(["train", "test"], axis=1, inplace=True)
test_df1.drop(["test", "train", "is_promoted"], axis=1, inplace=True)

In [32]:
train_df1.shape

(5000, 20)

In [33]:
test_df1.shape

(1000, 19)

##**Step-5: Slice X and y Values**

In [34]:
X = train_df1.drop(['is_promoted'], axis = 1)
y = train_df1['is_promoted']
X.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,department_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,2,1,35,5.0,8,1,0,49,0,0,0,0,0,0,1,0,0,0,1
1,0,1,30,5.0,4,0,0,60,0,0,0,1,0,0,0,0,1,0,0
2,0,1,34,3.0,7,0,0,50,0,0,0,0,0,0,1,0,1,0,1
3,0,2,39,1.0,10,0,0,50,0,0,0,0,0,0,1,0,1,0,0
4,0,1,45,3.0,2,0,0,73,0,0,0,0,0,0,0,1,1,0,0


In [35]:
y.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: is_promoted, dtype: float64

In [36]:
columnNames = ['education','no_of_trainings', 'age', 'previous_year_rating', 'length_of_service', 'KPIs_met>80%', 'awards_won?', 'avg_training_score', 'department_Finance',
               'department_HR', 'department_Legal', 'department_Operations', 'department_Procurement', 'department_R&D', 'department_Sales & Marketing', 'deartment_Technology',
               'gender_m', 'recruitment_channel_referred', 'recruitment_channel_sourcing']

In [37]:
std_scaler_object = preprocessing.StandardScaler()
X1 = std_scaler_object.fit_transform(X)
X1 = pd.DataFrame(X1 , columns = columnNames)
X1.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,1.531419,-0.406222,0.043929,1.210876,0.497985,1.013694,-0.247681,-1.226037,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,-1.510958,-0.163383,1.182891
1,-0.664164,-0.406222,-0.622875,1.210876,-0.436058,-0.986491,-0.247681,-0.466453,-0.205185,-0.202524,-0.123404,1.947784,-0.413142,-0.131507,-0.626394,-0.419095,0.661832,-0.163383,-0.845386
2,-0.664164,-0.406222,-0.089432,-0.487645,0.264474,-0.986491,-0.247681,-1.156984,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,0.661832,-0.163383,1.182891
3,-0.664164,1.420317,0.577372,-2.186166,0.965007,-0.986491,-0.247681,-1.156984,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,1.596439,-0.419095,0.661832,-0.163383,-0.845386
4,-0.664164,-0.406222,1.377536,-0.487645,-0.90308,-0.986491,-0.247681,0.431236,-0.205185,-0.202524,-0.123404,-0.513404,-0.413142,-0.131507,-0.626394,2.386093,0.661832,-0.163383,-0.845386


##**Step-6: Execute Train-Test-Split Command and Verify**

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size = 0.2, random_state = 66)

In [39]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4000, 19)
(4000,)
(1000, 19)
(1000,)


##**Step-7: Learn the Data and Predict the dependent Variable values for the "X_test"data using "SVC()" algorithm**

In [40]:
from sklearn.svm import SVC
svc_clf = SVC(kernel = 'rbf', random_state = 0)
svc_clf.fit(X_train, y_train)

SVC(random_state=0)

In [41]:
#predictions
y_pred = svc_clf.predict(X_test)

In [42]:
svc_Train_acc=svc_clf.score(X_train,y_train)
svc_Test_acc=svc_clf.score(X_test,y_test)

##**Step-8: Calculate the Accuracy of the Model**

In [43]:
print('Accuracy on training set:',svc_Train_acc)
print('Accuracy on test set:',svc_Test_acc)

Accuracy on training set: 0.80725
Accuracy on test set: 0.769


##**Step-9: Display the Confusion Matrix and Classification Report of the Model**

In [44]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  

[[374 176]
 [ 55 395]]
              precision    recall  f1-score   support

         0.0       0.87      0.68      0.76       550
         1.0       0.69      0.88      0.77       450

    accuracy                           0.77      1000
   macro avg       0.78      0.78      0.77      1000
weighted avg       0.79      0.77      0.77      1000



##**Step-10: SVC Algorithm Parameters Fine Tuning using GridSearch CV Method**

In [45]:
model_params = {
     'svc': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,5,10,15,20],
            'kernel': ['rbf','linear','sigmoid']
        }  
    },
 }

In [46]:
from sklearn.model_selection import GridSearchCV
import pandas as pd
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(X1, y)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svc,0.7886,"{'C': 5, 'kernel': 'linear'}"


In [47]:
#SV Classifier
svc_grid_acc = cross_val_score(SVC(C=5, kernel='linear', gamma = 'auto'),X1, y, cv=5)
print("svc_grid_acc (CV_based) :", svc_grid_acc)
svc_grid_acc_avg=np.average(svc_grid_acc)
print()
print("svc_grid_acc_avg : ", svc_grid_acc_avg)

svc_grid_acc (CV_based) : [0.778 0.792 0.781 0.798 0.794]

svc_grid_acc_avg :  0.7886


In [48]:
std_scaler_object = preprocessing.StandardScaler()
test_df2 = std_scaler_object.fit_transform(test_df1)
test_df3 = pd.DataFrame(test_df2 , columns = columnNames)
test_df3.head()

Unnamed: 0,education,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met>80%,awards_won?,avg_training_score,department_Finance,department_HR,department_Legal,department_Operations,department_Procurement,department_R&D,department_Sales & Marketing,deartment_Technology,gender_m,recruitment_channel_referred,recruitment_channel_sourcing
0,-0.639989,-0.442438,-1.426281,-0.276913,-1.200003,1.350873,-0.153432,0.99441,-0.222076,-0.224544,-0.135388,-0.487467,-0.372763,-0.139169,-0.671847,2.520504,0.614337,-0.131507,1.214598
1,-0.639989,-0.442438,-0.516,-0.276913,-0.224589,-0.740262,-0.153432,-0.914528,-0.222076,4.453463,-0.135388,-0.487467,-0.372763,-0.139169,-0.671847,-0.396746,-1.62777,-0.131507,-0.823318
2,-0.639989,-0.442438,-0.516,-1.940054,-0.468442,-0.740262,-0.153432,-1.208211,-0.222076,-0.224544,-0.135388,-0.487467,-0.372763,-0.139169,1.488433,-0.396746,0.614337,-0.131507,-0.823318
3,-0.639989,3.055097,-0.516,-1.108484,0.750825,-0.740262,-0.153432,0.113362,-0.222076,-0.224544,-0.135388,-0.487467,2.682671,-0.139169,-0.671847,-0.396746,-1.62777,-0.131507,-0.823318
4,-0.639989,-0.442438,-0.64604,0.554658,0.263118,-0.740262,-0.153432,-0.180321,4.502954,-0.224544,-0.135388,-0.487467,-0.372763,-0.139169,-0.671847,-0.396746,0.614337,-0.131507,1.214598


In [49]:
#predictions for Customer Data
cust_data_pred = svc_clf.predict(test_df3)

In [50]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           1000 non-null   int64  
 1   department            1000 non-null   object 
 2   region                1000 non-null   object 
 3   education             964 non-null    object 
 4   gender                1000 non-null   object 
 5   recruitment_channel   1000 non-null   object 
 6   no_of_trainings       1000 non-null   int64  
 7   age                   1000 non-null   int64  
 8   previous_year_rating  917 non-null    float64
 9   length_of_service     1000 non-null   int64  
 10  KPIs_met >80%         1000 non-null   int64  
 11  awards_won?           1000 non-null   int64  
 12  avg_training_score    1000 non-null   int64  
 13  test                  1000 non-null   int64  
dtypes: float64(1), int64(8), object(5)
memory usage: 109.5+ KB


In [51]:
test_df.drop(["test"], axis=1, inplace=True)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           1000 non-null   int64  
 1   department            1000 non-null   object 
 2   region                1000 non-null   object 
 3   education             964 non-null    object 
 4   gender                1000 non-null   object 
 5   recruitment_channel   1000 non-null   object 
 6   no_of_trainings       1000 non-null   int64  
 7   age                   1000 non-null   int64  
 8   previous_year_rating  917 non-null    float64
 9   length_of_service     1000 non-null   int64  
 10  KPIs_met >80%         1000 non-null   int64  
 11  awards_won?           1000 non-null   int64  
 12  avg_training_score    1000 non-null   int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 101.7+ KB


In [52]:
test_df["is_promoted"]=cust_data_pred
print(test_df.shape)
test_df.head()

(1000, 14)


Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,8724,Technology,region_26,Bachelor's,m,sourcing,1,24,,1,1,0,77,1.0
1,74430,HR,region_4,Bachelor's,f,other,1,31,3.0,5,0,0,51,0.0
2,72255,Sales & Marketing,region_13,Bachelor's,m,other,1,31,1.0,4,0,0,47,0.0
3,38562,Procurement,region_2,Bachelor's,f,other,3,31,2.0,9,0,0,65,0.0
4,64486,Finance,region_29,Bachelor's,m,sourcing,1,30,4.0,7,0,0,61,0.0


In [53]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           1000 non-null   int64  
 1   department            1000 non-null   object 
 2   region                1000 non-null   object 
 3   education             964 non-null    object 
 4   gender                1000 non-null   object 
 5   recruitment_channel   1000 non-null   object 
 6   no_of_trainings       1000 non-null   int64  
 7   age                   1000 non-null   int64  
 8   previous_year_rating  917 non-null    float64
 9   length_of_service     1000 non-null   int64  
 10  KPIs_met >80%         1000 non-null   int64  
 11  awards_won?           1000 non-null   int64  
 12  avg_training_score    1000 non-null   int64  
 13  is_promoted           1000 non-null   float64
dtypes: float64(2), int64(7), object(5)
memory usage: 109.5+ KB


In [54]:
test_df['is_promoted']=test_df['is_promoted'].astype(str)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           1000 non-null   int64  
 1   department            1000 non-null   object 
 2   region                1000 non-null   object 
 3   education             964 non-null    object 
 4   gender                1000 non-null   object 
 5   recruitment_channel   1000 non-null   object 
 6   no_of_trainings       1000 non-null   int64  
 7   age                   1000 non-null   int64  
 8   previous_year_rating  917 non-null    float64
 9   length_of_service     1000 non-null   int64  
 10  KPIs_met >80%         1000 non-null   int64  
 11  awards_won?           1000 non-null   int64  
 12  avg_training_score    1000 non-null   int64  
 13  is_promoted           1000 non-null   object 
dtypes: float64(1), int64(7), object(6)
memory usage: 109.5+ KB


In [55]:
test_df['is_promoted'].replace("1.0", "Y", inplace=True)
test_df['is_promoted'].replace("0.0", "N", inplace=True)

In [56]:
test_df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,8724,Technology,region_26,Bachelor's,m,sourcing,1,24,,1,1,0,77,Y
1,74430,HR,region_4,Bachelor's,f,other,1,31,3.0,5,0,0,51,N
2,72255,Sales & Marketing,region_13,Bachelor's,m,other,1,31,1.0,4,0,0,47,N
3,38562,Procurement,region_2,Bachelor's,f,other,3,31,2.0,9,0,0,65,N
4,64486,Finance,region_29,Bachelor's,m,sourcing,1,30,4.0,7,0,0,61,N


In [57]:
from google.colab import files
test_df.to_csv("gdrive/My Drive/Datasets/Customer_HR_Analysis_with_Predicted_Status_Values.csv", index = False)