### Problem statement
In the recruitment domain, HR faces the challenge of predicting if the candidate is faking their salary or not. For example, a candidate claims to have 5 years of experience and earns 70,000 per month working as a regional manager. The candidate expects more money than his previous CTC. We need a way to verify their claims (is 70,000 a month working as a regional manager with an experience of 5 years a genuine claim or does he/she make less than that?) Build a Decision Tree and Random Forest model with monthly income as the target variable. 

### Business Objective
The objective is to develop a predictive model to estimate the expected monthly income based on factors like position and years of experience, allowing HR teams to assess whether a candidate’s claimed income aligns with the model's estimates.

In [3]:
import pandas as pd
import numpy as np
df=pd.read_csv("HR_DT.csv")
df.head()

Unnamed: 0,Position of the employee,no of Years of Experience of employee,monthly income of employee
0,Business Analyst,1.1,39343
1,Junior Consultant,1.3,46205
2,Senior Consultant,1.5,37731
3,Manager,2.0,43525
4,Country Manager,2.2,39891


In [4]:
df.shape

(196, 3)

In [5]:
df.describe()

Unnamed: 0,no of Years of Experience of employee,monthly income of employee
count,196.0,196.0
mean,5.112245,74194.923469
std,2.783993,26731.578387
min,1.0,37731.0
25%,3.0,56430.0
50%,4.1,63831.5
75%,7.1,98273.0
max,10.5,122391.0


In [9]:
df.isnull().sum()

Position of the employee                 0
no of Years of Experience of employee    0
 monthly income of employee              0
dtype: int64

### Label Encoding

In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Position of the employee'] = le.fit_transform(df['Position of the employee'])

In [13]:
df.head()

Unnamed: 0,Position of the employee,no of Years of Experience of employee,monthly income of employee
0,0,1.1,39343
1,4,1.3,46205
2,8,1.5,37731
3,5,2.0,43525
4,3,2.2,39891


### Split the data

In [17]:
df.columns = df.columns.str.strip()
df.columns

Index(['Position of the employee', 'no of Years of Experience of employee',
       'monthly income of employee'],
      dtype='object')

In [19]:
X = df[['Position of the employee', 'no of Years of Experience of employee']]
y = df['monthly income of employee']

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

### Predict the model

In [26]:
y_pred_dt = dt_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)

In [28]:
y_pred_dt

array([ 93940. , 113812. ,  66029. ,  67938. ,  39891. ,  39891. ,
        63218. , 105582. ,  81363. ,  93940. ,  60150. ,  67938. ,
        56437.5,  39343. , 105582. , 109431. , 105582. ,  63218. ,
        57081. ,  56642. , 101302. ,  43525. ,  81363. ,  67938. ,
       105582. ,  59445. ,  57189. , 113812. ,  39343. ,  64445. ,
       101302. , 121872. ,  37731. , 109431. ,  83088. ,  60150. ,
       116969. ,  93940. ,  37731. ,  66029. ])

In [30]:
y_pred_rf

array([ 93053.26      , 113906.71      ,  65034.4       ,  66371.49      ,
        39770.66      ,  40248.92      ,  61908.73333333, 106807.28      ,
        81592.52      ,  93163.36      ,  59667.04      ,  67428.54      ,
        56500.28766667,  39795.24      , 107636.46      , 109457.6       ,
       106807.28      ,  62918.53333333,  57124.54      ,  58226.80333333,
       101491.62      ,  43343.3       ,  81674.        ,  67466.72      ,
       107219.42      ,  59315.47      ,  57596.98866667, 113989.18      ,
        39464.12      ,  62860.36666667, 101491.62      , 121882.38      ,
        38931.76      , 109457.6       ,  82984.5       ,  58708.12      ,
       116227.21      ,  93053.26      ,  38078.64      ,  65652.53      ])

In [32]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae_dt = mean_absolute_error(y_test, y_pred_dt)
mse_dt = mean_squared_error(y_test, y_pred_dt)
rmse_dt = np.sqrt(mse_dt)

mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)


In [34]:
print("\nDecision Tree Performance:")
print(f"Mean Absolute Error (MAE): {mae_dt}")
print(f"Root Mean Squared Error (RMSE): {rmse_dt}")

print("\nRandom Forest Performance:")
print(f"Mean Absolute Error (MAE): {mae_rf}")
print(f"Root Mean Squared Error (RMSE): {rmse_rf}")


Decision Tree Performance:
Mean Absolute Error (MAE): 1330.1875
Root Mean Squared Error (RMSE): 3471.5292748657616

Random Forest Performance:
Mean Absolute Error (MAE): 1693.846308333334
Root Mean Squared Error (RMSE): 3371.667484542051
