In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv('INX_Future_Inc_Employee.CSV')
df

# Domain Analysis

INX Future Inc. is a globally recognized analytics and automation firm. Despite its status as a top employer, the company is facing a decline in employee performance and client satisfaction.

The goal is to:

1. Identify the factors influencing employee performance

2. Build a predictive model for future employee performance (e.g., for hiring)

3. Provide data-driven recommendations to improve performance

The dataset contains 1200 records, each representing an employee.

There are 28 features (columns), including demographic info, job roles, experience, training, satisfaction scores, and finally the target: 
PerformanceRating.

**1. Emp Number:** Unique employee identifier (like an ID)

**2. Age:** Age of the employee (in years)

**3. Gender:** Gender of the employee (e.g., Male, Female)

**4. Education Background:** Field or domain of the employee's educational qualification (e.g., Marketing, Medical)

**5. Marital Status:** Marital status (e.g., Single, Married, Divorced)

**6. Emp Department:** Department where the employee works (e.g., Sales, Development, Human Resources)

**7. Emp Job Role:** Specific job title or role (e.g., Sales Executive, Data Scientist, Manager)

**8. Business Travel Frequency:** How often the employee travels for business (e.g., Travel_Rarely, Travel_Frequently)

**9. Distance From Home:** Distance between home and office (in kilometers or miles)

**10. Emp Education Level:** Educational level (ordinal: 1=Below College, 2=College, 3=Bachelor, 4=Master, 5=Doctorate)

**11. Emp Environment Satisfaction:** Satisfaction with work environment (scale: 1–4; 1=Low,2=Medium,3=High 4=Very High)

**12. Emp Job Satisfaction :** Satisfaction with current job role (scale: 1–4; 1=Low,2=Medium,3=High,4=Very High)

**13. Emp HourlyRate:**		The employee’s hourly wage (in monetary units).

**14. Emp Job Involvement:**	Degree of employee's engagement in their role (Scale: 1 = Low,2=Medium,3=High,4 = Very High). Higher involvement often correlates with better performance.

**15. Emp Job Level:**	Job seniority level in the organization (e.g., 1 = Entry, 5 = Executive). Often aligns with responsibility and expected performance.

**16.Over Time:**	Whether the employee works overtime (Yes or No). Can impact work-life balance and performance (either positively or negatively).

**17. Num Companies Worked:**	Total number of companies the employee has previously worked for. May indicate experience or instability.

**18. Emp Last Salary Hike Percent:**	Percentage increase in salary during the last hike. Can be a reward for good performance or influence motivation.

**19. Emp Relationship Satisafaction:** Satisfaction with relationships at work (scale: 1-4; 1=Low,2=Medium,3=High,4=Very High)

**20. Total Work Experience In Years:** Total years of work experience

**21. training Time Last Year:** Number of training programs attended in the last year

**22. Emp Work Life Balance:** Work-life balance rating (1=Bad, 2=Good, 3=Better, 4=Best)

**23. Experience Years At This Company:** Number of years the employee has been at INX

**24. Experience Years In Current Role:** Number of years in their current job role

**25. Years Since Last Promotion:**	Number of years since the last promotion

**26. Years With  Curr Manager:**	Number of years under the current manager

**27. Attrition:**	Whether the employee has left the company (Yes) or is still with the company (No)

**28. Performance Rating:**	Employee performance rating (typically 1–4) where 1=Low,2=Good,3=Excellent,4=Outstanding


# Basic Checks

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.describe(include='O')

In [None]:
df['EmpDepartment'].unique()

In [None]:
df['MaritalStatus'].unique()

In [None]:
df['EmpJobRole'].unique()

In [None]:
df['Attrition'].unique()

In [None]:
df['EducationBackground'].unique()

In [None]:
df['BusinessTravelFrequency'].unique()

In [None]:
df['OverTime'].unique()

# Exploratory Data Analysis (EDA)

### Univariate Analysis
#### Categorical Feature Distributions

In [None]:
sns.countplot(x=df.Gender)
plt.title('Distribution of Gender')
plt.show()

In [None]:
sns.countplot(x=df.MaritalStatus)
plt.title('Distribution of Marital Status')
plt.show()

In [None]:
plt.figure(figsize=(12,3))
sns.countplot(x=df.EmpDepartment)
plt.title('Distribution of Emp Department')
plt.show()

In [None]:
plt.figure(figsize=(12,3))
sns.countplot(x=df.EmpJobRole)
plt.title('Distribution of Emp Job Role')
plt.xticks(rotation=45)
plt.show()

In [None]:
sns.countplot(x=df.BusinessTravelFrequency)
plt.title('Distribution of Business Travel Frequency')
plt.show()

In [None]:
sns.countplot(x=df.OverTime)
plt.title('Distribution of Over Time')
plt.show()

#### INSIGHTS:

**1. Gender**
* Distribution is likely imbalanced, with more male employees than female.
* This may or may not influence performance directly.

**2. Marital Status**
* Most of the employees are married.
* Divorced being a minority class.
* Single employees might have higher flexibility for travel or overtime, but not necessarily higher performance.

**3. Emp Department**
* Sales, Developemnt, Research & Development departments likely has the largest employee base.
* Some department are showing lower average performance rating.

**4. Emp Job Role**
* Sales Executive appears very frequently, possibly dominating the workforce.
* Roles like Manager, Data Scientist, and Senior Developer might be fewer but hold higher responsibility.

**5. Business Travel Frequency**
* Most employees likely travel rarely, with fewer in frequent travel.
* Frequent travel could relate to burnout or reduced job satisfaction, depending on support.
* Frequent travel could impact satisfaction or performance.

**6. OverTime**
* Majority of employees do not work overtime.
* Roughly 25–30% of employees work overtime.
* Overtime could signal dedication, but might also lead to burnout or lower job satisfaction over time.

#### Numerical Feature Distributions

In [None]:
numerical_cols = ['Age', 'DistanceFromHome', 'EmpHourlyRate', 'TotalWorkExperienceInYears',
                  'ExperienceYearsAtThisCompany', 'YearsSinceLastPromotion']

df[numerical_cols].hist(bins=20,figsize=(15, 10),layout=(3,2))
plt.suptitle('Numerical Feature Distributions')
plt.tight_layout()
plt.show()

#### INSIGHTS:
**1. Age**
* Most employees fall between 30 and 40 years.
* Distribution is close to normal, slightly right-skewed.

**2. Distance Frome Home**
* Sharp right-skew: Most employees live within 0–5 km of work.
* Fewer employees live more than 10 km away.

**3. Emp Hourly Rate**
* Fairly uniform distribution between 30 and 100.
* No strong peaks—suggests a balanced pay structure.

**4. Total Work Experience in Years**
* Right-skewed: Majority have 5–15 years of experience.
* A few employees have 30+ years, but these are outliers.
* Suggests mid-level professionals dominate, with few freshers.

**5. Experience Years at this Company**
* Strong right-skew: Most employees have been at INX for <10 years, peaking at 3–5 years.
* Few employees have very long tenures (>20 years).

**6. Years Since Last Promotion**
* Majority have 0–2 years since last promotion.
* A long tail of employees who haven’t been promoted in 6+ years.



### Bivariate Analysis

In [None]:
categorical_cols = ['Gender', 'MaritalStatus', 'EmpDepartment', 'EmpJobRole',
                    'BusinessTravelFrequency', 'EducationBackground', 'EmpWorkLifeBalance']

for col in categorical_cols:
    plt.figure(figsize=(10,4))
    sns.countplot(data=df, x=col, hue='PerformanceRating')
    plt.title(f'{col} vs. Performance Rating')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

#### INSIGHTS:
**1. Gender vs. Performance Rating**
* If both males and females have similar distributions across all performance ratings → no significant gender-based performance difference.
* If one gender is underrepresented in higher ratings → may indicate bias or role-based imbalance.

**2. Marital Status vs. Performance Rating**
* Married employees might show higher stability and performance in some organizations.

* If single/divorced employees dominate lower ratings → could indicate work-life challenges, especially if linked to long hours or traveL.

**3. EmpDepartment vs. Performance Rating**
* Certain departments may have higher proportions of top performers due to technical roles with clear KPIs.
* Others might show more mid-level performance, possibly due to subjective evaluations.

**4. EmpJobRole vs. Performance Rating**
* Sales Executives or entry-level roles might show a wider spread, with more average ratings.

**5. BusinessTravelFrequency vs. Performance Rating**
* Employees who Travel Frequently may either perform higher (client-facing roles) or lower (burnout/fatigue).
* If Travel_Rarely employees have more top ratings, travel could be impacting productivity.

**6. EducationBackground vs. Performance Rating**
* Higher education levels (e.g. Medical, Life Sciences) might correlate with better ratings, depending on the role.
* If Marketing or Humanities backgrounds show lower ratings in technical roles, this may reflect skill mismatch rather than capability.

**7. EmpWorkLifeBalance vs. Performance Rating**
* If employees with poor work-life balance consistently score high, this might indicate overworking or burnout risk.
* If those with excellent balance have higher ratings → healthy workplace culture.

In [None]:
numerical_cols = ['Age', 'DistanceFromHome', 'TotalWorkExperienceInYears',
                  'TrainingTimesLastYear', 'ExperienceYearsAtThisCompany',
                  'ExperienceYearsInCurrentRole', 'YearsSinceLastPromotion',
                  'YearsWithCurrManager']

for col in numerical_cols:
    plt.figure(figsize=(6,4))
    sns.barplot(data=df, x='PerformanceRating', y=col)
    plt.title(f'{col} vs. Performance Rating')
    plt.tight_layout()
    plt.show()

# Data Preprocessing

In [None]:
df

In [None]:
# checking for missing values
df.isnull().sum()

In [None]:
# There is no missing values

In [None]:
# Checking for duplicates
df.duplicated().sum()

In [None]:
# There is no duplicates

#### Encoding

In [None]:
df.info()

In [None]:
# Nominal categorical columns → One-Hot Encoding
nominal_cols = [
    'Gender',
    'EducationBackground',
    'MaritalStatus',
    'EmpDepartment',
    'EmpJobRole',
]

# Ordinal categorical column → Label Encoding (manual mapping)
ordinal_cols = ['BusinessTravelFrequency']

# Binary columns → Label Encoding (map Yes/No to 1/0)
binary_cols = ['OverTime', 'Attrition']

In [None]:
# Define mapping for BusinessTravelFrequency
travel_map = {'Non-Travel': 0, 'Travel_Rarely': 1, 'Travel_Frequently': 2}
df['BusinessTravelFrequency'] = df['BusinessTravelFrequency'].map(travel_map)

In [None]:
binary_map = {'No': 0, 'Yes': 1}

for col in binary_cols:
    df[col] = df[col].map(binary_map)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in nominal_cols:
    df[col] = le.fit_transform(df[col])

In [None]:
df

In [None]:
# Checking for outliers

In [None]:
plt.figure(figsize=(20,25))
plotnumber=1
categorical_column=[ 'EmpNumber','Gender','EducationBackground','MaritalStatus','EmpDepartment',
                    'EmpJobRole','BusinessTravelFrequency','EmpEducationLevel','EmpEnvironmentSatisfaction',
                    'EmpJobInvolvement','EmpJobLevel','EmpJobSatisfaction','OverTime','EmpRelationshipSatisfaction',
                    'EmpWorkLifeBalance','Attrition','PerformanceRating'  ]
for column in df.drop(categorical_column,axis=1):
    if plotnumber<12:
        ax=plt.subplot(4,3,plotnumber)
        sns.boxplot(x=df[column])
        plt.xlabel(column,fontsize=10)
        plt.ylabel('count',fontsize=10)
        plotnumber+=1
        plt.tight_layout()

In [None]:
# Handling Outliers for Total Work Experience in Years
sns.boxplot(x=df.TotalWorkExperienceInYears)
plt.title('Boxplot for Total Work Experience in Years')
plt.show()

In [None]:
df['TotalWorkExperienceInYears'].skew()

In [None]:
Q1=df['TotalWorkExperienceInYears'].quantile(0.25)
Q1

In [None]:
Q3=df['TotalWorkExperienceInYears'].quantile(0.75)
Q3

In [None]:
IQR=Q3-Q1
IQR

In [None]:
lower_bound=Q1-(1.5*IQR)
lower_bound

In [None]:
df.loc[df['TotalWorkExperienceInYears']<lower_bound]

In [None]:
upper_bound=Q3+(1.5*IQR)
upper_bound

In [None]:
df.loc[df['TotalWorkExperienceInYears']>upper_bound]

In [None]:
len(df.loc[df['TotalWorkExperienceInYears']>upper_bound])/len(df)*100

In [None]:
df.loc[df['TotalWorkExperienceInYears']>upper_bound,'TotalWorkExperienceInYears']=df['TotalWorkExperienceInYears'].median()

In [None]:
# Outliers Handled.
sns.boxplot(x=df.TotalWorkExperienceInYears)
plt.title('Boxplot for Total Work Experience in Years')
plt.show()

In [None]:
# Handling Outliers for Training Time Last Year
sns.boxplot(x=df.TrainingTimesLastYear)
plt.title('Boxplot for Training Time Last Year')
plt.show()

In [None]:
df['TrainingTimesLastYear'].skew()

In [None]:
Q1=df['TrainingTimesLastYear'].quantile(0.25)
Q1

In [None]:
Q3=df['TrainingTimesLastYear'].quantile(0.75)
Q3

In [None]:
IQR=Q3-Q1
IQR

In [None]:
lower_bound=Q1-(1.5*IQR)
lower_bound

In [None]:
df.loc[df['TrainingTimesLastYear']<lower_bound]

In [None]:
len(df.loc[df['TrainingTimesLastYear']<lower_bound])/len(df)*100

In [None]:
df.loc[df['TrainingTimesLastYear']<lower_bound,'TrainingTimesLastYear']=df['TrainingTimesLastYear'].median()

In [None]:
upper_bound=Q3+(1.5*IQR)
upper_bound

In [None]:
df.loc[df['TrainingTimesLastYear']>upper_bound]

In [None]:
len(df.loc[df['TrainingTimesLastYear']>upper_bound])/len(df)*100

In [None]:
# Here percentage of outliers is greater than 5 so we keep it as it is.

In [None]:
# Handling outliers for Experience Years at this Company
sns.boxplot(x=df.ExperienceYearsAtThisCompany)
plt.title('Boxplot for Experience Years at This Company')
plt.show()

In [None]:
df['ExperienceYearsAtThisCompany'].skew()

In [None]:
Q1=df['ExperienceYearsAtThisCompany'].quantile(0.25)
Q1

In [None]:
Q3=df['ExperienceYearsAtThisCompany'].quantile(0.75)
Q3

In [None]:
IQR=Q3-Q1
IQR

In [None]:
lower_bound=Q1-(1.5*IQR)
lower_bound

In [None]:
df.loc[df['ExperienceYearsAtThisCompany']<lower_bound]

In [None]:
upper_bound=Q3+(1.5*IQR)
upper_bound

In [None]:
df.loc[df['ExperienceYearsAtThisCompany']>upper_bound]

In [None]:
len(df.loc[df['ExperienceYearsAtThisCompany']>upper_bound])/len(df)*100

In [None]:
df.loc[df['ExperienceYearsAtThisCompany']>upper_bound,'ExperienceYearsAtThisCompany']=df['ExperienceYearsAtThisCompany'].median()

In [None]:
# Outliers Handled
sns.boxplot(x=df.ExperienceYearsAtThisCompany)
plt.title('Boxplot for Experience Years at This Company')
plt.show()

In [None]:
# Handling Ouliers for Experience Years in Current Role
sns.boxplot(x=df.ExperienceYearsInCurrentRole)
plt.title('Boxplot for Experience Years in Current Role')
plt.show()

In [None]:
df['ExperienceYearsInCurrentRole'].skew()

In [None]:
Q1=df['ExperienceYearsInCurrentRole'].quantile(0.25)
Q1

In [None]:
Q3=df['ExperienceYearsInCurrentRole'].quantile(0.75)
Q3

In [None]:
IQR=Q3-Q1
IQR

In [None]:
lower_bound=Q1-(1.5*IQR)
lower_bound

In [None]:
df.loc[df['ExperienceYearsInCurrentRole']<lower_bound]

In [None]:
upper_bound=Q3+(1.5*IQR)
upper_bound

In [None]:
df.loc[df['ExperienceYearsInCurrentRole']>upper_bound]

In [None]:
len(df.loc[df['ExperienceYearsInCurrentRole']>upper_bound])/len(df)*100

In [None]:
df.loc[df['ExperienceYearsInCurrentRole']>upper_bound,'ExperienceYearsInCurrentRole']=df['ExperienceYearsInCurrentRole'].median()

In [None]:
# Outliers Handled
sns.boxplot(x=df.ExperienceYearsInCurrentRole)
plt.title('Boxplot for Experience Years in Current Role')
plt.show()

In [None]:
# Handling Outliers for Years Since Last promotion
sns.boxplot(x=df.YearsSinceLastPromotion)
plt.title('Boxplot for Years Since Last Promotion')
plt.show()

In [None]:
df['YearsSinceLastPromotion'].skew()

In [None]:
Q1=df['YearsSinceLastPromotion'].quantile(0.25)
Q1

In [None]:
Q3=df['YearsSinceLastPromotion'].quantile(0.75)
Q3

In [None]:
IQR=Q3-Q1
IQR

In [None]:
lower_bound=Q1-(1.5*IQR)
lower_bound

In [None]:
df.loc[df['YearsSinceLastPromotion']<lower_bound]

In [None]:
upper_bound=Q3+(1.5*IQR)
upper_bound

In [None]:
df.loc[df['YearsSinceLastPromotion']>upper_bound]

In [None]:
len(df.loc[df['YearsSinceLastPromotion']>upper_bound])/len(df)*100

In [None]:
# Here percentage of outlier is greater than 5%. So we keep it as it is.

In [None]:
# Handling Outlier for Years withCurrent Manager
sns.boxplot(x=df.YearsWithCurrManager)
plt.title('Boxplot for Years with Current Manager')
plt.show()

In [None]:
df['YearsWithCurrManager'].skew()

In [None]:
Q1=df['YearsWithCurrManager'].quantile(0.25)
Q1

In [None]:
Q3=df['YearsWithCurrManager'].quantile(0.75)
Q3

In [None]:
IQR=Q3-Q1
IQR

In [None]:
lower_bound=Q1-(1.5*IQR)
lower_bound

In [None]:
df.loc[df['YearsWithCurrManager']<lower_bound]

In [None]:
upper_bound=Q3+(1.5*IQR)
upper_bound

In [None]:
df.loc[df['YearsWithCurrManager']>upper_bound]

In [None]:
len(df.loc[df['YearsWithCurrManager']>upper_bound])/len(df)*100

In [None]:
# Here also percentage of outlier greater than 5%. So we keep it as it is.

In [None]:
df

In [None]:
# Scaling

In [None]:
numeric_cols = [
    'Age',
    'DistanceFromHome',
    'EmpEducationLevel',
    'TotalWorkExperienceInYears',
    'TrainingTimesLastYear',
    'ExperienceYearsAtThisCompany',
    'ExperienceYearsInCurrentRole',
    'YearsSinceLastPromotion',
    'YearsWithCurrManager'
]

In [None]:
from sklearn.preprocessing import RobustScaler

# Assume df is your DataFrame
scaler = RobustScaler()
df = df.copy()

# Apply scaling
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

In [None]:
df

In [None]:
# SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

In [None]:
# Separate features and target
x = df.drop(columns=['EmpNumber','PerformanceRating'])  # your feature set
y = df['PerformanceRating']                 # your target

In [None]:
# Split first
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

# Apply SMOTE only on training data
smote = SMOTE(random_state=42)
x_train_smote, y_train_smote = smote.fit_resample(x_train, y_train)

In [None]:
from collections import Counter
print('Actual classes ',Counter(y_train))
print('Smote classes ',Counter(y_train_smote))

# Feature Selection

In [None]:
# Drop non-numeric or irrelevant columns (like 'EmpNumber')
df_numeric = df.drop(columns=['EmpNumber'])

# Compute full correlation matrix
corr_matrix = df_numeric.corr()

# Get correlations with PerformanceRating, sort them
target_corr = corr_matrix['PerformanceRating'].drop('PerformanceRating')
top_features = target_corr.abs().sort_values(ascending=False).head(6).index  # top 6 most correlated features

# Include 'PerformanceRating' itself in the filtered list
top_corr_features = list(top_features) + ['PerformanceRating']

# Create a smaller correlation matrix with only top features
filtered_corr = df_numeric[top_corr_features].corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(filtered_corr, annot=True, cmap='coolwarm', fmt=".2f", square=True)
plt.title('Heatmap of Top Correlated Features with PerformanceRating')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

### INSIGHTS:

1. **Work Environment Has a Clear Impact**
* EmpEnvironmentSatisfaction has the highest positive correlation with PerformanceRating.
* Action: Improve workplace conditions, team dynamics, and management support.

2. **Recognition Through Pay Hikes Helps**
* A salary hike (EmpLastSalaryHikePercent) is clearly associated with better performance.
* Action: Ensure fair, timely raises for deserving employees. Reward good performance visibly.

3. **Stagnation Reduces Performance**
* Both YearsSinceLastPromotion and ExperienceYearsInCurrentRole have negative correlations.
* Action: Identify employees who have been in the same role or without promotion for long — they might be at risk of disengagement.

4. **Departmental Variation Matters**
* Different departments may exhibit different performance trends.
* Action: Drill down by department to identify training or process gaps.

5. **Long Tenure ≠ High Performance**
* Contrary to common assumptions, longer tenure (ExperienceYearsAtThisCompany) shows a slight drop in performance.
* Action: Provide career development opportunities and role variety for long-time employees.

 **Conclusion**

The most actionable levers to improve performance at INX Future Inc. are:

* Enhancing employee satisfaction (especially environment-related)
* Fair and timely compensation
* Career progression paths
* Department-level performance audits

# Model Creation

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
LR=LogisticRegression()
LR.fit(x_train_smote,y_train_smote)

In [None]:
y_train_predict=LR.predict(x_train_smote)
y_train_predict

In [None]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,recall_score,precision_score,f1_score

In [None]:
print(confusion_matrix(y_train_smote,y_train_predict))

In [None]:
print(classification_report(y_train_smote,y_train_predict))

In [None]:
y_test_predict=LR.predict(x_test)
y_test_predict

In [None]:
print(confusion_matrix(y_test,y_test_predict))

In [None]:
accuracy_LR=accuracy_score(y_test,y_test_predict)
recall_LR=recall_score(y_test,y_test_predict,average='weighted')
precision_LR=precision_score(y_test,y_test_predict,average='weighted')
f1_score_LR=f1_score(y_test,y_test_predict,average='weighted')

print(accuracy_LR)
print(recall_LR)
print(precision_LR)
print(f1_score_LR)

In [None]:
print(classification_report(y_test,y_test_predict))

### K Nearest Neighbor (KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Finding optimal value for k to determine how many nearest neighbors to find

In [None]:
error_rate=[]
for i in range (1,11):
    knn=KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train_smote,y_train_smote)
    y_pred_i=knn.predict(x_test)
    error_rate.append(np.mean(y_pred_i!=y_test))

error_rate

In [None]:
# Let's plot k value and error rate
plt.plot(range(1,11),error_rate,linestyle='dashed',marker='o',markerfacecolor='blue')
plt.title('error_rate vs k value')
plt.xlabel('k value')
plt.ylabel('error_rate')
plt.show()

In [None]:
# Let's fit the data into KNN model
KNN=KNeighborsClassifier(n_neighbors=7)
KNN.fit(x_train_smote,y_train_smote)

In [None]:
y_train_predict=KNN.predict(x_train_smote)
y_train_predict

In [None]:
print(confusion_matrix(y_train_smote,y_train_predict))

In [None]:
print(classification_report(y_train_smote,y_train_predict))

In [None]:
y_test_predict=KNN.predict(x_test)
y_test_predict

In [None]:
print(confusion_matrix(y_test,y_test_predict))

In [None]:
accuracy_KNN=accuracy_score(y_test,y_test_predict)
recall_KNN=recall_score(y_test,y_test_predict,average='weighted')
precision_KNN=precision_score(y_test,y_test_predict,average='weighted')
f1_score_KNN=f1_score(y_test,y_test_predict,average='weighted')

print(accuracy_KNN)
print(recall_KNN)
print(precision_KNN)
print(f1_score_KNN)

In [None]:
print(classification_report(y_test,y_test_predict))

###  Support Vector Machine (SVM)

In [None]:
from sklearn.svm import SVC
SVC=SVC()
SVC.fit(x_train_smote,y_train_smote)

In [None]:
y_train_predict=SVC.predict(x_train_smote)
y_train_predict

In [None]:
print(confusion_matrix(y_train_smote,y_train_predict))

In [None]:
print(classification_report(y_train_smote,y_train_predict))

In [None]:
y_test_predict=SVC.predict(x_test)
y_test_predict

In [None]:
print(confusion_matrix(y_test,y_test_predict))

In [None]:
accuracy_SVM=accuracy_score(y_test,y_test_predict)
recall_SVM=recall_score(y_test,y_test_predict,average='weighted')
precision_SVM=precision_score(y_test,y_test_predict,average='weighted')
f1_score_SVM=f1_score(y_test,y_test_predict,average='weighted')

print(accuracy_SVM)
print(recall_SVM)
print(precision_SVM)
print(f1_score_SVM)

In [None]:
print(classification_report(y_test,y_test_predict))

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
DTC=DecisionTreeClassifier()
DTC.fit(x_train_smote,y_train_smote)

In [None]:
y_train_predict=DTC.predict(x_train_smote)
y_train_predict

In [None]:
print(confusion_matrix(y_train_smote,y_train_predict))

In [None]:
print(classification_report(y_train_smote,y_train_predict))

In [None]:
y_test_predict=DTC.predict(x_test)
y_test_predict

In [None]:
print(confusion_matrix(y_test,y_test_predict))

In [None]:
accuracy_DT=accuracy_score(y_test,y_test_predict)
recall_DT=recall_score(y_test,y_test_predict,average='weighted')
precision_DT=precision_score(y_test,y_test_predict,average='weighted')
f1_score_DT=f1_score(y_test,y_test_predict,average='weighted')

print(accuracy_DT)
print(recall_DT)
print(precision_DT)
print(f1_score_DT)

In [None]:
print(classification_report(y_test,y_test_predict))

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
RFC=RandomForestClassifier()
RFC.fit(x_train_smote,y_train_smote)

In [None]:
y_train_predict=RFC.predict(x_train_smote)
y_train_predict

In [None]:
print(confusion_matrix(y_train_smote,y_train_predict))

In [None]:
print(classification_report(y_train_smote,y_train_predict))

In [None]:
y_test_predict=RFC.predict(x_test)
y_test_predict

In [None]:
print(confusion_matrix(y_test,y_test_predict))

In [None]:
accuracy_RF=accuracy_score(y_test,y_test_predict)
recall_RF=recall_score(y_test,y_test_predict,average='weighted')
precision_RF=precision_score(y_test,y_test_predict,average='weighted')
f1_score_RF=f1_score(y_test,y_test_predict,average='weighted')

print(accuracy_RF)
print(recall_RF)
print(precision_RF)
print(f1_score_RF)

In [None]:
print(classification_report(y_test,y_test_predict))

### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
GBC=GradientBoostingClassifier()
GBC.fit(x_train_smote,y_train_smote)

In [None]:
y_train_predict=GBC.predict(x_train_smote)
y_train_predict

In [None]:
print(confusion_matrix(y_train_smote,y_train_predict))

In [None]:
print(classification_report(y_train_smote,y_train_predict))

In [None]:
y_test_predict=GBC.predict(x_test)
y_test_predict

In [None]:
print(confusion_matrix(y_test,y_test_predict))

In [None]:
accuracy_GB=accuracy_score(y_test,y_test_predict)
recall_GB=recall_score(y_test,y_test_predict,average='weighted')
precision_GB=precision_score(y_test,y_test_predict,average='weighted')
f1_score_GB=f1_score(y_test,y_test_predict,average='weighted')

print(accuracy_GB)
print(recall_GB)
print(precision_GB)
print(f1_score_GB)

In [None]:
print(classification_report(y_test,y_test_predict))

In [None]:
!pip install tabulate

In [None]:
from tabulate import tabulate

data=[['Logistic Regression',accuracy_LR,recall_LR,precision_LR,f1_score_LR],
      ['KNN',accuracy_KNN,recall_KNN,precision_KNN,f1_score_KNN],
      ['SVM',accuracy_SVM,recall_SVM,precision_SVM,f1_score_SVM],
      ['Decision Tree',accuracy_DT,recall_DT,precision_DT,f1_score_DT],
      ['Random Forest',accuracy_RF,recall_RF,precision_RF,f1_score_RF],
      ['Gradient Boosting',accuracy_GB,recall_GB,precision_GB,f1_score_GB]]

column_names=['Algorithms','Accuracy','Recall','Precision','f1 Score']

print(tabulate(data, headers = column_names, tablefmt = "fancy_grid"))

# Summary

### Key Findings
1. **It’s a Classification Problem**
* The target variable, PerformanceRating, is categorical (values: 2, 3, 4).
* Problem addressed using supervised classification models.

2. **Best Performing Models**
* Random Forest and Gradient Boosting both achieved ~93% accuracy.
* Models performed exceptionally well after handling class imbalance using SMOTE and applying feature scaling.
* These models can reliably predict employee performance.

3. **Top Factors Affecting Performance**

   From correlation and model insights:
* EmpEnvironmentSatisfaction : Strong Positive
* EmpLastSalaryHikePercent : Positive
* YearsSinceLastPromotion : Negative
* ExperienceYearsInCurrentRole : Slight Negative
* DistanceFromHome : Weak Negative

### Strategic Recommendations
1. Boost Work Environment & Recognition
* Higher satisfaction and timely salary hikes are tied to better performance.
* Improve work-life balance and implement fair appraisal systems.

2. Encourage Growth & Promotions
* Long stagnation in the same role or no promotion → linked to reduced performance.
* Create internal mobility and growth programs.

3. Data-Driven HR Practices
* Deploy the trained model in the recruitment pipeline to predict likely top performers.
* Continuously monitor performance indicators for early intervention.

# Conclusion

The data science solution effectively identifies performance patterns and enables INX Future Inc. to:

* Predict employee performance with 93% accuracy,

* Improve retention of high performers,

* Make informed talent decisions,

* Preserve its reputation as a top employer.