# **Project 4: Attrition Rate Calculator**

***Pandas*** :  *is a Python library for data manipulation and analysis, providing data structures and functions to work with structured data effectively. It's widely used in data science and data analysis tasks due to its flexibility and powerful tools.*

**The Warnings** : *library in Python is used to display alert messages to developers about potential issues in the code without stopping program execution. It allows you to warn about deprecated features, runtime concerns, or bad practices, and also provides control over how warnings are shown, filtered, or suppressed during runtime.*

# **Data Visualization**

**This Python script uses Pandas to read a CSV file named "Attrition-Rate.csv" and store it as a DataFrame. It then displays the first 5 rows of the DataFrame for quick data preview.**

**The warning line tells Python to ignore all warning messages and prevent them from being displayed during program execution to make the code output look clean and neat.**

In [38]:
import pandas as pd
import warnings

warnings.simplefilter("ignore")

df = pd.read_csv("/content/Attrition-Rate.csv")
df.head(5)

Unnamed: 0.1,Unnamed: 0,Location,Emp. Group,Function,Gender,Tenure,Experience (YY.MM),Marital Status,Age in YY.,Hiring Source,Promoted/Non Promoted,Job Role Match,Stay/Left
0,1,Pune,B2,Operation,Male,0.0,6.08,Single,27.12,Direct,Non Promoted,Yes,Left
1,2,Noida,B7,Support,Male,0.0,13.0,Marr.,38.08,Direct,Promoted,No,Stay
2,3,Bangalore,B3,Operation,Male,0.01,16.05,Marr.,36.04,Direct,Promoted,Yes,Stay
3,4,Noida,B2,Operation,Male,0.01,6.06,Marr.,32.07,Direct,Promoted,Yes,Stay
4,5,Lucknow,B2,Operation,Male,0.0,7.0,Marr.,32.05,Direct,Non Promoted,Yes,Stay


**This code calculates and returns the total count of missing values for each column in the DataFrame 'df'.**

In [39]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
Location,0
Emp. Group,0
Function,0
Gender,0
Tenure,0
Experience (YY.MM),4
Marital Status,0
Age in YY.,0
Hiring Source,0


**These lines compute the median of 'Experience (YY.MM)' column and fill the missing values with it. Then, they find the mode of 'Job Role Match' column and replace missing values with it. It's a common practice in data preprocessing to handle missing values using statistical measures.**

In [40]:
median_experience = df['Experience (YY.MM)'].median()
df['Experience (YY.MM)'].fillna(median_experience, inplace=True)

mode_job_role = df['Job Role Match'].mode()[0]
df['Job Role Match'].fillna(mode_job_role, inplace=True)

**Checking if the null values are filled or not.**

In [None]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
Location,0
Emp. Group,0
Function,0
Gender,0
Tenure,0
Experience (YY.MM),0
Marital Status,0
Age in YY.,0
Hiring Source,0


**This line removes the column labeled "Unnamed: 0" from the DataFrame 'df' in place, effectively dropping it from the dataset. It's often used to eliminate unnecessary or redundant columns during data preprocessing.**

In [None]:
df.drop(columns=["Unnamed: 0"], inplace=True)

**Returns all columns in the DataFrame "df".**

In [None]:
df.columns

Index(['Location', 'Emp. Group', 'Function', 'Gender ', 'Tenure',
       'Experience (YY.MM)', 'Marital Status', 'Age in YY.', 'Hiring Source',
       'Promoted/Non Promoted', 'Job Role Match', 'Stay/Left'],
      dtype='object')

**This line displays the first 5 rows of the DataFrame 'df', providing a quick preview of the data after the specified operations, such as dropping a column and handling missing values.**

In [None]:
df.head(5)

Unnamed: 0,Location,Emp. Group,Function,Gender,Tenure,Experience (YY.MM),Marital Status,Age in YY.,Hiring Source,Promoted/Non Promoted,Job Role Match,Stay/Left
0,Pune,B2,Operation,Male,0.0,6.08,Single,27.12,Direct,Non Promoted,Yes,Left
1,Noida,B7,Support,Male,0.0,13.0,Marr.,38.08,Direct,Promoted,No,Stay
2,Bangalore,B3,Operation,Male,0.01,16.05,Marr.,36.04,Direct,Promoted,Yes,Stay
3,Noida,B2,Operation,Male,0.01,6.06,Marr.,32.07,Direct,Promoted,Yes,Stay
4,Lucknow,B2,Operation,Male,0.0,7.0,Marr.,32.05,Direct,Non Promoted,Yes,Stay


# **Data Cleaning**

**This code utilizes Pandas' :get_dummies()" function to convert categorical variable(s) in the 'Location' column into dummy/indicator variables, dropping the first category to prevent multicollinearity issues. It then displays the first 5 rows of the modified DataFrame, showcasing the transformed data.**

In [None]:
df = pd.get_dummies(df, columns=["Location"], drop_first=True)
df.head(5)

Unnamed: 0,Emp. Group,Function,Gender,Tenure,Experience (YY.MM),Marital Status,Age in YY.,Hiring Source,Promoted/Non Promoted,Job Role Match,...,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,Location_Lucknow,Location_Madurai,Location_Mumbai,Location_Nagpur,Location_Noida,Location_Pune,Location_Vijayawada
0,B2,Operation,Male,0.0,6.08,Single,27.12,Direct,Non Promoted,Yes,...,False,False,False,False,False,False,False,False,True,False
1,B7,Support,Male,0.0,13.0,Marr.,38.08,Direct,Promoted,No,...,False,False,False,False,False,False,False,True,False,False
2,B3,Operation,Male,0.01,16.05,Marr.,36.04,Direct,Promoted,Yes,...,False,False,False,False,False,False,False,False,False,False
3,B2,Operation,Male,0.01,6.06,Marr.,32.07,Direct,Promoted,Yes,...,False,False,False,False,False,False,False,True,False,False
4,B2,Operation,Male,0.0,7.0,Marr.,32.05,Direct,Non Promoted,Yes,...,False,False,False,True,False,False,False,False,False,False


**This command return the columns in the DataFrame to see if the "Location" columns has dropped or not.**

In [None]:
df.columns

Index(['Emp. Group', 'Function', 'Gender ', 'Tenure', 'Experience (YY.MM)',
       'Marital Status', 'Age in YY.', 'Hiring Source',
       'Promoted/Non Promoted', 'Job Role Match', 'Stay/Left',
       'Location_Chennai', 'Location_Gurgaon', 'Location_Hyderabad',
       'Location_Kolkata', 'Location_Lucknow', 'Location_Madurai',
       'Location_Mumbai', 'Location_Nagpur', 'Location_Noida', 'Location_Pune',
       'Location_Vijayawada'],
      dtype='object')

**This code converts the columns specified in "columns_to_encode" into integer data type, likely representing binary indicators after one-hot encoding. It then displays the first 5 rows of the DataFrame with these columns converted to integers. This step is often done before feeding data into machine learning models that require numerical inputs.**

In [None]:
columns_to_encode = ['Location_Chennai', 'Location_Gurgaon', 'Location_Hyderabad',
                     'Location_Kolkata', 'Location_Lucknow', 'Location_Madurai',
                     'Location_Mumbai', 'Location_Nagpur', 'Location_Noida', 'Location_Pune',
                     'Location_Vijayawada']

df[columns_to_encode] = df[columns_to_encode].astype(int)

df.head(5)

Unnamed: 0,Emp. Group,Function,Gender,Tenure,Experience (YY.MM),Marital Status,Age in YY.,Hiring Source,Promoted/Non Promoted,Job Role Match,...,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,Location_Lucknow,Location_Madurai,Location_Mumbai,Location_Nagpur,Location_Noida,Location_Pune,Location_Vijayawada
0,B2,Operation,Male,0.0,6.08,Single,27.12,Direct,Non Promoted,Yes,...,0,0,0,0,0,0,0,0,1,0
1,B7,Support,Male,0.0,13.0,Marr.,38.08,Direct,Promoted,No,...,0,0,0,0,0,0,0,1,0,0
2,B3,Operation,Male,0.01,16.05,Marr.,36.04,Direct,Promoted,Yes,...,0,0,0,0,0,0,0,0,0,0
3,B2,Operation,Male,0.01,6.06,Marr.,32.07,Direct,Promoted,Yes,...,0,0,0,0,0,0,0,1,0,0
4,B2,Operation,Male,0.0,7.0,Marr.,32.05,Direct,Non Promoted,Yes,...,0,0,0,1,0,0,0,0,0,0


**These lines of code utilize Pandas' "get_dummies()" function to one-hot encode the 'Function' and 'Gender' columns, expanding categorical variables into binary indicators. The column 'Gender ' is renamed to 'Gender' to remove the trailing space.**

**Finally, it displays the first 5 rows of the DataFrame with the newly encoded columns. This process is common in preparing categorical data for machine learning algorithms.**

In [None]:
df = pd.get_dummies(df, columns=["Function"])

df.rename(columns={'Gender ': 'Gender'}, inplace=True)

df = pd.get_dummies(df, columns=["Gender"])

df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Marital Status,Age in YY.,Hiring Source,Promoted/Non Promoted,Job Role Match,Stay/Left,Location_Chennai,...,Location_Nagpur,Location_Noida,Location_Pune,Location_Vijayawada,Function_Operation,Function_Sales,Function_Support,Gender_Female,Gender_Male,Gender_other
0,B2,0.0,6.08,Single,27.12,Direct,Non Promoted,Yes,Left,0,...,0,0,1,0,True,False,False,False,True,False
1,B7,0.0,13.0,Marr.,38.08,Direct,Promoted,No,Stay,0,...,0,1,0,0,False,False,True,False,True,False
2,B3,0.01,16.05,Marr.,36.04,Direct,Promoted,Yes,Stay,0,...,0,0,0,0,True,False,False,False,True,False
3,B2,0.01,6.06,Marr.,32.07,Direct,Promoted,Yes,Stay,0,...,0,1,0,0,True,False,False,False,True,False
4,B2,0.0,7.0,Marr.,32.05,Direct,Non Promoted,Yes,Stay,0,...,0,0,0,0,True,False,False,False,True,False


**These lines convert specific columns representing categorical variables encoded as integers into integer data type. Afterwards, it displays the first 5 rows of the DataFrame with these columns converted to integers. This type conversion is typically done to ensure compatibility with various machine learning algorithms.**

In [None]:
df["Function_Operation"] = df["Function_Operation"].astype(int)
df["Function_Sales"] = df["Function_Sales"].astype(int)
df["Function_Support"] = df["Function_Support"].astype(int)
df["Gender_Female"] = df["Gender_Female"].astype(int)
df["Gender_Male"] = df["Gender_Male"].astype(int)
df["Gender_other"] = df["Gender_other"].astype(int)

df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Marital Status,Age in YY.,Hiring Source,Promoted/Non Promoted,Job Role Match,Stay/Left,Location_Chennai,...,Location_Nagpur,Location_Noida,Location_Pune,Location_Vijayawada,Function_Operation,Function_Sales,Function_Support,Gender_Female,Gender_Male,Gender_other
0,B2,0.0,6.08,Single,27.12,Direct,Non Promoted,Yes,Left,0,...,0,0,1,0,1,0,0,0,1,0
1,B7,0.0,13.0,Marr.,38.08,Direct,Promoted,No,Stay,0,...,0,1,0,0,0,0,1,0,1,0
2,B3,0.01,16.05,Marr.,36.04,Direct,Promoted,Yes,Stay,0,...,0,0,0,0,1,0,0,0,1,0
3,B2,0.01,6.06,Marr.,32.07,Direct,Promoted,Yes,Stay,0,...,0,1,0,0,1,0,0,0,1,0
4,B2,0.0,7.0,Marr.,32.05,Direct,Non Promoted,Yes,Stay,0,...,0,0,0,0,1,0,0,0,1,0


**This commnad specifies all unique categories in the column name "Marital Status".**

In [None]:
df["Marital Status"].unique()

array(['Single', 'Marr.', 'Div.', 'NTBD', 'Sep.'], dtype=object)

**These lines perform one-hot encoding on the 'Marital Status' column using Pandas' get_dummies() function, expanding categorical variables into binary indicators. Then, it displays the first 5 rows of the DataFrame, showcasing the transformation.**

In [None]:
df = pd.get_dummies(df, columns=["Marital Status"])
df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Age in YY.,Hiring Source,Promoted/Non Promoted,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,...,Function_Sales,Function_Support,Gender_Female,Gender_Male,Gender_other,Marital Status_Div.,Marital Status_Marr.,Marital Status_NTBD,Marital Status_Sep.,Marital Status_Single
0,B2,0.0,6.08,27.12,Direct,Non Promoted,Yes,Left,0,0,...,0,0,0,1,0,False,False,False,False,True
1,B7,0.0,13.0,38.08,Direct,Promoted,No,Stay,0,0,...,0,1,0,1,0,False,True,False,False,False
2,B3,0.01,16.05,36.04,Direct,Promoted,Yes,Stay,0,0,...,0,0,0,1,0,False,True,False,False,False
3,B2,0.01,6.06,32.07,Direct,Promoted,Yes,Stay,0,0,...,0,0,0,1,0,False,True,False,False,False
4,B2,0.0,7.0,32.05,Direct,Non Promoted,Yes,Stay,0,0,...,0,0,0,1,0,False,True,False,False,False


**In these lines, specific columns representing categorical variables encoded as integers are converted to integer data type. Then, it displays the first 5 rows of the DataFrame with these columns converted to integers. This type conversion is commonly done to ensure compatibility with machine learning algorithms.**

In [None]:
df["Marital Status_Div."] = df["Marital Status_Div."].astype(int)
df["Marital Status_Marr."] = df["Marital Status_Marr."].astype(int)
df["Marital Status_NTBD"] = df["Marital Status_NTBD"].astype(int)
df["Marital Status_Sep."] = df["Marital Status_Sep."].astype(int)
df["Marital Status_Single"] = df["Marital Status_Single"].astype(int)

df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Age in YY.,Hiring Source,Promoted/Non Promoted,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,...,Function_Sales,Function_Support,Gender_Female,Gender_Male,Gender_other,Marital Status_Div.,Marital Status_Marr.,Marital Status_NTBD,Marital Status_Sep.,Marital Status_Single
0,B2,0.0,6.08,27.12,Direct,Non Promoted,Yes,Left,0,0,...,0,0,0,1,0,0,0,0,0,1
1,B7,0.0,13.0,38.08,Direct,Promoted,No,Stay,0,0,...,0,1,0,1,0,0,1,0,0,0
2,B3,0.01,16.05,36.04,Direct,Promoted,Yes,Stay,0,0,...,0,0,0,1,0,0,1,0,0,0
3,B2,0.01,6.06,32.07,Direct,Promoted,Yes,Stay,0,0,...,0,0,0,1,0,0,1,0,0,0
4,B2,0.0,7.0,32.05,Direct,Non Promoted,Yes,Stay,0,0,...,0,0,0,1,0,0,1,0,0,0


**These lines perform one-hot encoding on the 'Promoted/Non Promoted' column using Pandas' "get_dummies()" function, expanding categorical variables into binary indicators. Then, it displays the first 5 rows of the DataFrame, showcasing the transformation.**

In [None]:
df = pd.get_dummies(df, columns=["Promoted/Non Promoted"])
df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Age in YY.,Hiring Source,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,...,Gender_Female,Gender_Male,Gender_other,Marital Status_Div.,Marital Status_Marr.,Marital Status_NTBD,Marital Status_Sep.,Marital Status_Single,Promoted/Non Promoted_Non Promoted,Promoted/Non Promoted_Promoted
0,B2,0.0,6.08,27.12,Direct,Yes,Left,0,0,0,...,0,1,0,0,0,0,0,1,True,False
1,B7,0.0,13.0,38.08,Direct,No,Stay,0,0,0,...,0,1,0,0,1,0,0,0,False,True
2,B3,0.01,16.05,36.04,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,False,True
3,B2,0.01,6.06,32.07,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,False,True
4,B2,0.0,7.0,32.05,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,True,False


**In these lines, the column 'Promoted/Non Promoted_Non Promoted' is renamed to 'Not Promoted', and 'Promoted/Non Promoted_Promoted' is renamed to 'Promoted', facilitating clearer interpretation of the data. Then, it displays the first 5 rows of the DataFrame with the renamed columns.**

In [None]:
df.rename(columns={'Promoted/Non Promoted_Non Promoted': 'Not Promoted'}, inplace=True)
df.rename(columns={'Promoted/Non Promoted_Promoted': 'Promoted'}, inplace=True)

df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Age in YY.,Hiring Source,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,...,Gender_Female,Gender_Male,Gender_other,Marital Status_Div.,Marital Status_Marr.,Marital Status_NTBD,Marital Status_Sep.,Marital Status_Single,Not Promoted,Promoted
0,B2,0.0,6.08,27.12,Direct,Yes,Left,0,0,0,...,0,1,0,0,0,0,0,1,True,False
1,B7,0.0,13.0,38.08,Direct,No,Stay,0,0,0,...,0,1,0,0,1,0,0,0,False,True
2,B3,0.01,16.05,36.04,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,False,True
3,B2,0.01,6.06,32.07,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,False,True
4,B2,0.0,7.0,32.05,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,True,False


**These lines convert specific columns representing categorical variables encoded as integers into integer data type. Then, it displays the first 5 rows of the DataFrame with these columns converted to integers, commonly done to ensure compatibility with machine learning algorithms.**

In [None]:
df["Not Promoted"] = df["Not Promoted"].astype(int)
df["Promoted"] = df["Promoted"].astype(int)

df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Age in YY.,Hiring Source,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,...,Gender_Female,Gender_Male,Gender_other,Marital Status_Div.,Marital Status_Marr.,Marital Status_NTBD,Marital Status_Sep.,Marital Status_Single,Not Promoted,Promoted
0,B2,0.0,6.08,27.12,Direct,Yes,Left,0,0,0,...,0,1,0,0,0,0,0,1,1,0
1,B7,0.0,13.0,38.08,Direct,No,Stay,0,0,0,...,0,1,0,0,1,0,0,0,0,1
2,B3,0.01,16.05,36.04,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,0,1
3,B2,0.01,6.06,32.07,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,0,1
4,B2,0.0,7.0,32.05,Direct,Yes,Stay,0,0,0,...,0,1,0,0,1,0,0,0,1,0


**This command displays all unique categories in the columns name "Hiring Source".**

In [None]:
df["Hiring Source"].unique()

array(['Direct', 'Agency', 'Employee Referral'], dtype=object)

**These lines of code utilize Pandas get_dummies() function to one-hot encode the 'Hiring Source' column, expanding categorical variables into binary indicators. Then, it displays the first 5 rows of the DataFrame with the newly encoded columns, enabling further analysis or model building.**

In [None]:
df = pd.get_dummies(df, columns=["Hiring Source"])
df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Age in YY.,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,...,Marital Status_Div.,Marital Status_Marr.,Marital Status_NTBD,Marital Status_Sep.,Marital Status_Single,Not Promoted,Promoted,Hiring Source_Agency,Hiring Source_Direct,Hiring Source_Employee Referral
0,B2,0.0,6.08,27.12,Yes,Left,0,0,0,0,...,0,0,0,0,1,1,0,False,True,False
1,B7,0.0,13.0,38.08,No,Stay,0,0,0,0,...,0,1,0,0,0,0,1,False,True,False
2,B3,0.01,16.05,36.04,Yes,Stay,0,0,0,0,...,0,1,0,0,0,0,1,False,True,False
3,B2,0.01,6.06,32.07,Yes,Stay,0,0,0,0,...,0,1,0,0,0,0,1,False,True,False
4,B2,0.0,7.0,32.05,Yes,Stay,0,0,0,0,...,0,1,0,0,0,1,0,False,True,False


**In these lines, specific columns representing categorical variables encoded as integers are converted to integer data type. Then, it displays the first 5 rows of the DataFrame with these columns converted to integers, typically done to ensure compatibility with machine learning algorithms.**

In [None]:
df["Hiring Source_Agency"] = df["Hiring Source_Agency"].astype(int)
df["Hiring Source_Direct"] = df["Hiring Source_Direct"].astype(int)
df["Hiring Source_Employee Referral"] = df["Hiring Source_Employee Referral"].astype(int)

df.head(5)

Unnamed: 0,Emp. Group,Tenure,Experience (YY.MM),Age in YY.,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,...,Marital Status_Div.,Marital Status_Marr.,Marital Status_NTBD,Marital Status_Sep.,Marital Status_Single,Not Promoted,Promoted,Hiring Source_Agency,Hiring Source_Direct,Hiring Source_Employee Referral
0,B2,0.0,6.08,27.12,Yes,Left,0,0,0,0,...,0,0,0,0,1,1,0,0,1,0
1,B7,0.0,13.0,38.08,No,Stay,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
2,B3,0.01,16.05,36.04,Yes,Stay,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
3,B2,0.01,6.06,32.07,Yes,Stay,0,0,0,0,...,0,1,0,0,0,0,1,0,1,0
4,B2,0.0,7.0,32.05,Yes,Stay,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0


**This command displays all unique categories in the column name "Emp. Group".**

In [None]:
df["Emp. Group"].unique()

array(['B2', 'B7', 'B3', 'B1', 'B5', 'B0', 'B4', 'B6', 'C3', 'D2'],
      dtype=object)

**These lines of code utilize Pandas' "get_dummies()" function to one-hot encode the 'Emp. Group' column, expanding categorical variables into binary indicators. Then, it displays the first 5 rows of the DataFrame with the newly encoded columns, facilitating further analysis or model building.**

In [None]:
df = pd.get_dummies(df, columns=["Emp. Group"])
df.head(5)

Unnamed: 0,Tenure,Experience (YY.MM),Age in YY.,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,Location_Lucknow,...,Emp. Group_B0,Emp. Group_B1,Emp. Group_B2,Emp. Group_B3,Emp. Group_B4,Emp. Group_B5,Emp. Group_B6,Emp. Group_B7,Emp. Group_C3,Emp. Group_D2
0,0.0,6.08,27.12,Yes,Left,0,0,0,0,0,...,False,False,True,False,False,False,False,False,False,False
1,0.0,13.0,38.08,No,Stay,0,0,0,0,0,...,False,False,False,False,False,False,False,True,False,False
2,0.01,16.05,36.04,Yes,Stay,0,0,0,0,0,...,False,False,False,True,False,False,False,False,False,False
3,0.01,6.06,32.07,Yes,Stay,0,0,0,0,0,...,False,False,True,False,False,False,False,False,False,False
4,0.0,7.0,32.05,Yes,Stay,0,0,0,0,1,...,False,False,True,False,False,False,False,False,False,False


**In these lines, specific columns representing categorical variables encoded as integers are converted to integer data type. Then, it displays the first 5 rows of the DataFrame with these columns converted to integers, commonly done to ensure compatibility with machine learning algorithms.**

In [None]:
df["Emp. Group_B0"] = df["Emp. Group_B0"].astype(int)
df["Emp. Group_B1"] = df["Emp. Group_B1"].astype(int)
df["Emp. Group_B2"] = df["Emp. Group_B2"].astype(int)
df["Emp. Group_B3"] = df["Emp. Group_B3"].astype(int)
df["Emp. Group_B4"] = df["Emp. Group_B4"].astype(int)
df["Emp. Group_B5"] = df["Emp. Group_B5"].astype(int)
df["Emp. Group_B6"] = df["Emp. Group_B6"].astype(int)
df["Emp. Group_B7"] = df["Emp. Group_B7"].astype(int)
df["Emp. Group_C3"] = df["Emp. Group_C3"].astype(int)
df["Emp. Group_D2"] = df["Emp. Group_D2"].astype(int)

df.head(5)

Unnamed: 0,Tenure,Experience (YY.MM),Age in YY.,Job Role Match,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,Location_Lucknow,...,Emp. Group_B0,Emp. Group_B1,Emp. Group_B2,Emp. Group_B3,Emp. Group_B4,Emp. Group_B5,Emp. Group_B6,Emp. Group_B7,Emp. Group_C3,Emp. Group_D2
0,0.0,6.08,27.12,Yes,Left,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0.0,13.0,38.08,No,Stay,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0.01,16.05,36.04,Yes,Stay,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,0.01,6.06,32.07,Yes,Stay,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0.0,7.0,32.05,Yes,Stay,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0


**This command displays all unique categories in the column name "Job Role Match".**

In [None]:
df["Job Role Match"].unique()

array(['Yes', 'No'], dtype=object)

**This code one-hot encodes the 'Job Role Match' column, converting categorical values into binary indicators. It then displays the first 5 rows of the DataFrame with the newly encoded columns.**

In [None]:
df = pd.get_dummies(df, columns=["Job Role Match"])
df.head(5)

Unnamed: 0,Tenure,Experience (YY.MM),Age in YY.,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,Location_Lucknow,Location_Madurai,...,Emp. Group_B2,Emp. Group_B3,Emp. Group_B4,Emp. Group_B5,Emp. Group_B6,Emp. Group_B7,Emp. Group_C3,Emp. Group_D2,Job Role Match_No,Job Role Match_Yes
0,0.0,6.08,27.12,Left,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,False,True
1,0.0,13.0,38.08,Stay,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,True,False
2,0.01,16.05,36.04,Stay,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,False,True
3,0.01,6.06,32.07,Stay,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,False,True
4,0.0,7.0,32.05,Stay,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,False,True


**In these lines, the column 'Job Role Match_No' is renamed to 'Job Role UnMatched', and 'Job Role Match_Yes' is renamed to 'Job Role Matched', providing clearer interpretation of the data. Then, it displays the first 5 rows of the DataFrame with the renamed columns.**

In [None]:
df.rename(columns={'Job Role Match_No': 'Job Role UnMatched'}, inplace=True)
df.rename(columns={'Job Role Match_Yes': 'Job Role Matched'}, inplace=True)

df.head(5)

Unnamed: 0,Tenure,Experience (YY.MM),Age in YY.,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,Location_Lucknow,Location_Madurai,...,Emp. Group_B2,Emp. Group_B3,Emp. Group_B4,Emp. Group_B5,Emp. Group_B6,Emp. Group_B7,Emp. Group_C3,Emp. Group_D2,Job Role UnMatched,Job Role Matched
0,0.0,6.08,27.12,Left,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,False,True
1,0.0,13.0,38.08,Stay,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,True,False
2,0.01,16.05,36.04,Stay,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,False,True
3,0.01,6.06,32.07,Stay,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,False,True
4,0.0,7.0,32.05,Stay,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,False,True


**These lines convert specific columns representing categorical variables encoded as integers into integer data type. Then, it displays the first 5 rows of the DataFrame with these columns converted to integers, commonly done to ensure compatibility with machine learning algorithms.**

In [None]:
df["Job Role Matched"] = df["Job Role Matched"].astype(int)
df["Job Role UnMatched"] = df["Job Role UnMatched"].astype(int)

df.head(5)

Unnamed: 0,Tenure,Experience (YY.MM),Age in YY.,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,Location_Lucknow,Location_Madurai,...,Emp. Group_B2,Emp. Group_B3,Emp. Group_B4,Emp. Group_B5,Emp. Group_B6,Emp. Group_B7,Emp. Group_C3,Emp. Group_D2,Job Role UnMatched,Job Role Matched
0,0.0,6.08,27.12,Left,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1,0.0,13.0,38.08,Stay,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,0.01,16.05,36.04,Stay,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
3,0.01,6.06,32.07,Stay,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,0.0,7.0,32.05,Stay,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1


**These lines map the values in the 'Stay/Left' column according to the specified dictionary, where 'Left' is mapped to 0 and 'Stay' is mapped to 1. Then, it displays the first 5 rows of the DataFrame with the updated 'Stay/Left' column.**

In [None]:
mapping = {'Left': 0, 'Stay': 1}

df['Stay/Left'] = df['Stay/Left'].map(mapping)

df.head(5)

Unnamed: 0,Tenure,Experience (YY.MM),Age in YY.,Stay/Left,Location_Chennai,Location_Gurgaon,Location_Hyderabad,Location_Kolkata,Location_Lucknow,Location_Madurai,...,Emp. Group_B2,Emp. Group_B3,Emp. Group_B4,Emp. Group_B5,Emp. Group_B6,Emp. Group_B7,Emp. Group_C3,Emp. Group_D2,Job Role UnMatched,Job Role Matched
0,0.0,6.08,27.12,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1,0.0,13.0,38.08,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,0.01,16.05,36.04,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
3,0.01,6.06,32.07,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,0.0,7.0,32.05,1,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1


**This command displays all columns uptil now in the DataFrame.**

In [None]:
df.columns

Index(['Tenure', 'Experience (YY.MM)', 'Age in YY.', 'Stay/Left',
       'Location_Chennai', 'Location_Gurgaon', 'Location_Hyderabad',
       'Location_Kolkata', 'Location_Lucknow', 'Location_Madurai',
       'Location_Mumbai', 'Location_Nagpur', 'Location_Noida', 'Location_Pune',
       'Location_Vijayawada', 'Function_Operation', 'Function_Sales',
       'Function_Support', 'Gender_Female', 'Gender_Male', 'Gender_other',
       'Marital Status_Div.', 'Marital Status_Marr.', 'Marital Status_NTBD',
       'Marital Status_Sep.', 'Marital Status_Single', 'Not Promoted',
       'Promoted', 'Hiring Source_Agency', 'Hiring Source_Direct',
       'Hiring Source_Employee Referral', 'Emp. Group_B0', 'Emp. Group_B1',
       'Emp. Group_B2', 'Emp. Group_B3', 'Emp. Group_B4', 'Emp. Group_B5',
       'Emp. Group_B6', 'Emp. Group_B7', 'Emp. Group_C3', 'Emp. Group_D2',
       'Job Role UnMatched', 'Job Role Matched'],
      dtype='object')

# **Model Building**

**This commands split the features and the target variable into (X, y) for model building.**

In [None]:
X = df[['Tenure', 'Experience (YY.MM)', 'Age in YY.',
       'Location_Chennai', 'Location_Gurgaon', 'Location_Hyderabad',
       'Location_Kolkata', 'Location_Lucknow', 'Location_Madurai',
       'Location_Mumbai', 'Location_Nagpur', 'Location_Noida', 'Location_Pune',
       'Location_Vijayawada', 'Function_Operation', 'Function_Sales',
       'Function_Support', 'Gender_Female', 'Gender_Male', 'Gender_other',
       'Marital Status_Div.', 'Marital Status_Marr.', 'Marital Status_NTBD',
       'Marital Status_Sep.', 'Marital Status_Single', 'Not Promoted',
       'Promoted', 'Hiring Source_Agency', 'Hiring Source_Direct',
       'Hiring Source_Employee Referral', 'Emp. Group_B0', 'Emp. Group_B1',
       'Emp. Group_B2', 'Emp. Group_B3', 'Emp. Group_B4', 'Emp. Group_B5',
       'Emp. Group_B6', 'Emp. Group_B7', 'Emp. Group_C3', 'Emp. Group_D2',
       'Job Role UnMatched', 'Job Role Matched']]

y = df['Stay/Left']

# **Decision Tree Classifier**

**1: DecisionTreeClassifier(max_depth=10): Initializes a DecisionTreeClassifier with a maximum depth of 10 to prevent overfitting.**


**2: train_test_split(X, y, train_size=0.8, random_state=42): Splits the dataset into training and testing sets with 80% for training and a random state of 42 for reproducibility.**


**3: dt_classifier.fit(X_train, y_train): Trains the decision tree classifier on the training data.**

**4: accuracy_score(y_test, dt_y_pred): Calculates the accuracy of the model.**

**5: classification_report(y_test, dt_y_pred): Generates a classification report with precision, recall, F1-score, and support.**

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

dt_classifier = DecisionTreeClassifier(max_depth=10)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

dt_classifier.fit(X_train, y_train)

dt_y_pred = dt_classifier.predict(X_test)

dt_accuracy = accuracy_score(y_test, dt_y_pred)
print("Decision Tree Accuracy:", dt_accuracy)

print("Decision Tree Classification Report:")
print(classification_report(y_test, dt_y_pred))

Decision Tree Accuracy: 0.8232044198895028
Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.69      0.73        62
           1       0.85      0.89      0.87       119

    accuracy                           0.82       181
   macro avg       0.81      0.79      0.80       181
weighted avg       0.82      0.82      0.82       181



# **Random Forest Classifier**

**1: "RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)" : Initializes a RandomForestClassifier with 100 decision trees (n_estimators), each having a maximum depth of 10 to prevent overfitting, and a random state of 42 for reproducibility.**

**2: "rf_classifier.fit(X_train, y_train)": Trains the random forest classifier on the training data.**

**3: "accuracy_score(y_test, rf_y_pred)": Calculates the accuracy of the model.**

**4: "classification_report(y_test, rf_y_pred)": Generates a classification report with precision, recall, F1-score, and support.**

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

rf_classifier.fit(X_train, y_train)

rf_y_pred = rf_classifier.predict(X_test)

rf_accuracy = accuracy_score(y_test, rf_y_pred)
print("Random Forest Accuracy:", rf_accuracy)

print("Random Forest Classification Report:")
print(classification_report(y_test, rf_y_pred))

Random Forest Accuracy: 0.8839779005524862
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.79      0.82        62
           1       0.90      0.93      0.91       119

    accuracy                           0.88       181
   macro avg       0.88      0.86      0.87       181
weighted avg       0.88      0.88      0.88       181



# **Logistic Regression**

**1: "LogisticRegression(C=1.0)" : Initializes a LogisticRegression model with regularization strength (C) set to 1.0, & max iterations is 500 units.**

**2: "log_reg.fit(X_train, y_train)" : Trains the logistic regression model on the training data.**

**3: "accuracy_score(y_test, y_pred)" : Calculates the accuracy of the model.**

**4: "classification_report(y_test, y_pred)" : Generates a classification report with precision, recall, F1-score, and support.**

In [37]:
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

log_reg = LogisticRegression(C=1.0, max_iter=500)

log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.8729281767955801
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.74      0.80        62
           1       0.88      0.94      0.91       119

    accuracy                           0.87       181
   macro avg       0.87      0.84      0.85       181
weighted avg       0.87      0.87      0.87       181



# **Conclusion:**

After testing all three models — **Decision Tree**, **Random Forest**, and **Logistic Regression** — the following results were obtained:

* **Decision Tree:**
  Accuracy: **82.32%**

  * Class 0: Precision **77%**, Recall **69%**
  * Class 1: Precision **85%**, Recall **89%**

* **Random Forest:**
  Accuracy: **88.39%**

  * Class 0: Precision **86%**, Recall **79%**
  * Class 1: Precision **90%**, Recall **93%**

* **Logistic Regression:**
  Accuracy: **87.29%**

  * Class 0: Precision **87%**, Recall **74%**
  * Class 1: Precision **88%**, Recall **94%**

📊 **Final Remark:**
Based on these results, the **Random Forest model** delivered the highest accuracy and demonstrated strong precision–recall balance across both classes. Its performance indicates better generalization and robustness compared to the other models. Therefore, the **Random Forest model** was selected and saved as the final model (**Attrition_Rate_Model.joblib**).

# **Saving The Model**

**dump(rf_classifier, "Attrition_Rate_Model.joblib"): Saves the trained RandomForestClassifier model as a joblib file named:**

**"Attrition_Rate_Model.joblib".**

**This code saves the trained Random Forest classifier model to a file using the joblib library for future use or deployment.**

In [None]:
from joblib import dump

y_train_str = y_train.astype(str)

rf_classifier = RandomForestClassifier()

rf_classifier.fit(X_train, y_train_str)

dump(rf_classifier, "Attrition_Rate_Model.joblib")

['Attrition_Rate_Model.joblib']