In [40]:
import pandas as pd
#Importing useful libraries for data analysis, data preprocessing, data cleaning, and data transformation.

The "pandas" library is a powerful and widely-used Python library for data manipulation and analysis. It provides data structures and functions that simplify working with structured data, such as tabular data (like spreadsheets or SQL tables). The name "pandas" is derived from "panel data," which refers to multidimensional structured data sets commonly used in statistics and econometrics.

In [41]:
file_path = "HR Data.csv" 
#Defining the file path 

In [42]:
df = pd.read_csv(file_path)
#Reading the csv file

"pd.read_csv()" is a function provided by the pandas library in Python, and it is used to read data from a CSV (Comma Separated Values) file and create a DataFrame.

# Uploaded Data

In [43]:
df.head()
#The df.head() method in pandas is used to display the first few rows of a DataFrame

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


Since we have the total of 35 columns we need to eliminate some of the columns which are not in used for the further analysis.

In [44]:
columns_to_delete = ['TrainingTimesLastYear','EnvironmentSatisfaction',
                     'JobSatisfaction','StockOptionLevel','Education', 
                     'EmployeeCount', 'EmployeeNumber', 'Gender', 'HourlyRate', 
                     'JobInvolvement' ,'Over18', 'RelationshipSatisfaction', 'StandardHours', 
                     'WorkLifeBalance', 'YearsWithCurrManager']
#We'll be droping specified columns from the Data Frame for the further cleaning part.

In [45]:
df.drop(columns=columns_to_delete, inplace=True)
#Using df.drop we'll be deleting specified columns from the DataFrame

The "df.drop()" method in pandas is used to remove rows or columns from a DataFrame based on specified labels. This method provides flexibility in terms of what you can drop and how you can specify what to drop.

# Cleaned Data

In [46]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,EducationField,JobLevel,JobRole,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,TotalWorkingYears,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion
0,41,Yes,Travel_Rarely,1102,Sales,1,Life Sciences,2,Sales Executive,Single,5993,19479,8,Yes,11,3,8,6,4,0
1,49,No,Travel_Frequently,279,Research & Development,8,Life Sciences,2,Research Scientist,Married,5130,24907,1,No,23,4,10,10,7,1
2,37,Yes,Travel_Rarely,1373,Research & Development,2,Other,1,Laboratory Technician,Single,2090,2396,6,Yes,15,3,7,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,Life Sciences,1,Research Scientist,Married,2909,23159,1,Yes,11,3,8,8,7,3
4,27,No,Travel_Rarely,591,Research & Development,2,Medical,1,Laboratory Technician,Married,3468,16632,9,No,12,3,6,2,2,2


In [47]:
df.to_csv('cleaned_data.csv', index=False)
#Converting the cleaned file to csv

The "df.to_csv()" method in pandas is used to write the contents of a DataFrame to a CSV (Comma Separated Values) file. It allows you to save the data in the DataFrame as a CSV file on your local machine or any other desired location.

Renaming the Cleaned Dataset Columns for easy Analysis. 

In [48]:
column_mapping = {
    'Age': 'EmployeeAge',
    'Attrition': 'EmploymentStatus',
    'BusinessTravel': 'EmployeeTravelFrequency',
    'DailyRate': 'EmployeeDailySalary',
    'Department': 'EmployeeDepartmentName',
    'DistanceFromHome': 'EmployeeCommuteDistance',
    'EducationField': 'EmployeeFieldOfStudy',
    'JobLevel': 'EmploymentLevel',
    'JobRole': 'EmployeePositionTitle',
    'MaritalStatus': 'EmployeeMaritalStatusType',
    'MonthlyIncome': 'EmployeeMonthlySalary',
    'MonthlyRate': 'EmployeeMonthlyRateOfPay',
    'NumCompaniesWorked': 'EmployeePreviousEmployersCount',
    'OverTime': 'EmployeeOvertimeStatus',
    'PercentSalaryHike': 'EmployeeSalaryIncreasePercentage',
    'PerformanceRating': 'EmployeeJobPerformanceRating',
    'TotalWorkingYears': 'EmployeeTotalYearsOfWorkExperience',
    'YearsAtCompany': 'EmployeeYearsWithCompany',
    'YearsInCurrentRole': 'EmployeeYearsInCurrentPosition',
    'YearsSinceLastPromotion': 'EmployeeYearsSinceLastPromotionReceived'
}
#We'll be cleaning the specified Columns into new column name for better and easy analysis.

In [49]:
df = df.rename(columns=column_mapping)
#Using df.rename we'll rename all the columns.

The "df.rename()" method in pandas is used to rename the index or column labels of a DataFrame. It allows you to change the names of rows or columns according to your specific requirements.

# Renamed Columns

In [50]:
df.head()

Unnamed: 0,EmployeeAge,EmploymentStatus,EmployeeTravelFrequency,EmployeeDailySalary,EmployeeDepartmentName,EmployeeCommuteDistance,EmployeeFieldOfStudy,EmploymentLevel,EmployeePositionTitle,EmployeeMaritalStatusType,EmployeeMonthlySalary,EmployeeMonthlyRateOfPay,EmployeePreviousEmployersCount,EmployeeOvertimeStatus,EmployeeSalaryIncreasePercentage,EmployeeJobPerformanceRating,EmployeeTotalYearsOfWorkExperience,EmployeeYearsWithCompany,EmployeeYearsInCurrentPosition,EmployeeYearsSinceLastPromotionReceived
0,41,Yes,Travel_Rarely,1102,Sales,1,Life Sciences,2,Sales Executive,Single,5993,19479,8,Yes,11,3,8,6,4,0
1,49,No,Travel_Frequently,279,Research & Development,8,Life Sciences,2,Research Scientist,Married,5130,24907,1,No,23,4,10,10,7,1
2,37,Yes,Travel_Rarely,1373,Research & Development,2,Other,1,Laboratory Technician,Single,2090,2396,6,Yes,15,3,7,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,Life Sciences,1,Research Scientist,Married,2909,23159,1,Yes,11,3,8,8,7,3
4,27,No,Travel_Rarely,591,Research & Development,2,Medical,1,Laboratory Technician,Married,3468,16632,9,No,12,3,6,2,2,2


In [51]:
df.to_csv('renamed_file.csv', index=False)
#Saving the renamed columns files into csv.

### Since our dataset is cleaned and sanitize it is ready for further Analysis.
