# Project: Human Resources Dataset Analysis

## Week 2: (1) Exploratory Data Analysis (EDA) And Determine Data Analysis Questions

**This table includes the column names and their descriptions:**

| **#** | **Column Name**                    | **Description**                                                |
|-------|-------------------------------------|---------------------------------------------------------------|
| 1     | EmployeeID                         | Unique identifier for each employee                           |
| 2     | FirstName                          | The first name of the employee                                |
| 3     | LastName                           | The last name of the employee                                 |
| 4     | Gender                             | Gender of the employee (Male, Female, Non-binary)            |
| 5     | Age                                | Age of the employee (in years)                               |
| 6     | BusinessTravel                     | Frequency of business travel (Rarely, Occasionally, Frequent) |
| 7     | Department                         | The department where the employee works                       |
| 8     | DistanceFromHome                   | Distance from the employee's home to work (in kilometers)    |
| 9     | State                              | The state where the employee resides                          |
| 10    | Ethnicity                          | Ethnicity of the employee                                     |
| 11    | EducationField                     | Field of education of the employee (e.g., IT, Marketing)    |
| 12    | JobRole                            | Job title of the employee (e.g., Software Engineer, Sales Executive) |
| 13    | MaritalStatus                      | Marital status of the employee (e.g., Married, Single)       |
| 14    | Salary                             | Salary of the employee (in local currency)                   |
| 15    | StockOptionLevel                   | Level of stock options granted to the employee (number of shares) |
| 16    | OverTime                           | Whether the employee works overtime (Yes or No)              |
| 17    | HireDate                           | Date of hiring the employee (in date format)                 |
| 18    | Attrition                          | Whether the employee left the company (Yes or No)            |
| 19    | YearsAtCompany                     | Number of years the employee has been with the company       |
| 20    | YearsInMostRecentRole              | Number of years the employee has been in the most recent role |
| 21    | YearsSinceLastPromotion             | Number of years since the last promotion of the employee      |
| 22    | YearsWithCurrManager               | Number of years the employee has worked with the current manager |
| 23    | EducationLevel                     | Level of education (e.g., Bachelor's, Master's)              |
| 24    | PerformanceID                      | Performance evaluation identifier for the employee            |
| 25    | ReviewDate                         | Date of the last performance review for the employee          |
| 26    | TrainingOpportunitiesWithinYear    | Number of training opportunities available within the year    |
| 27    | TrainingOpportunitiesTaken         | Number of training opportunities taken                         |
| 28    | EnvironmentSatisfactionLevel       | Level of satisfaction with the work environment (Scale 1 to 5) |
| 29    | JobSatisfactionLevel               | Level of satisfaction with the job (Scale 1 to 5)            |
| 30    | RelationshipSatisfactionLevel      | Level of satisfaction with workplace relationships (Scale 1 to 5) |
| 31    | WorkLifeBalanceLevel               | Level of work-life balance (Scale 1 to 5)                     |
| 32    | SelfRatingLevel                    | Self-rating of the employee (Scale 1 to 5)                    |
| 33    | ManagerRatingLevel                 | Manager's rating of the employee (Scale 1 to 5)               |


In [18]:
# import Libraries
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import pyodbc

pd.options.display.max_rows = None

pd.options.display.max_columns = None
sns.set()


In [59]:
import pyodbc
import pandas as pd


def read_sql_query(query):
    
    server = 'DESKTOP-0CQ5N9B' 
    database = 'HR_system' 
    
    # SQL Authentication
    connection_string = (
        f"Driver={{ODBC Driver 17 for SQL Server}};" 
        f"Server={server};" 
        f"Database={database};"
        f"Trusted_Connection=yes;" 
    )

    # Creating connection
    try:
        connection = pyodbc.connect(connection_string) 
        print("Connection successful!")

        # Use cursor to execute query
        cursor = connection.cursor()
        cursor.execute(query)
        
        # Get results
        rows = cursor.fetchall()
        columns = [column[0] for column in cursor.description] 
        
        # Creating DataFrame
        df = pd.DataFrame.from_records(rows, columns=columns)

        return df

    except Exception as e:
        print(f"Error: {e}")
        connection.close()
        return None  
    
    finally:
        # close connection
        connection.close()



# Employees

In [99]:
pth= "../00-Dataset_Data_Model/"
#df_employees = pd.read_csv(f"{pth}06-All_Data_Employees.csv")

## Load Dataset of All_Data_Employees.csv (the view "FullEmployeePerformanceView")
query = """SELECT * FROM FullEmployeePerformanceView;"""

df_employees = read_sql_query(query)

df_employees.head()

Connection successful!


Unnamed: 0,EmployeeID,FirstName,LastName,Gender,Age,BusinessTravel,Department,DistanceFromHome,State,Ethnicity,EducationField,JobRole,MaritalStatus,Salary,StockOptionLevel,OverTime,HireDate,Attrition,YearsAtCompany,YearsInMostRecentRole,YearsSinceLastPromotion,YearsWithCurrManager,EducationLevel,PerformanceID,ReviewDate,TrainingOpportunitiesWithinYear,TrainingOpportunitiesTaken,EnvironmentSatisfactionLevel,JobSatisfactionLevel,RelationshipSatisfactionLevel,WorkLifeBalanceLevel,SelfRatingLevel,ManagerRatingLevel
0,001A-8F88,Christy,Jumel,Male,22,Some Travel,Technology,40,CA,White,Information Systems,Software Engineer,Married,27763.0,0,No,2021-09-05,No,1,0,1,0,Masters,,NaT,,,,,,,,
1,005C-E0FB,Fin,O'Halleghane,Non-Binary,24,Frequent Traveller,Sales,17,CA,White,Marketing,Sales Executive,Married,56155.0,1,No,2017-08-26,No,5,2,2,0,Masters,PR4067,2020-06-17,1.0,2.0,Neutral,Neutral,Dissatisfied,Dissatisfied,Exceeds Expectation,Meets Expectation
2,005C-E0FB,Fin,O'Halleghane,Non-Binary,24,Frequent Traveller,Sales,17,CA,White,Marketing,Sales Executive,Married,56155.0,1,No,2017-08-26,No,5,2,2,0,Masters,PR5070,2021-06-17,1.0,1.0,Satisfied,Satisfied,Very Satisfied,Very Satisfied,Meets Expectation,Meets Expectation
3,005C-E0FB,Fin,O'Halleghane,Non-Binary,24,Frequent Traveller,Sales,17,CA,White,Marketing,Sales Executive,Married,56155.0,1,No,2017-08-26,No,5,2,2,0,Masters,PR6165,2022-06-17,3.0,0.0,Neutral,Satisfied,Very Satisfied,Satisfied,Exceeds Expectation,Exceeds Expectation
4,00A3-2445,Wyatt,Ziehm,Male,30,Some Travel,Technology,6,CA,Black or African American,Computer Science,Machine Learning Engineer,Married,126238.0,0,No,2012-03-08,No,10,3,6,6,High School,PR1165,2016-06-19,2.0,2.0,Satisfied,Very Satisfied,Satisfied,Very Satisfied,Exceeds Expectation,Meets Expectation


In [101]:
# df_employees["HireDate"]=pd.to_datetime(df_employees["HireDate"], format='%Y-%m-%d', errors='coerce')
# Display df_employees data information
df_employees.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6899 entries, 0 to 6898
Data columns (total 33 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   EmployeeID                       6899 non-null   object        
 1   FirstName                        6899 non-null   object        
 2   LastName                         6899 non-null   object        
 3   Gender                           6899 non-null   object        
 4   Age                              6899 non-null   int64         
 5   BusinessTravel                   6899 non-null   object        
 6   Department                       6899 non-null   object        
 7   DistanceFromHome                 6899 non-null   int64         
 8   State                            6899 non-null   object        
 9   Ethnicity                        6899 non-null   object        
 10  EducationField                   6899 non-null   object     


## Summary statistics

In [103]:
numeric_df_employees = [
    "Age",
    "DistanceFromHome",
    "Salary",
    "StockOptionLevel",
    "HireDate",
    "YearsAtCompany",
    "YearsInMostRecentRole",
    "YearsSinceLastPromotion",
    "YearsWithCurrManager",
    "ReviewDate", 
    "TrainingOpportunitiesWithinYear",
    "TrainingOpportunitiesTaken"
]

categorical_df_employees =[
    "Gender",
    "BusinessTravel",
    "Department",
    "State",
    "Ethnicity",
    "EducationField",
    "JobRole",
    "MaritalStatus",
    "OverTime",
    "Attrition",
    "EducationLevel",
    "EnvironmentSatisfactionLevel", 
    "JobSatisfactionLevel", 
    "RelationshipSatisfactionLevel",
    "WorkLifeBalanceLevel", 
    "SelfRatingLevel",
    "ManagerRatingLevel"
]




#### **Numeric Columns Table:**

| **#** | **Column Name**                    | **Description**                                          |
|-------|-------------------------------------|---------------------------------------------------------|
| 1     | Age                                | The age of the employee (in years)                     |
| 2     | DistanceFromHome                   | The distance from the employee's home to work (in km)  |
| 3     | Salary                             | The employee's salary (in local currency)              |
| 4     | StockOptionLevel                   | The level of stock options granted to the employee (number of shares) |
| 5     | HireDate                           | The date of hiring the employee (in date format)       |
| 6     | YearsAtCompany                     | The number of years the employee has been with the company |
| 7     | YearsInMostRecentRole              | The number of years the employee has spent in their most recent role |
| 8     | YearsSinceLastPromotion             | The number of years since the employee's last promotion |
| 9     | YearsWithCurrManager               | The number of years the employee has worked with the current manager |
| 10    | ReviewDate                         | The date of the employee's last performance review      |
| 11    | TrainingOpportunitiesWithinYear    | The number of training opportunities available within the year |
| 12    | TrainingOpportunitiesTaken         | The number of training opportunities that have been taken |


In [104]:
# Summary statistics of numeric
df_employees[numeric_df_employees].describe()


Unnamed: 0,Age,DistanceFromHome,Salary,StockOptionLevel,HireDate,YearsAtCompany,YearsInMostRecentRole,YearsSinceLastPromotion,YearsWithCurrManager,ReviewDate,TrainingOpportunitiesWithinYear,TrainingOpportunitiesTaken
count,6899.0,6899.0,6899.0,6899.0,6899,6899.0,6899.0,6899.0,6899.0,6709,6709.0,6709.0
mean,30.604146,22.327874,110898.374112,0.725467,2016-01-12 15:39:53.564284416,5.578055,2.778953,4.143934,2.741412,2019-04-14 05:12:04.936652288,2.012968,1.01729
min,18.0,1.0,20387.0,0.0,2012-01-03 00:00:00,0.0,0.0,0.0,0.0,2013-01-02 00:00:00,1.0,0.0
25%,25.0,12.0,44646.0,0.0,2013-07-02 00:00:00,3.0,0.0,1.0,0.0,2017-05-21 00:00:00,1.0,0.0
50%,28.0,22.0,74458.0,1.0,2015-05-19 00:00:00,6.0,2.0,4.0,2.0,2019-09-15 00:00:00,2.0,1.0
75%,36.0,33.0,137219.5,1.0,2018-06-11 00:00:00,9.0,5.0,7.0,5.0,2021-06-01 00:00:00,3.0,2.0
max,51.0,45.0,547204.0,3.0,2022-12-31 00:00:00,10.0,10.0,10.0,10.0,2022-12-31 00:00:00,3.0,3.0
std,7.986542,12.899799,98427.862382,0.839724,,3.410087,2.81017,3.20377,2.792284,,0.82031,0.950316


_________________________________________________________________

#### **Categorical Columns Table:**

| **#** | **Column Name**                    | **Description**                                          |
|-------|-------------------------------------|---------------------------------------------------------|
| 1     | Gender                             | The gender of the employee (Male, Female, Non-binary)  |
| 2     | BusinessTravel                     | Frequency of business travel (Rarely, Sometimes, Frequently) |
| 3     | Department                         | The department the employee works in                    |
| 4     | State                              | The state where the employee resides                     |
| 5     | Ethnicity                          | The ethnicity of the employee                           |
| 6     | EducationField                     | The field of education of the employee (e.g., Information Technology, Marketing) |
| 7     | JobRole                            | The job role of the employee (e.g., Software Engineer, Sales Representative) |
| 8     | MaritalStatus                      | The marital status of the employee (e.g., Married, Single) |
| 9     | OverTime                           | Whether the employee works overtime (Yes or No)        |
| 10     | Attrition                          | Whether the employee has left the company (Yes or No)  |
| 11    | EducationLevel                     | The level of education (e.g., Bachelor's, Master's)    |
| 12    | EnvironmentSatisfactionLevel       | The level of satisfaction with the work environment (Scale of 1 to 5) |
| 13    | JobSatisfactionLevel               | The level of satisfaction with the job (Scale of 1 to 5) |
| 14    | RelationshipSatisfactionLevel      | The level of satisfaction with workplace relationships (Scale of 1 to 5) |
| 15    | WorkLifeBalanceLevel               | The level of work-life balance (Scale of 1 to 5)       |
| 16    | SelfRatingLevel                    | The employee's self-rating (Scale of 1 to 5)           |
| 17    | ManagerRatingLevel                 | The manager's rating of the employee (Scale of 1 to 5)  |


In [70]:
# Summary statistics of categorical
df_employees[categorical_df_employees].describe(include=[object]).T


Unnamed: 0,count,unique,top,freq
Gender,6899,4,Female,3171
BusinessTravel,6899,3,Some Travel,4871
Department,6899,3,Technology,4375
State,6899,3,CA,4162
Ethnicity,6899,7,White,3497
EducationField,6899,9,Computer Science,2039
JobRole,6899,13,Sales Executive,1567
MaritalStatus,6899,3,Married,2922
Attrition,6899,2,No,4638
EducationLevel,6899,5,Bachelors,2706


In [52]:
# Summary statistics of categorical
for cat_df in categorical_df_employees:
    print(f"Summary statistics of: {cat_df}")
    print((df_employees[cat_df].value_counts(normalize=True)*100).to_string(index=True, header=True,float_format='%.2f%%'))
    print("\n")  


Summary statistics of: Gender
Gender
Female              45.96%
Male                44.33%
Non-Binary           8.75%
Prefer Not To Say    0.96%


Summary statistics of: BusinessTravel
BusinessTravel
Some Travel          70.60%
Frequent Traveller   20.21%
No Travel             9.19%


Summary statistics of: Department
Department
Technology        63.41%
Sales             32.05%
Human Resources    4.54%


Summary statistics of: State
State
CA   60.33%
NY   27.60%
IL   12.07%


Summary statistics of: Ethnicity
Ethnicity
White                              50.69%
Mixed or multiple ethnic groups    15.93%
Black or African American          15.83%
Asian or Asian American             9.90%
American Indian or Alaska Native    4.12%
Native Hawaiian                     2.28%
Other                               1.26%


Summary statistics of: EducationField
EducationField
Computer Science      29.56%
Information Systems   23.13%
Marketing             11.62%
Marketing             11.42%
Business St

__________________________________

## Summary 

____________________________

____________________________

__________________________

__________________________

## PerformanceRating

# Summary statistics

____________________________


____________________________


____________________________

_____________________________________

## (2) Data Cleaning:
After building the data model, we proceeded with data cleaning and preprocessing. Here’s a summary of the key observations:

- The data had no missing values or unusual entries across all tables. Each field, such as age, salary, and years of experience, showed values within expected ranges.
- Data entries for categorical variables like gender, marital status, and job roles were consistent without any spelling or formatting issues.
- Date fields, such as the employee `HireDate` and `ReviewDate` in the performance review table, were in the correct format and adhered to the expected chronological order.
- Numeric fields, including `Salary`, `YearsAtCompany`, and `DistanceFromHome`, were confirmed to contain only valid numbers without any outliers or inconsistent values.

## Conclusion:
The dataset was thoroughly examined and found to be clean, consistent, and aligned with the designed data model. There were no missing values, illogical entries, or repeated values, ensuring data integrity across all tables. 


- **Python (pandas, Matplotlib)**: For detailed data Cleaning and visual inspection.

# END