<a href="https://colab.research.google.com/github/Pinlinzz/Analisis-Dataset-Employee-Salary-Analysis-/blob/main/Analisis_Dataset_Employee_Salary_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INISIASI DAN IMPORTING DATASET


Let's check the structure and summary of the R data frame, similar to `df.info()` and `df.describe()` in Python.

In [7]:
# Load the dataset into an R data frame
salary_df <- read.csv('/content/dataset/Salary_Dataset.csv')

# Display the first 5 rows of the data frame
summary(salary_df)

  Employee_ID         Name                Age           Gender         
 Min.   :   1.0   Length:1200        Min.   :20.00   Length:1200       
 1st Qu.: 300.8   Class :character   1st Qu.:30.00   Class :character  
 Median : 600.5   Mode  :character   Median :41.00   Mode  :character  
 Mean   : 600.5                      Mean   :39.99                     
 3rd Qu.: 900.2                      3rd Qu.:50.00                     
 Max.   :1200.0                      Max.   :59.00                     
   Country              City            Education          Job_Title        
 Length:1200        Length:1200        Length:1200        Length:1200       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                  

# Data Cleaning & Preprocessing


First, let's check for missing values in the `salary_df` data frame.

In [8]:
# Check for missing values in the dataset
missing_vals <- colSums(is.na(salary_df))
cat('Missing values in each column:\n')
print(missing_vals)

Missing values in each column:
        Employee_ID                Name                 Age              Gender 
                  0                   0                   0                   0 
            Country                City           Education           Job_Title 
                  0                   0                   0                   0 
         Department    Experience_Years          Salary_USD           Bonus_USD 
                  0                   0                   0                   0 
Work_Hours_Per_Week         Remote_Work   Performance_Score        Joining_Year 
                  0                   0                   0                   0 
      Contract_Type 
                  0 


Now, we'll apply an imputation strategy: median for numeric missing values and mode for categorical missing values. We'll define a simple function to calculate the mode.

In [12]:
# Function to get the mode (most frequent value) of a vector
get_mode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Impute missing values
for (col_name in names(salary_df)) {
  if (any(is.na(salary_df[[col_name]]))) {
    if (is.numeric(salary_df[[col_name]])) {
      median_val <- median(salary_df[[col_name]], na.rm = TRUE)
      salary_df[[col_name]][is.na(salary_df[[col_name]])] <- median_val
    } else { # Assume it's categorical or needs mode imputation
      mode_val <- get_mode(salary_df[[col_name]][!is.na(salary_df[[col_name]])])
      salary_df[[col_name]][is.na(salary_df[[col_name]])] <- mode_val
    }
  }
}

cat('\nMissing values after imputation:\n')
print(colSums(is.na(salary_df)))


Missing values after imputation:
        Employee_ID                Name                 Age              Gender 
                  0                   0                   0                   0 
            Country                City           Education           Job_Title 
                  0                   0                   0                   0 
         Department    Experience_Years          Salary_USD           Bonus_USD 
                  0                   0                   0                   0 
Work_Hours_Per_Week         Remote_Work   Performance_Score        Joining_Year 
                  0                   0                   0                   0 
      Contract_Type 
                  0 

Data types after conversion:
        Employee_ID                Name                 Age              Gender 
          "numeric"         "character"           "numeric"         "character" 
            Country                City           Education           Job_Title 
   

#Exploratory Data Analysis


In [13]:
# Summary counts for categorical variables
categorical_cols <- c('Gender', 'Country', 'City', 'Education', 'Job_Title', 'Department', 'Remote_Work', 'Contract_Type')

for (col_name in categorical_cols) {
  cat(paste0('\nValue counts for ', col_name, ':\n'))
  print(table(salary_df[[col_name]]))
}


Value counts for Gender:

Female   Male 
   596    604 

Value counts for Country:

  Canada    India Pakistan       UK      USA 
     234      266      247      216      237 

Value counts for City:

  Delhi Karachi      LA  London      NY Toronto 
    200     193     192     202     226     187 

Value counts for Education:

   Bachelor High School      Master         PhD 
        293         277         309         321 

Value counts for Job_Title:

  Analyst  Designer Developer   Manager 
      322       303       276       299 

Value counts for Department:

  Finance        HR        IT Marketing 
      283       328       303       286 

Value counts for Remote_Work:

 No Yes 
575 625 

Value counts for Contract_Type:

 Contract Full-Time Part-Time 
      406       377       417 


In [33]:
# Correlation analysis on numeric columns
# Select only numeric columns

numeric_df <- salary_df[sapply(salary_df, is.numeric)]

# Check if there are at least two numeric features for correlation (R's cor() requires at least 2)
if (ncol(numeric_df) >= 2) {
  corr_matrix <- cor(numeric_df)
  cat('\nCorrelation Matrix:\n')
  print(corr_matrix)
} else {
  cat('\nNot enough numeric columns for correlation analysis (need at least 2).\n')
}

ggcorrplot(corr, hc.order = TRUE, type = "lower",
   lab = TRUE)


Correlation Matrix:
                     Employee_ID          Age Experience_Years   Salary_USD
Employee_ID          1.000000000 -0.008069908     -0.021545193  0.021925719
Age                 -0.008069908  1.000000000     -0.031257252  0.003948856
Experience_Years    -0.021545193 -0.031257252      1.000000000 -0.005786815
Salary_USD           0.021925719  0.003948856     -0.005786815  1.000000000
Bonus_USD           -0.029867220 -0.054355095      0.009791218 -0.011814010
Work_Hours_Per_Week  0.029871635  0.007763162     -0.026546002  0.013021541
Performance_Score   -0.031549700  0.049837086     -0.053388829  0.016017413
Joining_Year        -0.002531153  0.038053205     -0.008723624  0.007420720
                       Bonus_USD Work_Hours_Per_Week Performance_Score
Employee_ID         -0.029867220         0.029871635       -0.03154970
Age                 -0.054355095         0.007763162        0.04983709
Experience_Years     0.009791218        -0.026546002       -0.05338883
Salary_USD 

ERROR: Error in ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE): could not find function "ggcorrplot"


# Data Visualization
