# **Project Name**    - Glassdoor Jobs Salary Prediction




##### **Project Type**    - Supervised Regression
##### **Contribution**    - Individual- Gyanvir Singh


# **Project Summary -**

This project focuses on understanding and predicting salaries based on job listings and company-related information. The primary aim was to analyze patterns in salary distribution and build a model capable of predicting average salaries using features like job type, company size, location, industry, and rating. The dataset included job postings from various companies, with details such as estimated salary ranges, employer-provided salary flags, and categorical information like sector, ownership type, and state.

The first step involved cleaning the dataset by removing or transforming unclear values, encoding categorical variables, and scaling numerical features. We also explored the skewness of data and handled it selectively, only where it improved the model's effectiveness. Feature selection was guided by correlation and tree-based importance scores. Initial visual analysis revealed that average salaries varied significantly with company size, job type, and location. For example, both very large and very small companies offered relatively higher salaries compared to mid-sized firms, and California-based jobs had generally higher salary averages.

We trained two models—Linear Regression and Random Forest—to predict salaries. While Linear Regression gave us a baseline with low predictive power (R² ≈ 0.15), Random Forest significantly outperformed it with an R² of around 0.67. Feature importance scores from Random Forest indicated that hourly jobs, job state, company rating, and industry were among the most influential variables. Hyperparameter tuning was also performed, although the tuned model slightly underperformed compared to the default in terms of R², indicating that the original settings may have generalized better for this dataset.

We addressed the core objectives of the project by successfully building a salary prediction model, visualizing how average salary varies by company size, identifying the top predictive features, and evaluating different models to recommend the best-performing one. This approach not only enables accurate salary prediction but also helps in understanding what truly drives compensation differences in job markets.

Future improvements could involve incorporating textual data from job descriptions, using deep learning or NLP-based models, and deploying the solution as an interactive web application. This would make it easier for job seekers or HR departments to use the model for salary benchmarking and decision-making.


**Key impacts:**
1. Informed Salary Benchmarking:
    - Companies and job seekers can better understand how different factors—like company size, location, and industry—impact salary expectations, allowing for more data-driven negotiations and job decisions.
1. Accurate Salary Prediction Model:
    - With Random Forest achieving a strong R² score, the model provides a reliable way to estimate salaries for new job listings, helping employers set competitive offers and job seekers evaluate offers realistically.
1. Feature Transparency for HR Teams:
    - The model highlights key drivers of salary (e.g., job type, state, and company rating), giving HR and compensation teams clear direction on what influences pay the most across different job roles.

# **GitHub Link -**

https://github.com/Gyanvir/GlassDoor_Jobs_Salary_Prediction

# **Problem Statement**


The goal of this project is to develop a data-driven solution that can help both job seekers and employers make more informed decisions regarding salary expectations. Using job listing data, we aim to:
1. Understand how different job-related factors—such as company size, industry, location, and employer ratings—affect salary offerings.
1. Analyze and visualize the relationship between company size and average salary to uncover any meaningful patterns.
1. Identify the most influential features in determining salary, using statistical and model-based techniques.
1. Build and evaluate a machine learning model that can accurately predict a job’s average salary based on available features.

This analysis is intended to support salary transparency, help with fair compensation planning, and enable smarter decision-making in the hiring ecosystem.

# **Chart Descriptions**


**Chart 1 – Distribution of Average Salary**

Why this chart?
    To understand overall salary spread across job listings.
    
Insights:
    Helps detect outliers, skewness, and median salary.
    
Business Impact:
    - Gives stakeholders a benchmark for market salary expectations.

**Chart 2 – Job Title Frequency**

Why this chart?
To see which roles are most in-demand.

Insights:
Reveals hiring patterns and skill trends.

Business Impact:
Useful for job seekers or companies expanding their hiring.

**Chart 3 – Company Frequency**

Why this chart?
To identify top recruiters.

Insights:
Highlights market leaders in hiring.

Business Impact:
Job seekers can prioritize these companies.

**Chart 4 – Revenue Categories**

Why this chart?
To explore the company size by revenue.

Insights:
Shows how many listings come from high vs. low revenue companies.

Business Impact:
Smaller firms may offer different roles or pay compared to larger ones.

**Chart 5 – Job States Frequency**

Why this chart?
To see geographic job distribution.

Insights:
Reveals which states have highest job demand.

Business Impact:
Useful for relocation decisions or regional hiring strategies.

**Chart 6 – Average Salary by Job State**

Why this chart?
To identify high-paying states.

Insights:
Helps assess geographic salary trends.

Business Impact:
Useful for strategic hiring or relocation planning.

**Chart 7 – Salary vs. Company Rating**

Why this chart?
To see if better-rated companies pay more.

Insights:
Mild trends may emerge; can find high-paying low-rated outliers too.

Business Impact:
Can guide job seekers on trade-offs between rating and salary.

**Chart 8 – Boxplot: Salary by Company Ownership**

Why this chart?
To compare salary ranges across private, public, non-profits, etc.

Insights:
Reveals variability in salaries per ownership category.

Business Impact:
Useful for understanding compensation philosophy across org types.

**Chart 9 – Countplot: Same-State Job vs HQ**

Why this chart?
To analyze co-location of job listings with company HQs.

Insights:
Shows how often jobs are remote or in other branches.

Business Impact:
Useful for remote work policy planning or facility expansion.

**Chart 10 – Salary by Competitor Info**

Why this chart?
Tests whether companies that reveal competitors pay differently.

Insights:
Can be a proxy for company transparency or maturity.

Business Impact:
Could imply openness vs. secrecy in salary benchmarking.

**Chart 11 – Salary vs. Rating Across Job States**

Why this chart?
To observe how salary-rating relationships differ across locations.

Insights:
Can reveal regional differences in pay fairness or compensation strategy.

Business Impact:
Useful for salary benchmarking across states and employer types.

**Chart 12 – Count of Jobs by Revenue and Ownership**

Why this chart?
To explore the intersection between company size and ownership.

Insights:
Shows where most jobs are concentrated—large private or small public firms?

Business Impact:
Helps forecast which types of companies are scaling hiring efforts.

**Chart 13 – Boxplot of Salary vs. State and Revenue**

Why this chart?
To see how salary varies by both location and company revenue.

Insights:
California pays high and even small firms do that.

Business Impact:
Aids decision-makers in setting pay structures based on geography + scale.

**Chart 14 – Heatmap: Correlation Between All Numeric Features**

Why this chart?
To find strong feature interactions.

Insights:
Shows how salary relates to rating, competitor info, etc.

Business Impact:
Can guide feature selection in modeling.

**Chart 15 – Pairplot of Key Features**

Why this chart?
To visualize bivariate relationships in compact form.

Insights:
Detects potential clustering or correlations visually.

Business Impact:
Good sanity check before modeling.

# **Model Selections**


### Linear Regression

**Model Explanation:**  
Linear Regression is a simple, interpretable algorithm that assumes a linear relationship between the input features and the target variable (average salary). It's often used as a baseline model in regression problems.

**Performance:**  
- **MSE (Mean Squared Error):** ~1087.94  
- **R² Score:** ~0.15  

These metrics suggest that the model performs poorly, capturing very little variance in the data. The assumption of linearity likely doesn't hold well for this dataset, and the model struggles to generalize.

**Business Interpretation:**  
While Linear Regression is easy to understand and implement, its poor performance indicates that the salary structure is not linearly related to the features. Hence, relying on this model for salary prediction in a real-world business scenario could lead to inaccurate estimates and misinformed decisions.


### 2. Random Forest Regressor

**Model Explanation:**  
Random Forest is an ensemble learning method that builds multiple decision trees and averages their results for regression tasks. It can capture complex, non-linear relationships and automatically handles feature interactions.

**Performance (Before Tuning):**  
- **MSE:** ~425.03  
- **R² Score:** ~0.67  

**Performance (After Hyperparameter Tuning):**  
- **MSE:** ~460-480  
- **R² Score:** ~0.62-0.63  

The untuned model surprisingly performed slightly better, showing that even default hyperparameters provide strong results.

**Business Interpretation:**  
Random Forest provides significantly better predictive performance, suggesting it captures non-linear salary trends influenced by factors such as job state, rating, or industry. From a business standpoint, this model can offer more reliable salary predictions, supporting HR teams in decision-making, job seekers in setting expectations, and platforms in automating compensation analytics.

---

**Final Selection:**  
Given its superior performance and robustness, the **Random Forest Regressor** (untuned) is selected as the final model for salary prediction in this project.
