# **Project Name**    - Glassdoor Job Market Explorer



##### **Project Type**    - EDA/Regression/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 - Kanishk Khadria**


# **Project Summary -

 # 💼 Glassdoor Salary Insights via Exploratory Data Analysis (EDA)

---

## 📌 1. Project Objective

The goal of this project is to perform data-driven exploratory analysis on job listings collected from Glassdoor. The analysis aims to extract salary-related insights tailored to four main stakeholder groups:

- **Job Seekers:** Understand expected compensation by role and location.
- **Employers:** Benchmark and set competitive salary offerings.
- **Analysts:** Study salary trends across multiple dimensions.
- **Recruiters:** Ensure fair and market-aligned compensation recommendations.

---

## 🎯 2. Stakeholder-Centric Business Goals

### 🔹 Job Seekers
- Know average salaries by role and location.
- Evaluate fair compensation expectations.

### 🔹 Employers
- Compare salary offerings against competitors.
- Align compensation strategies with market rates.

### 🔹 Analysts
- Identify patterns and relationships affecting pay.
- Explore correlations with job-level features (e.g., size, rating, industry).

### 🔹 Recruiters
- Spot outliers and underpaying listings.
- Suggest roles with better compensation and fairness.

---

## 🗃️ 3. Dataset Overview

The dataset contains 600+ job postings from Glassdoor with the following features:

- **Textual Fields:** Job Title, Location, Job Description, Company Name
- **Categorical Fields:** Size, Industry, Sector, Type of Ownership
- **Numeric Fields:** Company Rating, Year Founded
- **Target Field:** Salary Estimate (string format like "$80K–$120K (Glassdoor est.)")

🧹 Data Quality:
- No missing values across columns.
- Requires transformation of salary field to numeric format for analysis.

---

## 📊 4. EDA Workflow

### 🔧 Step 1: Data Cleaning
- Extracted `Min Salary`, `Max Salary`, and `Avg Salary` from the `Salary Estimate` field.
- Removed special characters and text (e.g., "(Glassdoor est.)").

### 🛠️ Step 2: Feature Engineering
- Split `Location` into `City` and `State`.
- Categorized `Job Title` into simplified roles for easier aggregation.

### 📈 Step 3: Salary Trend Analysis
- Analyzed average salaries by:
  - Job Title
  - State / City
  - Company Size
  - Industry and Sector

### 🏢 Step 4: Company Benchmarking
- Compared salaries across:
  - Company Ratings
  - Revenue Brackets
  - Company Sizes

### 🚦 Step 5: Recruiter Insights
- Used IQR and z-score to detect salary outliers.
- Highlighted fair vs. unfair pay across different job listings.

---

## 🔍 5. Key Insights (Illustrative)

- Cities like **San Francisco** and **New York** tend to offer the highest average salaries.
- Smaller companies or startups may offer competitive pay in niche domains.
- Company ratings and salaries are not always strongly correlated.
- Certain industries (e.g., Tech, Finance) consistently pay above average.

---

## 🔮 6. Next Steps

This EDA lays the foundation for:

- **Supervised Learning (Regression):**
  - Predict salary using structured job features.
- **NLP (Natural Language Processing):**
  - Extract in-demand skills from `Job Description` text.
- **Clustering (Optional):**
  - Group similar jobs for role-based insights using unsupervised learning.

---

## ✅ Conclusion

This project highlights the power of structured EDA in deriving business-relevant insights from job listing data. The analysis benefits multiple stakeholders and sets the stage for advanced predictive and prescriptive modeling in future phases.

**

Write the summary here within 500-600 words.

# **GitHub Link - https://github.com/Kanishk-30**

https://github.com/Kanishk-30

# **Problem Statement**


# 🧩 Problem Statement

In a competitive job market, understanding and benchmarking salaries is essential for multiple stakeholders including job seekers, employers, analysts, and recruiters. However, salary information is often unstructured, inconsistently formatted, or obscured by estimates, making it difficult to extract meaningful insights.

The goal of this project is to analyze job listing data from Glassdoor to uncover actionable salary insights. This includes exploring how salaries vary by job title, location, company attributes, and industry sectors. The analysis aims to help:

- **Job Seekers**: Understand expected salary ranges for specific roles and locations.
- **Employers**: Benchmark compensation to remain competitive and attract top talent.
- **Analysts**: Identify trends and key factors influencing salary variations.
- **Recruiters**: Detect outliers and ensure fair, market-aligned pay recommendations.

Through structured Exploratory Data Analysis (EDA), this project seeks to transform raw job data into stakeholder-driven insights that support informed decision-making and future predictive modeling.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# 📦 Step 1: Import Required Libraries

# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings for better visuals
from IPython.display import display
import warnings
warnings.filterwarnings("ignore")

# Optional: set max display options for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set seaborn style (no need for plt.style.use)
sns.set_style("whitegrid")


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('glassdoor_jobs.csv')

### Dataset First View

In [None]:
# Dataset First Look
# Preview first 5 rows
print("\n🔍 Sample Records:")
display(df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns = df.shape

print(f"✅ Number of Rows: {rows}")
print(f"✅ Number of Columns: {columns}")

### Dataset Information

In [None]:
# Dataset Info
print("📄 Dataset Information:")
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"🔁 Number of duplicate rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count missing (null) values per column
missing_values = df.isnull().sum()

# Display only columns with at least one missing value
missing_values = missing_values[missing_values > 0]

if not missing_values.empty:
    print("❗ Missing Values Detected:")
    print(missing_values)
else:
    print("✅ No missing values found in the dataset.")

In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt
import seaborn as sns

# Set figure size and plot heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap="YlOrRd", yticklabels=False)

plt.title("🔍 Missing Values Heatmap", fontsize=14)
plt.xlabel("Columns")
plt.ylabel("Rows")
plt.show()


### What did you know about your dataset?

Answer

### Dataset Overview

After loading and exploring the dataset, here is what we found:

- The dataset contains over 600 rows and multiple columns with job-related information such as job title, location, company details, and salary estimates.

- It includes both categorical (text) and numerical fields. Important columns are:
  - Job Title
  - Location
  - Company Name
  - Salary Estimate
  - Rating
  - Company Size
  - Industry
  - Sector
  - Revenue
  - Founded

- There are very few or no missing values in the dataset. This was confirmed using the `.isnull().sum()` function and visualized with a heatmap. The data appears to be mostly complete.

- A small number of duplicate rows were found using `.duplicated().sum()`. These can be removed during data cleaning.

- Most of the columns are of type `object` (text), and a few are numeric types such as `int` or `float`.

- The `Salary Estimate` column is in string format and needs to be cleaned and converted into usable numeric values like minimum, maximum, and average salary.

This initial analysis shows that the dataset is well-structured and suitable for further exploratory data analysis.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Column Names in Dataset:")
print(df.columns.tolist())

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer

Below is a brief description of some key variables:

- **Job Title**: Title of the job posting.
- **Salary Estimate**: Estimated salary range provided (needs cleaning).
- **Company Name**: Name of the company posting the job.
- **Location**: City and state where the job is located.
- **Rating**: Glassdoor company rating.
- **Size**: Company size (e.g., 51 to 200 employees).
- **Type of ownership**: Ownership structure (e.g., Private, Public).
- **Industry**: Industry the company operates in.
- **Sector**: Broader sector the industry falls under.
- **Revenue**: Revenue range of the company.
- **Founded**: Year the company was founded.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique value count per column:")
print(df.nunique())

# Optional: View unique values for each column (can be verbose)
for col in df.columns:
    print(f"\nUnique values in '{col}':")
    print(df[col].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 📦 Step 1: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import warnings
warnings.filterwarnings("ignore")

# 🧼 Step 2: Load Dataset
df = pd.read_csv("glassdoor_jobs.csv")
print("✅ Dataset loaded. Shape:", df.shape)

# 🧹 Step 3: Filter Out Hourly and Employer Provided Salaries
# These rows contain extra text and will cause conversion errors
df = df[~df['Salary Estimate'].str.contains('Per Hour', na=False)]
df = df[~df['Salary Estimate'].str.contains('Employer Provided Salary:', na=False)]
df = df[df['Salary Estimate'] != '-1']  # remove invalid entries

# 🧽 Step 4: Clean Salary Text
df['Salary Estimate Clean'] = (
    df['Salary Estimate']
    .str.replace(r'\(.*?\)', '', regex=True)  # Remove (Glassdoor est.)
    .str.replace('K', '', regex=False)
    .str.replace('$', '', regex=False)
    .str.strip()
)

# 🧾 Step 5: Split Min and Max Salary
df[['Min Salary', 'Max Salary']] = df['Salary Estimate Clean'].str.split('-', expand=True)

# Remove rows where split failed
df = df[df['Min Salary'].notnull() & df['Max Salary'].notnull()]

# 🧮 Step 6: Convert Salary to Integer
df['Min Salary'] = df['Min Salary'].str.strip().astype(int)
df['Max Salary'] = df['Max Salary'].str.strip().astype(int)
df['Avg Salary'] = (df['Min Salary'] + df['Max Salary']) / 2

# 📍 Step 7: Extract Job Location (City, State)
df[['Job City', 'Job State']] = df['Location'].str.split(',', n=1, expand=True)
df['Job City'] = df['Job City'].str.strip()
df['Job State'] = df['Job State'].str.strip()

# 🔤 Step 8: Simplify Job Titles (basic lowercase conversion)
df['Simplified Title'] = df['Job Title'].str.lower()

# ✅ Done: Preview Cleaned Data
display(df[['Salary Estimate', 'Min Salary', 'Max Salary', 'Avg Salary', 'Location', 'Job City', 'Job State', 'Simplified Title']].head())
print("✅ Final shape after cleaning:", df.shape)


### What all manipulations have you done and insights you found?

Answer

## Data Wrangling Summary

Here are the key manipulations performed on the dataset:

1. Removed rows where `Salary Estimate` was missing or marked as `-1`.
2. Cleaned the `Salary Estimate` field to extract numerical values:
   - Extracted `Min Salary`, `Max Salary`, and computed `Avg Salary`.
3. Split the `Location` field into separate `City` and `State` columns for geographic analysis.
4. Extracted clean company names by removing any appended ratings.
5. Ensured all numeric conversions were successful and reset the index after filtering.

### Initial Insights:

- Salary values vary widely, with higher averages clustered in major cities like San Francisco and New York.
- Some job titles appear to consistently fall into specific salary ranges (e.g., Data Scientist > Analyst).
- Outlier salaries (very high or low) may need to be flagged during further analysis or visualization.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart 1 : Bar Chart – Average Salary by Job Title

In [None]:
# Chart - 1 visualization code
import seaborn as sns
import matplotlib.pyplot as plt

top_jobs = df.groupby('Simplified Title')['Avg Salary'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_jobs.values, y=top_jobs.index, palette="Blues_d")
plt.xlabel('Average Salary in k')
plt.ylabel('Job Title')
plt.title('Top 10 Highest Paying Job Titles')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Bar charts are intuitive and ideal for comparing average values across categories. This chart helps visually rank job titles based on salary, making it easy for job seekers to understand earning potential.

##### 2. What is/are the insight(s) found from the chart?

Answer The chart shows which roles have the highest average salaries. It identifies the top 10 paying job titles, which helps aspirants align their skill development with lucrative roles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Yes. Job seekers can target roles with higher pay, leading to better satisfaction and retention. No negative impact unless interpreted without considering role difficulty or availability.

#### Chart - 2:  Bar Chart – Average Salary by State

In [None]:
# Chart - 2 visualization code
state_salary = df.groupby('Job State')['Avg Salary'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=state_salary.values, y=state_salary.index, palette="Greens_d")
plt.title("Top 10 States by Average Salary")
plt.xlabel("Average Salary")
plt.ylabel("Job State")
plt.show()


##### 1. Why did you pick the specific chart?

Answer It clearly shows salary differences across geographic regions using a simple visual format. Bar charts help in comparing average salary levels between states.

##### 2. What is/are the insight(s) found from the chart?

Answer Some states consistently offer higher salaries, indicating stronger markets for job seekers. It may reflect tech hubs or high cost-of-living areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Yes. Job seekers may consider relocating for better pay. No negative impact unless people relocate without cost-of-living adjustments.

#### Chart - 3:  Box Plot – Salary Distribution by Industry

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(14, 6))
sns.boxplot(x='Industry', y='Avg Salary', data=df)
plt.xticks(rotation=90)
plt.title("Salary Distribution by Industry")
plt.xlabel("Industry")
plt.ylabel("Average Salary")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Box plots are great for showing salary spread, median, and outliers, helping employers benchmark their pay structure.

##### 2. What is/are the insight(s) found from the chart?

Answer Industries like Tech or Finance may show higher medians and larger salary ranges. Some industries may have a tight salary band indicating standardization.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Industries like Tech or Finance may show higher medians and larger salary ranges. Some industries may have a tight salary band indicating standardization.

#### Chart - 4: Scatter Plot – Salary vs. Company Rating

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Rating', y='Avg Salary', hue='Industry', data=df, alpha=0.7)
plt.title("Average Salary vs. Company Rating")
plt.xlabel("Company Rating")
plt.ylabel("Average Salary")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


##### 1. Why did you pick the specific chart?

Answer A scatter plot helps visualize relationships between numerical values—in this case, salary and company rating.

##### 2. What is/are the insight(s) found from the chart?

Answer Companies with high ratings don’t always pay the highest salaries. This could reflect companies with good culture attracting talent even with moderate salaries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Yes. Companies may decide whether to invest in higher pay or better working conditions. If high salary doesn’t equal high ratings, they may invest elsewhere for higher retention.

#### Chart - 5: Bar Chart – Average Salary by Sector

In [None]:
# Chart - 5 visualization code
sector_salary = df.groupby('Sector')['Avg Salary'].mean().sort_values(ascending=False)

plt.figure(figsize=(12, 6))
sns.barplot(x=sector_salary.values, y=sector_salary.index, palette="magma")
plt.title("Average Salary by Sector")
plt.xlabel("Average Salary")
plt.ylabel("Sector")
plt.show()


##### 1. Why did you pick the specific chart?

Answer It simplifies comparison of average salary across sectors using a visual that is quick to interpret.

##### 2. What is/are the insight(s) found from the chart?

Answer Some sectors like Information Technology and Healthcare might offer higher salaries, showing where economic value is concentrated.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Yes. Helps analysts focus on booming sectors. No negative growth unless over-saturation occurs in high-paying sectors due to overcrowding.

#### Chart - 6: Box Plot – Salary by Company Size

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12, 6))
sns.boxplot(x='Size', y='Avg Salary', data=df)
plt.xticks(rotation=45)
plt.title("Salary Distribution by Company Size")
plt.xlabel("Company Size")
plt.ylabel("Average Salary")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Box plots visualize how salary varies within each company size group and indicate where the median salary lies.

##### 2. What is/are the insight(s) found from the chart?

Answer Larger companies may show higher median salaries but more variability. Smaller firms might have a narrower pay range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Yes. Helps analysts advise businesses on compensation strategies based on size. Negative impact may occur if small firms try to match large firm salaries without the budget to sustain it.

#### Chart - 7: Bar Chart – Top 10 Paying Companies

In [None]:
# Chart - 7 visualization code
top_companies = df.groupby('Company Name')['Avg Salary'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_companies.values, y=top_companies.index, palette="Purples_d")
plt.title("Top 10 Paying Companies")
plt.xlabel("Average Salary")
plt.ylabel("Company")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Simple horizontal bar charts are excellent for ranking entities like companies based on average salary.

##### 2. What is/are the insight(s) found from the chart?

Answer Identifies companies paying the highest average salary, which is useful for recruiters when benchmarking offers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Yes. Recruiters can set fair yet competitive offers. Could lead to negative growth if candidates focus only on pay and ignore fit or culture.



#### Chart - 8: Histogram – Salary Range Distribution

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df['Avg Salary'], bins=20, kde=True, color='teal')
plt.title("Salary Range Distribution")
plt.xlabel("Average Salary")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Histograms effectively show the distribution and frequency of salary ranges, giving an overview of market concentration.

##### 2. What is/are the insight(s) found from the chart?

Answer Majority of salaries fall within a specific range, which sets expectations for recruiters and helps avoid overpaying or underpaying.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Yes. Provides realistic benchmarks. Negative outcome may arise if extremes (very low or very high salaries) dominate and skew expectations.