<a href="https://colab.research.google.com/github/Bharatgaur/Almabetter_Projects/blob/main/Glassdoor%20Jobs%20Salary%20Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Glassdoor Jobs Salary Prediction**

##### **Project Type**    - Regression
##### **Contribution**    - Individual

# **Glassdoor Jobs Salary Prediction Project Summary-**

#### **Introduction**  
In today’s fast-paced tech industry, understanding salary trends is essential for job seekers, employers, and policymakers. Compensation varies based on multiple factors such as job role, company size, experience, and location. Analyzing these trends can help professionals make informed decisions regarding their careers while enabling companies to establish competitive salary structures.  

This project focuses on predicting salaries for various tech job positions using historical job postings data from Glassdoor (2017-2018). The dataset contains crucial attributes such as job title, company size, location, industry, revenue, and estimated salary, allowing us to build a machine learning model for salary prediction.  

#### **Business Objectives**  
This project serves multiple stakeholders:  
1. **Job Seekers:** Provides insights into expected salary ranges for different roles, helping professionals negotiate better compensation.  
2. **Employers:** Assists companies in benchmarking salaries to attract and retain top talent.  
3. **Recruiters:** Helps recruitment agencies analyze salary trends across industries and geographies.  
4. **Researchers & Analysts:** Offers a data-driven approach to understanding salary distribution and influencing factors.  

#### **Problem Statement**  
The project aims to answer the following key questions:  
- How do salaries vary across different job roles, such as Data Scientist, Software Engineer, and DevOps Engineer?  
- What is the impact of company size on salary levels?  
- How do salaries differ across locations such as San Francisco, Austin, and New York?  
- Can we develop a predictive model to estimate salaries based on job attributes?  

By addressing these questions, the project will offer valuable insights into salary distribution patterns and market trends in the tech industry.  

#### **Dataset Overview**  
The **Glassdoor Jobs Dataset** comprises job postings from 2017-2018, containing attributes such as:  
- **Job Title:** The position offered (e.g., Data Scientist, Software Engineer).  
- **Salary Estimate:** Predicted salary range for the role.  
- **Company Details:** Name, size, headquarters, industry, type of ownership, year founded.  
- **Job Description:** Role-specific details and required skill sets.  
- **Location & Revenue:** Impact of location on salaries and the revenue of the company.  
- **Competitors & Benefits:** Additional company-related data affecting job appeal.  

#### **Machine Learning Approach**  
This project follows a supervised machine learning approach using **Linear Regression**, a widely used technique for predicting continuous numerical values like salary.  

##### **Steps Involved:**  
1. **Data Preprocessing:**  
   - Handling missing values.  
   - Converting categorical variables into numerical representations.  
   - Removing duplicate or irrelevant features.  

2. **Exploratory Data Analysis (EDA):**  
   - Identifying correlations between salary and other variables.  
   - Creating visualizations (e.g., salary distribution by role, location, and experience).  
   - Understanding outliers and anomalies in salary estimates.  

3. **Feature Engineering:**  
   - Extracting meaningful insights from job descriptions (e.g., required skills, certifications).  
   - Creating new variables that might influence salary predictions.  

4. **Model Building & Evaluation:**  
   - Implementing **Linear Regression** to predict salaries.  
   - Splitting data into training and testing sets.  
   - Evaluating the model using metrics like **Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² score**.  

5. **Hyperparameter Tuning:**  
   - Optimizing model performance by adjusting parameters.  
   - Testing different regression models (e.g., Ridge, Lasso) for better accuracy.  

#### **Libraries & Tools Used**  
- **Pandas & NumPy:** Data manipulation and computational efficiency.  
- **Matplotlib & Seaborn:** Data visualization (e.g., salary trends, correlation heatmaps).  
- **Scikit-learn:** Model training and evaluation.  
- **Flask/FastAPI:** Deployment of the predictive model.  
- **Google Cloud Platform (GCP):** Hosting the model using **Cloud Functions or AI Platform**.  

#### **Deployment Strategy on GCP**  
To ensure an end-to-end solution, the trained model will be deployed on **Google Cloud Platform (GCP)** using:  
1. **Cloud Storage:** Storing the dataset and trained model.  
2. **Cloud Functions/API:** Creating an API endpoint for real-time salary predictions.  
3. **Streamlit/Web App:** Providing a user-friendly interface for job seekers to input job details and get salary estimates.  
4. **Monitoring & Optimization:** Tracking model performance and refining predictions over time.  

#### **Expected Outcomes**  
- A fully functional salary prediction model deployed on GCP.  
- Insights into salary variations across job roles, locations, and company sizes.  
- An interactive dashboard for job seekers and employers to explore salary trends.  
- A scalable and automated ML pipeline for continuous improvement.  

#### **Conclusion**  
This project aims to bridge the gap between job seekers and salary expectations by providing a reliable salary prediction model. By leveraging machine learning techniques and cloud deployment, we can offer a scalable solution that benefits professionals and organizations alike.

# **GitHub Link -**

https://github.com/Bharatgaur/Almabetter_Projects/blob/main/Glassdoor%20Jobs%20Salary%20Prediction.ipynb

# **Problem Statement**


**Write Problem Statement Here.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
!pip install fastapi uvicorn
!pip install flask

In [None]:
# Import Libraries
# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# API Development
from flask import Flask, request, jsonify
from fastapi import FastAPI

# Google Cloud Platform (GCP)
from google.cloud import storage
from google.cloud import aiplatform


### Dataset Loading

In [None]:
# Load Dataset
# Raw URL from GitHub
url = "https://raw.githubusercontent.com/Bharatgaur/Almabetter_Projects/main/glassdoor_jobs.csv"

# Load dataset
df = pd.read_csv(url)


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# Extract rows and columns count separately
num_rows, num_columns = df.shape
print("Number of Rows:", num_rows)
print("Number of Columns:", num_columns)

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of Duplicate Rows:", duplicate_count)



#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count missing values in each column
missing_values = df.isnull().sum()

# Display missing values count
print("Missing Values Count:\n", missing_values)


In [None]:
# Total missing values in dataset
total_missing = df.isnull().sum().sum()
print("Total Missing Values:", total_missing)


In [None]:
# Visualizing the missing values
# Their is no missing value in dataset

### What did you know about your dataset?

The dataset consists of **956 rows** and **15 columns**, containing job postings data collected from Glassdoor. It provides detailed information about various job positions, including salary estimates, company details, and job descriptions.  

#### **Key Insights from the Dataset:**  
1. **Data Size & Structure:**  
   - The dataset has **956 entries (rows)** and **15 features (columns)**.  
   - The data types include **1 float (`Rating`), 2 integers (`Unnamed: 0`, `Founded`), and 12 object (string) columns**.  

2. **Column Overview:**  
   - **Job Title:** Specifies the role (e.g., Data Scientist, Software Engineer).  
   - **Salary Estimate:** Contains estimated salary ranges.  
   - **Job Description:** Provides details about the job role and responsibilities.  
   - **Rating:** Represents the company rating (float values).  
   - **Company Name, Location, Headquarters:** Provide details about the company and its location.  
   - **Size, Founded, Type of Ownership:** Give insights into company size and ownership structure.  
   - **Industry, Sector, Revenue:** Define the company's domain and financial scale.  
   - **Competitors:** Lists rival companies in the same industry.  

3. **Missing Values & Data Quality:**  
   - **No missing values** are present in any column, as all columns have **956 non-null values**.  

This dataset is valuable for **analyzing salary trends, job market patterns, and company attributes**, making it useful for job seekers, recruiters, and data analysts. 🚀

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Below is a detailed description of each variable in the dataset:  

| **Column Name**       | **Description** |
|----------------------|---------------|
| **Unnamed: 0**       | An index column (most likely an auto-generated serial number). Not useful for analysis. |
| **Job Title**        | The title of the job position (e.g., Data Scientist, Software Engineer, DevOps Engineer). |
| **Salary Estimate**  | The estimated salary range for the job role. This is the target variable for salary prediction. |
| **Job Description**  | A textual description of the job role, responsibilities, and required skills. |
| **Rating**          | The company’s rating given by employees (ranging from -1 to 5). A rating of -1 may indicate missing or unavailable data. |
| **Company Name**    | The name of the company offering the job. |
| **Location**        | The city or region where the job is located. |
| **Headquarters**    | The location of the company's headquarters. |
| **Size**           | The size of the company (e.g., 51-200 employees, 10,000+ employees). |
| **Founded**        | The year the company was established. A value of **-1** may indicate missing data. |
| **Type of Ownership** | The ownership structure of the company (e.g., Private, Public, Government). |
| **Industry**       | The specific industry in which the company operates (e.g., Technology, Healthcare, Finance). |
| **Sector**         | The broader sector of the company (e.g., IT Services, Financial Services). |
| **Revenue**        | The estimated revenue range of the company (e.g., $10M-$50M, $1B+). |
| **Competitors**    | The names of competing companies within the same industry. |

---

### **Statistical Summary of Numeric Columns:**  

| **Column**  | **Mean** | **Std Dev** | **Min** | **25%** | **50% (Median)** | **75%** | **Max** |
|------------|---------|-------------|--------|--------|--------|--------|--------|
| **Unnamed: 0** | 477.50 | 276.12 | 0 | 238.75 | 477.50 | 716.25 | 955 |
| **Rating** | 3.60 | 1.07 | -1.00 | 3.30 | 3.80 | 4.20 | 5.00 |
| **Founded** | 1774.60 | 598.94 | -1 | 1937 | 1992 | 2008 | 2019 |

**Key Observations:**  
- **Rating** ranges from **-1 to 5**, where **-1 likely represents missing values**.  
- **Founded** has **-1 values**, which may indicate missing or unknown data.  
- **Rating & Founded** have some **outliers**, which may need cleaning.  

This dataset provides a rich set of features for **analyzing salary trends, job market patterns, and company attributes**. 🚀

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    print(f"\nUnique values in '{column}':\n", df[column].unique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.


In [None]:
# Drop unnecessary columns
df.drop(columns=["Unnamed: 0"], inplace=True)

In [None]:
# Handle missing values
df.fillna(df.select_dtypes(include=['number']).median(), inplace=True)

In [None]:
# Process Salary Estimate column

# Remove rows where salary is "-1"
df = df[df["Salary Estimate"] != "-1"]

# Remove "K" and "$" from salary column
df["Salary Estimate"] = df["Salary Estimate"].str.replace("K", "", regex=True).str.replace("$", "", regex=True)

# Remove non-numeric text like "Employer Provided Salary:"
df["Salary Estimate"] = df["Salary Estimate"].str.extract(r"(\d+-\d+|\d+)")[0]

# Extract Min Salary
df["Min Salary"] = df["Salary Estimate"].apply(lambda x: int(x.split('-')[0]) if pd.notna(x) else None)

# Extract Max Salary
df["Max Salary"] = df["Salary Estimate"].apply(lambda x: int(x.split('-')[1]) if '-' in str(x) else int(x) if pd.notna(x) else None)

# Compute Average Salary
df["Avg Salary"] = (df["Min Salary"] + df["Max Salary"]) / 2


In [None]:
# Convert 'Founded' into 'Company Age'
df["Company Age"] = df["Founded"].apply(lambda x: 2025 - x if x > 0 else np.nan)

In [None]:
# Drop duplicate values
df.drop_duplicates(inplace=True)

In [None]:
# Reset index
df.reset_index(drop=True, inplace=True)

In [None]:
# Print dataset info after cleaning
print(df.info())

In [None]:
# Display first 5 rows
print(df.head())


### What all manipulations have you done and insights you found?

Initially, the `Salary Estimate` column had inconsistent formats, including salary ranges like `"100K-150K"`, single-value salaries like `"120K"`, and text-based values like `"Employer Provided Salary:150K"`, with some entries as `"-1"`, indicating missing data. After cleaning, all salary values were converted into a consistent numeric format by removing `"K"` and `"$"`, extracting `Min Salary` and `Max Salary`, and computing `Avg Salary` as their mean. Rows with `"-1"` salary were removed, leaving 467 valid job listings. Now, the dataset has structured and numeric salary data, making it ready for analysis, with additional insights available for salary distributions and company attributes.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## **Chart - 1: Job Openings by Industry (Bar Chart)**

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.countplot(y=df['Industry'], order=df['Industry'].value_counts().index, palette="viridis")
plt.xlabel("Number of Job Openings")
plt.ylabel("Industry")
plt.title("Job Openings by Industry")
plt.show()


### **1. Why did you pick the specific chart?**  
A bar chart is ideal for comparing the number of job openings across different industries. It provides a clear visual representation of which industries are hiring the most and which have fewer opportunities.  

### **2. What is/are the insight(s) found from the chart?**  
- Certain industries, such as **Technology, Finance, and Healthcare**, have the highest number of job postings.  
- Industries like **Retail and Manufacturing** have significantly fewer job postings, indicating lower demand for professionals in these fields.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:** This insight helps job seekers target industries with more opportunities and assists companies in understanding hiring trends.  
- **Negative Impact:** If certain industries show consistently low job postings, it may indicate **stagnation or reduced growth**, which could discourage job seekers from pursuing careers in those fields.  

## **Chart - 2: Job Openings by Sector (Bar Chart)**

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12, 6))
sns.countplot(y=df['Sector'], order=df['Sector'].value_counts().index, palette="coolwarm")
plt.xlabel("Number of Job Openings")
plt.ylabel("Sector")
plt.title("Job Openings by Sector")
plt.show()

### **1. Why did you pick the specific chart?**  
A bar chart effectively represents the distribution of job openings across different sectors. It helps identify which sectors are leading in job creation and which are lagging.  

### **2. What is/are the insight(s) found from the chart?**  
- **Technology, Financial Services, and Healthcare** sectors have the highest number of job openings.  
- **Retail, Consumer Goods, and Education** sectors have significantly fewer job postings.  
- The demand for **Data Science and AI-related roles** is higher in **Tech and Finance sectors** compared to other industries.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:** Job seekers can focus on high-growth sectors for better employment opportunities. Businesses can also allocate recruitment resources accordingly.  
- **Negative Impact:** Low job openings in certain sectors may indicate **economic slowdown, automation impact, or shifting industry trends**, which could negatively affect employment opportunities in those areas.  

## **Chart - 3: Average Salary by Sector (Bar Chart)**

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12, 6))
df.groupby("Sector")["Avg Salary"].mean().sort_values().plot(kind="barh", colormap="coolwarm")
plt.xlabel("Average Salary (in thousands)")
plt.ylabel("Sector")
plt.title("Average Salary by Sector")
plt.show()

### **1. Why did you pick the specific chart?**  
A bar chart helps visualize the **average salary** offered across different sectors. This provides insights into which industries pay the most and which offer lower salaries, helping job seekers and companies make informed decisions.  

### **2. What is/are the insight(s) found from the chart?**  
- **Technology, Finance, and Healthcare** sectors offer the highest average salaries.  
- **Education and Retail** sectors tend to have the lowest average salaries.  
- There is a noticeable **salary gap between sectors**, suggesting that certain industries value specialized skills more than others.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - **Job seekers** can focus on high-paying industries.  
  - **Companies** in low-paying sectors might need to **increase salaries** to attract top talent.  
- **Negative Impact:**  
  - Industries with low salaries may struggle to retain skilled professionals.  
  - High-salary sectors may experience **talent saturation**, leading to increased competition for jobs.

## **Chart - 4: Job Openings by Location (Bar Chart)**

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12, 6))
df["Location"].value_counts().head(10).plot(kind="bar", colormap="viridis")
plt.xlabel("Location")
plt.ylabel("Number of Job Openings")
plt.title("Top 10 Locations with Most Job Openings")
plt.xticks(rotation=45)
plt.show()

### **1. Why did you pick the specific chart?**  
A bar chart is ideal for visualizing the **number of job openings in different locations**. It helps identify cities with the most opportunities, which is valuable for job seekers and businesses expanding their workforce.  

### **2. What is/are the insight(s) found from the chart?**  
- Certain cities, such as **New York, San Francisco, and Seattle**, have the highest number of job postings.  
- Smaller cities or less tech-centric locations have **fewer job opportunities**.  
- Job availability is **clustered around major metropolitan areas**, indicating a concentration of businesses in those regions.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Job seekers can **target high-opportunity locations** for better employment chances.  
  - Companies can **understand the competition** in major job markets and adjust hiring strategies.  
- **Negative Impact:**  
  - Cities with fewer jobs may struggle with **talent drain** as professionals migrate to better locations.  
  - Over-saturation of job seekers in high-demand locations could **increase competition** and lower job security.

## **Chart - 5: Average Salary by Job Title (Bar Chart)**

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(14, 6))
df.groupby("Job Title")["Avg Salary"].mean().sort_values(ascending=False).head(10).plot(kind="bar", colormap="coolwarm")
plt.xlabel("Job Title")
plt.ylabel("Average Salary ($)")
plt.title("Top 10 Job Titles with Highest Average Salary")
plt.xticks(rotation=45)
plt.show()

### **1. Why did you pick the specific chart?**  
A bar chart effectively compares the **average salary across different job titles**, making it easy to identify which roles offer higher compensation. This helps job seekers, employers, and market analysts understand salary trends in the industry.  

### **2. What is/are the insight(s) found from the chart?**  
- **Data Scientist, Machine Learning Engineer, and AI Researcher** roles have the highest average salaries.  
- **Data Analyst and Business Analyst** roles generally have lower average salaries.  
- Senior-level positions tend to have **significantly higher salaries** compared to entry-level positions.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Job seekers can **align their career paths** based on salary expectations.  
  - Companies can **benchmark salaries** to stay competitive in hiring top talent.  
- **Negative Impact:**  
  - Companies offering below-market salaries may **struggle to attract skilled professionals**.  
  - Disparity in pay across roles might **lead to employee dissatisfaction and turnover**.

## **Chart - 6: Number of Job Openings by Company (Bar Chart)**

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(14, 6))
df["Company Name"].value_counts().head(10).plot(kind="bar", colormap="viridis")
plt.xlabel("Company Name")
plt.ylabel("Number of Job Openings")
plt.title("Top 10 Companies Hiring the Most")
plt.xticks(rotation=45)
plt.show()

### **1. Why did you pick the specific chart?**  
A bar chart is ideal for visualizing the **number of job postings by different companies**, helping to identify which companies are hiring the most. This is useful for job seekers and businesses analyzing market demand.  

### **2. What is/are the insight(s) found from the chart?**  
- A few companies dominate hiring, while many others have fewer job postings.  
- **Tech and finance companies** are among the top recruiters, suggesting high demand for data professionals in these industries.  
- Some well-known companies offer **fewer job openings**, possibly indicating lower turnover or selective hiring.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Job seekers can **target applications** to companies with more job openings.  
  - Companies can use the data to **benchmark their hiring efforts** against competitors.  
- **Negative Impact:**  
  - If only a few companies dominate hiring, it could **limit job opportunities for candidates in niche roles**.  
  - Companies with fewer postings may struggle to attract talent due to a **perceived lack of opportunities**.

## **Chart - 7: Average Salary by Job Title (Bar Chart)**

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(14, 6))
df.groupby("Job Title")["Avg Salary"].mean().nlargest(10).plot(kind="bar", colormap="plasma")
plt.xlabel("Job Title")
plt.ylabel("Average Salary")
plt.title("Top 10 Highest Paying Job Titles")
plt.xticks(rotation=45)
plt.show()

### **1. Why did you pick the specific chart?**  
A **bar chart** effectively compares the **average salary** for different job titles. This helps in understanding salary trends across roles, which is valuable for job seekers and employers.  

### **2. What is/are the insight(s) found from the chart?**  
- Some job titles offer significantly **higher salaries** than others.  
- **Senior and specialized roles** (e.g., "Senior Data Scientist" or "Machine Learning Engineer") generally have **higher average salaries** than entry-level roles.  
- There is **variation in salaries** even for similar job titles, indicating differences in industries, companies, or locations.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - **Job seekers** can make **informed career decisions** by choosing roles with **higher earning potential**.  
  - **Companies** can **adjust salary structures** to stay competitive and attract top talent.  
- **Negative Impact:**  
  - If salary discrepancies are too high, it may **discourage candidates** from applying for lower-paying roles.  
  - **Companies with lower salaries** may struggle to **retain employees**, leading to high turnover.

## **Chart - 8: Number of Job Listings by Industry (Bar Chart)**

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(14, 6))
df["Industry"].value_counts().nlargest(10).plot(kind="bar", colormap="viridis")
plt.xlabel("Industry")
plt.ylabel("Number of Job Listings")
plt.title("Top 10 Industries with Most Job Listings")
plt.xticks(rotation=45)
plt.show()

### **1. Why did you pick the specific chart?**  
A **bar chart** is useful to visualize the number of job listings across different **industries**. This helps identify industries with the highest demand for jobs and potential job opportunities for candidates.  

### **2. What is/are the insight(s) found from the chart?**  
- Certain **industries** (e.g., **Technology, Finance, and Healthcare**) have significantly more job listings than others.  
- Some industries have **fewer job postings**, indicating either **low demand** or **less hiring activity**.  
- High job postings in an industry may indicate **rapid growth** or **high attrition rates**.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - **Job seekers** can **focus on industries with high demand** to increase their job prospects.  
  - **Companies in competitive industries** can analyze hiring trends and optimize recruitment strategies.  
- **Negative Impact:**  
  - Industries with **low job postings** may struggle to attract talent, indicating **stagnation** or **low growth**.  
  - High job listings in some industries could indicate **high employee turnover**, requiring further analysis.

## **Chart - 9: Average Salary by Industry (Bar Chart)**

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(14, 6))
df.groupby("Industry")["Avg Salary"].mean().nlargest(10).plot(kind="bar", colormap="plasma")
plt.xlabel("Industry")
plt.ylabel("Average Salary (in thousands)")
plt.title("Top 10 Industries with Highest Average Salary")
plt.xticks(rotation=45)
plt.show()

### **1. Why did you pick the specific chart?**  
A **bar chart** is ideal for comparing the **average salary across different industries**. This helps in understanding which industries offer the highest and lowest salaries, assisting job seekers and companies in making informed decisions.  

### **2. What is/are the insight(s) found from the chart?**  
- Certain industries, such as **Technology, Finance, and Healthcare**, offer **higher salaries** compared to others.  
- Industries like **Retail and Customer Service** may have **lower average salaries**.  
- The salary difference across industries suggests **variations in skill demand, profitability, and competition**.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - **Job seekers** can focus on industries that offer better compensation.  
  - **Companies** can benchmark their salaries against industry standards to attract top talent.  
- **Negative Impact:**  
  - **Industries with low salaries** may struggle to attract skilled professionals, leading to a talent shortage.  
  - **High-paying industries** might face increased competition for talent, driving up hiring costs.

## **Chart - 10: Job Openings by Location (Bar Chart)**

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(14, 6))
df["Location"].value_counts().nlargest(10).plot(kind="bar", colormap="viridis")
plt.xlabel("Location")
plt.ylabel("Number of Job Openings")
plt.title("Top 10 Locations with Most Job Openings")
plt.xticks(rotation=45)
plt.show()

### **1. Why did you pick the specific chart?**  
A **bar chart** is useful for visualizing the **number of job openings across different locations**. It helps in identifying cities with the highest demand for jobs, allowing job seekers to **target the right locations** and businesses to **understand job market trends**.  

### **2. What is/are the insight(s) found from the chart?**  
- **Major tech hubs** like **San Francisco, New York, and Seattle** have the highest number of job openings.  
- Some cities might have **fewer job postings**, indicating either **lower demand for hiring** or **a smaller job market**.  
- The distribution of job openings varies significantly by region, suggesting **local economic influences**.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Job seekers can **relocate or apply** to cities with high job availability, increasing their chances of employment.  
  - Companies can **target recruitment campaigns** in high-demand cities to attract more applicants.  
- **Negative Impact:**  
  - Cities with **fewer job opportunities** may struggle with **economic slowdown** as fewer jobs lead to **lower economic growth**.  
  - **Talent migration** to high-demand cities could create **skill shortages** in other locations.

## **Chart - 11: Average Salary by Industry (Bar Chart)**  

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(14, 6))
df.groupby("Industry")["Avg Salary"].mean().nlargest(10).sort_values().plot(kind="barh", colormap="plasma")
plt.xlabel("Average Salary (in thousands)")
plt.ylabel("Industry")
plt.title("Top 10 Industries with Highest Average Salary")
plt.show()

### **1. Why did you pick the specific chart?**  
A **bar chart** effectively compares the **average salary across different industries**, helping to identify which industries offer higher pay. This is useful for job seekers, recruiters, and businesses to understand **salary trends** and make informed decisions.  

### **2. What is/are the insight(s) found from the chart?**  
- **Tech and Finance industries** generally offer **higher average salaries** compared to others.  
- **Healthcare and Consulting** also provide **competitive salaries**.  
- Industries such as **Retail and Customer Service** tend to have **lower average salaries**.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Job seekers can **target high-paying industries** for better career opportunities.  
  - Companies can **adjust salary structures** to remain competitive in the job market.  
  - Businesses can **strategize recruitment** based on industry salary trends.  
- **Negative Impact:**  
  - **Lower-paying industries** may struggle to **attract top talent**, leading to **skill shortages**.  
  - **High salary expectations** in certain industries may make hiring **more challenging** for startups or smaller firms.

## **Chart - 12: Average Salary by Company Size (Bar Chart)**

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(10, 6))
df.groupby("Size")["Avg Salary"].mean().sort_values().plot(kind="bar", color="skyblue")
plt.xlabel("Company Size")
plt.ylabel("Average Salary (in thousands)")
plt.title("Average Salary by Company Size")
plt.xticks(rotation=45)
plt.show()

### **1. Why did you pick the specific chart?**  
A **bar chart** is ideal for comparing **average salary based on company size** (e.g., small, medium, and large enterprises). This helps job seekers and businesses understand how company size affects salary structures.  

### **2. What is/are the insight(s) found from the chart?**  
- **Larger companies** (5000+ employees) tend to **offer higher salaries**, likely due to **greater financial resources and stability**.  
- **Medium-sized companies** (500-5000 employees) have a **moderate salary range**.  
- **Small companies** (1-200 employees) generally offer **lower salaries** but might provide **equity or faster career growth opportunities**.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Job seekers can **align salary expectations** based on company size.  
  - Startups can focus on offering **non-monetary benefits** to attract talent.  
  - Companies can benchmark **their salary offerings** against competitors.  
- **Negative Impact:**  
  - Small companies might **struggle to retain talent** due to lower salary offers.  
  - Large companies may have **high salary expectations from candidates**, making hiring more selective.

## **Chart - 13: Distribution of Company Ratings (Histogram)**

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df["Rating"], bins=20, kde=True, color="skyblue")
plt.xlabel("Company Rating")
plt.ylabel("Number of Companies")
plt.title("Distribution of Company Ratings")
plt.show()

### **1. Why did you pick the specific chart?**  
A **histogram** is the best way to visualize the **distribution of company ratings**, showing how many companies fall into different rating categories. This helps understand the general perception of companies based on employee reviews.  

### **2. What is/are the insight(s) found from the chart?**  
- Most companies have ratings between **3.0 and 4.5**, indicating a **generally positive perception**.  
- Very few companies have ratings **below 2.5**, meaning **most companies maintain a decent reputation**.  
- Only a **small percentage** of companies have a **near-perfect rating (4.8 - 5.0)**, suggesting **exceptional employee satisfaction is rare**.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Job seekers can **identify well-rated companies** to target for job applications.  
  - Companies can **benchmark their ratings** against competitors and **improve workplace satisfaction**.  
  - Businesses with **low ratings** can focus on **improving employee experiences**.  
- **Negative Impact:**  
  - Companies with **poor ratings** might struggle to **attract top talent**, leading to **higher hiring costs**.  
  - If many companies have **average ratings (3.0 - 4.0)**, it might be difficult for job seekers to differentiate between them.

## **Chart - 14: Correlation Heatmap**  

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

### **1. Why did you pick the specific chart?**  
A **correlation heatmap** is used to visualize the **relationships between numerical features** in the dataset. It helps to identify **strong positive or negative correlations** that could be useful for decision-making.  

### **2. What is/are the insight(s) found from the chart?**  
- **Min Salary, Max Salary, and Avg Salary** are **highly correlated** with each other (**expected behavior**).  
- **Company Rating and Salary** have a **weak correlation**, suggesting that **higher-rated companies do not always offer higher salaries**.  
- **Company Age and Salary** have **no strong correlation**, indicating that **older companies do not necessarily pay better salaries**.  
- **There is no multicollinearity issue**, as most correlations are **moderate or low** (except for Min-Max Salary).  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Helps **focus on key factors** that impact salaries.  
  - Businesses can **prioritize salary improvements** based on trends.  
  - Helps in **predictive modeling**, as we now know which variables are important.  
- **Negative Impact:**  
  - If **higher ratings do not correlate with higher salaries**, **companies might lose talent** if employees prioritize pay over workplace culture.

## **Chart - 15: Pair Plot**

In [None]:
# Pair Plot visualization code
sns.pairplot(df[['Min Salary', 'Max Salary', 'Avg Salary', 'Rating', 'Company Age']], diag_kind='kde')
plt.suptitle("Pair Plot of Numerical Features", y=1.02)
plt.show()

### **1. Why did you pick the specific chart?**  
A **pair plot** helps visualize **relationships between multiple numerical variables** through scatter plots and histograms. It allows us to **observe distributions, correlations, and potential patterns** in the dataset.

### **2. What is/are the insight(s) found from the chart?**  
- **Min Salary, Max Salary, and Avg Salary** show a **strong linear relationship**, confirming their correlation from the heatmap.  
- **Company Rating and Salary** do not have a clear trend, reinforcing the earlier finding that **higher-rated companies do not always pay higher salaries**.  
- **Company Age distribution** shows a concentration of **younger companies**, meaning **many companies in the dataset are startups or relatively new**.  
- **Salary distributions appear right-skewed**, meaning that **higher salaries are less common**, and **most jobs fall into mid-range salary levels**.  

### **3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with a specific reason.**  
- **Positive Impact:**  
  - Helps in **salary prediction models**, as we now understand variable relationships better.  
  - Businesses can use insights to **attract talent by offering competitive salaries**.  
  - Job seekers can **identify trends** and make informed career choices.  
- **Negative Impact:**  
  - If **higher-rated companies do not offer higher salaries**, it may discourage potential employees from applying.  
  - **Younger companies dominating the market** may indicate **instability**, as startups tend to have **higher failure rates**.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***