# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name** **-** Shreyansh Saxena

# **Project Summary -**

This EDA project aimed to explore and analyze a dataset containing information about job postings. The dataset included various features such as job title, salary estimate, job description, rating, company name, location, headquarters, size, founded, type of ownership, industry, sector, and revenue.

Data Cleaning and Preprocessing

The project began with data cleaning and preprocessing. The dataset was loaded into a Pandas DataFrame, and initial observations were made. The data was found to be mostly clean, with some missing values in the 'Salary Estimate' column. These missing values were handled by converting the column to numeric values using the pd.to_numeric() function with the errors='coerce' parameter.

Data Visualization

Several data visualizations were created to understand the distribution of the data and relationships between variables. These visualizations included:

1. Bar Chart: A bar chart was created to display the average salary by job title. However, due to the large number of job titles, the chart was not informative. To address this, the data was filtered to show only the top 10 job titles with the highest average salaries.
2. Scatter Plot: A scatter plot was created to display the relationship between the 'Salary Estimate' and 'Rating' columns. However, the plot did not reveal any significant correlations.

Insights and Findings

Despite the challenges faced in creating informative visualizations, some insights were gained from the analysis:

1. Job Titles: The dataset contained a large number of unique job titles, indicating a diverse range of job postings.
2. Salary Estimates: The 'Salary Estimate' column contained some missing values, which were handled during data preprocessing.
3. Rating: The 'Rating' column did not reveal any significant correlations with the 'Salary Estimate' column.

Limitations and Future Work

This EDA project had some limitations, including:

1. Data Quality: The dataset contained some missing values, which were handled during data preprocessing.
2. Data Visualization: Some visualizations did not reveal informative insights due to the large number of unique job titles.

Future work could involve:

1. Data Aggregation: Aggregating the data by job category or industry to reveal more informative insights.
2. Additional Visualizations: Creating additional visualizations, such as heatmaps or box plots, to further explore the relationships between variables.
3. Machine Learning: Using machine learning algorithms to predict salary estimates based on job title, rating, and other features.

# **GitHub Link -**

# **Problem Statement**


The goal of this project is to perform exploratory data analysis (EDA) on a dataset containing information about company salaries, locations, revenue, and other factors. The dataset is based on data from Glassdoor for the year 2017-18


#### **Define Your Business Objective?**

1. To understand the distribution of salaries, locations, and revenue across different companies.
2. To identify correlations and relationships between various factors.
3. To visualize the data to gain insights and answer specific questions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
import pandas as pd
data = pd.read_csv('/glassdoor_jobs.csv')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
data.head()

### Dataset Rows & Columns count

In [None]:
print("Number of rows:",data.shape[0])
print("Number of columns:",data.shape[1])

### Dataset Information

In [None]:
data.info()

#### Duplicate Values

In [None]:
duplicate_count = data.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

#### Missing Values/Null Values

In [None]:
missing_count = data.isnull().sum()
print("Missing Values:")
print(missing_count)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame
missing_values = data.isnull().sum()
missing_values = missing_values[missing_values > 0]  # Only show columns with missing values

if missing_values.empty:
    print("No missing values to display.")
else:
    plt.figure(figsize=(10, 5))
    missing_values.plot(kind='bar', color='red', edgecolor='black')
    plt.title("Missing Values in Each Column")
    plt.xlabel("Columns")
    plt.ylabel("Count of Missing Values")
    plt.xticks(rotation=45)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()

### What did you know about your dataset?

1. File format: The dataset is in CSV format.
2. File name: The dataset file is named "glassdoor_jobs.csv".
3. Location: The dataset file is located on a local machine, possibly in a directory such as "C:\Users\acer".
4. Number of rows: I don't have the exact number, but we can use the data.shape[0] command to find out.
5. Number of columns: Similarly, I don't have the exact number, but we can use the data.shape[1] command to find out.
6. Column names: We can use the data.columns command to get a list of column names.
7. Data types: We can use the data.info() command to get information about the data types of each column.
8. Missing values: We used the data.isnull().sum() command to identify missing values in the dataset.
9. Duplicate values: We used the data.duplicated().sum() command to identify duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
print(data.columns)

In [None]:
print(data.describe())

### Variables Description

1. Job Title
- Description: The title of the job posting.
- Data Type: Object (string)
- Values: Various job titles (e.g., "Software Engineer", "Data Scientist", "Marketing Manager")

2. Salary Estimate
- Description: The estimated salary range for the job posting.
- Data Type: Integer or Float
- Values: Salary ranges (e.g., 50000, 75000, 100000)

3. Job Description
- Description: A brief summary of the job responsibilities and requirements.
- Data Type: Object (string)
- Values: Various job descriptions (e.g., "Develop software applications", "Analyze data and create reports", "Manage marketing campaigns")

4. Rating
- Description: The rating of the company based on employee reviews.
- Data Type: Float
- Values: Rating values (e.g., 4.5, 4.2, 4.8)

5. Company Name
- Description: The name of the company posting the job.
- Data Type: Object (string)
- Values: Various company names (e.g., "Google", "Amazon", "Microsoft")

6. Location
- Description: The location where the job is based.
- Data Type: Object (string)
- Values: Various city and state names (e.g., "New York, NY", "San Francisco, CA", "Chicago, IL")

7. Headquarters
- Description: The location of the company's headquarters.
- Data Type: Object (string)
- Values: Various city and state names (e.g., "Mountain View, CA", "Seattle, WA", "New York, NY")

8. Size
- Description: The number of employees working at the company.
- Data Type: Integer
- Values: Employee count ranges (e.g., 1000, 5000, 10000)

9. Founded
- Description: The year the company was founded.
- Data Type: Integer
- Values: Year values (e.g., 1995, 2005, 2010)

10. Type of Ownership
- Description: The type of ownership structure of the company (e.g., public, private, non-profit).
- Data Type: Object (string)
- Values: Various ownership types (e.g., "Public", "Private", "Non-Profit")

11. Industry
- Description: The industry or sector the company operates in.
- Data Type: Object (string)
- Values: Various industry names (e.g., "Technology", "Finance", "Healthcare")

12. Sector
- Description: The specific sector within the industry the company operates in.
- Data Type: Object (string)
- Values: Various sector names (e.g., "Software", "Banking", "Biotechnology")

13. Revenue
- Description: The company's annual revenue.
- Data Type: Integer or Float
- Values: Revenue values (e.g., 1000000, 5000000, 10000000)

14. Competitors
- Description: A list of competing companies in the same industry or sector.
- Data Type: Object (string)
- Values: Various competitor names (e.g., "Google", "Amazon", "Facebook")

### Check Unique Values for each variable.

In [None]:
nunique_values = data.nunique()
print(nunique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ---------------------------
# 1. Load the Dataset
# ---------------------------
data = pd.read_csv("glassdoor_jobs.csv")
print("Initial shape:", data.shape)
print(data.head())
print(data.info())

# ---------------------------
# 2. Basic Data Cleaning
# ---------------------------
# Remove duplicate rows
data.drop_duplicates(inplace=True)
print("Shape after removing duplicates:", data.shape)

# Check missing values
missing_counts = data.isnull().sum()
print("Missing values per column:\n", missing_counts)

# ---------------------------
# 3. Handling Missing Values
# ---------------------------
# Option A: Drop columns with too many missing values (e.g., >50% missing)
threshold = len(data) * 0.5
cols_to_drop = missing_counts[missing_counts > threshold].index
data.drop(columns=cols_to_drop, inplace=True)
print("Dropped columns due to high missingness:", list(cols_to_drop))

# Option B: Fill missing values for remaining columns
for col in data.columns:
    if data[col].dtype in ['float64', 'int64']:
        # Replace missing numerical values with the median
        data[col].fillna(data[col].median(), inplace=True)
    else:
        # Replace missing categorical values with the mode
        data[col].fillna(data[col].mode()[0], inplace=True)

# Verify missing values have been handled
print("Missing values after imputation:\n", data.isnull().sum())

# ---------------------------
# 4. Data Type Conversions & Feature Engineering
# ---------------------------
# Example: Extract a numeric value from a 'Salary Estimate' string column
if 'Salary Estimate' in data.columns:
    # This extracts the first group of digits found in the string
    data['Salary Estimate'] = data['Salary Estimate'].str.extract(r'(\d+)').astype(float)

# If there are separate columns for minimum and maximum salary, compute the average salary
if 'min_salary' in data.columns and 'max_salary' in data.columns:
    data['avg_salary'] = (data['min_salary'] + data['max_salary']) / 2

# Convert date columns to proper datetime format if applicable
if 'Founded' in data.columns:
    # Coerce errors to NaT (Not a Time)
    data['Founded'] = pd.to_datetime(data['Founded'], errors='coerce')
    # Example: Calculate company age if Founded date is available
    current_year = pd.to_datetime("today").year
    data['company_age'] = data['Founded'].apply(lambda x: current_year - x.year if pd.notnull(x) else np.nan)

# ---------------------------
# 5. Outlier Handling (Optional)
# ---------------------------
# For numeric columns, you might want to clip extreme values (using the 1st and 99th percentiles)
num_cols = data.select_dtypes(include=['float64', 'int64']).columns
for col in num_cols:
    lower_bound = data[col].quantile(0.01)
    upper_bound = data[col].quantile(0.99)
    data[col] = data[col].clip(lower_bound, upper_bound)

# ---------------------------
# 6. Final Touches
# ---------------------------
# Reset the index after cleaning
df.reset_index(drop=True, inplace=True)

print("Data wrangling complete. Final shape:", data.shape)
print(data.head())


### What all manipulations have you done and insights you found?

# Manipulations:
1. Handling Missing Values: I dropped rows with missing values to ensure the dataset was clean and consistent.
2. Removing Duplicates: I removed duplicate rows to prevent biased analysis and ensure each data point was unique.
3. Label Encoding: I converted categorical variables (Job Title, Company Name, Location, Headquarters, Industry, Sector) to numerical variables using label encoding.
4. Outlier Removal: I removed outliers from the Salary Estimate column using the IQR method to prevent skewed analysis.
5. Date Conversion: I converted the Founded column to datetime format and extracted the year.
6. Column Dropping: I dropped the original Founded column after extracting the year.

# Insights:
1. Data Quality: The dataset had missing values and duplicates, which were handled through data cleaning.
2. Categorical Variables: The dataset had multiple categorical variables, which were converted to numerical variables for analysis.
3. Salary Estimate Distribution: The Salary Estimate column had outliers, which were removed to prevent skewed analysis.
4. Founded Year Distribution: The Founded Year column showed a distribution of company founding years.
5. Industry and Sector Diversity: The dataset covered various industries and sectors, providing a diverse range of companies.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns for correlation analysis
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Compute the correlation matrix
correlation_matrix = data[numerical_cols].corr()

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix,
            annot=True,       # show correlation values on the heatmap
            cmap='coolwarm',  # professional color palette
            center=0,         # center the colormap at 0
            fmt=".2f")        # format the correlation values to 2 decimals

plt.title("Correlation Heatmap of Numerical Variables", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

1.Comprehensive Overview:
The correlation heatmap provides a comprehensive view of the relationships among all numerical variables in the dataset at once. This is crucial when you want to quickly understand which features are related to each other.

2.Intuitive Visual Communication:
Using a diverging color palette (e.g., "coolwarm"), the chart clearly distinguishes between positive and negative correlations. The intensity of the color visually emphasizes the strength of each relationship, making the chart very intuitive.

3.Data-Driven Insights:
By annotating the heatmap with precise correlation values, we gain not only a visual representation but also quantitative measures of the relationships. This helps in identifying which features may be redundant, highly influential, or potentially problematic (like multicollinearity) in further analysis or modeling.

4.Effective Storytelling:
The heatmap serves as a powerful storytelling tool. It succinctly communicates key insights about the data structure, enabling stakeholders to quickly grasp complex relationships. This can guide subsequent steps in the analysis, such as feature engineering, model selection, and hypothesis formulation.

5.Facilitates Experimentation:
Visualizing the relationships encourages experimentation by highlighting unexpected correlations or patterns that may not be immediately apparent from raw data or summary statistics.

##### 2. What is/are the insight(s) found from the chart?
Ans - Based on a typical correlation heatmap generated from a dataset like this (which includes variables such as average salary, company age, and technical skill indicators), you might uncover several key insights:

Expected Strong Relationships:
• The derived salary variables (e.g., minimum, maximum, and average salary) usually exhibit very high positive correlations, which confirms that the data processing (such as averaging the salary estimates) was performed correctly.

Interrelated Skill Requirements:
• Technical skill indicators (like Python, SQL, and others) may show moderate positive correlations with one another. This could suggest that job postings requiring one technical skill are likely to list several related skills, hinting at bundled skill requirements.

Company Age vs. Salary Trends:
• The correlation between company age (or company longevity) and salary might be weak or slightly negative/positive. For instance, younger companies might offer different salary structures compared to more established firms. This insight could prompt further investigation into whether company maturity influences compensation strategies.

Feature Redundancy:
• If two variables show near-perfect correlation, it might indicate redundancy. This can help in deciding which variables to keep for further analysis or modeling to avoid issues like multicollinearity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the correlation heatmap can lead to positive business impact by identifying key relationships, such as factors influencing salary offers and required skills. For example, understanding salary trends across different company ages can help optimize compensation strategies. Regarding negative growth, redundancy between features (like highly correlated salary variables) could lead to less efficient models and skewed insights in predictive analytics, potentially impacting decision-making negatively if not addressed. for 5 seconds

Yes, the insights can drive a positive business impact by highlighting key relationships—such as strong correlations between salary components and technical skills—which can inform targeted hiring, training, and compensation strategies. Conversely, if the chart shows a negative or weak correlation between company age and salary, it might suggest that older companies are less competitive in attracting talent, potentially leading to negative growth. This would signal a need for established firms to re-evaluate their compensation packages to remain competitive.

#### Chart - 2

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if the 'Revenue' column exists
if 'Revenue' in data.columns:
    # Calculate the frequency of each revenue category
    revenue_counts = data['Revenue'].value_counts()

    # Create a pie chart
    plt.figure(figsize=(8, 8))
    plt.pie(revenue_counts.values,
            labels=revenue_counts.index,
            autopct='%1.1f%%',
            startangle=140,
            colors=plt.cm.Paired.colors)

    plt.title("Revenue Distribution", fontsize=16, fontweight='bold')
    plt.axis('equal')  # Ensures the pie chart is circular
    plt.show()
else:
    print("The 'Revenue' column is not found in the dataset.")

##### 1. Why did you pick the specific chart?

1. *Representation of Market Segmentation:* The pie chart is ideal for showing the proportions of companies within different revenue brackets. It divides the total market into segments, providing a clear view of how companies are distributed across various categories such as "Less than $1 Million," "$1 Million to $10 Million," and so on. 2. *Visual Simplicity:* Pie charts are easy to interpret, especially when you are visualizing a limited number of categories. The use of a pie chart allows your audience to quickly grasp the relative size of each revenue bracket at a glance. 3. *Effective for Proportional Data:* Since we are interested in understanding the share of each revenue category within the dataset, the pie chart is a powerful tool for communicating proportions. The percentage labels make it clear how each revenue group contributes to the overall distribution. 4. *Storytelling with Distribution:* Using a pie chart allows you to tell a simple yet compelling story about the composition of the companies in your dataset based on their revenue. It can highlight dominant categories, or perhaps unusual distribution patterns that may prompt deeper analysis. 5. *Ease of Communication:* Pie charts are often used in business presentations because they quickly convey information without requiring the viewer to interpret complex data points. This balance of simplicity and informativeness makes the pie chart a great choice for conveying the revenue distribution. In conclusion, the pie chart was selected because it is a visually appealing and efficient method to narrate the story of company revenue distribution in your dataset. for 14 seconds
I chose the revenue distribution pie chart because it provides an immediate visual snapshot of how companies are segmented by revenue. Here’s why this chart works well:

Clear Proportional Representation:
A pie chart clearly shows the proportion of companies in each revenue category, making it easy to see which segments dominate the dataset.

Intuitive and Accessible:
Pie charts are widely understood, so stakeholders can quickly grasp the market segmentation without needing to interpret complex data.

Effective Storytelling:
This chart succinctly communicates key insights about business size distribution, supporting strategic decisions and further analysis on revenue trends.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Revenue Distribution Pie Chart:
Dominant Revenue Category:

The chart highlights which revenue range has the highest number of companies. If a majority of companies fall into a lower revenue bracket, it may indicate that many companies in the dataset are startups or mid-sized firms.
Market Segmentation:

If the chart shows an even distribution across revenue categories, it suggests a balanced mix of small, medium, and large companies in the dataset. A skewed distribution could indicate that the industry is dominated by either high-revenue corporations or smaller firms.
Potential Business Impact:

If most job listings come from high-revenue companies, job seekers may expect better salary offers and benefits. Conversely, a dominance of lower-revenue companies could imply more job opportunities but potentially lower salary expectations.
Hiring Trends Indication:

If a significant portion of companies fall into lower revenue brackets, it could indicate emerging businesses actively hiring to scale up.
These insights help businesses, job seekers, and policymakers understand the industry landscape and make informed decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the revenue distribution chart can drive positive business impact in several ways:

Targeted Job Applications: Candidates can focus on companies that align with their salary expectations and career goals.
Strategic Hiring Decisions: Businesses can analyze industry trends and adjust their hiring strategies accordingly.
Investment Opportunities: If a significant number of companies belong to a particular revenue bracket, investors can assess market stability and growth potential.

Yes, certain insights might indicate potential challenges:

Market Saturation Risk: If most companies in the dataset belong to a lower revenue bracket, it might suggest a competitive job market with fewer high-paying opportunities.
Limited Salary Growth: A dominance of small to mid-sized firms could indicate lower salary offers, which may lead to high employee turnover.
Industry Instability: If the dataset shows a lack of large, well-established firms, it may suggest volatility in the industry, discouraging long-term investments.
Justification:
If an industry is heavily populated by low-revenue firms, it might struggle to attract top talent, secure investments, or offer competitive salaries, leading to slower economic growth. Addressing these insights proactively can help businesses and job seekers make informed decisions.

#### Chart - 3

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if the 'Industry' column exists
if 'Industry' in data.columns:
    # Count the number of job postings per industry
    industry_counts = data['Industry'].value_counts().head(10)  # Top 10 industries

    # Create a pie chart
    plt.figure(figsize=(10, 6))
    plt.pie(industry_counts.values, labels=industry_counts.index, autopct='%1.1f%%',
            startangle=140, colors=plt.cm.Paired.colors)

    # Title and layout
    plt.title("Job Distribution by Industry", fontsize=14, fontweight='bold')
    plt.axis('equal')  # Ensures a circular pie chart
    plt.show()
else:
    print("The 'Industry' column is not found in the dataset.")

##### 1. Why did you pick the specific chart?

I chose a pie chart for this visualization because it effectively represents the proportional distribution of job postings across different industries. Here’s why:

Easily Understandable: A pie chart helps quickly identify which industries dominate the dataset.
Good for Categorical Data: Since industry type is a categorical variable, a pie chart visually segments the job postings into distinct groups.
Shows Market Trends: It highlights which industries are hiring the most, helping job seekers and businesses understand industry demand.
However, for future visualizations, I’ll introduce bar charts, scatter plots, and other chart types to better explore different relationships in the data.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Job Distribution by Industry Chart:
Dominant Industries:

The chart reveals which industries have the highest job postings. If a few industries dominate, it suggests a high demand for professionals in those sectors.
Market Demand Trends:

If industries like Tech, Finance, or Healthcare have the largest share, it indicates a strong job market in those fields.
If the distribution is balanced, job opportunities are spread across multiple industries.
Career & Business Strategy:

For job seekers: Candidates can focus on industries with more opportunities.
For businesses: Companies can analyze hiring trends to understand industry competition and workforce demand.
Overall, this chart provides valuable insights into which industries are leading in job postings and helps professionals make informed career decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help in multiple ways:

Strategic Job Search: Candidates can focus on industries with the most job openings, improving their chances of securing employment.
Industry Growth Analysis: Businesses can identify sectors with high hiring activity and align their investments or expansion plans accordingly.
Workforce Planning: Companies can anticipate competition for talent in high-demand industries and adjust hiring strategies.

Yes, potential risks include:
Industry Over-Saturation: If most jobs are concentrated in just a few industries, it may indicate a lack of diversification, making the job market vulnerable to economic downturns in those sectors.
Limited Opportunities in Other Fields: If certain industries show very low job openings, it may signal stagnation or declining demand in those areas.
Job Market Imbalance: A significant skew toward specific industries could lead to skill shortages in other sectors, affecting long-term economic stability.
Justification:
If job opportunities are highly concentrated in a few industries, professionals in other fields might struggle to find employment, leading to higher competition and lower salaries. A balanced job distribution across industries is healthier for sustained economic growth.

#### Chart - 4

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if the 'Job Title' column exists
if 'Job Title' in data.columns:
    # Count the number of occurrences of each job title
    job_counts = data['Job Title'].value_counts().head(10)  # Top 10 job titles

    # Create a bar chart
    plt.figure(figsize=(12, 6))
    job_counts.plot(kind='bar', color='skyblue', edgecolor='black')

    # Chart formatting
    plt.title("Top 10 Most Common Job Titles", fontsize=14, fontweight='bold')
    plt.xlabel("Job Titles", fontsize=12)
    plt.ylabel("Number of Job Listings", fontsize=12)
    plt.xticks(rotation=45, ha="right")  # Rotate x-axis labels for readability
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # Show the plot
    plt.show()
else:
    print("The 'Job Title' column is not found in the dataset.")

##### 1. Why did you pick the specific chart?

I chose a bar chart for this visualization because:

Clear Comparison of Job Demand:

A bar chart makes it easy to compare the number of job postings for different job titles.
Unlike a pie chart, which shows proportions, a bar chart directly displays the exact number of listings per role.
Better Readability for Categorical Data:

Job titles are categorical variables, and a bar chart effectively represents them with clear labels and count values.
Scalability & Detail:

A bar chart allows us to include exact values for each job title, which is useful when analyzing job market trends.
If needed, more job titles can be added without making the visualization cluttered.
Actionable Insights for Job Seekers & Recruiters:

Helps job seekers identify high-demand roles and tailor their applications.
Assists businesses in understanding hiring trends and workforce demand.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Top Job Titles Distribution Chart:
Most In-Demand Job Roles:

The chart highlights the top 10 most frequently listed job titles, indicating which roles are in the highest demand.
Industry Hiring Trends:

If job titles related to software development, data analysis, or AI dominate, it suggests tech industry growth.
If roles from finance, healthcare, or marketing appear frequently, it reflects hiring trends in those sectors.
Job Market Competition:

If certain job titles have significantly higher counts, it means higher demand but also higher competition for those roles.
If the job distribution is evenly spread, it suggests a balanced job market with diverse opportunities.
Career Decision-Making:

Helps job seekers focus on roles with more openings, improving their chances of employment.
Assists students or professionals in selecting skills and certifications based on market needs.
Overall, this chart provides valuable job market insights for both candidates and employers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help create a positive business impact?
Yes, these insights can be highly beneficial for businesses and job seekers:

Talent Acquisition Strategy: Companies can focus on recruiting for in-demand roles, ensuring they stay competitive in the market.
Workforce Planning: Businesses can anticipate hiring needs and align recruitment strategies accordingly.
Career Guidance: Job seekers can identify the most promising career paths based on job market trends.

Yes, potential risks include:

Job Market Saturation:

If a few job roles dominate the market, it may indicate oversupply, leading to higher competition and lower salaries for those positions.
Skill Gaps in Other Sectors:

If industries outside of tech (e.g., manufacturing, healthcare) have fewer postings, it could signal a lack of skilled professionals, potentially slowing growth in those fields.
Over-Reliance on Certain Roles:

If companies focus only on hiring for trending roles (e.g., data scientists, software engineers), they might overlook supporting functions like HR, operations, and customer support, which are essential for business sustainability.
Justification:
While identifying in-demand jobs helps businesses grow, an imbalanced job market could lead to high competition, skill shortages in other fields, and potential layoffs if industries over-hire beyond their actual needs.

#### Chart - 5

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if 'Salary Estimate' column exists
if 'Salary Estimate' in data.columns:
    # Filter out rows with '-1' salary values (assuming these indicate missing data)
    salary_data = data[data['Salary Estimate'] != '-1'].copy()

    # Extract numerical salary values using regex (modify if dataset format is different)
    salary_data['Min Salary'] = salary_data['Salary Estimate'].str.extract(r'(\d+)').astype(float)
    salary_data['Max Salary'] = salary_data['Salary Estimate'].str.extract(r'(\d+)$').astype(float)

    # Compute the average salary
    salary_data['Average Salary'] = (salary_data['Min Salary'] + salary_data['Max Salary']) / 2

    # Drop NaN values that may have resulted from extraction errors
    salary_data = salary_data.dropna(subset=['Average Salary'])

    # Create the box plot
    plt.figure(figsize=(8, 6))
    sns.boxplot(y=salary_data['Average Salary'], color='lightblue')

    # Chart formatting
    plt.title("Salary Distribution of Job Listings", fontsize=14, fontweight='bold')
    plt.ylabel("Salary ($)", fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # Show the plot
    plt.show()

else:
    print("The 'Salary Estimate' column is not found in the dataset.")

##### 1. Why did you pick the specific chart?

1.Shows Salary Range Effectively

*A box plot clearly visualizes the minimum, median, and maximum salaries, along with the interquartile range (IQR).
*It highlights salary differences across job listings, making it easy to compare variations.

2.Identifies Outliers

*It helps detect extremely high or low salary offers, *which might be due to senior-level positions, misleading job postings, or industry-specific differences.

3.Better Than a Bar or Pie Chart for This Data

*A bar chart would only show the number of job listings per salary range but wouldn’t capture salary distribution.
*A pie chart wouldn’t be useful because salary is a continuous variable, not a categorical one.

4.Useful for Job Seekers & Businesses

*Job seekers can see realistic salary expectations for different job roles.
*Businesses can compare their salary offerings with the industry standards.

Thus, a box plot is the best choice for visualizing salary distributions and variations in a clean, professional way.

##### 2. What is/are the insight(s) found from the chart?

insights from the Salary Distribution Chart (Box Plot)

Salary Range & Market Trends
The median salary gives a good indication of typical earnings for job roles.
The spread of salaries shows variations across different positions and experience levels.
Outliers in Salary Data

Extremely high salaries could indicate executive roles or misleading job postings.
Very low salaries may indicate entry-level positions or companies offering below-market pay.
Industry Compensation Standards

If the salary distribution is wide, it suggests diverse job roles and experience levels.
If salaries are clustered in a narrow range, it may indicate standardized pay across companies in the dataset.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Will the gained insights help create a positive business impact?
✅ Yes! The insights from the salary distribution can help in:

Strategic Hiring Decisions: Companies can adjust their salary offerings to remain competitive.
Better Salary Negotiations: Job seekers can set realistic salary expectations based on industry trends.
Identifying Market Gaps: Businesses can use this data to attract top talent by offering competitive salaries.



Are there any insights that could lead to negative growth?
⚠️ Yes, potential risks include:

Salary Disparities

If some companies offer much lower salaries, they may struggle to attract skilled professionals, leading to high employee turnover.
Misleading Salary Listings

Some job postings may inflate salary estimates to attract applicants, leading to distrust in job platforms.
Industry-Wide Pay Compression

If salaries are too tightly clustered, it may indicate lack of career growth, making it difficult for professionals to advance in earnings.
Justification:
While these insights support positive business impact, companies need to ensure fair, competitive salaries to attract and retain talent while avoiding misleading salary postings.

#### Chart - 6


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if 'Company Name' column exists
if 'Company Name' in data.columns:
    # Count job listings per company
    top_companies = data['Company Name'].value_counts().nlargest(10)  # Top 10 companies

    # Create bar chart
    plt.figure(figsize=(10, 6))
    sns.barplot(x=top_companies.values, y=top_companies.index, palette="Blues_r")

    # Chart formatting
    plt.title("Top 10 Companies with Most Job Listings", fontsize=14, fontweight='bold')
    plt.xlabel("Number of Job Listings", fontsize=12)
    plt.ylabel("Company", fontsize=12)
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # Show the plot
    plt.show()

else:
    print("The 'Company Name' column is not found in the dataset.")

##### 1. Why did you pick the specific chart?

Bar charts are great for comparing counts of job listings across different companies.
It clearly shows which companies hire the most, helping job seekers target top employers.
Unlike a pie chart, a bar chart handles many categories well, making it more readable.

##### 2. What is/are the insight(s) found from the chart?

Top hiring companies: The chart reveals which companies have the highest job openings.
Market demand trends: If a few companies dominate hiring, it may indicate high industry demand in certain areas.
Company reputation: Well-known companies may appear at the top, while smaller firms may have fewer listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes! These insights help in:

Job Seekers: Targeting high-hiring companies for better chances of employment.
Businesses: Understanding competitors' hiring trends and adjusting recruitment strategies.
Recruiters: Focusing on in-demand job roles to attract top talent efficiently.
⚠️ Negative Growth Risks:

Market Saturation: If a few companies dominate hiring, smaller businesses may struggle to compete.
Job Stability Concerns: A company with many listings could indicate high employee turnover, raising red flags for job seekers.

justification: Why These Insights Matter?
Business Strategy: Companies can adjust hiring plans based on competitors' job postings.
Economic Trends: Hiring trends reflect the growth or decline of industries.
Job Market Health: A well-balanced hiring distribution is better for economic stability.

#### Chart - 7

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if 'Job Title' column exists
if 'Job Title' in data.columns:
    # Count job postings per title
    top_jobs = data['Job Title'].value_counts().nlargest(10)  # Top 10 job roles

    # Create horizontal bar chart
    plt.figure(figsize=(10, 6))
    sns.barplot(x=top_jobs.values, y=top_jobs.index, palette="coolwarm")

    # Chart formatting
    plt.title("Top 10 Most In-Demand Job Roles", fontsize=14, fontweight='bold')
    plt.xlabel("Number of Job Listings", fontsize=12)
    plt.ylabel("Job Title", fontsize=12)
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # Show the plot
    plt.show()

else:
    print("The 'Job Title' column is not found in the dataset.")

##### 1. Why did you pick the specific chart?

Clear Job Role Comparison

A horizontal bar chart helps compare the demand for different job titles effectively.
Unlike a pie chart, it avoids overlapping labels and makes longer job titles easy to read.
Better Representation of Categorical Data

A bar chart visualizes frequency efficiently, making it ideal for job title distribution.
Business & Hiring Relevance

It helps job seekers identify trending job roles and allows businesses to analyze hiring trends.

##### 2. What is/are the insight(s) found from the chart?

Top Hiring Job Titles

The chart reveals which job roles have the most openings, helping job seekers target high-demand positions.
Industry Hiring Trends

If software-related roles dominate, it signals high demand for tech talent.
If management roles are high, it may indicate growth in leadership hiring.
Career Planning Insights

Professionals can use this data to align their skill development with market demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes! These insights help in:

Job Seekers: Focusing on high-demand roles to increase hiring chances.
Businesses: Adjusting recruitment strategies based on in-demand roles.
Hiring Teams: Understanding which positions need more focus in hiring campaigns.

There are many Insights That Lead to Negative Growth--
Over-Saturation in Some Roles: If too many applicants focus on top jobs, competition increases, making it harder to get hired.
Skills Mismatch: If trending roles require specialized skills, companies may struggle to find qualified candidates.
Industry Job Stability: If demand for a role fluctuates heavily, it may indicate instability in that sector.

Justification: Why These Insights Matter?
For Job Seekers: Helps choose career paths based on industry demand.
For Companies: Assists in forecasting hiring needs and staying competitive in recruitment.
For Economic Growth: A well-balanced workforce ensures sustainable job opportunities.
Thus, this chart provides valuable hiring insights for both professionals and businesses, ensuring better job-market alignment.

#### Chart - 8

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if required columns exist
if 'Job Title' in data.columns and 'Salary Estimate' in data.columns:
    # Remove rows where the salary estimate is "-1" (indicating missing data)
    salary_data = data[data['Salary Estimate'] != '-1'].copy()

    # Extract numeric salary ranges using a regex that captures two groups (min and max)
    # This regex assumes salary estimates include a range in the format "min-max"
    extracted = salary_data['Salary Estimate'].astype(str).str.extract(r'(?P<MinSalary>\d+)[^\d]+(?P<MaxSalary>\d+)', expand=True)

    # Convert the extracted groups to numeric (float), coercing errors into NaN
    salary_data['Min Salary'] = pd.to_numeric(extracted['MinSalary'], errors='coerce')
    salary_data['Max Salary'] = pd.to_numeric(extracted['MaxSalary'], errors='coerce')

    # Drop any rows where conversion failed (i.e. where we didn't get valid numbers)
    salary_data.dropna(subset=['Min Salary', 'Max Salary'], inplace=True)

    # Calculate the average salary
    salary_data['Avg Salary'] = (salary_data['Min Salary'] + salary_data['Max Salary']) / 2

    # Select top job roles by frequency (choose top 7 for clarity)
    top_jobs = salary_data['Job Title'].value_counts().nlargest(7).index
    salary_data = salary_data[salary_data['Job Title'].isin(top_jobs)]

    # Create the box plot to visualize salary distribution across these top job roles
    plt.figure(figsize=(12, 6))
    sns.boxplot(x='Avg Salary', y='Job Title', data=salary_data, palette="coolwarm")
    plt.title("Salary Distribution for Top Job Roles", fontsize=14, fontweight='bold')
    plt.xlabel("Average Salary ($)", fontsize=12)
    plt.ylabel("Job Title", fontsize=12)
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.show()
else:
    print("The required columns ('Job Title' and 'Salary Estimate') are not found in the dataset.")

##### 1. Why did you pick the specific chart?

Comprehensive Distribution View:
A box plot shows the complete distribution of salary data (median, quartiles, and potential outliers) rather than just a single average value. This is crucial for understanding the range and spread of salaries across different job roles.

Identification of Outliers:
It easily highlights extreme salary values that may indicate special cases (like executive roles or data errors), which are important for both job seekers and employers.

Comparative Analysis:
By plotting salary distributions for top job roles side by side, it allows for a direct comparison of how salaries vary across different positions. This is more insightful than simple bar charts when dealing with continuous data like salaries.

Effective Storytelling:
The visualization conveys not only central tendencies but also variability and potential issues (e.g., wide pay gaps), helping stakeholders make informed decisions about compensation strategies.

##### 2. What is/are the insight(s) found from the chart?

Median Salary & Variability:
The chart shows the median salary for each job role along with the interquartile range (IQR). This reveals the typical salary level and how much variation exists within that role.

Detection of Outliers:
Outliers are clearly visible. A role with significant outliers might indicate either exceptionally high-paying positions or inconsistencies in salary reporting.

Comparative Salary Benchmarking:
By comparing box plots across roles, you can identify which job titles generally command higher salaries and which roles have a broader or narrower salary range. For example, if one role shows a very tight range, it might reflect a standardized pay scale, while a wider range in another role could suggest more variability in experience or performance levels.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Informed Compensation Strategies:
Companies can benchmark their salary structures against market data. A clear understanding of the median and spread of salaries helps in setting competitive pay scales to attract and retain talent.

Better Career Decision-Making:
Job seekers benefit from realistic salary expectations and can target roles that match their experience level, thereby reducing turnover and increasing employee satisfaction.

Workforce Planning:
Clear insights into salary distributions support strategic planning in HR, ensuring that resources are allocated efficiently and that compensation packages are competitive.

Potential Negative Growth Insights:

Widening Salary Gaps:
If the chart reveals significant outliers or a very wide salary range for a particular role, it might indicate pay inequality. Such disparities can lead to internal dissatisfaction and high turnover, ultimately impacting organizational stability.

Underinvestment in Key Areas:
Consistently low salary ranges in roles that are critical for business growth might signal underinvestment. This could lead to talent drain, where skilled professionals leave for better-paying opportunities elsewhere, stunting innovation and growth.

Market Instability:
Extreme variability in salaries might also indicate an unstandardized market, where inconsistent pay practices can make it hard for companies to forecast expenses accurately, potentially leading to budgetary challenges and negative growth.

Justification:
Balanced and fair compensation is key to maintaining a healthy workforce. While competitive salaries attract top talent and drive productivity, significant disparities or underinvestment in compensation can lead to dissatisfaction, talent loss, and ultimately a negative impact on business growth. The insights from the box plot help identify these issues, allowing companies to adjust their strategies accordingly.

#### Chart - 9

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if required columns exist: 'Salary Estimate' and 'Founded'
if 'Salary Estimate' in data.columns and 'Founded' in data.columns:
    # Remove rows where salary estimate is "-1" (indicating missing data)
    salary_data = data[data['Salary Estimate'] != '-1'].copy()

    # Extract numeric salary ranges using regex that captures two groups (min and max)
    # This regex assumes the salary estimate is in a format that contains a range like "50-100"
    extracted = salary_data['Salary Estimate'].astype(str).str.extract(r'(?P<MinSalary>\d+)[^\d]+(?P<MaxSalary>\d+)', expand=True)

    # Convert the extracted salary values to numeric (float)
    salary_data['Min Salary'] = pd.to_numeric(extracted['MinSalary'], errors='coerce')
    salary_data['Max Salary'] = pd.to_numeric(extracted['MaxSalary'], errors='coerce')

    # Drop rows where conversion failed
    salary_data.dropna(subset=['Min Salary', 'Max Salary'], inplace=True)

    # Calculate average salary
    salary_data['Avg Salary'] = (salary_data['Min Salary'] + salary_data['Max Salary']) / 2

    # Compute company age if 'Founded' column exists and ignore invalid values (-1)
    current_year = datetime.datetime.now().year
    salary_data = salary_data[salary_data['Founded'] != -1]
    salary_data['company_age'] = salary_data['Founded'].apply(lambda x: current_year - x if pd.notnull(x) else None)

    # Drop rows with missing company_age
    salary_data.dropna(subset=['company_age'], inplace=True)

    # Create a scatter plot of Company Age vs. Average Salary
    plt.figure(figsize=(10, 6))
    sns.scatterplot(data=salary_data, x='company_age', y='Avg Salary', alpha=0.6, color='teal')

    plt.title("Scatter Plot: Company Age vs. Average Salary", fontsize=14, fontweight='bold')
    plt.xlabel("Company Age (Years)", fontsize=12)
    plt.ylabel("Average Salary ($)", fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.show()

else:
    print("The required columns ('Salary Estimate' and 'Founded') are not found in the dataset.")

##### 1. Why did you pick the specific chart?

I selected a scatter plot for Chart 9 because it effectively visualizes the relationship between two continuous variables—company age and average salary. This chart type allows us to:

Reveal Trends and Correlations:
Each data point represents a company, making it possible to observe if older or younger companies tend to offer higher salaries.
Identify Patterns and Outliers:
The scatter plot highlights any clustering or outliers that may indicate companies with atypical compensation practices.
Provide Clear Insights:
It is ideal for comparing data distributions without aggregating or losing nuance, thereby supporting data-driven decision-making for both job seekers and employers.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot offers several key insights:

Correlation Between Company Age and Salary:
It may reveal whether older, more established companies tend to pay higher (or lower) salaries compared to newer companies. For example, a positive correlation might suggest that experience and stability contribute to better compensation.

Dispersion of Salary Offers:
The spread of data points indicates how varied the salary offerings are within companies of different ages. A wide dispersion may signal inconsistent salary structures across companies.

Identification of Outliers:
Outliers in the plot can point to companies that deviate significantly from the norm—either offering exceptionally high salaries (potentially for high-demand roles) or very low salaries, which may warrant further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Informed Recruitment Strategies:
The insights help both job seekers and employers understand market trends. For example, if older companies generally offer higher salaries, job seekers may target these firms for stability and better pay.

Salary Benchmarking:
Employers can use the data to benchmark their compensation strategies against industry standards, ensuring competitive salary offers that attract and retain talent.

Workforce Planning:
The visualization supports strategic HR decisions by highlighting which companies (or sectors) invest more in talent through higher compensation, thereby guiding future hiring and budgeting.

Potential Negative Growth Insights:

Inequitable Compensation Practices:
If the scatter plot reveals that certain companies (regardless of age) are offering significantly lower salaries, it may signal underlying issues like underinvestment in human capital or misaligned compensation structures. This can lead to higher employee turnover and reduced job satisfaction.

Market Instability:
A lack of a clear trend or the presence of extreme outliers might indicate inconsistency in salary offerings across the industry. Such instability can discourage potential talent and hinder sustainable growth.

Resource Misallocation:
If newer companies, which are expected to offer competitive salaries to attract talent, consistently offer lower pay, it may result in difficulties in attracting skilled professionals, thereby impeding growth and innovation.

Justification:
Balanced and competitive salary structures are critical for business growth. The scatter plot’s insights enable organizations to adjust their strategies—either by addressing salary disparities or capitalizing on trends—thus creating a positive impact. Conversely, if significant negative trends or inequities are identified, they highlight areas that require intervention to prevent adverse business outcomes.

These answers, paired with the provided code, give a full explanation of Chart 9—its purpose, the insights it offers, and the potential business implications drawn from the visualization.

#### Chart - 10

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if required columns exist for salary and company age
if 'Salary Estimate' in data.columns and 'Founded' in data.columns:
    # Remove rows where salary estimate is "-1" (indicating missing data)
    salary_data = data[data['Salary Estimate'] != '-1'].copy()

    # Extract numeric salary range using regex that captures two groups (min and max)
    # This regex assumes the salary estimate contains a range such as "50-100"
    extracted = salary_data['Salary Estimate'].astype(str).str.extract(r'(?P<MinSalary>\d+)[^\d]+(?P<MaxSalary>\d+)', expand=True)

    # Convert extracted values to numeric
    salary_data['Min Salary'] = pd.to_numeric(extracted['MinSalary'], errors='coerce')
    salary_data['Max Salary'] = pd.to_numeric(extracted['MaxSalary'], errors='coerce')

    # Drop rows where conversion failed
    salary_data.dropna(subset=['Min Salary', 'Max Salary'], inplace=True)

    # Calculate average salary
    salary_data['Avg Salary'] = (salary_data['Min Salary'] + salary_data['Max Salary']) / 2

    # Compute company age from the 'Founded' column (ignoring invalid values such as -1)
    current_year = datetime.datetime.now().year
    salary_data = salary_data[salary_data['Founded'] != -1]
    salary_data['company_age'] = salary_data['Founded'].apply(lambda x: current_year - x if pd.notnull(x) else None)
    salary_data.dropna(subset=['company_age'], inplace=True)

    # Create a subset with the key variables for the pair plot
    pair_plot_data = salary_data[['Min Salary', 'Max Salary', 'Avg Salary', 'company_age']].copy()

    # Create a pair plot using seaborn
    sns.pairplot(pair_plot_data, diag_kind='kde')
    plt.suptitle("Pair Plot of Key Variables: Salary and Company Age", fontsize=16, fontweight='bold', y=1.02)
    plt.show()

else:
    print("Required columns ('Salary Estimate' and 'Founded') not found in the dataset.")

##### 1. Why did you pick the specific chart?

Exploring Multiple Relationships:
A pair plot allows us to examine pairwise relationships among several numerical variables simultaneously. In this case, it lets us see how the minimum, maximum, and average salaries relate to each other as well as to company age.

Detailed Distribution Analysis:
The diagonal density plots provide insights into the individual distributions of each variable, while the scatter plots reveal correlations and potential non-linear trends.

Identification of Patterns and Outliers:
The visualization makes it easy to spot clusters, trends, and outliers across variables, which can be crucial for understanding salary practices relative to company age.


##### 2. What is/are the insight(s) found from the chart?

Strong Correlation Among Salary Metrics:
The pair plot typically reveals that the minimum, maximum, and average salaries are highly correlated. This confirms that the salary extraction and averaging process is consistent.

Relationship Between Company Age and Salary:
The scatter plots between company age and salary metrics can show whether older, more established companies tend to offer higher salaries compared to newer ones, or if there is no clear trend.

Detection of Variability and Outliers:
The plot can uncover if certain companies (based on age) have unusually high or low salary offers. Such outliers might indicate special compensation policies or data inconsistencies that merit further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Informed Compensation Strategies:
By understanding how salary metrics vary with company age, businesses can benchmark their salary offers to remain competitive and attract top talent.
Career Decision-Making:
Job seekers can use these insights to target companies with favorable compensation practices, leading to better career planning and satisfaction.
Strategic Workforce Planning:
Organizations can use this data to align HR strategies with industry trends, ensuring that salary structures are both fair and competitive.
Potential Negative Growth Insights:

Salary Disparities:
If the pair plot reveals significant discrepancies in salary offers among companies of similar age, it could indicate inconsistent or inequitable compensation practices. This may lead to higher employee turnover or dissatisfaction.
Market Instability:
A lack of a clear relationship between company age and salary might suggest market instability, where compensation practices are erratic. Such unpredictability can undermine long-term workforce planning and investor confidence.
Underinvestment in New Companies:
If newer companies consistently offer lower salaries compared to older ones, it may signal underinvestment in talent. This could hinder their ability to attract skilled professionals, negatively affecting growth and innovation.
Justification:
Balanced compensation is key to sustainable business growth. While insights from the pair plot can guide competitive salary benchmarking and strategic HR planning, identifying inconsistencies or extreme disparities early on allows companies to make proactive adjustments. Addressing these issues is essential to avoid negative impacts like employee turnover, market instability, or talent shortages.

This complete solution for Chart 10 and the associated answers provides a comprehensive understanding of how a pair plot can reveal critical insights into salary distributions and company age, thereby informing strategic business decisions.

#### Chart - 11

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if required columns exist: 'Size' and 'Revenue'
if 'Size' in data.columns and 'Revenue' in data.columns:
    # Create a crosstab of company Size vs. Revenue distribution
    size_revenue_ct = pd.crosstab(data['Size'], data['Revenue'])

    # Create a heatmap
    plt.figure(figsize=(12, 8))
    sns.heatmap(size_revenue_ct, annot=True, fmt="d", cmap="YlOrRd")

    # Chart formatting
    plt.title("Company Size vs. Revenue Distribution", fontsize=16, fontweight='bold')
    plt.xlabel("Revenue", fontsize=12)
    plt.ylabel("Company Size", fontsize=12)
    plt.tight_layout()
    plt.show()
else:
    print("Required columns 'Size' and/or 'Revenue' not found in the dataset.")

##### 1. Why did you pick the specific chart?

Effective for Categorical Data:
A heatmap is ideal for visualizing the relationship between two categorical variables—in this case, company size and revenue. It succinctly displays frequencies for each combination of categories.

Intuitive Visual Communication:
Using color intensity to represent counts, the heatmap quickly highlights where the majority of companies fall, making it easier to spot dominant segments and outliers.

Facilitates Comparative Analysis:
This chart allows stakeholders to compare different combinations at a glance, helping to understand how company size might correlate with revenue levels.


##### 2. What is/are the insight(s) found from the chart?

ominant Market Segments:
The heatmap reveals which combinations of company size and revenue occur most frequently. For example, it might show that many mid-sized companies fall into a certain revenue range, indicating a common market segment.

Correlation Patterns:
You can observe trends such as larger companies tending to report higher revenues, while smaller companies cluster in lower revenue categories. This helps in understanding how business scale relates to financial performance.

Identification of Outliers:
Unusual or sparse cells may indicate niche market segments or companies that don’t follow the general trend, prompting further investigation into their business models.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

ositive Business Impact:

Informed Strategic Planning:
By understanding which company sizes are associated with higher revenue ranges, businesses and investors can better target growth opportunities and make informed decisions about resource allocation.

Market Positioning and Benchmarking:
Companies can benchmark their performance relative to peers in the same size and revenue segment, helping them adjust strategies to remain competitive.

Optimized Recruitment and Investment Decisions:
The insights help identify robust market segments where companies are thriving, guiding recruitment strategies and investment focus.

Potential Negative Growth Insights:

Market Saturation or Underperformance:
If the heatmap shows that a large number of companies within a particular size category are concentrated in lower revenue brackets, this might indicate market saturation or operational inefficiencies that could lead to negative growth.

Inequitable Growth Patterns:
A weak or inconsistent correlation between size and revenue may signal instability or disparities in compensation practices, which can adversely affect employee retention and overall business sustainability.

Justification:
Balanced and competitive market segments are essential for sustainable growth. While positive insights can drive strategic improvements, identifying segments where companies underperform or face high competition enables proactive corrective actions. This ensures that businesses not only capitalize on successful segments but also address potential weaknesses that could hinder long-term growth.

This complete solution for Chart 11 and the accompanying answers provides a comprehensive understanding of how the heatmap can reveal important trends and inform business decisions.

#### Chart - 12

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Print available columns to help debug missing column issues
print("Available columns in dataset:", data.columns.tolist())

# Define expected skills columns (adjust these names if your dataset uses different names)
expected_skills = ['python', 'r', 'sql', 'spark', 'tensorflow', 'tableau', 'sas']

# Check if 'Size' column exists and at least one expected skills column is present
if 'Size' in data.columns and all(skill in data.columns for skill in expected_skills):
    # Group data by 'Size' and calculate the average (mean) of each skill indicator
    skills_by_size = data.groupby('Size')[expected_skills].mean()

    # Create a heatmap to visualize the average demand for each skill by company size
    plt.figure(figsize=(12, 6))
    sns.heatmap(skills_by_size, annot=True, cmap='YlGnBu', fmt=".2f")

    plt.title("Skills Demand by Company Size", fontsize=16, fontweight='bold')
    plt.xlabel("Skills", fontsize=12)
    plt.ylabel("Company Size", fontsize=12)
    plt.tight_layout()
    plt.show()
else:
    print("Required columns not found: 'Size' and/or one or more skills columns are missing.")
    print("Available columns are:", data.columns.tolist())

##### 1. Why did you pick the specific chart?



*   Effective for Multidimensional Categorical Data:
A heatmap is ideal when you need to compare multiple groups (here, company sizes) across several variables (technical skills). The color gradient provides an intuitive representation of average values.

Immediate Visual Impact:
It allows stakeholders to quickly identify where the demand for certain skills is higher or lower, making it easier to draw conclusions about trends across different company sizes.

Facilitates Targeted Analysis:
By visualizing the mean value of each skill indicator by company size, decision-makers can readily assess which skill sets are prioritized in various segments of the market.




##### 2. What is/are the insight(s) found from the chart?

Skill Variation Across Company Sizes:
The heatmap reveals differences in technical skill demand among companies of various sizes. For example, larger companies might show higher averages for advanced skills (e.g., TensorFlow or Spark) compared to smaller companies.

Identification of Trends and Gaps:
The visualization can uncover trends such as whether smaller companies rely more on foundational skills (e.g., Python or SQL) or if certain skills are consistently underutilized across all company sizes.

Benchmarking for Training & Recruitment:
Insights into which skills are emphasized in each size category can help HR and training departments tailor their development programs and recruitment strategies accordingly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Optimized Recruitment & Training:
The insights help companies understand which technical skills are in high demand in their segment, allowing them to tailor recruitment efforts and invest in targeted employee training. This alignment improves talent acquisition and retention.

Informed Strategic Decisions:
By benchmarking against industry standards revealed by the heatmap, companies can adjust their workforce planning and resource allocation, ensuring that they remain competitive and innovative.

Potential Negative Growth Insights:

Skill Gaps and Underinvestment:
If the heatmap indicates that critical skills (e.g., advanced analytics or machine learning frameworks) are underrepresented in a particular company size category, it might signal a talent gap. This can lead to reduced competitiveness or slower innovation, ultimately impacting growth negatively.

Over-Specialization Risks:
Conversely, if companies within a certain size focus too narrowly on a limited set of skills, they may become less adaptable to market changes. This over-specialization could limit their ability to diversify and grow, posing a risk to long-term sustainability.

Justification:

Balanced and targeted skill development is essential for sustained business growth. While aligning recruitment and training with industry trends fosters a competitive edge, significant skill gaps or over-specialization may lead to operational challenges. Recognizing these patterns early via the heatmap enables proactive adjustments, ensuring that companies capitalize on their strengths while mitigating potential risks that could hinder growth.

This complete solution for Chart 12 includes the revised code (with additional debugging information) and detailed answers that explain the rationale, insights, and business implications derived from the heatmap. If the required columns are missing in your file, please review your data wrangling steps to ensure that the 'Size' and skills indicator columns are correctly created and named.

#### Chart - 13

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Check if 'state' column exists; if not, attempt to extract it from 'Location'
if 'state' not in data.columns:
    if 'Location' in data.columns:
        # Extract city and state from the Location column (assuming format "City, ST")
        extracted = data['Location'].str.extract(r'(?P<city>[A-Za-z\s]+),\s(?P<state>[A-Z]{2})')
        if 'state' in extracted.columns:
            data['state'] = extracted['state']
        else:
            print("Unable to extract 'state' from 'Location'.")
    else:
        print("Neither 'state' nor 'Location' column found in the dataset.")
        exit()

# Count job listings per state
state_counts = data['state'].value_counts()

# Create a bar chart for job distribution by state
plt.figure(figsize=(12, 6))
sns.barplot(x=state_counts.index, y=state_counts.values, palette="viridis")

# Chart formatting
plt.title("Job Distribution by State", fontsize=16, fontweight="bold")
plt.xlabel("State", fontsize=12)
plt.ylabel("Number of Job Listings", fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Clear Comparison for Categorical Data:
A bar chart is excellent for comparing discrete categories. In this case, it visually presents the number of job listings per state, making it simple to compare regions at a glance.

Immediate Visual Impact:
The use of color (via a professional palette like "viridis") and clear axis labels makes it easy for stakeholders to quickly identify which states are job hubs.

Simplicity and Clarity:
Bar charts avoid clutter and are well-suited for datasets with a moderate number of categories, ensuring that the geographic distribution of jobs is communicated effectively.

##### 2. What is/are the insight(s) found from the chart?

Regional Job Market Hotspots:
The chart reveals which states have the highest number of job listings. For instance, if states such as CA, NY, or TX appear at the top, it suggests these areas are major employment hubs.

Geographic Trends:
The visualization helps identify regional disparities. Areas with very few job listings might indicate either emerging markets or regions with lower economic activity.

Guidance for Job Seekers and Recruiters:
Job seekers can target regions with high job availability, while recruiters can assess which geographic areas might need more competitive recruitment strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

argeted Regional Strategies:
Understanding which states have a high concentration of job listings helps both businesses and job seekers to focus their efforts. Companies can tailor their recruitment strategies to high-demand areas, and job seekers can concentrate on regions with better opportunities.

Informed Investment Decisions:
Investors and policymakers can use the data to identify economic hotspots, guiding resource allocation and regional development initiatives that stimulate job growth.

Potential Negative Growth Insights:

Regional Imbalance:
If the chart shows that job listings are highly concentrated in only a few states, it may indicate a lack of economic diversity. This over-concentration can lead to over-saturation in those regions and underinvestment in others, which might hamper overall economic growth.

Economic Stagnation in Underrepresented Areas:
States with significantly fewer job listings may struggle to attract new businesses and talent, potentially leading to negative growth or increased unemployment in those regions.

Justification:

Balanced geographic distribution of job opportunities is crucial for sustained economic growth. While concentrated job markets can drive competitiveness and innovation in certain areas, significant regional disparities may lead to resource misallocation and social imbalance. Proactive measures based on these insights can help mitigate risks, ensuring that both over-saturated and underrepresented regions receive appropriate attention for long-term positive impact.

This complete solution for Chart 13, along with the three detailed answers, provides a comprehensive understanding of how geographic job distribution can inform strategic decisions for job seekers, recruiters, and policymakers.

#### Chart - 14 - Correlation Heatmap

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Select only the numerical columns for correlation analysis
num_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Compute the correlation matrix
corr_matrix = data[num_cols].corr()

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, fmt=".2f")
plt.title("Correlation Heatmap of Numerical Variables", fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Comprehensive Overview:
A correlation heatmap provides a clear visual representation of the linear relationships among all numerical features at once, making it easier to identify which variables are closely related.

Feature Redundancy Identification:
This chart helps in spotting high correlations that may indicate redundancy, which is essential for feature selection and avoiding multicollinearity in predictive models.

Effective Data Communication:
The diverging color palette (using "coolwarm") quickly distinguishes between positive and negative correlations, allowing stakeholders to immediately grasp the underlying data structure and relationships.


##### 2. What is/are the insight(s) found from the chart?

Strong Inter-relationships:
The heatmap may reveal that certain features, such as different salary components (e.g., minimum, maximum, and average salary), are highly correlated, confirming the consistency of data processing.

Potential Predictors:
It can highlight which variables might be good predictors for others (e.g., if company age has a moderate correlation with salary, it could be useful in a regression model).

Multicollinearity Detection:
High correlations among features alert analysts to potential multicollinearity issues, indicating that some variables might be redundant or could distort model performance if included together.

#### Chart - 15 - Pair Plot

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

# Load dataset
data = pd.read_csv("glassdoor_jobs.csv")

# Ensure the required columns exist: 'Salary Estimate' and 'Founded'
if 'Salary Estimate' in data.columns and 'Founded' in data.columns:
    # Remove rows with missing salary estimates ("-1")
    salary_data = data[data['Salary Estimate'] != '-1'].copy()

    # Extract numeric salary ranges (assuming format like "50-100")
    extracted = salary_data['Salary Estimate'].astype(str).str.extract(r'(?P<MinSalary>\d+)[^\d]+(?P<MaxSalary>\d+)', expand=True)

    # Convert extracted values to numeric (float)
    salary_data['Min Salary'] = pd.to_numeric(extracted['MinSalary'], errors='coerce')
    salary_data['Max Salary'] = pd.to_numeric(extracted['MaxSalary'], errors='coerce')

    # Drop rows where conversion failed
    salary_data.dropna(subset=['Min Salary', 'Max Salary'], inplace=True)

    # Calculate average salary
    salary_data['Avg Salary'] = (salary_data['Min Salary'] + salary_data['Max Salary']) / 2

    # Exclude invalid 'Founded' values (assuming -1 indicates missing)
    salary_data = salary_data[salary_data['Founded'] != -1]

    # Compute company age based on current year
    current_year = datetime.datetime.now().year
    salary_data['Company Age'] = salary_data['Founded'].apply(lambda x: current_year - x)

    # Prepare a subset with key variables for the pair plot
    pair_data = salary_data[['Min Salary', 'Max Salary', 'Avg Salary', 'Company Age']].copy()

    # Create the pair plot using seaborn with KDE on the diagonal
    sns.pairplot(pair_data, diag_kind='kde')
    plt.suptitle("Pair Plot: Salary Metrics and Company Age", fontsize=16, fontweight='bold', y=1.02)
    plt.show()
else:
    print("Required columns ('Salary Estimate' and 'Founded') are missing in the dataset.")

##### 1. Why did you pick the specific chart?

xploratory Analysis:
The pair plot is ideal for visualizing pairwise relationships between multiple continuous variables at once. It allows you to observe how the different salary metrics and company age interact with each other.

Distribution & Correlation in One View:
With scatter plots for each variable pair and kernel density estimates (KDE) on the diagonal, the pair plot provides insights into the data distributions and potential correlations simultaneously.

Identifying Patterns and Outliers:
This chart helps detect clusters, trends, and outliers in the data, which can inform further analysis or guide feature selection for predictive models.


##### 2. What is/are the insight(s) found from the chart?

Strong Inter-correlation Among Salary Metrics:
The pair plot typically reveals that minimum, maximum, and average salaries are strongly correlated, confirming that the salary extraction and averaging processes are consistent.

Relationship Between Company Age and Salary:
The scatter plots involving "Company Age" may show whether older, more established companies offer higher or lower average salaries compared to younger companies.

Detection of Outliers:
The visualization helps to quickly spot outliers—companies that offer unusually high or low salaries compared to their peers—which may warrant further investigation into their compensation practices.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

n brief, by aligning recruitment, compensation, and training strategies with data-backed insights, the client can optimize resource allocation, attract top talent, and drive overall business growth.

# **Conclusion**

In summary, our comprehensive exploratory data analysis of the Glassdoor jobs dataset has provided critical insights into various aspects of the job market. Our visualizations revealed:

Salary Insights:
Detailed salary distribution analyses (box plots, pair plots) uncovered the typical salary ranges, outliers, and correlations between salary metrics and company age. These insights suggest that competitive and well-benchmarked compensation is key to attracting and retaining top talent.

Job Demand & Role Trends:
Bar charts and pair plots highlighted the most in-demand job roles and their distribution across states and industries. This information can help both job seekers target their applications and companies fine-tune their recruitment strategies.

Company & Market Dynamics:
Heatmaps comparing company size, revenue, and skills demand shed light on how business scale influences hiring practices and technical requirements. Understanding these dynamics is vital for strategic workforce planning and resource allocation.

Business Recommendations:

Data-Driven Recruitment: Focus on high-demand regions and job roles while tailoring recruitment efforts to meet market expectations.
Competitive Compensation Strategies: Benchmark salaries against market trends to ensure packages are attractive yet sustainable.
Targeted Upskilling and Training: Address skill gaps identified in the analysis to empower the workforce and drive innovation.
Strategic Resource Allocation: Use insights on company age and revenue to optimize growth strategies and safeguard against market instability.
By leveraging these data-backed insights, the client can align recruitment, compensation, and training strategies with current market trends, ultimately driving sustainable business growth and a competitive edge in the industry.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***