<a href="https://colab.research.google.com/github/Aryayayayaa/Glassdoor-Jobs/blob/main/Glassdoor_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name    -  Glassdoor: Predicting Salary Ranges, Uncover Market Trends and Provide Insights to Tech Professionals and Organizations**



##### **Project Type**    - Regression
##### **Contribution**    - Individual
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

In today's dynamic and fiercely competitive tech industry, understanding salary trends is more than just a matter of curiosity; it's a critical strategic imperative for a wide array of stakeholders. This project, which leverages a rich dataset of job postings from Glassdoor.com compiled in 2017, aims to illuminate these trends by predicting salaries for various tech job positions. The dataset's comprehensive nature, encompassing key features like job title, company size, experience level, and geographical location, allows for a granular analysis of the factors that significantly impact compensation structures.

For **job seekers**, this project offers an invaluable compass in navigating their career paths. By providing insights into expected salary ranges for different tech roles, professionals can make more informed decisions about which skills to acquire, which industries to target, and what compensation to realistically anticipate during negotiations. For instance, a software engineer considering a move to a data scientist role can understand the potential salary uplift or adjustment, allowing for better career planning and skill development. This data-driven approach empowers individuals to negotiate fair compensation, ensuring their skills and experience are appropriately valued in the market.

**Employers** stand to gain significantly by utilizing the insights derived from this project. In an environment where attracting and retaining top tech talent is paramount, offering competitive salaries is non-negotiable. This predictive model assists companies in benchmarking their compensation packages against industry standards, ensuring they remain attractive to highly sought-after professionals. Furthermore, by understanding the impact of company size and location on salary expectations, businesses can tailor their compensation strategies to their specific context, optimizing their recruitment and retention efforts without overspending or underpaying. This can also help in identifying potential areas where their current salary structure might be lagging, prompting adjustments to stay competitive.

**Analysts and researchers** will find this project a robust foundation for deeper exploration into labor market dynamics. The 2017 Glassdoor data provides a snapshot of a specific period in the tech industry's rapid growth, offering a historical perspective on salary trends. Researchers can leverage this data to identify patterns, evaluate the effectiveness of different compensation strategies, and even predict future shifts in the tech job market. For example, they could analyze how the demand for certain niche skills influences salary premiums in specific locations, or how the growth of remote work might have started to impact geographical salary disparities even back in 2017. The project enables the creation of data-driven insights that can inform broader economic studies and policy recommendations.

Finally, **recruiters** can harness the power of this salary prediction model to streamline their talent acquisition processes and ensure fair compensation practices. By having a clear understanding of market rates for specific tech roles, recruiters can efficiently pre-screen candidates based on salary expectations, reducing time-to-hire. More importantly, it aids in presenting competitive and equitable offers, minimizing the risk of candidates declining due to uncompetitive compensation. This project allows recruiters to benchmark salaries effectively, advocating for fair pay within their organizations and building stronger relationships with both candidates and hiring managers by demonstrating a data-backed understanding of market value. Ultimately, this project serves as a comprehensive tool to foster transparency and efficiency in the tech job market, benefiting all parties involved.

# **GitHub Link -**

https://github.com/Aryayayayaa/Labmentix/blob/3b673c5d2f3f18b90a70115af995ed2ceea330d2/Glassdoor_Code.ipynb

# **Problem Statement**


To analyze the dataset provided and analyze:
1. How does salary vary by job position (e.g. Data Scientist vs. Siftware Engineer vs. DevOps Engineer)
2. What is the impact of company size on salary levels?
Can we build a predictive model to estimate salaries based on job attributes?

By doing thism we can predict salary ranges, uncover market trends, and provide insights to tech professionals and organizations.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from google.colab import drive
import gdown                                                                    #to download the csv file from google drive - specified path
import re                                                                       #for regular expressions to clean text
from scipy import stats                                                         #for statistical tests
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE                                        # For handling imbalanced datasets (if classification)

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer, PorterStemmer

print("Libraries imported!")

print("\nDownloads:")
nltk.download('stopwords')
nltk.download('punkt_tab')                                                      # Often needed with tokenization
nltk.download('wordnet')                                                        # Needed for lemmatization
nltk.download('averaged_perceptron_tagger_eng')                                 # Needed for POS tagging

warnings.filterwarnings('ignore')                                              #suppress warnings

### Dataset Loading

In [None]:
# Load Dataset
drive.mount('/content/drive')

file_id = '1AoI3IN2olFG8LDo5lDjgieJi_xBlO6bu'
output_path = 'glassdoor_jobs.csv'  # Name you want for the downloaded file in Colab

gdown.download(f'https://drive.google.com/uc?id={file_id}', output_path, quiet=False)

df = pd.read_csv(output_path)
print("\nCSV loaded successfully!")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

After analyzing the dataset, it is seen that their are no duplicate values and number of missing values is zero. Hence, the entries in the dataset is distinct.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

On executing the above codes, it is seen that:

1. 15 columns are present in the dataset file with 'dtype' (datatype) as 'object'.

2. the dataset description gives information about the first column such as count of entries under that column, mean, std, min, 25%, 50%, 75% and max value. Their is option of analyzing with the help of table and graph also.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values for column '{column}':")
    print(unique_values)
    print()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#dropping unnecesary column
if 'Unnamed: 0' in df.columns:
    df.drop('Unnamed: 0', axis=1, inplace=True)
    print("Dropped 'Unnamed: 0' column.")

#Clean and Parse 'Salary Estimate': create 'Min Salary', 'Max Salary', 'Average Salary' in numerical format
if 'Salary Estimate' in df.columns and df['Salary Estimate'].dtype == 'object':
    print("\nCleaning 'Salary Estimate' column")

    df['Salary Cleaned'] = df['Salary Estimate'].str.replace(r'\(Glassdoor est.\)', '', regex=True)     # Remove the '(Glassdoor est.)' part and other non-numeric characters
    df['Salary Cleaned'] = df['Salary Cleaned'].str.replace('$', '', regex=False)                       # Remove '$'
    df['Salary Cleaned'] = df['Salary Cleaned'].str.lower().str.strip()                                 # Lowercase and strip for 'k'

    #fn to convert salary string to numeric range
    def parse_salary(salary_str):
        # Add a check for empty or non-convertible strings
        if not salary_str or salary_str.strip() == '':
            return pd.NA, pd.NA # Return pandas NA for missing values

        salary_str = salary_str.replace('k', '000')                                                     #convert 'k' to '000'
        if '-' in salary_str:
            try:
                min_s, max_s = map(float, salary_str.split('-'))
            except ValueError:
                 # Handle cases where splitting by '-' still results in non-float parts
                 return pd.NA, pd.NA
        else:                                                                                           #handle cases like '100000' without a range if they appear
            try:
                min_s = float(salary_str)
                max_s = float(salary_str)
            except ValueError:
                # Handle cases where the string isn't a valid float after cleaning
                return pd.NA, pd.NA
        return min_s, max_s

    #apply the parsing and create new columns
    df[['Min Salary', 'Max Salary']] = df['Salary Cleaned'].apply(lambda x: pd.Series(parse_salary(x)))
    df['Average Salary'] = (df['Min Salary'] + df['Max Salary']) / 2

    # drop the intermediate 'Salary Cleaned' column and original 'Salary Estimate'
    df.drop(['Salary Estimate', 'Salary Cleaned'], axis=1, inplace=True)
    print("Created 'Min Salary', 'Max Salary', and 'Average Salary' columns")

#clean 'Company Name': example: 'Tecolote Research\n3.8' needs to become 'Tecolote Research'
if 'Company Name' in df.columns and df['Company Name'].dtype == 'object':
    print("\nCleaning 'Company Name' column")

    #remove newline characters and any trailing numbers/ratings: use regex to remove '\n' followed by a number and optionally a decimal
    df['Company Name'] = df['Company Name'].astype(str).str.replace(r'\n\d+\.?\d*$', '', regex=True).str.strip()
    print("Cleaned 'Company Name'")

# extract 'State' from 'Location': example: 'Albuquerque, NM' -> 'NM'
if 'Location' in df.columns and df['Location'].dtype == 'object':
    print("\nExtracting 'State' from 'Location'")
    # Assuming format 'City, State' or 'City, State (Country)' -> takes the last part after the comma, strips whitespace, and converts to uppercase
    df['State'] = df['Location'].astype(str).apply(lambda x: x.split(',')[-1].strip().upper() if ',' in x else 'Unknown')
    print("Created 'State' column")

#Clean and Parse 'Size' -> example: '501 to 1000 employees', '10000+ employees', 'Unknown'
# Convert to numerical range or simplified categories
if 'Size' in df.columns and df['Size'].dtype == 'object':
    print("\nCleaning 'Size' column")
    df['Size Cleaned'] = df['Size'].astype(str).str.lower().str.replace(' employees', '', regex=False).str.strip()

    def parse_size(size_str):
        if 'unknown' in size_str or 'unspecified' in size_str:
            return 'Unknown'
        elif '+' in size_str:
            return size_str.replace('+', '') + ' employees' # e.g., '10000+ employees' -> '10000 employees' for categorization
        elif 'to' in size_str:
            return size_str.replace(' to ', '-') # e.g., '501 to 1000' -> '501-1000'
        else: # Handle single values if any
            return size_str

    df['Size Cleaned'] = df['Size Cleaned'].apply(parse_size)

    # Map to more defined categories or numerical ranges if needed for analysis: let's keep it as clean strings, but you could convert to numerical bins
    size_mapping = {
        '1-50': '1-50',
        '51-200': '51-200',
        '201-500': '201-500',
        '501-1000': '501-1000',
        '1001-5000': '1001-5000',
        '5001-10000': '5001-10000',
        '10000 employees': '10000+', # For '10000+'
        'unknown': 'Unknown',
        'unspecified': 'Unknown'
    }

    df['Employee Count Range'] = df['Size Cleaned'].map(size_mapping).fillna(df['Size Cleaned']) # Apply mapping only if the values are in the mapping keys
    df['Employee Count Range'] = df['Employee Count Range'].astype(str)
    print("Created 'Employee Count Range' column")
    print("Cleaned 'Size' and created 'Employee Count Range' column.")

    df.drop(['Size', 'Size Cleaned'], axis=1, inplace=True)
    print("Cleaned 'Size' and created 'Employee Count Range' column.")

#derive 'Years Old' from 'Founded'
if 'Founded' in df.columns and pd.api.types.is_numeric_dtype(df['Founded']):
    print("\nCreating 'Years Old' column")
    current_year = pd.Timestamp.now().year
    df['Years Old'] = current_year - df['Founded']
    # Handle cases where 'Founded' might be 0 or in the future (unlikely but good practice)
    df.loc[df['Years Old'] < 0, 'Years Old'] = 0 # Company not founded yet or error
    print("Created 'Years Old' column")

#Clean and Parse 'Revenue' -> example: '$50 to $100 million (USD)'
#create 'Min Revenue', 'Max Revenue', 'Average Revenue' in numerical format
if 'Revenue' in df.columns and df['Revenue'].dtype == 'object':
    print("\nCleaning 'Revenue' column")

    # Handle '-1' or 'Unknown / Non-Applicable' entries
    df['Revenue Cleaned'] = df['Revenue'].astype(str).str.lower().str.strip()
    df.loc[df['Revenue Cleaned'].isin(['-1', 'unknown / non-applicable']), 'Revenue Cleaned'] = 'Unknown'

    # Function to convert revenue string to numerical range (in millions or billions)
    def parse_revenue(revenue_str):
        if 'unknown' in revenue_str:
            return pd.NA, pd.NA # Use pandas NA for missing values

        revenue_str = revenue_str.replace('$', '')
        revenue_str = revenue_str.replace('(usd)', '').strip()

        multiplier = 1
        if 'million' in revenue_str:
            multiplier = 1_000_000
            revenue_str = revenue_str.replace('million', '').strip()
        elif 'billion' in revenue_str:
            multiplier = 1_000_000_000
            revenue_str = revenue_str.replace('billion', '').strip()

        if not revenue_str or revenue_str.strip() == '':
            return pd.NA, pd.NA

        if 'to' in revenue_str:
            try:
                min_r, max_r = map(float, revenue_str.split('to'))
            except ValueError:
                 return pd.NA, pd.NA
        elif '+' in revenue_str:
            try:
                min_r = float(revenue_str.replace('+', ''))
                max_r = min_r * 2
            except ValueError:
                 return pd.NA, pd.NA
        else:
            try:
                min_r = float(revenue_str)
                max_r = float(revenue_str)
            except ValueError:
                 return pd.NA, pd.NA


        return min_r * multiplier, max_r * multiplier

    # Apply the parsing
    df[['Min Revenue', 'Max Revenue']] = df['Revenue Cleaned'].apply(lambda x: pd.Series(parse_revenue(x)))
    df['Average Revenue'] = (df['Min Revenue'] + df['Max Revenue']) / 2

    df.drop(['Revenue', 'Revenue Cleaned'], axis=1, inplace=True)
    print("Created 'Min Revenue', 'Max Revenue', and 'Average Revenue' columns")

#process 'Competitors'
# '-1' means no competitors listed. Can create a count or a boolean.
if 'Competitors' in df.columns and df['Competitors'].dtype == 'object':
    print("\nProcessing 'Competitors' column")

    # Count competitors, setting 0 for '-1'
    df['Num Competitors'] = df['Competitors'].astype(str).apply(
        lambda x: len(x.split(',')) if x != '-1' and x.strip() != '' else 0
    )
    # Create a boolean flag if a list of competitors exists
    df['Has Competitors Listed'] = df['Competitors'].astype(str).apply(lambda x: 1 if x != '-1' and x.strip() != '' else 0)

    # df.drop('Competitors', axis=1, inplace=True)  -> might consider dropping the original 'Competitors' if these new features are sufficient
    print("Created 'Num Competitors' and 'Has Competitors Listed' columns")

# standardize Text Columns (Job Title, Job Description, Industry, Sector, Type of ownership, Headquarters)
# Convert to lowercase and strip extra whitespace for consistency
print("\nStandardizing text columns")
text_columns = ['Job Title', 'Job Description', 'Industry', 'Sector', 'Type of ownership', 'Headquarters']
for col in text_columns:
    if col in df.columns and df[col].dtype == 'object':
        df[col] = df[col].astype(str).str.lower().str.strip()
print("Standardized text columns to lowercase and stripped whitespace")


#final Data Inspection
print("\nFinal Data Review After Cleaning:")

print("\nDataFrame Information (df.info()):")
df.info()

print("\nFirst 5 rows of the CLEANED dataset:")
print(df.head())

print("\nMissing values after all cleaning steps:")
print(df.isnull().sum())

print(f"\nFinal DataFrame shape: {df.shape}")
print("Dataset is now analysis-ready!")

### What all manipulations have you done and insights you found?

The cleaning process has significantly transformed the dataset, making various textual and range-based columns suitable for quantitative analysis.

The **Unnamed: 0 column**, which was likely an extraneous index from the CSV export, has been successfully dropped.This removes redundant information and cleans up the DataFrame for analysis.

**Salary Estimate Column Transformation**:The original Salary Estimate column (e.g., "$53K-$91K (Glassdoor est.)", "-1", "Employer Provided Salary:$150K-$160K", "$17-$24 Per Hour(Glassdoor est.)") has been dropped.

**New Columns Created: **

**Min Salary (e.g., 53000.0):** The lower bound of the estimated annual salary, converted to a numerical (float64) type.

**Max Salary (e.g., 91000.0):** The upper bound of the estimated annual salary, converted to a numerical (float64) type.

**Average Salary (e.g., 72000.0):** The calculated average of Min Salary and Max Salary, also numerical (float64).

**Is_Hourly**: A boolean flag (though its dtype shows as object in df.info() likely because bool is often represented as object when mixed with pd.NA or during initial inference) indicating if the original salary was an hourly rate.

**Min_Hourly_Salary, Max_Hourly_Salary, Average_Hourly_Salary**: These columns were initialized and populated specifically for hourly rates, leaving pd.NA for annual salaries.

Handling of Complex Formats: The code successfully parsed out '$', 'K', '(Glassdoor est.)', '(Employer est.)', 'Employer Provided Salary:', and handled the -1 values by converting them to pd.NA. It also correctly distinguished and processed "Per Hour" salaries. This is a major improvement. Salary information is now in a quantifiable format, allowing for direct numerical analysis, filtering, and aggregation. The Is_Hourly flag is crucial for differentiating job types.

**Company Name Cleaning**: Trailing ratings (e.g., \n3.8) have been removed from company names. Provides cleaner, more consistent company names, ideal for grouping and unique identification.

**Location to State Extraction**: A new column State (e.g., NM, MD) has been extracted from the Location column, converting it to uppercase and stripping whitespace.Enables geographic analysis of job postings at a state level.

**Size Column Transformation**: The original Size column (e.g., "501 to 1000 employees", "10000+ employees", "Unknown") has been dropped.

**New Column Created:** Employee Count Range (e.g., 501-1000, 10000+, Unknown).
 Standardizes the company size information into a more consistent categorical format, useful for analyzing job patterns across different company sizes.

**Founded to Years Old Derivation:** A new numerical column Years Old (e.g., 52, 41) has been calculated by subtracting the Founded year from 2025. Values where Founded was -1 or resulted in a negative age were converted to pd.NA.
 Provides a direct measure of company age, which can be a valuable feature for analysis (e.g., mature vs. startup companies).

**Revenue Column Transformation:** The original Revenue column (e.g., "$50 to $100 million (USD)", "-1", "Unknown / Non-Applicable") has been dropped.

**New Columns Created:**

**Min Revenue (e.g., 50000000.0):** The lower bound of the estimated revenue.

**Max Revenue (e.g., 100000000.0):** The upper bound of the estimated revenue.

**Average Revenue (e.g., 75000000.0):** The calculated average revenue.

Handling of Complex Formats: Similar to salary, it parsed out '$', 'million', 'billion', '(USD)', 'to', '+', and handled -1 and 'unknown / non-applicable' by converting them to pd.NA.

Revenue data is now in a numerical format, enabling financial analysis and correlation with other job attributes.

**Competitors Column Processing:** The original Competitors column is retained.

**New Columns Created:**

**Num Competitors (e.g., 0, 3):** Counts the number of competitors listed, converting -1 to 0.

**Has Competitors Listed (e.g., 0, 1):** A binary flag indicating whether any competitors were listed.

Provides quantifiable measures of competition for each job posting, useful for market analysis.

**Standardization of Text Columns:** Job Title, Job Description, Industry, Sector, Type of ownership, and Headquarters columns have been converted to lowercase and had leading/trailing whitespace removed.
Ensures consistency for text-based analysis (e.g., counting unique job titles, keyword search, natural language processing), preventing issues due to capitalization or extra spaces.

**Overall Insights from the Final Output:**

**Cleaned Data Types:** Many object columns that contained numerical or range data are now correctly represented as float64 or int64, which is essential for mathematical operations and statistical analysis. Textual categorical columns are consistently formatted.

**Introduced Missing Values (Correctly):** While your initial df.info() showed no missing values, the output now shows Min Salary, Max Salary, Average Salary with 264 missing values and Min Revenue, Max Revenue, Average Revenue with 381 missing values. This is an insight into the true completeness of your data after parsing. These missing values represent the original -1 or 'Unknown' entries that could not be meaningfully converted to a number, and pd.NA is the correct way to represent them.

**Rich Feature Set:** The creation of new, derived features (Average Salary, State, Years Old, Employee Count Range, Average Revenue, Num Competitors, Has Competitors Listed, Is_Hourly, Min_Hourly_Salary, Max_Hourly_Salary, Average_Hourly_Salary) has significantly enriched the dataset, providing more direct and useful variables for further exploration, visualization, and predictive modeling.

**Readiness for Analysis:** The dataset is now in a much more analysis-ready state, allowing you to perform aggregations, build visualizations, and apply machine learning algorithms without extensive manual cleaning for each new task.
The next steps would typically involve exploring these newly created features, visualizing distributions, and performing statistical analysis to uncover deeper patterns and relationships within your Glassdoor jobs data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# @title Num Competitors

df['Num Competitors'].plot(kind='line', figsize=(8, 4), title='Num Competitors')
plt.gca().spines[['top', 'right']].set_visible(False)

##### 1. Why did you pick the specific chart?

This appears to be a line plot chosen to display the Num Competitors value for each entry in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Num Competitors predominantly has values of 0 or 3, with very few instances of 1, 2, or 4. This indicates that most job postings either have no competitors listed (value 0) or exactly 3 competitors listed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. This insight helps businesses understand the competitive landscape. If a company finds its postings consistently in the '3 competitors' group, they know it's a standard competitive field. For positions with 0 competitors, it might indicate niche roles or less competitive hiring, allowing for more strategic outreach.

**Negative Growth Insight:** The chart itself doesn't directly indicate negative growth. However, if a business consistently observes a high number of competitors (e.g., always 3 or more) for their key roles, it suggests a highly competitive talent market. This could indirectly lead to negative growth if they struggle to attract top talent due to intense competition, resulting in slower hiring, higher recruitment costs, or lower quality hires. The spike at ~100 on the x-axis, showing a value of 4 competitors, is a rare instance of higher competition that might warrant further investigation.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# @title Years Old

df['Years Old'].plot(kind='line', figsize=(8, 4), title='Years Old')
plt.gca().spines[['top', 'right']].set_visible(False)

##### 1. Why did you pick the specific chart?

This is a line plot, typically chosen to show the value of "Years Old" for each entry in the dataset. It helps visualize individual company ages across the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart shows a wide range of company ages, from very young (close to 0) to very old (some exceeding 1500-2000 years, which seems anomalous). The vast majority of companies appear to be relatively young, likely under 250 years old, with spikes indicating some extremely old or potentially misrecorded founding dates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. Understanding company age distribution helps tailor recruitment strategies. For example, a business can target younger, fast-growing companies for talent acquisition or partner with established, older companies. It also helps in market analysis, identifying if a sector is dominated by new entrants or long-standing players.

**Negative Growth Insight:** The presence of companies with extremely high "Years Old" values (e.g., >1000 years) suggests data quality issues or errors in the Founded column. If these extreme outliers are not data entry errors (and represent actual founding dates like universities or very old institutions), they might skew analyses of typical company age. If these are errors, using this data without further cleaning for specific metrics (e.g., average company age) could lead to misinformed business decisions about market maturity or growth potential, which could impede growth by misallocating resources.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# @title Founded

from matplotlib import pyplot as plt
df['Founded'].plot(kind='line', figsize=(8, 4), title='Founded')
plt.gca().spines[['top', 'right']].set_visible(False)

##### 1. Why did you pick the specific chart?

This is a line plot, typically used to show the Founded year for each entry in the dataset. It helps visualize the distribution of founding years across the companies.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that most companies were founded in recent centuries (mostly after 1900, with many concentrated around the 2000s). However, there are significant vertical lines dropping to near zero or very low values, suggesting a substantial number of entries with either a founding year of 0, -1, or an invalid/placeholder value, or potentially very old founding dates (e.g., in the 17th or 18th century).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Understanding the distribution of founding years can help businesses assess market maturity, identify emerging trends (e.g., many recently founded companies in a specific sector), or target companies based on their stage of development (startup vs. established). This can inform strategic partnerships, investment decisions, or talent acquisition focus.

**Negative Growth Insight:** The presence of founding years around 0 or other clearly erroneous values (the sharp vertical drops) indicates significant data quality issues. If these aren't handled, any analysis involving company age (like the Years Old column we derived) will be skewed. This can lead to misinformed business decisions due to inaccurate representation of company maturity, potentially impacting resource allocation, risk assessment, or market entry strategies. For instance, incorrectly assuming a large number of very old companies could mask a lack of innovation or new market entrants.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# @title Rating

df['Rating'].plot(kind='line', figsize=(8, 4), title='Rating')
plt.gca().spines[['top', 'right']].set_visible(False)

##### 1. Why did you pick the specific chart?

This is a line plot, chosen to display the Rating value for each individual data entry (job posting or company). It allows for visualizing the distribution of ratings across the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most company ratings fall between 3.0 and 5.0. However, there are many instances where the rating drops to 0 or below, indicating companies with either no rating information, potentially very low ratings, or placeholder/error values (e.g., -1).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. Analyzing company ratings can help job seekers identify highly-rated employers, and it can help companies understand their brand perception. Businesses can use this to benchmark their own ratings against competitors, identifying areas for improvement in employee satisfaction and workplace culture.

**Negative Growth Insight:** The frequent drops to 0 or below for Rating suggest data quality issues. If a significant portion of companies truly has a 0 or negative rating, this would indicate severe dissatisfaction, which directly impacts talent attraction and retention, potentially leading to negative growth due to a weakened workforce. If these are merely missing data points (e.g., converted from "No rating" or "-1"), then relying on an average rating without handling these correctly would provide a misleading picture of overall company sentiment, leading to flawed business decisions about talent acquisition or reputation management.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# @title Num Competitors vs Has Competitors Listed

df.plot(kind='scatter', x='Num Competitors', y='Has Competitors Listed', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A scatter plot is suitable here because it visualizes the relationship between two numerical variables (Num Competitors on the x-axis and Has Competitors Listed on the y-axis). It helps confirm the logical consistency between these two derived features.

##### 2. What is/are the insight(s) found from the chart?

The chart shows two distinct groups of points:

When Num Competitors is 0, Has Competitors Listed is also 0.
When Num Competitors is 1, 2, 3, or 4, Has Competitors Listed is 1.
This confirms that the cleaning and feature engineering for these two columns were successful and logically consistent: if there are no competitors (0), the flag is off (0); if there's any positive number of competitors, the flag is on (1).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. This chart validates the accuracy of the Num Competitors and Has Competitors Listed features, ensuring that subsequent analyses based on these fields are reliable. For instance, businesses can confidently use Has Competitors Listed as a quick filter for competitive roles or Num Competitors for a detailed competitive analysis, leading to more informed strategic decisions in recruitment and market positioning.

**Negative Growth Insight:** No, this specific chart does not directly show any insights that lead to negative growth. Instead, it confirms the correctness of the data manipulation, which is a positive outcome for data reliability. Issues leading to negative growth would stem from the values of Num Competitors (e.g., consistently high numbers indicating fierce talent competition), not from the relationship between Num Competitors and Has Competitors Listed itself.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# @title Years Old vs Num Competitors

df.plot(kind='scatter', x='Years Old', y='Num Competitors', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A scatter plot is used here to visualize the relationship between two numerical variables: Years Old (company age) on the x-axis and Num Competitors (number of listed competitors) on the y-axis. It helps identify any correlations or patterns between company age and the competitive landscape.

##### 2. What is/are the insight(s) found from the chart?

* Most companies are relatively young (under 250 years old), and within this group, Num Competitors is primarily 0 or 3, with some instances of 1, 2, and 4.

* There's a cluster of companies around the ~2000-year mark with Num Competitors values of 0, 1, or 2. These are the anomalous "Founded" entries we noted earlier.

* Younger companies show a wider spread in competitor counts, suggesting that competitive dynamics are varied for newer entities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. This insight helps businesses understand if company age correlates with competitive intensity. For instance, if very young companies tend to have fewer competitors listed, it might suggest opportunities in emerging markets. If older, established companies consistently show certain competitor counts, it can inform market entry or expansion strategies.

**Negative Growth Insight:** The cluster of data points around Years Old = 2000 (corresponding to Founded year 0 or similar problematic values) represents data quality issues. If these old/erroneous Years Old values are included in aggregated analyses (e.g., calculating average competitors for "very old" companies), they would lead to misleading insights. Acting on such flawed data could result in negative growth by misjudging market saturation, competitive threats, or the target audience's needs, leading to ineffective business strategies or resource misallocation. Proper handling (e.g., removing or imputing these outliers) is crucia

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# @title Founded vs Years Old

df.plot(kind='scatter', x='Founded', y='Years Old', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A scatter plot is ideal here to visualize the direct mathematical relationship between Founded year (x-axis) and Years Old (y-axis). It helps confirm if the Years Old column was calculated correctly from the Founded year.

##### 2. What is/are the insight(s) found from the chart?

The chart shows a strong inverse linear relationship between Founded and Years Old, as expected (the older the founding year, the younger the company age). Most data points fall on this inverse line.

One isolated point at Founded near 0 and Years Old near 2000 (likely 2025 based on the calculation year). This confirms the existence of the Founded year 0 anomaly and its correct transformation into Years Old.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. This chart provides validation that the Years Old feature was correctly derived from the Founded year, which builds confidence in the reliability of this newly engineered feature for analysis. Businesses can now confidently use Years Old for segmentation or trend analysis.

**Negative Growth Insight:** The outlier point at Founded near 0 (resulting in Years Old near 2000) represents a data quality issue. While the calculation was correct based on this erroneous input, including this outlier in statistical summaries (e.g., average company age) would skew results and provide misleading insights. This could lead to negative growth if business strategies (e.g., market entry, partnership targeting) are based on an inaccurate understanding of typical company age within a sector or the overall dataset. This confirms the need to handle such outliers in the Founded or Years Old columns.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# @title Rating vs Founded

df.plot(kind='scatter', x='Rating', y='Founded', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A scatter plot is used here to examine the relationship between Rating (x-axis) and Founded year (y-axis). It helps to see if there's any correlation between a company's rating and its founding year.

##### 2. What is/are the insight(s) found from the chart?

* Most companies with ratings (between ~2.0 and 5.0) were founded in recent times (mostly after 1900, clustered around the 2000s).

* There's a significant cluster of data points around Founded = 0 (or very low values) across various Rating values. This indicates the "data quality issue" where the founding year was likely a placeholder -1 or 0, yet these entries still have associated ratings.

* There's also an outlier point at Rating = -1 and Founded around 2000, suggesting a placeholder rating value for a company with a recent founding year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** This chart helps understand if there's a typical age for companies with certain rating ranges. For example, if newer companies consistently have higher ratings, it could inform strategies for attracting talent or marketing to customers who prefer modern organizations.

**Negative Growth Insight:** The prominent horizontal cluster at Founded = 0 (or Founded = -1 before cleaning) represents major data quality issues. Relying on aggregated insights from Founded or Years Old columns without addressing these erroneous values would lead to misleading conclusions about company age distributions. For example, if a business analyzes "average company age" and these 0-founded entries are included, the average would be artificially low, potentially leading to negative growth by misjudging market maturity or the typical longevity of companies in a sector, thus informing flawed strategic planning. The single outlier at Rating = -1 also highlights a data quality concern, as a non-numeric rating might skew any aggregate rating analysis.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# @title Num Competitors

df['Num Competitors'].plot(kind='hist', bins=20, title='Num Competitors')
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A histogram (or bar chart for discrete values like this) is ideal for displaying the frequency distribution of a single numerical variable, Num Competitors. It quickly shows which competitor counts are most common.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that:

* The most frequent Num Competitors value is 0 (around 625-630 entries), meaning a large majority of job postings do not list any competitors.

* The second most frequent value is 3 (around 250 entries).

* Very few postings list 1, 2, or 4 competitors.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. This insight helps businesses understand the competitive landscape of job postings. A high frequency of '0 competitors' might indicate niche roles, less transparent companies, or a less crowded talent market for those positions. The prominence of '3 competitors' suggests a common competitive benchmark. This information can help optimize recruitment strategies, allocate resources effectively, and assess the difficulty of attracting talent for certain roles.

**Negative Growth Insight:** No, this chart directly does not show insights that lead to negative growth. It highlights the distribution of competitor counts. However, if a business consistently finds itself in the '3 competitors' bucket for all its critical roles, it means they operate in a highly competitive hiring environment. If they lack competitive advantages (e.g., compensation, benefits, culture), this could indirectly lead to negative growth by making it harder to secure top talent, increasing recruitment costs, and potentially slowing down critical projects due to staffing challenges. The data itself doesn't cause negative growth, but the competitive scenario it reveals could pose challenges.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# @title Years Old

df['Years Old'].plot(kind='hist', bins=20, title='Years Old')
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A histogram is chosen to show the frequency distribution of Years Old, a continuous numerical variable. It effectively groups company ages into bins and displays how many companies fall into each age range.

##### 2. What is/are the insight(s) found from the chart?

The chart clearly shows two main insights:

* **Dominance of Young Companies:** The vast majority of companies are very young, likely under 100-150 years old, with a large peak at the lowest age range (0-50 years).

* **Anomaly of Very Old Companies:** There's a distinct, smaller peak around 1900-2000 years, indicating a significant number of companies with extremely high calculated ages (which as we discussed, correspond to the "Founded" year 0 or -1 anomalies).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. This distribution helps understand the typical age profile of companies in the dataset. Businesses can use this to:
* Targeting: Focus marketing or recruitment efforts on companies within specific age brackets (e.g., fast-growing startups vs. stable, mature corporations).
* Market Analysis: Assess if the market is saturated with older players or seeing an influx of new companies.

**Negative Growth Insight:** The prominent bar around 1900-2000 years ("Years Old") signifies a major data quality issue (companies with erroneous "Founded" dates, like 0 or -1). If these anomalous data points are not properly handled (e.g., removed, imputed, or treated as a separate category for "Unknown Age"), they will skew any aggregated statistics or models involving company age. Basing business decisions on such inaccurate data (e.g., misjudging the average age of companies in a sector) could lead to negative growth by developing strategies (e.g., investment, partnership, product development) that are misaligned with the true market dynamics.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# @title Founded

df['Founded'].plot(kind='hist', bins=20, title='Founded')
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A histogram is chosen to display the frequency distribution of the Founded year, a numerical variable. It effectively groups the founding years into bins and shows how many companies were founded in each period.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals two major insights:

* Recent Founding Dominance: The overwhelming majority of companies were founded in the late 1900s and early 2000s, with a very large peak in the most recent bins (near 2000).

* Anomalous Founding Years: There's a distinct bar at Founded year 0, indicating a significant number of entries with invalid or placeholder founding dates. There are also some companies founded in the late 1800s.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. Understanding the distribution of founding years provides insights into the dynamism of the job market. A strong concentration of recently founded companies might suggest a vibrant, emerging sector for innovation or high growth. This can guide investment decisions, talent acquisition strategies, and market entry planning.

**Negative Growth Insight:** The substantial bar at Founded year 0 (or similar non-sensical values) represents a critical data quality issue. Including these erroneous values in any analysis of company age (e.g., average founding year) will severely skew results and provide fundamentally misleading insights. Basing business strategies on such inaccurate data could lead to negative growth by misjudging the maturity of a market, the competitive landscape, or the longevity of businesses, thus resulting in ill-informed investments or ineffective strategic planning.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# @title Rating

df['Rating'].plot(kind='hist', bins=20, title='Rating')
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A histogram is chosen to display the frequency distribution of Rating, a numerical variable. It effectively groups the ratings into bins and shows how many companies fall into each rating range.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals several key insights:

* Normal Distribution: Most companies have ratings clustered between approximately 3.0 and 4.5, with a peak around 4.0, suggesting a generally positive sentiment.

* Left Skew: There's a tail of ratings extending down to 2.0, indicating some companies have lower ratings.

* Anomaly at -1: A distinct bar at -1 indicates a significant number of entries where the rating information is missing or represented by a placeholder (likely from original "-1" values or "No rating" before cleaning).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: Yes. This distribution helps companies understand the typical range of ratings in the market. Companies with ratings below the average (e.g., in the 2.0-3.0 range) can identify areas for improvement in employee satisfaction or public perception. Highly-rated companies can leverage their strong reputation for talent attraction and marketing.
Negative Growth Insight: The bar at -1 (indicating missing or invalid ratings) is a data quality concern. If this large group of entries is treated as having a "negative one" rating in analyses, it would severely skew average rating calculations and present a misleading picture of company sentiment. Basing strategic decisions (e.g., recruitment budgets, employer branding efforts) on such flawed aggregate data could lead to negative growth by misallocating resources or failing to address real perception issues.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# @title Has Competitors Listed

df['Has Competitors Listed'].plot(kind='hist', bins=20, title='Has Competitors Listed')
plt.gca().spines[['top', 'right',]].set_visible(False)

##### 1. Why did you pick the specific chart?

A histogram (or bar chart) is ideal for displaying the frequency distribution of a binary (0 or 1) categorical variable like Has Competitors Listed. It clearly shows the count of job postings that either have competitors listed (1) or do not (0).

##### 2. What is/are the insight(s) found from the chart?

The chart shows two distinct insights:

* Majority with No Listed Competitors: A large majority of job postings (over 600) do not have competitors listed (Has Competitors Listed = 0).

* Significant Portion with Listed Competitors: A substantial number of job postings (over 300) do have competitors listed (Has Competitors Listed = 1).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Yes. This insight directly informs market and recruitment strategies. Companies can differentiate roles where competition is explicit versus those where it's not (perhaps niche roles or less transparent listings). For instance, for roles with Has Competitors Listed = 1, a company might need to invest more in competitive compensation or unique benefits to attract talent.

**Negative Growth Insight:** No, this chart primarily shows the presence or absence of competitor information, not a direct driver of negative growth. However, if a business consistently finds that all their critical job postings fall into the Has Competitors Listed = 1 category (especially with high Num Competitors if combined with that insight), it indicates a highly competitive talent acquisition environment. Failing to adequately address this high competition with superior employer branding or compensation could indirectly lead to negative growth due to challenges in hiring and retaining skilled employees.








#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
print("Correlation Heatmap")

#select only numerical columns for correlation calculation
numerical_df = df.select_dtypes(include=['float64', 'int64'])


for col in ['Min Salary', 'Max Salary', 'Average Salary',
            'Min_Hourly_Salary', 'Max_Hourly_Salary', 'Average_Hourly_Salary',
            'Min Revenue', 'Max Revenue', 'Average Revenue', 'Rating', 'Founded', 'Years Old',
            'Num Competitors', 'Has Competitors Listed']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')                       # This is important if some parsing failed and left objects, e.g., 'object' type despite conversion

numerical_df = df.select_dtypes(include=['float64', 'int64'])                   # Re-select numerical columns after ensuring type consistency

# Drop columns that are completely irrelevant for correlation or are constant
# 'Founded' and 'Years Old' are perfectly inversely correlated, one is enough for core analysis
# 'Has Competitors Listed' is derived from 'Num Competitors', so they are highly related.
# 'Is_Hourly' is a boolean, good to keep.
# 'Rating' and 'Num Competitors' are numerical.

# Identify columns to potentially drop for a cleaner heatmap (optional, but good for clarity)
columns_to_exclude = ['Founded'] # 'Founded' is inverse of 'Years Old', so keep 'Years Old'

# Filter out the columns to exclude
numerical_df_filtered = numerical_df.drop(columns=columns_to_exclude, errors='ignore')

#calc the correlation matrix
correlation_matrix = numerical_df_filtered.corr()

print("\nCorrelation matrix calculated. Displaying heatmap")

#create the heatmap
plt.figure(figsize=(12, 10)) # Adjust figure size for readability

# annot=True displays the correlation values on the heatmap
# cmap='coolwarm' sets the color scheme (red for negative, blue for positive correlation)
# fmt=".2f" formats the annotations to two decimal places
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)

plt.title('Correlation Heatmap of Numerical Features', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

print("\nCorrelation heatmap displayed successfully.")

##### 1. Why did you pick the specific chart?

A correlation heatmap is chosen to visualize the correlation matrix between multiple numerical variables. It effectively shows the strength and direction (positive or negative) of linear relationships between all pairs of features in a compact, color-coded format, making it easy to spot strong relationships at a glance.

##### 2. What is/are the insight(s) found from the chart?

1. Strong Positive Correlations:

* Min Salary, Max Salary, and Average Salary are all very highly positively correlated with each other (0.96 to 0.99). This is expected as Average Salary is derived directly from Min and Max Salary, and Min and Max salaries move together.
* Similarly, Min Revenue, Max Revenue, and Average Revenue are also perfectly positively correlated (1.00) due to their derivation.
* Num Competitors and Has Competitors Listed are very highly positively correlated (0.97). This is also expected as Has Competitors Listed is a binary flag based on Num Competitors.

2. Moderate Negative Correlation:
* Rating has a moderate negative correlation with Years Old (-0.48). This suggests that older companies tend to have lower ratings, or conversely, younger companies tend to have higher ratings.
* Years Old has a moderate negative correlation with Num Competitors (-0.21) and Has Competitors Listed (-0.22). This implies that older companies tend to have fewer listed competitors or are less likely to have competitors listed.

3. Weak or No Significant Correlation:
* Rating has very weak or negligible correlations with Min Salary, Max Salary, Average Salary, Min Revenue, Max Revenue, Average Revenue, and Num Competitors (all close to 0). This suggests that a company's rating doesn't strongly predict or is predicted by its salary estimates, revenue, or number of competitors.
* Salary-related columns (Min Salary, Max Salary, Average Salary) show a weak positive correlation with revenue-related columns (around 0.23), implying that higher salaries might be loosely associated with higher revenues, but it's not a strong linear relationship.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
print("Generating Pair Plot Visualization")

#select a subset of numerical columns for the pair plot


# Re-ensure relevant columns are numeric, coercing errors to NaN, as done for heatmap
for col in ['Min Salary', 'Max Salary', 'Average Salary',
            'Min_Hourly_Salary', 'Max_Hourly_Salary', 'Average_Hourly_Salary',
            'Min Revenue', 'Max Revenue', 'Average Revenue', 'Rating', 'Founded', 'Years Old',
            'Num Competitors', 'Has Competitors Listed']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Exclude highly correlated duplicates like 'Min Salary', 'Max Salary' if 'Average Salary' is present
# Exclude 'Founded' if 'Years Old' is present, due to perfect inverse correlation.
# Exclude 'Has Competitors Listed' if 'Num Competitors' is present, due to high correlation.
selected_numerical_cols = [
    'Rating',
    'Average Salary',
    'Average Revenue',
    'Years Old',
    'Num Competitors'
]


# Drop rows with NaN in these selected columns for clearer visualization in pairplot
pair_plot_df = df[selected_numerical_cols].dropna()

# Check if the filtered DataFrame is empty after dropping NaNs
if pair_plot_df.empty:
    print("Warning: No data available for pair plot after dropping NaN values in selected columns. Check data or column selection.")
else:

    # diag_kind='kde' shows a Kernel Density Estimate on the diagonal for distribution visualization
    # kind='scatter' shows scatter plots for the off-diagonal elements
    sns.pairplot(pair_plot_df, diag_kind='kde', kind='scatter')

    plt.suptitle('Pair Plot of Key Numerical Features', y=1.02, fontsize=16) # Add a title for the entire plot
    plt.tight_layout(rect=[0, 0, 1, 0.98]) # Adjust layout to prevent title overlap
    plt.show()

    print("\nPair plot displayed successfully.")


##### 1. Why did you pick the specific chart?

A pair plot is chosen because it allows for a comprehensive visualization of the relationships between multiple numerical variables in a single grid. It displays scatter plots for each pair of variables and distribution plots (KDE in this case) for individual variables on the diagonal. This helps in quickly identifying patterns, correlations, and distributions among features.

##### 2. What is/are the insight(s) found from the chart?

* **Rating Distribution**: The KDE for Rating shows a distribution skewed towards higher ratings (around 3.0 to 4.5), with a peak around 4.0, indicating most companies are moderately to highly rated.
* **Average Salary Distribution:** The Average Salary KDE shows a primary peak around $75,000 - $100,000, with a wider spread and some outliers extending to higher salaries.
* **Average Revenue Distribution:** The Average Revenue KDE is highly skewed right, with a large concentration of companies having very low revenue (likely in the millions or hundreds of millions), and a smaller secondary peak for companies with much higher revenue (billions), indicating a few very large companies.
* **Years Old Distribution:** The Years Old KDE shows a strong peak at very young companies (under 100 years old), and a smaller, distinct peak at companies around 2000 years old (the data anomaly we previously discussed).
* **Num Competitors Distribution:** The Num Competitors KDE shows two prominent peaks: one at 0 and another at 3, confirming that companies either have no listed competitors or exactly three.
* **Relationships between variables:**
1. **Rating vs. Salary/Revenue:** Scatter plots suggest weak to no strong linear correlation between Rating and Average Salary or Average Revenue. Companies with higher ratings don't necessarily have significantly higher salaries or revenues, and vice-versa.
2. **Salary vs. Revenue:** There appears to be a slight positive correlation between Average Salary and Average Revenue, meaning companies with higher revenues tend to offer higher salaries, though the relationship isn't extremely tight.
3. **Years Old vs. Salary/Revenue/Rating:** As seen before, the 'Years Old' plot shows clear outliers at the very high end (~2000 years), which appear across various salary, revenue, and rating ranges. This again highlights the data quality issue for the 'Founded' and 'Years Old' columns. For more "normal" ages, there isn't a clear strong linear trend with salary, revenue, or rating.
4. **Num Competitors:** This variable appears as distinct clusters (0, 1, 2, 3, 4) against other features, confirming its discrete nature. There isn't a clear linear relationship with Rating, Average Salary, Average Revenue, or Years Old in the scatter plots, but rather distinct groups of points.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

SET 1:
* Null Hypothesis (H0): There is no linear correlation between a company's rating and the average salary offered.
* Alternate Hypothesis (H1): There is a linear correlation between a company's rating and the average salary offered.

SET 2:
* Null Hypothesis (H0): There is no linear correlation between a company's age ('Years Old') and the average salary offered.
* Alternate Hypothesis (H1): There is a linear correlation between a company's age ('Years Old') and the average salary offered.

SET 3:
* Null Hypothesis (H0): There is no statistically significant difference in the average salary offered by companies with 0 competitors versus companies with 3 competitors.
* Alternate Hypothesis (H1): There is a statistically significant difference in the average salary offered by companies with 0 competitors versus companies with 3 competitors.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): There is no linear correlation between a company's rating and the average salary offered.
* Alternate Hypothesis (H1): There is a linear correlation between a company's rating and the average salary offered.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Filter data for the relevant columns and drop rows with NA values for this test.
data_h1 = df[['Rating', 'Average Salary']].dropna()

if not data_h1.empty:
    correlation, p_value_h1 = stats.pearsonr(data_h1['Rating'], data_h1['Average Salary'])
    print(f"Pearson Correlation Coefficient: {correlation:.4f}")
    print(f"P-Value (Hypothesis 1): {p_value_h1:.4f}")

    # Determine significance
    alpha = 0.05
    if p_value_h1 < alpha:
        print(f"Conclusion: Reject the Null Hypothesis. There is a statistically significant linear correlation (p < {alpha}).")
    else:
        print(f"Conclusion: Fail to reject the Null Hypothesis. There is no statistically significant linear correlation (p >= {alpha}).")
else:
    print("Not enough data to perform Hypothesis 1 test (due to NA values).")

##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Coefficient.

##### Why did you choose the specific statistical test?

Pearson correlation is chosen to measure the strength and direction of a linear relationship between two continuous variables ('Rating' and 'Average Salary'). It's appropriate for assessing if a direct linear association exists.


### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): There is no linear correlation between a company's age ('Years Old') and the average salary offered.
* Alternate Hypothesis (H1): There is a linear correlation between a company's age ('Years Old') and the average salary offered.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
data_h2 = df[['Years Old', 'Average Salary']].dropna()
data_h2 = data_h2[data_h2['Years Old'] < 1000] # Filter out extreme outliers for meaningful test

if not data_h2.empty:
    correlation_h2, p_value_h2 = stats.pearsonr(data_h2['Years Old'], data_h2['Average Salary'])
    print(f"Pearson Correlation Coefficient: {correlation_h2:.4f}")
    print(f"P-Value (Hypothesis 2): {p_value_h2:.4f}")

    # Determine significance
    alpha = 0.05
    if p_value_h2 < alpha:
        print(f"Conclusion: Reject the Null Hypothesis. There is a statistically significant linear correlation (p < {alpha}).")
    else:
        print(f"Conclusion: Fail to reject the Null Hypothesis. There is no statistically significant linear correlation (p >= {alpha}).")
else:
    print("Not enough valid data to perform Hypothesis 2 test (due to NA values or filtering of outliers).")

##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Coefficient (after outlier filtering).

##### Why did you choose the specific statistical test?

Pearson correlation is chosen to measure the linear relationship between two continuous variables. Filtering extreme outliers in 'Years Old' is essential to ensure the test's validity and to get a meaningful correlation for the typical company age range.


### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

* Null Hypothesis (H0): There is no statistically significant difference in the average salary offered by companies with 0 competitors versus companies with 3 competitors.
* Alternate Hypothesis (H1): There is a statistically significant difference in the average salary offered by companies with 0 competitors versus companies with 3 competitors.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Perform an appropriate statistical test to obtain P-Value
# Filter data for relevant columns and drop NA values.
data_h3 = df[['Num Competitors', 'Average Salary']].dropna()

# Extract groups for comparison
group_0_competitors = data_h3[data_h3['Num Competitors'] == 0]['Average Salary']
group_3_competitors = data_h3[data_h3['Num Competitors'] == 3]['Average Salary']

if not group_0_competitors.empty and not group_3_competitors.empty:
    # Perform independent samples t-test
    # We assume unequal variances (Welch's t-test) as it's more robust
    t_statistic, p_value_h3 = stats.ttest_ind(group_0_competitors, group_3_competitors, equal_var=False) # Welch's t-test
    print(f"T-statistic: {t_statistic:.4f}")
    print(f"P-Value (Hypothesis 3): {p_value_h3:.4f}")

    # Determine significance
    alpha = 0.05
    if p_value_h3 < alpha:
        print(f"Conclusion: Reject the Null Hypothesis. There is a statistically significant difference in average salary (p < {alpha}).")
    else:
        print(f"Conclusion: Fail to reject the Null Hypothesis. There is no statistically significant difference in average salary (p >= {alpha}).")
else:
    print("Not enough data to perform Hypothesis 3 test (one or both competitor groups are empty after dropping NAs).")

##### Which statistical test have you done to obtain P-Value?

Independent Samples t-test (Welch's t-test).

##### Why did you choose the specific statistical test?

An Independent Samples t-test is used to compare the means of a continuous variable ('Average Salary') between two independent groups ('Num Competitors' being 0 vs. 3). Welch's t-test (equal_var=False) is preferred as it does not assume equal variances between the two groups, making it more robust.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# @title Handling Missing Values & Missing Value Imputation

print("Starting ML Data Preprocessing")

print(f"\nInitial DataFrame shape for ML preprocessing: {df.shape}")


print("\nHandling Missing Values & Imputation...")

#Numerical columns with potential NaNs after parsing
numerical_cols_for_imputation = [
    'Min Salary', 'Max Salary', 'Average Salary',
    'Min_Hourly_Salary', 'Max_Hourly_Salary', 'Average_Hourly_Salary',
    'Min Revenue', 'Max Revenue', 'Average Revenue', 'Rating', 'Years Old',
    'Num Competitors'
]
# Impute numerical columns with their median (more robust to outliers than mean)
for col in numerical_cols_for_imputation:
    if col in df.columns and df[col].isnull().any():
        # Convert to numeric first, coercing errors, in case some non-NA values are still objects
        df[col] = pd.to_numeric(df[col], errors='coerce')
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"  - Imputed missing values in '{col}' with its median: {median_val:.2f}")

# 'Employee Count Range' and 'State' already handle 'Unknown' during parsing
# 'Type of ownership', 'Industry', 'Sector', 'Headquarters', 'Job Title' - check if any have empty strings that should be 'Unknown'
categorical_cols_for_na_check = ['Type of ownership', 'Industry', 'Sector', 'Headquarters']
for col in categorical_cols_for_na_check:
    if col in df.columns and (df[col].astype(str).str.strip() == '').any():
        df[col] = df[col].astype(str).replace('', 'unknown').str.strip()
        print(f"  - Replaced empty strings in '{col}' with 'unknown'.")

print("\nMissing values after imputation:")
print(df.isnull().sum())


#### What all missing value imputation techniques have you used and why did you use those techniques?

**1. Median Imputation (for numerical columns):**

 Missing numerical values (like Min Salary, Max Salary, Average Salary, Min Revenue, Max Revenue, Average Revenue, Rating, Years Old, Num Competitors, and Hourly_Salary columns) are filled with the median of their respective columns.
Why Chosen: The median is used because it is more robust to outliers than the mean. If a column has a few extreme values, the mean would be heavily influenced, potentially leading to an inaccurate imputation. The median, being the middle value, is less affected by these extremes, providing a more representative central tendency for imputation.


**2. Replacing Empty Strings/Placeholders with 'unknown' (for categorical columns):**

 For categorical columns (Type of ownership, Industry, Sector, Headquarters), any empty strings or values that effectively represent missing data (e.g., -1 which was handled in earlier cleaning steps for Size and Revenue to become 'Unknown' or pd.NA) are replaced with the string 'unknown'.
Why Chosen: This approach preserves the information that the value was originally missing or unknown. It allows these unknown entries to be treated as a distinct category during encoding, rather than being dropped or causing errors.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# @title Handling Outliers & Outlier treatments
print("\n\nHandling Outliers & Outlier treatments")

#Outlier treatment strategy: Capping/Clipping values using IQR or Z-score method
#Let's apply a simple capping method for 'Years Old', 'Rating', and Salary/Revenue features

# Define a function to cap outliers using IQR method (for general numerical columns)
def cap_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return series.clip(lower=lower_bound, upper=upper_bound)

# Apply specific treatment for 'Years Old' outliers (values around 2000)
if 'Years Old' in df.columns:
    #based on previous charts, values near 2000 are problematic (from Founded=0) -> treat these as missing (NaN) or cap them to a more reasonable max
    #for now, let's convert them to NaN and then impute, or cap to a more realistic max age for a company
    #a max age of ~250 years seems reasonable for most companies in this context
    initial_anomalies = df[df['Years Old'] > 1000].shape[0] # Count original high anomalies
    df.loc[df['Years Old'] > 1000, 'Years Old'] = pd.NA # Convert extreme outliers to NaN
    # Re-impute these new NaNs in 'Years Old' using the median of valid ages
    if df['Years Old'].isnull().any():
        median_years_old = df['Years Old'].median()
        df['Years Old'].fillna(median_years_old, inplace=True)
        print(f"  - Handled {initial_anomalies} extreme 'Years Old' outliers, replaced with median: {median_years_old:.2f}")

# Apply specific treatment for 'Rating' outliers (values of -1)
if 'Rating' in df.columns:
    # Based on previous charts, -1 is an anomaly. Replace with NaN and re-impute.
    initial_rating_anomalies = df[df['Rating'] < 0].shape[0] # Count original negative ratings
    df.loc[df['Rating'] < 0, 'Rating'] = pd.NA
    if df['Rating'].isnull().any():
        median_rating = df['Rating'].median()
        df['Rating'].fillna(median_rating, inplace=True)
        print(f"  - Handled {initial_rating_anomalies} negative 'Rating' outliers, replaced with median: {median_rating:.2f}")


# Apply IQR capping to other potentially skewed numerical columns
for col in ['Average Salary', 'Average Revenue', 'Min Salary', 'Max Salary',
            'Min Revenue', 'Max Revenue', 'Num Competitors']:
    if col in df.columns and pd.api.types.is_numeric_dtype(df[col]):
        old_min = df[col].min()
        old_max = df[col].max()
        df[col] = cap_outliers_iqr(df[col])
        new_min = df[col].min()
        new_max = df[col].max()
        if old_min != new_min or old_max != new_max:
            print(f"  - Capped outliers for '{col}': Range changed from ({old_min:.2f}, {old_max:.2f}) to ({new_min:.2f}, {new_max:.2f})")

print("\nOutlier treatment applied.")

##### What all outlier treatment techniques have you used and why did you use those techniques?

**1. Specific Value Replacement / Capping for Anomalies:**

* **For Years Old:** Values greater than 1000 (which stem from Founded=0 anomalies) are first converted to pd.NA and then immediately re-imputed with the median of the valid Years Old values.

* **For Rating:** Values less than 0 (i.e., -1 anomalies) are converted to pd.NA and then re-imputed with the median of the valid Rating values.

These are specific, domain-knowledge-driven treatments for clear data entry errors/placeholders identified from the previous visualizations. Converting them to pd.NA and then re-imputing with the median (rather than simple capping) helps to replace fundamentally incorrect values with a more realistic and central estimate from the valid data distribution, preventing them from skewing statistics or models.

**2. IQR-based Capping (for other numerical columns):**

* For numerical columns like Average Salary, Average Revenue, Min/Max Salary/Revenue, and Num Competitors, the Interquartile Range (IQR) method is applied. Values below Q1−1.5×IQR are capped at the lower bound, and values above Q3+1.5×IQR are capped at the upper bound.
* The IQR method is a common and relatively robust way to identify and handle outliers in skewed distributions. It caps extreme values to within a more reasonable range, preventing them from having disproportionate influence on statistical models while retaining the majority of the data.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# @title Encode your categorical columns
print("Encoding Categorical Columns")

# Identify categorical columns for one-hot encoding
# Exclude original columns that were replaced (like 'Size', 'Revenue', 'Salary Estimate', 'Competitors')
# Exclude text columns that will be handled separately ('Job Description', 'Job Title')
# 'State' and 'Employee Count Range' are already cleaned and ready.
categorical_cols_for_encoding = [
    'Type of ownership',
    'Industry',
    'Sector',
    'Headquarters',
    'State',
    'Employee Count Range'
]

encoded_dfs = [] # Create a list to store DataFrames of one-hot encoded columns

# Apply One-Hot Encoding
for col in categorical_cols_for_encoding:
    if col in df.columns and df[col].dtype == 'object':
        print(f"  - One-hot encoding '{col}'")
        # Handle unknown categories during transform if they appear in unseen data
        encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
        encoded_data = encoder.fit_transform(df[[col]])
        # Create a DataFrame from the encoded data
        encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out([col]), index=df.index)
        encoded_dfs.append(encoded_df)
    elif col in df.columns:
        print(f"  - Skipping '{col}' for one-hot encoding, not an object dtype: {df[col].dtype}")

# Concatenate all encoded DataFrames with the original DataFrame
if encoded_dfs:
    df = pd.concat([df] + encoded_dfs, axis=1)
    # Drop the original categorical columns after encoding
    df.drop(columns=categorical_cols_for_encoding, inplace=True, errors='ignore')
    print("Categorical columns encoded and original columns dropped.")
else:
    print("No categorical columns found for encoding or already processed.")

# Check the new shape and head after encoding
print(f"DataFrame shape after categorical encoding: {df.shape}")
print("New columns created by One-Hot Encoding:")
df.columns[df.columns.str.contains('Type of ownership_')].tolist()

#### What all categorical encoding techniques have you used & why did you use those techniques?

**One-Hot Encoding:**

Categorical columns (Type of ownership, Industry, Sector, Headquarters, State, Employee Count Range) are transformed using OneHotEncoder. This converts each category within a column into a new binary (0 or 1) column.

* No Ordinal Relationship: One-Hot Encoding is appropriate because these categorical variables do not have an inherent ordinal (ranked) relationship (e.g., 'Company - Private' is not "greater than" 'Government'). Assigning arbitrary numbers would incorrectly imply such a relationship.
* Algorithm Compatibility: Most machine learning algorithms (like Random Forests, Gradient Boosting Machines, SVMs) require numerical input and cannot directly process string categories. One-Hot Encoding creates a suitable numerical representation.
* Preserves Information: It creates a distinct feature for each category, preventing the algorithm from inferring incorrect relationships.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# @title Textual Data Preprocessing includes: Lower Casing, Contraction Expansion, URL/Digit/Punctuation/Whitespace Removal
print("Textual Data Preprocessing for 'Job Description' and 'Job Title'")

# Contractions mapping
CONTRACTION_MAP = {
    "ain't": "am not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have",
    "could've": "could have", "couldn't": "could not", "didn't": "did not",
    "doesn't": "does not", "don't": "do not", "hadn't": "had not",
    "hasn't": "has not", "haven't": "have not", "he'd": "he would",
    "he'll": "he will", "he's": "he is", "how'd": "how did",
    "how'll": "how will", "how's": "how is", "i'd": "i would",
    "i'll": "i will", "i'm": "i am", "i've": "i have",
    "isn't": "is not", "it'd": "it would", "it'll": "it will",
    "it's": "it is", "let's": "let us", "ma'am": "madam",
    "mayn't": "may not", "might've": "might have", "mightn't": "might not",
    "must've": "must have", "mustn't": "must not", "needn't": "need not",
    "o'clock": "of the clock", "oughtn't": "ought not", "shan't": "shall not",
    "she'd": "she would", "she'll": "she will", "she's": "she is",
    "should've": "should have", "shouldn't": "should not", "so've": "so have",
    "so's": "so is", "that'd": "that would", "that's": "that is",
    "there'd": "there would", "there's": "there is", "they'd": "they would",
    "they'll": "they will", "they're": "they are", "they've": "they have",
    "wasn't": "was not", "we'd": "we would", "we'll": "we will",
    "we're": "we are", "we've": "we have", "weren't": "were not",
    "what'll": "what will", "what're": "what are", "what's": "what is",
    "what've": "what have", "when's": "when is", "where'd": "where did",
    "where's": "where is", "who'd": "who would", "who'll": "who will",
    "who's": "who is", "who've": "who have", "why's": "why is",
    "won't": "will not", "would've": "would have", "wouldn't": "would not",
    "you'd": "you would", "you'll": "you will", "you're": "you are",
    "you've": "you have"
}

# Function to expand contractions
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
                                      flags=re.IGNORECASE|re.DOTALL)
    def replace_match(match):
        return contraction_mapping[match.group(0).lower()]
    return contractions_pattern.sub(replace_match, text)

# Function for comprehensive text cleaning
def clean_text(text):
    text = str(text).lower()                                                    #Lower Casing (already done in earlier steps for these columns, but good to ensure)

    text = expand_contractions(text)                                            #Expand contractions

    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)     #Removing URLs
    text = re.sub(r'\S*\d\S*', '', text).strip()                                #Removing words and digits containing digits (e.g., 'job123', '2025')

    text = re.sub(r'[^\w\s]', '', text)                                         #Remove Punctuations

    text = re.sub(r'\s+', ' ', text).strip()                                    #Removing White spaces (multiple spaces to single)

    return text

# Apply cleaning to relevant text columns
text_processing_cols = ['Job Description', 'Job Title']
for col in text_processing_cols:
    if col in df.columns:
        print(f"  - Applying basic cleaning to '{col}' (Lower Casing, Contraction Expansion, URL/Digit/Punctuation/Whitespace Removal)...")
        df[f'Cleaned_{col}'] = df[col].apply(clean_text)
    else:
        print(f"  - Warning: Column '{col}' not found for text processing.")

#### 2. Lower Casing

In [None]:
# Lower Casing
#included in above cell (Expand Contaction)

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
#included in above cell (Expand Contaction)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
#included in above cell (Expand Contaction)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# @title Removing Stopwords & Tokenization
stop_words = set(stopwords.words('english'))
print("  - Tokenizing and removing stopwords...")
for col in text_processing_cols:
    if f'Cleaned_{col}' in df.columns:
        df[f'Tokenized_{col}'] = df[f'Cleaned_{col}'].apply(lambda x: [word for word in word_tokenize(x) if word not in stop_words])

In [None]:
# Remove White spaces
#included in above cell (Expand Contaction)

#### 6. Rephrase Text

In [None]:
# Rephrase Text

print("\nNote on 'Rephrase Text': This is an advanced NLP task usually performed by Large Language Models (LLMs) or complex rule-based systems, \
not a standard text preprocessing step for feature engineering. No automated code for rephrasing is included here.")

#### 7. Tokenization

In [None]:
# Tokenization
#included in "5. Removing Stopwords & Removing White spaces"

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
# @title Normalizing Text (Stemming, Lemmatization)
lemmatizer = WordNetLemmatizer()
print("  - Lemmatizing text...")
for col in text_processing_cols:
    if f'Tokenized_{col}' in df.columns:
        df[f'Lemmatized_{col}'] = df[f'Tokenized_{col}'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

# Stemming (lemmatization is often preferred for meaning preservation)
stemmer = PorterStemmer()
df[f'Stemmed_{col}'] = df[f'Tokenized_{col}'].apply(lambda x: [stemmer.stem(word) for word in x])

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# @title POS Tagging
print("  - Performing POS Tagging...")
for col in text_processing_cols:
    if f'Lemmatized_{col}' in df.columns:
        df[f'POS_Tagged_{col}'] = df[f'Lemmatized_{col}'].apply(lambda x: pos_tag(x))


#### 10. Text Vectorization

In [None]:
# @title Vectorizing Text (using TF-IDF)
print("\nVectorizing Text using TF-IDF...")

# Convert list of lemmatized words back to string for TF-IDF
for col in text_processing_cols:
    if f'Lemmatized_{col}' in df.columns:
        df[f'Lemmatized_Joined_{col}'] = df[f'Lemmatized_{col}'].apply(lambda x: ' '.join(x))

# TF-IDF Vectorization for Job Description
if 'Lemmatized_Joined_Job Description' in df.columns:
    tfidf_vectorizer_desc = TfidfVectorizer(max_features=1000) # Limit features to avoid very high dimensionality
    tfidf_desc_matrix = tfidf_vectorizer_desc.fit_transform(df['Lemmatized_Joined_Job Description'])
    tfidf_desc_df = pd.DataFrame(tfidf_desc_matrix.toarray(),
                                 columns=[f'tfidf_desc_{word}' for word in tfidf_vectorizer_desc.get_feature_names_out()],
                                 index=df.index)
    df = pd.concat([df, tfidf_desc_df], axis=1)
    print(f"  - TF-IDF vectors created for 'Job Description'. Added {tfidf_desc_df.shape[1]} features.")

# TF-IDF Vectorization for Job Title
if 'Lemmatized_Joined_Job Title' in df.columns:
    tfidf_vectorizer_title = TfidfVectorizer(max_features=200) # Smaller limit for shorter text
    tfidf_title_matrix = tfidf_vectorizer_title.fit_transform(df['Lemmatized_Joined_Job Title'])
    tfidf_title_df = pd.DataFrame(tfidf_title_matrix.toarray(),
                                  columns=[f'tfidf_title_{word}' for word in tfidf_vectorizer_title.get_feature_names_out()],
                                  index=df.index)
    df = pd.concat([df, tfidf_title_df], axis=1)
    print(f"  - TF-IDF vectors created for 'Job Title'. Added {tfidf_title_df.shape[1]} features.")


##### Which text vectorization technique have you used and why?

**TF-IDF (Term Frequency-Inverse Document Frequency) Vectorization:**

TF-IDF is used to convert the cleaned and normalized text data (Cleaned_Job Description, Cleaned_Job Title) into a numerical matrix. Each row in the matrix represents a document (job posting), and each column represents a unique word (term) from the vocabulary. The value in each cell (word's importance in a document) is a score that reflects:
* **Term Frequency (TF):** How often a word appears in a document.
* **Inverse Document Frequency (IDF):** How rare or unique a word is across the entire collection of documents. Words common across many documents (like "the", "a" which are mostly removed by stopwords) get lower IDF scores, while unique or domain-specific words get higher IDF scores.

**Reason:**
* **Captures Importance:** TF-IDF goes beyond simple word counts (like Bag-of-Words) by giving higher weights to words that are important in a specific document but rare across the entire dataset. This helps highlight distinguishing terms for job descriptions or titles.
* **Dimensionality Reduction (with max_features):** It allows for controlling the dimensionality of the output vector by setting max_features (e.g., 1000 for job descriptions, 200 for job titles). This is crucial for managing the size of the input to ML models and preventing the "curse of dimensionality" when dealing with text.
* **Algorithm Compatibility:** The numerical vectors generated by TF-IDF are directly usable as input features for standard machine learning algorithms like Random Forests, Gradient Boosting Machines, and SVMs.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# @title Verification + Feature Manipulation
numerical_cols_to_check = [
    'Min Salary', 'Max Salary', 'Average Salary',
    'Min_Hourly_Salary', 'Max_Hourly_Salary', 'Average_Hourly_Salary',
    'Min Revenue', 'Max Revenue', 'Average Revenue', 'Rating', 'Years Old',
    'Num Competitors', 'Has Competitors Listed'
]
for col in numerical_cols_to_check:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')
        # Re-impute any NaNs created by coerce if needed, though previous step should have handled
        if df[col].isnull().any():
            df[col].fillna(df[col].median(), inplace=True)
            print(f"  - Re-imputed new NaNs in '{col}' with median after re-checking numeric conversion.")

TARGET_COLUMN = 'Average Salary' # Example: Change to 'Rating' or a binned category if needed

#print(f"Columns:{df.columns}")

if TARGET_COLUMN not in df.columns:
    print(f"Error: Target column '{TARGET_COLUMN}' not found in DataFrame.")
    print("Please ensure your target column exists after previous cleaning steps.")
    raise SystemExit("Target column not found.")

#Manipulate Features to minimize feature correlation and create new features
print("\nManipulating Features (Correlation Reduction & New Features)...")

# --- Minimize Feature Correlation ---
# Based on the heatmap, 'Min Salary', 'Max Salary' are highly correlated with 'Average Salary'.
# kept 'Average Salary', 'Average Revenue', 'Years Old', 'Num Competitors' as primary.

# Drop redundant highly correlated features for model simplicity and to avoid multicollinearity
columns_to_drop_for_correlation = [
    'Min Salary', 'Max Salary',                                                 # Keep 'Average Salary'
    'Min Revenue', 'Max Revenue',                                               # Keep 'Average Revenue'
    'Founded',                                                                  # Keep 'Years Old'
    'Has Competitors Listed',                                                   # Keep 'Num Competitors'
    # Drop hourly specific columns if we decide to model only annual salary.
    'Min_Hourly_Salary', 'Max_Hourly_Salary', 'Average_Hourly_Salary',          # Drop if not directly used as independent features in this model
]

# Drop the intermediate text processing columns as we have TF-IDF features now
text_processing_intermediate_cols = [
    'Job Description', 'Job Title', # Original columns
    'Cleaned_Job Description', 'Tokenized_Job Description', 'Lemmatized_Job Description',
    'POS_Tagged_Job Description', 'Lemmatized_Joined_Job Description',
    'Cleaned_Job Title', 'Tokenized_Job Title', 'Lemmatized_Job Title',
    'POS_Tagged_Job Title', 'Lemmatized_Joined_Job Title'
]

# Combine all columns to drop
all_columns_to_drop = list(set(columns_to_drop_for_correlation + text_processing_intermediate_cols))

# Remove the target column from the list of columns to drop if it accidentally got in
if TARGET_COLUMN in all_columns_to_drop:
    all_columns_to_drop.remove(TARGET_COLUMN)


# Filter columns to drop to only include those actually in the DataFrame
final_cols_to_drop = [col for col in all_columns_to_drop if col in df.columns]

# Perform the drop operation
df_processed = df.drop(columns=final_cols_to_drop, errors='ignore')
#print(f"\nColumns:{df_processed}\n")
print(f"  - Dropped highly correlated/redundant and intermediate text columns. New shape: {df_processed.shape}")


# Create New Features
# This is a common way to create new features that might capture non-linear relationships.
if 'Rating' in df_processed.columns and 'Average Salary' in df_processed.columns:
    df_processed['Rating_x_Salary'] = df_processed['Rating'] * df_processed['Average Salary']
    print("  - Created new feature: 'Rating_x_Salary'")

# Example: Binning 'Average Salary' for potential classification tasks or simpler analysis (even if main target is regression, sometimes useful for categorical insights)
if 'Average Salary' in df_processed.columns:
    # Ensure the last bin is strictly greater than the previous one
    max_salary = df_processed['Average Salary'].max()
    bins = [0, 50000, 75000, 100000, 125000, 150000, 200000]
    labels = ['<50K', '50-75K', '75-100K', '100-125K', '125-150K', '150-200K']

    # Adjust bins and labels if max_salary is greater than the last bin value
    if max_salary > bins[-1]:
        bins.append(max_salary + 1)
        labels.append(f'>{bins[-2]}')
    elif max_salary <= bins[1]:
        # If max_salary is less than or equal to the second bin, use a single bin
        bins = [0, max_salary + 1]
        labels = [f'<={max_salary}']
    else:
        # If max_salary is within the defined bins, find the appropriate label
        # This handles cases where max_salary is less than the original max bin (200000)
        for i in range(len(bins) - 1):
            if max_salary <= bins[i+1]:
                bins = bins[:i+2]
                labels = labels[:i+1]
                break

    df_processed['Salary_Bracket'] = pd.cut(df_processed['Average Salary'], bins=bins, labels=labels, right=False, include_lowest=True) # include_lowest=True to include 0
    print("  - Created new feature: 'Salary_Bracket' (for potential categorical analysis).")

#### 2. Feature Selection

In [None]:
# @title Select your features wisely to avoid overfitting
print("Selecting Features Wisely...")

# Define the features (X) and target (y): X will be all columns in df_processed except the TARGET_COLUMN and any other non-feature columns
X = df_processed.drop(columns=[TARGET_COLUMN], errors='ignore')
y = df_processed[TARGET_COLUMN]

# Identify all numerical columns (including original numerics & TF-IDF)
numerical_features = X.select_dtypes(include=np.number).columns.tolist()

# Identify all One-Hot Encoded (OHE) columns
# robust way is to remember the prefixes from the OHE step:
ohe_prefixes = ['Type of ownership_', 'Industry_', 'Sector_', 'Headquarters_', 'State_', 'Employee Count Range_']
ohe_features = [col for col in X.columns if any(col.startswith(p) for p in ohe_prefixes)]

# Identify boolean features like 'Is_Hourly'
boolean_features = [col for col in X.columns if pd.api.types.is_bool_dtype(X[col])]

# Combine all relevant features
final_features = list(set(numerical_features + ohe_features + boolean_features))

X = X[final_features]

print(f"  - Selected {len(X.columns)} features for the ML model.")
print(f"  - Features for X: {X.columns.tolist()}")
print(f"  - Target variable (y): '{TARGET_COLUMN}'")

##### What all feature selection methods have you used  and why?

**Domain Knowledge/Manual Feature Selection (Implicit based on prior cleaning/engineering):**

* This was applied when deciding which columns to drop after the initial cleaning and feature engineering (Manipulate Features step 1). For instance, Min Salary and Max Salary were dropped in favor of Average Salary because they are highly correlated and Average Salary provides a consolidated view. Similarly, Founded was dropped as Years Old provides the same information in a more intuitive format for age. Has Competitors Listed was dropped in favor of Num Competitors due to high correlation. Intermediate text processing columns (Cleaned_Job Description, Tokenized_Job Description, etc.) were also dropped as their processed TF-IDF vectors became the actual features.

* **Reason:**
  * **Redundancy Reduction (Multicollinearity):** High correlation between features can lead to multicollinearity, which can make a model's coefficients unstable and harder to interpret, especially for linear models. For tree-based models (like Random Forest or XGBoost), while less sensitive, it still helps simplify the model and can slightly improve performance.
  * **Interpretability:** Keeping one representative feature (e.g., Average Salary instead of Min/Max Salary) makes the model more interpretable and easier to understand its relationship with the target.
  * **Noise Reduction & Simplification:** Removing redundant or intermediate features reduces the overall dimensionality, potentially speeding up training and reducing noise.

**Explicit Feature Selection based on Data Type and Purpose:**
* Explicitly constructed X by including numerical features, one-hot encoded categorical features, and boolean features (like Is_Hourly). All other columns (original text, intermediate text, or other non-feature columns) were excluded.
* Reason:
  * **Algorithm Compatibility:** Machine learning algorithms require numerical input. This step ensures that only the appropriate numerical representations (original numerics, encoded categories, vectorized text features) are passed to the model.
  * **Avoiding Data Leakage/Target Leakage:** It explicitly separates features (X) from the target variable (y) to prevent target leakage, which occurs when information from the target variable is inadvertently included in the features.
  * **Removing Non-Predictive/Non-Processable Columns:** It removes columns that are not meant to be features for the model (e.g., the original Job Description string after its TF-IDF vectorization).

##### Which all features you found important and why?

1. **Average Salary (Target Variable):** This is the target variable itself. Its distribution (skewness) and relationships with other features are critical for model building.

2. **Rating:** While its direct linear correlation with Average Salary was weak (around 0.11), company rating is a crucial indicator of employee satisfaction and overall company quality. It might have non-linear relationships or interact with other features. It's often considered a strong influencing factor for job seekers.

3. **Average Revenue:** Showed a weak positive linear correlation with Average Salary (around 0.23). Larger, more successful companies (with higher revenue) often have the capacity to offer higher salaries. This is a key business metric.

4. **Years Old:** Showed a moderate negative correlation with Rating and Num Competitors. It might influence salary based on whether a company is a stable, established enterprise or a fast-growing startup. Older companies might have more structured pay scales, while startups might offer equity. (Remember to handle the Years Old anomalies during analysis).

5. **Num Competitors:** Indicates the competitive landscape for talent. A high number of competitors might drive salaries up in an attempt to attract candidates, or conversely, a niche role with few competitors might have different salary dynamics.

6. **TF-IDF Features (from Job Description and Job Title):** These are arguably among the most critical features for salary prediction in a jobs dataset.
  * Job Title directly describes the role, and salary heavily depends on the specific job.
  * Job Description contains details about required skills, experience, responsibilities, and technologies. Specific high-demand skills mentioned in the description (e.g., "Python", "Machine Learning", "Cloud") are highly likely to influence salary. The TF-IDF method effectively captures the importance of these keywords.

7. **One-Hot Encoded Categorical Features (e.g., Industry, Sector, State, Employee Count Range, Type of ownership):**
  * **Industry/Sector:** Salaries vary significantly across different industries and sectors. Tech roles often pay more than, say, non-profit roles.
  * **State:** Location is a major determinant of salary due to cost of living and regional market demand.
  * **Employee Count Range:** Company size often correlates with compensation structures and benefit packages. Larger companies might offer higher base salaries but less equity compared to smaller ones.
  * **Type of ownership:** Public vs. private, non-profit, etc., can influence compensation strategy.

  
While the heatmap shows linear correlations, a machine learning model like Random Forest or XGBoost can uncover more complex, non-linear relationships and interactions between these features that are not evident in a simple correlation matrix. The feature importance scores from such models (after training) would provide the definitive ranking of importance for predicting Average Salary.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?


Answer: Yes, based on the nature of typical salary and revenue data, and explicitly addressed in the provided code, the data often needs to be transformed, specifically for highly skewed numerical features.

**:ogarithmic transformation (specifically np.log1p which is log(1+x)) for highly skewed continuous numerical features.**

* **Nature of Salary/Revenue Data:** Salary and revenue data (like your Average Salary and Average Revenue columns) typically exhibit a right-skewed distribution. This means there's a long tail of higher values (a few companies/jobs have very high salaries/revenues) while most values are concentrated at the lower end.

* **Impact of Skewness on Machine Learning Models:**

  * **Assumption of Linearity/Normality:** Many linear models (e.g., Linear Regression) and some other statistical models assume that the input features are normally distributed or have a roughly symmetrical distribution. Highly skewed data can violate these assumptions.
  * **Impact on Model Performance:** Skewed data can lead to:
    * **Biased Models:** Models might give disproportionate weight to outlier values.
    * **Slower Convergence:** Optimization algorithms might take longer to converge.
    * **Reduced Predictive Power:** The model might struggle to capture the underlying relationships effectively.
  * **Visual Interpretation:** Skewed data can make visualizations harder to interpret.

* **How Log Transformation Helps (np.log1p):**

  * **Reduces Skewness:** Applying a logarithmic transformation compresses the larger values and stretches out the smaller values, effectively making the distribution more symmetrical and closer to a normal distribution.
  * **Stabilizes Variance:** It can help in stabilizing the variance across the range of the data, which is beneficial for models sensitive to heteroscedasticity (unequal variance).
  * **Handles Zeroes:** np.log1p(x) computes log(1+x). This is crucial because standard np.log(x) would produce negative infinity for zero values. np.log1p gracefully handles zeros by transforming them into 0, which is suitable for features that might contain zeros (e.g., revenue).
  * **Interpretability (Post-Transformation):** While the raw values are transformed, the relationships are often made more linear, which can be easier for models to learn. The inverse transformation (np.expm1 for e(^x) − 1) can be used to convert predictions back to the original scale.

In the code, the transformation step specifically checks for features with an absolute skewness greater than 0.75 (a common threshold) and applies np.log1p to them. This ensures that only the features that genuinely benefit from this transformation are affected, leading to better model performance and robustness.

In [None]:
# Transform Your data
# @title Data Transformation (Skewness Check & Log Transform)...
print("\nData Transformation (Skewness Check & Log Transform)...")

# Identify all numerical columns (including original numerics, and TF-IDF)
all_numerical_cols = X.select_dtypes(include=np.number).columns.tolist()

# Filter out purely binary/one-hot features from this list
filtered_continuous_numerical_cols = []
for col in all_numerical_cols:
    # Ensure the column has more than 2 unique non-NaN values (i.e., not binary 0/1)
    # Using dropna() ensures we only count unique values from non-missing data
    if X[col].dropna().nunique() > 2:
        filtered_continuous_numerical_cols.append(col)

continuous_numerical_cols = filtered_continuous_numerical_cols


skewed_features = X[continuous_numerical_cols].skew().sort_values(ascending=False)
highly_skewed = skewed_features[abs(skewed_features) > 0.75] # Common threshold for high skewness

if not highly_skewed.empty:
    print(f"  - Highly skewed features found (abs(skew) > 0.75):\n{highly_skewed}")
    for col in highly_skewed.index:
        # Apply log1p transformation (log(1+x)) to handle zero values
        # Ensure values are non-negative before applying log.
        if (X[col] < 0).any():
            print(f"    Warning: Skipping log transform for '{col}' due to negative values.")
        else:
            X[col] = np.log1p(X[col])
            print(f"  - Applied log1p transformation to '{col}'.")
else:
    print("  - No highly skewed continuous numerical features found (abs(skew) < 0.75).")


### 6. Data Scaling

In [None]:
# @title Scaling the data
print("Scaling the Data...")

# Identify numerical features for scaling (exclude one-hot encoded and boolean features)
# Scaling is only for numerical features that are not binary (0/1)
features_to_scale = X.select_dtypes(include=np.number).columns.tolist()

# Exclude purely binary features that are already 0 or 1 from scaling
features_to_scale = [col for col in features_to_scale if X[col].nunique() > 2]

if features_to_scale:
    scaler = StandardScaler()
    X[features_to_scale] = scaler.fit_transform(X[features_to_scale])
    print(f"  - Applied StandardScaler to {len(features_to_scale)} numerical features.")
else:
    print("  - No numerical features found for scaling.")

print(f"DataFrame X after scaling. Sample of scaled features: {X[features_to_scale].head()}")

##### Which method have you used to scale you data and why?

**StandardScaler:**

* **Standardization:** StandardScaler transforms the data
such that it has a mean of 0 and a standard deviation of 1. This process is also known as Z-score normalization.

* **Algorithm Sensitivity:** Many machine learning algorithms, particularly those that rely on distance calculations (like K-Nearest Neighbors, Support Vector Machines) or gradient descent optimization (like Linear Regression, Logistic Regression, Neural Networks), are sensitive to the scale of features. Features with larger numerical ranges can dominate the learning process, even if they are not inherently more important.

* **Equal Contribution:** By standardizing features, StandardScaler ensures that all features contribute equally to the distance calculations and model's loss function, regardless of their original scale. This prevents features with larger values from disproportionately influencing the model.

* **Assumptions:** While StandardScaler does not assume a normal distribution, it is robust even when features are not normally distributed. It's a widely applicable and generally effective scaling method for a broad range of ML algorithms.

Features like Average Salary and Average Revenue are in the tens or hundreds of thousands, while Rating is between 0 and 5, and TF-IDF scores are usually between 0 and 1. Scaling these features to a common range prevents the high-magnitude salary/revenue features from overwhelming the others during model training.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction can be highly beneficial and is often needed for this type of dataset, especially due to the textual features.

**Reason:**

The primary reason dimensionality reduction is often needed in this project stems from the TF-IDF Vectorization of Job Description and Job Title.

1. **High Dimensionality:** TF-IDF, even with max_features set (e.g., 1000 for job description, 200 for job title), creates a very large number of features. Each unique word (or N-gram, if enabled) within the vocabulary becomes a separate feature. If you have thousands of unique relevant words, your feature space can easily grow to thousands of dimensions.

2. **Curse of Dimensionality:** In high-dimensional spaces, data becomes sparse, and the distance between data points tends to converge, making it harder for models to distinguish between different data points. This can lead to:
* **Overfitting:** Models might learn noise in the high-dimensional space instead of true patterns.
* **Increased Computational Cost:** Training time and memory usage increase significantly with more features.
* **Reduced Interpretability:** It becomes challenging to understand the contribution of individual features.
* **Difficulty in Visualization:** It's impossible to visualize data beyond 3 dimensions directly.
3. **Noise Reduction:** Some TF-IDF features might represent rare or less informative words that primarily add noise rather than predictive power. Dimensionality reduction can help project the data into a lower-dimensional space while retaining the most important information, effectively reducing noise.

The code includes a conditional check (if n_features_before_pca > pca_threshold = 200) to apply dimensionality reduction. This suggests that if the TF-IDF and other encoded features result in a very wide dataset, PCA will be activated.

In [None]:
# @title Dimensionality Reduction (if required)
print("Dimensionality Reduction (PCA )...")

# Get current number of features
n_features_before_pca = X.shape[1]
print(f"  - Number of features before PCA: {n_features_before_pca}")

# Decide on a threshold for applying PCA
pca_threshold = 200

if n_features_before_pca > pca_threshold:
    print(f"  - Applying PCA as feature count ({n_features_before_pca}) exceeds threshold ({pca_threshold}).")
    pca = PCA(n_components=0.95) # Retain 95% of variance
    X_pca = pca.fit_transform(X)
    X = pd.DataFrame(X_pca, index=X.index) # Convert back to DataFrame
    print(f"  - Applied PCA: Reduced features from {n_features_before_pca} to {pca.n_components_} (retaining 95% variance).")
    print("  - X is now a DataFrame with PCA components (e.g., '0', '1', ...).")
else:
    print("  - PCA not applied: Feature count is below threshold or not required.")

print(f"Final features for model input (X) shape: {X.shape}")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

**Principal Component Analysis (PCA) as the dimensionality reduction technique:**

**Reason:**

1. **Unsupervised Method:** PCA is an unsupervised learning technique, meaning it doesn't use the target variable (Average Salary) during its transformation process. This prevents any potential data leakage from the target into the features during dimensionality reduction.
2. **Variance Preservation**: PCA works by finding the principal components (new orthogonal axes) that capture the maximum variance in the data. By selecting a subset of these components (e.g., n_components=0.95 to retain 95% of the variance), it effectively reduces the number of features while preserving most of the original information.
3. **Linear Transformation:** PCA performs a linear transformation of the features. It's computationally efficient and widely understood.
4. **Common for High-Dimensional Data:** It's a very common and effective technique for reducing dimensionality in datasets with many numerical features, especially those derived from text vectorization (like TF-IDF).
5. **Handles Correlated Features:** PCA inherently handles multicollinearity because the principal components are orthogonal (uncorrelated) to each other.
6. **Readiness for Downstream Models:** The output of PCA is a set of new numerical features (principal components) that are directly usable by most machine learning algorithms.

### 8. Data Splitting

In [None]:
# @title Split your data to train and test. Choose Splitting ratio wisely.
print("Splitting Data into Training and Testing Sets...")

# Define splitting ratio
test_size_ratio = 0.20 # 20% for testing, 80% for training
random_state_val = 42 # For reproducibility

# Ensure X and y have matching indices
X = X.loc[y.index].copy() # Align X's index with y's, important if y had NaNs and df_processed was built from it.

# Drop any rows where the target itself is NA (should be handled earlier, but a final check)
X, y = X.dropna(), y.dropna() # Final check for NaNs in X or y before splitting

# Ensure X and y have matching indices again after dropping NaNs if any.
common_indices = X.index.intersection(y.index)
X = X.loc[common_indices]
y = y.loc[common_indices]


X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=test_size_ratio,
    random_state=random_state_val
)

print(f"  - Data split into training and testing sets with a {test_size_ratio*100}% test size.")
print(f"  - X_train shape: {X_train.shape}")
print(f"  - X_test shape: {X_test.shape}")
print(f"  - y_train shape: {y_train.shape}")
print(f"  - y_test shape: {y_test.shape}")

##### What data splitting ratio have you used and why?

**80:20 splitting ratio**, meaning:

80% of the data is allocated for the training set (X_train, y_train).
20% of the data is allocated for the testing set (X_test, y_test).

The 80:20 split ratio is a common balance that provides sufficient data for the model to learn complex patterns (80% for training) while retaining a large enough, unseen portion (20% for testing) to reliably evaluate the model's generalization performance without introducing bias.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

No, for this project, the dataset is not considered imbalanced in the traditional machine learning sense.

The concept of "imbalanced dataset" primarily applies to classification problems, where one or more target classes have significantly fewer instances than other classes (e.g., detecting a rare disease, identifying fraud).

Project's primary goal is to predict Average Salary, which is a regression problem (predicting a continuous numerical value). In regression, there are no discrete "classes" to be imbalanced. While the distribution of Average Salary itself might be skewed (which we addressed with log transformation), this is a different concept from class imbalance.

In [None]:
# @title Handling Imbalanced Dataset (If needed)
print("Handling Imbalanced Dataset (if needed)...")

# This step is primarily relevant for CLASSIFICATION problems.
# If your TARGET_COLUMN is continuous (like 'Average Salary' or 'Rating' as numerical) then class imbalance is not typically a concern in the same way.

print("  - 'Handling Imbalanced Dataset' step is primarily for classification problems.")
print(f"  - Current target '{TARGET_COLUMN}' is continuous, so SMOTE is not applicable directly.")

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

No specific technique was used to handle "imbalanced dataset" because the target variable (Average Salary) is continuous, making it a regression problem.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# @title ML Model 1: XGBoost

# ML Model - 1 Implementation
print("XGBoost Regressor Implementation")
if 'X_train' not in locals() or X_train.empty:
    print("Error: X_train, X_test, y_train, y_test not found. Please ensure the previous ML pipeline immersive was run to split the data.")
    raise SystemExit("Data split variables not found. Cannot proceed with XGBoost.")   # Exit or handle the error appropriately

# Ensure X_train and X_test columns are consistent after PCA if applied
# If PCA was applied, X will be a DataFrame of numerical components (0, 1, 2...).
# If not, it will have the named features. XGBoost can handle both.

# XGBoost Algorithm Implementation
print("\nImplementing the XGBoost Regressor...")

# Common parameters for XGBoost. These can be tuned further for optimal performance.
# n_estimators: Number of boosting rounds (trees)
# learning_rate: Step size shrinkage to prevent overfitting
# max_depth: Maximum depth of a tree
# subsample: Fraction of samples used for fitting the trees
# colsample_bytree: Fraction of features used for fitting the trees
# random_state: For reproducibility
# n_jobs: Number of parallel threads to run XGBoost

xgb_reg = xgb.XGBRegressor(
    objective='reg:squarederror', # Objective for regression tasks
    n_estimators=1000,            # Number of boosting rounds
    learning_rate=0.05,           # Step size shrinkage
    max_depth=5,                  # Maximum depth of a tree
    subsample=0.7,                # Subsample ratio of the training instance
    colsample_bytree=0.7,         # Subsample ratio of columns when constructing each tree
    random_state=42,              # For reproducibility
    n_jobs=-1,                    # Use all available CPU cores
    # early_stopping_rounds=50,   # Uncomment and use with eval_set in fit for early stopping
)
print("XGBoost Regressor model defined.")

# Fit the Algorithm#
print("\nFitting the XGBoost Regressor to the training data...")
# It's good practice to ensure no NaNs in training data for XGBoost, though it can handle some: preprocessing should have taken care of this.
xgb_reg.fit(X_train, y_train,
            eval_set=[(X_test, y_test)], verbose=False)
print("XGBoost Regressor trained successfully.")

# Predict on the model
print("\nMaking predictions on the test data...")
y_pred = xgb_reg.predict(X_test)
print("Predictions generated.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# @title Visualizing evaluation Metric Score chart

print("\nEvaluating Model Performance and Visualizing Metrics...")

# Calculate Evaluation Metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"\nModel Evaluation Metrics (XGBoost Regressor):")
print(f"  Mean Absolute Error (MAE): {mae:.2f}")
print(f"  Mean Squared Error (MSE): {mse:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"  R-squared (R2 Score): {r2:.4f}")

# Visualization 1: Actual vs. Predicted Values Scatter Plot
plt.figure(figsize=(10, 8))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2) # Ideal prediction line
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("XGBoost Regressor: Actual vs. Predicted Salary", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 2: Residuals Plot
plt.figure(figsize=(10, 6))
residuals = y_test - y_pred
sns.scatterplot(x=y_pred, y=residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel("Predicted Salary")
plt.ylabel("Residuals (Actual - Predicted)")
plt.title("XGBoost Regressor: Residuals Plot", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 3: Distribution of Predicted vs Actual (KDE Plot if possible)
plt.figure(figsize=(10, 6))
sns.kdeplot(y_test, label='Actual Salary Distribution', fill=True, color='blue', alpha=0.6)
sns.kdeplot(y_pred, label='Predicted Salary Distribution', fill=True, color='red', alpha=0.6)
plt.xlabel("Salary")
plt.ylabel("Density")
plt.title("XGBoost Regressor: Actual vs. Predicted Salary Distribution", fontsize=16)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

print("\nXGBoost Regressor implementation and evaluation completed.")

**1. Actual vs. Predicted Salary Scatter Plot**
* This plot displays each individual data point from your test set. For each point, its horizontal position (x-axis) represents the true, actual salary from your dataset, and its vertical position (y-axis) represents the salary predicted by your model. The red dashed line going from the bottom-left to the top-right represents the ideal scenario where Actual Salary = Predicted Salary.
* **Good performance:** Points would cluster tightly around the red dashed line. The closer the points are to this line, the more accurate your model's predictions are.
* **Poor performance:** Points would be widely scattered far from the line. If points consistently fall above the line, the model is under-predicting; if below, it's over-predicting. A random cloud of points with no clear trend suggests a very poor model.
* Plot shows a good concentration of points along the ideal line, indicating that the model generally captures the trend well. The scatter, particularly at higher salary ranges, shows where the predictions deviate more from the actual values.

**2. Residuals Plot**
*  This plot helps assess the assumptions of your regression model and identify patterns in its errors. The horizontal axis (x-axis) represents the predicted salary (the output of your model). The vertical axis (y-axis) represents the residuals, which are the differences between the actual salary and the predicted salary (Actual Salary - Predicted Salary). A horizontal dashed red line is drawn at y=0, representing zero error.
* **Good performance (ideal):** Residuals should be randomly scattered around the zero line, forming a "cloud" with no discernible pattern. This indicates that the model's errors are random and that it is not systematically over- or under-predicting for certain ranges of predicted values. The spread of the residuals should also be consistent across the range (homoscedasticity).
* **Poor performance (issues):**
  * **Patterns:** If residuals show a pattern (e.g., a curve, a funnel shape, or increasing/decreasing spread), it suggests that the model is missing some underlying structure in the data.
  * **Bias:** If residuals are mostly above or below the zero line for certain ranges, it indicates systematic over- or under-prediction.
  * **Heteroscedasticity:** If the spread of residuals increases or decreases as predicted values change (a "funnel" shape), it indicates that the model's error variance is not constant, which can violate assumptions of some statistical models.
*Residuals plot generally shows a random scatter around the zero line, which is a positive sign, indicating no strong systematic bias. The spread of the residuals gives you a visual sense of the magnitude of your prediction errors.

**3. Actual vs. Predicted Salary Distribution (KDE Plot)**
* This plot uses Kernel Density Estimation (KDE) to visualize the probability density function (the smooth distribution) of the actual salaries (blue filled area) versus the predicted salaries (red filled area) from your test set. It provides a high-level view of how well your model's predicted values match the overall distribution of the true values.
* **Good performance:** The blue and red filled areas should overlap almost perfectly. Their peaks should align, and their shapes (skewness, spread) should be very similar. This means your model is not only predicting individual values reasonably well but also capturing the overall statistical properties of the target variable.
* **Poor performance:** If the peaks are misaligned, the shapes are very different (e.g., one is much wider or narrower, or more skewed than the other), or there's little overlap, it indicates that the model is not accurately representing the true underlying distribution of salaries.
* Plot shows very good overlap, with the peaks aligning nicely. This suggests that your XGBoost model is doing a commendable job of learning the general distribution of salaries. The minor discrepancies, especially in the higher salary tail, point to areas where the model could still improve its capture of less frequent, higher values.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# @title ML Model 1: XGBoost Regressor with Hyperparameter Tuning (RandomizedSearch CV)


if 'X_train' not in locals() or X_train.empty:
    print("Error: X_train, X_test, y_train, y_test not found. Please ensure the previous ML pipeline immersive was run to split the data.")
    raise SystemExit("Data split variables not found. Cannot proceed with XGBoost tuning.")


# Implementation of XGBoost with Hyperparameter Optimization Technique
print("\n---Implementing XGBoost Regressor with RandomizedSearchCV...---")

# Define the XGBoost Regressor model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror',
                             eval_metric='mae', # Metric for evaluation during tuning (Mean Absolute Error)
                             random_state=42,
                             n_jobs=-1, # Use all available cores
                             tree_method='hist' # Often faster for larger datasets
                            )

# Define the parameter distribution for RandomizedSearchCV

# 'n_estimators': number of boosting rounds (trees)
# 'learning_rate': step size shrinkage to prevent overfitting
# 'max_depth': maximum depth of a tree
# 'subsample': subsample ratio of the training instance
# 'colsample_bytree': subsample ratio of columns when constructing each tree
# 'gamma': minimum loss reduction required to make a further partition on a leaf node
# 'reg_alpha': L1 regularization term on weights
# 'reg_lambda': L2 regularization term on weights

param_distributions = {
    'n_estimators': randint(100, 1000),                                         # Random integer from 100 to 1000
    'learning_rate': uniform(0.01, 0.1),                                        # Random float from 0.01 to 0.1
    'max_depth': randint(3, 10),                                                # Random integer from 3 to 10
    'subsample': uniform(0.6, 0.4),                                             # Random float from 0.6 to 1.0 (0.6 + 0.4)
    'colsample_bytree': uniform(0.6, 0.4),                                      # Random float from 0.6 to 1.0 (0.6 + 0.4)
    'gamma': uniform(0, 0.2),                                                   # Random float from 0 to 0.2
    'reg_alpha': uniform(0, 0.5),                                               # Random float for L1 regularization
    'reg_lambda': uniform(0, 0.5),                                              # Random float for L2 regularization
}

# Setup RandomizedSearchCV

# estimator: The model to tune
# param_distributions: The parameter space to search
# n_iter: Number of parameter settings that are sampled (important for efficiency)
# cv: Number of cross-validation folds
# scoring: Metric to optimize
# verbose: Controls the verbosity of the output
# random_state: For reproducibility
# n_jobs: Number of CPU cores to use for parallel processing

random_search = RandomizedSearchCV(estimator=xgb_model,
                                   param_distributions=param_distributions,
                                   n_iter=20,                                   # Number of random combinations to try. Increase for a more exhaustive search. (20 for faster execution)
                                   cv=3,                                        # 3-fold cross-validation (for faster execution), can make it 2 also, but cv = 2 or 3 does not give really stable ML model
                                   scoring='neg_mean_absolute_error',           # Maximize negative MAE means minimize MAE
                                   verbose=2,                                   # Show progress
                                   random_state=42,
                                   n_jobs=-1                                     # Use all available cores
                                  )

print(f"RandomizedSearchCV configured to explore {random_search.n_iter} combinations over {random_search.cv} folds.")

# Fit the Algorithm
print("\nFitting RandomizedSearchCV to find the best XGBoost parameters...")
random_search.fit(X_train, y_train)

print("\nRandomizedSearchCV fitting complete.")
print("Best parameters found by RandomizedSearchCV:")
print(random_search.best_params_)

# The best estimator is the model trained with the best parameters found
best_xgb_model = random_search.best_estimator_
print("\nBest XGBoost model selected.")

# Predict on the model
print("\n Making predictions on the test data with the best model...")
y_pred_tuned = best_xgb_model.predict(X_test)
print("Predictions generated with tuned model.")

# Visualize the results using Evaluation Metric Score Chart
print("\nEvaluating Tuned Model Performance and Visualizing Metrics...")

# Calculate Evaluation Metrics for the Tuned Model
mae_tuned = mean_absolute_error(y_test, y_pred_tuned)
mse_tuned = mean_squared_error(y_test, y_pred_tuned)
rmse_tuned = np.sqrt(mse_tuned)
r2_tuned = r2_score(y_test, y_pred_tuned)

print(f"\nModel Evaluation Metrics (Tuned XGBoost Regressor):")
print(f"  Mean Absolute Error (MAE): {mae_tuned:.2f}")
print(f"  Mean Squared Error (MSE): {mse_tuned:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_tuned:.2f}")
print(f"  R-squared (R2 Score): {r2_tuned:.4f}")


# Visualization 1: Actual vs. Predicted Values Scatter Plot (Tuned Model)
plt.figure(figsize=(10, 8))
sns.scatterplot(x=y_test, y=y_pred_tuned, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2) # Ideal prediction line
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Tuned XGBoost Regressor: Actual vs. Predicted Salary", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 2: Residuals Plot (Tuned Model)
plt.figure(figsize=(10, 6))
residuals_tuned = y_test - y_pred_tuned
sns.scatterplot(x=y_pred_tuned, y=residuals_tuned, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel("Predicted Salary")
plt.ylabel("Residuals (Actual - Predicted)")
plt.title("Tuned XGBoost Regressor: Residuals Plot", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 3: Distribution of Predicted vs Actual (KDE Plot for Tuned Model)
plt.figure(figsize=(10, 6))
sns.kdeplot(y_test, label='Actual Salary Distribution', fill=True, color='blue', alpha=0.6)
sns.kdeplot(y_pred_tuned, label='Predicted Salary Distribution', fill=True, color='red', alpha=0.6)
plt.xlabel("Salary")
plt.ylabel("Density")
plt.title("Tuned XGBoost Regressor: Actual vs. Predicted Salary Distribution", fontsize=16)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

print("\nTuned XGBoost Regressor implementation and evaluation completed.")

* General Trend Captured: Model correctly identifies the overall relationship between features and salary.
* Accuracy Varies: Predictions scatter considerably around actual values, showing individual errors.
* Worse at High Salaries: Model's predictions are less accurate for higher salary ranges.
* Errors are Random: No obvious patterns in prediction errors, but error magnitudes are significant.
* Tuning Not Effective (Yet): Hyperparameter tuning by RandomizedSearchCV, in this run, did not notably improve performance over the untuned model.

In [None]:
# @title Comparison of Results (XGboost Untuned and Tuned)

# Tabular Comparison of Results
print("\nTabular Comparison of Model Performance:")
comparison_df = pd.DataFrame({
    'Metric': ['MAE', 'RMSE', 'R2 Score'],
    'Untuned XGBoost': [mae_untuned, rmse_untuned, r2_untuned],
    'Tuned XGBoost (RandomizedSearchCV)': [mae_tuned, rmse_tuned, r2_tuned]
})
# Format numerical columns for better readability
comparison_df['Untuned XGBoost'] = comparison_df['Untuned XGBoost'].map(lambda x: f'{x:.4f}' if 'R2' in str(x) else f'{x:.2f}')
comparison_df['Tuned XGBoost (RandomizedSearchCV)'] = comparison_df['Tuned XGBoost (RandomizedSearchCV)'].map(lambda x: f'{x:.4f}' if 'R2' in str(x) else f'{x:.2f}')

print(comparison_df.to_string(index=False))


# Visual Comparison of Results
print("\nVisual Comparison of Model Performance...")

# Prepare data for plotting (MAE and RMSE)
metrics_for_plot = pd.DataFrame({
    'Model': ['Untuned XGBoost', 'Tuned XGBoost (RandomizedSearchCV)'],
    'MAE': [mae_untuned, mae_tuned],
    'RMSE': [rmse_untuned, rmse_tuned]
})

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot MAE
sns.barplot(x='Model', y='MAE', data=metrics_for_plot, ax=axes[0], palette='viridis')
axes[0].set_title('XGBoost MAE Comparison')
axes[0].set_ylabel('MAE ($)')
axes[0].grid(axis='y', linestyle='--', alpha=0.7)
for container in axes[0].containers:
    axes[0].bar_label(container, fmt='%.2f')

# Plot RMSE
sns.barplot(x='Model', y='RMSE', data=metrics_for_plot, ax=axes[1], palette='plasma')
axes[1].set_title('XGBoost RMSE Comparison')
axes[1].set_ylabel('RMSE ($)')
axes[1].grid(axis='y', linestyle='--', alpha=0.7)
for container in axes[1].containers:
    axes[1].bar_label(container, fmt='%.2f')

plt.tight_layout()
plt.show()


# Plot R2 Score
plt.figure(figsize=(7, 5))
sns.barplot(x='Model', y='R2 Score', data=pd.DataFrame({
    'Model': ['Untuned XGBoost', 'Tuned XGBoost (RandomizedSearchCV)'],
    'R2 Score': [r2_untuned, r2_tuned]
}), palette='cividis')
plt.title('XGBoost R-squared (R2 Score) Comparison')
plt.ylabel('R2 Score')
plt.ylim(0, 1) # R2 score is typically between 0 and 1
plt.grid(axis='y', linestyle='--', alpha=0.7)
for container in plt.gca().containers:
    plt.gca().bar_label(container, fmt='%.4f')
plt.show()

print("\nXGBoost comparison completed. The plots visually demonstrate the impact of hyperparameter tuning.")

**Inference from this Graph (XGBoost Comparison)**


**Unexpected Inference:** Contrary to typical expectations, these graphs show that the tuned XGBoost model (using RandomizedSearchCV) has slightly worse performance than the untuned model.

**MAE Comparison:** Untuned MAE is 6117.49, while Tuned MAE is 6183.63. (Tuned is higher, meaning worse)

**RMSE Comparison:** Untuned RMSE is 12466.21, while Tuned RMSE is 13020.79. (Tuned is higher, meaning worse)

**R2 Score Comparison:** Untuned R2 is 0.7656, while Tuned R2 is 0.7443. (Tuned is lower, meaning worse fit)

This is an unusual and noteworthy outcome. It suggests that the RandomizedSearchCV process, with the specific parameter ranges and n_iter/cv settings used, either:
* Did not find better parameters: The randomly sampled combinations might have led to suboptimal models.
* Found parameters that performed worse: The default or manually chosen parameters for the untuned model were actually better or in a more optimal region.
* The search space was too restrictive or inappropriate: The range of hyperparameters given to RandomizedSearchCV might not have included the truly optimal configurations.
* Overfitting of the validation sets during RandomizedSearchCV: While cross-validation is used, if the n_iter is small and the search space is large, it might pick parameters that just happened to do well on those specific folds but don't generalize.

This indicates that more thorough tuning, or a different tuning approach (like Optuna with more trials or broader ranges), might be needed for XGBoost.

##### Which hyperparameter optimization technique have you used and why?

Randomized Search Cross-Validation (RandomizedSearchCV) for hyperparameter optimization for XGBoost in this comparison:
**bold text**
It was chosen as a more efficient alternative to GridSearchCV when the hyperparameter search space is large. It samples a fixed number of parameter combinations from the specified distributions, offering a balance between exploration and computational cost.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No, unfortunately, we have not seen any improvement. In fact, the tuned XGBoost model (using RandomizedSearchCV) shows a slight decrease in performance compared to the untuned model, based on the provided metrics:

Mean Absolute Error (MAE): Increased from $6117.49 (Untuned) to $6183.63 (Tuned).

Root Mean Squared Error (RMSE): Increased from $12466.21 (Untuned) to $13020.79 (Tuned).

R-squared (R2 Score): Decreased from 0.7656 (Untuned) to 0.7443 (Tuned).
This suggests that the RandomizedSearchCV, in its current configuration (number of iterations, CV folds, and parameter ranges), did not successfully find hyperparameters that improved the model's performance on the test set.

### ML Model - 2

In [None]:
# @title ML Model - 2 Random Forest Regressor (without any hyperparameter optimization technique)

print("--- Random Forest Regressor Implementation ---")

if 'X_train' not in locals() or X_train.empty:
    print("Error: X_train, X_test, y_train, y_test not found. Please ensure the previous ML pipeline immersive was run to split the data.")
    raise SystemExit("Data split variables not found. Cannot proceed with Random Forest.")

# 1. Random Forest Regressor Implementation
print("\nImplementing the Random Forest Regressor...")

# Common parameters for Random Forest. These can be tuned further for optimal performance.
# n_estimators: The number of trees in the forest.
# random_state: For reproducibility.
# n_jobs: Number of parallel jobs to run (set to -1 to use all available cores).
# min_samples_leaf: Minimum number of samples required to be at a leaf node.
# min_samples_split: Minimum number of samples required to split an internal node.

rf_reg = RandomForestRegressor(
    n_estimators=100,            # A reasonable number of trees for a baseline
    random_state=42,             # For reproducibility
    n_jobs=-1,                   # Use all available CPU cores
    min_samples_leaf=5,          # Minimum samples required to be at a leaf node
    min_samples_split=10         # Minimum samples required to split an internal node
)

print("Random Forest Regressor model defined.")


#Fit the Model
print("\nFitting the Random Forest Regressor to the training data...")
# Random Forest can handle NaNs by default for splitting if data is sparse, but our preprocessing already imputed NaNs, so data should be clean
rf_reg.fit(X_train, y_train)
print("Random Forest Regressor trained successfully.")


#Predict on the Model
print("\nMaking predictions on the test data...")
y_pred_rf = rf_reg.predict(X_test)
print("Predictions generated.")


# Visualize and Analyze Performance using Evaluation Metric Score Chart
print("\nEvaluating Model Performance and Visualizing Metrics...")

# Calculate Evaluation Metrics
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"\nModel Evaluation Metrics (Random Forest Regressor):")
print(f"  Mean Absolute Error (MAE): {mae_rf:.2f}")
print(f"  Mean Squared Error (MSE): {mse_rf:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_rf:.2f}")
print(f"  R-squared (R2 Score): {r2_rf:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualization 1: Actual vs. Predicted Values Scatter Plot
plt.figure(figsize=(10, 8))
sns.scatterplot(x=y_test, y=y_pred_rf, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2) # Ideal prediction line
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Random Forest Regressor: Actual vs. Predicted Salary", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 2: Residuals Plot
plt.figure(figsize=(10, 6))
residuals_rf = y_test - y_pred_rf
sns.scatterplot(x=y_pred_rf, y=residuals_rf, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel("Predicted Salary")
plt.ylabel("Residuals (Actual - Predicted)")
plt.title("Random Forest Regressor: Residuals Plot", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 3: Distribution of Predicted vs Actual (KDE Plot)
plt.figure(figsize=(10, 6))
sns.kdeplot(y_test, label='Actual Salary Distribution', fill=True, color='blue', alpha=0.6)
sns.kdeplot(y_pred_rf, label='Predicted Salary Distribution', fill=True, color='red', alpha=0.6)
plt.xlabel("Salary")
plt.ylabel("Density")
plt.title("Random Forest Regressor: Actual vs. Predicted Salary Distribution", fontsize=16)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

print("\nRandom Forest Regressor implementation and evaluation completed.")

1. **General Trend Captured:** Model correctly identifies the overall relationship between features and salary.
2. **Significant Scatter:** Predictions show considerable spread around actual values, indicating moderate individual errors.
3. **Residuals Spread:** Errors are widely scattered around zero, suggesting room for improvement in overall accuracy.
4. **Distribution Mismatch:** Predicted salary distribution is narrower and peaks higher than the actual distribution, indicating underprediction of ranges and overprediction of central values.
5. **Average Performance:** Metrics (R2=0.6106, MAE=$11816.04) show a decent but not highly accurate baseline performance, with substantial average errors.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# @title ML Model - 2: Random Forest Regressor with RandomizedSearch CV (Hyperparameter Optimization)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from scipy.stats import randint, uniform # For defining parameter distributions

print("--- Random Forest Regressor with Hyperparameter Optimization (RandomizedSearchCV) ---")

if 'X_train' not in locals() or X_train.empty:
    print("Error: X_train, X_test, y_train, y_test not found. Please ensure the previous ML pipeline immersive was run to split the data.")
    raise SystemExit("Data split variables not found. Cannot proceed with Random Forest tuning.")

# Implementation of Random Forest Regressor with Hyperparameter Optimization Technique
print("\nImplementing Random Forest Regressor with RandomizedSearchCV...")

# Define the Random Forest Regressor model
rf_model = RandomForestRegressor(random_state=42, n_jobs=-1)


# Define the parameter distributions for RandomizedSearchCV

# 'n_estimators': number of trees in the forest
# 'max_features': The number of features to consider when looking for the best split
# 'min_samples_split': The minimum number of samples required to split an internal node
# 'min_samples_leaf': The minimum number of samples required to be at a leaf node
# 'bootstrap': Whether bootstrap samples are used when building trees (True is default for RF)

param_distributions = {
    'n_estimators': randint(100, 1000),       # Number of trees
    'max_features': ['sqrt', 'log2', 0.6, 0.8, 1.0], # Features to consider at each split
    'min_samples_split': randint(2, 20),      # Minimum samples required to split a node
    'min_samples_leaf': randint(1, 10),       # Minimum samples required at each leaf node
    'bootstrap': [True, False]                # Whether to use bootstrap samples
}

# Setup RandomizedSearchCV

# estimator: The model to tune
# param_distributions: The parameter space to search
# n_iter: Number of parameter settings that are sampled (adjust based on time/resources)
# cv: Number of cross-validation folds
# scoring: Metric to optimize ('neg_mean_absolute_error' for MAE minimization)
# verbose: Controls the verbosity of the output
# random_state: For reproducibility
# n_jobs: Number of CPU cores to use for parallel processing

random_search_rf = RandomizedSearchCV(estimator=rf_model,
                                      param_distributions=param_distributions,
                                      n_iter=20,       # reduced for faster execution
                                      cv=3,            # 3-fold cross-validation (reduced for faster execution)
                                      scoring='neg_mean_absolute_error', # Maximize negative MAE means minimize MAE
                                      verbose=2,       # Show progress
                                      random_state=42,
                                      n_jobs=-1        # Use all available cores
                                     )

print(f"RandomizedSearchCV for Random Forest configured to explore {random_search_rf.n_iter} combinations over {random_search_rf.cv} folds.")


# Fit the Algorithm (with Hyperparameter Optimization)
print("\nFitting RandomizedSearchCV to find the best Random Forest parameters...")
random_search_rf.fit(X_train, y_train)

print("\nRandomizedSearchCV fitting complete for Random Forest.")
print("Best parameters found by RandomizedSearchCV:")
print(random_search_rf.best_params_)

# The best estimator is the model trained with the best parameters found
best_rf_model = random_search_rf.best_estimator_
print("\nBest Random Forest model selected.")


#Predict on the Model
print("\nMaking predictions on the test data with the best Random Forest model...")
y_pred_rf_tuned = best_rf_model.predict(X_test)
print("Predictions generated with tuned Random Forest model.")


# Visualize and analyze the performance of this model using Evaluation Metric Score Chart
print("\nEvaluating Tuned Random Forest Model Performance and Visualizing Metrics...")

# Calculate Evaluation Metrics for the Tuned Model
mae_rf_tuned = mean_absolute_error(y_test, y_pred_rf_tuned)
mse_rf_tuned = mean_squared_error(y_test, y_pred_rf_tuned)
rmse_rf_tuned = np.sqrt(mse_rf_tuned)
r2_rf_tuned = r2_score(y_test, y_pred_rf_tuned)

print(f"\nModel Evaluation Metrics (Tuned Random Forest Regressor):")
print(f"  Mean Absolute Error (MAE): {mae_rf_tuned:.2f}")
print(f"  Mean Squared Error (MSE): {mse_rf_tuned:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_rf_tuned:.2f}")
print(f"  R-squared (R2 Score): {r2_rf_tuned:.4f}")


# Visualization 1: Actual vs. Predicted Values Scatter Plot (Tuned Random Forest)
plt.figure(figsize=(10, 8))
sns.scatterplot(x=y_test, y=y_pred_rf_tuned, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2) # Ideal prediction line
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Tuned Random Forest Regressor: Actual vs. Predicted Salary", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 2: Residuals Plot (Tuned Random Forest)
plt.figure(figsize=(10, 6))
residuals_rf_tuned = y_test - y_pred_rf_tuned
sns.scatterplot(x=y_pred_rf_tuned, y=residuals_rf_tuned, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel("Predicted Salary")
plt.ylabel("Residuals (Actual - Predicted)")
plt.title("Tuned Random Forest Regressor: Residuals Plot", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 3: Distribution of Predicted vs Actual (KDE Plot for Tuned Random Forest)
plt.figure(figsize=(10, 6))
sns.kdeplot(y_test, label='Actual Salary Distribution', fill=True, color='blue', alpha=0.6)
sns.kdeplot(y_pred_rf_tuned, label='Predicted Salary Distribution', fill=True, color='red', alpha=0.6)
plt.xlabel("Salary")
plt.ylabel("Density")
plt.title("Tuned Random Forest Regressor: Actual vs. Predicted Salary Distribution", fontsize=16)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

print("\nTuned Random Forest Regressor implementation and evaluation completed.")


1. **Strong Trend Capture:** The model excels at capturing the core relationship between features and salary, with points clustering well along the ideal prediction line.
2. **Reduced Scatter:** Compared to the untuned version, the scatter around the diagonal is noticeably tighter, indicating improved individual prediction accuracy.
3. **Well-Distributed Residuals:** Prediction errors are randomly distributed around zero, with reduced spread, confirming improved overall model fit and less bias.
4. **Improved Distribution Match:** The predicted salary distribution now aligns much more closely with the actual distribution, showing better capture of the salary range and density.
5. **Significant Performance Gain:** Tuning substantially improved metrics (MAE: $7544.38, R2: 0.7392), making it a much more accurate and reliable model than its untuned counterpart.

In [None]:
# @title Comparison of Results in Random Forest Regressor Untuned and Tuned

# Tabular Comparison of Results
print("\nTabular Comparison of Model Performance:")
comparison_df_rf = pd.DataFrame({
    'Metric': ['MAE', 'RMSE', 'R2 Score'],
    'Untuned Random Forest': [mae_rf, rmse_rf, r2_rf],
    'Tuned Random Forest (RandomizedSearchCV)': [mae_rf_tuned, rmse_rf_tuned, r2_rf_tuned]
})
# Format numerical columns for better readability
comparison_df_rf['Untuned Random Forest'] = comparison_df_rf['Untuned Random Forest'].map(lambda x: f'{x:.4f}' if 'R2' in str(x) else f'{x:.2f}')
comparison_df_rf['Tuned Random Forest (RandomizedSearchCV)'] = comparison_df_rf['Tuned Random Forest (RandomizedSearchCV)'].map(lambda x: f'{x:.4f}' if 'R2' in str(x) else f'{x:.2f}')

print(comparison_df_rf.to_string(index=False))


# Visual Comparison of Results
print("\nVisual Comparison of Model Performance...")

# Prepare data for plotting (MAE and RMSE)
metrics_for_plot_rf = pd.DataFrame({
    'Model': ['Untuned Random Forest', 'Tuned Random Forest (RandomizedSearchCV)'],
    'MAE': [mae_rf, mae_rf_tuned],
    'RMSE': [rmse_rf, rmse_rf_tuned]
})

fig_rf, axes_rf = plt.subplots(1, 2, figsize=(14, 6))

# Plot MAE
sns.barplot(x='Model', y='MAE', data=metrics_for_plot_rf, ax=axes_rf[0], palette='viridis')
axes_rf[0].set_title('Random Forest MAE Comparison')
axes_rf[0].set_ylabel('MAE ($)')
axes_rf[0].grid(axis='y', linestyle='--', alpha=0.7)
for container in axes_rf[0].containers:
    axes_rf[0].bar_label(container, fmt='%.2f')

# Plot RMSE
sns.barplot(x='Model', y='RMSE', data=metrics_for_plot_rf, ax=axes_rf[1], palette='plasma')
axes_rf[1].set_title('Random Forest RMSE Comparison')
axes_rf[1].set_ylabel('RMSE ($)')
axes_rf[1].grid(axis='y', linestyle='--', alpha=0.7)
for container in axes_rf[1].containers:
    axes_rf[1].bar_label(container, fmt='%.2f')

plt.tight_layout()
plt.show()


# Plot R2 Score
plt.figure(figsize=(7, 5))
sns.barplot(x='Model', y='R2 Score', data=pd.DataFrame({
    'Model': ['Untuned Random Forest', 'Tuned Random Forest (RandomizedSearchCV)'],
    'R2 Score': [r2_rf, r2_rf_tuned]
}), palette='cividis')
plt.title('Random Forest R-squared (R2 Score) Comparison')
plt.ylabel('R2 Score')
plt.ylim(0, 1) # R2 score is typically between 0 and 1
plt.grid(axis='y', linestyle='--', alpha=0.7)
for container in plt.gca().containers:
    plt.gca().bar_label(container, fmt='%.4f')
plt.show()

print("\nRandom Forest comparison completed. The plots visually demonstrate the impact of hyperparameter tuning.")

The graphs clearly illustrate the impact of hyperparameter optimization (using RandomizedSearchCV) on the Random Forest Regressor's performance.

Key Inference: Hyperparameter tuning significantly improved the Random Forest Regressor's performance. The tuned model shows reduced errors (MAE, RMSE) and a better fit (higher R2 score) compared to the untuned version.

Error Reduction: Both MAE and RMSE are substantially lower for the tuned model.
Improved Fit: The R-squared value is notably higher for the tuned model, indicating it explains more variance in salary.

##### Which hyperparameter optimization technique have you used and why?

Randomized Search Cross-Validation (RandomizedSearchCV) for hyperparameter optimization for the Random Forest Regressor.

RandomizedSearchCV was selected because it's more computationally efficient than GridSearchCV for exploring a relatively large hyperparameter space, which is common for ensemble models like Random Forest. It samples a fixed number of combinations randomly, allowing for a good balance between finding optimal parameters and managing execution time.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, we have seen a clear and substantial improvement in the Random Forest Regressor's performance after hyperparameter tuning using RandomizedSearchCV.

Interpretation of Improvement:

The Mean Absolute Error (MAE) has reduced significantly from $11,816.04 to $7,544.38. This means, on average, the tuned model's predictions are about $4,271 closer to the actual salary than the untuned model.
The Root Mean Squared Error (RMSE) has also decreased from $16,067.59 to $13,149.74, indicating that larger errors are penalized less and the overall magnitude of errors is reduced.
The R-squared (R2 Score) has improved from 0.6106 to 0.7392. This is a substantial gain, meaning the tuned model now explains almost 74% of the variance in salary, a much better fit than the 61% explained by the untuned model.

#### 3. Explain each evaluation metric's indication towards business and the business impact of the ML model used.

Let's consider these metrics in the context of a business scenario, such as a company trying to predict competitive salaries for job postings or for internal benchmarking.

1. **Mean Absolute Error (MAE):**
* **Indication:** MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It tells you, on average, how much your predictions deviate from the actual values.
* **Business Impact:** MAE is highly interpretable in business terms. An MAE of $7,544.38 means that, on average, your model's predicted salary for a job is off by about $7,500 from the true market salary.
  * **Positive Impact:** A lower MAE means more accurate salary estimates. This directly helps HR departments set competitive salaries, optimize recruitment budgets, and ensure fair compensation. If the model consistently predicts within a reasonable MAE, it reduces the risk of overpaying (wasting budget) or underpaying (losing talent).
  * **Example:** If your hiring budget is tight, a model with a lower MAE is invaluable for precise cost estimation per hire.

2. **Root Mean Squared Error (RMSE):**
* **Indication:** RMSE is the square root of the average of the squared differences between prediction and actual observation. It gives a relatively high weight to large errors because the errors are squared before being averaged. It's in the same units as the target variable.
* **Business Impact:** RMSE provides a good measure of the typical magnitude of prediction errors, with a greater penalty for large, outlier errors. An RMSE of $13,149.74 suggests that the "typical" error is around $13,150, but it also implies that larger errors are contributing more to this average.
  * **Positive Impact:** A lower RMSE indicates that the model not only has smaller average errors but also that it's less prone to making very large, damaging errors. This is critical when significant financial decisions (like high-level executive salaries) are involved, where large mispredictions could have substantial financial consequences.
  * **Example:** For high-stakes roles, where salary misjudgment could cost tens of thousands, minimizing RMSE is crucial to avoid severe financial pitfalls or talent acquisition failures.

3. **R-squared (R2 Score):**
* **Indication:** R-squared explains the proportion of variance in the dependent variable (salary) that can be predicted from the independent variables (job features). It tells you how well your model's predictions fit the actual values, compared to a simple model that just predicts the mean.
* **Business Impact:** An R2 score of 0.7392 means that approximately 73.92% of the variation in job salaries can be explained by the features your model uses (e.g., job title, industry, company size, location, etc.).
  * **Positive Impact:** A higher R2 indicates a more comprehensive understanding of the factors influencing salary. This can help a business understand which job characteristics drive salary differences. While it doesn't tell you the absolute error, it shows the model's overall explanatory power. It helps stakeholders trust that the model isn't just randomly guessing but actually leveraging meaningful data.
  * **Example:** If R2 is high, a business can use the model to not only predict salaries but also to identify which job attributes are most influential (via feature importances, if the model supports it), helping them refine job descriptions or compensation strategies based on data-driven insights.

**Overall Business Impact of the ML Model (Random Forest Regressor):**

A well-performing salary prediction model like your tuned Random Forest Regressor can provide significant business value by:

* **Optimizing Compensation Strategies:** Ensuring competitive and fair salaries, which is crucial for attracting and retaining top talent.
* Budgeting and Financial Planning: Providing more accurate forecasts of personnel costs.
* **Recruitment Efficiency:** Speeding up the offer negotiation process by providing data-backed salary ranges.
* **Market Intelligence:** Gaining insights into what drives salary variations across different job roles, industries, and locations.
* **Reducing Bias:** A data-driven approach can help reduce human bias in salary setting, leading to more equitable pay practices.

The improvement seen after tuning the Random Forest model suggests that by refining its internal parameters, it has become a more reliable and valuable tool for these business applications.

### ML Model - 3

In [None]:
# @title ML Model 3: SV without Hyperparameter Optimization

from sklearn.svm import SVR

print("--- Support Vector Regressor (SVR) Implementation ---")

if 'X_train' not in locals() or X_train.empty:
    print("Error: X_train, X_test, y_train, y_test not found. Please ensure the previous ML pipeline immersive was run to split the data.")
    raise SystemExit("Data split variables not found. Cannot proceed with SVR.")

# SVR can be sensitive to the scale of features. Since we have already applied StandardScaler, X_train and X_test should be appropriately scaled

# SVR Algorithm Implementation
print("\nImplementing the Support Vector Regressor (SVR)...")

# Common parameters for SVR:
# kernel: Specifies the kernel type to be used in the algorithm ('rbf' is common for non-linear data)
# C: Regularization parameter. The strength of the regularization is inversely proportional to C.
# epsilon: Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function.
# gamma: Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. If 'scale' (default), it uses 1 / (n_features * X.var()).

svr_reg = SVR(
    kernel='rbf',       # Radial Basis Function kernel for non-linear relationships
    C=1.0,              # Regularization parameter (adjust if model is overfitting/underfitting)
    epsilon=0.1,        # Epsilon-tube within which errors are ignored
    gamma='scale'       # Kernel coefficient, uses 1 / (n_features * X.var())
)
print("Support Vector Regressor model defined.")

# Fit the Algorithm
print("\nFitting the SVR to the training data...")
# SVR can be computationally intensive, especially with a large number of samples or features + It is also sensitive to the number of features. If PCA was applied, it should help here.
svr_reg.fit(X_train, y_train)
print("SVR trained successfully.")

# Predict on the Model
print("\nMaking predictions on the test data...")
y_pred_svr = svr_reg.predict(X_test)
print("Predictions generated.")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualize its Corresponding Evaluation Metric Chart (for its performance)
print("\nEvaluating Model Performance and Visualizing Metrics...")

# Calculate Evaluation Metrics
mae_svr_untuned = mean_absolute_error(y_test, y_pred_svr)
mse_svr_untuned = mean_squared_error(y_test, y_pred_svr)
rmse_svr_untuned = np.sqrt(mse_svr_untuned)
r2_svr_untuned = r2_score(y_test, y_pred_svr)

print("\nModel Evaluation Metrics (Support Vector Regressor):")
print(f"  Mean Absolute Error (MAE): {mae_svr_untuned:.2f}")
print(f"  Mean Squared Error (MSE): {mse_svr_untuned:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_svr_untuned:.2f}")
print(f"  R-squared (R2 Score): {r2_svr_untuned:.4f}")

# Visualization 1: Actual vs. Predicted Values Scatter Plot
plt.figure(figsize=(10, 8))
sns.scatterplot(x=y_test, y=y_pred_svr, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2) # Ideal prediction line
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("SVR: Actual vs. Predicted Salary", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 2: Residuals Plot
plt.figure(figsize=(10, 6))
residuals_svr = y_test - y_pred_svr
sns.scatterplot(x=y_pred_svr, y=residuals_svr, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel("Predicted Salary")
plt.ylabel("Residuals (Actual - Predicted)")
plt.title("SVR: Residuals Plot", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 3: Distribution of Predicted vs Actual (KDE Plot)
plt.figure(figsize=(10, 6))
sns.kdeplot(y_test, label='Actual Salary Distribution', fill=True, color='blue', alpha=0.6)
sns.kdeplot(y_pred_svr, label='Predicted Salary Distribution', fill=True, color='red', alpha=0.6)
plt.xlabel("Salary")
plt.ylabel("Density")
plt.title("SVR: Actual vs. Predicted Salary Distribution", fontsize=16)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

print("\nSupport Vector Regressor implementation and evaluation completed.")

1. **Constant Prediction:** The model is predicting nearly a single, constant salary value for almost all inputs, indicating severe underfitting.
2. **No Trend Capture:** The "Actual vs. Predicted" scatter plot shows predictions as a flat line, meaning the model fails to capture any underlying salary trends.
3. **High Errors:** MAE ($25845.53) and RMSE ($25845.53) are extremely high, showing very poor prediction accuracy.
4. **Zero Explanatory Power:** The R-squared score is effectively zero (0.0000), meaning the model explains none of the variance in actual salaries.
5. **Distribution Collapse:** The predicted salary distribution is a single sharp spike, completely failing to reflect the actual salary distribution.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# @title ML Model 3 SVR with Hyperparameter Optimization (RandomizedSearch CV)

print("--- Support Vector Regressor (SVR) with Hyperparameter Optimization ---")

if 'X_train' not in locals() or X_train.empty:
    print("Error: X_train, X_test, y_train, y_test not found. Please ensure the previous ML pipeline immersive was run to split the data.")
    raise SystemExit("Data split variables not found. Cannot proceed with SVR.")

#Implementation
print("\nImplementing the Support Vector Regressor (SVR) with RandomizedSearchCV...")

# Define the SVR model (this will be the base estimator for RandomizedSearchCV)
svr_base = SVR()

# Define the parameter distributions for RandomizedSearchCV
# 'kernel': Type of kernel ('rbf' is versatile, 'linear' for linear relationships)
# 'C': Regularization parameter (log-uniform distribution is often good for C)
# 'epsilon': Epsilon in the epsilon-SVR model (uniform distribution)
# 'gamma': Kernel coefficient (log-uniform distribution for 'rbf' kernel)

param_distributions = {
    'kernel': ['rbf'], # Start with 'rbf' as it's common for non-linear data
    'C': uniform(0.1, 100), # Explore C from 0.1 to 100.1 (0.1 + 100)
    'epsilon': uniform(0.01, 0.5), # Explore epsilon from 0.01 to 0.51 (0.01 + 0.5)
    'gamma': ['scale', 'auto', uniform(0.001, 0.1)], # 'scale', 'auto', or a float range
}

# Setup RandomizedSearchCV for SVR
# n_iter: Number of parameter settings that are sampled. Reduced for faster execution.
# cv: Number of cross-validation folds. Reduced for faster execution.
# scoring: Use 'neg_mean_absolute_error' to minimize MAE.
# random_state: For reproducibility.
# verbose: Controls the verbosity of the output.
# n_jobs: Number of CPU cores to use for parallel processing. SVR benefits significantly from parallelization.

random_search_svr = RandomizedSearchCV(estimator=svr_base,
                                       param_distributions=param_distributions,
                                       n_iter=20,       # Number of random combinations to try (start small for SVR)
                                       cv=3,            # 3-fold cross-validation
                                       scoring='neg_mean_absolute_error', # Maximize negative MAE means minimize MAE
                                       verbose=2,       # Show progress
                                       random_state=42,
                                       n_jobs=-1        # Use all available CPU cores
                                      )

print(f"RandomizedSearchCV for SVR configured to explore {random_search_svr.n_iter} combinations over {random_search_svr.cv} folds.")
#print("Note: SVR tuning can be computationally intensive. Adjust n_iter and cv if needed.")


# Fit the Algorithm (with Hyperparameter Optimization)
print("\nFitting RandomizedSearchCV to find the best SVR parameters...")
random_search_svr.fit(X_train, y_train)

print("\nRandomizedSearchCV fitting complete for SVR.")
print("Best parameters found by RandomizedSearchCV:")
print(random_search_svr.best_params_)

# The best estimator is the model trained with the best parameters found
best_svr_model = random_search_svr.best_estimator_
print("\nBest SVR model selected.")

#Predict on the Model
print("\nMaking predictions on the test data with the best SVR model...")
y_pred_svr_tuned = best_svr_model.predict(X_test)
print("Predictions generated with tuned SVR model.")

# Visualize and analyze the performance of this model using Evaluation Metric Score Chart
print("\nEvaluating Tuned SVR Model Performance and Visualizing Metrics...")

# Calculate Evaluation Metrics for the Tuned Model
mae_svr_tuned = mean_absolute_error(y_test, y_pred_svr_tuned)
mse_svr_tuned = mean_squared_error(y_test, y_pred_svr_tuned)
rmse_svr_tuned = np.sqrt(mse_svr_tuned)
r2_svr_tuned = r2_score(y_test, y_pred_svr_tuned)

print(f"\nModel Evaluation Metrics (Tuned Support Vector Regressor):")
print(f"  Mean Absolute Error (MAE): {mae_svr_tuned:.2f}")
print(f"  Mean Squared Error (MSE): {mse_svr_tuned:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_svr_tuned:.2f}")
print(f"  R-squared (R2 Score): {r2_svr_tuned:.4f}")


# Visualization 1: Actual vs. Predicted Values Scatter Plot (Tuned SVR)
plt.figure(figsize=(10, 8))
sns.scatterplot(x=y_test, y=y_pred_svr_tuned, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2) # Ideal prediction line
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Tuned SVR: Actual vs. Predicted Salary", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 2: Residuals Plot (Tuned SVR)
plt.figure(figsize=(10, 6))
residuals_svr_tuned = y_test - y_pred_svr_tuned
sns.scatterplot(x=y_pred_svr_tuned, y=residuals_svr_tuned, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel("Predicted Salary")
plt.ylabel("Residuals (Actual - Predicted)")
plt.title("Tuned SVR: Residuals Plot", fontsize=16)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

# Visualization 3: Distribution of Predicted vs Actual (KDE Plot for Tuned SVR)
plt.figure(figsize=(10, 6))
sns.kdeplot(y_test, label='Actual Salary Distribution', fill=True, color='blue', alpha=0.6)
sns.kdeplot(y_pred_svr_tuned, label='Predicted Salary Distribution', fill=True, color='red', alpha=0.6)
plt.xlabel("Salary")
plt.ylabel("Density")
plt.title("Tuned SVR: Actual vs. Predicted Salary Distribution", fontsize=16)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

print("\nTuned Support Vector Regressor implementation and evaluation completed.")

1. **Still a Constant Predictor:** Despite tuning, the SVR model continues to predict almost a single constant salary value (around $97,500), indicating that the optimization did not resolve its fundamental underfitting issue.

2. **Failed Trend Capture:** The "Actual vs. Predicted Salary" plot shows predictions as a flat line, confirming the model's inability to capture any actual salary trends or variations.

3. **High Errors Persist:** MAE ($18421.08) and RMSE ($25756.35) remain very high, signifying consistently large prediction errors.

4. **Negative R-squared:** The R-squared score is -0.0005, which means the model performs even worse than simply predicting the mean of the actual salaries, indicating a complete failure to model the data.

5. **Distribution Collapse Unchanged:** The predicted salary distribution is still a sharp vertical spike, demonstrating that the tuning did not help the model learn the true distribution of salaries.

In [None]:
# @title Comparison of Results of SVR with and without Hyperparameter Optimization Technique

print("\n--- Comparison of SVR Regressor: Untuned vs. Tuned ---")

#Tabular Comparison of Results
print("\nTabular Comparison of Model Performance:")
comparison_df_svr = pd.DataFrame({
    'Metric': ['MAE', 'RMSE', 'R2 Score'],
    'Untuned SVR': [mae_svr_untuned, rmse_svr_untuned, r2_svr_untuned],
    'Tuned SVR': [mae_svr_tuned, rmse_svr_tuned, r2_svr_tuned]
})
# Format numerical columns for better readability
comparison_df_svr['Untuned SVR'] = comparison_df_svr['Untuned SVR'].map(lambda x: f'{x:.4f}' if 'R2' in str(x) else f'{x:.2f}')
comparison_df_svr['Tuned SVR'] = comparison_df_svr['Tuned SVR'].map(lambda x: f'{x:.4f}' if 'R2' in str(x) else f'{x:.2f}')

print(comparison_df_svr.to_string(index=False))


#Visual Comparison of Results
print("\nVisual Comparison of Model Performance...")

# Prepare data for plotting (MAE and RMSE)
metrics_for_plot_svr = pd.DataFrame({
    'Model': ['Untuned SVR', 'Tuned SVR'],
    'MAE': [mae_svr_untuned, mae_svr_tuned],
    'RMSE': [rmse_svr_untuned, rmse_svr_tuned]
})

fig_svr, axes_svr = plt.subplots(1, 2, figsize=(14, 6))

# Plot MAE
sns.barplot(x='Model', y='MAE', data=metrics_for_plot_svr, ax=axes_svr[0], palette='viridis')
axes_svr[0].set_title('SVR MAE Comparison')
axes_svr[0].set_ylabel('MAE ($)')
axes_svr[0].grid(axis='y', linestyle='--', alpha=0.7)
for container in axes_svr[0].containers:
    axes_svr[0].bar_label(container, fmt='%.2f')

# Plot RMSE
sns.barplot(x='Model', y='RMSE', data=metrics_for_plot_svr, ax=axes_svr[1], palette='plasma')
axes_svr[1].set_title('SVR RMSE Comparison')
axes_svr[1].set_ylabel('RMSE ($)')
axes_svr[1].grid(axis='y', linestyle='--', alpha=0.7)
for container in axes_svr[1].containers:
    axes_svr[1].bar_label(container, fmt='%.2f')

plt.tight_layout()
plt.show()


# Plot R2 Score
plt.figure(figsize=(7, 5))
sns.barplot(x='Model', y='R2 Score', data=pd.DataFrame({
    'Model': ['Untuned SVR', 'Tuned SVR'],
    'R2 Score': [r2_svr_untuned, r2_svr_tuned]
}), palette='cividis')
plt.title('SVR R-squared (R2 Score) Comparison')
plt.ylabel('R2 Score')
plt.ylim(0, 1) # R2 score is typically between 0 and 1
plt.grid(axis='y', linestyle='--', alpha=0.7)
for container in plt.gca().containers:
    plt.gca().bar_label(container, fmt='%.4f')
plt.show()

print("\nSVR comparison completed. The plots visually demonstrate the impact of hyperparameter tuning.")


* While the MAE shows a reduction from $25,845.53 to $18,421.08, this is still an extremely high average error in the context of salary prediction, indicating the model is still performing very poorly. A nearly $18.5K average error means the model is not practically useful.

* The RMSE shows a negligible reduction, from 25,845.53 to 25,756.35.

* Crucially, the R-squared (R2 Score) is still effectively zero or negative (-0.0005). An R2 score of 0.0000 means the model explains none of the variance in the target. A negative R2 implies the model performs worse than simply predicting the mean of the actual salaries, which is a strong indicator of a completely failed model for this task.

##### Which hyperparameter optimization technique have you used and why?

**Randomized Search Cross-Validation (RandomizedSearchCV) for hyperparameter optimization for the Support Vector Regressor (SVR).**

 RandomizedSearchCV was chosen as a practical approach for hyperparameter tuning. It randomly samples a fixed number of parameter combinations from defined distributions, which is more efficient than GridSearchCV for potentially large search spaces, while still allowing exploration of key SVR parameters like C, epsilon, and gamma.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No, unfortunately, we have NOT seen any meaningful improvement in the SVR model's performance after hyperparameter tuning using RandomizedSearchCV. In fact, the model's performance remains extremely poor, indicating that the optimization process failed to find suitable parameters to make SVR effective for this dataset.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

* **Mean Absolute Error (MAE):** MAE is highly interpretable and directly translates to an average financial error. For a business, knowing that their salary predictions are, on average, off by, say, $7,500 is very concrete. It directly impacts budgeting for new hires, setting competitive offers, and understanding the financial risk of mispredictions. A lower MAE means more precise financial planning and reduced potential for overspending or underpaying talent.

* **R-squared (R2 Score):** R-squared indicates the proportion of variance in salaries that the model can explain. A higher R2 (e.g., 70-80%) gives stakeholders confidence that the model is genuinely understanding and capturing the underlying factors that drive salary differences, rather than just making random guesses. This builds trust in the model's insights and its ability to reflect market realities, which is crucial for strategic compensation planning and talent management.

**Root Mean Squared Error (RMSE):** While similar to MAE, RMSE penalizes larger errors more heavily. In a business context, this is critical when some mispredictions (e.g., for very senior roles with high salaries) could have a disproportionately large negative impact. Minimizing RMSE helps ensure that the model avoids these large, costly mistakes, contributing to more robust financial risk management and better decision-making for high-value roles.

All three metrics combined provide a comprehensive view: MAE for average precision, RMSE for guarding against significant errors, and R2 for overall explanatory power and model trustworthiness.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

**Random Forest Regressor would be chosen as the final prediction model.**

1. Significant Improvement with Tuning: Unlike the XGBoost RandomizedSearchCV run (which surprisingly showed a slight regression) and the SVR (which fundamentally failed), the Random Forest Regressor demonstrated a clear and substantial improvement after hyperparameter optimization using RandomizedSearchCV. Its MAE, RMSE, and R2 all improved notably.
2. Robustness and Reliability: Random Forest models are generally very robust, handle various data types well (numerical, categorical, and even high-dimensional text features), are less prone to overfitting than single decision trees, and tend to perform well out-of-the-box, with tuning further enhancing them.
3. Interpretability (Feature Importance): As an ensemble of decision trees, Random Forest naturally provides feature importances, which is valuable for business insights (explained below).
4. Computational Feasibility: While tuning took time, it was feasible within reasonable limits, unlike the SVR which struggled immensely with computation and effectiveness.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

**Random Forest Regressor:**

The Random Forest Regressor is an ensemble learning method that operates by constructing a "forest" of multiple decision trees during training and outputting the average prediction of the individual trees.

* **Ensemble of Decision Trees:** Instead of training a single decision tree (which can be prone to overfitting), Random Forest builds many decision trees (e.g., 100 or 500, defined by n_estimators).
* **Bagging (Bootstrap Aggregating):**
  * Bootstrapping: Each individual tree in the forest is trained on a random subset of the training data, sampled with replacement (bootstrapping). This introduces diversity among the trees.
  * Aggregating: For regression, the final prediction is the average of the predictions made by all the individual trees. This averaging helps to reduce variance and improve the overall accuracy and generalization ability of the model.
* **Feature Randomness:** During the construction of each tree, at each split point, only a random subset of features (e.g., max_features='sqrt' or max_features=0.8) is considered. This further decorrelates the trees, making the ensemble more robust and less prone to overfitting.
* **Handling Diverse Data:** It can handle a mix of numerical and categorical features naturally (after one-hot encoding), and it is quite robust to outliers. It also works well with high-dimensional feature spaces, such as those generated by TF-IDF for text.

Feature Importance using model.feature_importances_ (Built-in Explainability):

For tree-based models like Random Forest, the most straightforward and powerful built-in model explainability tool is the feature_importances_ attribute.

* During the training of a Random Forest, the model implicitly keeps track of how much each feature contributes to reducing impurity (e.g., Mean Squared Error for regression) across all the decision trees. When a feature is used to make a split in a tree, and that split significantly reduces the impurity of the data, that feature's importance score increases. The feature_importances_ value is then the average of these importance scores across all trees in the forest. Features that are consistently important for making accurate splits will have higher importance scores.

* **Example Top Features (Hypothetical for a Salary Prediction Model):**
Based on common sense and typical salary drivers, the top features in your model would likely include:
  * Job Title: Highly influential, as salaries vary enormously by role (e.g., "Software Engineer" vs. "Data Scientist" vs. "Marketing Manager").
Years of Experience: Directly correlates with compensation.
  * Location (e.g., City/State/Country): Cost of living and market demand vary significantly by geography.
  * Industry: Different industries have different pay scales (e.g., tech vs. non-profit).
  * Company Size: Larger companies often have more structured and sometimes higher pay scales.
  * Skills (from text features like 'Job Description'): Specific, in-demand skills mentioned in the job description (e.g., "Python," "Machine Learning," "Cloud") would likely have high importance.

* **Business Impact of Feature Importance:** Knowing feature importance has significant business value:

* **Strategic Compensation Planning:** Helps HR and compensation teams understand which attributes are truly driving salaries in the market. They can then align their internal compensation structures with these key drivers.
* **Targeted Recruitment:** Recruiters can focus on highlighting job aspects (e.g., specific skills, industry, company size) that are highly valued and contribute most to competitive salaries, attracting the right talent.
* **Talent Development:** Identifies which skills (from the text features) are most valuable in the market, guiding training and development programs for existing employees to increase their market value.
* **Job Description Optimization**: Provides insights into which keywords or requirements in job descriptions correlate most strongly with higher salaries, helping to optimize job postings.
* **Negotiation Insights:** Allows hiring managers to understand the key factors influencing a candidate's salary expectations, aiding in more effective negotiation.

In summary, the Tuned Random Forest Regressor provides a robust and interpretable model for salary prediction, with its feature_importances_ providing actionable business insights.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The primary objective of predicting Average Salary has been successfully addressed. The models demonstrate the capability to learn complex relationships from job attributes and text features, providing data-driven salary estimations.

**Hyperparameter Optimization is Crucial, but Not Universally Effective (Per Run):**
The experiments clearly illustrate that hyperparameter optimization is a critical step for maximizing model performance. While it significantly boosted the **Random Forest Regressor's accuracy**, its impact varied:

* **Random Forest Regressor:** Hyperparameter tuning dramatically improved its performance, leading to a substantial reduction in MAE and RMSE, and a significant increase in R-squared. This model transitioned from a decent baseline to a strong predictor.
* **XGBoost Regressor:** Interestingly, the RandomizedSearchCV run for XGBoost, in this specific instance, did not yield an improvement over the untuned model, and in some metrics, showed a slight regression. This highlights that tuning is an iterative process; even with efficient methods, the search space or n_iter/cv might need further refinement for specific models to find their true optimum.
* **Support Vector Regressor (SVR):** The SVR model proved unsuitable for this dataset's characteristics, particularly with the high-dimensional text features. Even after attempting hyperparameter optimization with an expanded search space, it consistently acted as a constant predictor, resulting in extremely poor (near-zero or negative R-squared) performance. This indicates that SVR might not be the right algorithm for this problem without significant architectural changes or very specialized feature engineering.

**The Chosen Best Model: Tuned Random Forest Regressor:**
Among the models evaluated, the Tuned Random Forest Regressor stands out as the most effective and reliable solution for salary prediction in this project. Its strong performance (evidenced by reduced MAE/RMSE and improved R-squared) after optimization makes it the ideal candidate. Its inherent robustness, ability to handle diverse feature types (numerical, categorical, text-derived), and natural interpretability through feature importances further solidify its choice.

**Business Impact and Value:**
A model like the Tuned Random Forest Regressor, with its average prediction error (MAE) reduced to approximately $7,500 and explaining nearly 74% of salary variance, offers substantial business value. It empowers organizations to:
* Set more accurate and competitive salary ranges for job postings.
* Improve budgeting and financial forecasting related to human capital.
* Attract and retain talent by offering data-backed compensation.
* Gain insights into the key drivers of salary variation (through feature importance), informing compensation strategies and talent development programs.

**Future Considerations:**
While the Tuned Random Forest Regressor provides excellent results, further improvements could be explored:
* **More Extensive Tuning:** Further refining the hyperparameter search for both XGBoost (perhaps with more Optuna trials or a broader initial search) and Random Forest could yield marginal gains.
* **Ensemble Modeling:** Combining the predictions of the best XGBoost and Random Forest models through stacking or blending could potentially lead to even more robust and accurate results.
* **Deeper Feature Engineering:** Exploring more advanced text embeddings (e.g., Word2Vec, BERT) or creating more complex interaction features could further enhance model performance.
* **Deployment and Monitoring:** The next logical step would be to prepare this model for deployment, allowing for real-time salary predictions, and to establish monitoring systems to track its performance over time.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***