<a href="https://colab.research.google.com/github/Sahildubey08/ML-Submission-Glassdoor/blob/main/ML_Submission_Glassdoor_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Glassdoor Jobs Salary Prediction (Machine Learning)





##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

This project is aim to develop a machine learning model to predit job salaries such as job title, company, location, required skills, experience level, and other relevent attribute to estimate salary ranges. By leveraging regression algorithms comprehensive data preprocessing, the model provides accurate salary predictions that can benefit both job seekers and employers in making informed decisions about compensation expectations and market trends.

# **GitHub Link -**

# **Problem Statement**


**Write Problem Statement Here.**
In today's competitive job market, both job seekers and employers face significant challenges in determining appropriate salary ranges. Job applicants often struggle to understand their market worth, while employers find it difficult to set competitive compensation packages that attract top talent without overpaying.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/glassdoor_jobs.csv')

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
display(df.head())

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.describe()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

### What did you know about your dataset?

As we can see that this data contains total 956 rows and 15 columns and has no duplicate and null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
'Job Title', 'Salary Estimate', 'Job Description', 'Rating', 'Company Name', 'Location', 'Headquarters', 'Size', 'Founded', 'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors'

### Variables Description

Answer Here


**Job Title:** The title of the job posting (e.g., "Data Scientist", "Healthcare Data Scientist").

**Salary Estimate:** The estimated salary range for the job, provided as a string (e.g., "$53K-$91K (Glassdoor est.)"). Some entries have a value of -1, which likely indicates missing data.

**Job Description:** A detailed description of the job, including responsibilities, requirements, qualifications, and sometimes benefits. This is a string of text.

**Rating:** The company's overall rating on Glassdoor (on a scale of 1 to 5). This is a numeric field.

**Company Name:** The name of the company, and sometimes the company's rating is appended (e.g., "Tecolote Research 3.8").

**Location:** The city and state where the job is located (e.g., "Albuquerque, NM").

**Headquarters:** The city and state where the company's headquarters is located (e.g., "Goleta, CA").

**Size:** The size of the company in terms of number of employees, given as a range (e.g., "501 to 1000 employees").

**Founded:** The year the company was founded.

**Type of ownership:** The type of company ownership (e.g., "Company - Private", "Company - Public", "Government", etc.).

**Industry:** The industry in which the company operates (e.g., "Aerospace & Defense").

**Sector:** The sector of the company (e.g., "Aerospace & Defense").

**Revenue:** The revenue of the company, given as a range (e.g., "$50 to $100 million (USD)"). Some entries are "Unknown / Non-Applicable".

**Competitors:** The competitors of the company. Many entries have -1, which likely indicates no data or no competitors listed.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Create a new DataFrame df_clean
df_clean = df.copy()

# Handle salary estimate including 'Employer Provided Salary' and other formats
df_clean['Salary Estimate'] = df_clean['Salary Estimate'].apply(lambda x: x.replace('(Glassdoor est.)', '').replace('(Employer est.)', '').strip())

# Extract min and max salary, handling different formats
def parse_salary(salary_str):
    try:
        if 'Employer Provided Salary:' in salary_str:
            salary_str = salary_str.replace('Employer Provided Salary:', '').strip()
            # Handle cases like '150K' or '150'
            if 'K' in salary_str:
                 min_sal = int(salary_str.replace('$', '').replace('K', '')) * 1000
            else:
                 min_sal = int(salary_str.replace('$', '')) * 1000
            max_sal = min_sal # Assuming the employer provided salary is a single value
        else:
            sal_parts = salary_str.split('-')
            if len(sal_parts) == 2:
                min_sal = int(sal_parts[0].replace('$', '').replace('K', '')) * 1000
                max_sal = int(sal_parts[1].replace('$', '').replace('K', '')) * 1000
            else:
                min_sal = np.nan
                max_sal = np.nan
    except ValueError:
        min_sal = np.nan
        max_sal = np.nan
    return min_sal, max_sal

df_clean[['min_salary', 'max_salary']] = df_clean['Salary Estimate'].apply(lambda x: pd.Series(parse_salary(x)))

# Calculate average salary
df_clean['avg_salary'] = (df_clean['min_salary'] + df_clean['max_salary']) / 2

# Replace -1 with NaN in the new salary columns
df_clean[['min_salary', 'max_salary', 'avg_salary']] = df_clean[['min_salary', 'max_salary', 'avg_salary']].replace(-1, np.nan)

# Categorize job titles
def categorize_job_title(title):
    title = title.lower()
    if 'data scientist' in title:
        return 'Data Scientist'
    elif 'data engineer' in title:
        return 'Data Engineer'
    elif 'analyst' in title:
        return 'Analyst'
    elif 'manager' in title:
        return 'Manager'
    elif 'director' in title:
        return 'Manager'
    elif 'research scientist' in title:
        return 'Data Scientist'
    elif 'machine learning' in title:
        return 'Data Scientist'
    else:
        return 'Other'

df_clean['Job Role'] = df_clean['Job Title'].apply(categorize_job_title)

# Display the head of the cleaned DataFrame
display(df_clean.head())

In [None]:
# List of columns to drop - Adjusted to keep 'Job Description' for later text processing
columns_to_drop = [
    'Unnamed: 0', 'Salary Estimate', 'Job Title',
    'Company Name', 'Location', 'Headquarters', 'Size', 'Revenue', 'Competitors',
    # 'Job Description' and related text processing columns are kept until after text vectorization
    # 'description_tokens', 'description_lemmatized', 'description_pos_tags',
    # 'description_lemmatized_str', 'simple_title', 'job_state', 'Full_State'
]

# Exclude 'Job Description' from the list of columns to be dropped if it's there
# This ensures it remains for text processing later.
actual_columns_to_drop = [col for col in columns_to_drop if col in df_clean.columns and col != 'Job Description']

df_clean.drop(columns=actual_columns_to_drop, inplace=True)

print(f"Columns remaining: {df_clean.columns.tolist()}")
display(df_clean.head())

In [None]:
# Apply tokenization to the 'Job Description' column
df_clean['description_tokens'] = df_clean['Job Description'].apply(tokenize_text)

# Display the head of the DataFrame with the new column
display(df_clean.head())

### What all manipulations have you done and insights you found?

Answer Here.
By doing the data wrangling.


1-I have Removes explanatory text from salary estimates to extract pure salary ranges

2-Then I have created the minimum, maximum and average salary range column.

3-Applies the parse_salary function to each row in 'Salary Estimate.

4-For safety measures I have replace -1 value with Nan



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Salary Distribution Salary Distribution by Job Role (Box Plot)



In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(12, 6))
sns.boxplot(data=df_clean[df_clean['Job Role'] != 'Other'],
            x='Job Role', y='avg_salary')
plt.title('Salary Distribution by Tech Job Role', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.ylabel('Average Salary ($)')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
I choose this boxplot chart for visualizing the salary distribution by job role because it is an effective way to show the distribution of a numerical variable (average salary) across different categories (job roles). A box plot clearly displays the median, quartiles, and potential outliers for each job roles, making it easy to compare the salary ranges and central tendencies of different roles at a glance.


##### 2. What is/are the insight(s) found from the chart?

Answer Here
Based on the box plot visualizing salary distribution by tech job role:

*   **Data Scientist and Data Engineer** roles appear to have higher median salaries and a wider salary range compared to 'Analyst' roles.
*   **Analyst** roles generally show a lower median salary and a tighter salary distribution.
*   There are **outliers** in several job roles, indicating some positions within those categories have significantly higher or lower salaries than the typical range.
*   The **Manager** role also shows a relatively high median salary, which is expected due to the nature of the role.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Answer Here


#### Chart - 2 Impact of Company Size on Salaries

In [None]:
# Chart - 2 visualization code

# Temporarily re-add 'Size' column to df_clean from the original df for categorization
# This assumes the index aligns or needs to be handled if not.
# Given that df_clean was a copy of df and subsequent operations kept the index, this should be fine.
if 'Size' not in df_clean.columns and 'Size' in df.columns:
    df_clean['Size'] = df['Size']

# Categorize company size
def categorize_company_size(size):
    if 'to' in str(size): # Convert to string to handle potential non-string values like -1
        return str(size).replace(' employees', '').strip()
    elif '+' in str(size):
        return str(size).replace(' employees', '').strip()
    elif str(size) == '-1':
        return 'Unknown'
    else:
        return str(size) # Ensure output is string

df_clean['Company Size Category'] = df_clean['Size'].apply(categorize_company_size)

plt.figure(figsize=(10, 6))
company_size_order = ['1 to 50', '51 to 200', '201 to 500',
                     '501 to 1000', '1001 to 5000', '5001 to 10000', '10000+', 'Unknown']
sns.barplot(data=df_clean, x='Company Size Category', y='avg_salary',
            order=company_size_order, errorbar='sd')
plt.title('Average Salary by Company Size', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.ylabel('Average Salary ($)')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I chose the bar chart to visualize the average salary by company size because it is an effective way to compare the average of a numerical variable (avg_salary) across distinct categories (Company Size Category). Bar charts clearly show the magnitude of the average salary for each size category, making it easy to see how salary levels differ based on company size.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Larger companies tend to offer higher average salaries: There is a general trend that as company size increases, the average salary also tends to increase. Companies with 10000+ employees show the highest average salaries.
Smaller companies have lower average salaries: Companies in the smaller size categories (e.g., 1 to 50, 51 to 200) generally have lower average salaries compared to larger companies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**For Employers:**

Talent Acquisition and Retention: Understanding how company size affects salary allows businesses to benchmark their compensation packages against competitors of similar or different sizes. This can help them set competitive salaries to attract and retain talent. If a smaller company knows that larger companies generally pay more, they might need to offer other benefits (e.g., better work-life balance, more opportunities for growth, unique company culture) to compete effectively.
Budgeting and Planning: Businesses can use this information for better budgeting and workforce planning, understanding the potential salary costs associated with different growth stages and company sizes.
Strategic Decision Making: If a company is considering scaling up, the insights can inform their financial projections and talent strategy related to compensation.


**For Job Seekers:**

Salary Negotiation: Job seekers can use this information to understand expected salary ranges based on the size of the companies they are applying to, empowering them to negotiate more effectively.
Targeting Job Searches: Individuals can refine their job search by targeting companies of certain sizes if salary is a primary factor in their decision-making.


#### Chart - 3 Geographic Salary

In [None]:
import altair as alt

# Retrieve 'Location' from the original df for the rows currently in df_clean
# This handles cases where rows were removed during outlier treatment, ensuring index alignment.
if 'Location' not in df_clean.columns and 'Location' in df.columns:
    df_clean['Location'] = df.loc[df_clean.index, 'Location']

# Ensure the DataFrame is ready with 'State' and 'avg_salary'
if 'State' not in df_clean.columns:
    df_clean['State'] = df_clean['Location'].apply(lambda x: x.split(',')[-1].strip() if ',' in x else x)

# Mapping of state abbreviations to full names
state_mapping = {
    'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California',
    'CO': 'Colorado', 'CT': 'Connecticut', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia',
    'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa', 'KS': 'Kansas',
    'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland', 'MA': 'Massachusetts',
    'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi', 'MO': 'Missouri', 'MT': 'Montana',
    'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico',
    'NY': 'New York', 'NC': 'North Carolina', 'ND': 'North Dakota', 'OH': 'Ohio', 'OK': 'Oklahoma',
    'OR': 'Oregon', 'PA': 'Pennsylvania', 'RI': 'Rhode Island', 'SC': 'South Carolina', 'SD': 'South Dakota',
    'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VT': 'Vermont', 'VA': 'Virginia', 'WA': 'Washington',
    'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming', 'DC': 'District of Columbia' # Added DC
}

# Create a new column with full state names
if 'Full_State' not in df_clean.columns:
    df_clean['Full_State'] = df_clean['State'].map(state_mapping)

# Calculate top states by average salary using Full_State for grouping and sorting
top_states_full = df_clean.groupby('Full_State')['avg_salary'].mean().sort_values(ascending=False).head(15).reset_index()

# Create the interactive Altair chart using Full_State
chart = alt.Chart(top_states_full).mark_bar().encode(
    x=alt.X('avg_salary', title='Average Salary ($)'),
    y=alt.Y('Full_State', sort='-x', title='State'),
    tooltip=['Full_State', alt.Tooltip('avg_salary', title='Average Salary', format='$,.0f')]
).properties(
    title='Top 15 States by Average Salary (Interactive - Full Name)'
).interactive() # Make the chart interactive

# Display the chart
chart.display()

##### 1. Why did you pick the specific chart?

Answer Here.

Compare averages across categories: Bar charts are excellent for comparing the mean of a numerical variable (avg_salary) across distinct geographical categories (states).


Highlight top performers: By sorting the states by average salary and focusing on the top 15, the bar chart clearly highlights which states offer the highest compensation on average.


Provide clear visual ranking: The length of each bar provides an immediate visual comparison and ranking of the average salaries across the selected states.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart clearly shows the top 15 states with the highest average salaries.
California appears to have the highest average salary among the states included in the chart.
Other states like Illinois, Massachusetts, and New Jersey also show relatively high average salaries.
The chart allows for a quick visual comparison of average salaries across these top states, highlighting geographic variations in compensation.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**For Businesses:**

**Talent Acquisition Strategy:** Companies can use this information to understand salary expectations in different locations. This is crucial for setting competitive
compensation packages when hiring in various states. If a company is expanding or opening new offices, this insight can inform location decisions based on talent availability and associated salary costs.


**Compensation Benchmarking:** Businesses can benchmark their salaries against the average salaries in their specific location and in competitor locations. This helps ensure they are offering competitive wages to attract and retain talent.


**Remote Work Policies:** For companies considering remote work options, understanding geographic salary differences can help in defining compensation policies for remote employees based on their location.


**Market Analysis:** The chart provides insights into regional salary trends, which can be valuable for market analysis and strategic planning.


**For Job Seekers:**

**Targeting Job Searches:** Job seekers can use this information to identify states with higher average salaries for their desired roles, helping them focus their job search efforts.


**Salary Negotiation:** Knowing the typical salary range in a specific state empowers job seekers to negotiate their salaries more effectively.


**Relocation Decisions:** Individuals considering relocation for career opportunities can use this data to understand the potential earning differences in various states.

#### Chart - 4 Industry/Sector Impact on Salaries

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12, 6))
top_industries = df_clean.groupby('Industry')['avg_salary'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=top_industries.values, y=top_industries.index)
plt.title('Top 10 Industries by Average Salary', fontsize=14, fontweight='bold')
plt.xlabel('Average Salary ($)')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I choose a bar chart to visualize the average salary by industry.

**Compare averages across categories:** Bar charts are well-suited for comparing the mean of a numerical variable (avg_salary) across distinct categories (industries).


**Highlight top performers:** By sorting the industries by average salary and focusing on the top 10, the bar chart clearly highlights which industries offer the highest compensation on average.


**Provide clear visual ranking:** The length of each bar provides an immediate visual comparison and ranking of the average salaries across the selected industries.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart clearly shows the top 10 industries with the highest average salaries.


Industries like Other Retail Stores, Motion Picture Production & Distribution, and Financial Analytics & Research appear to have the highest average salaries among the top 10.


The chart highlights that average salaries can vary significantly across different industries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**For Businesses:**

**Talent Acquisition and Retention:** Understanding average salaries across different industries allows companies to benchmark their compensation packages. This is crucial for attracting and retaining talent, especially when competing with companies in higher-paying industries for similar roles.


**Industry Competitiveness:** Businesses can assess their salary competitiveness within their own industry and compare it to others. This can inform strategies for attracting talent from different sectors or for positioning themselves as an attractive employer within their industry.


**Market Analysis and Strategy:** Insights into industry salary trends are valuable for broader market analysis and strategic planning, including decisions about diversification or focusing on specific industry niches.
For Job Seekers:

**Targeting Job Searches:** Job seekers can use this information to identify industries that tend to offer higher average salaries for their desired skills and roles, helping them focus their job search.


**Salary Negotiation:** Knowing the typical salary range within a specific industry empowers job seekers to negotiate their salaries more effectively.


**Career Path Planning:** Understanding salary variations across industries can inform career path decisions and potential transitions between sectors

#### Chart - Pair Plot

In [None]:
# Correlation Heatmap
plt.figure(figsize=(12, 8))

# Select numerical columns for correlation analysis
numerical_columns = ['avg_salary', 'min_salary', 'max_salary', 'Rating', 'sdesc_len', 'num_comp']

# Filter only columns that exist in the dataframe
existing_numerical = [col for col in numerical_columns if col in df_clean.columns]

# Calculate correlation matrix
correlation_matrix = df_clean[existing_numerical].corr()

# Create heatmap
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))  # Mask upper triangle
heatmap = sns.heatmap(correlation_matrix,
                      mask=mask,
                      annot=True,
                      cmap='coolwarm',
                      center=0,
                      fmt='.2f',
                      square=True,
                      cbar_kws={'shrink': 0.8})

plt.title('Feature Correlation Heatmap\n(How Variables Relate to Salary)',
          fontsize=16, fontweight='bold', pad=20)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Print key correlations with avg_salary
print("Key Correlations with Average Salary:")
print("=" * 40)
if 'avg_salary' in correlation_matrix.columns:
    salary_correlations = correlation_matrix['avg_salary'].sort_values(ascending=False)
    for feature, corr in salary_correlations.items():
        if feature != 'avg_salary':
            print(f"{feature:15} : {corr:+.3f}")

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Answer Here.**

**Null Hypothesis (H0):** The average salary is the same across different company size categories.
**Alternate Hypothesis (H1):** The average salary is different across different company size categories.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Drop rows with NaN in 'Company Size Category' or 'avg_salary'
df_anova = df_clean.dropna(subset=['Company Size Category', 'avg_salary'])

# Perform one-way ANOVA
# Create a formula string for the OLS model
formula = 'avg_salary ~ C(Q("Company Size Category"))'

# Fit the OLS model
model = ols(formula, data=df_anova).fit()

# Perform ANOVA table calculation
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table and extract the p-value
print(anova_table)
p_value_anova = anova_table['PR(>F)'][0]
print(f"\nANOVA P-value: {p_value_anova:.4f}")

##### Which statistical test have you done to obtain P-Value?

**Answer Here.**
I have performed a One-Way ANOVA (Analysis of Variance) test using `statsmodels.api` to obtain the P-value.

##### Why did you choose the specific statistical test?

**Answer Here.**
I chose the One-Way ANOVA test because I am comparing the means of a continuous variable (average salary) across three or more independent groups (different company size categories). ANOVA is used to determine if there are any statistically significant differences between the means of these groups.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Check for missing values
print("Missing values before handling:")
print(df_clean.isnull().sum())


# Let's re-check for NaNs after our wrangling steps
print("\nMissing values after initial wrangling:")
print(df_clean.isnull().sum())

# We will fill the NaN salary values with the median of the average salary.
median_avg_salary = df_clean['avg_salary'].median()
df_clean['avg_salary'] = df_clean['avg_salary'].fillna(median_avg_salary)
df_clean['min_salary'] = df_clean['min_salary'].fillna(df_clean['min_salary'].median())
df_clean['max_salary'] = df_clean['max_salary'].fillna(df_clean['max_salary'].median())



# Let's replace -1 in 'Founded' with NaN and then impute with the median year.
df_clean['Founded'] = df_clean['Founded'].replace(-1, np.nan)
median_founded_year = df_clean['Founded'].median()
df_clean['Founded'] = df_clean['Founded'].fillna(median_founded_year)



# Re-check for missing values after imputation
print("\nMissing values after imputation:")
print(df_clean.isnull().sum())

display(df_clean.head())

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.


**Median Imputation for Numerical Columns ('min_salary', 'max_salary', 'avg_salary', 'Founded')**-I chose to impute the missing values in these numerical columns with the median. The median is a robust measure of central tendency that is less affected by outliers compared to the mean. This is particularly useful for salary data, which can sometimes have extreme values. For the 'Founded' column, replacing -1 (which likely represents missing information) with the median year provides a reasonable estimate for companies with unknown founding dates, assuming the distribution of founding years is not heavily skewed.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# Visualize numerical columns to identify outliers
numerical_cols = ['avg_salary', 'Rating', 'Founded', 'num_comp']

# Filter for columns that exist in the DataFrame
existing_numerical_cols = [col for col in numerical_cols if col in df_clean.columns]

plt.figure(figsize=(15, 10))
for i, col in enumerate(existing_numerical_cols):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(y=df_clean[col])
    plt.title(f'Box plot of {col}', fontsize=12)
plt.tight_layout()
plt.show()

# You can also use descriptive statistics to identify potential outliers
print("\nDescriptive statistics of numerical columns:")
print(df_clean[existing_numerical_cols].describe())


# Identify outliers using the IQR method for 'avg_salary':
Q1 = df_clean['avg_salary'].quantile(0.25)
Q3 = df_clean['avg_salary'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nIQR for Average Salary:")
print(f"  Q1: {Q1:.2f}")
print(f"  Q3: {Q3:.2f}")
print(f"  IQR: {IQR:.2f}")
print(f"  Lower Bound (for outliers): {lower_bound:.2f}")
print(f"  Upper Bound (for outliers): {upper_bound:.2f}")

# Remove outliers based on IQR for 'avg_salary'
df_clean_filtered = df_clean[(df_clean['avg_salary'] >= lower_bound) & (df_clean['avg_salary'] <= upper_bound)].copy()

print(f"\nShape before outlier removal: {df_clean.shape}")
print(f"Shape after outlier removal (avg_salary): {df_clean_filtered.shape}")

# Update df_clean to the filtered version
df_clean = df_clean_filtered

print("\nMissing values after outlier removal:")
print(df_clean.isnull().sum())

display(df_clean.head())

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

***IQR Method for avg_salary:*** I used the Interquartile Range (IQR) method to identify and remove outliers in the avg_salary column. The IQR is the range between the first quartile (Q1) and the third quartile (Q3). Outliers are typically defined as values that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR.

I chose this method because the box plots showed clear potential outliers in the average salary distribution. The IQR method is a common and effective way to handle outliers in numerical data, as it is less sensitive to extreme values than methods that rely on the mean and standard deviation. By removing these outliers, we can improve the performance of our machine learning models by reducing the impact of extreme values on the training process.

I also visualized the numerical columns using box plots and printed descriptive statistics to get a better understanding of the data distribution and identify potential outliers before applying the IQR method.

### 3. Categorical Encoding

### Removing Unwanted Columns

In [None]:
#Categorical Encoding
# Identify categorical columns
print("Categorical Variables Analysis:")
print("=" * 50)

categorical_columns = []
# Define a list of columns to exclude from the nunique check
exclude_cols = ['Job Description', 'description_tokens', 'description_lemmatized', 'description_pos_tags', 'description_lemmatized_str']

for col in df_clean.columns:
    # Exclude columns that are in the exclude_cols list
    if col not in exclude_cols:
        if df_clean[col].dtype == 'object' or df_clean[col].nunique() < 20:
            unique_count = df_clean[col].nunique()
            categorical_columns.append(col)
            print(f"{col:25} : {unique_count:3} unique values")
            if unique_count <= 10:  # Show values for columns with few categories
                print(f"{'':25}   {df_clean[col].unique()}")

print(f"\nTotal categorical columns: {len(categorical_columns)}")

# Key categorical features for encoding
key_categorical = ['Job Role', 'Company Size Category', 'State', 'Industry', 'Sector', 'Type of ownership']
# Filter to only include columns that exist in our dataframe
key_categorical = [col for col in key_categorical if col in df_clean.columns and col not in exclude_cols]

print(f"\nKey categorical features for encoding: {key_categorical}")

In [None]:
# Apply one-hot encoding to key categorical features
key_categorical = ['Job Role', 'Company Size Category', 'State', 'Industry', 'Sector', 'Type of ownership']

# Ensure columns exist before encoding
existing_key_categorical = [col for col in key_categorical if col in df_clean.columns]

df_encoded = pd.get_dummies(df_clean, columns=existing_key_categorical, drop_first=True) # drop_first=True to avoid multicollinearity

# Display the head of the encoded DataFrame
display(df_encoded.head())

print(f"\nShape of DataFrame after One-Hot Encoding: {df_encoded.shape}")

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

Machine learning algorithms typically work with numerical data, so categorical variables need to be converted into a numerical format.


The choice of technique depends on the nature of the categorical variable (nominal or ordinal) and the cardinality (number of unique categories).


One-hot encoding is a simple and effective method for nominal variables.


Target encoding can be beneficial for high-cardinality features, but it's important to use cross-validation or proper splitting to prevent data leakage.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')
nltk.download('stopwords')

#### 2. Lower Casing

In [None]:
def lower_casing(text):
  # Lower Casing
  text = text.lower()
  return text

#### 3. Removing Punctuations

In [None]:
import re
import string

def remove_punctuations(text):
  # Remove Punctuations
  text = text.translate(str.maketrans('', '', string.punctuation))
  return text

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
def remove_urls(text):
  # Remove URLs
  url_pattern = re.compile(r'https?://\S+|www\.\S+')
  text = url_pattern.sub(r'', text)
  return text

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
# Remove White spaces
def remove_whitespaces(text):
  # Remove White spaces
  text = text.strip()
  return text

#### 6. Rephrase Text

#### 7. Tokenization

In [None]:
# Define the tokenization function
def tokenize_text(text):
  # Tokenization
  return nltk.word_tokenize(text)

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
import nltk
nltk.download('punkt_tab')

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

# Import necessary libraries for lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK data if not already downloaded
try:
    wordnet.ensure_loaded()
except LookupError:
    nltk.download('wordnet')
    nltk.download('omw-1.4')


# Instantiate the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to lemmatize a list of tokens
def lemmatize_tokens(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

# Apply the lemmatization function to the 'description_tokens' column
df_clean['description_lemmatized'] = df_clean['description_tokens'].apply(lemmatize_tokens)

# Display the head of the updated DataFrame
display(df_clean.head())

##### Which text normalization technique have you used and why?

Answer Here.
I have used **Lemmatization** as the text normalization technique.

**Why Lemmatization?**

Lemmatization is a more sophisticated technique than stemming because it reduces words to their base or dictionary form (lemma). Unlike stemming, which simply chops off prefixes or suffixes and can result in non-words, lemmatization considers the context and converts the word to its meaningful base form. For example, 'running' becomes 'run', 'better' becomes 'good' etc.

In the context of analyzing job descriptions, lemmatization helps in grouping together different forms of the same word (e.g., "develop," "developing," "developed" all become "develop"). This is important for accurate feature extraction and analysis, as it ensures that variations of a word are treated as the same concept. This leads to a more accurate representation of the vocabulary and can improve the performance of downstream tasks like text vectorization and model building.

While stemming is faster, the slightly increased computational cost of lemmatization is often offset by the improved quality of the normalized text, especially for tasks that require a deeper understanding of word meaning.

#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk

# Download the averaged_perceptron_tagger if not already downloaded
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError: # Corrected exception handling
    nltk.download('averaged_perceptron_tagger')

# Define a function to perform POS tagging
def pos_tagging(tokens):
    # Ensure tokens is a list before tagging
    if isinstance(tokens, list):
      return nltk.pos_tag(tokens)
    else:
      return [] # Return empty list for non-list input


# Check if 'description_lemmatized' exists before applying POS tagging
if 'description_lemmatized' in df_clean.columns:
    # Apply the pos_tagging function to the 'description_lemmatized' column
    df_clean['description_pos_tags'] = df_clean['description_lemmatized'].apply(pos_tagging)

    # Display the head of the updated DataFrame
    display(df_clean.head())
else:
    print("Error: 'description_lemmatized' column not found. Please ensure lemmatization was successful.")

In [None]:
# Display the first few entries of the 'description_pos_tags' column
display(df_clean['description_pos_tags'].head())

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

# Check if 'description_lemmatized' column exists before proceeding
if 'description_lemmatized' not in df_clean.columns:
    print("Error: 'description_lemmatized' column not found.")
    tfidf_matrix = None # Set to None to indicate vectorization could not proceed
else:
    # Join the lemmatized tokens back into strings for TF-IDF
    df_clean['description_lemmatized_str'] = df_clean['description_lemmatized'].apply(lambda x: ' '.join(x))

    # Initialize TF-IDF Vectorizer
    # We can set a max_features to limit the vocabulary size and reduce dimensionality
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)

    # Fit and transform the lemmatized text data
    tfidf_matrix = tfidf_vectorizer.fit_transform(df_clean['description_lemmatized_str'])

    # Display the shape of the resulting TF-IDF matrix
    print("Shape of TF-IDF matrix:", tfidf_matrix.shape)

##### Which text vectorization technique have you used and why?

Answer Here.
I have used **TF-IDF (Term Frequency-Inverse Document Frequency)** as the text vectorization technique.

**Why TF-IDF?**

TF-IDF is a widely used and effective technique for converting text into numerical vectors, especially when the goal is to represent the importance of words within documents relative to a whole corpus. Here's why it's a good choice for analyzing job descriptions:

*   **Captures Term Importance:** TF-IDF assigns a weight to each term based on its frequency within a specific document (Term Frequency - TF) and its inverse frequency across all documents (Inverse Document Frequency - IDF). This means words that are frequent in a particular job description but rare across all job descriptions will have a higher TF-IDF score, indicating their potential importance in distinguishing that job description.
*   **Reduces the Impact of Common Words:** Common words like "the," "a," "is" (stopwords) will have a low IDF score because they appear in many documents. This reduces their weight, preventing them from dominating the vector representation.
*   **Handles Document Length Variations:** TF-IDF inherently accounts for the length of documents, preventing longer documents from having disproportionately larger vector values.

In the context of salary prediction from job descriptions, TF-IDF helps in identifying the most relevant keywords and phrases that might influence salary levels (e.g., specific skills, technologies, or responsibilities mentioned frequently in higher-paying jobs but less so in lower-paying ones).

While other techniques like Count Vectorization (simple word counts) or Word Embeddings exist, TF-IDF strikes a good balance between simplicity, interpretability, and effectiveness for many text classification and regression tasks.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

# Create a feature for job description length
df_clean['description_len'] = df_clean['Job Description'].apply(len)

# Create a simplified job title feature
def simplify_title(title):
    if 'data scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'analyst' in title.lower():
        return 'analyst'
    elif 'manager' in title.lower():
        return 'manager'
    elif 'director' in title.lower():
        return 'manager'
    else:
        return 'other'

df_clean['simple_title'] = df_clean['Job Title'].apply(simplify_title)

# Extract state from location
df_clean['job_state'] = df_clean['Location'].apply(lambda x: x.split(',')[-1].strip() if ',' in x else x)

# You can add more feature manipulation steps here based on your analysis and domain knowledge.

# Display the head of the DataFrame with new features
display(df_clean.head())

#### 2. Feature Selection

## Visualize feature relationships


In [None]:
# Chart - 14 - Correlation Heatmap (Re-using the existing chart ID)
plt.figure(figsize=(10, 8))
# Select numerical columns for correlation analysis including the target variable
numerical_cols_for_corr = ['avg_salary', 'Rating', 'Founded', 'num_comp', 'description_len']

# Filter only columns that exist in the dataframe and are numeric
existing_numerical_for_corr = [col for col in numerical_cols_for_corr if col in df_clean.columns and pd.api.types.is_numeric_dtype(df_clean[col])]


# Calculate correlation matrix
correlation_matrix_subset = df_clean[existing_numerical_for_corr].corr()

# Create heatmap
sns.heatmap(correlation_matrix_subset,
            annot=True,
            cmap='coolwarm',
            center=0,
            fmt='.2f',
            square=True,
            cbar_kws={'shrink': 0.8})

plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Chart - 15 - Pair Plot (Re-using the existing chart ID)
# Select a subset of relevant numerical features for pair plot
# Based on the correlation heatmap and domain knowledge, choose features that show some variation and potential relation to salary
pairplot_cols = ['avg_salary', 'Rating', 'num_comp', 'description_len']

# Filter for columns that exist in the dataframe
existing_pairplot_cols = [col for col in pairplot_cols if col in df_clean.columns and pd.api.types.is_numeric_dtype(df_clean[col])]


if len(existing_pairplot_cols) > 1:
    sns.pairplot(df_clean[existing_pairplot_cols])
    plt.suptitle('Pair Plot of Selected Numerical Features', y=1.02, fontsize=16, fontweight='bold')
    plt.show()
else:
    print("Not enough numerical columns available for pair plot.")

# Additional visualization: Box plot for top industries vs avg_salary
plt.figure(figsize=(12, 6))
# Recalculate top industries from the potentially filtered df_clean
top_industries = df_clean.groupby('Industry')['avg_salary'].mean().sort_values(ascending=False).head(10)
# Filter df_clean to include only these top industries
df_top_industries = df_clean[df_clean['Industry'].isin(top_industries.index)]
sns.boxplot(data=df_top_industries, x='Industry', y='avg_salary')
plt.title('Average Salary Distribution in Top 10 Industries', fontsize=14, fontweight='bold')
plt.xticks(rotation=90)
plt.ylabel('Average Salary ($)')
plt.tight_layout()
plt.show()

# Additional visualization: Box plot for company size vs avg_salary
plt.figure(figsize=(10, 6))
# Define the order for company size categories
company_size_order_viz = ['1 to 50', '51 to 200', '201 to 500',
                          '501 to 1000', '1001 to 5000', '5001 to 10000', '10000+', 'Unknown']
# Filter df_clean to include only valid company size categories
df_valid_size = df_clean[df_clean['Company Size Category'].isin(company_size_order_viz)]
sns.boxplot(data=df_valid_size, x='Company Size Category', y='avg_salary',
            order=company_size_order_viz)
plt.title('Average Salary Distribution by Company Size', fontsize=14, fontweight='bold')
plt.xticks(rotation=45)
plt.ylabel('Average Salary ($)')
plt.tight_layout()
plt.show()

##### What all feature selection methods have you used  and why?

Answer Here.

Discussion on Domain Knowledge and Feature Selection
Based on real-world understanding of the job market, several features are intuitively expected to be strong predictors of salary:

Job Title: This is arguably the most important factor. Different job roles have vastly different salary ranges based on required skills, responsibilities, and market demand (e.g., a Senior Data Scientist typically earns more than a Junior Data Analyst). Our ANOVA test confirmed that 'Job Role' has a highly significant impact on average salary (p-value << 0.05).

Experience Level: While not a direct column in our dataset, experience is often implied in job titles (e.g., "Senior", "Lead"). More experienced professionals generally command higher salaries due to their expertise and proven track record. This is a crucial factor that would ideally be explicitly engineered or extracted from the 'Job Description'.

Location: Salaries vary significantly based on the cost of living and demand in different geographic areas. Major tech hubs or cities with high costs of living typically offer higher salaries for the same role compared to rural areas. Our geographic salary visualization and ANOVA on 'State' (though not explicitly shown in the summary, the p-value was very low in the previous run) support this.

Required Skills: Specific technical skills (e.g., Python, SQL, Machine Learning frameworks, Cloud platforms) and soft skills are highly valued and directly impact earning potential. Jobs requiring specialized or in-demand skills tend to pay more. This information is primarily embedded in the 'Job Description' and was partially captured through TF-IDF vectorization, which can highlight important skill keywords.

Company Size: Larger companies often have more structured compensation plans, potentially higher revenue to support higher salaries, and more complex projects requiring specialized skills. Our bar chart and ANOVA confirmed that 'Company Size Category' has a statistically significant relationship with average salary.

Industry and Sector: Different industries and sectors have varying levels of profitability, demand for specific roles, and established salary norms. For instance, tech, finance, or pharmaceuticals might offer higher salaries for data professionals compared to non-profit or education sectors. Our bar charts and ANOVA for 'Industry' and 'Sector' showed statistically significant differences in average salaries across categories.

Company Name and Rating: Reputable or highly-rated companies might attract top talent by offering competitive salaries and better benefits. Company-specific factors can influence compensation. Our correlation analysis showed a weak positive correlation between 'Rating' and 'avg_salary', suggesting a minor influence.

Type of Ownership: The ownership structure (e.g., public, private, government, non-profit) can influence compensation strategies and salary levels. Our ANOVA confirmed 'Type of ownership' is statistically significant.

Revenue: A company's revenue directly relates to its ability to pay employees. Higher revenue companies are generally expected to offer higher salaries. Although 'Revenue' was not included in the filter methods summary here, it's a strong candidate based on domain knowledge and would likely show significance in statistical tests or importance in model-based methods.

Comparison with Filter Method Results:

The filter methods (correlation and ANOVA) largely confirmed our domain knowledge expectations. 'Job Role', 'Company Size Category', 'Industry', 'Sector', and 'Type of ownership' were all identified as statistically significant predictors of avg_salary through ANOVA. Numerical features like 'Rating' and 'num_comp' showed weak positive correlations, indicating a less pronounced linear relationship but still potentially relevant. The high correlation between 'min_salary', 'max_salary', and 'avg_salary' is expected as they represent the same underlying salary information.

While filter methods provide initial insights based on individual feature relationships, domain knowledge helps us prioritize which features to engineer or focus on further, especially for complex information like skills and experience embedded in text data. It also guides the interpretation of statistical results and helps identify potential limitations (e.g., 'Founded' having a weak correlation might be due to how the data is distributed or interactions with other factors).

##### Which all features you found important and why?

**Justify Chosen Features**

Based on the analysis performed using filter methods (correlation analysis and ANOVA) and our domain knowledge of the job market, the following features were chosen for predicting average salary:

*   **Job Role:** The ANOVA test showed a highly statistically significant difference in average salary across different job roles (Data Scientist, Data Engineer, Analyst, Manager, Other). This aligns with domain knowledge, as job titles directly reflect the level of expertise, responsibility, and market demand, which are major determinants of salary.
*   **Company Size Category:** The ANOVA test indicated a statistically significant difference in average salary based on company size. Our bar chart visualization also showed a clear trend of increasing average salary with increasing company size. Larger companies often have more resources and structured compensation scales, supporting this finding.
*   **State:** The ANOVA test revealed a statistically significant difference in average salary across different states. This confirms the domain knowledge that geographic location significantly impacts salary due to variations in cost of living, industry concentration, and local market demand.
*   **Industry and Sector:** Both Industry and Sector showed statistically significant differences in average salary according to the ANOVA tests. Different industries and sectors have varying levels of profitability, demand for specific skills, and established compensation norms, which influences salary ranges.
*   **Type of ownership:** The ANOVA test indicated a statistically significant difference in average salary based on the type of company ownership. This suggests that factors related to the ownership structure (e.g., public vs. private, non-profit) can influence compensation practices.
*   **Rating:** The correlation analysis showed a weak positive linear correlation between company rating and average salary. While not a strong individual predictor based on this analysis, domain knowledge suggests that reputable companies (often reflected in higher ratings) might offer slightly better compensation to attract talent. It was included as a potentially contributing factor.
*   **Founded:** The correlation analysis showed a very weak linear correlation with average salary. However, domain knowledge suggests that the age of a company might indirectly relate to its stability, maturity, and potentially its compensation structure. It was kept for this reason, although its individual predictive power appears low based on the filter method.
*   **num_comp:** The correlation analysis showed a very weak positive linear correlation with average salary. While the direct linear relationship is weak, the number of competitors could potentially influence salary in complex ways (e.g., highly competitive markets might drive salaries up). It was retained as a feature that might contribute in combination with others.
*   **description_len:** The correlation analysis showed a weak positive linear correlation with average salary. Longer job descriptions might indicate more complex roles or detailed requirements, which could be associated with higher salaries.

**Features Excluded (and why):**

*   **min_salary and max_salary:** These were excluded because they are direct components used to calculate `avg_salary`, and including them would lead to perfect multicollinearity and prevent the model from learning the influence of other features.

*   **Job Title (original), Salary Estimate, Company Name, Location, Headquarters, Size, Revenue, Competitors (original string):** These original columns were either transformed (e.g., Salary Estimate into min/max/avg salary, Size into Company Size Category, Location into State) or are high-cardinality text/identifier columns that are better represented by engineered features (like `description_len`, `simple_title`, or information extracted from the job description text via TF-IDF, although TF-IDF features were not explicitly included in the Linear Regression model evaluation here for simplicity, they are valuable for capturing skills/requirements).

The selected features represent a balance between statistical significance identified by filter methods and the intuitive importance based on domain knowledge. This set aims to capture the most influential factors determining job salaries in this dataset. Further model-based feature selection (wrapper or embedded methods) could be applied to refine this set and potentially improve model performance.

### 5. Data Transformation

In [None]:
# Transform Your data

import scipy.sparse

target = 'avg_salary'

numerical_cols_for_X = [
    'Rating',
    'Founded',
    'num_comp',
    'description_len' # Created in Feature Manipulation
]

existing_numerical_cols_for_X = [col for col in numerical_cols_for_X if col in df_clean.columns]

X_numerical = df_clean[existing_numerical_cols_for_X].copy()

X_numerical = X_numerical.fillna(X_numerical.median())



# Get the list of one-hot encoded columns corresponding to the selected categorical features
categorical_features_used = ['Job Role', 'Company Size Category', 'State', 'Industry', 'Sector', 'Type of ownership']

# Filter columns in df_encoded that start with any of the categorical_features_used names
X_categorical = df_encoded.loc[:, df_encoded.columns.str.startswith(tuple(cat + '_' for cat in categorical_features_used))]


# Prepare text features (TF-IDF matrix)
X_text = tfidf_matrix


# Convert X_numerical and X_categorical to sparse matrices if they are not already, for efficient concatenation with X_text
X_numerical_sparse = scipy.sparse.csr_matrix(X_numerical.values)
X_categorical_sparse = scipy.sparse.csr_matrix(X_categorical.values)

# Concatenate all sparse matrices horizontally
X = scipy.sparse.hstack([
    X_numerical_sparse,
    X_categorical_sparse,
    X_text
])

# Define the target variable y
y = df_encoded[target]

print(f"Shape of combined feature matrix X: {X.shape}")
print(f"Shape of target variable y: {y.shape}")

print("First 5 rows of y:")
display(y.head())

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Transformation of data is important for converting the raw data in the usable format like in terms of analysing reporting etc. Doing this is very important because of cleaning, restructuring, and enriching the data to improve its quality, ensure compatibility with target systems, and reveal underlying patterns and insights.

I have used categorical encoding and text vectorization.



### 6. Data Scaling

In [None]:
# Scaling your data

from sklearn.preprocessing import StandardScaler, MinMaxScaler


# Re-define numerical features that need scaling
numerical_cols_to_scale = [
    'Rating',
    'Founded',
    'num_comp',
    'description_len'
]

# Create a copy of the numerical features part of df_clean before scaling
X_numerical_to_scale = df_clean[numerical_cols_to_scale].copy()

# Impute any remaining NaNs (as done before, just to be safe)
X_numerical_to_scale = X_numerical_to_scale.fillna(X_numerical_to_scale.median())

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numerical features
X_numerical_scaled = scaler.fit_transform(X_numerical_to_scale)

# Convert the scaled numerical features back to a DataFrame for easier handling or directly to sparse
X_numerical_scaled_sparse = scipy.sparse.csr_matrix(X_numerical_scaled)

# Re-use X_categorical_sparse and X_text (tfidf_matrix) from previous steps

# Re-concatenate all sparse matrices horizontally to form the final scaled X
X_scaled = scipy.sparse.hstack([
    X_numerical_scaled_sparse,
    X_categorical_sparse,
    X_text
])

print("Data scaling completed for numerical features within the combined matrix.")
print(f"Shape of scaled combined feature matrix X_scaled: {X_scaled.shape}")

# Assign X_scaled back to X for consistency with subsequent steps
X = X_scaled


##### Which method have you used to scale you data and why?

Answer Here.

I have used **StandardScaler** to scale the numerical features. Here's why:

1.  **Standardization:** StandardScaler transforms the data such that its mean is 0 and its standard deviation is 1. This process is called standardization.
2.  **Algorithm Performance:** Many machine learning algorithms (like Linear Regression, SVMs, neural networks, or algorithms that rely on distance calculations) perform better or converge faster when numerical input features are on a similar scale. Features with larger values might disproportionately influence the model's objective function.
3.  **Preserves Distribution:** Unlike normalization (MinMaxScaler) which scales features to a fixed range (e.g., 0 to 1), standardization doesn't bound values to a specific range. It's less affected by outliers, as it doesn't compress all values into a small interval, which can be beneficial if the data has some outliers that we chose not to remove entirely.
4.  **Sparse Matrix Compatibility:** For our combined feature matrix `X`, which contains a mix of scaled numerical features, one-hot encoded categorical features (which are already 0s and 1s and don't need scaling in the same way), and TF-IDF features (which are also often sparse and have their own scaling properties), applying `StandardScaler` only to the numerical components before concatenation is a clean and effective strategy. This avoids trying to scale the 0/1 values of one-hot encoded features or the TF-IDF scores unnecessarily, which could degrade model performance.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import TruncatedSVD

# The number of components can be tuned. Starting with 500 as a reasonable reduction.
# Ensure n_components is less than min(n_samples, n_features)
num_components = min(500, X.shape[0] - 1, X.shape[1] - 1)

# Initialize TruncatedSVD
svd = TruncatedSVD(n_components=num_components, random_state=42)

# Fit and transform the combined feature matrix X
X_reduced = svd.fit_transform(X)

print(f"Original feature matrix shape: {X.shape}")
print(f"Reduced feature matrix shape: {X_reduced.shape}")

# Update X to the reduced version for subsequent steps
X = X_reduced

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.
I used Truncated SVD (Singular Value Decomposition) for dimensionality reduction. This technique is well-suited for high-dimensional, sparse datasets like ours (which includes TF-IDF features) because it effectively reduces the number of features while preserving the most significant variance in the data. It helps in mitigating the 'curse of dimensionality', improving model performance by reducing overfitting and computational costs, and also helps in noise reduction.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
# X is the feature matrix (already reduced in dimensionality)
# y is the target variable (average salary)
# test_size=0.2 means 20% of the data will be used for testing, and 80% for training
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

##### What data splitting ratio have you use and why?

Answer Here.


Split the data into training and testing sets X is the feature matrix (already reduced in dimensionality)
y is the target variable (average salary)
test_size=0.2 means 20% of the data will be used for testing, and 80% for training
random_state ensures reproducibility of the split

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

To determine if the salary data is imbalanced, we need to examine the distribution of the `avg_salary` target variable. For continuous variables like salary, 'imbalance' isn't about unequal class representation but rather about whether the data is heavily skewed or concentrated in specific ranges, which can affect model training.

Let's visualize the distribution and look at some descriptive statistics.

In [None]:
# Handling Imbalanced Dataset (If needed)
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(df_clean['avg_salary'], kde=True, bins=30)
plt.title('Distribution of Average Salary')
plt.xlabel('Average Salary ($)')
plt.ylabel('Frequency')
plt.show()

print("\nDescriptive Statistics for Average Salary:")
print(df_clean['avg_salary'].describe())

print("\nSkewness of Average Salary:")
print(df_clean['avg_salary'].skew())

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

Based on my analysis of the avg_salary distribution, the dataset for this continuous target variable does not appear to be significantly imbalanced. The histogram showed a relatively normal distribution, and the skewness value was close to zero (0.174), indicating a fairly symmetrical spread of salaries. Therefore, no specific technique was needed or applied to balance the dataset.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation: Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Assuming X (features) and y (target) have been defined and split in previous steps
# If not, you would need to add the splitting code here as well.
# For now, let's assume X_train, X_test, y_train, y_test are available from cell c377d1d9

# Initialize the Linear Regression model
linear_reg_model = LinearRegression()

# Fit the Algorithm
linear_reg_model.fit(X_train, y_train)

# Predict on the model
y_pred_lr = linear_reg_model.predict(X_test)

# Evaluate the model
mae_lr = mean_absolute_error(y_test, y_pred_lr)
mse_lr = mean_squared_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mse_lr)
r2_lr = r2_score(y_test, y_pred_lr)

print("Linear Regression Model Performance:")
print(f"  Mean Absolute Error (MAE): {mae_lr:.2f}")
print(f"  Mean Squared Error (MSE): {mse_lr:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_lr:.2f}")
print(f"  R-squared (R2): {r2_lr:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import pandas as pd

# Evaluation metrics from the Linear Regression model
metrics = {'Metric': ['MAE', 'MSE', 'RMSE', 'R-squared'],
           'Score': [mae_lr, mse_lr, rmse_lr, r2_lr]}
metrics_df = pd.DataFrame(metrics)

# For visualization purposes, especially with varying scales,
# we might want to plot MAE, RMSE, and R-squared separately or use a log scale for errors.
# Let's create two plots: one for error metrics (MAE, RMSE) and one for R-squared.

# Plotting Error Metrics (MAE and RMSE)
error_metrics_df = metrics_df[metrics_df['Metric'].isin(['MAE', 'RMSE'])]

plt.figure(figsize=(8, 5))
sns.barplot(x='Metric', y='Score', data=error_metrics_df)
plt.title('Linear Regression Model Error Metrics', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.show()

# Plotting R-squared
r2_metric_df = metrics_df[metrics_df['Metric'] == 'R-squared']

plt.figure(figsize=(6, 4))
sns.barplot(x='Metric', y='Score', data=r2_metric_df)
plt.title('Linear Regression Model R-squared Score', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.ylim(0, 1) # R-squared is typically between 0 and 1
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1: Cross-Validation for Linear Regression

from sklearn.model_selection import cross_val_score

# Perform k-fold cross-validation (e.g., k=5)
# We will use the Root Mean Squared Error (RMSE) as the scoring metric,
# as it's commonly used for regression tasks and is in the same units as the target variable.
# cross_val_score returns negative MSE, so we take the absolute value and then the square root for RMSE.
# We also negate the score because cross_val_score expects a scoring function where higher is better.
# For negative MSE, higher (less negative) is better.

# Define scoring as negative MSE
scoring = 'neg_mean_squared_error'

# Perform cross-validation
cv_scores = cross_val_score(linear_reg_model, X, y, cv=5, scoring=scoring)

# Convert negative MSE scores to positive MSE and then to RMSE
cv_rmse_scores = np.sqrt(-cv_scores)

print("Cross-Validation Results (RMSE):")
print(f"  Individual Fold RMSEs: {cv_rmse_scores}")
print(f"  Mean Cross-Validation RMSE: {cv_rmse_scores.mean():.2f}")
print(f"  Standard Deviation of CV RMSE: {cv_rmse_scores.std():.2f}")

# You can also calculate other metrics if needed
# cv_mae_scores = -cross_val_score(linear_reg_model, X, y, cv=5, scoring='neg_mean_absolute_error')
# print(f"  Mean Cross-Validation MAE: {cv_mae_scores.mean():.2f}")

# cv_r2_scores = cross_val_score(linear_reg_model, X, y, cv=5, scoring='r2')
# print(f"  Mean Cross-Validation R2: {cv_r2_scores.mean():.4f}")

##### Which hyperparameter optimization technique have you used and why?

Answer Here.
For the standard Linear Regression model, there are no hyperparameters to tune using techniques like Grid Search or Random Search. Linear Regression directly calculates the coefficients that minimize the sum of squared errors based on the provided data.

The cross-validation performed in the previous step is used to assess the model's performance consistency across different subsets of the data, giving a more reliable estimate of how well it might generalize to unseen data, rather than optimizing internal model parameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.


Since we performed cross-validation instead of hyperparameter tuning for the standard Linear Regression model, there isn't an "improvement" in the model itself in terms of optimized parameters.

However, the cross-validation results provide a more reliable estimate of how the model is likely to perform on unseen data compared to a single train-test split. The mean cross-validation RMSE (Root Mean Squared Error) gives us a better understanding of the model's typical prediction error across different subsets of the data.

We can note the mean cross-validation RMSE as a more robust performance indicator than the single split RMSE (approximately {{cv_rmse_scores.mean():.2f}} based on the previous output). There isn't a separate chart to update in this case, as we didn't tune hyperparameters to get a new set of scores for a modified model.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# ML Model - 2 Implementation: Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Initialize the Random Forest Regressor model
# You can start with default parameters or some initial values
rf_reg_model = RandomForestRegressor(n_estimators=100, random_state=42) # n_estimators is a common parameter to tune

# Fit the Algorithm
rf_reg_model.fit(X_train, y_train)

# Predict on the model
y_pred_rf = rf_reg_model.predict(X_test)

# Evaluate the model
mae_rf = mean_absolute_error(y_test, y_pred_rf)
mse_rf = mean_squared_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mse_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Regressor Model Performance (Untuned):")
print(f"  Mean Absolute Error (MAE): {mae_rf:.2f}")
print(f"  Mean Squared Error (MSE): {mse_rf:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_rf:.2f}")
print(f"  R-squared (R2): {r2_rf:.4f}")

# Visualizing evaluation Metric Score chart

# Evaluation metrics from the Random Forest Regressor model
metrics_rf = {'Metric': ['MAE', 'MSE', 'RMSE', 'R-squared'],
              'Score': [mae_rf, mse_rf, rmse_rf, r2_rf]}
metrics_rf_df = pd.DataFrame(metrics_rf)

# Plotting Error Metrics (MAE and RMSE)
error_metrics_rf_df = metrics_rf_df[metrics_rf_df['Metric'].isin(['MAE', 'RMSE'])]

plt.figure(figsize=(8, 5))
sns.barplot(x='Metric', y='Score', data=error_metrics_rf_df)
plt.title('Random Forest Regressor Model Error Metrics (Untuned)', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.show()

# Plotting R-squared
r2_metric_rf_df = metrics_rf_df[metrics_rf_df['Metric'] == 'R-squared']

plt.figure(figsize=(6, 4))
sns.barplot(x='Metric', y='Score', data=r2_metric_rf_df)
plt.title('Random Forest Regressor Model R-squared Score (Untuned)', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.ylim(0, 1) # R-squared is typically between 0 and 1
plt.show()

**2. Cross- Validation & Hyperparameter Tuning**

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

The Random Forest Regressor was initialized with a default number of estimators (n_estimators=100) and a random_state. No explicit hyperparameter optimization technique like Grid Search CV or Random Search CV was applied to tune the model in the provided code. The model was trained and evaluated using these default settings.



##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.
For the Random Forest Regressor, no explicit hyperparameter optimization was performed, so there isn't a direct 'improvement' to note from tuning. The model was evaluated using its default parameters, and the performance metrics are as follows:

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### **Random Forest Regressor Model Evaluation Metrics and Business Impact**

**1. Mean Absolute Error (MAE):** `8792.65`
*   **Indication:** MAE represents the average absolute difference between the predicted salary and the actual salary. In business terms, this means that, on average, our model's salary predictions are off by approximately **$8,792.65**.
*   **Business Impact:** This metric is highly interpretable. For HR departments, recruiters, or job seekers, an MAE of ~\$8.8K means that if they use this model to estimate a salary, they can expect the actual salary to be within about \$8.8K of the prediction. A lower MAE is desirable, as it indicates higher prediction accuracy, leading to more reliable salary benchmarking and negotiation strategies. This directly impacts compensation planning, budget allocation for new hires, and job offer credibility.

**2. Mean Squared Error (MSE):** `167,867,995.39`
*   **Indication:** MSE measures the average of the squares of the errors. It penalizes larger errors more heavily than MAE. The high value suggests that there might be some significant errors, or the range of salaries is large, leading to a large squared error value.
*   **Business Impact:** While less directly interpretable than MAE in terms of raw dollar amount, MSE is crucial for understanding the presence of large prediction errors. In a business context, large MSE values mean the model can occasionally make very wrong predictions, which could lead to substantial budget miscalculations, failed negotiations due to vastly incorrect salary expectations, or even legal/ethical issues if salary estimations are far off market rates. Reducing MSE helps in ensuring more consistent and less volatile prediction accuracy.

**3. Root Mean Squared Error (RMSE):** `12,956.39`
*   **Indication:** RMSE is the square root of MSE, bringing the error back into the same units as the target variable (dollars). It's essentially a measure of the typical magnitude of the prediction errors, but, like MSE, it gives more weight to large errors.
*   **Business Impact:** An RMSE of approximately **$12,956.39** means that the typical error magnitude is higher than the MAE, again suggesting that some predictions are further off. For businesses, RMSE provides a more sensitive measure of error. If the business prioritizes avoiding large errors (e.g., to prevent overpaying significantly or underpaying critically needed talent), then minimizing RMSE is important. It helps in assessing the overall reliability of the salary predictions and their potential deviation from actual values.

**4. R-squared (R2):** `0.6192`
*   **Indication:** R-squared, or the coefficient of determination, indicates the proportion of the variance in the dependent variable (average salary) that is predictable from the independent variables (features). An R2 of `0.6192` means that approximately **61.92%** of the variability in job salaries can be explained by our model's features.
*   **Business Impact:** An R2 of nearly 62% suggests that the model has a reasonably good explanatory power. For business stakeholders, this implies that the selected features (job role, company size, location, industry, description content, etc.) are significant drivers of salary variation. This insight is valuable for:
    *   **Strategic Planning:** Understanding which factors influence salaries the most helps in strategic planning for talent acquisition and retention.
    *   **Predictive Confidence:** A higher R2 generally means greater confidence in the model's ability to predict salaries based on the given information. While 62% is good, there's still ~38% of salary variance unexplained, indicating that other factors not captured by the model (e.g., individual negotiation skills, specific skill proficiency, internal company politics, unlisted benefits) are at play. This suggests opportunities for further data collection and feature engineering to improve the model's explanatory power and thus its business utility.

In [None]:
print("Missing values in df_clean after dropping columns:")
df_clean.isnull().sum()

### ML Model - 3

In [None]:
# ML Model - 3 Implementation: Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Initialize the Gradient Boosting Regressor model
gbr_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Fit the Algorithm
gbr_model.fit(X_train, y_train)

# Predict on the model
y_pred_gbr = gbr_model.predict(X_test)

# Evaluate the model
mae_gbr = mean_absolute_error(y_test, y_pred_gbr)
mse_gbr = mean_squared_error(y_test, y_pred_gbr)
rmse_gbr = np.sqrt(mse_gbr)
r2_gbr = r2_score(y_test, y_pred_gbr)

print("Gradient Boosting Regressor Model Performance (Untuned):")
print(f"  Mean Absolute Error (MAE): {mae_gbr:.2f}")
print(f"  Mean Squared Error (MSE): {mse_gbr:.2f}")
print(f"  Root Mean Squared Error (RMSE): {rmse_gbr:.2f}")
print(f"  R-squared (R2): {r2_gbr:.4f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart for Gradient Boosting Regressor

# Evaluation metrics from the Gradient Boosting Regressor model
metrics_gbr = {'Metric': ['MAE', 'MSE', 'RMSE', 'R-squared'],
               'Score': [mae_gbr, mse_gbr, rmse_gbr, r2_gbr]}
metrics_gbr_df = pd.DataFrame(metrics_gbr)

# Plotting Error Metrics (MAE and RMSE)
error_metrics_gbr_df = metrics_gbr_df[metrics_gbr_df['Metric'].isin(['MAE', 'RMSE'])]

plt.figure(figsize=(8, 5))
sns.barplot(x='Metric', y='Score', data=error_metrics_gbr_df)
plt.title('Gradient Boosting Regressor Model Error Metrics (Untuned)', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.show()

# Plotting R-squared
r2_metric_gbr_df = metrics_gbr_df[metrics_gbr_df['Metric'] == 'R-squared']

plt.figure(figsize=(6, 4))
sns.barplot(x='Metric', y='Score', data=r2_metric_gbr_df)
plt.title('Gradient Boosting Regressor Model R-squared Score (Untuned)', fontsize=14, fontweight='bold')
plt.ylabel('Score')
plt.ylim(0, 1) # R-squared is typically between 0 and 1
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

##### Which hyperparameter optimization technique have you used and why?

 No explicit hyperparameter optimization technique like Grid Search CV or Random Search CV was used. The model was initialized with predefined parameters (n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42) and trained using these settings.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

For the Gradient Boosting Regressor, since no explicit hyperparameter optimization was performed, there is no improvement to note from tuning. The model was evaluated using its default parameters, and its performance metrics (MAE, MSE, RMSE, R-squared) would be the same as those obtained from the initial untuned model implementation. These specific values are visible in the output of the model implementation cell.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

To determine a positive business impact, we would consider Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R2). I've outlined a plan to detail why each of these metrics is important in a business context.

Explain Business Impact of Evaluation Metrics: (MAE, RMSE, R-squared) are considered for a positive business impact, and provide a justification for each. This explanation should cover how each metric translates into tangible business value and decision-making.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

Based on the performance metrics, the Random Forest Regressor appears to be the best-performing model among those implemented. It has the lowest Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), and the highest R-squared (R2) score, indicating it explains more of the variance in salary than the other models. I'll update the notebook with this justification.

Justify Model Choice: Generate a markdown response in cell explaining that the Random Forest Regressor was chosen as the final prediction model. The justification should compare its MAE, RMSE, and R-squared scores against the other models (Linear Regression and Gradient Boosting Regressor), highlighting its superior performance.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

 the Random Forest Regressor model and then visualize the feature importances using the model's built-built-in capabilities. Since the model was trained on data after dimensionality reduction using Truncated SVD, the feature importances will correspond to these SVD components. I will explain what this means for interpretability.

Explain Random Forest Regressor: Generate a markdown response eRandom Forest Regressor model, its underlying principles, and why it is a suitable choice for this regression task.
Calculate and Visualize Feature Importance: Extract the feature importances from the trained Random Forest Regressor model (rf_reg_model). Since the model was trained on the SVD-reduced features (X_reduced), these importances will correspond to the SVD components. Create a bar chart to visualize the top N most important SVD components.
Explain Feature Importance Implications: Provide a markdown explanation of what the visualized feature importances indicate. Specifically address that the importances are for SVD components rather than original features, and discuss the implications of this for business interpretability.

# **Conclusion**

Write the conclusion here.

## Summary:

### Data Analysis Key Findings

*   The initial 'Job Description' column contained 894 non-null entries and was of object data type.
*   The text cleaning process successfully converted the text to lowercase, removed URLs, punctuation, and words/digits containing digits, storing the result in the 'cleaned\_description' column.
*   Common English stopwords were removed from the cleaned text, creating the 'description\_no\_stopwords' column.
*   The text was tokenized into individual words, and these tokens were stored as lists in the 'description\_tokens' column.
*   Lemmatization was applied to reduce words to their base form, resulting in the 'description\_lemmatized' column containing lemmatized token lists.
*   The lemmatized tokens were joined back into strings, and TF-IDF vectorization was applied, resulting in a sparse matrix of shape (894, 5000), representing the numerical representation of the text data with the top 5000 features.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***