<a href="https://colab.research.google.com/github/Rajat-Yd/Internship_First_Project.repo/blob/main/ML_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Glassdoor Job, Regression problem ML model



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**


This project focuses on using the Machine Learning (ML) and Natural Language Processing (NLP) to analyze employee reviews from Glassdoor.

Our objective is to develop an intelligent system capable of processing unstructured textual data to predict company ratings, classify sentiments, and identify key themes within employee feedback.

Utilizing NLP and data science methodologies, the project aims to provide actionable insights for organizations, HR professionals, and job seekers to facilitate data-driven decision-making.

The developmnt pipeline follows a systematic approach, which covers the data acquisition, preprocessing, exploratory data analysis (EDA), model development, evaluation, and final deployment. I will use this colab notebook to showcase this project.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**



Glassdoor gives us a vast collection of employee reviews which offer critical insights into workplace environments, managerial effectiveness, and job satisfaction. still, manually parsing thousands of reviews to identify trends, detect concerns, and predict company ratings is highly inefficient and impractical.

**This project proposes an AI-driven system designed to:**

- Conduct sentiment analysis on textual reviews.

- Predict company ratings based on review content.

- Extract and categorize key themes from employee feedback.

- Detect biases in ratings influenced by factors such as job role, department, and location.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Data Handling
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Natural Language Processing (NLP)
import re  # Regular expressions for text processing
import nltk  # NLP toolkit
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Warnings
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Rajat_AI ML_Project/glassdoor_jobs.csv")



### Dataset First View

In [None]:
# Dataset First Look
print("Dataset Loaded Successfully!\n")
print("First 5 Rows:\n", df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Get the number of rows and columns
num_rows, num_columns = df.shape

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_columns}")

### Dataset Information

In [None]:
# Dataset Info
print("\nDataset Info:\n")
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Number of Duplicate Rows: {duplicate_count}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing Values per Column:\n", missing_values)

In [None]:
# Check for null values (sometimes different from NaN)
null_values = df.isna().sum()
print("\nNull Values per Column:\n", null_values)# Visualizing the missing values

### What did you know about your dataset?

**Dataset Summary (Glassdoor Jobs Data)**  

**Basic Information**
- **Total Rows:** 956 (Job postings)  
- **Total Columns:** 15 (Job-related attributes)  
- **Dataset Type:** Structured CSV  

**Data Quality Analysis**
-  **No missing values** detected at a basic level.  
-  **Salary Estimate & Revenue** need cleaning (parsing text to numerical values).  
-  **Competitors Column** has `-1` values, meaning missing data.  
-  **No duplicate rows** (if found, they should be removed).  

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:\n")
print(df.columns)

In [None]:
# Dataset Describe
# Summary statistics of numerical columns
print("Numerical Column Summary:\n")
print(df.describe())

# Summary statistics of categorical columns
print("\nCategorical Column Summary:\n")
print(df.describe(include="object"))


### Variables Description

**Variable Description**  

- **Job Title** → Role of the job (e.g., "Data Scientist").  
- **Job Description** → Detailed job responsibilities (useful for NLP).  
- **Rating** → Glassdoor company rating (1-5 scale).  
- **Company Name** → Name of the employer.  
- **Location** → Job location (City, State format).  
- **Headquarters** → Company's headquarters location.  
- **Size** → Number of employees (e.g., "1001-5000 employees").  
- **Founded** → Year company was established.  
- **Type of Ownership** → Public, Private, Government, etc.  
- **Industry** → Specific industry sector (e.g., "Tech", "Healthcare").  
- **Sector** → Broader industry category.  
- **Revenue** → Revenue range (text format, needs parsing).  
- **Competitors** → List of competing companies (may have missing values).  
- **Min Salary** → Extracted minimum salary (numeric).  
- **Max Salary** → Extracted maximum salary (numeric).  
- **Avg Salary** → Average of Min & Max Salary.  


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique Values Count per Column:\n")
print(df.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Check if 'Salary Estimate' exists
if 'Salary Estimate' in df.columns:
    # Remove rows where Salary Estimate is '-1' (invalid salary values)
    df = df[df['Salary Estimate'] != '-1']

    # Remove text in parentheses (e.g., "Glassdoor est.")
    df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.split('(')[0].strip())

    # Handle special cases where salary is "Employer Provided Salary: X-Y"
    df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.replace("Employer Provided Salary:", "").strip())

    # Identify hourly salary rows (contain "Per Hour") and yearly salary rows
    df['Hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'Per Hour' in x else 0)

    # Remove "Per Hour" and "Per Year" text
    df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.replace('Per Hour', '').replace('Per Year', '').strip())

    # Extract Min & Max Salary safely
    df['Min Salary'] = df['Salary Estimate'].apply(lambda x: int(x.split('-')[0].replace('K', '').replace('$', '').strip()) * 1000)
    df['Max Salary'] = df['Salary Estimate'].apply(lambda x: int(x.split('-')[1].replace('K', '').replace('$', '').strip()) * 1000)

    # Convert Hourly Salary to Annual Salary (Assuming 40 hours/week, 52 weeks/year → 2080 hours)
    df.loc[df['Hourly'] == 1, 'Min Salary'] = df['Min Salary'] * 2080
    df.loc[df['Hourly'] == 1, 'Max Salary'] = df['Max Salary'] * 2080

    # Calculate Average Salary
    df['Avg Salary'] = (df['Min Salary'] + df['Max Salary']) / 2

    # Drop the original Salary Estimate column
    df.drop(['Salary Estimate', 'Hourly'], axis=1, inplace=True)

    print("✅ Salary column cleaned successfully!")
else:
    print("⚠️ Warning: 'Salary Estimate' column not found. Check column names above.")


### What all manipulations have you done and insights you found?

### **1. Removed Duplicate Rows**
- **Before:** The dataset may have had duplicate job postings.  
- **After:** Removed duplicates to ensure unique job listings.  
- **Impact:** Avoids redundancy and improves model performance.  

---

### **2. Handled Missing Values**
- **Fixed Competitors Column:**  
  - Replaced `"-1"` with `"No Competitors"` (indicating missing data).  
- **Dropped rows with missing values in critical columns:**  
  - Ensured **Job Title & Job Description** are always present.
---

### **3. Split `"Location"` into `"City"` & `"State"`**
- **Before:** `"Location"` was a single column with `"City, State"` format.  
- **After:**  
  - Split `"Location"` into separate `"City"` & `"State"` columns.  
  - Filled missing values with `"Unknown"`.

---

### **4. Converted `"Founded"` Year into `"Company Age"`**
- **Before:** `"Founded"` column contained raw years (e.g., `1990`).  
- **After:**  
  - Converted `"Founded"` into `"Company Age"` (`2025 - Founded`).  
  - Replaced `"-1"` (unknown values) with `NaN`.  

---

## **Key Insights from the Dataset**  

### **1. Company Ratings & Job Satisfaction**
- Glassdoor ratings range from **1.0 to 5.0**.    

---

### **2. Salary Trends**
- **Hourly Wages** were converted to yearly salaries (assuming **2080 working hours/year**).  
- **Min, Max, and Avg Salaries were extracted** from the salary column.  

---

### **3. Job Location Analysis**
- The dataset has jobs from **multiple states in the U.S.**  
- Certain **states/cities dominate hiring.**

---

### **4. Industry & Sector Breakdown**
- The dataset contains jobs from **Tech, Healthcare, Finance, and Consulting** sectors.  
- Some industries have **more competitive salaries and ratings** than others.  

---

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1
Salary distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visuals
sns.set(style="whitegrid")

# salary distribution/
plt.figure(figsize=(8,5))
sns.histplot(df['Avg Salary'], bins=30, kde=True, color='blue')
plt.title("Distribution of Average Salaries", fontsize=14)
plt.xlabel("Average Salary (USD)", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with a KDE (Kernel Density Estimation) curve because it is ideal for visualizing distributions.

The KDE curve provides a smooth representation of salary density, making trends easier to analyze.

##### 2. What is/are the insight(s) found from the chart?

Most job salaries fall within a specific range, likely between $50K-$150K.

There might be outliers with extremely high salaries, which could indicate executive roles or highly specialized positions.

If there are multiple peaks, it indicates the presence of different salary groups based on seniority (entry-level vs. senior roles) or industries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes!
For companies:

- Understanding salary distribution helps in structuring competitive salary packages to attract top talent.
Identifies salary gaps, ensuring fair pay across different job levels.

For job seekers:
- Helps candidates negotiate salaries better, ensuring they align with market trends.
Guides professionals on whether they should upskill or switch industries for better pay.

#### Chart - 2
---
Salary v/s Company Ratings

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Rating'], y=df['Avg Salary'], palette="coolwarm")
plt.title("Company Rating vs. Average Salary", fontsize=14)
plt.xlabel("Glassdoor Rating", fontsize=12)
plt.ylabel("Average Salary (USD)", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot is best for showing salary distribution across company ratings, highlighting median salaries, variations, and outliers.

##### 2. What is/are the insight(s) found from the chart?

- Higher-rated companies (4.0+) tend to offer better salaries.
----

- Low-rated companies (below 3.0) show wider salary variation, indicating inconsistent pay structures.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
- Companies with higher salaries & ratings attract top talent.
- Helps businesses balance salary & work culture improvements.

Possible Negative Impact:
- Some high-rated companies may offer lower salaries, relying on brand reputation instead of pay—leading to attrition.

#### Chart - 3
Top Hiring Cities

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10,5))
top_cities = df['Location'].value_counts().head(10)
sns.barplot(x=top_cities.index, y=top_cities.values, palette="viridis")
plt.xticks(rotation=45)
plt.title("Top 10 Cities with Most Job Postings", fontsize=14)
plt.xlabel("City", fontsize=12)
plt.ylabel("Number of Job Postings", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart effectively shows which cities have the highest job postings, helping identify key hiring locations.

##### 2. What is/are the insight(s) found from the chart?

- Major tech hubs (e.g., San Francisco, New York) dominate hiring.

- Some unexpected cities may have high job demand, indicating emerging job markets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
- Helps companies focus hiring efforts in high-demand cities.
- Job seekers can target cities with better employment opportunities.
---
Possible Negative Impact:

- High job concentration in a few cities may cause talent shortages elsewhere.

#### Chart - 4
Industry vs salary

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12,6))
top_industries = df.groupby("Industry")["Avg Salary"].mean().sort_values(ascending=False).head(10)
sns.barplot(x=top_industries.index, y=top_industries.values, palette="magma")
plt.xticks(rotation=45)
plt.title("Top Paying Industries (Average Salary)", fontsize=14)
plt.xlabel("Industry", fontsize=12)
plt.ylabel("Average Salary (USD)", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing average salaries across industries, making it easy to spot high-paying vs. low-paying sectors.


##### 2. What is/are the insight(s) found from the chart?

Tech, Finance, and Healthcare industries tend to offer the highest salaries.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
- Helps companies benchmark salaries to stay competitive in their industry.


Possible Negative Impact:
- Salary gaps across industries may cause talent migration from low-paying fields, leading to shortages in critical sectors like education & social work.


#### Chart - 5
Company age vs salary

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8,5))
sns.scatterplot(x=df['Founded'], y=df['Avg Salary'], alpha=0.6, color='green')
plt.title("Company Age vs. Average Salary", fontsize=14)
plt.xlabel("Company Age (Years)", fontsize=12)
plt.ylabel("Average Salary (USD)", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is ideal for visualizing the relationship between company age and salary, showing patterns across different experience levels.

##### 2. What is/are the insight(s) found from the chart?

- Younger startups (0-10 years) may offer higher salaries to attract talent.
- Older companies (50+ years) tend to have stable but moderate salaries.
- Mid-aged companies (10-50 years) show a balanced salary structure.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
- Helps job seekers choose between startups vs. established firms based on salary potential.

Possible Negative Impact:
- Younger companies may struggle with retention if high salaries aren’t sustainable.

#### Chart - 6 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select only numerical columns
num_df = df.select_dtypes(include=['int64', 'float64'])  # Keep only numeric data

# Compute correlation matrix
corr_matrix = num_df.corr()

# Plot the heatmap
plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

# Title
plt.title("Correlation Heatmap of Numerical Features", fontsize=14)
plt.show()



##### 1. Why did you pick the specific chart?

A heatmap is the best way to visualize correlations between numerical variables, helping identify strong positive or negative relationships.



##### 2. What is/are the insight(s) found from the chart?

- Salary vs. Rating - If positively correlated, higher-rated companies tend to pay better.

- Salary vs. Company Age - If negatively correlated, younger companies may offer higher salaries to attract talent.

- Company Age vs. Rating - If positively correlated, older companies tend to have better reputations.

#### Chart - 7 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select numerical columns for pair plot
num_df = df[['Avg Salary', 'Min Salary', 'Max Salary', 'Rating', 'Founded']]

# Create pair plot
sns.pairplot(num_df, diag_kind='kde', plot_kws={'alpha':0.6})

# Show the plot
plt.suptitle("Pair Plot of Key Numerical Features", fontsize=14, y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot helps visualize relationships between multiple numerical variables, showing scatter plots for comparisons and histograms for distributions.

##### 2. What is/are the insight(s) found from the chart?

- Salary vs. Rating → If positively correlated, higher-rated companies pay more.

- Min Salary vs. Max Salary → Should show a strong positive correlation since they are linked.


## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.drop(['Competitors'], axis=1, inplace=True)  # Drop columns with excessive missing values
print("✅ Columns with high missing values removed!")

#### What all missing value imputation techniques have you used and why did you use those techniques?

Dropped columns with excessive missing values (e.g., Competitors) → If too many values are missing, they don’t contribute to analysis.

- Filled numerical values with the median → Median is less affected by outliers than the mean, making it more reliable for skewed salary/rating data.

- Filled categorical values with the mode → Mode (most frequent value) is best for categorical data, ensuring consistency without distorting distributions.

### 2. Handling Outliers

Detecting outlier's

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Visualizing outliers in numerical columns
plt.figure(figsize=(10,5))
sns.boxplot(data=df[['Min Salary', 'Max Salary', 'Avg Salary', 'Rating', 'Founded']], palette="Set2")
plt.title("Boxplot for Outlier Detection")
plt.show()

In [None]:
# Handling Outliers & Outlier treatments
# Function to remove outliers using IQR
def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

# Apply to numerical columns
num_cols = ['Min Salary', 'Max Salary', 'Avg Salary', 'Rating', 'Founded']
for col in num_cols:
    df = remove_outliers(df, col)

print("✅ Outliers removed using IQR method!")

Checking the removal of Outlier's

In [None]:
# Check boxplot again after removing outliers
plt.figure(figsize=(10,5))
sns.boxplot(data=df[num_cols], palette="Set2")
plt.title("Boxplot After Outlier Removal")
plt.show()


##### What all outlier treatment techniques have you used and why did you use those techniques?

- Boxplot Visualization → To visually detect extreme values in salary, rating, and company age.

- IQR (Interquartile Range) Method → Removed outliers beyond 1.5x IQR since it's effective for skewed salary distributions without removing too much data.

### 3. Categorical Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
# Identify categorical columns
categorical_cols = ['Job Title', 'Company Name', 'Industry', 'Sector', 'Type of ownership', 'Location', 'Job Description', 'Headquarters', 'Size', 'Revenue']

# Apply Label Encoding
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le  # Save encoders for future reference

print("✅ Categorical features encoded successfully!")

In [None]:
# Encode your categorical columns
# Fill missing categorical values with the most frequent value (mode)
cat_cols = ['Job Title', 'Company Name', 'Industry', 'Sector', 'Location']
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

print("✅ Missing categorical values filled with mode!")


In [None]:
# Recheck for missing values
print("Remaining Missing Values After Handling:\n")
print(df.isnull().sum())

Verify the missing values are filled ? or not

#### What all categorical encoding techniques have you used & why did you use those techniques?

- Label Encoding → Used for categorical features like "Job Title", "Company Name", and "Industry" because ML models need numerical values.

- Mode Imputation for Missing Categorical Values → Ensures that categorical data remains consistent without introducing bias.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
!pip install contractions

In [None]:
# Expand Contraction
import re
import contractions

# Function to expand contractions in text columns
def expand_contractions(text):
    return contractions.fix(text) if isinstance(text, str) else text

# Apply to text-based columns
text_cols = ['Job Description']
for col in text_cols:
    df[col] = df[col].apply(expand_contractions)

print("✅ Contractions expanded successfully!")


#### 2. Lower Casing

In [None]:
# Lower Casing
df['Job Title'] = df['Job Title'].str.lower()
df['Company Name'] = df['Company Name'].str.lower()
df['Industry'] = df['Industry'].str.lower()
df['Sector'] = df['Sector'].str.lower()
df['Location'] = df['Location'].str.lower()
df['Job Description'] = df['Job Description'].str.lower()

print("✅ Text converted to lowercase!")

#### 3. Removing Punctuations

In [None]:
#Remove Punctuations
import string

# Function to remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation)) if isinstance(text, str) else text

# Apply to text-based columns
text_cols = ['Job Description']
for col in text_cols:
    df[col] = df[col].apply(remove_punctuation)

print("✅ Punctuation removed successfully!")

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
import re

# Function to remove URLs
def remove_urls(text):
    return re.sub(r'http\S+|www\S+', '', text) if isinstance(text, str) else text

# Function to remove digits
def remove_digits(text):
    return re.sub(r'\d+', '', text) if isinstance(text, str) else text

# Apply to text-based columns
text_cols = ['Job Description']
for col in text_cols:
    df[col] = df[col].apply(remove_urls).apply(remove_digits)

print("✅ URLs and digits removed successfully!")

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# Remove White spaces

# Remove White spacesimport nltk
from nltk.corpus import stopwords

# # Download stopwords if not already present
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Function to remove stopwords
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word.lower() not in stop_words]) if isinstance(text, str) else text

# Function to remove extra whitespaces
def remove_whitespace(text):
    return " ".join(text.split()) if isinstance(text, str) else text

# Apply to text-based columns
text_cols = ['Job Description']
for col in text_cols:
    df[col] = df[col].apply(remove_stopwords).apply(remove_whitespace)

print("✅ Stopwords and extra whitespaces removed successfully!")

#### 6. Rephrase Text

In [None]:
# Rephrase Text
from textblob import TextBlob

# Function to rephrase text using TextBlob
def rephrase_text(text):
    if isinstance(text, str):
        blob = TextBlob(text)
    else:
        return str(blob.correct())  # Corrects spelling & rephrases slightly
    return text

# Apply to text-based columns
text_cols = ['Job Description']
for col in text_cols:
    df[col] = df[col].apply(rephrase_text)

print("✅ Text rephrased successfully!")



#### 7. Tokenization

In [None]:
# Tokenization
from nltk.tokenize import word_tokenize, sent_tokenize

# Download tokenizer if not available
nltk.download('punkt_tab')

# Function to tokenize words
def tokenize_words(text):
    return word_tokenize(text) if isinstance(text, str) else text

# Function to tokenize sentences
def tokenize_sentences(text):
    return sent_tokenize(text) if isinstance(text, str) else text

# Apply word tokenization
df['Job Description Tokens'] = df['Job Description'].apply(tokenize_words)

# Apply sentence tokenization (optional)
# df['Job Description Sentences'] = df['Job Description'].apply(tokenize_sentences)

print("✅ Text tokenized successfully!")


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import WordNetLemmatizer

# Download WordNet data
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to apply lemmatization
def apply_lemmatization(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()]) if isinstance(text, str) else text

# Apply lemmatization to job descriptions
df['Job Description Lemmatized'] = df['Job Description'].apply(apply_lemmatization)

print("✅ Lemmatization applied successfully!")

In [None]:
import nltk
from nltk.stem import PorterStemmer

# Initialize stemmer
stemmer = PorterStemmer()

# Function to apply stemming
def apply_stemming(text):
    return " ".join([stemmer.stem(word) for word in text.split()]) if isinstance(text, str) else text

# Apply stemming to job descriptions
df['Job Description Stemmed'] = df['Job Description'].apply(apply_stemming)

print("✅ Stemming applied successfully!")

##### Which text normalization technique have you used and why?

- Stemming is faster but less accurate, useful for reducing word variations.
- Lemmatization is slower but more precise, useful for meaning-based NLP tasks.
- Both techniques improve text consistency for feature extraction & machine learning models.

#### 9. Part of speech tagging

In [None]:
import nltk

# Download necessary datasets
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')  # Optional, improves lemmatization accuracy

print("✅ NLTK resources downloaded successfully!")

In [None]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Function to apply POS tagging
def pos_tagging(text):
    tokens = word_tokenize(text)
    return pos_tag(tokens) if isinstance(text, str) else text

# Apply POS tagging to job descriptions
df['Job Description POS'] = df['Job Description'].apply(pos_tagging)

print("POS tagging applied sucessfully")

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
import gensim
from gensim.models import Word2Vec

# Tokenize text for Word2Vec
df['Job Description Tokenized'] = df['Job Description'].apply(word_tokenize)

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=df['Job Description Tokenized'], vector_size=100, window=5, min_count=2, workers=4)

print("✅ Word2Vec model trained successfully!")

##### Which text vectorization technique have you used and why?

✅ Word2Vec → Captures semantic meaning & word relationships for deep NLP models.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Create a new feature: Salary Range
df['Salary Range'] = df['Max Salary'] - df['Min Salary']

print("✅ New feature 'Salary Range' created successfully!")

# Define a threshold (e.g., $100,000) to classify high-paying jobs
df['High Paying'] = df['Avg Salary'].apply(lambda x: 1 if x >= 100000 else 0)

print("✅ New feature 'High Paying' (1 = Yes, 0 = No) created successfully!")

#### 2. Feature Selection

 Convert Categorical Columns Using TF-IDF



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=500)  # Keep top 500 words

# Transform Job Descriptions into TF-IDF vectors
tfidf_matrix = tfidf.fit_transform(df['Job Description'])

# Convert to DataFrame and merge with the original dataset
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
df = df.reset_index(drop=True)  # Reset index for merging
df = pd.concat([df, tfidf_df], axis=1)

# Identify non-numeric columns
non_numeric_cols = df.select_dtypes(include=['object']).columns

# Remove text-based columns from feature selection
X = df.drop(columns=['Avg Salary'] + list(non_numeric_cols))  # Keep only numerical features
y = df['Avg Salary']



print("✅ Removed non-numeric columns for feature selection!")
print("✅ Text converted to numerical features using TF-IDF!")

In [None]:
from sklearn.feature_selection import mutual_info_regression

# Compute feature importance using mutual information
feature_importance = mutual_info_regression(X, y)
important_features = pd.Series(feature_importance, index=X.columns).sort_values(ascending=False)

print("✅ Feature importance calculated successfully!")
print(important_features)

##### What all feature selection methods have you used  and why?

- Variance Thresholding → Removes low-variance features that don’t contribute much to model learning.
- Mutual Information Regression → Identifies how much each feature influences salary (target variable), ensuring only informative predictors are used.

##### Which all features you found important and why?

- Min Salary & Max Salary → Strong predictors for Avg Salary, as they directly define salary range.
- Company Rating → Higher-rated companies tend to offer better salaries.
- Company Age → Younger companies may offer competitive salaries to attract talent.
- Industry & Sector → Some industries (e.g., tech, finance) offer higher average salaries.
- Location (State) → Salaries differ significantly across states due to cost of living & job demand.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
import numpy as np

# Apply log transformation to skewed numerical features
skewed_cols = ['Min Salary', 'Max Salary', 'Avg Salary', 'Founded']

for col in skewed_cols:
    df[col] = np.log1p(df[col])  # log1p(x) = log(1 + x) to avoid log(0) issues

print("✅ Log transformation applied to skewed features!")

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# Select numerical columns for scaling
num_cols = ['Min Salary', 'Max Salary', 'Avg Salary', 'Rating', 'Founded']

# Apply Standard Scaling
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# Apply Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
df[num_cols] = minmax_scaler.fit_transform(df[num_cols])

print("✅ Standardization applied to numerical features!")
print("✅ Min-Max Scaling applied to numerical features!")


##### Which method have you used to scale you data and why?

- Log Transformation → Reduces skewness & outliers for better model stability.
- Standardization (Z-score) → Works best for models like SVM, Logistic Regression.
- Min-Max Scaling → Works well for Deep Learning models & Distance-based ML models (KNN, Clustering).

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

- ✔️ Removes redundant features to reduce complexity.
- ✔️ Speeds up training & improves generalization.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

No dimensionality is done.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Define Features (X) and Target (y)
X = df.drop(columns=['Avg Salary'])  # Independent variables
y = df['Avg Salary']  # Target variable

# Split data into 80% train & 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"✅ Data split successfully: Train Size: {X_train.shape[0]}, Test Size: {X_test.shape[0]}")

##### What data splitting ratio have you used and why?

**Reason for the taken ratio**
- ✔️ 80% Training Data → Provides enough data for the model to learn patterns properly.
- ✔️ 20% Testing Data → Ensures a good sample for evaluating performance without overfitting.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

**Let's check if the data is imbalanced or not**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of salaries
plt.figure(figsize=(8,5))
sns.histplot(df['Avg Salary'], bins=30, kde=True, color='blue')
plt.title("Distribution of Average Salaries", fontsize=14)
plt.xlabel("Average Salary", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.show()

# Check salary percentiles
print(df['Avg Salary'].describe())

In [None]:
# Handling Imbalanced Dataset (If needed)
# No immediate need of Hanalding Data imbalance right now.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

- From the given stats, the data does NOT appear to be highly imbalanced.
- No immediate need for SMOTE or oversampling.
- However, if the histogram shows extreme skewness, I can apply log transformation or binning to improve distribution.

## ***7. ML Model Implementation***

### ML Model - 1

**Removing list based coloumns with Data type =  Object**

In [None]:
list_cols = ['Job Description POS', 'Job Description Tokens','Job Description Lemmatized', 'Job Description Stemmed']  # problematic columns
X = X.drop(columns=[col for col in list_cols if col in X.columns])

print("✅ Removed list-based columns for model training!")

print("✅ Data Types After Fix:")
print(X.dtypes.value_counts())  # Ensure only numerical features remain

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression Model
model_1 = LinearRegression()
model_1.fit(X_train, y_train)

print("✅ Model 1 trained successfully!")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
import seaborn as sns

# Create a bar chart for evaluation metrics
metrics = ['MAE', 'MSE', 'RMSE', 'R² Score']
values = [mae, mse, rmse, r2]

plt.figure(figsize=(8, 5))
sns.barplot(x=metrics, y=values, palette='coolwarm')

# Annotate bars with actual values
for i, v in enumerate(values):
    plt.text(i, v + 0.01, f"{v:.4f}", ha='center', fontsize=12)

plt.title("📊 Model 1 Evaluation Metric Scores", fontsize=14)
plt.xlabel("Evaluation Metrics", fontsize=12)
plt.ylabel("Score", fontsize=12)
plt.show()

In [None]:
# Checking for the Model Overfitting

train_r2 = model_1.score(X_train, y_train)
test_r2 = model_1.score(X_test, y_test)

print(f"📊 R² Score on Training Data: {train_r2:.4f}")
print(f"📊 R² Score on Test Data: {test_r2:.4f}")


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Predictions on test set
y_pred_1 = model_1.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred_1)
mse = mean_squared_error(y_test, y_pred_1)
rmse = mean_squared_error(y_test, y_pred_1)
r2 = r2_score(y_test, y_pred_1)

# Print metrics
print(f"📊 Model 1 - Linear Regression Performance:")
print(f"✅ Mean Absolute Error (MAE): {mae:.4f}")
print(f"✅ Mean Squared Error (MSE): {mse:.4f}")
print(f"✅ Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"✅ R² Score: {r2:.4f}")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
#IMport Required libraries
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Initialize Ridge Regression model
ridge_model = Ridge()

# Apply GridSearchCV to find the best alpha
grid_search = GridSearchCV(ridge_model, param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best hyperparameter
best_alpha = grid_search.best_params_['alpha']
print(f"✅ Best Alpha Found: {best_alpha}")

# Fit the Algorithm
# Train the optimized Ridge model
optimized_model_1 = Ridge(alpha=best_alpha)
optimized_model_1.fit(X_train, y_train)

print("✅ Model 1 (Ridge Regression with Optimized Alpha) trained successfully!")

# Predict on the model
# Make predictions
y_pred_optimized = optimized_model_1.predict(X_test)

print("✅ Predictions made successfully!")

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***