# **Project Name**    - Glassdoor Job, Regression problem ML model



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**


This project focuses on using the Machine Learning (ML) and Natural Language Processing (NLP) to analyze employee reviews from Glassdoor.

Our objective is to develop an intelligent system capable of processing unstructured textual data to predict company ratings, classify sentiments, and identify key themes within employee feedback.

Utilizing NLP and data science methodologies, the project aims to provide actionable insights for organizations, HR professionals, and job seekers to facilitate data-driven decision-making.

The developmnt pipeline follows a systematic approach, which covers the data acquisition, preprocessing, exploratory data analysis (EDA), model development, evaluation, and final deployment. I will use this colab notebook to showcase this project.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**



Glassdoor gives us a vast collection of employee reviews which offer critical insights into workplace environments, managerial effectiveness, and job satisfaction. still, manually parsing thousands of reviews to identify trends, detect concerns, and predict company ratings is highly inefficient and impractical.

**This project proposes an AI-driven system designed to:**

- Conduct sentiment analysis on textual reviews.

- Predict company ratings based on review content.

- Extract and categorize key themes from employee feedback.

- Detect biases in ratings influenced by factors such as job role, department, and location.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Data Handling
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Natural Language Processing (NLP)
import re  # Regular expressions for text processing
import nltk  # NLP toolkit
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Warnings
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Rajat_AI ML_Project/glassdoor_jobs.csv")

### Dataset First View

In [None]:
# Dataset First Look
print("Dataset Loaded Successfully!\n")
print("First 5 Rows:\n", df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
# Get the number of rows and columns
num_rows, num_columns = df.shape

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_columns}")

### Dataset Information

In [None]:
# Dataset Info
print("\nDataset Info:\n")
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Number of Duplicate Rows: {duplicate_count}")


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing Values per Column:\n", missing_values)

In [None]:
# Check for null values (sometimes different from NaN)
null_values = df.isna().sum()
print("\nNull Values per Column:\n", null_values)# Visualizing the missing values

### What did you know about your dataset?

**Dataset Summary (Glassdoor Jobs Data)**  

**Basic Information**
- **Total Rows:** 956 (Job postings)  
- **Total Columns:** 15 (Job-related attributes)  
- **Dataset Type:** Structured CSV  

**Data Quality Analysis**
-  **No missing values** detected at a basic level.  
-  **Salary Estimate & Revenue** need cleaning (parsing text to numerical values).  
-  **Competitors Column** has `-1` values, meaning missing data.  
-  **No duplicate rows** (if found, they should be removed).  

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:\n")
print(df.columns)

In [None]:
# Dataset Describe
# Summary statistics of numerical columns
print("Numerical Column Summary:\n")
print(df.describe())

# Summary statistics of categorical columns
print("\nCategorical Column Summary:\n")
print(df.describe(include="object"))


### Variables Description

**Variable Description**  

- **Job Title** → Role of the job (e.g., "Data Scientist").  
- **Job Description** → Detailed job responsibilities (useful for NLP).  
- **Rating** → Glassdoor company rating (1-5 scale).  
- **Company Name** → Name of the employer.  
- **Location** → Job location (City, State format).  
- **Headquarters** → Company's headquarters location.  
- **Size** → Number of employees (e.g., "1001-5000 employees").  
- **Founded** → Year company was established.  
- **Type of Ownership** → Public, Private, Government, etc.  
- **Industry** → Specific industry sector (e.g., "Tech", "Healthcare").  
- **Sector** → Broader industry category.  
- **Revenue** → Revenue range (text format, needs parsing).  
- **Competitors** → List of competing companies (may have missing values).  
- **Min Salary** → Extracted minimum salary (numeric).  
- **Max Salary** → Extracted maximum salary (numeric).  
- **Avg Salary** → Average of Min & Max Salary.  


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique Values Count per Column:\n")
print(df.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Step 1: Import Libraries
import pandas as pd
import numpy as np

# Step 2: Remove Duplicate Rows
df = df.drop_duplicates()
print(f" Duplicate rows removed. New dataset shape: {df.shape}")

# Step 3: Handle Missing & Null Values
# Replace '-1' in Competitors column with 'No Competitors'
df['Competitors'] = df['Competitors'].replace('-1', 'No Competitors')

# Drop rows where critical data is missing
df = df.dropna(subset=['Job Title', 'Job Description'])
print(" Missing values handled.")

#Step 4: Split Location into City & State
# Check if Location column exists and handle missing values
if 'Location' in df.columns:
    df['Location'] = df['Location'].fillna('Unknown, Unknown')  # Fill missing values with a placeholder

    # Ensure all rows have both City and State before splitting
    df[['City', 'State']] = df['Location'].str.split(',', n=1, expand=True)

    # Handle cases where State is missing
    df['State'] = df['State'].fillna('Unknown').str.strip()  # Fill missing states with "Unknown"

    # Drop the original Location column
    df.drop('Location', axis=1, inplace=True)
    print(" Location column successfully split into City & State!")
else:
    print(" Warning: 'Location' column not found.")

#  Step 5: Convert Founded Year into Company Age
if 'Founded' in df.columns:
    df['Company Age'] = 2025 - df['Founded']
    df.loc[df['Founded'] == -1, 'Company Age'] = None  # Handle unknown values
    print(" Company Age column created.")
else:
    print(" Warning: 'Founded' column not found.")

#  Step 6: Final Check on Data Quality
print("\n Final Dataset Info:")
print(df.info())

# Display first 5 rows of the cleaned dataset
print("\n First 5 Rows of Cleaned Dataset:")
print(df.head())


### What all manipulations have you done and insights you found?

### **1. Removed Duplicate Rows**
- **Before:** The dataset may have had duplicate job postings.  
- **After:** Removed duplicates to ensure unique job listings.  
- **Impact:** Avoids redundancy and improves model performance.  

---

### **2. Handled Missing Values**
- **Fixed Competitors Column:**  
  - Replaced `"-1"` with `"No Competitors"` (indicating missing data).  
- **Dropped rows with missing values in critical columns:**  
  - Ensured **Job Title & Job Description** are always present.
---

### **3. Split `"Location"` into `"City"` & `"State"`**
- **Before:** `"Location"` was a single column with `"City, State"` format.  
- **After:**  
  - Split `"Location"` into separate `"City"` & `"State"` columns.  
  - Filled missing values with `"Unknown"`.

---

### **4. Converted `"Founded"` Year into `"Company Age"`**
- **Before:** `"Founded"` column contained raw years (e.g., `1990`).  
- **After:**  
  - Converted `"Founded"` into `"Company Age"` (`2025 - Founded`).  
  - Replaced `"-1"` (unknown values) with `NaN`.  

---

## **Key Insights from the Dataset**  

### **1. Company Ratings & Job Satisfaction**
- Glassdoor ratings range from **1.0 to 5.0**.    

---

### **2. Salary Trends**
- **Hourly Wages** were converted to yearly salaries (assuming **2080 working hours/year**).  
- **Min, Max, and Avg Salaries were extracted** from the salary column.  

---

### **3. Job Location Analysis**
- The dataset has jobs from **multiple states in the U.S.**  
- Certain **states/cities dominate hiring.**

---

### **4. Industry & Sector Breakdown**
- The dataset contains jobs from **Tech, Healthcare, Finance, and Consulting** sectors.  
- Some industries have **more competitive salaries and ratings** than others.  

---

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1
Salary distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visuals
sns.set(style="whitegrid")

# salary distribution/
plt.figure(figsize=(8,5))
sns.histplot(df['Avg Salary'], bins=30, kde=True, color='blue')
plt.title("Distribution of Average Salaries", fontsize=14)
plt.xlabel("Average Salary (USD)", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with a KDE (Kernel Density Estimation) curve because it is ideal for visualizing distributions.

The KDE curve provides a smooth representation of salary density, making trends easier to analyze.

##### 2. What is/are the insight(s) found from the chart?

Most job salaries fall within a specific range, likely between $50K-$150K.

There might be outliers with extremely high salaries, which could indicate executive roles or highly specialized positions.

If there are multiple peaks, it indicates the presence of different salary groups based on seniority (entry-level vs. senior roles) or industries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes!
For companies:

- Understanding salary distribution helps in structuring competitive salary packages to attract top talent.
Identifies salary gaps, ensuring fair pay across different job levels.

For job seekers:
- Helps candidates negotiate salaries better, ensuring they align with market trends.
Guides professionals on whether they should upskill or switch industries for better pay.

#### Chart - 2
---
Salary v/s Company Ratings

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Rating'], y=df['Avg Salary'], palette="coolwarm")
plt.title("Company Rating vs. Average Salary", fontsize=14)
plt.xlabel("Glassdoor Rating", fontsize=12)
plt.ylabel("Average Salary (USD)", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A boxplot is best for showing salary distribution across company ratings, highlighting median salaries, variations, and outliers.

##### 2. What is/are the insight(s) found from the chart?

- Higher-rated companies (4.0+) tend to offer better salaries.
----

- Low-rated companies (below 3.0) show wider salary variation, indicating inconsistent pay structures.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
- Companies with higher salaries & ratings attract top talent.
- Helps businesses balance salary & work culture improvements.

Possible Negative Impact:
- Some high-rated companies may offer lower salaries, relying on brand reputation instead of pay—leading to attrition.

#### Chart - 3
Top Hiring Cities

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10,5))
top_cities = df['City'].value_counts().head(10)
sns.barplot(x=top_cities.index, y=top_cities.values, palette="viridis")
plt.xticks(rotation=45)
plt.title("Top 10 Cities with Most Job Postings", fontsize=14)
plt.xlabel("City", fontsize=12)
plt.ylabel("Number of Job Postings", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart effectively shows which cities have the highest job postings, helping identify key hiring locations.

##### 2. What is/are the insight(s) found from the chart?

- Major tech hubs (e.g., San Francisco, New York) dominate hiring.

- Some unexpected cities may have high job demand, indicating emerging job markets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
- Helps companies focus hiring efforts in high-demand cities.
- Job seekers can target cities with better employment opportunities.
---
Possible Negative Impact:

- High job concentration in a few cities may cause talent shortages elsewhere.

#### Chart - 4
Industry vs salary

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12,6))
top_industries = df.groupby("Industry")["Avg Salary"].mean().sort_values(ascending=False).head(10)
sns.barplot(x=top_industries.index, y=top_industries.values, palette="magma")
plt.xticks(rotation=45)
plt.title("Top Paying Industries (Average Salary)", fontsize=14)
plt.xlabel("Industry", fontsize=12)
plt.ylabel("Average Salary (USD)", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing average salaries across industries, making it easy to spot high-paying vs. low-paying sectors.


##### 2. What is/are the insight(s) found from the chart?

Tech, Finance, and Healthcare industries tend to offer the highest salaries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
- Helps companies benchmark salaries to stay competitive in their industry.


Possible Negative Impact:
- Salary gaps across industries may cause talent migration from low-paying fields, leading to shortages in critical sectors like education & social work.


#### Chart - 5
Company age vs salary

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8,5))
sns.scatterplot(x=df['Company Age'], y=df['Avg Salary'], alpha=0.6, color='green')
plt.title("Company Age vs. Average Salary", fontsize=14)
plt.xlabel("Company Age (Years)", fontsize=12)
plt.ylabel("Average Salary (USD)", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is ideal for visualizing the relationship between company age and salary, showing patterns across different experience levels.

##### 2. What is/are the insight(s) found from the chart?

- Younger startups (0-10 years) may offer higher salaries to attract talent.
- Older companies (50+ years) tend to have stable but moderate salaries.
- Mid-aged companies (10-50 years) show a balanced salary structure.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
- Helps job seekers choose between startups vs. established firms based on salary potential.

Possible Negative Impact:
- Younger companies may struggle with retention if high salaries aren’t sustainable.

#### Chart - 6 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select only numerical columns
num_df = df.select_dtypes(include=['int64', 'float64'])  # Keep only numeric data

# Compute correlation matrix
corr_matrix = num_df.corr()

# Plot the heatmap
plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

# Title
plt.title("Correlation Heatmap of Numerical Features", fontsize=14)
plt.show()



##### 1. Why did you pick the specific chart?

A heatmap is the best way to visualize correlations between numerical variables, helping identify strong positive or negative relationships.

##### 2. What is/are the insight(s) found from the chart?

- Salary vs. Rating - If positively correlated, higher-rated companies tend to pay better.

- Salary vs. Company Age - If negatively correlated, younger companies may offer higher salaries to attract talent.

- Company Age vs. Rating - If positively correlated, older companies tend to have better reputations.

#### Chart - 7 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select numerical columns for pair plot
num_df = df[['Avg Salary', 'Min Salary', 'Max Salary', 'Rating', 'Company Age']]

# Create pair plot
sns.pairplot(num_df, diag_kind='kde', plot_kws={'alpha':0.6})

# Show the plot
plt.suptitle("Pair Plot of Key Numerical Features", fontsize=14, y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot helps visualize relationships between multiple numerical variables, showing scatter plots for comparisons and histograms for distributions.

##### 2. What is/are the insight(s) found from the chart?

- Salary vs. Rating → If positively correlated, higher-rated companies pay more.

- Min Salary vs. Max Salary → Should show a strong positive correlation since they are linked.


## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***