**Libraries**

In [71]:
import numpy as np
import pandas as pd

**Loading up the dataset**

In [72]:
data2 = pd.read_csv("data\glassdoor_jobs.csv")

---

# **Data Preparation**

## **Features & Data Engineering**

In [73]:
# Check for Missing values for the 'data2'
data2.isnull().sum()

Unnamed: 0           0
Job Title            0
Salary Estimate      0
Job Description      0
Rating               0
Company Name         0
Location             0
Headquarters         0
Size                 0
Founded              0
Type of ownership    0
Industry             0
Sector               0
Revenue              0
Competitors          0
dtype: int64

**Initial Exploratory Analysis**

In [74]:
data2.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Data Scientist,$53K-$91K (Glassdoor est.),"Data Scientist\nLocation: Albuquerque, NM\nEdu...",3.8,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),-1
1,1,Healthcare Data Scientist,$63K-$112K (Glassdoor est.),What You Will Do:\n\nI. General Summary\n\nThe...,3.4,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1
2,2,Data Scientist,$80K-$90K (Glassdoor est.),"KnowBe4, Inc. is a high growth information sec...",4.8,KnowBe4\n4.8,"Clearwater, FL","Clearwater, FL",501 to 1000 employees,2010,Company - Private,Security Services,Business Services,$100 to $500 million (USD),-1
3,3,Data Scientist,$56K-$97K (Glassdoor est.),*Organization and Job ID**\nJob ID: 310709\n\n...,3.8,PNNL\n3.8,"Richland, WA","Richland, WA",1001 to 5000 employees,1965,Government,Energy,"Oil, Gas, Energy & Utilities",$500 million to $1 billion (USD),"Oak Ridge National Laboratory, National Renewa..."
4,4,Data Scientist,$86K-$143K (Glassdoor est.),Data Scientist\nAffinity Solutions / Marketing...,2.9,Affinity Solutions\n2.9,"New York, NY","New York, NY",51 to 200 employees,1998,Company - Private,Advertising & Marketing,Business Services,Unknown / Non-Applicable,"Commerce Signals, Cardlytics, Yodlee"


In [75]:
def get_shape_info(df):
    shape_info = df.shape
    column_info = df.dtypes.reset_index()
    column_info.columns = ['Column Name', 'Data Type']
    return shape_info, column_info


data2_shape, data2_column_info = get_shape_info(data2)

print("\nShape of data2:", data2_shape)
print("Column information for data2:")
print(data2_column_info)


Shape of data2: (939, 15)
Column information for data2:
          Column Name Data Type
0          Unnamed: 0     int64
1           Job Title    object
2     Salary Estimate    object
3     Job Description    object
4              Rating   float64
5        Company Name    object
6            Location    object
7        Headquarters    object
8                Size    object
9             Founded     int64
10  Type of ownership    object
11           Industry    object
12             Sector    object
13            Revenue    object
14        Competitors    object


In [76]:
data2.drop('Unnamed: 0', axis=1, inplace=True)

In [77]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 939 entries, 0 to 938
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Job Title          939 non-null    object 
 1   Salary Estimate    939 non-null    object 
 2   Job Description    939 non-null    object 
 3   Rating             939 non-null    float64
 4   Company Name       939 non-null    object 
 5   Location           939 non-null    object 
 6   Headquarters       939 non-null    object 
 7   Size               939 non-null    object 
 8   Founded            939 non-null    int64  
 9   Type of ownership  939 non-null    object 
 10  Industry           939 non-null    object 
 11  Sector             939 non-null    object 
 12  Revenue            939 non-null    object 
 13  Competitors        939 non-null    object 
dtypes: float64(1), int64(1), object(12)
memory usage: 102.8+ KB


---

### **Job Title**

- What are the unique job titles in the dataset?



In [78]:
data2['Job Title'].unique()

array(['Data Scientist', 'Healthcare Data Scientist',
       'Research Scientist', 'Staff Data Scientist - Technology',
       'Data Analyst', 'Data Engineer I', 'Scientist I/II, Biology',
       'Customer Data Scientist',
       'Data Scientist - Health Data Analytics',
       'Senior Data Scientist / Machine Learning',
       'Data Scientist - Quantitative', 'Digital Health Data Scientist',
       'Associate Data Analyst', 'Clinical Data Scientist',
       'Data Scientist / Machine Learning Expert', 'Web Data Analyst',
       'Senior Data Scientist', 'Data Engineer',
       'Data Scientist - Algorithms & Inference', 'Scientist',
       'Data Science Analyst', 'Lead Data Scientist',
       'Spectral Scientist/Engineer',
       'College Hire - Data Scientist - Open to December 2019 Graduates',
       'Data Scientist, Office of Data Science',
       'Business Intelligence Analyst',
       'Data Scientist in Artificial Intelligence Early Career',
       'Data Scientist - Research', 'R&D 

Grouping similar `Job Titles` together:

In [79]:
job_title_mapping = {
    'Data Scientist': [
        'Data Scientist', 'Healthcare Data Scientist', 'Staff Data Scientist - Technology',
        'Customer Data Scientist', 'Data Scientist - Health Data Analytics', 'Senior Data Scientist / Machine Learning',
        'Data Scientist - Quantitative', 'Digital Health Data Scientist', 'Clinical Data Scientist',
        'Data Scientist / Machine Learning Expert', 'Senior Data Scientist', 'Data Scientist - Algorithms & Inference',
        'Lead Data Scientist', 'Data Scientist in Artificial Intelligence Early Career', 'Data Scientist - Research',
        'R&D Data Analysis Scientist', 'Data Scientist SR', 'R&D Sr Data Scientist', 'Data Scientist II',
        'Senior Data Scientist - AI Forecasting, Finance team', 'Data Scientist (Active TS SCI with Polygraph)',
        'Principal Data Scientist', 'Chief Data Scientist', 'Staff Data Scientist', 'Data Scientist - Bioinformatics',
        'Data Scientist (Actuary, FSA or ASA)', 'Customer Data Scientist/Sales Engineer', 'Data Scientist Analyst',
        'Data Scientist/ML Engineer', 'Jr. Data Scientist', 'Senior Data Scientist 4 Artificial Intelligence',
        'Data Scientist - Alpha Insights', 'Data Scientist - Sales', 'Data Scientist Manager',
        'Senior Data Scientist Statistics', 'Sr. Data Scientist', 'Sr. Data Scientist II', 'Data Scientist (Warehouse Automation)',
                'Principal Data Scientist (Computational Chemistry)', 'Data Scientist, Office of Data Science', 'Data Scientist in Translational Medicine',
        'Senior Data Scientist - R&D Oncology', 'Health Data Analyst/Developer', 'Senior Data Scientist: Causal & Predictive analytics AI Innovation Lab',
        'Senior Data Scientist Oncology', 'Data Scientist - Systems Engineering', 'Data Scientist - Consultant - National',
        'Data Scientist - Alpha Insights', 'Senior Data & Machine Learning Scientist', 'Data Science Project Manager',
        'Head Data Scientist – Image Analytics lead, Novartis AI Innovation Lab', 'Senior Insurance Data Scientist',
        'Senior Data Scientist – Visualization, Novartis AI Innovation Lab', 'Data Science Engineer - Mobile',
        'Principal, Data Science - Advanced Analytics', 'IT Associate Data Analyst', 'Sr Expert Data Science, Advanced Visual Analytics (Associate level)',
        'Data Science Intern', 'Senior Quantitative Analyst', 'Data Scientist Manager', 'Data Scientists', 'Ag Data Scientist', 'Data Scientist, Rice University',
        'Senior Data Scientist Artificial Intelligence'
    ],
    'Data Analyst': [
        'Data Analyst', 'Associate Data Analyst', 'Web Data Analyst', 'Data Science Analyst', 'Jr. Business Data Analyst',
        'Financial Data Analyst', 'Senior Data Analyst', 'Data Analyst - Asset Management', 'Data Analyst II',
        'E-Commerce Data Analyst', 'Excel / VBA / SQL Data Analyst', 'Insurance Financial Data Analyst',
        'Data Analyst / Scientist', 'Marketing Data Analyst', 'Data Analyst Chemist - Quality System Contractor',
        'Survey Data Analyst', 'Junior Data Analyst', 'Data Analyst Senior', 'Lead Data Analyst',
        'Business Data Analyst', 'Data Analyst, Performance Partnership', 'SQL Data Engineer',
        'Business Intelligence Analyst / Developer', 'System and Data Analyst', 'Data & Analytics Consultant (NYC)',
        'Real World Evidence (RWE) Scientist', 'Data Scientist - Consultant - National', 'Lead Health Data Analyst - Front End',
        'Insurance Financial Data Analyst', 'Data Engineer I - Azure', 'Information Security Data Analyst',
        'Data Scientist, Senior', 'Data Modeler - Data Solutions Engineer', 'Consultant - Data Analytics Group',
        'Sr Data Analyst', 'Program/Data Analyst', 'Big Data Engineer - Chicago - Future Opportunity',
        'Supply Chain Data Analyst', 'Data Modeler (Analytical Systems)', 'Senior Data Analyst/Scientist',
        'Data Analyst, May 2020 Undergrad', 'Marketing Data Analyst, May 2020 Undergrad', 'Survey Data Analyst',
        'Foundational Community Supports Data Analyst', 'Senior Health Data Analyst, Star Ratings',
        'Corporate Risk Data Analyst (SQL Based) - Milwaukee or', 'Data Analytics Project Manager',
        'Marketing Data Analyst', 'Data Analyst Chemist - Quality System Contractor', 'Survey Data Analyst',
        'Diversity and Inclusion Data Analyst', 'Managing Data Scientist/ML Engineer', 'Advanced Analytics Manager',
        'Consultant - Analytics Consulting', 'Analytics Manager - Data Mart', 'Analytics - Business Assurance Data Analyst',
                'Senior Data Analyst', 'Data Analytics Project Manager', 'Data Analyst Senior', 'Associate Data Analyst- Graduate Development Program',
        'Survey Data Analyst', 'Data Analyst Chemist - Quality System Contractor', 'Data Analyst, May 2020 Undergrad',
        'Digital Marketing & ECommerce Data Analyst', 'Marketing Data Analyst, May 2020 Undergrad', 'Market Data Analyst',
        'Revenue Analytics Manager', 'Sr. Data Analyst', 'Managing Data Scientist/ML Engineer', 'Foundational Community Supports Data Analyst',
        'Diversity and Inclusion Data Analyst', 'Technology-Minded, Data Professional Opportunities', 'Salesforce Analytics Consultant',
        'Senior Operations Data Analyst, Call Center Operations', 'Senior Health Data Analyst, Star Ratings', 'Corporate Risk Data Analyst (SQL Based) - Milwaukee or',
        'Sr Data Analyst - IT', 'Business Data Analyst, SQL'
    ],

    'Data Engineer': [
        'Data Engineer I', 'Data Engineer', 'Senior Data Engineer', 'Data Engineer Intern', 'MongoDB Data Engineer II',
        'AWS Data Engineer', 'Data Engineer with R', 'Sr. Data Engineer', 'Lead Data Engineer', 'Data Engineer I - Azure',
        'Associate Data Engineer', 'Staff Data Engineer', 'Sr. Data Engineer - Contract-to-Hire (Java)',
        'Lead Big Data Engineer', 'Sr Data Engineer (Sr BI Developer)', 'IT - Data Engineer II',
        'Data Engineer, Data Engineering and Artificial Intelligence', 'Big Data Engineer', 'Data Engineer - Consultant (Charlotte Based)',
        'Sr. Microsoft Data Engineer', 'Data Engineer - ETL', 'Software Data Engineer - College',
        'Staff BI and Data Engineer', 'Sr. Data Engineer | Big Data SaaS Pipeline', 'Principal Data Engineer, Data Platform & Insights',
        'Data Engineering Analyst', 'Staff Data Engineer', 'Data & Analytics Consultant (NYC)', 'Associate Data Engineer',
        'Lead Data Engineer', 'Data Engineer - Consultant', 'Principal Data Engineer', 'Data Engineering Manager',
        'Data Infrastructure Engineer', 'Cloud Data Engineer', 'Senior Data Infrastructure Engineer',
        'Machine Learning Data Engineer', 'Data Integration Engineer', 'Data Systems Engineer',
        'Platform Data Engineer', 'Data Engineer - Advanced Analytics', 'Data Solutions Engineer',
        'Data Pipeline Engineer', 'Data Warehousing Engineer', 'Sr. Engineer - Data Engineering',
        'Data Network Engineer', 'Data Architecture Engineer', 'Data Strategy Engineer',
        'Data Center Engineer', 'Senior Data Solutions Engineer', 'Data Processing Engineer',
        'Data Optimization Engineer', 'Data Engineer - Machine Learning', 'Data Quality Engineer',
        'Enterprise Data Engineer', 'Full Stack Data Engineer', 'Data Engineer - Business Intelligence',
        'Data Development Engineer', 'Senior Cloud Data Engineer', 'Data Engineer - Data Science',
        'Data Engineer - Analytics', 'Data Engineer - Data Warehousing', 'Data Application Engineer',
        'Data Systems Development Engineer', 'Data Operations Engineer', 'Data Engineering Specialist',
        'Advanced Data Engineer', 'Data Technology Engineer', 'Data Engineer - Big Data',
        'Data Security Engineer', 'Sr. Data Systems Engineer', 'Data Analytics Engineer',
        'Data Platform Engineer', 'Data Engineer - Infrastructure', 'Data Science Engineer',
        'Data Engineer - R&D', 'Data Engineer - SQL', 'Data Engineer - Python', 'Data Engineer - Cloud Technologies',
        'Data Engineer, Data Engineering and Artificial Intelligence', 'Data Engineer - Analytics', 'Data Engineer - Data Science',
        'Data Engineer - Business Intelligence', 'Data Engineer - Data Warehousing', 'Data Engineer - Cloud Technologies',
        'Data Engineer - SQL', 'Data Engineer - Python', 'Data Engineer - R&D', 'Data Engineer - Big Data',
        'Data Engineer - Infrastructure', 'Data Engineer - ETL', 'Data Science Engineer', 'Data Platform Engineer',
        'Data Technology Engineer', 'Data Analytics Engineer', 'Data Systems Engineer', 'Data Network Engineer',
        'Data Security Engineer', 'Data Architecture Engineer', 'Data Strategy Engineer', 'Data Optimization Engineer',
        'Data Processing Engineer', 'Data Quality Engineer', 'Data Solutions Engineer', 'Data Pipeline Engineer',
        'Data Integration Engineer', 'Data Application Engineer', 'Data Operations Engineer', 'Data Systems Development Engineer',
        'Data Warehousing Engineer', 'Sr. Data Systems Engineer', 'Principal Data Engineer', 'Senior Cloud Data Engineer',
        'Senior Data Infrastructure Engineer', 'Advanced Data Engineer', 'Enterprise Data Engineer', 'Full Stack Data Engineer',
        'Lead Data Engineer', 'Lead Big Data Engineer', 'Senior Data Solutions Engineer', 'Data Center Engineer',
        'Cloud Data Engineer', 'Data Infrastructure Engineer', 'Data Engineering Manager', 'Machine Learning Data Engineer',
        'Data Engineer Intern', 'Data Development Engineer', 'Data Engineering Specialist', 'Data Engineer with R',
        'Sr. Engineer - Data Engineering', 'Staff Data Engineer', 'Data Engineer - Advanced Analytics', 'Platform Data Engineer',
        'MongoDB Data Engineer II', 'Associate Data Engineer', 'AWS Data Engineer', 'Sr. Microsoft Data Engineer',
        'Sr. Data Engineer | Big Data SaaS Pipeline', 'Staff BI and Data Engineer', 'Sr Data Engineer (Sr BI Developer)',
        'IT - Data Engineer II', 'Software Data Engineer - College', 'Data Systems Specialist 2', 'Radar Data Analyst',
        'Principal Data Engineer, Data Platform & Insights', 'Data Engineering Analyst', 'VP, Data Science',
        'Project Scientist - Auton Lab, Robotics Institute', 'Data & Analytics Consultant (NYC)', 'ATL - Data & Analytics (DA)',
        'MSP - Data & Analytics (DA)', 'Associate Data Engineer','Data Engineer, Data Engineering and Artificial Intelligence',
        'Data Engineer, Data Engineering and Artifical Intelligence'
    ],

    'Machine Learning Engineer': [
        'Machine Learning Engineer', 'Senior Machine Learning (ML) Engineer / Data Scientist - Cyber Security Analytics',
        'Staff Machine Learning Engineer', 'Principal Machine Learning Scientist', 'Machine Learning Research Scientist',
        'Senior LiDAR Data Scientist', 'Machine Learning Engineer - Regulatory', 'Senior Machine Learning Engineer',
        'Associate Machine Learning Engineer / Data Scientist May 2020 Undergrad', 'Machine Learning Scientist ',
        'Principal Machine Learning Scientist', 'Staff Machine Learning Scientist, AI Foundation', 'Machine Learning Engineer (NLP)',
        'Senior Data Scientist Artificial Intelligence', 'Machine Learning Scientist'

    ],

    'Computer Vision Specialist': [
        'Computer Vision Engineer', 'Computer Vision Scientist', 'Research Scientist or Senior Research Scientist - Computer Vision',
        'Deep Learning/Computer Vision Scientist', 'Senior Computer Vision Engineer'
    ],

    'Business Intelligence Analyst': [
        'Business Intelligence Analyst', 'BI Analyst', 'BI & Platform Analytics Manager', 'Business Intelligence Analyst / Developer',
        'Senior Business Intelligence Analyst', 'Director - Data, Privacy and AI Governance'
    ],

    'Analytics Manager/Consultant': [
        'Analytics Manager', 'Analytics Consultant', 'Data Analytics Manager', 'Advanced Analytics Manager',
        'Analytics - Business Assurance Data Analyst', 'Consultant - Analytics Consulting', 'Analytics Manager - Data Mart',
        'Managing Data Scientist/ML Engineer', 'Director of Analytics',
        'Senior Manager, Epidemiologic Data Scientist', 'Senior Risk Data Scientist', 'Risk and Analytics IT, Data Scientist'
    ],

    'Software Engineer (Data Focus)': [
        'Software Engineer - Data', 'Data-focused Software Engineer', 'Front-End, Back-End, Fullstack Developers & Data Scientist / Researchers - Cleared OR CLEARABLE (Up to 25% Profit Sharing Benefit!)',
        'Software Engineer (Data Scientist/Software Engineer) - SISW - MG', 'Software Data Engineer - College', 'Sr Software Engineer (Data Scientist)',
        'Software Engineer - Data Visualization', 'Senior Spark Engineer (Data Science)', 'Software Engineer Staff Scientist: Human Language Technologies'
    ],

    'Specialized Scientific Roles': [
        'Environmental Scientist', 'Medical Lab Scientist', 'Clinical Laboratory Scientist', 'Medical Laboratory Scientist',
        'Food Scientist - Developer', 'Scientist Manufacturing - Kentucky BioProcessing', 'Clinical Scientist, Clinical Development',
        'Quality Control Scientist', 'Scientist, Stem Cells and Genomics', 'MED TECH/LAB SCIENTIST - LABORATORY',
        'Medical Technologist / Clinical Laboratory Scientist', 'MED TECH/LAB SCIENTIST- SOUTH COASTAL LAB',
        'Senior Clinical Lab Scientist, Clinical Lab Svcs - FT/Nights (8hr)', 'Scientist, Analytical Development', 'Scientist Manufacturing Pharma - Kentucky BioProcessing'
        'Product Engineer – Data Science', 'Associate Principal Scientist, Pharmacogenomics', 'PL Actuarial-Lead Data Scientist',
        'Sr. Scientist, Toxicology', 'Associate Scientist', 'Sr Data Scientist', 'Senior Engineer, Data Management Engineering'

    ],

    'Manager/Director in Data Field': [
        'Director of Data Science', 'Data Science Manager', 'Senior Manager, Data Science', 'Director II, Data Science',
        'Head Data Scientist', 'Senior Director Biometrics and Clinical Data Management', 'Director - Data, Privacy and AI Governance',
        'Director II, Data Science - GRM Actuarial', 'Director Data Science', 'Assistant Director/Director, Office of Data Science',
        'Principal Data Scientist', 'Chief Data Scientist', 'Vice President of Data Science', 'Director of Analytics',
        'Managing Data Scientist/ML Engineer', 'Senior Data Science Systems Engineer', 'Data Science Manager',
                'Director II, Data Science - GRS Predictive Analytics', 'Head Data Scientist, Predictive Analytics Lead AI Innovation Lab',
        'Director, Data Science', 'Manager of Data Science', 'Manager, Safety Scientist, Medical Safety & Risk Management',
        'Director, Precision Medicine Clinical Biomarker Scientist', 'Tech Manager, Software Engineering - Data',
        'Business Development - Data Supply Partnerships (Veraset)', 'Sr. Manager, Data Science - Marketing Mix Media',
        'Sr. BI Data Engineer III', 'Process Development Scientist', 'Senior Analytical Scientist',
        'Enterprise Architect, Data', 'Sr. Enterprise Account Exec- Data Science / ML - NYC', 'College Hire - Data Scientist - Open to December 2019 Graduates',
        'Sr Data Scientist - IT', 'Staff Machine Learning Scientist, AI Foundation', 'Senior Imagery Scientist - SAR TO 11 #78 (TS/SCI)',
        'Associate Director/Director, Safety Scientist', 'Lead Data Engineer (Python)', 'Associate, Data Science, Internal Audit',
        'Senior Data Scientist - Algorithms'
    ],

    'Clinical Scientist': [
        'Clinical Data Scientist', 'Clinical Research Scientist', 'Clinical Scientist, Clinical Development',
        'Clinical Laboratory Scientist', 'Clinical Scientist', 'Medical Laboratory Scientist',
        'Senior Clinical Scientist', 'Clinical Document Review Scientist', 'Medical Lab Scientist - MLT'
    ],

    'Statistician': [
        'Statistician', 'Senior Statistician', 'Biostatistician', 'Data Scientist - Statistical', 'Senior Data Scientist Statistics',
        'Senior Scientist - Biostatistician', 'Data Statistician'
    ],

    'AI Specialist': [
        'AI Researcher', 'Artificial Intelligence Specialist', 'AI Engineer', 'Senior AI Engineer',
        'AI Scientist', 'Senior AI Scientist', 'AI Machine Learning Engineer'
    ],

    'Other Technical Roles': [
        'Technical Specialist', 'Tech Lead', 'Technical Engineer', 'Senior Technical Specialist',
        'Technical Analyst', 'Lead Technical Specialist', 'IT Technical Lead', 'Data Technical Lead'
    ],


    'NLP Specialist': [
        'NLP Engineer', 'Natural Language Processing Scientist', 'NLP Data Scientist', 'NLP Research Scientist',
        'NLP Specialist', 'Senior NLP Engineer', 'NLP Scientist', 'Head Data Scientist – NLP lead, Novartis AI Innovation Lab'
    ],
    'Computer Vision Specialist': [
        'Computer Vision Engineer', 'Computer Vision Scientist', 'Computer Vision Specialist',
        'Senior Computer Vision Engineer', 'Computer Vision Research Scientist', 'Deep Learning/Computer Vision Scientist'
    ],
    'Business Intelligence Analyst': [
        'Business Intelligence Analyst', 'BI Analyst', 'BI & Platform Analytics Manager', 'Business Intelligence Developer',
        'Senior BI Analyst', 'BI Data Analyst', 'Business Intelligence Data Analyst'
    ],
    'Analytics Manager/Consultant': [
        'Analytics Manager', 'Analytics Consultant', 'Data Analytics Manager', 'Senior Analytics Manager',
        'Analytics Project Manager', 'Consultant - Analytics Consulting', 'Analytics - Business Assurance Data Analyst',
        'Advanced Analytics Manager', 'Managing Data Scientist/ML Engineer', 'Business Analytics Consultant',
        'Director of Analytics', 'Analytics Manager - Data Mart'
    ],

    'Specialized Scientific Roles': [
        'Environmental Scientist', 'Medical Lab Scientist', 'Clinical Scientist', 'Food Scientist', 'Agricultural Scientist',
        'Pharmaceutical Scientist', 'Biomedical Scientist', 'Geo Scientist', 'Physical Scientist', 'Medical Scientist',
        'Clinical Laboratory Scientist', 'MED TECH/LAB SCIENTIST - LABORATORY', 'MED TECH/LAB SCIENTIST- SOUTH COASTAL LAB',
        'Scientist Manufacturing - Kentucky BioProcessing', 'Scientist Manufacturing Pharma - Kentucky BioProcessing',
        'Senior Scientist - Neuroscience', 'Scientist, Analytical Development', 'Sr. Scientist - Digital & Image Analysis/Computational Pathology',
        'Principal Scientist - Immunologist', 'Sr. Scientist, Quantitative Translational Sciences', 'Principal Scientist, Chemistry & Immunology',
        'Scientist/Senior Scientist, Autoimmune', 'Principal Scientist, Hematology', 'Pharmacovigilance Scientist (Senior Pharmacovigilance)',
        'Medical Laboratory Scientist', 'Clinical Laboratory Scientist', 'Scientist, Product Development',
        'Senior Scientist - Regulatory Submissions', 'Scientist - Biomarker and Flow Cytometry', 'Associate Scientist, LC/MS Biologics',
        'Sr. Scientist Method Development', 'Scientist - CVRM Metabolism - in vivo pharmacology', 'Scientist, Pharmacometrics',
        'Staff Scientist- Upstream PD', 'Senior Clinical Lab Scientist, Clinical Lab Svcs - FT/Nights (8hr)',
        'Product Engineer – Spatial Data Science and Statistical Analysis', 'Lab Head, Principle Scientist, Dupixent/Type 2 Inflammation & Fibrosis - Cambridge, MA',
        'Senior Scientist (Neuroscience)', 'Associate Scientist / Sr. Associate Scientist, Antibody Discovery',
        'Scientist, Molecular/Cellular Biologist', 'Scientist - Analytical Services', 'Senior Scientist - Toxicologist - Product Integrity (Stewardship)',
        'Senior Scientist Protein/Oligonucleotides', 'Quality Control Scientist', 'Quality Control Scientist III- Analytical Development',
        'Senior Formulations Scientist II', 'Scientist 2, QC Viral Vector', 'Senior Scientist - Bioanalytical',
        'Scientist, Biomarker Science', 'Scientist, Stem Cells and Genomics', 'Scientist, Immuno-Oncology',
        'Principal Scientist, Hematology', 'Principal Scientist, Chemistry & Immunology', 'Associate Scientist/Scientist',
                'ENVIRONMENTAL ENGINEER/SCIENTIST', 'Food Scientist - Developer', 'Scientist', 'Clinical Data Analyst',
        'Associate Environmental Scientist - Wildlife Biologist', 'Pricipal Scientist Molecular and cellular biologist',
        'Computational Chemist/Data Scientist', 'RESEARCH COMPUTER SCIENTIST - RESEARCH ENGINEER - SR. COMPUTER SCIENTIST - SOFTWARE DEVELOPMENT',
        'Chief Scientist - Emerging Technology Center', 'Systems Engineer II - Data Analyst', 'Products Data Analyst II',
        'Sr Scientist, Immuno-Oncology - Oncology', 'Sr. Scientist II', 'Insurance Data Scientist', 'Data Modeler',
        'Clinical Data Manager', 'Chief Data Officer', 'Associate Director, Platform and DevOps- Data Engineering and Aritifical Intelligence',
        'Project Scientist', 'Sr. Data Scientist - Analytics, Personalized Healthcare (PHC)', 'RESEARCH SCIENTIST - BIOLOGICAL SAFETY',
        'Customer Data Scientist/Sales Engineer (Bay', 'Data Architect / Data Modeler', 'Associate Scientist/Scientist, Process Analytical Technology - Small Molecule Analytical Chemistry',
        'Staff Scientist', 'Data Management Specialist', 'Medical Technologist / Clinical Laboratory Scientist',
        'Associate Data Scientist/Computer Scientist', 'CONSULTANT– DATA ANALYTICS GROUP', 'Data Engineer, Data Engineering and Artificial Intelligence'

        ],

    'Research Scientist': [
        'Research Scientist', 'Scientist I/II, Biology', 'Senior Research Scientist', 'Spectral Scientist/Engineer',
        'R&D Sr Data Scientist', 'Research Scientist - Bioinformatics', 'Senior Research Analytical Scientist-Non-Targeted Analysis',
        'Research Scientist / Principal Research Scientist - Multiphysical Systems', 'Research Scientist, Machine Learning Department',
        'Research Scientist – Security and Privacy', 'Senior Research Scientist - Embedded System Development for DevOps',
        'Weapons and Sensors Engineer/Scientist – Entry Level', 'Senior Scientist - Bioanalytical', 'Scientist, Biomarker Science',
        'Senior Research Scientist - Neuroscience', 'Senior Research Statistician- Data Scientist', 'Senior Scientist - Neuroscience',
        'Senior Research Scientist-Machine Learning', 'Scientist, Product Development', 'Senior Scientist Protein/Oligonucleotides',
        'Scientist, Molecular/Cellular Biologist', 'Scientist - Analytical Services', 'Senior Scientist - Regulatory Submissions',
        'Senior Scientist - Biostatistician', 'Senior Formulations Scientist II', 'Scientist, Pharmacometrics',
        'Scientist Manufacturing Pharma - Kentucky BioProcessing', 'Lab Head, Principle Scientist, Dupixent/Type 2 Inflammation & Fibrosis - Cambridge, MA',
        'Senior Scientist, Logic Gated CAR T Cell Therapy', 'Principal Scientist, Hematology', 'Scientist - CVRM Metabolism - in vivo pharmacology',
        'Research Scientist or Senior Research Scientist - Computer Vision', 'Scientist/Senior Scientist, Autoimmune',
        'Principal Scientist, Chemistry & Immunology', 'Scientist - Biomarker and Flow Cytometry', 'Associate Research Scientist I (Protein Expression and Production)',
        'Staff Scientist-Downstream Process Development', 'Scientist, Immuno-Oncology', 'Associate Scientist / Sr. Associate Scientist, Antibody Discovery'
    ],

    'Machine Learning Engineer': [
        'Machine Learning Engineer', 'Senior Machine Learning Engineer', 'Staff Machine Learning Engineer',
        'Machine Learning Research Scientist', 'Machine Learning Engineer - Regulatory', 'Senior LiDAR Data Scientist',
        'Principal Machine Learning Scientist', 'Machine Learning Engineer (NLP)', 'Associate Machine Learning Engineer / Data Scientist May 2020 Undergrad',
        'Lead Machine Learning Engineer', 'Senior Machine Learning (ML) Engineer / Data Scientist - Cyber Security Analytics'
    ],

    'NLP Specialist': [
        'NLP Engineer', 'Natural Language Processing Scientist', 'Head Data Scientist – NLP lead, Novartis AI Innovation Lab',
        'UX Data Scientist (Python)', 'Senior Data Scientist - NLP lead, Novartis AI Innovation Lab', 'Data Scientist - NLP'
    ]
  }

# Replace job titles in the DataFrame
for general_title, specific_titles in job_title_mapping.items():
    data2.loc[data2['Job Title'].isin(specific_titles), 'Job Title'] = general_title

In [124]:
with pd.option_context('display.max_rows', None):
    print(data2['Job Title'].value_counts())

Job Title
Data Scientist                    245
Specialized Scientific Roles      119
Data Analyst                      112
Data Engineer                     108
Manager/Director in Data Field     29
Research Scientist                 24
Machine Learning Engineer          20
Software Engineer (Data Focus)      8
Analytics Manager/Consultant        4
Clinical Scientist                  3
Business Intelligence Analyst       2
Statistician                        2
Name: count, dtype: int64


---

### **Company Rating**

- What is the average rating of companies in the dataset?



In [81]:
average_rating = data2['Rating'].mean()
print("Average Rating:", average_rating)

Average Rating: 3.601171458998935


In [82]:
data2['Rating'].describe()

count    939.000000
mean       3.601171
std        1.074927
min       -1.000000
25%        3.300000
50%        3.800000
75%        4.200000
max        5.000000
Name: Rating, dtype: float64

In [83]:
data2.shape

(939, 14)

In [84]:
data2 = data2[data2['Rating'] >= 0]

In [85]:
data2.shape

(905, 14)

I dropped the rows that have Negative `Rating` values (-1), which are 34 rows.

---

### **Company Size**

- How many different company sizes are there, and what are they?



In [86]:
company_sizes = data2['Size'].unique()
company_sizes

array(['501 to 1000 employees', '10000+ employees',
       '1001 to 5000 employees', '51 to 200 employees',
       '201 to 500 employees', '5001 to 10000 employees',
       '1 to 50 employees', 'Unknown'], dtype=object)

In [87]:
data2 = data2[(data2['Size'] != 'Unknown') & (data2['Size'] != -1)]

I dropped the rows that have Unknown `Size`.

---

### **Company's Age**

- How many companies were founded in each year?



In [88]:
companies_founded_per_year = data2['Founded'].value_counts()
print("Companies Founded per Year:")
print(companies_founded_per_year)

Companies Founded per Year:
Founded
-1       61
 2008    39
 2010    38
 1996    38
 2013    33
         ..
 1945     1
 1878     1
 1860     1
 1942     1
 1889     1
Name: count, Length: 108, dtype: int64


In [89]:
data2 = data2[data2['Founded'] >= 0]
(data2['Founded']==-1).sum()

0

I dropped the rows that have Negative values (-1) in `Founded`, which are 61 rows.

---

### **Company's Type of Ownership**

- What are the different types of ownership and their counts?



In [90]:
data2['Type of ownership'].value_counts()

Type of ownership
Company - Private                 482
Company - Public                  219
Nonprofit Organization             49
Subsidiary or Business Segment     37
Government                         15
Hospital                           15
College / University               13
Other Organization                  4
Contract                            3
Unknown                             1
School / School District            1
Name: count, dtype: int64

I will replace these values: `'-1'`, `'Unknown'`, `'Contract'`, `'School / School District'`, `'Private Practice / Firm'` to `Other Organization`

In [91]:
values_to_replace = ['-1', 'Unknown', 'Contract', 'School / School District', 'Private Practice / Firm']

data2['Type of ownership'] = data2['Type of ownership'].replace(values_to_replace, 'Other Organization')

In [92]:
data2['Type of ownership'].value_counts()

Type of ownership
Company - Private                 482
Company - Public                  219
Nonprofit Organization             49
Subsidiary or Business Segment     37
Government                         15
Hospital                           15
College / University               13
Other Organization                  9
Name: count, dtype: int64

---

### **Industry**

In [93]:
data2['Industry'].value_counts()

Industry
Biotech & Pharmaceuticals                   136
IT Services                                  70
Computer Hardware & Software                 64
Insurance Carriers                           60
Enterprise Software & Network Solutions      54
Health Care Services & Hospitals             51
Internet                                     35
Consulting                                   33
Aerospace & Defense                          31
Staffing & Outsourcing                       26
Advertising & Marketing                      25
Consumer Products Manufacturing              22
Research & Development                       21
Banks & Credit Unions                        19
Colleges & Universities                      16
Lending                                      14
Energy                                       14
Federal Agencies                             13
Travel Agencies                               8
Financial Analytics & Research                8
Real Estate                    

Grouping similar industries into broader categories

In [94]:
# Mapping of specific industries to broader categories
industry_mapping = {
    'Biotech & Pharmaceuticals': ['Biotech & Pharmaceuticals'],
    'Technology': ['IT Services', 'Computer Hardware & Software', 'Enterprise Software & Network Solutions', 'Internet', 'Video Games', 'Telecommunications Services', 'Telecommunications Manufacturing'],
    'Insurance': ['Insurance Carriers', 'Insurance Agencies & Brokerages'],
    'Health Care': ['Health Care Services & Hospitals', 'Health Care Products Manufacturing'],
    'Consulting': ['Consulting'],
    'Aerospace & Defense': ['Aerospace & Defense'],
    'Professional Services': ['Staffing & Outsourcing', 'Advertising & Marketing', 'Research & Development', 'Architectural & Engineering Services'],
    'Manufacturing': ['Consumer Products Manufacturing', 'Food & Beverage Manufacturing', 'Industrial Manufacturing', 'Transportation Equipment Manufacturing'],
    'Financial Services': ['Banks & Credit Unions', 'Lending', 'Federal Agencies', 'Financial Analytics & Research', 'Financial Transaction Processing', 'Investment Banking & Asset Management', 'Brokerage Services', 'Metals Brokers', 'Stock Exchanges'],
    'Education': ['Colleges & Universities', 'K-12 Education', 'Education Training Services', 'Preschool & Child Care'],
    'Energy & Utilities': ['Energy', 'Gas Stations'],
    'Retail & Consumer Services': ['Department, Clothing, & Shoe Stores', 'Consumer Product Rental', 'Other Retail Stores', 'Sporting Goods Stores', 'Beauty & Personal Accessories Stores'],
    'Real Estate & Construction': ['Real Estate', 'Construction'],
    'Media & Entertainment': ['Publishing', 'TV Broadcast & Cable Networks', 'Motion Picture Production & Distribution'],
    'Logistics & Transportation': ['Logistics & Supply Chain', 'Transportation Management', 'Trucking'],
    'Others': ['Security Services', 'Gambling', 'Wholesale', 'Social Assistance', 'Auctions & Galleries', 'Mining', 'Farm Support Services']
}

for broader_category, industries in industry_mapping.items():
    data2.loc[data2['Industry'].isin(industries), 'Industry'] = broader_category

In [95]:
data2['Industry'].value_counts()

Industry
Technology                    237
Biotech & Pharmaceuticals     136
Professional Services          76
Financial Services             67
Insurance                      66
Health Care                    52
Manufacturing                  34
Consulting                     33
Aerospace & Defense            31
Education                      23
Others                         19
Energy & Utilities             18
Retail & Consumer Services     14
Real Estate & Construction     12
Logistics & Transportation      8
Travel Agencies                 8
Media & Entertainment           5
Name: count, dtype: int64

---

### **Sector**

I will drop the `Sector` column, since it is similar to `Industry`.

In [96]:
data2['Sector'].value_counts()

Sector
Information Technology                223
Biotech & Pharmaceuticals             136
Business Services                     119
Insurance                              66
Finance                                52
Health Care                            51
Manufacturing                          35
Aerospace & Defense                    31
Education                              23
Retail                                 16
Oil, Gas, Energy & Utilities           14
Government                             13
Media                                  13
Travel & Tourism                        8
Transportation & Logistics              8
Real Estate                             8
Telecommunications                      6
Arts, Entertainment & Recreation        4
Construction, Repair & Maintenance      4
Mining & Metals                         3
Consumer Services                       3
Non-Profit                              2
Agriculture & Forestry                  1
Name: count, dtype: int64

In [97]:
data2.drop(columns="Sector" ,inplace=True)

---

### **Company HQ**

In [98]:
data2['Headquarters'].value_counts()

Headquarters
New York, NY         64
San Francisco, CA    49
Chicago, IL          33
Cambridge, MA        22
Boston, MA           17
                     ..
Chattanooga, TN       1
Tempe, AZ             1
Aurora, CO            1
Fort Worth, TX        1
Novi, MI              1
Name: count, Length: 207, dtype: int64

In [99]:
data2.drop(columns="Headquarters" ,inplace=True)

---

### **Company's Revenue**

In [100]:
data2['Revenue'].value_counts()

Revenue
Unknown / Non-Applicable            245
$10+ billion (USD)                  135
$100 to $500 million (USD)           99
$1 to $2 billion (USD)               65
$500 million to $1 billion (USD)     61
$25 to $50 million (USD)             54
$50 to $100 million (USD)            47
$2 to $5 billion (USD)               43
$10 to $25 million (USD)             39
$5 to $10 billion (USD)              18
$5 to $10 million (USD)              16
$1 to $5 million (USD)               15
Less than $1 million (USD)            2
Name: count, dtype: int64

I will drop `Revenue` since most of its values are `Unknown / Non-Applicable`.

In [101]:
data2.drop(columns="Revenue" ,inplace=True)

---

### **Competitors**

I will drop `Competitors` since it is not practical to write each competitor company name in the UI.

In [102]:
data2.drop(columns="Competitors" ,inplace=True)

---

### **Company's Location**

- What are the top 5 locations with the most job listings?


In [103]:
with pd.option_context('display.max_rows', None):
    print(data2['Location'].value_counts())

Location
New York, NY                         64
San Francisco, CA                    62
Cambridge, MA                        50
Chicago, IL                          36
Boston, MA                           23
South San Francisco, CA              18
Pittsburgh, PA                       16
San Jose, CA                         13
Chantilly, VA                        13
Rockville, MD                        11
Austin, TX                           11
Washington, DC                       10
Herndon, VA                          10
Richland, WA                         10
Winston-Salem, NC                    10
San Diego, CA                         9
Indianapolis, IN                      9
Mountain View, CA                     9
Nashville, TN                         8
Houston, TX                           8
Rochester, NY                         7
Denver, CO                            7
Atlanta, GA                           7
Huntsville, AL                        7
Dallas, TX                     

I will drop the `Location` column since it is only for US, and will be hard to *Encode*.

In [104]:
data2.drop(columns="Location" ,inplace=True)

---

### **Company's Name**

I will drop it since it cant hold enough number of names to be trained on, and if I trained the model on it with this number of data, the model will get a low Evaluation.

In [105]:
data2.drop(columns="Company Name" ,inplace=True)

---

### **Job Description**

I will drop it since it cant hold enough number of text, and will not be helpful for the model.

In [106]:
data2.drop(columns="Job Description" ,inplace=True)

---

### **Estimated Salary**

In [107]:
with pd.option_context('display.max_rows', None):
    print(data2['Salary Estimate'].value_counts())

Salary Estimate
-1                                           163
$21-$34 Per Hour(Glassdoor est.)               6
$86K-$143K (Glassdoor est.)                    6
$54K-$115K (Glassdoor est.)                    6
$49K-$113K (Glassdoor est.)                    6
$76K-$142K (Glassdoor est.)                    5
$74K-$124K (Glassdoor est.)                    5
$107K-$173K (Glassdoor est.)                   5
$81K-$167K (Glassdoor est.)                    5
$42K-$86K (Glassdoor est.)                     4
$40K-$68K (Glassdoor est.)                     4
$63K-$105K (Glassdoor est.)                    4
$61K-$109K (Glassdoor est.)                    4
$56K-$95K (Glassdoor est.)                     4
$69K-$127K (Glassdoor est.)                    4
$110K-$175K (Glassdoor est.)                   4
$68K-$139K (Glassdoor est.)                    4
$18-$25 Per Hour(Glassdoor est.)               4
$49K-$97K (Glassdoor est.)                     4
$35K-$62K (Glassdoor est.)                     4
$64K

I will create a new DataFrame called `unknown_salaries` that will contain only the rows with the value '**-1**' in the column `Salary Estimate`, and then drop them from the main DataFrame.

This `unknown_salaries` DataFrame may be used in other steps if needed.

In [108]:
# create new DataFrame for 'Salary Estimate' => '-1'
unknown_salaries = data2[data2['Salary Estimate'] == '-1']

# removing these rows from the main DataFrame
data2 = data2[data2['Salary Estimate'] != '-1']

I will parse and convert each salary range in the `Salary Estimate` column of `data2` to its average value.

In [109]:
def average_salary(salary_str):
    salary_str = salary_str.replace('Employer Provided Salary:', '').replace('(Glassdoor est.)', '').replace('(Employer est.)', '')

    if 'Per Hour' in salary_str:
        salary_str = salary_str.replace('Per Hour', '').replace('$', '')
        if '-' in salary_str:
            # Handle hourly range
            lower, upper = salary_str.split('-')
            lower_rate = float(lower) if lower.strip() else 0
            upper_rate = float(upper) if upper.strip() else 0
            hourly_rate = (lower_rate + upper_rate) / 2
        else:
            hourly_rate = float(salary_str)
        return hourly_rate * 40 * 52  # Convert to annual salary
    elif '-' in salary_str:
        # Handle range salaries
        salary_str = salary_str.replace('K', '000').replace('$', '')
        lower, upper = salary_str.split('-')
        lower_salary = float(lower) if lower.strip() else 0
        upper_salary = float(upper) if upper.strip() else 0
        return (lower_salary + upper_salary) / 2
    else:
        # Handle single value salary
        salary_str = salary_str.replace('K', '000').replace('$', '')
        return float(salary_str) if salary_str.strip() else 0



data2['Salary Estimate'] = data2['Salary Estimate'].apply(average_salary)

In [110]:
data2['Salary Estimate'].describe()

count       676.000000
mean     100677.514793
std       37522.096598
min       15500.000000
25%       72875.000000
50%       96000.000000
75%      122500.000000
max      254000.000000
Name: Salary Estimate, dtype: float64

---

## **Encoding Categorical Features**

In [111]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

### **OneHot Encoding (Dummies)**

For all the categorical features, except `Size` feature.

In [112]:
one_hot_cols = ['Job Title', 'Type of ownership', 'Industry']
one_hot_encoder = OneHotEncoder()
one_hot_encoded = one_hot_encoder.fit_transform(data2[one_hot_cols]).toarray()

one_hot_encoded_df = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(one_hot_cols))

### **Ordinal Encoding**

Only for `Size` column, since the Salaries usually are higher in the companies that have more employees (Bigger Size).

In [113]:
size_mapping = {
    '1 to 50 employees': 0,
    '51 to 200 employees': 1,
    '201 to 500 employees': 2,
    '501 to 1000 employees': 3,
    '1001 to 5000 employees': 4,
    '5001 to 10000 employees': 5,
    '10000+ employees': 6
}
ordinal_encoder = OrdinalEncoder(categories=[list(size_mapping.values())])
data2['Size'] = data2['Size'].map(size_mapping)
data2['Size'] = ordinal_encoder.fit_transform(data2[['Size']])

Now, combine all the data inside `df` DataFrame.

In [114]:
numerical_cols = ['Salary Estimate', 'Rating', 'Founded']

data2_reset = data2.reset_index(drop=True)
one_hot_encoded_df_reset = one_hot_encoded_df.reset_index(drop=True)

df = pd.concat([data2_reset[numerical_cols], data2_reset[['Size']], one_hot_encoded_df_reset], axis=1)

---

# **Machine Learning Model**

## **Splitting the Data**

In [115]:
from sklearn.model_selection import train_test_split

X = df.drop('Salary Estimate', axis=1)
y = df['Salary Estimate']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **Training Machine Learning Models**

### **Linear Regression**

In [116]:
unknown_salaries.shape

(163, 7)

In [117]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

lr_predictions = lr_model.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_predictions)
lr_rmse = np.sqrt(lr_mse)
lr_r2 = r2_score(y_test, lr_predictions)

print("Linear Regression:")
print(f"MSE: {lr_mse}, RMSE: {lr_rmse}, R²: {lr_r2}")


Linear Regression:
MSE: 910710504.2466048, RMSE: 30177.98045341346, R²: 0.34974324582903404


### **SVM**

In [118]:
from sklearn.svm import SVR

svm_model = SVR()
svm_model.fit(X_train, y_train)

svm_predictions = svm_model.predict(X_test)
svm_mse = mean_squared_error(y_test, svm_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_r2 = r2_score(y_test, svm_predictions)

print("SVM Regressor:")
print(f"MSE: {svm_mse}, RMSE: {svm_rmse}, R²: {svm_r2}")


SVM Regressor:
MSE: 1473204600.8144517, RMSE: 38382.347515680325, R²: -0.05188337840445012


### **XGBoost Regressor**

In [119]:
from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=400, max_depth=5)
xgb_model.fit(X_train, y_train)

xgb_predictions = xgb_model.predict(X_test)
xgb_mse = mean_squared_error(y_test, xgb_predictions)
xgb_rmse = np.sqrt(xgb_mse)
xgb_r2 = r2_score(y_test, xgb_predictions)

print("XGBoost Regressor:")
print(f"MSE: {xgb_mse}, RMSE: {xgb_rmse}, R²: {xgb_r2}")


XGBoost Regressor:
MSE: 539502313.740047, RMSE: 23227.189105443795, R²: 0.6147897473846042


### **Random Forest**

In [120]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=400)
rf_model.fit(X_train, y_train)

rf_predictions = rf_model.predict(X_test)
rf_mse = mean_squared_error(y_test, rf_predictions)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_predictions)

print("Random Forest:")
print(f"MSE: {rf_mse}, RMSE: {rf_rmse}, R²: {rf_r2}")


Random Forest:
MSE: 566970318.4832095, RMSE: 23811.138538154984, R²: 0.595177306851026


---

## **Evaluation**

I got `XGBoost Regressor` as the best model according to the evaluation metrics above. I still want to evaluate this model in another way, which is using the ***`Confidence Interval`***.

In [121]:
def calculate_accuracy_with_confidence_interval(predictions, actual_values, confidence_interval):
    correct_count = 0
    for pred, actual in zip(predictions, actual_values):
        lower_bound = actual - confidence_interval
        upper_bound = actual + confidence_interval
        if lower_bound <= pred <= upper_bound:
            correct_count += 1
    accuracy = correct_count / len(predictions)
    return accuracy

accuracy_7500 = calculate_accuracy_with_confidence_interval(xgb_predictions, y_test, 7500)
accuracy_8000 = calculate_accuracy_with_confidence_interval(xgb_predictions, y_test, 8000)
accuracy_10000 = calculate_accuracy_with_confidence_interval(xgb_predictions, y_test, 10000)


print(f"Accuracy with 7500 confidence interval: {accuracy_7500:.2f}")
print(f"Accuracy with 8000 confidence interval: {accuracy_8000:.2f}")
print(f"Accuracy with 10000 confidence interval: {accuracy_10000:.2f}")

Accuracy with 7500 confidence interval: 0.60
Accuracy with 8000 confidence interval: 0.63
Accuracy with 10000 confidence interval: 0.65


Not the best Accuracy, but It is fine according to the lake of data.

---

## **Save the Model with Pickle**

In [122]:
import pickle

with open('xgboost_regressor_model.pkl', 'wb') as file:
    pickle.dump(xgb_model, file)

print("Model saved successfully!")

Model saved successfully!


---