### Data Scientist Salary Prediction

- The aim of this project is to predict the salary expectation of a data scientist based on their  skill set,industry,experience and many more features that are contributers

### About Dataset

This dataset contains job postings related to Data Science roles in 2025, collected from publicly available sources. It includes essential details such as job titles, seniority levels, company information, locations, salaries, industries, company size, and required skills. The dataset has been cleaned and structured to ensure accuracy and consistency, with duplicates and irrelevant entries removed.

It is designed to help researchers, students, and professionals analyze hiring trends, salary ranges, and in-demand skills in the Data Science job market. This dataset can also support projects in machine learning, career prediction, salary forecasting, and workforce analytics.

#### Key Columns:

- job_title : The job title (e.g., Data Scientist)
- seniority_level: The seniority of the role (senior, lead, etc.)
- status: Work type (on-site, hybrid, remote)
- company: Company name (anonymized, e.g., company_003)
- location: Job posting location (city, region, or multiple sites)
- post_date: Date or time since posting
- Headquarters: Company headquarters
- Industry: sector (Retail, Technology, Finance, etc.)
- ownership: Company type (Public, Private)
- company_size: Number of employees in the company
- Revenue: Company’s annual revenue
- salary: Salary details (fixed or range)
- skills: Required skills (e.g., Python, SQL, Machine Learning.

#### Dataset Characteristics:
- Most job roles are senior and lead level.
- Work arrangements are mainly on-site and hybrid.
- Companies belong mostly to Technology, Finance, Retail, and Manufacturing sectors.
- Salaries are provided as both ranges and fixed amounts.

### Dependecies

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [3]:
#load data
data=pd.read_csv("data_science_job_posts_2025.csv")
data.head()

Unnamed: 0,job_title,seniority_level,status,company,location,post_date,headquarter,industry,ownership,company_size,revenue,salary,skills
0,data scientist,senior,hybrid,company_003,"Grapevine, TX . Hybrid",17 days ago,"Bentonville, AR, US",Retail,Public,€352.44B,Public,"€100,472 - €200,938","['spark', 'r', 'python', 'scala', 'machine lea..."
1,data scientist,lead,hybrid,company_005,"Fort Worth, TX . Hybrid",15 days ago,"Detroit, MI, US",Manufacturing,Public,155030,€51.10B,"€118,733","['spark', 'r', 'python', 'sql', 'machine learn..."
2,data scientist,senior,on-site,company_007,"Austin, TX . Toronto, Ontario, Canada . Kirkla...",a month ago,"Redwood City, CA, US",Technology,Public,25930,€33.80B,"€94,987 - €159,559","['aws', 'git', 'python', 'docker', 'sql', 'mac..."
3,data scientist,senior,hybrid,company_008,"Chicago, IL . Scottsdale, AZ . Austin, TX . Hy...",8 days ago,"San Jose, CA, US",Technology,Public,34690,€81.71B,"€112,797 - €194,402","['sql', 'r', 'python']"
4,data scientist,,on-site,company_009,On-site,3 days ago,"Stamford, CT, US",Finance,Private,1800,Private,"€114,172 - €228,337",[]


### Data Preprocessing

In [4]:
#checking the size of our data
data.shape

(944, 13)

In [5]:
#checking general data information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   job_title        941 non-null    object
 1   seniority_level  884 non-null    object
 2   status           688 non-null    object
 3   company          944 non-null    object
 4   location         942 non-null    object
 5   post_date        944 non-null    object
 6   headquarter      944 non-null    object
 7   industry         944 non-null    object
 8   ownership        897 non-null    object
 9   company_size     944 non-null    object
 10  revenue          929 non-null    object
 11  salary           944 non-null    object
 12  skills           944 non-null    object
dtypes: object(13)
memory usage: 96.0+ KB


- The data consist of attributes that have missing values
- All the columns are provided as objct ,some will be converted to numeric

We are going to do data cleaning as there are many inconsistencies within the dataset before we continue with any analysis

#### Handling Data Inconsistencies

In [6]:
data.columns

Index(['job_title', 'seniority_level', 'status', 'company', 'location',
       'post_date', 'headquarter', 'industry', 'ownership', 'company_size',
       'revenue', 'salary', 'skills'],
      dtype='object')

In [9]:
#correcting data entries
categorical_columns=["seniority_level",'status','company','location','industry','ownership']

for column in categorical_columns:
    print(data[column].value_counts())
    print("=="*50)

seniority_level
senior      630
lead        116
midlevel    113
junior       25
Name: count, dtype: int64
status
on-site    363
hybrid     207
remote     118
Name: count, dtype: int64
company
company_134    30
company_003    24
company_941    21
company_395    21
company_244    20
               ..
company_044     1
company_045     1
company_046     1
company_048     1
company_050     1
Name: count, Length: 420, dtype: int64
location
Bengaluru, Karnataka, India                                             52
Fully Remote                                                            35
United States                                                           22
New York, NY                                                            22
On-site                                                                 18
                                                                        ..
Huntsville, AL . Merritt Island, FL . Seattle, WA                        1
Laurel, MD                           

In [11]:
numeric_columns=['company_size','revenue','salary']
for column in numeric_columns:
    print(data[column].value_counts())
    print("=="*50)


company_size
900         18
Private     15
180         15
45          15
€354.99B    14
            ..
24,820       1
88,990       1
118,910      1
68,400       1
93,580       1
Name: count, Length: 510, dtype: int64
revenue
Private      247
Public       227
Education     20
Nonprofit     12
€913.33M       6
            ... 
€51.10B        1
€29.17B        1
€11.43B        1
€4.93B         1
€36.29B        1
Name: count, Length: 312, dtype: int64
salary
€25,214                3
€137,000               3
€10,524 - €26,185      3
€114,169               3
€137,006 - €200,939    2
                      ..
€33,288 - €53,080      1
€64,290                1
€145,904 - €166,510    1
€159,149 - €181,595    1
€195,486 - €201,926    1
Name: count, Length: 896, dtype: int64
