# Data Science, AI & ML Job Salaries (2025) – EDA & Insights

In this notebook, I explore the **Data Science, AI & ML Job Salaries (2025)** dataset. The goal is to understand how salaries vary across roles, experience levels, locations, company size and remote work, and to generate clear insights for aspiring data professionals.

**Key questions:**
- How do salaries differ by job title (Data Scientist, ML Engineer, AI Engineer, etc.)?
- What is the impact of experience level and company size on salary?
- Which countries and work setups (remote vs on‑site) pay the most?


## 1. Imports & data loading 

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("seaborn-v0_8")
sns.set_palette("viridis")
plt.rcParams["figure.figsize"] = (10, 5)


## load CSV

In [None]:
import pandas as pd

df = pd.read_csv("/kaggle/input/data-science-salaries/salaries.csv")
df.head()

## 2. Data overview

First, let’s look at the shape, columns and basic statistics to understand what information we have in this dataset.


In [None]:
df.shape


In [None]:
df.columns


In [None]:
df.info()


In [None]:
df.describe(include="all").T


In [None]:
df.isna().sum()


# 3. Clean column names 

In [None]:
# optional: make column names lower_snake_case
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df.head()


# 4. Target & key columns quick look 

## 2. Key columns

Important columns in this dataset include:

- `work_year`: year of the salary
- `experience_level`: EN (Entry), MI (Mid), SE (Senior), EX (Executive)
- `employment_type`: FT (Full‑time), PT (Part‑time), CT (Contract), FL (Freelance)
- `job_title`: specific role (Data Scientist, ML Engineer, etc.)
- `salary_in_usd`: salary converted to USD
- `employee_residence`: country of the employee
- `remote_ratio`: 0 (on‑site), 50 (hybrid), 100 (fully remote)
- `company_location`: country of the company
- `company_size`: S, M, L


In [None]:
df[["work_year",
    "experience_level",
    "employment_type",
    "job_title",
    "salary_in_usd",
    "employee_residence",
    "remote_ratio",
    "company_location",
    "company_size"]].head()


# 5. Salary distribution & outliers (EDA)

## 3. Salary distribution

Let’s inspect how salaries are distributed overall and check for extreme outliers.


In [None]:
sns.histplot(df["salary_in_usd"], bins=40, kde=True)
plt.title("Salary in USD – Distribution")
plt.xlabel("Salary (USD)")
plt.show()


In [None]:
sns.boxplot(x=df["salary_in_usd"])
plt.title("Salary in USD – Boxplot")
plt.xlabel("Salary (USD)")
plt.show()


# 6. Salaries by experience level

## 4. Salaries by experience level

How do salaries change from entry‑level to senior and executive roles?


In [None]:
exp_order = ["EN", "MI", "SE", "EX"]  # adjust if dataset uses these codes

exp_salary = (
    df.groupby("experience_level")["salary_in_usd"]
    .median()
    .reindex(exp_order)
)

exp_salary.plot(kind="bar")
plt.title("Median Salary by Experience Level")
plt.ylabel("Median Salary (USD)")
plt.xlabel("Experience Level")
plt.xticks(rotation=0)
plt.show()


# 7. Salaries by job title (top 10)

## 5. Salaries by job title (top roles)

Now we compare median salaries for the most common job titles in Data Science, AI and ML.


In [None]:
# top 10 most frequent job titles
top_titles = (
    df["job_title"]
    .value_counts()
    .head(10)
    .index
)

salary_by_title = (
    df[df["job_title"].isin(top_titles)]
    .groupby("job_title")["salary_in_usd"]
    .median()
    .sort_values()
)

salary_by_title.plot(kind="barh")
plt.title("Median Salary by Job Title (Top 10)")
plt.xlabel("Median Salary (USD)")
plt.ylabel("Job Title")
plt.show()


# 8. Salaries by country

## 6. Salaries by country (top locations)

Which countries offer the highest median salaries for data/AI roles?


In [None]:
top_countries = (
    df["employee_residence"]
    .value_counts()
    .head(10)
    .index
)

salary_by_country = (
    df[df["employee_residence"].isin(top_countries)]
    .groupby("employee_residence")["salary_in_usd"]
    .median()
    .sort_values()
)

salary_by_country.plot(kind="barh")
plt.title("Median Salary by Employee Residence (Top 10)")
plt.xlabel("Median Salary (USD)")
plt.ylabel("Country")
plt.show()


# 9. Remote vs on‑site salary

## 7. Remote work and salary

The `remote_ratio` column shows whether a job is on‑site, hybrid or fully remote. Let’s see how this relates to salary.


In [None]:
def map_remote(r):
    if r == 0:
        return "On-site"
    elif r == 50:
        return "Hybrid"
    elif r == 100:
        return "Fully remote"
    return "Other"

df["remote_type"] = df["remote_ratio"].apply(map_remote)

remote_salary = (
    df.groupby("remote_type")["salary_in_usd"]
    .median()
    .sort_values()
)

remote_salary.plot(kind="bar")
plt.title("Median Salary by Remote Type")
plt.ylabel("Median Salary (USD)")
plt.xlabel("Work Setup")
plt.xticks(rotation=0)
plt.show()


# 10. Company size vs salary

## 8. Company size and salary

Bigger companies often have different pay scales. We compare median salaries by company size.


In [None]:
size_order = ["S", "M", "L"]
size_salary = (
    df.groupby("company_size")["salary_in_usd"]
    .median()
    .reindex(size_order)
)

size_salary.plot(kind="bar")
plt.title("Median Salary by Company Size")
plt.ylabel("Median Salary (USD)")
plt.xlabel("Company Size")
plt.xticks(rotation=0)
plt.show()


# 11. Simple salary band classification

## 9. Simple salary band classification (optional)

As a small ML exercise, we can create salary bands (Low / Medium / High) and train a simple classifier. This is optional but shows how to go from EDA to a basic model.


In [None]:
# create salary bands based on quantiles
q1, q2 = df["salary_in_usd"].quantile([0.33, 0.66])

def salary_band(s):
    if s <= q1:
        return "Low"
    elif s <= q2:
        return "Medium"
    else:
        return "High"

df["salary_band"] = df["salary_in_usd"].apply(salary_band)
df["salary_band"].value_counts()


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

# choose a few useful features
features = ["experience_level", "employment_type", "job_title",
            "employee_residence", "remote_type", "company_size"]

X = df[features]
y = df["salary_band"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

cat_features = features

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)
    ]
)

model = RandomForestClassifier(
    n_estimators=200, random_state=42, n_jobs=-1
)

clf = Pipeline(steps=[("preprocess", preprocess),
                     ("model", model)])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))


In [None]:
cm = confusion_matrix(y_test, y_pred, labels=["Low", "Medium", "High"])
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Low", "Medium", "High"],
            yticklabels=["Low", "Medium", "High"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix – Salary Band Classifier")
plt.show()


# 12. Final conclusions 

## 10. Key insights & conclusions

- Senior and executive roles have significantly higher median salaries than entry‑level and mid‑level positions.
- Among the top 10 job titles, roles such as `<replace_with_top_role>` show the highest median salaries.
- Certain countries (e.g., `<country_1>`, `<country_2>`) offer higher pay for data and AI roles compared to others.
- Fully remote or hybrid roles may offer competitive or higher salaries than strictly on‑site positions.
- A simple Random Forest model can reasonably classify salary bands (Low/Medium/High) based on features like experience level, job title, location and company size.

**Next steps:**
- Perform more detailed feature importance analysis.
- Try other models (Gradient Boosting, XGBoost).
- Build an interactive dashboard or deploy a small app that lets users explore salary ranges by role, level and country.
