
# Data Science Job Salaries Analysis



### Columns Description

- **work_year**: The year in which the data was recorded.
- **experience_level**: The level of experience required for the job. Categories include:
  - 'EN' - Entry-level
  - 'MI' - Mid-level
  - 'SE' - Senior-level
  - 'EX' - Executive-level
- **employment_type**: The nature of employment. Types include:
  - 'FT' - Full-time
  - 'PT' - Part-time
  - 'CT' - Contract
  - 'FL' - Freelance
- **job_title**: The title of the job, e.g., 'Data Scientist', 'ML Engineer'.
- **salary**: The salary amount in the specified currency.
- **salary_currency**: The currency in which the salary is paid, e.g., USD, EUR.
- **salary_in_usd**: The salary converted into USD for standardization purposes.
- **employee_residence**: The country or region where the employee resides.
- **remote_ratio**: Indicates the extent to which a job is remote, with values like 0 (non-remote), 50 (partially remote), and 100 (fully remote).
- **company_location**: The location of the company offering the job.
- **company_size**: The size of the company, categorized as:
  - 'S' - Small (1-50 employees)
  - 'M' - Medium (51-250 employees)
  - 'L' - Large (251+ employees)


In [7]:
import IPython
import sys

def clean_notebook():
    IPython.display.clear_output(wait=True)
    print("Notebook cleaned.")
!pip install openai
!pip install python-dotenv
!pip install -U langchain-openai
!pip install langchain-experimental
!pip install tabulate


# Clean up the notebook
clean_notebook()

Notebook cleaned.


| Question                                                                 | Pandas Code                                           |
|--------------------------------------------------------------------------|-------------------------------------------------------|
| How many unique job titles are present in the dataset?                   | `df['job_title'].nunique()`                           |
| What is the average salary in USD for all job titles?                    | `df['salary_in_usd'].mean()`                          |
| Which job title has the highest average salary?                          | `df.groupby('job_title')['salary_in_usd'].mean().idxmax()` |
| What is the median salary in USD for data scientists?                    | `df[df['job_title'] == 'Data Scientist']['salary_in_usd'].median()` |
| How many different employment types are represented in the dataset?      | `df['employment_type'].nunique()`                     |
| What is the average salary difference between remote and non-remote jobs?| `df[df['remote_ratio'] == 100]['salary_in_usd'].mean() - df[df['remote_ratio'] != 100]['salary_in_usd'].mean()` |
| How many employees are located in 'CA' (Canada)?                         | `df[df['employee_residence'] == 'CA'].shape[0]`       |
| What is the most common company size?                                    | `df['company_size'].mode()[0]`                        |
| How many employees work in large-sized companies?                        | `df[df['company_size'] == 'L'].shape[0]`              |
| Find the maximum salary in USD for 'Data Engineer' positions.            | `df[df['job_title'] == 'Data Engineer']['salary_in_usd'].max()` |
| What is the average salary in EUR?                                       | `df[df['salary_currency'] == 'EUR']['salary'].mean()` |
| Which country has the most remote jobs?                                  | `df[df['remote_ratio'] == 100]['employee_residence'].value_counts().idxmax()` |
| How many job titles have an average salary above $100,000?               | `(df.groupby('job_title')['salary_in_usd'].mean() > 100000).sum()` |
| What is the most common employee residence country?                      | `df['employee_residence'].mode()[0]`                  |
| Find the average salary for each company size category.                  | `df.groupby('company_size')['salary_in_usd'].mean()`  |
| Which job title has the most employees?                                  | `df['job_title'].value_counts().idxmax()`             |


In [27]:
import pandas as pd

# Creating a DataFrame with the questions and corresponding Pandas code
questions_and_code_df = pd.DataFrame({
    "Question": [
        "How many unique job titles are present in the dataset?",
        "What is the average salary in USD for all job titles?",
        "Which job title has the highest average salary?",
        "What is the median salary in USD for data scientists?",
        "How many different employment types are represented in the dataset?",
        "What is the average salary difference between remote and non-remote jobs?",
        "How many employees are located in 'CA' (Canada)?",
        "What is the most common company size?",
        "How many employees work in large-sized companies?",
        "Find the maximum salary in USD for 'Data Engineer' positions.",
        "What is the average salary in EUR?",
        "Which country has the most remote jobs?",
        "How many job titles have an average salary above $100,000?",
        "What is the most common employee residence country?",
        "Find the average salary for each company size category.",
        "Which job title has the most employees?"
    ],
    "Pandas Code": [
        "df['job_title'].nunique()",
        "df['salary_in_usd'].mean()",
        "df.groupby('job_title')['salary_in_usd'].mean().idxmax()",
        "df[df['job_title'] == 'Data Scientist']['salary_in_usd'].median()",
        "df['employment_type'].nunique()",
        "df[df['remote_ratio'] == 100]['salary_in_usd'].mean() - df[df['remote_ratio'] != 100]['salary_in_usd'].mean()",
        "df[df['employee_residence'] == 'CA'].shape[0]",
        "df['company_size'].mode()[0]",
        "df[df['company_size'] == 'L'].shape[0]",
        "df[df['job_title'] == 'Data Engineer']['salary_in_usd'].max()",
        "df[df['salary_currency'] == 'EUR']['salary'].mean()",
        "df[df['remote_ratio'] == 100]['employee_residence'].value_counts().idxmax()",
        "(df.groupby('job_title')['salary_in_usd'].mean() > 100000).sum()",
        "df['employee_residence'].mode()[0]",
        "df.groupby('company_size')['salary_in_usd'].mean()",
        "df['job_title'].value_counts().idxmax()"
    ]
})

questions_and_code_df

Unnamed: 0,Question,Pandas Code
0,How many unique job titles are present in the ...,df['job_title'].nunique()
1,What is the average salary in USD for all job ...,df['salary_in_usd'].mean()
2,Which job title has the highest average salary?,df.groupby('job_title')['salary_in_usd'].mean(...
3,What is the median salary in USD for data scie...,df[df['job_title'] == 'Data Scientist']['salar...
4,How many different employment types are repres...,df['employment_type'].nunique()
5,What is the average salary difference between ...,df[df['remote_ratio'] == 100]['salary_in_usd']...
6,How many employees are located in 'CA' (Canada)?,df[df['employee_residence'] == 'CA'].shape[0]
7,What is the most common company size?,df['company_size'].mode()[0]
8,How many employees work in large-sized companies?,df[df['company_size'] == 'L'].shape[0]
9,Find the maximum salary in USD for 'Data Engin...,df[df['job_title'] == 'Data Engineer']['salary...


In [28]:
import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

True

In [29]:
import pandas as pd

df = pd.read_csv("./ds_salaries.csv")
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


In [30]:
import pandas as pd
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Initialize the language model
llm = OpenAI(temperature=0)

template_string = """Given the following question about a pandas DataFrame and the DataFrame information provided, generate only  pandas code without any description to answer it. 
Do not execute the code, just provide it.

Question: {question}

DataFrame Info:
{df_info}

Pandas code:
"""


# Create a prompt template
prompt_template = PromptTemplate(
    input_variables=["question", "df_info"],
    template=template_string
)

In [31]:

def get_dataframe_info(df):
    """Get basic information about the DataFrame."""
    info = f"Columns: {', '.join(df.columns)}\n"
    info += f"Shape: {df.shape}\n"
    info += "Data types:\n"
    for col, dtype in df.dtypes.items():
        info += f"  {col}: {dtype}\n"
    info += f"First few rows:\n{df.head().to_string()}"
    return info

print(get_dataframe_info(df))

Columns: work_year, experience_level, employment_type, job_title, salary, salary_currency, salary_in_usd, employee_residence, remote_ratio, company_location, company_size
Shape: (3755, 11)
Data types:
  work_year: int64
  experience_level: object
  employment_type: object
  job_title: object
  salary: int64
  salary_currency: object
  salary_in_usd: int64
  employee_residence: object
  remote_ratio: int64
  company_location: object
  company_size: object
First few rows:
   work_year experience_level employment_type                 job_title  salary salary_currency  salary_in_usd employee_residence  remote_ratio company_location company_size
0       2023               SE              FT  Principal Data Scientist   80000             EUR          85847                 ES           100               ES            L
1       2023               MI              CT               ML Engineer   30000             USD          30000                 US           100               US            S
2  

In [24]:
results = prompt_template.format(question="How many unique job titles are present in the dataset?", df_info=get_dataframe_info(df))

print(results)

Given the following question about a pandas DataFrame and the DataFrame information provided, generate only  pandas code without any description to answer it. 
Do not execute the code, just provide it.

Question: How many unique job titles are present in the dataset?

DataFrame Info:
Columns: work_year, experience_level, employment_type, job_title, salary, salary_currency, salary_in_usd, employee_residence, remote_ratio, company_location, company_size
Shape: (3755, 11)
Data types:
  work_year: int64
  experience_level: object
  employment_type: object
  job_title: object
  salary: int64
  salary_currency: object
  salary_in_usd: int64
  employee_residence: object
  remote_ratio: int64
  company_location: object
  company_size: object
First few rows:
   work_year experience_level employment_type                 job_title  salary salary_currency  salary_in_usd employee_residence  remote_ratio company_location company_size
0       2023               SE              FT  Principal Data Scie

In [32]:


# Create an LLMChain
chain = LLMChain(llm=llm, prompt=prompt_template)

def get_dataframe_info(df):
    """Get basic information about the DataFrame."""
    info = f"Columns: {', '.join(df.columns)}\n"
    info += f"Shape: {df.shape}\n"
    info += "Data types:\n"
    for col, dtype in df.dtypes.items():
        info += f"  {col}: {dtype}\n"
    info += f"First few rows:\n{df.head().to_string()}"
    return info

def query_dataframe(question):
    # Get DataFrame info
    df_info = get_dataframe_info(df)
    
    # Generate pandas code using LangChain
    response = chain.invoke({"question": question, "df_info": df_info})
  
    
    return response['text']

# Example usage
question = "What is the median salary in USD for data scientists?"
pandas_code = query_dataframe(question)


pandas_code

"df[df['job_title'] == 'Data Scientist']['salary_in_usd'].median()"

In [36]:
from tqdm import tqdm

output_code = []

for question in tqdm(questions_and_code_df["Question"]):
    pandas_code = query_dataframe(question)
    output_code.append( pandas_code)
    print(f"Question: {question}\nPandas code: {pandas_code}\n")



  6%|▋         | 1/16 [00:00<00:09,  1.58it/s]

Question: How many unique job titles are present in the dataset?
Pandas code: df['job_title'].nunique()



 12%|█▎        | 2/16 [00:01<00:08,  1.66it/s]

Question: What is the average salary in USD for all job titles?
Pandas code: df['salary_in_usd'].mean()



 19%|█▉        | 3/16 [00:01<00:08,  1.55it/s]

Question: Which job title has the highest average salary?
Pandas code: df.groupby('job_title')['salary'].mean().idxmax()



 25%|██▌       | 4/16 [00:02<00:07,  1.60it/s]

Question: What is the median salary in USD for data scientists?
Pandas code: df[df['job_title'] == 'Data Scientist']['salary_in_usd'].median()



 31%|███▏      | 5/16 [00:03<00:06,  1.61it/s]

Question: How many different employment types are represented in the dataset?
Pandas code: 
df['employment_type'].nunique()



 38%|███▊      | 6/16 [00:04<00:07,  1.35it/s]

Question: What is the average salary difference between remote and non-remote jobs?
Pandas code: df['salary_difference'] = df[df['remote_ratio'] == 100]['salary_in_usd'].mean() - df[df['remote_ratio'] != 100]['salary_in_usd'].mean()



 44%|████▍     | 7/16 [00:04<00:06,  1.44it/s]

Question: How many employees are located in 'CA' (Canada)?
Pandas code: df[df['employee_residence'] == 'CA'].shape[0]



 50%|█████     | 8/16 [00:05<00:05,  1.54it/s]

Question: What is the most common company size?
Pandas code: df['company_size'].value_counts().idxmax()



 56%|█████▋    | 9/16 [00:05<00:04,  1.56it/s]

Question: How many employees work in large-sized companies?
Pandas code: df[df['company_size'] == 'L'].shape[0]



 62%|██████▎   | 10/16 [00:06<00:03,  1.55it/s]

Question: Find the maximum salary in USD for 'Data Engineer' positions.
Pandas code: 
df[df['job_title'] == 'Data Engineer']['salary_in_usd'].max()



 69%|██████▉   | 11/16 [00:07<00:03,  1.63it/s]

Question: What is the average salary in EUR?
Pandas code: df['salary'].mean()



 75%|███████▌  | 12/16 [00:07<00:02,  1.55it/s]

Question: Which country has the most remote jobs?
Pandas code: df.groupby('employee_residence')['remote_ratio'].sum().idxmax()



 81%|████████▏ | 13/16 [00:08<00:01,  1.55it/s]

Question: How many job titles have an average salary above $100,000?
Pandas code: df[df['salary_in_usd'] > 100000]['job_title'].nunique()



 88%|████████▊ | 14/16 [00:09<00:01,  1.56it/s]

Question: What is the most common employee residence country?
Pandas code: df['employee_residence'].value_counts().idxmax()



 94%|█████████▍| 15/16 [00:09<00:00,  1.61it/s]

Question: Find the average salary for each company size category.
Pandas code: df.groupby('company_size')['salary_in_usd'].mean()



100%|██████████| 16/16 [00:10<00:00,  1.56it/s]

Question: Which job title has the most employees?
Pandas code: df['job_title'].value_counts().idxmax()






In [37]:
questions_and_code_df["estimated Code"] = output_code

In [38]:
questions_and_code_df

Unnamed: 0,Question,Pandas Code,estimated Code
0,How many unique job titles are present in the ...,df['job_title'].nunique(),df['job_title'].nunique()
1,What is the average salary in USD for all job ...,df['salary_in_usd'].mean(),df['salary_in_usd'].mean()
2,Which job title has the highest average salary?,df.groupby('job_title')['salary_in_usd'].mean(...,df.groupby('job_title')['salary'].mean().idxmax()
3,What is the median salary in USD for data scie...,df[df['job_title'] == 'Data Scientist']['salar...,df[df['job_title'] == 'Data Scientist']['salar...
4,How many different employment types are repres...,df['employment_type'].nunique(),\ndf['employment_type'].nunique()
5,What is the average salary difference between ...,df[df['remote_ratio'] == 100]['salary_in_usd']...,df['salary_difference'] = df[df['remote_ratio'...
6,How many employees are located in 'CA' (Canada)?,df[df['employee_residence'] == 'CA'].shape[0],df[df['employee_residence'] == 'CA'].shape[0]
7,What is the most common company size?,df['company_size'].mode()[0],df['company_size'].value_counts().idxmax()
8,How many employees work in large-sized companies?,df[df['company_size'] == 'L'].shape[0],df[df['company_size'] == 'L'].shape[0]
9,Find the maximum salary in USD for 'Data Engin...,df[df['job_title'] == 'Data Engineer']['salary...,\ndf[df['job_title'] == 'Data Engineer']['sala...
