
# Data Science Job Salaries Analysis



### Columns Description

- **work_year**: The year in which the data was recorded.
- **experience_level**: The level of experience required for the job. Categories include:
  - 'EN' - Entry-level
  - 'MI' - Mid-level
  - 'SE' - Senior-level
  - 'EX' - Executive-level
- **employment_type**: The nature of employment. Types include:
  - 'FT' - Full-time
  - 'PT' - Part-time
  - 'CT' - Contract
  - 'FL' - Freelance
- **job_title**: The title of the job, e.g., 'Data Scientist', 'ML Engineer'.
- **salary**: The salary amount in the specified currency.
- **salary_currency**: The currency in which the salary is paid, e.g., USD, EUR.
- **salary_in_usd**: The salary converted into USD for standardization purposes.
- **employee_residence**: The country or region where the employee resides.
- **remote_ratio**: Indicates the extent to which a job is remote, with values like 0 (non-remote), 50 (partially remote), and 100 (fully remote).
- **company_location**: The location of the company offering the job.
- **company_size**: The size of the company, categorized as:
  - 'S' - Small (1-50 employees)
  - 'M' - Medium (51-250 employees)
  - 'L' - Large (251+ employees)


In [7]:
import IPython
import sys

def clean_notebook():
    IPython.display.clear_output(wait=True)
    print("Notebook cleaned.")
!pip install openai
!pip install python-dotenv
!pip install -U langchain-openai
!pip install langchain-experimental
!pip install tabulate


# Clean up the notebook
clean_notebook()

Notebook cleaned.


| Question                                                                 | Pandas Code                                           |
|--------------------------------------------------------------------------|-------------------------------------------------------|
| How many unique job titles are present in the dataset?                   | `df['job_title'].nunique()`                           |
| What is the average salary in USD for all job titles?                    | `df['salary_in_usd'].mean()`                          |
| Which job title has the highest average salary?                          | `df.groupby('job_title')['salary_in_usd'].mean().idxmax()` |
| What is the median salary in USD for data scientists?                    | `df[df['job_title'] == 'Data Scientist']['salary_in_usd'].median()` |
| How many different employment types are represented in the dataset?      | `df['employment_type'].nunique()`                     |
| What is the average salary difference between remote and non-remote jobs?| `df[df['remote_ratio'] == 100]['salary_in_usd'].mean() - df[df['remote_ratio'] != 100]['salary_in_usd'].mean()` |
| How many employees are located in 'CA' (Canada)?                         | `df[df['employee_residence'] == 'CA'].shape[0]`       |
| What is the most common company size?                                    | `df['company_size'].mode()[0]`                        |
| How many employees work in large-sized companies?                        | `df[df['company_size'] == 'L'].shape[0]`              |
| Find the maximum salary in USD for 'Data Engineer' positions.            | `df[df['job_title'] == 'Data Engineer']['salary_in_usd'].max()` |
| What is the average salary in EUR?                                       | `df[df['salary_currency'] == 'EUR']['salary'].mean()` |
| Which country has the most remote jobs?                                  | `df[df['remote_ratio'] == 100]['employee_residence'].value_counts().idxmax()` |
| How many job titles have an average salary above $100,000?               | `(df.groupby('job_title')['salary_in_usd'].mean() > 100000).sum()` |
| What is the most common employee residence country?                      | `df['employee_residence'].mode()[0]`                  |
| Find the average salary for each company size category.                  | `df.groupby('company_size')['salary_in_usd'].mean()`  |
| Which job title has the most employees?                                  | `df['job_title'].value_counts().idxmax()`             |


In [1]:
import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

True

In [2]:
import pandas as pd

df = pd.read_csv("./ds_salaries.csv")
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


In [17]:
import pandas as pd
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Initialize the language model
llm = OpenAI(temperature=0)

# Create a prompt template
prompt_template = PromptTemplate(
    input_variables=["question", "df_info"],
    template="""Given the following question about a pandas DataFrame and the DataFrame information provided, generate only  pandas code without any description to answer it. 
Do not execute the code, just provide it.

Question: {question}

DataFrame Info:
{df_info}

Pandas code:
"""
)

In [21]:

def get_dataframe_info(df):
    """Get basic information about the DataFrame."""
    info = f"Columns: {', '.join(df.columns)}\n"
    info += f"Shape: {df.shape}\n"
    info += "Data types:\n"
    for col, dtype in df.dtypes.items():
        info += f"  {col}: {dtype}\n"
    info += f"First few rows:\n{df.head().to_string()}"
    return info

print(get_dataframe_info(df))

Columns: work_year, experience_level, employment_type, job_title, salary, salary_currency, salary_in_usd, employee_residence, remote_ratio, company_location, company_size, age
Shape: (3755, 12)
Data types:
  work_year: int64
  experience_level: object
  employment_type: object
  job_title: object
  salary: int64
  salary_currency: object
  salary_in_usd: int64
  employee_residence: object
  remote_ratio: int64
  company_location: object
  company_size: object
  age: int64
First few rows:
   work_year experience_level employment_type                 job_title  salary salary_currency  salary_in_usd employee_residence  remote_ratio company_location company_size  age
0       2023               SE              FT  Principal Data Scientist   80000             EUR          85847                 ES           100               ES            L    0
1       2023               MI              CT               ML Engineer   30000             USD          30000                 US           100      

In [19]:


# Create an LLMChain
chain = LLMChain(llm=llm, prompt=prompt_template)

def get_dataframe_info(df):
    """Get basic information about the DataFrame."""
    info = f"Columns: {', '.join(df.columns)}\n"
    info += f"Shape: {df.shape}\n"
    info += "Data types:\n"
    for col, dtype in df.dtypes.items():
        info += f"  {col}: {dtype}\n"
    info += f"First few rows:\n{df.head().to_string()}"
    return info

def query_dataframe(question):
    # Get DataFrame info
    df_info = get_dataframe_info(df)
    
    # Generate pandas code using LangChain
    response = chain.run(question=question, df_info=df_info)
    print(f"Generated code:\n{response}")
    
    return response

# Example usage
question = "What is the median salary in USD for data scientists?"
pandas_code = query_dataframe(question)

# print("\nTo execute this code and get the result, you can use:")
# print("exec(pandas_code)")
# print("This will create a 'result' variable with the output.")
# print("\nMake sure to review the code before executing it.")

pandas_code

Generated code:
df[df['job_title'] == 'Data Scientist']['salary_in_usd'].median()


"df[df['job_title'] == 'Data Scientist']['salary_in_usd'].median()"

In [9]:
pandas_code 

"df['age'] = 2023 - df['work_year']\ndf['age'].mean()"

In [10]:
exec(pandas_code)

In [5]:
import pandas as pd
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent

# Load your CSV file


# Initialize the language model
llm = OpenAI(temperature=0)

# Create a prompt template
prompt_template = PromptTemplate(
    input_variables=["question"],
    template="Given the following question about a pandas DataFrame, generate the appropriate pandas code to answer it: {question}"
)


In [6]:
agent = create_pandas_dataframe_agent(llm, df, verbose=True,allow_dangerous_code=True)

In [16]:

# Create an LLMChain
chain = LLMChain(llm=llm, prompt=prompt_template)

# Create a pandas DataFrame agent
agent = create_pandas_dataframe_agent(llm, df, verbose=True,allow_dangerous_code=True)

def query_dataframe(question):
    # Generate pandas code using LangChain
    response = chain.run(question=question)
    print(f"Generated code: {response}")
    
    # Use the pandas DataFrame agent to execute the query
    result = agent.run(response)
    return result

# Example usage
question = "What is the median salary in USD for data scientists?"
result = query_dataframe(question)
print(f"Result: {result}")

ValueError: Missing some input keys: {'df_info'}

## Question: How many unique job titles are present in the dataset?

In [5]:
# Solution:
df['job_title'].nunique()

93

## Question: What is the average salary in USD for all job titles?

In [6]:
# Solution:
df['salary_in_usd'].mean()

137570.38988015978

## Question: Which job title has the highest average salary?

In [7]:
# Solution:
df.groupby('job_title')['salary_in_usd'].mean().idxmax()

'Data Science Tech Lead'

## Question: What is the median salary in USD for data scientists?

In [8]:
# Solution:
df[df['job_title'] == 'Data Scientist']['salary_in_usd'].median()

141525.0

## Question: How many jobs are fully remote (remote_ratio = 100)?

In [9]:
# Solution:
df[df['remote_ratio'] == 100].shape[0]

1643

## Question: Which company location has the highest number of employees?

In [10]:
# Solution:
df['company_location'].value_counts().idxmax()

'US'

## Question: What percentage of employees work in small-sized companies?

In [11]:
# Solution:
(df['company_size'] == 'S').mean() * 100

3.9414114513981358

## Question: What is the most common employment type?

In [12]:
# Solution:
df['employment_type'].mode()[0]

'FT'

## Question: Find the range of salaries in USD for machine learning engineers.

In [13]:
# Solution:
df[df['job_title'] == 'ML Engineer']['salary_in_usd'].agg(['min', 'max'])

min     15966
max    289076
Name: salary_in_usd, dtype: int64

## Question: How many employees work in medium-sized companies?

In [14]:
# Solution:
df[df['company_size'] == 'M'].shape[0]

3153

## Question: What is the average salary for 'Data Analyst' positions in the US?

In [15]:
# Solution:
df[(df['job_title'] == 'Data Analyst') & (df['employee_residence'] == 'US')]['salary_in_usd'].mean()

117505.387283237

## Question: Which experience level has the highest average salary?

In [16]:
# Solution:
df.groupby('experience_level')['salary_in_usd'].mean().idxmax()

'EX'

## Question: How many employees have 'EN' (Entry-level) experience?

In [17]:
# Solution:
df[df['experience_level'] == 'EN'].shape[0]

320

## Question: What is the total number of employees in the dataset?

In [18]:
# Solution:
df.shape[0]

3755

## Question: How many different salary currencies are used in the dataset?

In [19]:
# Solution:
df['salary_currency'].nunique()

20

## Question: What is the most frequent job title in the dataset?

In [20]:
# Solution:
df['job_title'].mode()[0]

'Data Engineer'

## Question: Find the average remote ratio for 'Senior-level' positions.

In [21]:
# Solution:
df[df['experience_level'] == 'SE']['remote_ratio'].mean()

45.07154213036566

## Question: Which job title has the lowest average salary?

In [22]:
# Solution:
df.groupby('job_title')['salary_in_usd'].mean().idxmin()

'Power BI Developer'

## Question: How many different employment types are represented in the dataset?

In [23]:
# Solution:
df['employment_type'].nunique()

4

## Question: What is the average salary difference between remote and non-remote jobs?

In [24]:
# Solution:
df[df['remote_ratio'] == 100]['salary_in_usd'].mean() - df[df['remote_ratio'] != 100]['salary_in_usd'].mean()

-1936.0599539022369

## Question: How many employees are located in 'CA' (Canada)?

In [25]:
# Solution:
df[df['employee_residence'] == 'CA'].shape[0]

85

## Question: What is the most common company size?

In [26]:
# Solution:
df['company_size'].mode()[0]

'M'

## Question: How many employees work in large-sized companies?

In [27]:
# Solution:
df[df['company_size'] == 'L'].shape[0]

454

## Question: Find the maximum salary in USD for 'Data Engineer' positions.

In [28]:
# Solution:
df[df['job_title'] == 'Data Engineer']['salary_in_usd'].max()

324000

## Question: What is the average salary in EUR?

In [29]:
# Solution:
df[df['salary_currency'] == 'EUR']['salary'].mean()

57174.063559322036

## Question: Which country has the most remote jobs?

In [30]:
# Solution:
df[df['remote_ratio'] == 100]['employee_residence'].value_counts().idxmax()

'US'

## Question: How many job titles have an average salary above $100,000?

In [31]:
# Solution:
(df.groupby('job_title')['salary_in_usd'].mean() > 100000).sum()

55

## Question: What is the most common employee residence country?

In [32]:
# Solution:
df['employee_residence'].mode()[0]

'US'

## Question: Find the average salary for each company size category.

In [33]:
# Solution:
df.groupby('company_size')['salary_in_usd'].mean()

company_size
L    118300.982379
M    143130.548367
S     78226.682432
Name: salary_in_usd, dtype: float64

## Question: Which job title has the most employees?

In [34]:
# Solution:
df['job_title'].value_counts().idxmax()

'Data Engineer'