# What is the most optimal skill to learn for a Data Engineer?

#### Methodology
1. Continue from the last notebook to find the percent of postings with skill
2. Visualize median salary vs percent skill demand


## Import Libraries and Data

In [1]:
import ast
import pandas as pd
import seaborn as sns
from datasets import load_dataset
import matplotlib.pyplot as plt  

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])
df['job_skills'] = df['job_skills'].apply(lambda x: ast.literal_eval(x) if pd.notna(x) else x)

## Clean Data

Filter the original dataaset to only get rows where the job title is 'Data Engineer' and the country is 'Germany', to create a new DataFrame `df_DE_GR`. Drop NaN values from the `salary_year_avg` column. Then it uses the `explode` method on the `job_skills` column to create a new row in a new DataFrame (`df_DE_GR_exploded`) for each skill associated with a job.
Finally, it displays the first 5 entries of the `salary_year_avg` and `job_skills` column.

In [4]:
df_DE_GR = df[(df['job_title_short'] == 'Data Engineer') & (df['job_country'] == 'Germany')].copy()

df_DE_GR = df_DE_GR.dropna(subset=['salary_year_avg'])

df_DE_GR_exploded = df_DE_GR.explode('job_skills')

df_DE_GR_exploded[['salary_year_avg','job_skills']].head(5)

Unnamed: 0,salary_year_avg,job_skills
7772,199675.0,spark
45652,147500.0,sql
45652,147500.0,scala
45652,147500.0,spark
100515,89100.0,assembly


## Calculate Percent of Job Postings That Have Skills

Group the data by job skills and calculates the count and median salary for each skill, sorting the results in descending order by count. It then renames the columns. Calculates the percentage that each skill count represents out of the total number of Data Engineer jobs. Finally, filter out any skills that don't have any jobs associated with them.