# 📌**DA jobs on US job market**

# 1. Introduction

Data Analytics field is on the rise worldwide. In this notebook, we will take a closer look at US job market and figure out some facts about DA jobs! 🔎

* ❓What are the 20 most listed occupations in DA field?
* ❓What are the 20 jobs with highest salary in top rated companies?
* ❓Which range of salaries can various sectors offer? (min/max salaries)
* ❓Which sectors are hungry for data analysts?
* ❓Which cities list the most DA jobs?
* ❓Is there correlation between rating and salary offered by the company?

**Import libraries**

First of all, let's import some of the Python 🐍 data libraries necessary to be able to load, transform and visualise our dataset.

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import pandasql as ps 
import matplotlib.pyplot as plt
%matplotlib inline

# 🛠 2. Data Preparation
🔎Let's see what data we have just uploaded...

In [None]:
data = pd.read_csv("../input/data-analyst-jobs/DataAnalyst.csv") 
data.head(10)

In [None]:
data.info()

In [None]:
data = data.drop(["Unnamed: 0", "Competitors", "Easy Apply"], axis=1) #delete useless columns

In [None]:
data

In [None]:
print(data.isnull().any()) # looking for null values, if there are any in given column, the result will be "True"

In [None]:
data = data.dropna() #drop the missing values in Company Name column
data.where(data == '-1')
data

As we can observe, there are some unclear values ("-1") which probably represent missing values as well, for clarity we should replace them with "NA":

In [None]:
data = data.replace(to_replace = [-1, "-1"], value = "NA") #replace both strings and integers with NA
data.columns = data.columns.str.replace(' ', '_') #adjustment of columns for performing sql queries
data = data[data.Sector != "NA"] #exclude all NA rows within "Sector" column
data



In [None]:
salary_estimate_adjustment = data['Salary_Estimate'].str.split('-', expand=True) #split first and second numeric value
data['MinSalary'] = pd.to_numeric(salary_estimate_adjustment[0].str.extract('(\d+)', expand=False)) #create column (MinSalary) from first half (numeric values) of Salary_Estimate column
data['MaxSalary'] = pd.to_numeric(salary_estimate_adjustment[1].str.extract('(\d+)', expand=False)) #create column (MinSalary) from secondary half (numeric values) of Salary_Estimate column
data.head(10)

# 3. It's question time!

**❓1. What are the 20 most listed occupations in DA field?**

First of all, I've decided to print all the possible unique values in Job_Title column.

In [None]:
for i, value in enumerate(data.Job_Title.unique()): 
    print(i+1,". value is ",value)

Wow! Look at that! After adjusting our dataframe, we have 1900 rows in total - out of which 1113 rows are occupied by various unique job titles. 👩‍💻 👨‍💻 Now let's answer the first question:

In [None]:
data['Job_Title'].value_counts()[:20] #list first 20 most listed occupations

In [None]:
plt.figure(figsize=(15,7.5)) #figsize must be applied before plotting the graph

#always make sure to use color blind palette accesible for all viewers! 
color_blind_palette = ["#DC267F", "#785EF0", "#648FFF", "#FE6100", "#FFB000", "#E80E8D", "#2E8766", "#95ABAC", "#65B925", "#906A42", "#C4D537", "#344A52", "#6C6E06", "#1C4DD2", "#216E00", "#2E03E5", "#A94424", "#7F6EA9", "#9B8453", "#380721"]

most_listed_occupations = sns.barplot(x=data['Job_Title'].value_counts().head(20).index,
                                      y=data['Job_Title'].value_counts().head(20).values, palette=color_blind_palette)

plt.xticks(rotation='vertical') #text rotation on x axis for readability
plt.xlabel('Job_Title', fontsize=12.5)
plt.ylabel('Count', rotation=0, ha='right', fontsize=12.5)
plt.figtext(.5,.9,"20 most listed occupations in DA field", fontsize=20, fontweight='bold', fontname='helvetica', ha='center') #formatted title

for patch in most_listed_occupations.patches:
             most_listed_occupations.annotate("%.0f" % patch.get_height(), (patch.get_x() + patch.get_width() / 2., patch.get_height()),
                 ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                 textcoords='offset points')
plt.show()

**❓2. What are the 20 jobs with highest salary in top rated companies?**

In [None]:
df_task_highest_salary_top_comp = """SELECT Job_Title, Company_Name, Sector, MAX(MaxSalary), MAX(Rating) 
             FROM data 
             GROUP BY Job_Title
             ORDER BY MaxSalary DESC, Rating DESC
             LIMIT 20"""
ps.sqldf(df_task_highest_salary_top_comp) 

**❓3. What range of minimum salaries can various sectors offer?**

In [None]:
color_blind_palette = ["#DC267F", "#785EF0", "#648FFF", "#FE6100", "#FFB000", "#E80E8D", "#2E8766", "#95ABAC", "#65B925", "#906A42", "#C4D537", "#344A52", "#6C6E06", "#1C4DD2", "#216E00", "#2E03E5", "#A94424", "#7F6EA9", "#9B8453", "#380721", "#F38700", "#B4BA2A", "#BD028B", "#1E2F66"]

sns.set(style='whitegrid')
data.plot(figsize=(15,7.5))
plt.xlim(20, 120) #limit on x axis
ax = sns.boxplot(x=data['MinSalary'], y=data['Sector'], palette=color_blind_palette)
ax.legend().set_visible(False)

plt.figtext(.5,.9,"Range of minimum salaries in sectors", fontsize=25, fontweight='bold', fontname='helvetica', ha='right') #title formatting


**❓4. What range of maximum salaries can various sectors offer?**

In [None]:
color_blind_palette = ["#DC267F", "#785EF0", "#648FFF", "#FE6100", "#FFB000", "#E80E8D", "#2E8766", "#95ABAC", "#65B925", "#906A42", "#C4D537", "#344A52", "#6C6E06", "#1C4DD2", "#216E00", "#2E03E5", "#A94424", "#7F6EA9", "#9B8453", "#380721", "#F38700", "#B4BA2A", "#BD028B", "#1E2F66"]

sns.set(style='whitegrid')

data.plot(figsize=(15,7.5))
plt.xlim(30, 200)

ax = sns.boxplot(x=data['MaxSalary'], y=data['Sector'], palette=color_blind_palette)
ax.legend().set_visible(False)

plt.figtext(.5,.9,"Range of maximum salaries in sectors", fontsize=25, fontweight='bold', fontname='helvetica', ha='right') #title formatting

**❓5. Which sectors are hungry for data analysts?**

In [None]:
df_task_hungry_sectors = """SELECT Sector, COUNT(*) AS count
                    FROM data
                    GROUP BY Sector
                    ORDER BY count DESC
                    LIMIT 20"""
ps.sqldf(df_task_hungry_sectors)

In [None]:
plt.figure(figsize=(15,7.5)) #figsize must be applied before plotting the graph
plt.xticks(rotation='vertical') #text rotation for readability

color_blind_palette = ["#DC267F", "#785EF0", "#648FFF", "#FE6100", "#FFB000", "#E80E8D", "#2E8766", "#95ABAC", "#65B925", "#906A42", "#C4D537", "#344A52", "#6C6E06", "#1C4DD2", "#216E00", "#2E03E5", "#A94424", "#7F6EA9", "#9B8453", "#380721"]

plot_hungry_sectors = sns.barplot(x=data['Sector'].value_counts().head(20).index,
                                  y=data['Sector'].value_counts().head(20).values, palette=color_blind_palette)

plt.xlabel('Sector', fontsize=12.5)
plt.ylabel('Count', rotation=0, ha='right', fontsize=12.5)
plt.figtext(.5,.9,"Sectors hungry for data analysts", fontsize=20, fontweight='bold', fontname='helvetica', ha='center') #formatted title

for patch in plot_hungry_sectors.patches:
    plot_hungry_sectors.annotate("%.0f" % patch.get_height(), (patch.get_x() + patch.get_width() / 2., patch.get_height()),
                 ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                 textcoords='offset points')

plt.show()

**❓6. Which cities list the most DA jobs?**

In [None]:
df_task_city = """SELECT Location, COUNT(*) AS Job_count
                    FROM data
                    GROUP BY Location
                    ORDER BY Job_count DESC
                    LIMIT 20"""
ps.sqldf(df_task_city)

In [None]:
plt.figure(figsize=(15,7.5))
plt.xticks(rotation='vertical') #text rotation for readability

color_blind_palette = ["#DC267F", "#785EF0", "#648FFF", "#FE6100", "#FFB000", "#E80E8D", "#2E8766", "#95ABAC", "#65B925", "#906A42", "#C4D537", "#344A52", "#6C6E06", "#1C4DD2", "#216E00", "#2E03E5", "#A94424", "#7F6EA9", "#9B8453", "#380721"]

plot_city = sns.barplot(x=data['Location'].value_counts().head(20).index,
                        y=data['Location'].value_counts().head(20).values, palette=color_blind_palette)
plt.xlabel('Location', fontsize=12.5)
plt.ylabel('Count', rotation=0, ha='right', fontsize=12.5)
plt.figtext(.5,.9,'US cities listing DA jobs the most', fontsize=20, fontweight='bold', fontname='helvetica', ha='center') #formatted title

for patch in plot_city.patches:
             plot_city.annotate("%.0f" % patch.get_height(), (patch.get_x() + patch.get_width() / 2., patch.get_height()),
                 ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                 textcoords='offset points')

plt.show()

**❓6. Is there correlation between rating and salary offered by the company?**

In [None]:
corr_data=data
corr_data['AvgSalary'] = corr_data[['MinSalary', 'MaxSalary']].mean(axis=1) #create a new column "AvgSalary" calculated as a mean from MinSalary & MaxSalary
corr_data["Rating"] = pd.to_numeric(corr_data.Rating, errors='coerce') #convert "Rating" column datatype into numeric
corr_data.corr()


In [None]:
plt.figure(figsize=(20,15))
plt.figtext(.5, .9, "Salary vs. Rating", fontsize=20, fontweight='bold', fontname='Helvetica', ha='right') #formatted title
sns.heatmap(corr_data.corr(), annot=True,cmap="PiYG")

**From the plotted heatmap, it does look that there is barely any correlation (values are very close to 0) between either min/max or average salary and rating.**

In [None]:
color_blind_palette = ["#DC267F", "#785EF0", "#648FFF", "#FE6100", "#FFB000", "#E80E8D", "#2E8766", "#95ABAC", "#65B925", "#906A42", "#C4D537", "#344A52", "#6C6E06", "#1C4DD2", "#216E00", "#2E03E5", "#A94424", "#7F6EA9", "#9B8453", "#380721", "#F38700", "#B4BA2A", "#BD028B", "#1E2F66", "#51AAE8", "#D44564", "#E792E4", "#3CA291", "#F5C5FA", "#BBCBA8", "#734C6C", "#67DE7C", "#9FDA4A", "#883350", "#DEC284", "#82C9C6"]

plt.figure(figsize=(15,7))
plt.figtext(.5, .9, "Average salary vs. Rating", fontsize=15, fontweight='bold', fontname='Helvetica', ha='center') #formatted title

sns.boxplot(data=corr_data, x="Rating",y="AvgSalary", palette=color_blind_palette) #přidat boxplots s max a min salary

In [None]:
color_blind_palette = ["#DC267F", "#785EF0", "#648FFF", "#FE6100", "#FFB000", "#E80E8D", "#2E8766", "#95ABAC", "#65B925", "#906A42", "#C4D537", "#344A52", "#6C6E06", "#1C4DD2", "#216E00", "#2E03E5", "#A94424", "#7F6EA9", "#9B8453", "#380721", "#F38700", "#B4BA2A", "#BD028B", "#1E2F66", "#51AAE8", "#D44564", "#E792E4", "#3CA291", "#F5C5FA", "#BBCBA8", "#734C6C", "#67DE7C", "#9FDA4A", "#883350", "#DEC284", "#82C9C6"]

plt.figure(figsize=(15,7))
plt.figtext(.5, .9, "Minimum salary vs. Rating", fontsize=15, fontweight='bold', fontname='Helvetica', ha='center') #formatted title

sns.boxplot(data=corr_data, x="Rating",y="MinSalary", palette=color_blind_palette)

In [None]:
color_blind_palette = ["#DC267F", "#785EF0", "#648FFF", "#FE6100", "#FFB000", "#E80E8D", "#2E8766", "#95ABAC", "#65B925", "#906A42", "#C4D537", "#344A52", "#6C6E06", "#1C4DD2", "#216E00", "#2E03E5", "#A94424", "#7F6EA9", "#9B8453", "#380721", "#F38700", "#B4BA2A", "#BD028B", "#1E2F66", "#51AAE8", "#D44564", "#E792E4", "#3CA291", "#F5C5FA", "#BBCBA8", "#734C6C", "#67DE7C", "#9FDA4A", "#883350", "#DEC284", "#82C9C6"]

plt.figure(figsize=(15,7))
plt.figtext(.5, .9, "Maximum salary vs. Rating", fontsize=15, fontweight='bold', fontname='Helvetica', ha='center') #formatted title
sns.boxplot(data=corr_data, x="Rating",y="MaxSalary", palette=color_blind_palette)

# 4. Conclusion

If I was searching for a data job in US, I would definitely invest a lot of time into preparation for Data Analyst career - although job offers for them are ubiquitous and this trend will probably continue for a loong time 📊 - I believe there is a a huge competition among all the applicants. I would even consider moving into NY, where the job offers are twice as high as in second city, Chicago. (see question 6.)

This is my 2nd dataset published here on Kaggle, if you've appreciated it, please UPVOTE! 
If you have any suggestions or questions to my analysis, I will be happy to answer them!  
Happy coding to everyone! 💻 🎉