# Analysis

In [1]:
# Data modules
import pandas as pd

In [2]:
df = pd.read_csv('Data/cleaned_data_science_jobs_dataset.csv')

## Feature analysis univariate

### Work Year
<div>
<img src="Images/Work_year.png" width="400"/>
</div>
Every year there is a noticeable increase in the number of data in our dataset. More available jobs related to data worldwide could explain a small difference, but the reason for an increase this big is that the data is entered voluntarily, and the site that collects it has been gaining popularity over the last 2 years.

### Experience level
<div>
<img src="Images/Experience_level.png" width="700"/>
</div>
More than half of the positions are Senior, approximately one out of three are intermediate, one out of seven are junior and only one out of 25 are executive-level positions.

### Employment type
<div>
<img src="Images/Employment_type.png" width="400"/>
</div>
Almost every single job listed is a full-time job.

### Job title

In [3]:
len(df["job_title"].unique()) #How many job titles are in the dataset

52

<div>
<img src="Images/Job_title.png" width="900"/>
</div>
There are 52 unique job titles. Some of them are very niche, so we will focus on the 10 most common. The vast majority of data-related employees are either Data Engineers, Data Scientists, Data Analysts or Machine Learning Engineers; over 80% of all job titles fall in these 4 categories.

### Salary (USD)
<div>
<img src="Images/Salary.png" width="500"/>
</div>
<div>
<img src="Images/Salary_2.png" width="1000"/>
</div>
The first graph shows that there is a large number of outliers in our dataset. The maximum salary is 900k\$ per year! 

The second graph is the same boxplot as the first with the outliers filtered out, to make it easier to understand. There we can see that half of all the salaries are between 57k\\$ and 143k\\$ (first and third quartiles), with the median being at around 95k\\$.

The third graph, a bar plot of binned salaries (each bin having a range of 30k\\$) shows that the salaries follow a distribution close to normal, with some outliers and with an obvious lower bound of 0\\$. The mean of this distribution should be in the 90k-120k\\$ bin, which will be checked in the next block of code:

In [4]:
mean = round(df["salary_in_usd"].mean(),2)
print(f"The mean of all salaries is {mean}$")

The mean of all salaries is 118856.37$


As expected, the mean falls in that bin. Lets check its value if we drop the outliers:

In [5]:
mean = round(df[(df["salary_in_usd"] < df["salary_in_usd"].quantile(0.75))]["salary_in_usd"].mean(),2)
print(f"The mean of salaries (excluding outliers) is {mean}$")

The mean of salaries (excluding outliers) is 86997.91$


### Employee residence

In [6]:
len(df["employee_residence"].unique()) #How many countries are in the dataset

62

<div>
<img src="Images/Employee_residence.png" width="900"/>
</div>
As with job titles, there are too many employee countries of residence to show in a graph (62), so only the 10 most common ones were plotted. The vast majority of our data (68.5%) comes from the United States.

### Remote Ratio
<div>
<img src="Images/Remote_work.png" width="500"/>
</div>
Interestingly, more than half of all the jobs are fully remote. Later we will see the evolution over time of this percentage, to see wether covid has had any impact on it.

### Company Location
<div>
<img src="Images/Company_location.png" width="900"/>
</div>
Similarly to employee residence, the country with the greater amount of jobs is by far the United States, with over 70% of all the share. The difference in percentage between company location and employee residence (1.5% in the case of USA) is explained by people working remotely from other countries. This phenomenon will be explored in more detail later.

### Company Size
<div>
<img src="Images/Company_size.png" width="700"/>
</div>
Medium companies (companies with more than 50 employees but less than 250 employees) make up more than half of all companies in the data.

## Feature analysis multivariate

### Experience level by job title
<div>
<img src="Images/Experience_level_by_job_title.png" width="900"/>
</div>
This bar plot is the same as the one shown before in the job titles section, but this time only the 8 most common job titles are shown, and each job is split by experience level (junior, mid, senior and executive). Inside each job title the percentage of each experience level (regarding that specific job title) is shown as text. 

The jobs with the biggest percentage of juniors are Research Scientists, Machine Learning Engineers, Data Scientists and Data Analysts.

The biggest percentage of intermediate level employees are in Research Science, Data Science and Data Engineering jobs

Seniors are the biggest percentage in Data Architecture and Data Science Managing jobs.

Executives occupy the biggest percentage as Analytics Engineers and Data Science Managers.

### Company size by job title
<div>
<img src="Images/Company_size_by_job_title.png" width="1000">
</div>
Without taking job title into account, company size percentages are 30% large, 58% medium and 12% small. Taking this into account, the most noticeable insights that follow from this graph are:

- Small companies don't seem to need Data Engineers, Data Managers or Data Architects.
- Medium companies have a large percentage of Data Analysts, Data Architects and Analytics Engineers.
- Large companies require an unusualy large amount of Research Scientists (compared to small and medium companies).

### Remote ratio by work year
<div>
<img src="Images/Remote_ratio_by_work_year.png" width="900"/>
</div>
<div>
<img src="Images/Remote_ratio_by_work_year_2.png" width="800" style="float: left; margin-right: 150px;"/>
</div>
During the last 2 years the percentage of people working either 100% remote or doing no remote work at all has increased considerably at the expense of people working partially remote. This could be attributed to a large number of companies trying some degree of remote work for their employees for the first time during the most severe stage of the covid pandemic (2020-2021). Then by 2022 each company already decided if they wanted to focus on remote work or not.

### Remote Ratio by experience level
<div>
<img src="Images/Remote_ratio_by_experience_level.png" width="800">
</div>
There is no significant difference in remote ratio for each experience level.

### Remote Ratio by company size
<div>
<img src="Images/Remote_ratio_by_company_size.png" width="1000">
</div>
There is no significant difference to remote ratio for each company size.

### Of all employees working abroad, how many are in each country (in % of total)
<div>
<img src="Images/Percent_of_abroad.png" width="900">
</div>
Not to be confused with percentage of people working from abroad for each country, this graph shows how many of all the people working from abroad are in each country. Almost half of the people working abroad work in the USA, around 10% in the UK and around 10% in germany. Therefore, someone looking to work abroad should start applying for jobs in those countries.

### Experience level by company size
<div>
<img src="Images/Experience_level_by_company_size.png" width="900">
</div>
Small companies have the largest percentage of junior employees, while medium companies havealmost no junior employees at all. For someone wanting to start a career in data it would be advisable to focus on applying to either small or large companies.

### Salary by work year
<div>
<img src="Images/Salary_by_work_year.png" width="450">
</div>
Last year (2022) there has been a significant improvement on salaries, with 50% of the salaries being in the range of around 80k\\$ to 160k\\$, compared to 2021's interquartile range of around 50k\\$ to 130k\\$.

### Salary by experience level
<div>
<img src="Images/Salary_by_experience_level.png" width="1300">
</div>
All salary by experience level histograms are shaped like normal distributions (except for junior, which is clearly affected by the lower bound of 0). The mean is clearly increasing with exerience level, but the difference between senior and executive level salaries doesn't appear to be that much. This is partly due to dropping the outliers prior to drawing the plot: most of these outliers are people working as executives.

### Salary by job title
<div>
<img src="Images/Salary_by_job_title.png" width="1300">
</div>
Jobs in this list can be separated into 3 groups:

- Least paying jobs: Research scientists are by far the least paying job. Their interquartile range is the lowest and it's width also is (meaning there is not much variation within those salaries). Its median is about 60k\\$.
- Mid paying jobs: Data Engineers, Data Scientists and Data Analysts are not only the most common titles in data related jobs but their salaries are also the most average. There is a small difference between them, with Engineers earning more than Scientists, and them earning more than Analysts.
- Top paying jobs: Data Science Manager, Data Architect and Analytics Engineer are managing positions. As we've seen before, most of them are seniors or executives and as such have larger payrolls.

### Salary by company size
<div>
<img src="Images/Salary_by_company_size.png" width="500">
</div>
There is not a big difference in salaries between large and medium companies, but small companies have significantly lower salaries.

### Salary by company location
<div>
<img src="Images/Salary_by_company_location.png" width="1300">
</div>
Only countries with a significant amount of data were plotted. Overall, north american and japanese companies are the ones that pay the most, followed by west european ones. Inside west Europe, United Kingdom companies have the highest salaries, followed by Germany, Netherlands, Austria and France. South European companies have the lowest salaries (which is consistent with their lower standard of living).