![iut](https://github.com/Hexanol777/STEM-Salaries-Case-Study/tree/main/Phase%201/stock_image/IUT200.png)
<hr style="margin-bottom: 40px;">

<img src="https://github.com/Hexanol777/STEM-Salaries-Case-Study/tree/main/Phase%201/stock_image/Header.jpg"
    style="width:400px; float: right; margin: 0 40px 40px 40px;"></img>

# STEM Jobs Salaries

## Data Visualization

#### Data visualization is an essential component of any data analysis project as it allows us to explore and communicate our data effectively. By creating visual representations of the data, we can identify patterns, trends, and outliers that may not be immediately apparent from raw data.These visualizations will help us to gain insights into the distribution of the data, the relationship between different variables, and any outliers that may exist. Ultimately, the goal of data visualization is to provide a clear and concise representation of the data that is easy to understand and interpret, even for those without a strong background in data analysis.

[Link to the Data used in this Notebook](https://drive.google.com/file/d/1IhXv0qcq7YFfBxc0BQB1-z74wF40ZnZn/view?usp=share_link)

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Importing Modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Loading The Data:

In [None]:
!head data/jobs_with_country_codes.csv
# Note: incase if you are running this line locally you will be met with the error below
# as this notebook is meant to be executed at Google Colab

In [None]:
Data = pd.read_csv(
    'data/jobs_with_country_codes.csv',
    parse_dates=['Timestamp'])

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## The Data at a Glance:

In [None]:
Data.head()

In [None]:
Data.shape

In [None]:
Data.info()

In [None]:
Data.describe()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Numerical analysis and visualization

We'll analyze the `BaseSalary` column:

In [None]:
Data['TotalYearlyCompensation'].describe()

In [None]:
Data['TotalYearlyCompensation'].mean()

In [None]:
Data['TotalYearlyCompensation'].median()

In [None]:
# Specify the tick format as full numbers instead of scientific numbers
pd.set_option('display.float_format', '{:.0f}'.format)

In [None]:
Data['TotalYearlyCompensation'].plot(kind='box', vert=False, showfliers=False, figsize=(14,6))

In [None]:
Data['TotalYearlyCompensation'].plot(kind='density', figsize=(14,6))

In [None]:
ax = Data['TotalYearlyCompensation'].plot(kind='density', figsize=(14,6)) # kde
ax.axvline(Data['TotalYearlyCompensation'].mean(), color='red')
ax.axvline(Data['TotalYearlyCompensation'].median(), color='green')

In [None]:
ax = Data['TotalYearlyCompensation'].plot(kind='hist', figsize=(14,6))
ax.set_ylabel('Frequency')
ax.set_xlabel('Dollars')

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Categorical analysis and visualization

We'll analyze the `Country` column:

Later on the `.nlargest()` method to force the graphs to only show first 8 countries.
[Refer to the ``Pandas Documentation`` for more info](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html#pandas-dataframe-nlargest) 

The `autopct='%1.1f%%'` is also used to display percentages up to the 0.x decimal on the pie charts.
[Link to documentaion for further info](https://matplotlib.org/3.1.1/_modules/matplotlib/pyplot.html#pie)

### What 'Countries' host the most STEM job positions?

In [None]:
Data['Country'].value_counts()

In [None]:
Data['Country'].value_counts().nlargest(8).plot(kind='pie', figsize=(8,8), autopct='%1.1f%%')

In [None]:
plot = Data['Country'].value_counts().nlargest(8).plot(kind='bar', figsize=(14,6))
plot.set_ylabel('Number of STEM Positions')

### Gender distribution among STEM workers?

In [None]:
filtered_data = Data[Data['Gender'] != 'NA']
filtered_data['Gender'].value_counts().nlargest(2).plot(kind='pie', figsize=(7,7), autopct='%1.1f%%')

### Distribution of educational attainment among individuals employed in STEM fields

In [None]:
filtered_data = Data[Data['Education'] != 'NA']
filtered_data['Education'].value_counts().nlargest(3).plot(kind='pie', figsize=(7,7), autopct='%1.1f%%')

### Most common titles?

In [None]:
Title_counts = Data['Title'].value_counts()
Title_counts.head(10)
top_titles = Title_counts[:10]
plt.barh(top_titles.index, top_titles.values)
plt.gca().invert_yaxis()

plt.xlabel('Count')
plt.ylabel('Job Title')
plt.show()
Title_counts.head(10)

### Most common Companies?

In [None]:
Title_counts = Data['Company'].value_counts()
Title_counts.head(10)
top_titles = Title_counts[:10]
plt.barh(top_titles.index, top_titles.values)
plt.gca().invert_yaxis()

plt.xlabel('Count')
plt.ylabel('Company')
plt.show()
Title_counts.head(10)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Relationship between the columns?

Can we find any significant relationship?

In [None]:
corr = Data.corr(numeric_only=True)

corr

In [None]:
fig = plt.figure(figsize=(8,8))
plt.matshow(corr, cmap='RdBu', fignum=fig.number)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);

In [None]:
Data.plot(kind='scatter', x='YearsOfExperience', y='TotalYearlyCompensation', figsize=(7,7))

In [None]:
Data.plot(kind='scatter', x='BaseSalary', y='TotalYearlyCompensation', figsize=(6,6))

In [None]:
bxplt = Data[['TotalYearlyCompensation', 'Country']].boxplot(by='Country', figsize=(25,15))
bxplt.set_ylabel('TotalYearlyCompensation')

In [None]:
degree_means = Data.groupby('Education')['BaseSalary'].mean()
top_3_degrees = degree_means.nlargest(3)
fig, ax = plt.subplots()
ax.bar(top_3_degrees.index, top_3_degrees.values)
ax.set_xlabel('Educational Degree')
ax.set_ylabel('Average BaseSalary')
ax.set_title('Average BaseSalary for Top 3 Educational Degrees')
plt.show()
top_3_degrees.head()

In [None]:
excluded_data = Data[Data['Gender'] != 'Other']
gender_salary = excluded_data.groupby('Gender')['BaseSalary'].mean().nlargest(2)
gender_salary.plot(kind='bar', color=['blue', 'pink'])
plt.title('Mean Base Salary by Gender')
plt.xlabel('Gender')
plt.ylabel('Mean Base Salary')
plt.xticks(rotation=0)
plt.show()
gender_salary.head()

In [None]:
boxplot_cols = ['TotalYearlyCompensation', 'YearsOfExperience', 'YearsAtCompany', 'BaseSalary', 'StockGrantValue', 'Bonus']

Data[boxplot_cols].plot(kind='box', subplots=True, layout=(2,3), figsize=(14,8))

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
