# Analysing the Earnings and Women Participation in Graduates 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]

Petroleum Engineering is the major with the highest median earnings.

In [None]:
recent_grads.head()

In [None]:
recent_grads.describe()

The average median salary for all the majors is $40,000, while the average unemployment rate is about 0,07. The average share of women in the majors is around 52%.

Delete any rows that contains null data.

In [None]:
raw_data_count = recent_grads.shape[0]
recent_grads = recent_grads.dropna()
cleaned_data_count = recent_grads.shape[0]
print(raw_data_count)
print(cleaned_data_count)

One row was deleted.

### Data Visualization
The unemployment rate in the top ten highest and lowest median salaries

In [None]:
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')
recent_grads[-10:].plot.bar(x='Major', y='Unemployment_rate')

The highest median salaries, the Nuclear Engineering major has the biggest unemployment rate, followed by the Mining and Mineral Engineering major. For the lowest salaries, the Clinical Psychology major has the biggest unemployment rate. Both high and low salary majors have high and low unemployment rates.

Boxplots are used to see the variation in the 'Unemployment_rate' and in the 'Median' columns.

In [None]:
recent_grads[['Unemployment_rate']].boxplot()

The median is about 0.07 and the last quartile goes from about 0.085 to 0.125, while the first quartile goes from 0.05 to 0.05.

In [None]:
recent_grads[['Median']].boxplot()

The median is for the 'Median' column box plot is right below 40,000, while the last quertile goes from about 45 to 60 thousand.

Scatterplot is used to visualize any correlation between two varaibles. If the plot is too dense in some areas, we'll also use hexagonal bin plots so we can take the density into consideration.

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['Total'], y=recent_grads['Median'])
ax.set_title('Total Vs Median')
ax.set_xlabel('Total')
ax.set_ylabel('Median')

In [None]:
recent_grads.plot.hexbin(x = 'Total', y='Median', gridsize=15);

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['Total'], y=recent_grads['Unemployment_rate'])
ax.set_title('Total Vs Unemployment Rate')
ax.set_xlabel('Total')
ax.set_ylabel('Unemployment Rate')

We can see that there's no correlation between the total of students in the major neither with the median salary nor with the unemployment rate.

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['Full_time'], y=recent_grads['Median'])
ax.set_title('Full Time Vs Median')
ax.set_xlabel('Full Time')
ax.set_ylabel('Median')

In [None]:
recent_grads.plot.hexbin(x = 'Full_time', y='Median', gridsize=15);

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['Full_time'], y=recent_grads['Unemployment_rate'])
ax.set_title('Full Time Vs Unemployment_rate')
ax.set_xlabel('Full Time')
ax.set_ylabel('Unemployment_rate')

We also can't see no correlation between the amount of graduated with full time job and the median salary or unemployment rate.

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['ShareWomen'], y=recent_grads['Unemployment_rate'])
ax.set_title('Share of Women Vs Unemployment Rate')
ax.set_xlabel('Share of Women')
ax.set_ylabel('Unemployment Rate')

Now let't see how gender affects the median salary and the unemployment rate.

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['ShareWomen'], y=recent_grads['Median'])
ax.set_title('Share of Women Vs Median')
ax.set_xlabel('Share of Women')
ax.set_ylabel('Median')

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['Men'], y=recent_grads['Median'])
ax.set_title('Men Vs Median')
ax.set_xlabel('Men')
ax.set_ylabel('Median')

fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['Men'], y=recent_grads['Unemployment_rate'])
ax.set_title('Men Vs Unemployment_rate')
ax.set_xlabel('Men')
ax.set_ylabel('Unemployment_rate')

In [None]:
recent_grads.plot.hexbin(x = 'Men', y='Median', gridsize=15);
recent_grads.plot.hexbin(x = 'Men', y='Unemployment_rate', gridsize=15);

In [None]:
fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['Women'], y=recent_grads['Median'])
ax.set_title('Women Vs Median')
ax.set_xlabel('Women')
ax.set_ylabel('Median')

fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(x=recent_grads['Women'], y=recent_grads['Unemployment_rate'])
ax.set_title('Women Vs Unemployment_rate')
ax.set_xlabel('Women')
ax.set_ylabel('Unemployment_rate')

In [None]:
recent_grads.plot.hexbin(x = 'Women', y='Median', gridsize=15);
recent_grads.plot.hexbin(x = 'Women', y='Unemployment_rate', gridsize=15);

All the plots look pretty much the same, showing no correlation, with the except of the 'Share of Women Vs Median' plot that shows a small negative correlation between the share of women and the median salary.

let's visualize the proportion of men and women in each one of the categories of majors.

In [None]:
recent_grads.groupby('Major_category')['Men', 'Women'].sum().plot(kind='barh', stacked=True)

How many categories have the average share of women per major greater than 0.5.

In [None]:
recent_grads.groupby('Major_category').ShareWomen.mean().plot.barh()

10 of the 16 categories have, in average, more women graduate than men graduate.

We'll now look at the share of women in each of the major int he top ten highest and lowest median earnings. Remember that the dataset is ordered by median earnings.

In [None]:
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')
recent_grads[-10:].plot.bar(x='Major', y='ShareWomen')

We can see that among the highest median salaries, the Astronomy and Astrophysics major has the biggest share of women, followed by the Actuarial Sciences major. For the lowest salaries all the majors have a big share of women, which supports the 'Share of Women Vs Median' plot that was shown above.

#HISTOGRAMS

In [None]:
recent_grads['Median'].hist(bins=20, range=(0,100000), figsize=(7,7))

Here we can see that the most common salary range in between 30 and 40 thousand dollars.

In [None]:
recent_grads['Employed'].hist(bins=20, range=(0,400000), figsize=(7,7))

The most common number of employed graduates range in between 0 and 25 thousand.

In [None]:
recent_grads['ShareWomen'].hist(bins=10, range=(0,1), figsize=(7,7))

The most common share of women in the majors in this dataset range in between 0.6 and 0.8.

In [None]:
recent_grads['Unemployment_rate'].hist(bins=20, range=(0,0.2), figsize=(7,7))

And the most commons unemployment rates range from 5% to 6,25%.

We'll now use scatter matrices to see potential relationships and distributions between two columns simultaneously. First, we'll use the 'Sample_size' and 'Median' columns. Then we'll plot a scatter matrix for the 'Sample_size', 'Median' and 'Unemployment_rate' columns together.

In [None]:
from pandas.plotting import scatter_matrix

scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10));

In [None]:
scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']], figsize=(10,10));

We can see the histograms and the scatter plots that show us that there's no correlations between those columns.

# Conclusions

* There's no correlation between the total of graduates in the major and median salary of the graduates;
* There's a small negative correlation between the share of women and the median salary;
* There's no link between the number of full-time employees and median salary;
* Business is the major category with more graduates;
* There more categories of majors with the majority of women than with the majority of men.
