## Visualizing Earnings Based on College Majors

In this project, we'll explore the job outcomes of students who graduated from college between 2010 and 2012 in the United States. 
The dataset we work with was obducted from the American Community Survey site and cleaned by FiveThirtyEight.

Using visualizations, we will answer the following questions:
- Do students in more popular majors make more money? (scatter plots)
- How many majors are predominantly male? Predominantly female? (histograms)
- Which category of majors have the most students? (barplots)


Before we start creating data visualizations, let't import the libraries we need and remove rows containing null values.

In [None]:
# Import pandas and matplotlib into the environment
import pandas as pd
import matplotlib.pyplot as plt

#Run the Jupyter magic %matplotlib inline so that the plots are displayed inline
%matplotlib inline

# Read the dataset into a DataFrame and start exploring the data
recent_grads = pd.read_csv ('recent-grads.csv')

In [None]:
# Use DataFrame.iloc [] to return the first row formatted as a table
recent_grads.iloc[0]

In [None]:
# Use DataFrame.iloc [] to return the first row formatted as a table
recent_grads.iloc[0]

In [None]:
# Use DataFrame.head() to become familiar with how the data is structured
recent_grads.head()

In [None]:
# Use DataFrame.tail() to become familiar with how the data is structured
recent_grads.tail()

In [None]:
# Use DataFrame.describe() to generate summary statistics for all of the numeric columns
recent_grads.describe()

In [None]:
# Drop rows with missing values. Matplotlib expects that columns of values we pass in having matching lengths and missing values will cause matplotlib to throw errors.
# Look up the numbr of rows in recent_grads and assign the value to raw_data_count
raw_data_count = 173

In [None]:
# Use the DataFrame.dropna() to drop rows containing missing values and assign the resulting DataFrame back to recent_grads
recent_grads = recent_grads.dropna()

In [None]:
# Look up the number of rows in recent_grads now and assign the value to cleaned_data_count.
# If you compare cleaned_data_counr and raw_data_count, you'll notice that onlyu one row contained missing values and was dropped.
recent_grads.describe ()

cleaned_data_count = 172

In [None]:
#Generate scatter plots in seperate Jupyter notenbook cells to explore the following relations:

# Sample_size and Median
plt.scatter(recent_grads['Sample_size'], recent_grads['Median'])
plt.show()

In [None]:
# Sample_size and Unemployment_rate
plt.scatter(recent_grads['Sample_size'], recent_grads['Unemployment_rate'])
plt.show()

In [None]:
# Full_time and Median 
plt.scatter(recent_grads['Full_time'], recent_grads['Median'])
plt.show()

In [None]:
# ShareWomen and Employment_rate
plt.scatter(recent_grads['ShareWomen'], recent_grads['Median'])
plt.show()

In [None]:
# Men and Median 
plt.scatter(recent_grads['Men'], recent_grads['Median'])
plt.show()

In [None]:
# Women and Median
plt.scatter(recent_grads['Women'], recent_grads['Median'])
plt.show()

Use the plots to explore the following questions:
- Do students in more popular majors make more money?
- Do students that majored in subjects that were majority female make more money?
- Is there any link between the number of full-time employees and median salary?

In order to give an accurate answer to the first question: whether students in more popular majors make more money, the Total has to be compared with the Median, as showed below.

In [None]:
# Do students in more popular majors make more money?
plt.scatter(recent_grads['Total'], recent_grads['Median'])
plt.show()

There seems to be no or no significant linear correlation between the popularity of the major and the amount of money graduates earn. 

In addition, looking at the scatter plot of the Share of Women in the majors and the median salary of full-time, year-round graduates, students that majored in subjects in which the majority was female earn slightly 
less as opposed to the majored subjects of which the majority are men. 

Moreover, there seems to be no clear linear correlation between the number of full-time employees of different majors and the median salary.

In [None]:
# Generate histograms to explore the distributions of the following columns: 
# Sample_size
recent_grads["Sample_size"].hist(bins=10, range=(0,5000))
plt.xlabel('Number of Graduates')
plt.ylabel('Number of Majors')

The histogram above shows that the majority of majors have a sample size of 500 graduates that work full-time.

In [None]:
# Median
recent_grads['Median'].hist(bins=30, range=(0,150000))
plt.xlabel('Median salary of full-time year-round graduates, per year')
plt.ylabel('Number of majors')

Most majors have graduates that earn a median salary range of 30.000 - 40.000 a year.

In [None]:
# Employed
recent_grads["Employed"].hist(bins=40, range=(0,310000))
plt.xlabel ('Number of employed graduates')
plt.ylabel('Number of Majors')

Most of the majors deliver up to 20.000 employed graduates, whereas a small amount of majors deliver more than 100.000 employed graduates. 

In [None]:
# Full_time
recent_grads["Full_time"].hist(bins=40, range=(0,300000))
plt.xlabel ('Number of Full-time graduates')
plt.ylabel('Number of Majors')

Most of the majors deliver up to 20.000 graduates that work full-time, whereas there is a small amount of majors that deliver more than 100.000 full-time graduates. 
Although the difference between the number of majors with employed and full-time graduates is small, there seems to be a slight decrease in the number of full-time graduates as opposed to employed graduates. 

In [None]:
# ShareWomen
recent_grads["ShareWomen"].hist(bins=5, range=(0,1))
plt.xlabel ('Share of Women')
plt.ylabel ('Number of Majors')

The percentage of majors that are predominantly female seems to be around 50% and the same goes for the predominantly male majors.

In [None]:
# Unemployment_rate
recent_grads["Unemployment_rate"].hist(bins=25, range=(0,0.2))
plt.xlabel ('Unemployment Rate')
plt.ylabel ('Number of Majors')

Most majors have an unemployment rate of 4 to 12 percent. 

In [None]:
# Men
recent_grads["Men"].hist(bins=10, range=(0,175000))
plt.xlabel ('Number of Male Graduates')
plt.ylabel ('Number of Majors')

Most majors have around 15.000 male students.

In [None]:
# Women
recent_grads["Women"].hist(bins=10, range=(0,310000))

Most majors have around 25.000 female graduates.

To conclude, Approximately 50% of the majors are dominantly male, or predominantly female. However, in total, there are much more female graduates as opposed to male graduates.

In addition, the most common median salary range is between 30.000 and 40.000 a year.

In [None]:
# Create a 2 by 2 scatter matric plot using the Sample_size and Median columns
pd.plotting.scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))

The scatter matrix above shows that there is no significant linear correlation between the sample size and median salary. 

In [None]:
# Create a 3 by 3 scatter matrix plot using the Sample_size, Median, and Unemployment_rate
pd.plotting.scatter_matrix(recent_grads[['Sample_size', 'Median', 'Unemployment_rate']])

There seems to be a weak linear correlation between the median salary and the unemployment rate.

In [None]:
# Explore the questions from the last step using these scatter matris plots
pd.plotting.scatter_matrix(recent_grads[['Women', 'Men','Total']], figsize=(10,10))

There is a clear linear correlation between the total number of graduates and the total amount of men and women. This is to be expected. What is more interesting, however, is that the linear correlation between total number of graduates and the total amount of women graduates seems significntly higher as opposed to the correlation between total amount of graduates and total amount of male graduates. 
From the histograms above, we van estimate the total percentage of majors that are predominantly female or male is about 50%. The describe function at the top of the page, however, displayed a median percentage of 52,2% female presence of graduates in all majors, as opposed to 47,8% male graduates. 

In [None]:
# Use bar plots to compare the percentages of women (ShareWomen) from the first and last ten rows of the recent_grads dataframe
# First ten rows
recent_grads[:10].plot.bar(x='Major', y='ShareWomen')

In [None]:
# Last ten rows
recent_grads[162:173].plot.bar(x='Major', y='ShareWomen')

The following majors have a higher percentage of women graduates:
- Astronomy and Astrophysics
- Communication disorders sciences and services
- Early Childhood Education
- Other Foreign Languages
- Drama and Theater Arts
- Composition and Rhetoric
- Zoology
- Educational Psychology
- Clinical Psychology
- Counseling Psychology
- Library Science

In [None]:
# Use bar plots to compare the unemployment rate (Unemployment_rate) from the first ten rows and lest ten rows of the recent_grads dataframe
# First ten rows
recent_grads[:10].plot.bar(x='Major', y='Unemployment_rate')

In [None]:
# The last ten rows
recent_grads[162:174].plot.bar(x= "Major", y="Unemployment_rate")

The unemployment rate is highest in the following majors:
- Clinical Psychology
- Library Science
- Other Foreign Languages
- Nuclear Engineering
- Mining and Mineral Engineering

In conclusion, both the predominantly female and male majors can have significantly high unemployment rates.