# Analysis of the Post-University Salaries of Graduates by Major. 
**College degrees are very expensive. But, do they pay you back?** PayScale Inc. did a year-long survey of 1.2 million Americans with only a bachelor's degree. We'll be digging into this data and use Pandas to answer these questions:  

- Which degrees have the highest starting salaries? 

- Which majors have the lowest earnings after college?

- Which degrees have the highest earning potential?

- What are the lowest risk college majors from an earnings standpoint?

- Do business, STEM (Science, Technology, Engineering, Mathematics) or HASS (Humanities, Arts, Social Science) degrees earn more on average?

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('salaries_by_college_major.csv')

## Preliminary Data Exploration and Data Cleaning

In [3]:
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


In [4]:
df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS
50,Source: PayScale Inc.,,,,,


In [5]:
#attribute
df.shape

(51, 6)

In [6]:
df.columns

Index(['Undergraduate Major', 'Starting Median Salary',
       'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
       'Mid-Career 90th Percentile Salary', 'Group'],
      dtype='object')

Missing values and Junk data

In [7]:
#df.isna()

In [8]:
clean_df = df.dropna()
clean_df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
45,Political Science,40800.0,78200.0,41200.0,168000.0,HASS
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS


## Find College Major with Highest Starting salary
.idxmax() method will give us index for the row with the largest value.

In [9]:
clean_df['Starting Median Salary'].max()

74300.0

In [10]:
clean_df['Starting Median Salary'].idxmax()

43

In [11]:
clean_df['Undergraduate Major'].loc[43]
#clean_df['Undergraduate Major'][43]

'Physician Assistant'

In [12]:
# clean_df['Undergraduate Major'] gives that column
# .loc gives 43th row
clean_df.loc[43]

Undergraduate Major                  Physician Assistant
Starting Median Salary                           74300.0
Mid-Career Median Salary                         91700.0
Mid-Career 10th Percentile Salary                66400.0
Mid-Career 90th Percentile Salary               124000.0
Group                                               STEM
Name: 43, dtype: object

### What college major has the highest mid-career salary? How much do graduates with this major earn? (Mid-career is defined as having 10+ years of experience).

In [13]:
clean_df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


In [14]:
clean_df['Undergraduate Major'].loc[clean_df['Mid-Career Median Salary'].idxmax()]

'Chemical Engineering'

In [15]:
print(f"The max salary for mid-carrer: {clean_df['Mid-Career Median Salary'].max()}")
clean_df['Undergraduate Major'].loc[8]

The max salary for mid-carrer: 107000.0


'Chemical Engineering'

### Which college major has the lowest starting salary and how much do graduates earn after university?

In [16]:
clean_df['Undergraduate Major'].loc[clean_df['Starting Median Salary'].idxmin()]

'Spanish'

### Which college major has the lowest mid-career salary and how much can people expect to earn with this degree? 

In [17]:
print(f"The lowest mid-career salary is : {clean_df['Mid-Career Median Salary'].min()}")

The lowest mid-career salary is : 52000.0


In [18]:
clean_df['Undergraduate Major'].loc[clean_df['Mid-Career Median Salary'].idxmin()]

'Education'

## Sorting Values & Adding Columns: Majors with the Most Potential vs Lowest Risk


A low-risk major is a degree where there is a small difference between the lowest and highest salaries. In other words, if the difference between the 10th percentile and the 90th percentile earnings of your major is small, then we can be more certain about your salary after one graduate

In [19]:
spread_col = clean_df['Mid-Career 90th Percentile Salary'] - clean_df['Mid-Career 10th Percentile Salary']

In [20]:
spread_col

0     109800.0
1      96700.0
2     113700.0
3     104200.0
4      85400.0
5      96200.0
6      98100.0
7     108200.0
8     122100.0
9     102700.0
10     84600.0
11    105500.0
12     95900.0
13     98000.0
14    114700.0
15     74800.0
16    116300.0
17    159400.0
18     72700.0
19     98700.0
20     99600.0
21    102100.0
22    147800.0
23     70000.0
24     92000.0
25    111000.0
26     76000.0
27     66400.0
28    112000.0
29     88500.0
30    115900.0
31     84500.0
32     71300.0
33    118800.0
34    106600.0
35    100700.0
36    132900.0
37    137800.0
38     99300.0
39    107300.0
40     50700.0
41     65300.0
42    132500.0
43     57600.0
44    122000.0
45    126800.0
46     95400.0
47     66700.0
48     87300.0
49     65400.0
dtype: float64

In [21]:
clean_df.insert(1, 'Spread', spread_col)

In [22]:
clean_df.head(1)

Unnamed: 0,Undergraduate Major,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,109800.0,46000.0,77100.0,42200.0,152000.0,Business


In [23]:
low_risk_df = clean_df.sort_values('Spread')

In [24]:
low_risk_df[['Undergraduate Major', 'Spread']].head()

Unnamed: 0,Undergraduate Major,Spread
40,Nursing,50700.0
43,Physician Assistant,57600.0
41,Nutrition,65300.0
49,Spanish,65400.0
27,Health Care Administration,66400.0


In [26]:
# degrees with smallest spread
low_risk_df = clean_df.sort_values('Spread')

In [27]:
low_risk_df[['Group', 'Spread']].head()

Unnamed: 0,Group,Spread
40,Business,50700.0
43,STEM,57600.0
41,HASS,65300.0
49,HASS,65400.0
27,Business,66400.0


### Majors with highest potential

In [28]:
highest_potential = clean_df.sort_values(['Mid-Career 90th Percentile Salary'], ascending=False)

In [29]:
highest_potential[['Undergraduate Major', 'Mid-Career 90th Percentile Salary']].head()

Unnamed: 0,Undergraduate Major,Mid-Career 90th Percentile Salary
17,Economics,210000.0
22,Finance,195000.0
8,Chemical Engineering,194000.0
37,Math,183000.0
44,Physics,178000.0


### Greatest Spread: Highest risk Majors

In [30]:
highest_spread =  clean_df.sort_values('Spread', ascending=False)
highest_spread[['Undergraduate Major', 'Mid-Career 90th Percentile Salary']].head()

Unnamed: 0,Undergraduate Major,Mid-Career 90th Percentile Salary
17,Economics,210000.0
22,Finance,195000.0
37,Math,183000.0
36,Marketing,175000.0
42,Philosophy,168000.0


## which category of degree has the highest average salary

In [34]:
clean_df.groupby('Group').mean()

  clean_df.groupby('Group').mean()


Unnamed: 0_level_0,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Business,103958.333333,44633.333333,75083.333333,43566.666667,147525.0
HASS,95218.181818,37186.363636,62968.181818,34145.454545,129363.636364
STEM,101600.0,53862.5,90812.5,56025.0,157625.0


In [33]:
clean_df.groupby(['Group']).count()

Unnamed: 0_level_0,Undergraduate Major,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Business,12,12,12,12,12,12
HASS,22,22,22,22,22,22
STEM,16,16,16,16,16,16


In [35]:
clean_df.nunique()

Undergraduate Major                  50
Spread                               50
Starting Median Salary               43
Mid-Career Median Salary             49
Mid-Career 10th Percentile Salary    45
Mid-Career 90th Percentile Salary    43
Group                                 3
dtype: int64