<a href="https://colab.research.google.com/github/PranayPrasanth/100DaysOfCode-DataScience-Projects/blob/master/College_Majors_vs_Salaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing College Major Salaries

## Introduction

This notebook explores a dataset containing information about college majors and their corresponding salaries at different stages of a career. By analyzing this data, we aim to identify trends, patterns, and insights related to the earning potential associated with various fields of study.

Understanding the financial outcomes of different college majors can be a valuable resource for students making decisions about their education and future career paths. This analysis will provide a data-driven perspective on which majors tend to lead to higher starting salaries, higher mid-career salaries, and the overall salary spread within each field.

## Objective

The primary objective of this notebook is to perform an exploratory data analysis on the provided dataset to answer the following questions:

- Which college majors have the highest starting median salaries?
- Which college majors have the highest mid-career median salaries?
- Which college majors have the lowest starting median salaries?
- Which college majors have the largest and smallest salary spreads (difference between the 90th and 10th percentile mid-career salaries)?
- How do the different major groups (STEM, HASS, Business) compare in terms of salary potential?

Through this analysis, we hope to provide valuable insights for students considering their college major choices and understanding the potential financial implications of those decisions.

## Upgrade Plotly



In [None]:
# Set the plotting backend to inline for displaying plots in the notebook
%matplotlib inline

In [None]:
# Upgrade the plotly library to the latest version
%pip install --upgrade plotly




## Import Statements

In [None]:
# Import necessary libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

## Notebook Presentation

In [None]:
# Set pandas display options to format floating-point numbers with two decimal places and comma separators
pd.options.display.float_format = '{:,.2f}'.format

## Load the Data

In [None]:
# Load the dataset from a CSV file into a pandas DataFrame
df = pd.read_csv('salaries_by_college_major.csv')
# Display the first few rows of the DataFrame to get a glimpse of the data
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


## Preliminary Data Exploration

In [None]:
# Get the shape of the DataFrame (number of rows and columns)
shape = df.shape
# Get the column names of the DataFrame
columns = df.columns

# Print the shape and column names
print(f"Shape: {shape}\nColumns: {columns}")
# Check if there are any missing values in the DataFrame
df.isna().any().any() # Check for missing values
# Drop rows with missing values in-place
df.dropna(inplace=True)
# Check for duplicate rows
duplicates = df[df.duplicated()]
# Print the number of duplicate rows
print(f"Number of Duplicate Rows: {duplicates.shape[0]}")

Shape: (51, 6)
Columns: Index(['Undergraduate Major', 'Starting Median Salary',
       'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
       'Mid-Career 90th Percentile Salary', 'Group'],
      dtype='object')
Number of Duplicate Rows: 0


## Exploratory Data Analysis

In [None]:
# Get the number of unique values in the 'Undergraduate Major' column
df['Undergraduate Major'].nunique()

# Sort the DataFrame by 'Starting Median Salary' in descending order and select the top 10 rows
top10_majors_by_salary = df.sort_values(by='Starting Median Salary', ascending=False)[:10]
# Display the top 10 majors by starting median salary
top10_majors_by_salary
# Create a string of the top 10 majors for printing
top_10_majors_str = '\n'.join(top10_majors_by_salary['Undergraduate Major'])
# Find the major with the highest starting median salary
highest_paid_major = df.loc[df['Starting Median Salary'].idxmax(), 'Undergraduate Major']

# Print the highest paid undergraduate major and the top 10 majors by starting median salary
print(f"Highest paid Undergraduate Major: {highest_paid_major}", end='\n\n')
print(f"Top 10 Undergraduate Majors by Starting Median Salary: \n{top_10_majors_str}", end='\n\n')

# Create a bar chart using Plotly Express to visualize the starting median salary by major
fig = px.bar(top10_majors_by_salary, x='Undergraduate Major', y='Starting Median Salary',
             title='Starting Median Salary by Major',
             color='Undergraduate Major',
             labels={'Undergraduate Major': 'Major', 'Starting Median Salary': 'Starting Salary'})
# Update the layout of the chart to rotate x-axis labels for better readability and hide the legend
fig.update_layout(xaxis_tickangle=-45, showlegend=False)  # Rotate x-axis labels for readability
# Display the chart
fig.show()

Highest paid Undergraduate Major: Physician Assistant

Top 10 Undergraduate Majors by Starting Median Salary: 
Physician Assistant
Chemical Engineering
Computer Engineering
Electrical Engineering
Mechanical Engineering
Aerospace Engineering
Industrial Engineering
Computer Science
Nursing
Civil Engineering



In [None]:
# Sort the DataFrame by 'Mid-Career Median Salary' in descending order and select the top 10 rows
top_10_majors_by_midcareer_salaries = df.sort_values(by='Mid-Career Median Salary', ascending=False)[:10]
# Find the major with the highest mid-career median salary
highest_paid_major_by_midcareer_salary = df.loc[df['Mid-Career Median Salary'].idxmax(), 'Undergraduate Major']

# Create a string of the top 10 majors by mid-career median salary for printing
top_10_majors_by_midcareersalaries_str = "\n".join(top_10_majors_by_midcareer_salaries['Undergraduate Major'])

# Print the highest paid undergraduate major by mid-career and the top 10 majors by mid-career median salary
print(f"Highest paid Undergraduate Major by Mid-Career: {highest_paid_major_by_midcareer_salary}", end='\n')
print(f"Top 10 Undergraduate Majors by Mid-Career Median Salary: \n{top_10_majors_by_midcareersalaries_str}", end='\n\n')

# Create a bar chart using Plotly Express to visualize the starting median salary by major (using the top 10 mid-career majors)
fig = px.bar(top_10_majors_by_midcareer_salaries, x='Undergraduate Major', y='Starting Median Salary',
             title='Starting Median Salary by Major',
             color='Undergraduate Major',
             labels={'Undergraduate Major': 'Major', 'Starting Median Salary': 'Starting Salary'})
# Update the layout of the chart to rotate x-axis labels for better readability and hide the legend
fig.update_layout(xaxis_tickangle=-45, showlegend=False)  # Rotate x-axis labels for readability
# Display the chart
fig.show()

Highest paid Undergraduate Major by Mid-Career: Chemical Engineering
Top 10 Undergraduate Majors by Mid-Career Median Salary: 
Chemical Engineering
Computer Engineering
Electrical Engineering
Aerospace Engineering
Economics
Physics
Computer Science
Industrial Engineering
Mechanical Engineering
Math



In [None]:
# Sort the DataFrame by 'Starting Median Salary' in ascending order and select the top 10 rows
top_10_lowest_paid_undergraduate_majors = df.sort_values(by='Starting Median Salary', ascending=True)[:10]
# Create a string of the top 10 lowest paid undergraduate majors for printing
top_10_lowest_paid_undergraduate_majors_str = "\n".join(top_10_lowest_paid_undergraduate_majors['Undergraduate Major'])
# Find the lowest paid undergraduate major
lowest_paid_undergraduate_major = df.loc[df['Starting Median Salary'].idxmin(), 'Undergraduate Major']

# Print the lowest paid undergraduate major and the top 10 lowest paid undergraduate majors
print(f"Lowest Paid Undergraduate Major: {lowest_paid_undergraduate_major}\n\nTop 10 Lowest Paid Undergraduate Majors:\n{top_10_lowest_paid_undergraduate_majors_str} ")

# Create a bar chart using Plotly Express to visualize the starting median salary by major (using the top 10 lowest paid majors)
fig = px.bar(top_10_lowest_paid_undergraduate_majors, x='Undergraduate Major', y='Starting Median Salary',
             title='Mid-Career Median Salary by Major',
             color='Undergraduate Major',
             labels={'Undergraduate Major': 'Major', 'Starting Median Salary': 'Starting Salary'})
# Update the layout of the chart to rotate x-axis labels for better readability and hide the legend
fig.update_layout(xaxis_tickangle=-45, showlegend=False)  # Rotate x-axis labels for readability
# Display the chart
fig.show()

Lowest Paid Undergraduate Major: Spanish

Top 10 Lowest Paid Undergraduate Majors:
Spanish
Religion
Education
Criminal Justice
Journalism
Graphic Design
Art History
Drama
Psychology
Music 


In [None]:
# Calculate the salary spread by subtracting the 10th percentile mid-career salary from the 90th percentile mid-career salary
high_low_spread = df['Mid-Career 90th Percentile Salary'] - df['Mid-Career 10th Percentile Salary']

# Insert the calculated 'Spread' column into the DataFrame at index 6
df.insert(6, 'Spread', high_low_spread)

# Display the first few rows of the DataFrame with the new 'Spread' column
df.head()

# Sort the DataFrame by 'Spread' in ascending order and select the top 10 rows (low risk majors)
top_10_low_risk_majors = df.sort_values(by='Spread')[:10]
# Create a string of the top 10 low risk majors for printing
top_10_low_risk_majors_str = "\n".join(top_10_low_risk_majors['Undergraduate Major'])
# Display the 'Undergraduate Major' and 'Spread' columns for the top 10 low risk majors
top_10_low_risk_majors[['Undergraduate Major', 'Spread']]

Unnamed: 0,Undergraduate Major,Spread
40,Nursing,50700.0
43,Physician Assistant,57600.0
41,Nutrition,65300.0
49,Spanish,65400.0
27,Health Care Administration,66400.0
47,Religion,66700.0
23,Forestry,70000.0
32,Interior Design,71300.0
18,Education,72700.0
15,Criminal Justice,74800.0


In [None]:
# Sort the DataFrame by 'Spread' in descending order and select the top 10 rows (high risk majors)
top_10_high_risk_majors = df.sort_values(by='Spread', ascending=False)[:10]
# Create a string of the top 10 high risk majors for printing
top_10_high_risk_majors_str = "/n".join(top_10_high_risk_majors['Undergraduate Major'])

# Display the 'Undergraduate Major' and 'Spread' columns for the top 10 high risk majors
top_10_high_risk_majors[['Undergraduate Major', 'Spread']]

Unnamed: 0,Undergraduate Major,Spread
17,Economics,159400.0
22,Finance,147800.0
37,Math,137800.0
36,Marketing,132900.0
42,Philosophy,132500.0
45,Political Science,126800.0
8,Chemical Engineering,122100.0
44,Physics,122000.0
33,International Relations,118800.0
16,Drama,116300.0


In [None]:
# Sort the DataFrame by 'Mid-Career 90th Percentile Salary' in descending order to find majors with the highest earning potential
highest_potential_majors = df.sort_values(by='Mid-Career 90th Percentile Salary', ascending=False)
# Display the 'Undergraduate Major' and 'Mid-Career 90th Percentile Salary' columns for the top 10 majors with highest potential
highest_potential_majors[['Undergraduate Major', 'Mid-Career 90th Percentile Salary']].head(10)

Unnamed: 0,Undergraduate Major,Mid-Career 90th Percentile Salary
17,Economics,210000.0
22,Finance,195000.0
8,Chemical Engineering,194000.0
37,Math,183000.0
44,Physics,178000.0
36,Marketing,175000.0
30,Industrial Engineering,173000.0
14,Construction,171000.0
42,Philosophy,168000.0
19,Electrical Engineering,168000.0


In [None]:
# Group the DataFrame by the 'Group' column and calculate the mean for various salary columns
df.groupby('Group')[['Starting Median Salary', 'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary', 'Mid-Career 90th Percentile Salary']].mean()

Unnamed: 0_level_0,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Business,44633.33,75083.33,43566.67,147525.0
HASS,37186.36,62968.18,34145.45,129363.64
STEM,53862.5,90812.5,56025.0,157625.0


## Conclusion

In this notebook, we explored a dataset containing salary information for various college majors. Our analysis revealed several key insights:

- Highest Paying Majors: Physician Assistant, Chemical Engineering, and Computer Engineering were identified as the highest-paying majors based on starting median salaries. Chemical Engineering, Computer Engineering and Electrical Engineering were identified as the highest-paying majors based on Mid-Career median salaries


- Lowest Paying Majors: Spanish, Education, and Religion were among the lowest-paying majors.


- Risk and Potential: Majors like Economics and Finance exhibited a wider salary spread, indicating higher risk but also higher potential for earning. Conversely, Nursing and Physician Assistant majors demonstrated a narrower spread, suggesting lower risk but also lower potential for extremely high earnings.


- Major Groups: The analysis of major groups revealed that STEM majors, on average, tend to have higher starting and mid-career salaries compared to other groups like HASS (Humanities, Arts, and Social Sciences) and Business.


Overall, the choice of a college major can significantly influence earning potential. While individual career paths and personal factors play a crucial role in financial success, understanding salary trends associated with different majors can assist students in making informed decisions about their education and future careers.

Further research could involve:

- Investigating the impact of experience and location on salaries within specific majors.


- Analyzing the job market demand and growth prospects for different majors.


- Exploring the relationship between job satisfaction and salary across various majors.


- By incorporating these factors, students can gain a more comprehensive understanding of the factors that contribute to a fulfilling and financially rewarding career path.