## EDA

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../data/clean_data.csv")

In [3]:
df.head()

Unnamed: 0,Rank,Major,Degree Type,Early Career Pay,Mid-Career Pay,% High Meaning
0,1,Petroleum Engineering,Bachelors,98100,212100,60.0
1,2,Operations Research & Industrial Engineering,Bachelors,101200,202600,21.0
2,3,Electrical Engineering & Computer Science (EECS),Bachelors,128500,192300,45.0
3,4,Interaction Design,Bachelors,77400,178800,55.0
4,5,Building Science,Bachelors,71100,172400,46.0


### About the data 

The data is web scraped from the "https://www.payscale.com/college-salary-report/majors-that-pay-you-back/bachelors". The data represents the highest-paying jobs with a bachelor's degree holder in different majors. It also contains yearly pay since the start of the career and the middle of the career in dollars($). %High Meaning is the alumni who says their job makes the world a better places in percentages(%).

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 763 entries, 0 to 762
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Rank              763 non-null    int64  
 1   Major             763 non-null    object 
 2   Degree Type       763 non-null    object 
 3   Early Career Pay  763 non-null    int64  
 4   Mid-Career Pay    763 non-null    int64  
 5   % High Meaning    708 non-null    float64
dtypes: float64(1), int64(3), object(2)
memory usage: 35.9+ KB


### Descriptive Statistical Analysis 

In [5]:
df[['Early Career Pay' ,'Mid-Career Pay', '% High Meaning']].describe()

Unnamed: 0,Early Career Pay,Mid-Career Pay,% High Meaning
count,763.0,763.0,708.0
mean,61681.520315,101476.408912,53.764124
std,11627.792157,24459.70749,13.502978
min,39600.0,46600.0,18.0
25%,53050.0,83750.0,44.0
50%,58900.0,97400.0,52.0
75%,69000.0,115400.0,63.0
max,128500.0,212100.0,95.0


- The count of "High Meaning" values is lower, indicating the presence of some NaN entries.
- The minimum starting pay is \\$39,600, while the maximum is \\$128,500. For mid-career pay, the minimum is \\$46,600, and the maximum is \\$212,100.
- A large standard deviation and the maximum value of \\$212,100 suggest a significant salary disparity among mid-career individuals
- Regarding outliers, there may be high-value outliers in the pay columns. This is supported by the fact that the median exceeds the mean, indicating a slightly right-skewed distribution.
- The "% High Meaning" column shows that the mean and median values are nearly identical, suggesting the possibility of a normal distribution.

### Questions

#### 1. What major has the highest Mid-salary and Early Salary?

In [6]:
max_index = df['Early Career Pay'].idxmax()
print("Early Career Pay")
print(df.loc[max_index])

Early Career Pay
Rank                                                               3
Major               Electrical Engineering & Computer Science (EECS)
Degree Type                                                Bachelors
Early Career Pay                                              128500
Mid-Career Pay                                                192300
% High Meaning                                                  45.0
Name: 2, dtype: object


- EECS is one of the best careers to pursue, giving you the early dream of a package whooping 128500 dollars a year.
- Also, the salary hike is significantly higher than 90% of the individuals in the dataset.
- However, despite the high financial rewards, the career seems to have a relatively low 'high meaning' score, with only about 25% of the sample population reporting high job satisfaction or meaning.
As with many high-paying careers, stress is often a byproduct of the money earned!

In [7]:
max_index = df['Mid-Career Pay'].idxmax()
print("Mid-Career Pay")
print(df.loc[max_index])

Mid-Career Pay
Rank                                    1
Major               Petroleum Engineering
Degree Type                     Bachelors
Early Career Pay                    98100
Mid-Career Pay                     212100
% High Meaning                       60.0
Name: 0, dtype: object


- Petroleum Engineering may not have had the highest early-career pay, but it still stands strong with a starting salary of \\$98,100, higher than 75% of the other fields in the dataset.
- Looks like top performance in this industry can truly lead to the highest rewards! reaching to the top \\$212,100 per year.

#### 2. What major has the Lowest Mid-salary and Early Salary?

In [8]:
min_index = df['Early Career Pay'].idxmin()
print("Early Career Pay")
print(df.loc[min_index])

Early Career Pay
Rank                          735
Major               Voice & Opera
Degree Type             Bachelors
Early Career Pay            39600
Mid-Career Pay              65100
% High Meaning               52.0
Name: 735, dtype: object


- If you're aiming to get rich, Opera might not be the right choice. It starts with a lower salary and, even after years of hard work, the mid-career pay still falls below 25% of the sample population.
- On the bright side, Opera offers more meaning in comparison to higher-paying careers, providing a greater sense of purpose to those in the field. (It might be worth exploring the relationship between higher meaning and salary in the graph.) 

In [12]:
min_index = df['Mid-Career Pay'].idxmin()
print("Mid-Career Pay")
print(df.loc[min_index])

Mid-Career Pay
Rank                          763
Major               Metalsmithing
Degree Type             Bachelors
Early Career Pay            45900
Mid-Career Pay              46600
% High Meaning               36.0
Name: 762, dtype: object


- After all the physical toll of working the forge, you can expect only a modest increase in mid-career pay.
- Considering inflation, it might even feel like you're earning less over time. 

#### 3. Group By the Major

In [13]:
len(df['Major'].unique())

763

- 763 is the unique value in the major, which is the length of the data.
- Not a catagorical field. 