# In Class Exercise 7.4: Picking Your Major
Use the tools we have discussed we can start looking at associations by groups in the dataset `'recent-grads.csv'`, which contains data about employment and salaries for recent college graduates. The data comes from [here](https://github.com/fivethirtyeight/data/tree/master/college-majors) and was used for the story [The Economic Guide to Picking Your Major](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/), published by [fivethirtyeight](https://fivethirtyeight.com/).

## Imports and Dataframe

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# For slightly nicer charts
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 150
sns.set(style="ticks")

In [3]:
# open full dataset assign to variable
df_grad_full = pd.read_csv("recent-grads.csv")

# use label based indexing to create a NEW dataset that only contains a few columns (using .copy())
df = df_grad_full[['Major_category', 'Major', 'ShareWomen', 'Unemployment_rate', 'Median']].copy()

# create a list to use to rename the columns
rename_list = ['Major_Cat', 'Major', 'Percent_Female', 'Percent_Unemploy', 'Income_Median'] 

# rename columns by assigning the list we made to the dataframe's columns property
df.columns = rename_list 

# check the result
df.head()

Unnamed: 0,Major_Cat,Major,Percent_Female,Percent_Unemploy,Income_Median
0,Engineering,PETROLEUM ENGINEERING,0.120564,0.018381,110000
1,Engineering,MINING AND MINERAL ENGINEERING,0.101852,0.117241,75000
2,Engineering,METALLURGICAL ENGINEERING,0.153037,0.024096,73000
3,Engineering,NAVAL ARCHITECTURE AND MARINE ENGINEERING,0.107313,0.050125,70000
4,Engineering,CHEMICAL ENGINEERING,0.341631,0.061098,65000


Variables represent the following: 
* Major_Cat: Majors grouped into categories
* Major: Title of Major
* Percent_Female: Percentage of graduates classified as female
* Percent_Unemploy: Rate of unemployment
* Income_Median: Median earnings of full-time, year-round workers

---
We are going to be doing a lot of work with the 'Major' category. To make this work easier we can set the dataframe index to 'Major' using the following code.

In [4]:
df = df.set_index('Major')
df

Unnamed: 0_level_0,Major_Cat,Percent_Female,Percent_Unemploy,Income_Median
Major,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PETROLEUM ENGINEERING,Engineering,0.120564,0.018381,110000
MINING AND MINERAL ENGINEERING,Engineering,0.101852,0.117241,75000
METALLURGICAL ENGINEERING,Engineering,0.153037,0.024096,73000
NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313,0.050125,70000
CHEMICAL ENGINEERING,Engineering,0.341631,0.061098,65000
NUCLEAR ENGINEERING,Engineering,0.144967,0.177226,65000
ACTUARIAL SCIENCE,Business,0.441356,0.095652,62000
ASTRONOMY AND ASTROPHYSICS,Physical Sciences,0.535714,0.021167,62000
MECHANICAL ENGINEERING,Engineering,0.119559,0.057342,60000
ELECTRICAL ENGINEERING,Engineering,0.196450,0.059174,60000


---
## 1. Median Income - Top Majors
Produce a list of top ten majors for median income.

In [5]:
# enter and test code here

#df.sort_values('Income_Median', ascending = False).head(10)

df['Income_Median'].head(10)

Major
PETROLEUM ENGINEERING                        110000
MINING AND MINERAL ENGINEERING                75000
METALLURGICAL ENGINEERING                     73000
NAVAL ARCHITECTURE AND MARINE ENGINEERING     70000
CHEMICAL ENGINEERING                          65000
NUCLEAR ENGINEERING                           65000
ACTUARIAL SCIENCE                             62000
ASTRONOMY AND ASTROPHYSICS                    62000
MECHANICAL ENGINEERING                        60000
ELECTRICAL ENGINEERING                        60000
Name: Income_Median, dtype: int64

---
## 2. Median Income - Bottom Majors
Produce a list of the ten majors with the lowest median income.

In [6]:
# enter and test code here



---
## 3. Percent Female - Top Majors
Produce a list of top ten majors for percentage of women graduating.

In [22]:
# enter and test code here

df['Percent_Female'].sort_values().tail(10)

Major
NURSING                                          0.896019
SOCIAL WORK                                      0.904075
HUMAN SERVICES AND COMMUNITY ORGANIZATION        0.905590
SPECIAL NEEDS EDUCATION                          0.906677
FAMILY AND CONSUMER SCIENCES                     0.910933
ELEMENTARY EDUCATION                             0.923745
MEDICAL ASSISTING SERVICES                       0.927807
COMMUNICATION DISORDERS SCIENCES AND SERVICES    0.967998
EARLY CHILDHOOD EDUCATION                        0.968954
FOOD SCIENCE                                          NaN
Name: Percent_Female, dtype: float64

---
## 4. Percent Female - Bottom Majors
Produce a list of top ten majors for percentage of women graduating.

In [8]:
# enter and test code here



---
## 5. Single Correlation
Produce a single Pearson's r statistic for the correlation betweeen median income and percent female.

In [9]:
# enter and test code here



---
## 6. Correlation Table
Produce a correlation table from the dataframe.

In [10]:
# enter and test code here



---
## 7. Heatmap
Produce a heatmap from the dataframe.

In [11]:
# enter and test code here



---
## 8. Scatter Matrix
* Select any any three "major categories" that weren't used in the assigned reading. (Remember you can use `unique()` to get a list of unique values from a series.) 
* Create a new dataframe that only contains entries from the major categories that you selected
* Create a scatter matrix examining percent female and median income by major category
* Interpret every cell in the scatter matrix. What does these visualizations tell you about the data?

In [12]:
# enter and test code here



---
## 9. Some More Data Exploration: 
What are the top five majors with more than 50% female graduates with the highest median income?

In [13]:
# enter and test code here



---
## 10. Some More Data Exploration: 
Make a horizontal bar chart visualizing the result to the previous question.

In [14]:
# enter and test code here



---
## 11. Some More Data Exploration: 
What major has the highest unemployment rate?

In [15]:
# enter and test code here



---
## 12. Some More Data Exploration: 
Plot a histogram of the unemployment rate.

In [16]:
# enter and test code here



---
## 13. Some More Data Exploration: 
Excluding the "Engineering" major category, what are the top five median income majors?

In [17]:
# enter and test code here



---
## 14. Your Ideal Major: 
Create standard scores for median income, unemployment, and percent female by:
* subtracting the mean value for each column from the value
* then dividing the result by the standard deviation (to do this use the '.std()' method on a series)
* then assigning the result to new columns in the data frame 

Use the standard scores to create your own personal index score that creates a weighted combination of  median income, unemployment, and percent female to find your ideal major. 

In [18]:
# enter and test code here

---
## 15. Ask and Answer Your Own Question: 
You have the tools. Show us what you can do with them. 

In [19]:
# enter and test code here