# In Class Exercise 7.4: Picking Your Major
Use the tools we have discussed we can start looking at associations by groups in the dataset `'recent-grads.csv'`, which contains data about employment and salaries for recent college graduates. The data comes from [here](https://github.com/fivethirtyeight/data/tree/master/college-majors) and was used for the story [The Economic Guide to Picking Your Major](https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/), published by [fivethirtyeight](https://fivethirtyeight.com/).

## Imports and Dataframe

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# For slightly nicer charts
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['figure.dpi'] = 150
sns.set(style="ticks")

In [None]:
# open full dataset assign to variable
df_grad_full = pd.read_csv("recent-grads.csv")

# use label based indexing to create a NEW dataset that only contains a few columns (using .copy())
df = df_grad_full[['Major_category', 'Major', 'ShareWomen', 'Unemployment_rate', 'Median']].copy()

# create a list to use to rename the columns
rename_list = ['Major_Cat', 'Major', 'Prcnt_Female', 'Prcnt_Unemploy', 'Income_Median'] 

# rename columns by assigning the list we made to the dataframe's columns property
df.columns = rename_list 

# check the result
df.head()

Variables represent the following: 
* Major_Cat: Majors grouped into categories
* Major: Title of Major
* Prcnt_Female: Percentage of graduates classified as female
* Prcnt_Unemploy: Rate of unemployment
* Income_Median: Median earnings of full-time, year-round workers

---
We are going to be doing a lot of work with the 'Major' category. To make this work easier we can set the dataframe index to 'Major' using the following code.

In [None]:
df = df.set_index('Major')

---
## 1. Median Income - Top Majors
Produce a list of top ten majors for median income.

In [None]:
# enter and test code here

df['Income_Median'].sort_values().tail(10)

---
## 2. Median Income - Bottom Majors
Produce a list of the ten majors with the lowest median income.

In [None]:
# enter and test code here

df['Income_Median'].sort_values().head(10)

---
## 3. Percent Female - Top Majors
Produce a list of top ten majors for percentage of women graduating.

In [None]:
# enter and test code here

df['Prcnt_Female'].mean().sort_values().tail(10)

---
## 4. Percent Female - Bottom Majors
Produce a list of top ten majors for percentage of women graduating.

In [None]:
# enter and test code here

df['Prcnt_Female'].sort_values().head(10)

---
## 5. Single Correlation
Produce a single Pearson's r statistic for the correlation betweeen median income and percent female.

In [None]:
# enter and test code here

df['Income_Median'].corr(df['Prcnt_Female'])

---
## 6. Correlation Table
Produce a correlation table from the dataframe.

In [None]:
# enter and test code here

df.corr()

---
## 7. Heatmap
Produce a heatmap from the dataframe.

In [None]:
# enter and test code here

correlation_table = df.corr()
sns.heatmap(correlation_table, cmap="Blues")

---
## 8. Scatter Matrix
* Select any any three "major categories" that weren't used in the assigned reading. (Remember you can use `unique()` to get a list of unique values from a series.) 
* Create a new dataframe that only contains entries from the major categories that you selected
* Create a scatter matrix examining percent female and median income by major category
* Interpret every cell in the scatter matrix. What does these visualizations tell you about the data?

In [None]:
# enter and test code here

df.Major_Cat.unique()
df3majors = df.query('Major_Cat == "Engineering" or Major_Cat == "Business" or Major_Cat == "Psychology & Social Work"')
sns.pairplot(df3majors, hue='Major_Cat')

---
## 9. Some More Data Exploration: 
What are the top five majors with more than 50% female graduates with the highest median income?

In [None]:
# enter and test code here

dffemale = df.query("Prcnt_Female > .50")
dffemale['Income_Median'].sort_values().tail(5)

---
## 10. Some More Data Exploration: 
Make a horizontal bar chart visualizing the result to the previous question.

In [None]:
# enter and test code here

femalebar = dffemale['Income_Median'].sort_values().tail(5)

plt.barh(y=femalebar.index, width=femalebar.values)
plt.xlabel("Median Income")
plt.ylabel("Major")
plt.title("Median Income in Majors with Female Majority")
plt.show()

---
## 11. Some More Data Exploration: 
What major has the highest unemployment rate?

In [None]:
# enter and test code here

df['Prcnt_Unemploy'].sort_values().tail(1)

---
## 12. Some More Data Exploration: 
Plot a histogram of the unemployment rate.

In [None]:
# enter and test code here

plt.hist(df['Prcnt_Unemploy'])

---
## 13. Some More Data Exploration: 
Excluding the "Engineering" major category, what are the top five median income majors?

In [None]:
# enter and test code here

dfnoengin = df.query("Major_Cat != 'Engineering'")
dfnoengin['Income_Median'].sort_values().tail(5)

---
## 14. Your Ideal Major: 
Create standard scores for median income, unemployment, and percent female by:
* subtracting the mean value for each column from the value
* then dividing the result by the standard deviation (to do this use the '.std()' method on a series)
* then assigning the result to new columns in the data frame 

Use the standard scores to create your own personal index score that creates a weighted combination of  median income, unemployment, and percent female to find your ideal major. 

In [None]:
# calculate standard scores
df['Prcnt_Female_stan'] = (df['Prcnt_Female'] - df['Prcnt_Female'].mean()) / df['Prcnt_Female'].std()
df['Prcnt_Unemploy_stan'] = (df['Prcnt_Unemploy'] - df['Income_Median'].mean()) / df['Income_Median'].std()
df['Income_Median_stan'] = (df['Income_Median'] - df['Income_Median'].mean()) / df['Income_Median'].std()

# example weighted index
df['major_weighted'] = (2 * df['Prcnt_Female_stan']) + (1.5 * df['Income_Median_stan']) - (.5 * df['Prcnt_Unemploy_stan'])

df['major_weighted'].sort_values().tail(5)

---
## 15. Ask and Answer Your Own Question: 
You have the tools. Show us what you can do with them. 

In [None]:
# enter and test code here