# Table of Contents
 <p><div class="lev1"><a href="#Challenge:-Summarizing-Data"><span class="toc-item-num">1&nbsp;&nbsp;</span>Challenge: Summarizing Data</a></div><div class="lev1"><a href="#Introduction-To-The-Data"><span class="toc-item-num">2&nbsp;&nbsp;</span>Introduction To The Data</a></div><div class="lev1"><a href="#Summarizing-Major-Categories"><span class="toc-item-num">3&nbsp;&nbsp;</span>Summarizing Major Categories</a></div><div class="lev1"><a href="#Low-Wage-Job-Rates"><span class="toc-item-num">4&nbsp;&nbsp;</span>Low-Wage Job Rates</a></div><div class="lev1"><a href="#Comparing-Data-Sets"><span class="toc-item-num">5&nbsp;&nbsp;</span>Comparing Data Sets</a></div>

Challenge: Summarizing Data
===========================

The American Community Survey is a U.S. Census Bureau survey that collects data on everything from housing affordability to industry employment rates. For this challenge, we'll be using the data that the team at FiveThirtyEight derived from the 2010-2012 American Community Surveys. FiveThirtyEight cleaned the data set and made it available in a Github repository.

Here's a quick overview of the files we'll be working with:

- **all-ages.csv** - Employment data by major for all ages
- **recent-grads.csv** - Employment data by major for recent college graduates only

Here are descriptions for a few of the columns (out of 21 total columns):

- **Rank** - The major's numerical rank, by post-graduation median earnings
- **Major_code** - The major's numerical code
- **Major** - The major's description
- **Major_category** - The major's category
- **Total** - The total number of people who studied the major
- **Men** - The number of men who studied the major
- **Women** - The number of women who studied the major
- **ShareWomen** - The share of women (from 0 to 1) who studied the major
- **Employed** - The number of people who studied the major and obtained a job after graduating


Here are the first few rows and columns in recent-grads.csv. The data set all-ages.csv has the same structure, but with different values for some of the columns:

In [2]:
from IPython.core.display import display, HTML
display(HTML('<table class="table table-bordered"> <thead><tr> <th>Rank</th> <th>Major_code</th> <th>Major</th> <th>Major_category</th> <th>Total</th> <th>Sample_size</th> <th>Men</th> <th>Women</th> <th>ShareWomen</th> <th>Employed</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>2419</td> <td>PETROLEUM ENGINEERING</td> <td>Engineering</td> <td>2339</td> <td>36</td> <td>2057</td> <td>282</td> <td>0.120564</td> <td>1976</td> </tr> <tr> <td>2</td> <td>2416</td> <td>MINING AND MINERAL ENGINEERING</td> <td>Engineering</td> <td>756</td> <td>7</td> <td>679</td> <td>77</td> <td>0.101852</td> <td>640</td> </tr> <tr> <td>3</td> <td>2415</td> <td>METALLURGICAL ENGINEERING</td> <td>Engineering</td> <td>856</td> <td>3</td> <td>725</td> <td>131</td> <td>0.153037</td> <td>648</td> </tr> <tr> <td>4</td> <td>2417</td> <td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td> <td>Engineering</td> <td>1258</td> <td>16</td> <td>1123</td> <td>135</td> <td>0.107313</td> <td>758</td> </tr> <tr> <td>5</td> <td>2405</td> <td>CHEMICAL ENGINEERING</td> <td>Engineering</td> <td>32260</td> <td>289</td> <td>21239</td> <td>11021</td> <td>0.341631</td> <td>25694 </td> </tr> </tbody> </table>'))

Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed
1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564,1976
2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.101852,640
3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037,648
4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313,758
5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341631,25694


# Introduction To The Data

<div class="alert alert-info">
    <ul>
        <li>Read all-ages.csv into a DataFrame object, and assign it to all_ages.</li>
        <li>Read recent-grads.csv into a DataFrame object, and assign it to recent_grads.</li>
        <li>Display the first five rows of all_ages and recent_grads.</li>
    </ul>
</div>

In [1]:
import pandas as pd

In [3]:
all_ages = pd.read_csv("all-ages.csv")

In [6]:
print(all_ages.head(5))

   Major_code                                  Major  \
0        1100                    GENERAL AGRICULTURE   
1        1101  AGRICULTURE PRODUCTION AND MANAGEMENT   
2        1102                 AGRICULTURAL ECONOMICS   
3        1103                        ANIMAL SCIENCES   
4        1104                           FOOD SCIENCE   

                    Major_category   Total  Employed  \
0  Agriculture & Natural Resources  128148     90245   
1  Agriculture & Natural Resources   95326     76865   
2  Agriculture & Natural Resources   33955     26321   
3  Agriculture & Natural Resources  103549     81177   
4  Agriculture & Natural Resources   24280     17281   

   Employed_full_time_year_round  Unemployed  Unemployment_rate  Median  \
0                          74078        2423           0.026147   50000   
1                          64240        2266           0.028636   54000   
2                          22810         821           0.030248   63000   
3                         

In [5]:
recent_grads = pd.read_csv("recent-grads.csv")

In [7]:
print(recent_grads.head(5))

   Rank  Major_code                                      Major Major_category  \
0     1        2419                      PETROLEUM ENGINEERING    Engineering   
1     2        2416             MINING AND MINERAL ENGINEERING    Engineering   
2     3        2415                  METALLURGICAL ENGINEERING    Engineering   
3     4        2417  NAVAL ARCHITECTURE AND MARINE ENGINEERING    Engineering   
4     5        2405                       CHEMICAL ENGINEERING    Engineering   

   Total  Sample_size    Men  Women  ShareWomen  Employed      ...        \
0   2339           36   2057    282    0.120564      1976      ...         
1    756            7    679     77    0.101852       640      ...         
2    856            3    725    131    0.153037       648      ...         
3   1258           16   1123    135    0.107313       758      ...         
4  32260          289  21239  11021    0.341631     25694      ...         

   Part_time  Full_time_year_round  Unemployed  Unemploy

# Summarizing Major Categories

Both of these data sets group the various majors into categories in the Major_category column. Let's start by understanding the number of people in each Major_category for both data sets.

<div class="alert alert-info">
    <ul>
        <li>
            Use the Total column to calculate the number of people who fall under each Major_category in each data set.
Store the result as a separate dictionary for each data set.
            <ul>
                <li>The key for the dictionary should be the Major_category, and the value should be the total count.</li>
                <li>For the counts from all_ages, store the results as a dictionary named aa_cat_counts.</li>
                <li>For the counts from recent_grads, store the results as a dictionary named rg_cat_counts.</li>
            </ul>
        </li>
    </ul>
</div>

In [53]:
# Let's initialize the dicts
aa_cat_counts = dict()
rg_cat_counts = dict()

In [54]:
def total_count(df):
    """Counts in a dictionary the total number of people who fall under each Major_category in each data set"""
    cats = df['Major_category'].unique()
    return_dict = dict()
    
    # for each unique category
    for cat in cats:
        # we subset the rows matching this category as a major category
        major_df = df[df["Major_category"] == cat]
        # we sum its number of members
        total = major_df["Total"].sum()
        # we store it in the return dictionary
        return_dict[cat] = total
    return return_dict

In [55]:
aa_cat_counts = total_count(all_ages)
rg_cat_counts = total_count(recent_grads)

In [56]:
for cat,tot in aa_cat_counts.items():
    print(cat, ":", tot)

Physical Sciences : 1025318
Business : 9858741
Biology & Life Science : 1338186
Education : 4700118
Arts : 1805865
Computers & Mathematics : 1781378
Health : 2950859
Engineering : 3576013
Humanities & Liberal Arts : 3738335
Psychology & Social Work : 1987278
Social Science : 2654125
Law & Public Policy : 902926
Agriculture & Natural Resources : 632437
Interdisciplinary : 45199
Industrial Arts & Consumer Services : 1033798
Communications & Journalism : 1803822


In [57]:
for cat,tot in rg_cat_counts.items():
    print(cat, ":", tot)

Physical Sciences : 185479
Biology & Life Science : 453862
Education : 559129
Agriculture & Natural Resources : 79981
Arts : 357130
Computers & Mathematics : 299008
Health : 463230
Engineering : 537583
Interdisciplinary : 12296
Humanities & Liberal Arts : 713468
Social Science : 529966
Law & Public Policy : 179107
Business : 1302376
Psychology & Social Work : 481007
Industrial Arts & Consumer Services : 229792
Communications & Journalism : 392601


# Low-Wage Job Rates

The [press likes to talk](http://bit.ly/1fNLmaT) about the number of college graduates working low-pay, unskilled jobs because they can't find better ones. As a data person, you should be skeptical of any broad claims, and analyze relevant data to obtain a more nuanced view.

Let's run some basic calculations to explore that idea further.

<div class="alert alert-info">
    <ul>
        <li>
            Use the Low_wage_jobs and Total columns to calculate the proportion of recent college graduates that worked low wage jobs.
            <ul>
                <li>Recall that you can use the Series.sum() method to return the sum of the values in a column</li>
            </ul>
        </li>
        <li>Store the resulting float as low_wage_percent, and display the value with the print() function.</li>
    </ul>
</div>

In [63]:
low_wage_percent = 0.0

In [64]:
low_wage_jobs_sum = recent_grads['Low_wage_jobs'].sum()
recent_grads_sum = recent_grads['Total'].sum()

In [65]:
low_wage_percent = low_wage_jobs_sum / recent_grads_sum

In [66]:
print(low_wage_percent)

0.09852546076122913


# Comparing Data Sets

It looks like only about 9.85% of graduates took on a low wage job after finishing college.

Both the all_ages and recent_grads data sets have 173 rows, corresponding to the 173 college major codes. This enables us to do some comparisons between the two data sets, and perform some initial calculations to see how the statistics for recent college graduates compare with those for the entire population.

Next, let's calculate the number of majors where recent graduates did better than the overall population.

In [69]:
len(all_ages) == len(recent_grads)

True

<div class="alert alert-info">
    <ul>
        <li>
            Use a for loop to iterate over majors.
            <ul>
                <li>For each major, use Boolean filtering to find the corresponding row in both DataFrames.</li>
                <li>Compare the values for Unemployment_rate to see which DataFrame has a lower value.</li>
                <li>Increment rg_lower_count if the value for Unemployment_rate is lower for recent_grads than it is for all_ages.</li>
            </ul>
        </li>
        <li>Display rg_lower_count with the print() function.</li>
    </ul>
</div>

In [76]:
# All majors, common to both DataFrames
majors = recent_grads['Major'].unique()
rg_lower_count = 0

for maj in majors:
    rg_row = recent_grads[recent_grads["Major"] == maj]
    aa_row = all_ages[all_ages["Major"] == maj]
    
    rg_unemp_rate = rg_row.iloc[0]['Unemployment_rate']
    aa_unemp_rate = aa_row.iloc[0]['Unemployment_rate']
    
    if rg_unemp_rate < aa_unemp_rate:
        rg_lower_count += 1    

In [77]:
print(rg_lower_count)

43


<div class="alert alert-warning">
It appears that less recent graduates who studied 43 of the 173 majors ended up having lower unemployment rates than the general population.

In the next few missions, we'll dive further into the two key data structures in pandas: Series and DataFrame objects.
</div>