# Python: Data Analysis

**Goal**: analyze real data samples for trends!

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-dataset" data-toc-modified-id="Introduction-to-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to dataset</a></span></li><li><span><a href="#Number-of-students-by-major-category" data-toc-modified-id="Number-of-students-by-major-category-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Number of students by major category</a></span></li><li><span><a href="#Rate-of-low-wage-jobs" data-toc-modified-id="Rate-of-low-wage-jobs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Rate of low wage jobs</a></span></li><li><span><a href="#Comparing-datasets" data-toc-modified-id="Comparing-datasets-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Comparing datasets</a></span></li></ul></div>

## Introduction to dataset

For this challenge, we will use the datasets ``all-ages.csv`` and ``recent-grads.csv`` which collect data on purchasing power and employability. The ``all-ages.csv`` dataset corresponds to employment data by field of study for all ages. And the ``recent-grads.csv`` dataset lists employment data by field of study for recent university graduates. We will proceed to the analyses of these two datasets and make statistical computations.

In [1]:
import pandas as pd

In [2]:
all_ages = pd.read_csv("all-ages.csv")
all_ages.head()

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148.0,90245.0,74078.0,2423.0,0.026147,50000.0,34000.0,80000.0
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326.0,76865.0,64240.0,2266.0,0.028636,54000.0,36000.0,80000.0
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955.0,26321.0,22810.0,821.0,0.030248,63000.0,40000.0,98000.0
3,1103,ANIMAL SCIENCES,Agriculture & Natural Resources,103549.0,81177.0,64937.0,3619.0,0.042679,46000.0,30000.0,72000.0
4,1104,FOOD SCIENCE,Agriculture & Natural Resources,24280.0,17281.0,12722.0,894.0,0.049188,62000.0,38500.0,90000.0


In [3]:
all_ages.shape

(173, 11)

In [4]:
recent_grads = pd.read_csv("recent-grads.csv")
recent_grads.head()

Unnamed: 0,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419.0,PETROLEUM ENGINEERING,Engineering,2339.0,36.0,2057.0,282.0,0.120564,1976.0,...,270.0,1207.0,37.0,0.018381,110000.0,95000.0,125000.0,1534.0,364.0,193.0
1,2,2416.0,MINING AND MINERAL ENGINEERING,Engineering,756.0,7.0,679.0,77.0,0.101852,640.0,...,170.0,388.0,85.0,0.117241,75000.0,55000.0,90000.0,350.0,257.0,50.0
2,3,2415.0,METALLURGICAL ENGINEERING,Engineering,856.0,3.0,725.0,131.0,0.153037,648.0,...,133.0,340.0,16.0,0.024096,73000.0,50000.0,105000.0,456.0,176.0,0.0
3,4,2417.0,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258.0,16.0,1123.0,135.0,0.107313,758.0,...,150.0,692.0,40.0,0.050125,70000.0,43000.0,80000.0,529.0,102.0,0.0
4,5,2405.0,CHEMICAL ENGINEERING,Engineering,32260.0,289.0,21239.0,11021.0,0.341631,25694.0,...,5180.0,16697.0,1672.0,0.061098,65000.0,50000.0,75000.0,18314.0,4440.0,972.0


In [5]:
recent_grads.shape

(173, 21)

In a nutshell, we will explain some of these columns. The ``Rank`` column represents the ranking of fields of study in relation to the median salaries of graduates. ``Major_code`` represents the numerical code of the major of the field of study. ``Total`` represents the total number of people who studied the major. ``Sample_size`` represents the sample of full-time students. ``ShareWomen`` is the proportion of women who took the major, etc.

## Number of students by major category

In this section, we will try to answer the following questions:

* return the unique values of Major_category:
    * use the Series.unique() method to return the unique values of a series
* for each unique value (use a for? loop):
    * return all rows where Major_category is that unique value
    * calculate the total number of students representing this major category (Total column to sum)
    * you will keep this result in memory as a dictionary containing a Major_category as a key and the number of students as a value
* create a function in which you will use the Total column to calculate the number of students for each Major_category in each dataset:
    * store the result in 2 separate dictionaries.
    * the key for each dictionary will be Major_category and the value will be the total number of students
    * for the dataset all_ages, store the result in a dictionary named aa_car_counts
    * for the recent_grads dataset, store the result in a dictionary named rg_cat_counts

In [6]:
def compute_major_cat_totals(df):

    cats = df['Major_category'].unique()
    counts_dictionary = dict()

    for c in cats:

        major_df = df[df["Major_category"] == c]
        total = major_df["Total"].sum()
        counts_dictionary[c] = total

    return counts_dictionary

In [7]:
aa_cat_counts = compute_major_cat_totals(all_ages) 
aa_cat_counts

{'Agriculture & Natural Resources': 632437.0,
 'Biology & Life Science': 1338186.0,
 'Engineering': 3576013.0,
 'Humanities & Liberal Arts': 3738335.0,
 'Communications & Journalism': 1803822.0,
 'Computers & Mathematics': 1781378.0,
 'Industrial Arts & Consumer Services': 1018072.0,
 'Education': 4700118.0,
 'Law & Public Policy': 902926.0,
 'Interdisciplinary': 45199.0,
 'Health': 2950859.0,
 'Social Science': 2654125.0,
 'Physical Sciences': 1013152.0,
 nan: 0.0,
 'Psychology & Social Work': 1987278.0,
 'Arts': 1805865.0,
 'Business': 9858741.0}

In [8]:
rg_cat_counts = compute_major_cat_totals(recent_grads)                
rg_cat_counts             

{'Engineering': 537583.0,
 'Business': 1302376.0,
 'Physical Sciences': 183363.0,
 'Law & Public Policy': 179107.0,
 'Computers & Mathematics': 299008.0,
 'Agriculture & Natural Resources': 79981.0,
 'Industrial Arts & Consumer Services': 227357.0,
 'Arts': 357130.0,
 'Health': 463230.0,
 'Social Science': 529966.0,
 nan: 0.0,
 'Biology & Life Science': 453862.0,
 'Education': 559129.0,
 'Humanities & Liberal Arts': 713468.0,
 'Psychology & Social Work': 481007.0,
 'Communications & Journalism': 392601.0,
 'Interdisciplinary': 12296.0}

These previous computations could be done directly using the ``pivot_table()`` method.

In [9]:
import numpy as np

In [10]:
aa_cat_counts = dict(all_ages.pivot_table(
    index="Major_category", values="Total", aggfunc=np.sum))
aa_cat_counts

{'Total': Major_category
 Agriculture & Natural Resources         632437.0
 Arts                                   1805865.0
 Biology & Life Science                 1338186.0
 Business                               9858741.0
 Communications & Journalism            1803822.0
 Computers & Mathematics                1781378.0
 Education                              4700118.0
 Engineering                            3576013.0
 Health                                 2950859.0
 Humanities & Liberal Arts              3738335.0
 Industrial Arts & Consumer Services    1018072.0
 Interdisciplinary                        45199.0
 Law & Public Policy                     902926.0
 Physical Sciences                      1013152.0
 Psychology & Social Work               1987278.0
 Social Science                         2654125.0
 Name: Total, dtype: float64}

In [11]:
rg_cat_counts = dict(recent_grads.pivot_table(
    index="Major_category", values="Total", aggfunc=np.sum))
rg_cat_counts

{'Total': Major_category
 Agriculture & Natural Resources          79981.0
 Arts                                    357130.0
 Biology & Life Science                  453862.0
 Business                               1302376.0
 Communications & Journalism             392601.0
 Computers & Mathematics                 299008.0
 Education                               559129.0
 Engineering                             537583.0
 Health                                  463230.0
 Humanities & Liberal Arts               713468.0
 Industrial Arts & Consumer Services     227357.0
 Interdisciplinary                        12296.0
 Law & Public Policy                     179107.0
 Physical Sciences                       183363.0
 Psychology & Social Work                481007.0
 Social Science                          529966.0
 Name: Total, dtype: float64}

## Rate of low wage jobs

In this section, we will try to answer the following questions:

* use the "Low_wage_jobs" and "Total" columns to compute the proportion of recent graduates who had to find low-wage jobs (recent_grads):
    * remember that you can use the Series.sum() method to return the sum of a column's values
* store the result in the variable low_wage_proportion and display it

In [12]:
low_wage_jobs_sum = recent_grads['Low_wage_jobs'].sum()
recent_grads_sum = recent_grads['Total'].sum()

In [13]:
low_wage_proportion = low_wage_jobs_sum / recent_grads_sum
low_wage_proportion

0.09856140415130317

## Comparing datasets

In this section, we will try to answer the following questions:

* use a for loop to go through all the majors:
    * for each Major and each dataset, filter only the rows of the dataset corresponding to that Major
    * compare the values for the "Unemployment_rate" column to see which of the 2 dataset has the lowest value
    * increment (i.e. add 1) to the rg_lower_count variable if the value for Unemployment_rate is smaller in the recent_grads dataset than in the all_ages dataset
* display the result rg_lower_count

In [14]:
majors = recent_grads['Major'].unique()
rg_lower_count = 0

for m in majors:
    
    if not pd.isna(m):
        recent_grads_row = recent_grads[recent_grads['Major'] == m]
        all_ages_row = all_ages[all_ages['Major'] == m]

        rg_unemp_rate = recent_grads_row['Unemployment_rate'].values
        aa_unemp_rate = all_ages_row['Unemployment_rate'].values

        assert rg_unemp_rate.size > 0
        assert aa_unemp_rate.size > 0
        assert rg_unemp_rate.size  == aa_unemp_rate.size 

        if rg_unemp_rate < aa_unemp_rate:
            rg_lower_count += 1
    else:
        continue

In [15]:
rg_lower_count

42

So there are 42 out of 173 majors where recent graduates are doing better than the whole population in terms of employment rates.