In [53]:
# installed packages as part of this project 

#install.packages("readr")
#install.packages("tibble")
#install.packages("dplyr")
#install.packages("tidyverse")

In [1]:
# to load the readr package which reads the csv format data file, we need to use the library(readr)
library(readr)
library(tidyverse)
library(tidyselect)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.0     ✔ dplyr   0.8.5
✔ tibble  3.0.1     ✔ stringr 1.4.0
✔ tidyr   1.0.3     ✔ forcats 0.4.0
✔ purrr   0.3.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


 ### Let us summarize what we are interested in
 
 1. You're interested in the sciences.
         
The Major_category variable contains information about the field of study. We can use this information to identify majors in the physical and life sciences.

2. Recent graduates must have a median salary above 40,000 USD.

The Median variable provides median salaries for each major. We can use this information to identify majors with median salaries greater than 40,000 USD.

3. More than 40 percent of graduates must be women.

There is not a variable that tells us the percentage of graduates for each major that are women. However, we do have information about the total number of graduates (Total), the number of graduates who are men (Men), and the number of graduates who are women (Women).

In [5]:
# let us read the csv file now. So we need to use the read_csv function the readr package
recent_grads <- read_csv("recent_grads.csv")

In [3]:
# Since we are only interested in the subset, we are going to use the select function of the dplyr
recent_grads_select <- recent_grads%>%select(Major, Major_category, Total, Men, Women, Median, Unemployment_rate)

In [14]:
# let us create a new column and add it to our new data frame. 
recent_grads_select <- recent_grads_select %>% mutate(Women_percent=(Women/Total)*100)

In [15]:
# here are the first few rows of the data
recent_grads_select[1:10, ]

Major,Major_category,Total,Men,Women,Median,Unemployment_rate,Women_percent
PETROLEUM ENGINEERING,Engineering,2339,2057,282,110000,0.01838053,12.05643
MINING AND MINERAL ENGINEERING,Engineering,756,679,77,75000,0.11724138,10.18519
METALLURGICAL ENGINEERING,Engineering,856,725,131,73000,0.02409639,15.30374
NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,1123,135,70000,0.05012531,10.73132
CHEMICAL ENGINEERING,Engineering,32260,21239,11021,65000,0.06109771,34.16305
NUCLEAR ENGINEERING,Engineering,2573,2200,373,65000,0.17722641,14.4967
ACTUARIAL SCIENCE,Business,3777,2110,1667,62000,0.09565217,44.13556
ASTRONOMY AND ASTROPHYSICS,Physical Sciences,1792,832,960,62000,0.02116741,53.57143
MECHANICAL ENGINEERING,Engineering,91227,80320,10907,60000,0.05734228,11.95589
ELECTRICAL ENGINEERING,Engineering,81527,65511,16016,60000,0.05917385,19.64503


### 3. More than 40 percent of our graduates should be women.

When we look at the data above, we see that there are many Majors in which women's percentage is less than 40 percent. So we need to remove these majors and filter the data. We can use the filter() function from the dplyr package to achieve this. 

In [16]:
recent_grads_select <- recent_grads_select %>% filter(Women_percent >= 40)

In [17]:
recent_grads_health <- recent_grads_select %>% filter(Major_category=="Health")


In [18]:
recent_grads_health

Major,Major_category,Total,Men,Women,Median,Unemployment_rate,Women_percent
NURSING,Health,209394,21773,187621,48000,0.04486272,89.6019
MEDICAL TECHNOLOGIES TECHNICIANS,Health,15914,3916,11998,45000,0.03698279,75.39274
MEDICAL ASSISTING SERVICES,Health,11123,803,10320,42000,0.04250653,92.78072
PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION,Health,23551,8697,14854,40000,0.05552083,63.07163
MISCELLANEOUS HEALTH MEDICAL PROFESSIONS,Health,13386,1589,11797,36000,0.08141125,88.12939
NUTRITION SCIENCES,Health,18909,2563,16346,35000,0.06870068,86.44561
HEALTH AND MEDICAL ADMINISTRATIVE SERVICES,Health,18109,4266,13843,35000,0.08962626,76.44265
COMMUNITY AND PUBLIC HEALTH,Health,19735,4103,15632,34000,0.11214439,79.20953
HEALTH AND MEDICAL PREPARATORY PROGRAMS,Health,12740,5521,7219,33500,0.06977971,56.66405
TREATMENT THERAPY PROFESSIONS,Health,48491,13487,35004,33000,0.05982121,72.18659


### science dataframe

In [20]:
recent_grads_science  <- recent_grads_select %>% filter(Major_category=="Biology & Life Science"|Major_category=="Physical Sciences")


In [21]:
# we are going to modify the recent_grads_science dataframe to filter based on Median >= 40000 and Women_percent > 40
potential_majors <- recent_grads_science %>% filter(Median >= 40000 & Women_percent >40)

In [22]:
potential_majors

Major,Major_category,Total,Men,Women,Median,Unemployment_rate,Women_percent
ASTRONOMY AND ASTROPHYSICS,Physical Sciences,1792,832,960,62000,0.02116741,53.57143
"NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES",Physical Sciences,2116,528,1588,46000,0.07154047,75.04726
PHARMACOLOGY,Biology & Life Science,1762,515,1247,45000,0.08553157,70.77185
OCEANOGRAPHY,Physical Sciences,2418,752,1666,44700,0.05699482,68.89992
COGNITIVE SCIENCE AND BIOPSYCHOLOGY,Biology & Life Science,3831,1667,2164,41000,0.07523617,56.48656
MOLECULAR BIOLOGY,Biology & Life Science,18300,7426,10874,40000,0.08436116,59.42077
GENETICS,Biology & Life Science,3635,1761,1874,40000,0.03411765,51.55433


In [33]:
# create a new df, my_majors, containing the data in the potential_majors data frame arranged in order of increasing
# unemployment rate and decreasing Median (in that order)
my_majors <- potential_majors %>% arrange(Unemployment_rate, desc(Median))

In [34]:
my_majors

Major,Major_category,Total,Men,Women,Median,Unemployment_rate,Women_percent
ASTRONOMY AND ASTROPHYSICS,Physical Sciences,1792,832,960,62000,0.02116741,53.57143
GENETICS,Biology & Life Science,3635,1761,1874,40000,0.03411765,51.55433
OCEANOGRAPHY,Physical Sciences,2418,752,1666,44700,0.05699482,68.89992
"NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES",Physical Sciences,2116,528,1588,46000,0.07154047,75.04726
COGNITIVE SCIENCE AND BIOPSYCHOLOGY,Biology & Life Science,3831,1667,2164,41000,0.07523617,56.48656
MOLECULAR BIOLOGY,Biology & Life Science,18300,7426,10874,40000,0.08436116,59.42077
PHARMACOLOGY,Biology & Life Science,1762,515,1247,45000,0.08553157,70.77185
