# R Workbook 1: What are the characteristics in the number of jobs by census block?

In these workbooks, we will start with a motivating question, then walk through the process we need to go through in order to answer the motivating question. Along the way, we will walk through various R commands and develop skills as we work towards answering the question.

As you work, there will be headers that are in **<span style="color:green">GREEN</span>**. These indicate locations where there is an accompanying video. This video may walk through the steps or expand on the topics discussed in that section. Though it isn't absolutely necessary to watch the video while working through this notebook, we highly recommend watching them at least once on your first time through.

**<span style = "color:green">If you have not yet watched the "Introduction to Jupyter Notebooks" video, watch it before you proceed!</span>**

You will also run into some headers that are in **<span style="color:red">RED</span>**. These headers indicate a checkpoint to practice writing the code yourself. You should stop at these checkpoints and try doing the exercises and answering the questions posed in these sections.

**NOTE: When you open a notebook, make sure you run each cell containing code from the beginning. Since the code we're writing builds on everything written before, if you don't make sure to run everything from the beginning, some things may not work.**

In each of the workbooks, we will start out with a motivating question. Here, we'll introduce the data that we'll work with, which will lead into our motivating question for this workbook.

## Longitudinal Employer-Household Dynamics (LEHD) Data

In these workbooks, we will be using LEHD data. These are public-use data sets containing information about employers and employees. Information about the LEHD Data can be found at [https://lehd.ces.census.gov/](https://lehd.ces.census.gov/). 

We will be using the LEHD Origin-Destination Employment Statistics (LODES) datasets in our applications in this workbook. Each state has three main types of files: Origin-Destination data, in which job totals are associated with a home and work Census block pair, Residence Area Characteristic data, in which job totals are by home Census block, and Workplace Area Characteristic data, in which job totals are by workplace Census block. In addition to these three, there is a "geographic crosswalk" file with descriptions of the Census Blocks as they appear the in the LODES datasets.

You can find more information about the LODES datasets [here](https://lehd.ces.census.gov/data/lodes/LODES7/LODESTechDoc7.3.pdf).

## <span style = "color:green">Motivating Question (VIDEO)</span>

The dataset we'll be working with in this workbook is the Workplace Area Characteristics data, which aggregates job totals by workplace census block. We want to explore this dataset and get a better idea of the distribution of jobs. That is, we want to answer the following questions:

**How can we characterize the number of jobs in the state? What can we say about the distribution of the number of jobs by census block? What are distributions of jobs by different categories, such as age group or industry?**

As you work through this notebook, we'll work towards answering these questions, so try to keep in mind what we're working toward.

## Starting Out: Introduction to R

In order to try to answer these questions, we'll write code to bring in datasets, manipulate these datasets, and summarize datasets using R. R is a popular programming language that has seen a rise in use for data analysis. As of 2017, R is near or at the top in terms of popularity in programming languages for data analysis (see [the kdnuggets post](https://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html) or [a look at multiple surveys](http://makemeanalyst.com/most-popular-languages-for-data-science-and-analytics-2017/)).

In addition, unlike other tools like Excel or Stata, R can be used in a more general-purpose fashion. This means that we aren't limited to only doing certain statistical analyses and gives us much more flexibility in what we can do.

## Loading Libraries

We first load a few libraries. These libraries are essentially bundles of useful tools that can help us do specific tasks. In this case, we're going to be bringing in a suite of packages called `tidyverse`, which are specifically useful for computing and data analysis. Don't worry too much about the specifics of libraries for now, just that we need to include the first few lines of code below if we want to use many of the tools described in this workbook. 

In [3]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.5
[32m✔[39m [34mtidyr  [39m 1.0.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## <span style = "color:green">Reading in the Data Set</span>
We'll start by reading in a data set from a csv, or comma-separated value, file. For our examples, we'll use the Workplace Area Characteristic (WAC) data from California. 

We use the `read_csv` function to read in the csv file.

In [33]:
data_file = 'ca_wac_S000_JT00_2015.csv'
df <- read_csv(data_file)
# df <- read.csv('ca_wac_S000_JT00_2015.csv')

Parsed with column specification:
cols(
  .default = col_double(),
  w_geocode = [31mcol_character()[39m
)

See spec(...) for full column specifications.



Let's break the code down. In the first line, we assign `'ca_wac_S000_JT00_2015.csv'` to the variable `data_file`. Note that any text inside quotation marks, such as `'ca_wac_S000_JT00_2015.csv'`, is a string, which makes `data_file` a string variable. Note that this by itself doesn't really do anything fancy. We are just setting up a string variable with the text, `'ca_wac_S000_JT00_2015.csv'`, not telling R to look for a file with that name or anything like that yet.

To look at what we've stored inside `data_file`, try using the `print` function with `data_file` as the argument. What do you think the output will be?

In [5]:
print(data_file)

[1] "ca_wac_S000_JT00_2015.csv"


In the second line, we're using the `read_csv` function. The function `read_csv` outputs a Data Frame, which is then assigned to the variable `df`. Notice that there is a less than sign, followed by a dash, combined to look like an arrow. This assigns the output on the right hand side to a name, which is specified on the left hand side. This means that our data is now in a data frame called `df`.

> #### Side Note: File Location
> We only used the file name for `data_file`. This is because we included the CSV file in the same folder as this notebook. If it were somewhere else, we'd have to include the file path (e.g. `"/Documents/R/ca_wac_S000_JT00_2015.csv"`). If you don't know much about how file paths work, don't worry: you just need to make sure that the file is in the same folder as the notebook.

Lastly, we've included a line of code that can load the csv file into `df` in one line. This does the exact same thing as the first two lines of code, with the exception of not assigning `'ca_wac_S000_JT00_2015.csv'` to `data_file`. Notice that all we did was replace `data_file` with the string that we assigned to `data_file`. It is **commented out** by using a `#` symbol, which means R will ignore anything on that line that comes afterwards. You can feel free to try running it by itself (commenting out the first two lines) to check that it does the same thing.

## A Note on Data Types

We've mentioned that `data_file` is a string variable and that `df` is a Data Frame. These are different variable types, and it's important to keep this in mind because the type of variable dictates what you can do with it. 

In [6]:
class(data_file)

Since `data_file` is a string, the `class` function returns `character`, which essentially denotes that it is text. Let's look at `df`. What do you think the output will be?

In [35]:
class(df)

It tells us that `df` is an R data.frame object.

For the LODES data, we can actually also read it in using a URL. If we want to do this, we'll have to account for the file being compressed, but that's ok because the `read_csv` function also allows for reading compressed files. 

In [34]:
data_file = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df <- read_csv(data_file)


Parsed with column specification:
cols(
  .default = col_double(),
  w_geocode = [31mcol_character()[39m
)

See spec(...) for full column specifications.



In [37]:
# Can check the column specifications
spec(df)

cols(
  w_geocode = [31mcol_character()[39m,
  C000 = [32mcol_double()[39m,
  CA01 = [32mcol_double()[39m,
  CA02 = [32mcol_double()[39m,
  CA03 = [32mcol_double()[39m,
  CE01 = [32mcol_double()[39m,
  CE02 = [32mcol_double()[39m,
  CE03 = [32mcol_double()[39m,
  CNS01 = [32mcol_double()[39m,
  CNS02 = [32mcol_double()[39m,
  CNS03 = [32mcol_double()[39m,
  CNS04 = [32mcol_double()[39m,
  CNS05 = [32mcol_double()[39m,
  CNS06 = [32mcol_double()[39m,
  CNS07 = [32mcol_double()[39m,
  CNS08 = [32mcol_double()[39m,
  CNS09 = [32mcol_double()[39m,
  CNS10 = [32mcol_double()[39m,
  CNS11 = [32mcol_double()[39m,
  CNS12 = [32mcol_double()[39m,
  CNS13 = [32mcol_double()[39m,
  CNS14 = [32mcol_double()[39m,
  CNS15 = [32mcol_double()[39m,
  CNS16 = [32mcol_double()[39m,
  CNS17 = [32mcol_double()[39m,
  CNS18 = [32mcol_double()[39m,
  CNS19 = [32mcol_double()[39m,
  CNS20 = [32mcol_double()[39m,
  CR01 = [32mcol_double()[39m,
  CR02 = 

## <span style="color:red">Checkpoint 1: Read in Other Data</span>

You can access LODES data from other states by using the link below: 

[LODES Data](https://lehd.ces.census.gov/data/lodes/LODES7)

and navigating to the state you want. Check out the [LODES documentation](https://lehd.ces.census.gov/data/lodes/LODES7/LODESTechDoc7.3.pdf) for more information.

If you download any data, it will go to your local computer. However, since this notebook is running from a server in the cloud, you'll have to upload it to this server to access it. You can do so by navigating to the **Home** folder by clicking on the **Jupyter** icon at the top left of this page, and clicking the Upload button in right-hand corner.

If you have problems with that, don't worry. We've included several other LODES data in this environment too. They are named:
- Illinois: `il_wac_S000_JT00_2015.csv`
- Indiana: `in_wac_S000_JT00_2015.csv`
- Maryland: `md_wac_S000_JT00_2015.csv`

See if you can load one similarly to how you loaded the California data above.

Make sure you assign it to a variable other than `df` so that you don't overwrite the data we loaded earlier (for example, if you choose Illinois, you might use `df_il`). Play around with the few functions you've learned. 

## <span style="color:green">Exploring the Data Frame (VIDEO)</span>

Now that we've loaded in the data set as a Data Frame, let's check the number of rows and columns. We can do this by using the `dim` function with a data frame.

In [8]:
dim(df)

The first number refers to the number of rows in the data frame, while the second number refers to the number of columns. You can also get a more comprehensive look at the details of the data frame by using the base R function `str`, which stands for structure, or `glimpse` from the tidyverse suite of packages. The `glimpse` function will generally have the same information, but tries to be smarter about how it displays the information to make it easier to read. In this case, it's essentially the same. 

In [9]:
str(df)

'data.frame':	243462 obs. of  53 variables:
 $ w_geocode : num  6e+13 6e+13 6e+13 6e+13 6e+13 ...
 $ C000      : int  30 4 3 11 10 3 13 13 2 1 ...
 $ CA01      : int  2 0 2 3 3 0 3 2 0 0 ...
 $ CA02      : int  16 1 1 3 3 2 3 4 0 0 ...
 $ CA03      : int  12 3 0 5 4 1 7 7 2 1 ...
 $ CE01      : int  4 0 0 2 7 0 4 3 2 1 ...
 $ CE02      : int  2 0 3 2 1 2 5 2 0 0 ...
 $ CE03      : int  24 4 0 7 2 1 4 8 0 0 ...
 $ CNS01     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CNS02     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CNS03     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CNS04     : int  0 0 0 0 0 0 0 0 0 1 ...
 $ CNS05     : int  0 0 0 0 2 0 0 0 0 0 ...
 $ CNS06     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CNS07     : int  0 0 0 0 0 0 0 6 0 0 ...
 $ CNS08     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CNS09     : int  0 4 0 0 0 0 0 0 0 0 ...
 $ CNS10     : int  1 0 0 0 0 0 0 0 0 0 ...
 $ CNS11     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CNS12     : int  25 0 0 11 0 2 3 6 0 0 ...
 $ CNS13     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CNS14    

In [10]:
glimpse(df)

Observations: 243,462
Variables: 53
$ w_geocode  [3m[38;5;246m<dbl>[39m[23m 6.0014e+13, 6.0014e+13, 6.0014e+13, 6.0014e+13, 6.0014e+13…
$ C000       [3m[38;5;246m<int>[39m[23m 30, 4, 3, 11, 10, 3, 13, 13, 2, 1, 14, 3, 1, 9, 3, 1, 2, 1…
$ CA01       [3m[38;5;246m<int>[39m[23m 2, 0, 2, 3, 3, 0, 3, 2, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ CA02       [3m[38;5;246m<int>[39m[23m 16, 1, 1, 3, 3, 2, 3, 4, 0, 0, 9, 0, 0, 9, 1, 0, 2, 1, 1, …
$ CA03       [3m[38;5;246m<int>[39m[23m 12, 3, 0, 5, 4, 1, 7, 7, 2, 1, 3, 3, 1, 0, 1, 1, 0, 0, 0, …
$ CE01       [3m[38;5;246m<int>[39m[23m 4, 0, 0, 2, 7, 0, 4, 3, 2, 1, 3, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ CE02       [3m[38;5;246m<int>[39m[23m 2, 0, 3, 2, 1, 2, 5, 2, 0, 0, 5, 0, 0, 0, 2, 0, 0, 0, 0, 2…
$ CE03       [3m[38;5;246m<int>[39m[23m 24, 4, 0, 7, 2, 1, 4, 8, 0, 0, 6, 3, 0, 9, 1, 1, 2, 0, 1, …
$ CNS01      [3m[38;5;246m<int>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ CNS02      [3m[38;5;246m

To look at the first few rows of a data frame, we can use the `head` function.

In [12]:
head(df)

Unnamed: 0_level_0,w_geocode,C000,CA01,CA02,CA03,CE01,CE02,CE03,CNS01,CNS02,⋯,CFA02,CFA03,CFA04,CFA05,CFS01,CFS02,CFS03,CFS04,CFS05,createdate
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,60014000000000.0,30,2,16,12,4,2,24,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
2,60014000000000.0,4,0,1,3,0,0,4,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
3,60014000000000.0,3,2,1,0,0,3,0,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
4,60014000000000.0,11,3,3,5,2,2,7,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
5,60014000000000.0,10,3,3,4,7,1,2,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
6,60014000000000.0,3,0,2,1,0,2,1,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919


### Accessing the Data Frame
What if we want to only look at certain cells, or certain columns? We can use a variety of commands to do just that.

To access individual columns, we can use square brackets or we can simply use a `$` sign.

In [13]:
df$w_geocode
# df[,'w_geocode']
# These do the same thing

### Using `dplyr` for Manipulating and Transforming Data

In the `R` programming world, there is a suite of packages called `tidyverse` that is widely used. Though it isn't the default way to work with data, and it is possible to do many of the same tasks without the `tidyverse`, it is extremely popular because of the syntax and logic behind how it works.

Let's take a look at some example code. Below, we are going to subset the data by only keeping the census blocks that have more than 100 jobs and look at the top few rows.

In [20]:
df %>% filter(C000 > 100) %>% head()

Unnamed: 0_level_0,w_geocode,C000,CA01,CA02,CA03,CE01,CE02,CE03,CNS01,CNS02,⋯,CFA02,CFA03,CFA04,CFA05,CFS01,CFS02,CFS03,CFS04,CFS05,createdate
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,60014000000000.0,595,87,326,182,95,224,276,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
2,60014000000000.0,172,41,93,38,25,69,78,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
3,60014000000000.0,285,59,156,70,57,80,148,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
4,60014000000000.0,103,45,48,10,33,43,27,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919
5,60014000000000.0,256,28,184,44,31,36,189,2,0,⋯,0,0,0,0,0,0,0,0,0,20170919
6,60014000000000.0,130,36,69,25,32,60,38,0,0,⋯,0,0,0,0,0,0,0,0,0,20170919


In [21]:
df %>% filter(C000 > 100) %>% select(w_geocode, C000) %>% head()

Unnamed: 0_level_0,w_geocode,C000
Unnamed: 0_level_1,<dbl>,<int>
1,60014000000000.0,595
2,60014000000000.0,172
3,60014000000000.0,285
4,60014000000000.0,103
5,60014000000000.0,256
6,60014000000000.0,130


## <span style="color:red">Checkpoint 2: Explore Your Data

Now look at the data frame you loaded earlier using the tools we've just covered. Do the number of rows and columns make sense? Try subsetting the data set. How does it compare to the results from California? Does it make sense?

## Working with Data



In [7]:
summary(df)

   w_geocode              C000               CA01              CA02         
 Min.   :6.001e+13   Min.   :    1.00   Min.   :   0.00   Min.   :    0.00  
 1st Qu.:6.037e+13   1st Qu.:    2.00   1st Qu.:   0.00   1st Qu.:    1.00  
 Median :6.059e+13   Median :    7.00   Median :   1.00   Median :    4.00  
 Mean   :6.055e+13   Mean   :   65.92   Mean   :  14.18   Mean   :   37.15  
 3rd Qu.:6.073e+13   3rd Qu.:   32.00   3rd Qu.:   7.00   3rd Qu.:   17.00  
 Max.   :6.115e+13   Max.   :72275.00   Max.   :9319.00   Max.   :47140.00  
      CA03               CE01               CE02               CE03        
 Min.   :    0.00   Min.   :    0.00   Min.   :    0.00   Min.   :    0.0  
 1st Qu.:    0.00   1st Qu.:    1.00   1st Qu.:    0.00   1st Qu.:    0.0  
 Median :    2.00   Median :    3.00   Median :    2.00   Median :    1.0  
 Mean   :   14.59   Mean   :   15.36   Mean   :   21.36   Mean   :   29.2  
 3rd Qu.:    7.00   3rd Qu.:    9.00   3rd Qu.:   12.00   3rd Qu.:    8.0  
 Max.

In [15]:
df %>% summarize(mean = mean(C000), n = n())

mean,n
<dbl>,<int>
65.9189,243462


### Summaries with `group_by`

In [16]:
ca_xwalk <- read.csv('ca_xwalk.csv')

In [17]:
ca_xwalk %>% head()

Unnamed: 0_level_0,tabblk2010,st,stusps,stname,cty,ctyname,trct,trctname,bgrp,bgrpname,⋯,stanrcname,necta,nectaname,mil,milname,stwib,stwibname,blklatdd,blklondd,createdate
Unnamed: 0_level_1,<dbl>,<int>,<fct>,<fct>,<int>,<fct>,<dbl>,<fct>,<dbl>,<fct>,⋯,<lgl>,<int>,<lgl>,<dbl>,<fct>,<int>,<fct>,<dbl>,<dbl>,<int>
1,60971500000000.0,6,CA,California,6097,"Sonoma County, CA",6097150203,"1502.03 (Sonoma, CA)",60971502032,"2 (Tract 1502.03, Sonoma, CA)",⋯,,99999,,,,6000056,56 Sonoma County WIB,38.2764,-122.4507,20190826
2,60971500000000.0,6,CA,California,6097,"Sonoma County, CA",6097150202,"1502.02 (Sonoma, CA)",60971502024,"4 (Tract 1502.02, Sonoma, CA)",⋯,,99999,,,,6000056,56 Sonoma County WIB,38.3028,-122.4652,20190826
3,60971500000000.0,6,CA,California,6097,"Sonoma County, CA",6097150202,"1502.02 (Sonoma, CA)",60971502021,"1 (Tract 1502.02, Sonoma, CA)",⋯,,99999,,,,6000056,56 Sonoma County WIB,38.3089,-122.4466,20190826
4,60971500000000.0,6,CA,California,6097,"Sonoma County, CA",6097150202,"1502.02 (Sonoma, CA)",60971502023,"3 (Tract 1502.02, Sonoma, CA)",⋯,,99999,,,,6000056,56 Sonoma County WIB,38.29343,-122.4417,20190826
5,60971500000000.0,6,CA,California,6097,"Sonoma County, CA",6097150204,"1502.04 (Sonoma, CA)",60971502041,"1 (Tract 1502.04, Sonoma, CA)",⋯,,99999,,,,6000056,56 Sonoma County WIB,38.28489,-122.4513,20190826
6,60971500000000.0,6,CA,California,6097,"Sonoma County, CA",6097150100,"1501 (Sonoma, CA)",60971501003,"3 (Tract 1501, Sonoma, CA)",⋯,,99999,,,,6000056,56 Sonoma County WIB,38.18174,-122.43,20190826


In [18]:
ca_xwalk %>% group_by(ctyname) %>% summarize(n = n())

ctyname,n
<fct>,<int>
"Alameda County, CA",23955
"Alpine County, CA",456
"Amador County, CA",1396
"Butte County, CA",6517
"Calaveras County, CA",2818
"Colusa County, CA",2204
"Contra Costa County, CA",18310
"Del Norte County, CA",2062
"El Dorado County, CA",5810
"Fresno County, CA",22096


In [26]:
ca_xwalk %>% group_by(ctyname) %>% summarize(n = n()) %>% arrange(desc(n)) %>% head(10)

ctyname,n
<fct>,<int>
"Los Angeles County, CA",109588
"San Bernardino County, CA",48176
"San Diego County, CA",43415
"Orange County, CA",36875
"Riverside County, CA",35717
"Kern County, CA",35280
"Alameda County, CA",23955
"Santa Clara County, CA",22369
"Fresno County, CA",22096
"Sacramento County, CA",19938


### Creating a new column with `mutate`

The `mutate` function allows us to create a new column or alter existing columns 

In [23]:
df %>% mutate(CA_sum = CA01 + CA02 + CA03) %>% head()

Unnamed: 0_level_0,w_geocode,C000,CA01,CA02,CA03,CE01,CE02,CE03,CNS01,CNS02,⋯,CFA03,CFA04,CFA05,CFS01,CFS02,CFS03,CFS04,CFS05,createdate,CA_sum
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,60014000000000.0,30,2,16,12,4,2,24,0,0,⋯,0,0,0,0,0,0,0,0,20170919,30
2,60014000000000.0,4,0,1,3,0,0,4,0,0,⋯,0,0,0,0,0,0,0,0,20170919,4
3,60014000000000.0,3,2,1,0,0,3,0,0,0,⋯,0,0,0,0,0,0,0,0,20170919,3
4,60014000000000.0,11,3,3,5,2,2,7,0,0,⋯,0,0,0,0,0,0,0,0,20170919,11
5,60014000000000.0,10,3,3,4,7,1,2,0,0,⋯,0,0,0,0,0,0,0,0,20170919,10
6,60014000000000.0,3,0,2,1,0,2,1,0,0,⋯,0,0,0,0,0,0,0,0,20170919,3


## <span style="color:red">Checkpoint 3: Descriptive Statistics on Your Data</span>

Using the tools described above, look at the data you loaded in earlier. Make sure you know the answers to each of the following questions:
- Are there any missing values?
- What is the mean of each variable?
- Are there any inconsistencies in the data? 
- Are there missing values that may not have been coded as missing?
- Are there any interesting outliers?

In addition, try to think about the distribution of jobs by different characteristics like age group and industry. Which age group had the most jobs in the state? Which industry?

# Error Messages and Documentation

If you forget what a function does, you can always check the documentation by using a question mark in front of it.

In [2]:
?mean

0,1
mean {base},R Documentation

0,1
x,"An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only."
trim,the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.
na.rm,a logical value indicating whether NA values should be stripped before the computation proceeds.
...,further arguments passed to or from other methods.


### Useful Cheatsheets

- [Using `dplyr` for data transformation](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)