<center><h1>Dataframes, Tidy data, Tidyverse</h1></center>
<center><h3>Ellen Duong</h3></center>
<center><h3>August Guang</h3></center>
<center><h3>Paul Stey</h3></center>
<center><h3>2024-09-18</h3></center>

# 1. Review: What is a `data.frame`?

  - Tabular data structure (i.e., like Excel spreadsheet)
  - Canonical data structure for data analysis
  - Capable of storing heterogeneous data

## 1.1 What does it look like?

In [46]:
idx <- 1:4
score <- rnorm(4)
vocals <- c(TRUE, TRUE, TRUE, FALSE)
firstname <- c("john", "george", "paul", "ringo")

dat <- data.frame(idx, firstname, score, vocals)

dat 

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,0.2674111,True
2,george,0.4379386,True
3,paul,0.78634455,True
4,ringo,-0.09074287,False


## 1.2 Indexing and Slicing a `data.frame`

  - Similar to `vector`, `matrix`, and `array` objects

In [47]:
dat[1, 2]           # get element in first row, second column

In [48]:
dat[, 2]            # get all of second column

In [49]:
dat[3, ]            # get third row

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
3,3,paul,0.7863446,True


### 1.2.1 Indexing using Column Names

In [50]:
dat[3, "score"]          # element from row 3 and "score" column

In [51]:
dat[2:4, "firstname"]    # get elements 2, 3, and 4 from "firstname" column

## 1.3 The `$` Operator and `data.frame` Objects 

In [52]:
dat$firstname            # get the "firstname" column

# 2. Filter `data.frame` using Logical Indexing

In [53]:
dat

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,0.2674111,True
2,george,0.4379386,True
3,paul,0.78634455,True
4,ringo,-0.09074287,False


In [54]:
idx_keep <- c(TRUE, TRUE, TRUE, FALSE)

dat[idx_keep, ]

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
1,1,john,0.2674111,True
2,2,george,0.4379386,True
3,3,paul,0.7863446,True


## 2.1 Create New `data.frame` from Another

In [55]:
dat2 <- dat[idx_keep, ]       # create new dataframe, from subset of original

tail(dat2, n=2)

Unnamed: 0_level_0,idx,firstname,score,vocals
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<lgl>
2,2,george,0.4379386,True
3,3,paul,0.7863446,True


### 2.1.1 Take Subset of `data.frame` Columns

In [56]:
cols <- c("firstname", "score")    # columns we care about

dat_namescore <- dat2[, cols]      # create new dataframe

dat_namescore

Unnamed: 0_level_0,firstname,score
Unnamed: 0_level_1,<chr>,<dbl>
1,john,0.2674111
2,george,0.4379386
3,paul,0.7863446


# 3. Adding Columns to a `data.frame`

In [57]:
dat

idx,firstname,score,vocals
<int>,<chr>,<dbl>,<lgl>
1,john,0.2674111,True
2,george,0.4379386,True
3,paul,0.78634455,True
4,ringo,-0.09074287,False


In [58]:
dat$food <- c("steak", "chicken", "potato", "rice")

dat

idx,firstname,score,vocals,food
<int>,<chr>,<dbl>,<lgl>,<chr>
1,john,0.2674111,True,steak
2,george,0.4379386,True,chicken
3,paul,0.78634455,True,potato
4,ringo,-0.09074287,False,rice


## 3.1. Adding Columns (cont.)

In [59]:
dat[, "drink"] <- c("water", "milk", "beer", "scotch")

dat

idx,firstname,score,vocals,food,drink
<int>,<chr>,<dbl>,<lgl>,<chr>,<chr>
1,john,0.2674111,True,steak,water
2,george,0.4379386,True,chicken,milk
3,paul,0.78634455,True,potato,beer
4,ringo,-0.09074287,False,rice,scotch


<center><h1>Challenge Questions</h1></center>

### Question 1.
Create a `data.frame` object called `state_df` with two columns, one called `state` and one called `population`. Each column should have five elements. For the `state` column, select the abbreviations for five US states (e.g., "OH", "RI", "NY", "MA", "CT"). For the `population` column, use the `sample()` function to create "populations" at random from the range `1` to `1000000`.

In [60]:
# Your code here

### Question 2.
Add a third column to the `state_df` dataframe called `size`. In particular, use boolean indexing to assign elements of the third column to be `"large"` if that row's `population` value is larger than or equal to `500000`, and be `"small"`  if the row's `population` is less than `500000`.

In [61]:
# Your code here

# 4. Reading Data from CSV File

  - CSV File is "comma-separated values"
  - The `,` separator is conventional, but not mandatory
  - The `|` character is also common
  - `read.csv()` is a form of `read.table()` with specific arguments for comma-delimited files
  - The `read.csv()` function has many optional arguments
  - Critically, we can tell R the strings that ought to be considered missing

## 4.1 Providence Police Dept. Data

  - We will be looking at public data regarding arrests and case

In [62]:
# The line below reads the CSV file and creates a dataframe 

arrests_df <- read.csv("data/pvd_arrests_2021-10-03.csv")     

## 4.2 Exploring the Data

In [63]:
head(arrests_df)         # show first few lines of the dataframe

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
1,2019-08-24T02:23:00.0,2019,8,Male,White,NonHispanic,1981,37,No Permanent Address,providence,Rhode Island,,,,,2019-00084142,"YGonzalez, LTaveras",pvd2218242150382148273
2,2019-08-24T02:02:00.0,2019,8,,,,1994,25,SUMMER AVE,Cranston,Rhode Island,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1.0,2019-00084127,NManfredi,pvd15166785558364246202
3,2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,Rhode Island,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905
4,2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,Rhode Island,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905
5,2019-08-24T02:02:00.0,2019,8,Female,Black,Unknown,2001,18,TRASH ST,,,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd460449304532374599
6,2019-08-24T02:02:00.0,2019,8,Female,Black,Unknown,2001,18,TRASH ST,,,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1.0,2019-00084126,"MPlace, JPerez, ASantos",pvd460449304532374599


### 4.2.1 More Data Exploring 

In [64]:
dim(arrests_df)             # get dimensions of the dataframe

In [65]:
nrow(arrests_df)            # get number of rows

In [66]:
ncol(arrests_df)            # get the number of columns

In [67]:
colnames(arrests_df)        # get the column names

### 4.2.2 Summaries from `data.frame`

In [68]:
str(arrests_df)              # the str() function shows the structure of dataframe

'data.frame':	13012 obs. of  18 variables:
 $ arrest_date       : chr  "2019-08-24T02:23:00.0" "2019-08-24T02:02:00.0" "2019-08-24T02:02:00.0" "2019-08-24T02:02:00.0" ...
 $ year              : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
 $ month             : int  8 8 8 8 8 8 8 8 8 8 ...
 $ gender            : chr  "Male" "" "Female" "Female" ...
 $ race              : chr  "White" "" "Black" "Black" ...
 $ ethnicity         : chr  "NonHispanic" "" "NonHispanic" "NonHispanic" ...
 $ year_of_birth     : int  1981 1994 1984 1984 2001 2001 2001 1991 1991 1991 ...
 $ age               : int  37 25 34 34 18 18 18 28 28 28 ...
 $ from_address      : chr  "No Permanent Address" "SUMMER AVE" "DOUGLAS AVE" "DOUGLAS AVE" ...
 $ from_city         : chr  "providence" "Cranston" "Providence" "Providence" ...
 $ from_state        : chr  "Rhode Island" "Rhode Island" "Rhode Island" "Rhode Island" ...
 $ statute_type      : chr  "" "RI Statute Violation" "RI Statute Violation" "RI Stat

### 4.2.3 Summarizing Numeric Data

In [69]:
summary(arrests_df)

 arrest_date             year          month           gender         
 Length:13012       Min.   :2019   Min.   : 1.000   Length:13012      
 Class :character   1st Qu.:2019   1st Qu.: 3.000   Class :character  
 Mode  :character   Median :2020   Median : 7.000   Mode  :character  
                    Mean   :2020   Mean   : 6.508                     
                    3rd Qu.:2021   3rd Qu.: 9.000                     
                    Max.   :2021   Max.   :12.000                     
                                                                      
     race            ethnicity         year_of_birth       age       
 Length:13012       Length:13012       Min.   :1938   Min.   :18.00  
 Class :character   Class :character   1st Qu.:1980   1st Qu.:24.00  
 Mode  :character   Mode  :character   Median :1989   Median :31.00  
                                       Mean   :1986   Mean   :33.07  
                                       3rd Qu.:1995   3rd Qu.:39.00  
            

### 4.2.4 Summarizing Numeric Data (cont.)

In [70]:
numeric_vars <- c("month", "year", "age", "year_of_birth", "counts")

In [71]:
summary(arrests_df[, numeric_vars])

     month             year           age        year_of_birth 
 Min.   : 1.000   Min.   :2019   Min.   :18.00   Min.   :1938  
 1st Qu.: 3.000   1st Qu.:2019   1st Qu.:24.00   1st Qu.:1980  
 Median : 7.000   Median :2020   Median :31.00   Median :1989  
 Mean   : 6.508   Mean   :2020   Mean   :33.07   Mean   :1986  
 3rd Qu.: 9.000   3rd Qu.:2021   3rd Qu.:39.00   3rd Qu.:1995  
 Max.   :12.000   Max.   :2021   Max.   :83.00   Max.   :2003  
                                                               
     counts      
 Min.   : 1.000  
 1st Qu.: 1.000  
 Median : 1.000  
 Mean   : 1.087  
 3rd Qu.: 1.000  
 Max.   :15.000  
 NA's   :2983    

### 4.2.5 Summarizing String Variables

What's wrong with the data in the summary here?

In [72]:
table(arrests_df$race)           # show summary of "race" column in `arrests_df`


                               American Indian/Alaskan Native 
                            24                             25 
        Asian/Pacific Islander                          Black 
                           121                           5721 
                          NULL                        Unknown 
                            41                            588 
                         White            ZHispanic (FD only) 
                          6483                              9 

## 4.3 Utilizing `read.csv` options

What option in `read.csv` would help with the issue above?

In [73]:
help(read.csv)

0,1
read.table {utils},R Documentation

0,1
file,"the name of the file which the data are to be read from. Each row of the table appears as one line of the file. If it does not contain an absolute path, the file name is relative to the current working directory, getwd(). Tilde-expansion is performed where supported. This can be a compressed file (see file). Alternatively, file can be a readable text-mode connection (which will be opened for reading if necessary, and if so closed (and hence destroyed) at the end of the function call). (If stdin() is used, the prompts for lines may be somewhat confusing. Terminate input with a blank line or an EOF signal, Ctrl-D on Unix and Ctrl-Z on Windows. Any pushback on stdin() will be cleared before return.) file can also be a complete URL. (For the supported URL schemes, see the ‘URLs’ section of the help for url.)"
header,"a logical value indicating whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: header is set to TRUE if and only if the first row contains one fewer field than the number of columns."
sep,"the field separator character. Values on each line of the file are separated by this character. If sep = """" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns."
quote,"the set of quoting characters. To disable quoting altogether, use quote = """". See scan for the behaviour on quotes embedded in quotes. Quoting is only considered for columns read as character, which is all of them unless colClasses is specified."
dec,the character used in the file for decimal points.
numerals,"string indicating how to convert numbers whose conversion to double precision would lose accuracy, see type.convert. Can be abbreviated. (Applies also to complex-number inputs.)"
row.names,"a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names. If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names. Otherwise if row.names is missing, the rows are numbered. Using row.names = NULL forces row numbering. Missing or NULL row.names generate row names that are considered to be ‘automatic’ (and not preserved by as.matrix)."
col.names,"a vector of optional names for the variables. The default is to use ""V"" followed by the column number."
as.is,"controls conversion of character variables (insofar as they are not converted to logical, numeric or complex) to factors, if not otherwise specified by colClasses. Its value is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors. Note: to suppress all conversions including those of numeric columns, set colClasses = ""character"". Note that as.is is specified per column (not per variable) and so includes the column of row names (if any) and any columns to be skipped."
na.strings,"a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance."


In [74]:
arrests_df2 <- read.csv("data/pvd_arrests_2021-10-03.csv", 
                        na.strings = c("NA", "", " ", "NULL", "Unknown"))

### 4.3.1 Effects of  `na.strings`

In [75]:
table(arrests_df$race)             # explore `race` in original dataframe


                               American Indian/Alaskan Native 
                            24                             25 
        Asian/Pacific Islander                          Black 
                           121                           5721 
                          NULL                        Unknown 
                            41                            588 
                         White            ZHispanic (FD only) 
                          6483                              9 

In [76]:
table(arrests_df2$race)           # dataframe after setting `na.strings`


American Indian/Alaskan Native         Asian/Pacific Islander 
                            25                            121 
                         Black                          White 
                          5721                           6483 
           ZHispanic (FD only) 
                             9 

In [77]:
arrests_df[1, ]

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
1,2019-08-24T02:23:00.0,2019,8,Male,White,NonHispanic,1981,37,No Permanent Address,providence,Rhode Island,,,,,2019-00084142,"YGonzalez, LTaveras",pvd2218242150382148273


# 5. Data frame merging
- Data is often spread across more than one file, reading each file into R will result in more than one data frame. 
- If the data frames have some common identifying column, we can use that common ID to combine the data frames. 

For example, in addition to the arrests data, the Providence Police Department also makes data concerning criminal cases available. This is in the CSV file called `"pvd_cases_2021-10-03.csv"` in the `data/` directory of this repository. 

  1. Let's read the cases data in from a CSV and call it `cases_df`. 
  2. What columns do `cases_df` and `arrests_df2` share in common?

In [78]:
cases_df <- read.csv("data/pvd_cases_2021-10-03.csv") 

In [79]:
colnames(cases_df)

In [80]:
colnames(arrests_df2)

We can use the `merge()` function to combine by case number:

In [83]:
merge_df <- merge(arrests_df2, cases_df, by.x = "case_number", by.y = "casenumber", all=TRUE)

head(merge_df)

Unnamed: 0_level_0,case_number,arrest_date,year.x,month.x,gender,race,ethnicity,year_of_birth,age,from_address,⋯,id,location,reported_date,month.y,year.y,offense_desc,statute_code.y,statute_desc.y,counts.y,reporting_officer
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,⋯,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<chr>
1,2000-00176967,2020-12-20T17:25:00.0,2020,12,Male,Black,NonHispanic,1960,60,,⋯,pvd5279420904765058610,,,,,,,,,
2,2002-00017928,2020-02-10T20:50:00.0,2020,2,Male,Black,,1986,33,,⋯,pvd14287653191053962274,,,,,,,,,
3,2002-00084415,2020-06-20T15:00:00.0,2020,6,Male,White,Hispanic,1987,32,MAPLEHURST ST,⋯,pvd3425910849102764874,,,,,,,,,
4,2002-00156155,2020-10-19T21:39:00.0,2020,10,Male,Black,NonHispanic,1983,36,WILSON ST,⋯,pvd15553225720158958855,,,,,,,,,
5,2009-00012907,2019-07-11T08:00:00.0,2019,7,Male,Black,NonHispanic,1979,40,DENISON ST,⋯,pvd9436840405638425866,,,,,,,,,
6,2010-00132217,2021-03-17T00:00:00.0,2021,3,Male,Black,Hispanic,1987,33,RODNEY DR,⋯,pvd2708776692340882517,,,,,,,,,


- Using `all = TRUE` will fill in blank values as NA whenever a case number is in `arrests_df2` that doesn't exist in `cases_df` and vice versa.
- Using the `all.x = TRUE` argument will return all values in the first dataframe (`arrests_df2`), as well as any entries with the same ID column(s) from `cases_df`.

In [27]:
merge(arrests_df2, cases_df, by.x = "case_number", by.y = "casenumber", all.x=TRUE)

case_number,arrest_date,year.x,month.x,gender,race,ethnicity,year_of_birth,age,from_address,⋯,id,location,reported_date,month.y,year.y,offense_desc,statute_code.y,statute_desc.y,counts.y,reporting_officer
<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,⋯,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<chr>
2000-00176967,2020-12-20T17:25:00.0,2020,12,Male,Black,NonHispanic,1960,60,,⋯,pvd5279420904765058610,,,,,,,,,
2002-00017928,2020-02-10T20:50:00.0,2020,2,Male,Black,,1986,33,,⋯,pvd14287653191053962274,,,,,,,,,
2002-00084415,2020-06-20T15:00:00.0,2020,6,Male,White,Hispanic,1987,32,MAPLEHURST ST,⋯,pvd3425910849102764874,,,,,,,,,
2002-00156155,2020-10-19T21:39:00.0,2020,10,Male,Black,NonHispanic,1983,36,WILSON ST,⋯,pvd15553225720158958855,,,,,,,,,
2009-00012907,2019-07-11T08:00:00.0,2019,7,Male,Black,NonHispanic,1979,40,DENISON ST,⋯,pvd9436840405638425866,,,,,,,,,
2010-00132217,2021-03-17T00:00:00.0,2021,3,Male,Black,Hispanic,1987,33,RODNEY DR,⋯,pvd2708776692340882517,,,,,,,,,
2012-00080374,2021-02-27T03:19:00.0,2021,2,Male,Black,NonHispanic,1972,48,GAGE ST,⋯,pvd5466869003817378838,,,,,,,,,
2012-00080374,2021-02-27T03:16:00.0,2021,2,Male,Black,NonHispanic,1972,48,GAGE ST,⋯,pvd5466869003817378838,,,,,,,,,
2012-00080374,2021-02-27T03:16:00.0,2021,2,Male,Black,NonHispanic,1972,48,GAGE ST,⋯,pvd5466869003817378838,,,,,,,,,
2014-00105285,2020-02-09T17:00:00.0,2020,2,Male,White,Hispanic,1989,30,BROCK AVE,⋯,pvd16199867161131954743,,,,,,,,,


- Using the `all.y = TRUE` argument will return all values in the `cases_df` dataframe, as well as any entries with the same ID column(s) from `arrests_df2`.

In [28]:
merge(arrests_df2, cases_df, by.x = "case_number", by.y = "casenumber", all.y=TRUE)

case_number,arrest_date,year.x,month.x,gender,race,ethnicity,year_of_birth,age,from_address,⋯,id,location,reported_date,month.y,year.y,offense_desc,statute_code.y,statute_desc.y,counts.y,reporting_officer
<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,⋯,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<chr>
2019-00078922,,,,,,,,,,⋯,,939 DOUGLAS AVE,2019-08-10T23:20:00.0,8,2019,Missing Persons,Not Used,No violations,0,Central Station
2019-00078938,,,,,,,,,,⋯,,57 EDDY ST,2019-08-10T23:36:40.0,8,2019,Request for Assistance,Not Used,No violations,0,ADaCruz
2019-00078961,,,,,,,,,,⋯,,380 HOPE ST,2019-08-11T00:13:00.0,8,2019,Missing Persons,Not Used,No violations,0,Central Station
2019-00078976,,,,,,,,,,⋯,,90 ACADEMY AVE,2019-08-10T23:49:00.0,8,2019,"Assault, Aggravated",11-5-2,FELONY ASSAULT/ DANG. WEAPON OR SUBSTANCE,1,DAnderson
2019-00078982,,,,,,,,,,⋯,,BROAD ST & THURBERS AVE,2019-08-11T00:37:00.0,8,2019,Receiving Stolen Property,31-9-2,Possession of Stolen Vehicle or Parts,1,CBrown
2019-00079011,2019-08-11T01:53:00.0,2019,8,Female,Asian/Pacific Islander,,1993,26,BARSTOW ST,⋯,pvd9052534029220900662,51 BARSTOW ST,2019-08-11T01:53:00.0,8,2019,"Assault, Simple",11-5-3,SIMPLE ASSAULT/BATTERY,1,NO'Malley
2019-00079015,,,,,,,,,,⋯,,CLIFFORD ST & CLAVERICK ST,2019-08-11T02:01:00.0,8,2019,"Larceny, Other",11-41-1,LARCENY/U $1500 - ALL OTH LARCENY,1,JDennis
2019-00079020,,,,,,,,,,⋯,,380 HOPE ST,2019-08-11T02:16:00.0,8,2019,Missing Persons,Not Used,No violations,0,Central Station
2019-00079028,,,,,,,,,,⋯,,772 HOPE ST,2019-08-11T02:26:00.0,8,2019,Municipal Code Violation,Sec. 16-3.A,Disorderly and indecent conduct A - Theatening,1,DGonzalez
2019-00079032,2019-08-11T02:42:00.0,2019,8,Male,White,,1997,21,PERRY HILL ROAD,⋯,pvd4002343389973320073,172 PINE ST,2019-08-11T02:42:00.0,8,2019,Disorderly Conduct,11-45-1,DISORDERLY CONDUCT,1,JLanier


What do you notice about this data above? The columns? What's in the rows?

## 5.1 Joining instead of merging

There are usually multiple ways to do things in R. A different function for merging exists that is the `join()` family in `tidyverse`:

 * `inner_join(x,y)`: only keeps observations from x that have matching key in y
 * `left_join(x,y)`: keeps all observations in x
 * `right_join(x,y)`: keeps all observations in y
 * `full_join(x,y)`: keeps all observations in x and y

 This family of functions is nice because (1) it allows us to decide which data frame we want to focus on when merging, and (2) it gives more control over how we merge. For example, we can merge multiple columns, specifying which columns should match which columns in each dataframe. When we run `full_join`, we also get some additional information:

<div class="alert alert-block alert-info">
<b>Welcome to the tidyverse!</b>

<a href="https://www.tidyverse.org/"><b>Tidyverse</b></a> is a set of R packages desiend for datascience. All of its packages share an underlying designing philosophy, grammer, and data structures. 

Notably in this class we'll be touching on `dplyer`, `ggplot2`, and `reader`.

</div>


<div class="alert alert-block alert-info">
<b>INFO!</b>

We already loaded in our csv files with Base R's `read.csv`. We recommend, moving forward to use the following:

`read_csv` from the `readr` package which is included when loading in the `tidyverse` package.

It plays nicer than the base R version and it interprets empty strings as NA values on read by default

<a href="https://readr.tidyverse.org/reference/read_delim.html">See documentation here</a>

</div>


In [29]:
library(tidyverse)

“running command 'timedatectl' had status 1”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [30]:
full_join(arrests_df2, cases_df, by=join_by(case_number==casenumber,month,year))

“[1m[22mDetected an unexpected many-to-many relationship between `x` and `y`.
[36mℹ[39m Row 2 of `x` matches multiple rows in `y`.
[36mℹ[39m Row 6 of `y` matches multiple rows in `x`.
[36mℹ[39m If a many-to-many relationship is expected, set `relationship =


arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,⋯,case_number,arresting_officers,id,location,reported_date,offense_desc,statute_code.y,statute_desc.y,counts.y,reporting_officer
<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>
2019-08-24T02:23:00.0,2019,8,Male,White,NonHispanic,1981,37,No Permanent Address,providence,⋯,2019-00084142,"YGonzalez, LTaveras",pvd2218242150382148273,46 GESLER ST,2019-08-24T02:23:00.0,RI Statute Violation,12-9-16,WARRANT OF ARREST ON AFFIDAVIT - ALL OTH OFFENSE,1,LTaveras
2019-08-24T02:02:00.0,2019,8,,,,1994,25,SUMMER AVE,Cranston,⋯,2019-00084127,NManfredi,pvd15166785558364246202,1095 EDDY ST,2019-08-24T02:02:00.0,Traffic Violation,31-3-32,Driving with Expired Registration,1,NManfredi
2019-08-24T02:02:00.0,2019,8,,,,1994,25,SUMMER AVE,Cranston,⋯,2019-00084127,NManfredi,pvd15166785558364246202,1095 EDDY ST,2019-08-24T02:02:00.0,Traffic Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,NManfredi
2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,⋯,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,60 HARTFORD AVE,2019-08-24T02:02:00.0,Disorderly Conduct,11-45-1,DISORDERLY CONDUCT,3,MPlace
2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,⋯,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,60 HARTFORD AVE,2019-08-24T02:02:00.0,RI Statute Violation,11-32-1,OBSTRUCTING OFFICER IN EXECUTION OF DUTY,1,MPlace
2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,⋯,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,60 HARTFORD AVE,2019-08-24T02:02:00.0,Warrant\Capias,BWARRANT-6D,BENCH WARRANT ISSUED FROM 6TH DISTRICT COURT,1,MPlace
2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,⋯,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,60 HARTFORD AVE,2019-08-24T02:02:00.0,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,3,MPlace
2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,⋯,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,60 HARTFORD AVE,2019-08-24T02:02:00.0,RI Statute Violation,12-9-16,WARRANT OF ARREST ON AFFIDAVIT - ALL OTH OFFENSE,4,MPlace
2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,⋯,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,60 HARTFORD AVE,2019-08-24T02:02:00.0,Disorderly Conduct,11-45-1,DISORDERLY CONDUCT,3,MPlace
2019-08-24T02:02:00.0,2019,8,Female,Black,NonHispanic,1984,34,DOUGLAS AVE,Providence,⋯,2019-00084126,"MPlace, JPerez, ASantos",pvd3142917706201385905,60 HARTFORD AVE,2019-08-24T02:02:00.0,RI Statute Violation,11-32-1,OBSTRUCTING OFFICER IN EXECUTION OF DUTY,1,MPlace


In [31]:
arrests_df2[2,]

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
2,2019-08-24T02:02:00.0,2019,8,,,,1994,25,SUMMER AVE,Cranston,Rhode Island,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00084127,NManfredi,pvd15166785558364246202


# 6. The _dplyr_ Package

  - "dplyr" is short for "data plyer"
  - R package for aggregating, summarizing, reshaping, and generally wrangling data
  - Extremely popular in the R community
  - Authored by Hadley Wickham
  - Part of the "tidyverse" set of packages
   - We have already loaded "tidyverse", so we don't need to load any more packages for it

## 6.1 The _dplyr_ Verbs

  - The _dplyr_ package is organized around a set of "verbs", which are functions that operate on data
    + `filter()` - function is used to subset a data frame, retaining all rows that satisfies your conditions
    + `summarise()` - creates a new data frame. It returns one row for each combination of grouping variables. 
    + `select()` - selects variables in a data frame
    + `mutate()` - creates new columns that are functions of existing variables
    + `arrange()` - orders the rows of a data frame by the values of selected columns

## 6.2 The Pipe Operator

  - Can be used to pipe some object into a function call
  - `%>%`
    + `x %>% f(y)` is the same as `f(x, y)`
    

# 7. `filter()` Examples with _dplyr_

In [32]:
cases_df %>% 
    filter(casenumber=='2019-00084127') 

casenumber,location,reported_date,month,year,offense_desc,statute_code,statute_desc,counts,reporting_officer
<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<chr>
2019-00084127,1095 EDDY ST,2019-08-24T02:02:00.0,8,2019,Traffic Violation,31-3-32,Driving with Expired Registration,1,NManfredi
2019-00084127,1095 EDDY ST,2019-08-24T02:02:00.0,8,2019,Traffic Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,NManfredi


### 7.1.1 Comparing `filter()` with Logical Indexing

In [33]:
# dplyr approach
cases_df %>% 
    filter(casenumber=='2019-00084127')


# "base" R approach
weird_case <- cases_df$casenumber == "2019-00084127"      # create vector of bools

cases_df[weird_case, ]                       # get cases with that casenumber

casenumber,location,reported_date,month,year,offense_desc,statute_code,statute_desc,counts,reporting_officer
<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<chr>
2019-00084127,1095 EDDY ST,2019-08-24T02:02:00.0,8,2019,Traffic Violation,31-3-32,Driving with Expired Registration,1,NManfredi
2019-00084127,1095 EDDY ST,2019-08-24T02:02:00.0,8,2019,Traffic Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,NManfredi


Unnamed: 0_level_0,casenumber,location,reported_date,month,year,offense_desc,statute_code,statute_desc,counts,reporting_officer
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<chr>
5,2019-00084127,1095 EDDY ST,2019-08-24T02:02:00.0,8,2019,Traffic Violation,31-3-32,Driving with Expired Registration,1,NManfredi
9,2019-00084127,1095 EDDY ST,2019-08-24T02:02:00.0,8,2019,Traffic Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,NManfredi


## 7.2 `filter()` Examples (cont.)

In [34]:
# Here we create a new data.frame from result of filter()

arrests_males <- arrests_df %>% 
    filter(gender == "Male")                

In [35]:
head(arrests_males)

Unnamed: 0_level_0,arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
1,2019-08-24T02:23:00.0,2019,8,Male,White,NonHispanic,1981,37,No Permanent Address,providence,Rhode Island,,,,,2019-00084142,"YGonzalez, LTaveras",pvd2218242150382148273
2,2019-08-23T23:43:00.0,2019,8,Male,Black,NonHispanic,1991,28,PUBLIC ST,Providence,,RI Statute Violation,31-27-2.1,Chemical Test Refusal,1.0,2019-00084056,"CVingi, SCooney",pvd6431558757894418021
3,2019-08-23T23:43:00.0,2019,8,Male,Black,NonHispanic,1991,28,PUBLIC ST,Providence,,RI Statute Violation,31-27-2,Driving Under the Influence of Liqour or Drugs (=>.08<.1),1.0,2019-00084056,"CVingi, SCooney",pvd6431558757894418021
4,2019-08-23T23:43:00.0,2019,8,Male,Black,NonHispanic,1991,28,PUBLIC ST,Providence,,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1.0,2019-00084056,"CVingi, SCooney",pvd6431558757894418021
5,2019-08-23T21:38:00.0,2019,8,Male,White,Hispanic,1996,22,DOUGLAS,Providence,,RI Statute Violation,11-44-1,DOMESTIC-VANDALISM/MALICIOUS INJURY TO PROP,1.0,2019-00084031,"RCarlin, SKennedy",pvd15614289459563584867
6,2019-08-23T19:50:00.0,2019,8,Male,White,Hispanic,2000,19,MOWRY ST,Providence,,RI Statute Violation,31-27-4,"Reckless Driving, Drag Racing - Attempting to Elude",1.0,2019-00083996,"SCampbell, RMalloy",pvd900460037611487829


## 7.2 Using `filter()` with Multiple Conditions

In [36]:
arrests_teen_male <- arrests_df %>%
    filter(
        gender == "Male",
        age < 20
    )

arrests_teen_male

arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
2019-08-23T19:50:00.0,2019,8,Male,White,Hispanic,2000,19,MOWRY ST,Providence,,RI Statute Violation,31-27-4,"Reckless Driving, Drag Racing - Attempting to Elude",1,2019-00083996,"SCampbell, RMalloy",pvd900460037611487829
2019-08-23T19:50:00.0,2019,8,Male,White,Hispanic,2000,19,MOWRY ST,Providence,,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00083996,"SCampbell, RMalloy",pvd900460037611487829
2019-08-21T13:09:00.0,2019,8,Male,White,Hispanic,1999,19,MELISSA AVE,Providence,,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1,2019-00083170,"ITavarez, IYousif, CBrown, EDelgado",pvd5047836359365815220
2019-08-21T13:09:00.0,2019,8,Male,White,Hispanic,1999,19,MELISSA AVE,Providence,,RI Statute Violation,11-45-1,DISORDERLY CONDUCT,1,2019-00083170,"ITavarez, IYousif, CBrown, EDelgado",pvd5047836359365815220
2019-08-21T13:09:00.0,2019,8,Male,White,Hispanic,1999,19,MELISSA AVE,Providence,,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00083170,"ITavarez, IYousif, CBrown, EDelgado",pvd5047836359365815220
2019-08-20T02:00:00.0,2019,8,Male,White,Hispanic,1999,19,MINK RD,Providence,Rhode Island,RI Statute Violation,31-27-4,"Reckless Driving, Drag Racing - Attempting to Elude",1,2019-00078616,"JGagnon, RMalloy",pvd1076862233562848683
2019-08-20T00:00:00.0,2019,8,Male,Black,NonHispanic,2000,19,SMITH ST,Providence,,,,,,2019-00082826,"RPapa, MCamardo",pvd12708633210022966227
2019-08-17T00:00:00.0,2019,8,Male,White,Hispanic,2001,18,BENEDICT ST,Providence,,RI Statute Violation,11-37-2,SEXUAL ASSAULT -1ST DEGREE - FRC RAPE,1,2019-00081517,"RMendez, JNajarian",pvd9938776757456909177
2019-08-15T18:12:00.0,2019,8,Male,Black,NonHispanic,2000,18,CROSS ST,Providence,,RI Statute Violation,11-17-1,FORGERY AND COUNTERFEITING IN GENERAL,1,2019-00080859,"JBenros, JStanzione, NOC Officer, ACalle, EDelgado, JManown",pvd17954097329236445270
2019-08-15T00:00:00.0,2019,8,Male,Black,Hispanic,2000,19,SILVER LAKE AVE,Providence,,,,,,2019-00076472,,pvd14598067460260984586


### 7.2.1 Using `filter()` with Logical OR

  - Recall the `||` operator is the logical OR
  - The `|` operator performs the same role, but elementwise for columns (or vectors)

In [37]:
young_old_male <- arrests_df %>%
    filter(
        gender == "Male",
        age < 25 | age > 65  
    )
   
young_old_male

arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
2019-08-23T21:38:00.0,2019,8,Male,White,Hispanic,1996,22,DOUGLAS,Providence,,RI Statute Violation,11-44-1,DOMESTIC-VANDALISM/MALICIOUS INJURY TO PROP,1,2019-00084031,"RCarlin, SKennedy",pvd15614289459563584867
2019-08-23T19:50:00.0,2019,8,Male,White,Hispanic,2000,19,MOWRY ST,Providence,,RI Statute Violation,31-27-4,"Reckless Driving, Drag Racing - Attempting to Elude",1,2019-00083996,"SCampbell, RMalloy",pvd900460037611487829
2019-08-23T19:50:00.0,2019,8,Male,White,Hispanic,2000,19,MOWRY ST,Providence,,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00083996,"SCampbell, RMalloy",pvd900460037611487829
2019-08-23T18:26:00.0,2019,8,Male,White,Hispanic,1996,23,CUMERFORD ST,Providence,,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1,2019-00083963,JHanley,pvd1675234703933765967
2019-08-23T18:26:00.0,2019,8,Male,White,Hispanic,1996,23,CUMERFORD ST,Providence,,RI Statute Violation,11-32-1,OBSTRUCTING OFFICER IN EXECUTION OF DUTY,1,2019-00083963,JHanley,pvd1675234703933765967
2019-08-23T14:42:00.0,2019,8,Male,White,Hispanic,1998,20,LAURA ST,Providence,,RI Statute Violation,11-44-1,DOMESTIC-VANDALISM/MALICIOUS INJURY TO PROP,1,2019-00083892,"JCotugno, ALevesque, JButen, JJohnson",pvd17953747948212880432
2019-08-23T00:57:00.0,2019,8,Male,White,,1998,21,AUSTIN ST,newbrdford,,RI Statute Violation,11-5-3,SIMPLE ASSAULT OR BATTERY,1,2019-00083725,PSalmons,pvd3024232238010666153
2019-08-22T12:05:00.0,2019,8,Male,White,Hispanic,1999,20,ROCKINGHAM ST,Providence,Rhode Island,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00083486,"JBenros, MClary",pvd8008374038901187780
2019-08-22T01:38:00.0,2019,8,Male,Black,Hispanic,1998,21,HOLLIS ST,Providence,,RI Statute Violation,11-47-5.2,POSSESSION OF A STOLEN FIREARM,1,2019-00083396,"RFedo, KRosado, ALugo",pvd6386847572324309475
2019-08-21T18:03:00.0,2019,8,Male,Unknown,Hispanic,1998,21,CUMERFORD ST,Providence,,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00083274,JMadeira,pvd14028206778351997883


### 7.2.2 Using `filter()` with Logical OR (cont.)

In [38]:
ptk_young_old_male <- arrests_df %>%
    filter(
        gender == "Male",
        age < 25 | age > 65 | from_city == "Pawtucket"
    )

ptk_young_old_male

arrest_date,year,month,gender,race,ethnicity,year_of_birth,age,from_address,from_city,from_state,statute_type,statute_code,statute_desc,counts,case_number,arresting_officers,id
<chr>,<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>
2019-08-23T21:38:00.0,2019,8,Male,White,Hispanic,1996,22,DOUGLAS,Providence,,RI Statute Violation,11-44-1,DOMESTIC-VANDALISM/MALICIOUS INJURY TO PROP,1,2019-00084031,"RCarlin, SKennedy",pvd15614289459563584867
2019-08-23T19:50:00.0,2019,8,Male,White,Hispanic,2000,19,MOWRY ST,Providence,,RI Statute Violation,31-27-4,"Reckless Driving, Drag Racing - Attempting to Elude",1,2019-00083996,"SCampbell, RMalloy",pvd900460037611487829
2019-08-23T19:50:00.0,2019,8,Male,White,Hispanic,2000,19,MOWRY ST,Providence,,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00083996,"SCampbell, RMalloy",pvd900460037611487829
2019-08-23T18:26:00.0,2019,8,Male,White,Hispanic,1996,23,CUMERFORD ST,Providence,,RI Statute Violation,12-7-10,RESISTING LEGAL OR ILLEGAL ARREST,1,2019-00083963,JHanley,pvd1675234703933765967
2019-08-23T18:26:00.0,2019,8,Male,White,Hispanic,1996,23,CUMERFORD ST,Providence,,RI Statute Violation,11-32-1,OBSTRUCTING OFFICER IN EXECUTION OF DUTY,1,2019-00083963,JHanley,pvd1675234703933765967
2019-08-23T14:42:00.0,2019,8,Male,White,Hispanic,1998,20,LAURA ST,Providence,,RI Statute Violation,11-44-1,DOMESTIC-VANDALISM/MALICIOUS INJURY TO PROP,1,2019-00083892,"JCotugno, ALevesque, JButen, JJohnson",pvd17953747948212880432
2019-08-23T00:57:00.0,2019,8,Male,White,,1998,21,AUSTIN ST,newbrdford,,RI Statute Violation,11-5-3,SIMPLE ASSAULT OR BATTERY,1,2019-00083725,PSalmons,pvd3024232238010666153
2019-08-22T12:05:00.0,2019,8,Male,White,Hispanic,1999,20,ROCKINGHAM ST,Providence,Rhode Island,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00083486,"JBenros, MClary",pvd8008374038901187780
2019-08-22T01:38:00.0,2019,8,Male,Black,Hispanic,1998,21,HOLLIS ST,Providence,,RI Statute Violation,11-47-5.2,POSSESSION OF A STOLEN FIREARM,1,2019-00083396,"RFedo, KRosado, ALugo",pvd6386847572324309475
2019-08-21T18:03:00.0,2019,8,Male,Unknown,Hispanic,1998,21,CUMERFORD ST,Providence,,RI Statute Violation,31-11-18,"Driving after Denial, Suspension or Revocation of License",1,2019-00083274,JMadeira,pvd14028206778351997883


<center><h1>Using <code>select()</code> Function in dplyr</h1></center>

# 8. Using `select()` to Extract Columns
  - Recall that `filter()` can be used to filter rows
  - Similarly, `select()` is used to select columns
  - These functions can be "chained"

## 8.1 Example of `select()`

In [39]:
arrests_subset <- arrests_df %>% 
    select(id, age, gender, statute_desc)

head(arrests_subset)

Unnamed: 0_level_0,id,age,gender,statute_desc
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>
1,pvd2218242150382148273,37,Male,
2,pvd15166785558364246202,25,,"Driving after Denial, Suspension or Revocation of License"
3,pvd3142917706201385905,34,Female,RESISTING LEGAL OR ILLEGAL ARREST
4,pvd3142917706201385905,34,Female,DISORDERLY CONDUCT
5,pvd460449304532374599,18,Female,RESISTING LEGAL OR ILLEGAL ARREST
6,pvd460449304532374599,18,Female,DISORDERLY CONDUCT


### 8.1.1 Comparing `select()` to `[, ]` notation

In [40]:
# dplyr example
arrests_df %>% 
    select(id, age, gender, statute_desc)


# equivalent in "base" R example
cols <- c("id", "age", "gender", "statute_desc")

arrest_sub <- arrests_df[, cols]

head(arrest_sub)

id,age,gender,statute_desc
<chr>,<int>,<chr>,<chr>
pvd2218242150382148273,37,Male,
pvd15166785558364246202,25,,"Driving after Denial, Suspension or Revocation of License"
pvd3142917706201385905,34,Female,RESISTING LEGAL OR ILLEGAL ARREST
pvd3142917706201385905,34,Female,DISORDERLY CONDUCT
pvd460449304532374599,18,Female,RESISTING LEGAL OR ILLEGAL ARREST
pvd460449304532374599,18,Female,DISORDERLY CONDUCT
pvd460449304532374599,18,Female,OBSTRUCTING OFFICER IN EXECUTION OF DUTY
pvd6431558757894418021,28,Male,Chemical Test Refusal
pvd6431558757894418021,28,Male,Driving Under the Influence of Liqour or Drugs (=>.08<.1)
pvd6431558757894418021,28,Male,"Driving after Denial, Suspension or Revocation of License"


Unnamed: 0_level_0,id,age,gender,statute_desc
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>
1,pvd2218242150382148273,37,Male,
2,pvd15166785558364246202,25,,"Driving after Denial, Suspension or Revocation of License"
3,pvd3142917706201385905,34,Female,RESISTING LEGAL OR ILLEGAL ARREST
4,pvd3142917706201385905,34,Female,DISORDERLY CONDUCT
5,pvd460449304532374599,18,Female,RESISTING LEGAL OR ILLEGAL ARREST
6,pvd460449304532374599,18,Female,DISORDERLY CONDUCT


## 8.2 Example of `select()` (cont.)

In [41]:
arrests_vio <- arrests_df %>%
    select(
        id,
        age,
        gender,
        statute_desc
    )

In [42]:
head(arrests_vio)           # see first few lines of new dataframe

Unnamed: 0_level_0,id,age,gender,statute_desc
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>
1,pvd2218242150382148273,37,Male,
2,pvd15166785558364246202,25,,"Driving after Denial, Suspension or Revocation of License"
3,pvd3142917706201385905,34,Female,RESISTING LEGAL OR ILLEGAL ARREST
4,pvd3142917706201385905,34,Female,DISORDERLY CONDUCT
5,pvd460449304532374599,18,Female,RESISTING LEGAL OR ILLEGAL ARREST
6,pvd460449304532374599,18,Female,DISORDERLY CONDUCT


# 9. Chaining _dplyr_ Operators
  - One key reason for _dplyr_ popularity
  - _dplyr_ verbs/functions are "composable"
    + $(f \circ g)(x) == f(g(x))$

In [43]:
female_vio <- arrests_df %>%
    filter(gender == "Female") %>%
    select(id, age, gender, statute_desc)

head(female_vio)

Unnamed: 0_level_0,id,age,gender,statute_desc
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>
1,pvd3142917706201385905,34,Female,RESISTING LEGAL OR ILLEGAL ARREST
2,pvd3142917706201385905,34,Female,DISORDERLY CONDUCT
3,pvd460449304532374599,18,Female,RESISTING LEGAL OR ILLEGAL ARREST
4,pvd460449304532374599,18,Female,DISORDERLY CONDUCT
5,pvd460449304532374599,18,Female,OBSTRUCTING OFFICER IN EXECUTION OF DUTY
6,pvd8555094992612905738,45,Female,VANDALISM/MALICIOUS INJURY TO PROPERTY


## 9.1 More Chaining

In [44]:
female_midage <- arrests_df %>%
    filter(
        gender == "Female",
        age > 45,
        statute_desc != ""
    ) %>%
    select(
        id, 
        age, 
        gender,
        statute_desc
    ) %>%
    arrange(
        age
    )

head(female_midage)

Unnamed: 0_level_0,id,age,gender,statute_desc
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>
1,pvd5910286289754155205,46,Female,LOITERING FOR INDECENT PURPOSES PROSTITUTION - PROSTITUTION
2,pvd14925567736676696725,46,Female,SHOPLIFTING-MISD - SHOPLIFTING
3,pvd17492545928832438170,46,Female,"Driving after Denial, Suspension or Revocation of License"
4,pvd6439318455139528590,46,Female,BENCH WARRANT ISSUED FROM 6TH DISTRICT COURT
5,pvd5910286289754155205,46,Female,LOITERING FOR INDECENT PURPOSES PROSTITUTION - PROSTITUTION
6,pvd13975960782588463013,46,Female,SIMPLE ASSAULT OR BATTERY


<center><h1>Challenge Problem</h1></center>

In addition to the arrests data, the Providence Police Department also makes data concerning criminal cases available. This is in the CSV file called `"pvd_cases_2021-10-03.csv"` in the `data/` directory of this repository. 

  1. Let's read the cases data in from a CSV and call it `cases_df`. 
  2. Then, let's create a new dataframe that is a subset of `cases_df`. In particular, let's create a dataframe called `cases_summer_df` that contains only those cases that were heard in June, July, or August. 
  

**Note**: The `month` columns is coded numerically in the data set, so keep that in mind.

In [45]:
cases_df <- read.csv("data/pvd_cases_2021-10-03.csv") 

cases_summer_df <- case_df %>%
    filter(
        month >= 6,
        month < 9
    ) %>%
    group_by(month) %>%
    summarize(
        n_rows = n()
    )

cases_summer_df

ERROR: Error in filter(., month >= 6, month < 9): object 'case_df' not found


In [None]:
cases_summer_df_2 <- case_df %>%
    filter(
        month == 6 | month == 7 | month == 8
    ) %>%
    group_by(month) %>%
    summarize(
        n_rows = n()
    )

cases_summer_df_2

<center><h1>Using <code>group_by()</code> and <code>summarise()</code> in dplyr</h1></center>

# 10. Why use `group_by()` and `summarise()` from _dplyr_?
  - Being able to aggregate and summarize by grouping is hugely common
  - _split-apply-combine_ pattern
  - These operations can be "chained" with other _dplyr_ functions
  - Often makes for concise, intuitive, and readable code

## 10.1 Example of `group_by()` and `summarise()`

In [None]:
gender_tbl <- arrests_df %>%
    group_by(gender) %>%
    summarise(
        n_rows = n(),
        mean_age = mean(age)
    ) 

gender_tbl

gender,n_rows,mean_age
<chr>,<int>,<dbl>
,21,29.47619
Female,2777,32.10839
Male,10170,33.3589
,36,26.61111
Unknown,8,35.125


# 11. Chaining `filter()` with `group_by()` and `summarise()`

In [None]:
gender_tbl <- arrests_df %>%
    filter(
        from_city == "Providence",
        year == 2019
    ) %>%
    group_by(gender) %>%
    summarise(
        n_rows = n(),
        mean_age = mean(age),
        mean_cnts = mean(counts, na.rm = TRUE)
    )

head(gender_tbl)

gender,n_rows,mean_age,mean_cnts
<chr>,<int>,<dbl>,<dbl>
,9,23.88889,1.0
Female,515,33.46602,1.064039
Male,2039,33.38941,1.098027
Unknown,1,49.0,1.0


## 11.1 More Interesting Example of Chaining

In [None]:
is_summer <- function(month_num) {
    
    chk <- month_num %in% c(6, 7, 8)
    return(chk)
}

In [None]:
is_summer(6)   # TRUE
is_summer(2)   # FALSE
is_summer(8)   # TRUE


### 11.1.1 More Interesting Example (cont.)

In [None]:
vio_tbl <- arrests_df %>%
    filter(
        statute_desc != "",
        statute_desc != "NULL", 
        year == 2021
    ) %>%
    group_by(statute_desc) %>%
    summarise(
        n_vios = n(),
        prop_male = mean(gender == "Male"),
        mean_age = mean(age),
        prop_summer = mean(is_summer(month))
    ) %>%
    arrange(desc(n_vios))

head(vio_tbl)

statute_desc,n_vios,prop_male,mean_age,prop_summer
<chr>,<int>,<dbl>,<dbl>,<dbl>
DOMESTIC-SIMPLE ASSAULT/BATTERY,290,0.7482759,33.13448,0.3068966
"Driving after Denial, Suspension or Revocation of License",237,0.7552743,31.91561,0.3459916
DISORDERLY CONDUCT,132,0.8106061,30.73485,0.3409091
SIMPLE ASSAULT OR BATTERY,115,0.7304348,35.33913,0.3043478
BENCH WARRANT ISSUED FROM SUPERIOR COURT,106,0.8301887,37.04717,0.2735849
LICENSE OR PERMIT REQUIRED FOR CARRYING PISTOL,94,0.9787234,26.47872,0.2978723


<center><h1>Challenge Problem</h1></center>

Suppose we are interested in the distribution of states of origin (i.e., `from_state`) for the males arrested in the summer months (i.e., June, July, August). Let's use dplyr to create a table with the counts of individuals from the different cities in our arrests data. 

To accomplish this, we will use the `filter()`, `group_by()`, and `summarise()` functions. The table should end up having two columns `from_state`, and `num_arrests`.



In [None]:
male_summer_tbl <- arrests_df %>%
    filter(
        gender == "Male",
        is_summer(month)
    ) %>% 
    group_by(from_state) %>%
    summarize(
        num_arrests = n()
    ) %>%
    arrange(desc(num_arrests))

male_summer_tbl

from_state,num_arrests
<chr>,<int>
,1342
,682
Rhode Island,641
Massachusetts,20
Connecticut,6
Georgia,6
New York,5
New Mexico,4
Missouri,2
North Dakota,2
