<a href="https://colab.research.google.com/github/SDS-AAU/SDS-master/blob/master/M1/noteebooks/EDA_case_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
## Importing packages
library(tidyverse) # metapackage with lots of helpful functions
library(magrittr)

# Exploring the police dataset

Alright, lets get started. First, load the data (csv) from the SDS server

In [3]:
data <- read_csv("https://sds-aau.github.io/SDS-master/M1/data/RI-clean.csv.gz")

Parsed with column specification:
cols(
  .default = col_character(),
  stop_date = [34mcol_date(format = "")[39m,
  stop_time = [34mcol_time(format = "")[39m,
  county_name = [33mcol_logical()[39m,
  county_fips = [33mcol_logical()[39m,
  fine_grained_location = [33mcol_logical()[39m,
  driver_age_raw = [32mcol_double()[39m,
  driver_age = [32mcol_double()[39m,
  search_conducted = [33mcol_logical()[39m,
  contraband_found = [33mcol_logical()[39m,
  is_arrested = [33mcol_logical()[39m,
  out_of_state = [33mcol_logical()[39m,
  drugs_related_stop = [33mcol_logical()[39m
)

See spec(...) for full column specifications.



Lets do a first inspection of the dataset.

In [4]:
data %>% head()

id,state,stop_date,stop_time,location_raw,county_name,county_fips,fine_grained_location,police_department,driver_gender,⋯,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
<chr>,<chr>,<date>,<time>,<chr>,<lgl>,<lgl>,<lgl>,<chr>,<chr>,⋯,<lgl>,<chr>,<chr>,<lgl>,<chr>,<lgl>,<chr>,<lgl>,<lgl>,<chr>
RI-2005-00001,RI,2005-01-02,01:55:00,Zone K1,,,,600,M,⋯,False,,,False,Citation,False,0-15 Min,False,False,Zone K1
RI-2005-00002,RI,2005-01-02,20:30:00,Zone X4,,,,500,M,⋯,False,,,False,Citation,False,16-30 Min,False,False,Zone X4
RI-2005-00003,RI,2005-01-04,11:30:00,Zone X1,,,,0,,⋯,False,,,False,,,,,False,Zone X1
RI-2005-00004,RI,2005-01-04,12:55:00,Zone X4,,,,500,M,⋯,False,,,False,Citation,False,0-15 Min,False,False,Zone X4
RI-2005-00005,RI,2005-01-06,01:30:00,Zone X4,,,,500,M,⋯,False,,,False,Citation,False,0-15 Min,False,False,Zone X4
RI-2005-00006,RI,2005-01-12,08:05:00,Zone X1,,,,0,M,⋯,False,,,False,Citation,False,30+ Min,True,False,Zone X1


In [5]:
data %>% glimpse()

Rows: 509,681
Columns: 26
$ id                    [3m[90m<chr>[39m[23m "RI-2005-00001", "RI-2005-00002", "RI-2005-0000…
$ state                 [3m[90m<chr>[39m[23m "RI", "RI", "RI", "RI", "RI", "RI", "RI", "RI",…
$ stop_date             [3m[90m<date>[39m[23m 2005-01-02, 2005-01-02, 2005-01-04, 2005-01-04…
$ stop_time             [3m[90m<time>[39m[23m 01:55:00, 20:30:00, 11:30:00, 12:55:00, 01:30:…
$ location_raw          [3m[90m<chr>[39m[23m "Zone K1", "Zone X4", "Zone X1", "Zone X4", "Zo…
$ county_name           [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ county_fips           [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fine_grained_location [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ police_department     [3m[90m<chr>[39m[23m "600", "500", "000", "500", "500", "000", "300"…
$ driver_gender         [3m[90m<chr>[39m[23m "M", "M", NA, "M", "M", "M", "M", "M", "M",

- Each row is one traffic stop
- NA are missing values

In [10]:
data %>%
filter(driver_gender %>% is.na()) %>% 
head()

id,state,stop_date,stop_time,location_raw,county_name,county_fips,fine_grained_location,police_department,driver_gender,⋯,search_conducted,search_type_raw,search_type,contraband_found,stop_outcome,is_arrested,stop_duration,out_of_state,drugs_related_stop,district
<chr>,<chr>,<date>,<time>,<chr>,<lgl>,<lgl>,<lgl>,<chr>,<chr>,⋯,<lgl>,<chr>,<chr>,<lgl>,<chr>,<lgl>,<chr>,<lgl>,<lgl>,<chr>
RI-2005-00003,RI,2005-01-04,11:30:00,Zone X1,,,,0,,⋯,False,,,False,,,,,False,Zone X1
RI-2005-00018,RI,2005-03-05,00:20:00,Zone X4,,,,500,,⋯,False,,,False,,,,,False,Zone X4
RI-2005-00036,RI,2005-06-08,08:40:00,Zone X4,,,,500,,⋯,False,,,False,,,,,False,Zone X4
RI-2005-00075,RI,2005-08-17,13:20:00,Zone X1,,,,0,,⋯,False,,,False,,,,,False,Zone X1
RI-2005-00076,RI,2005-08-17,15:59:00,Zone X1,,,,0,,⋯,False,,,False,,,,,False,Zone X1
RI-2005-00081,RI,2005-08-24,18:00:00,Zone X1,,,,0,,⋯,False,,,False,,,,,False,Zone X1


Lets use `summarize_all()` and `gather()` to get an overview on missing values across the dataset.

In [15]:
data %>%
  select(-id) %>%
  summarise_all(function(x) sum(is.na(x))) %>%
  pivot_longer(everything())

name,value
<chr>,<int>
state,0
stop_date,10
stop_time,10
location_raw,0
county_name,509681
county_fips,509681
fine_grained_location,509681
police_department,10
driver_gender,29097
driver_age_raw,29049


Clean the data a bit

In [16]:
# We drop the county_name columns.
data %<>%
  select(-county_name)

In [17]:
# Drop all rows where stop_data, time, and driver_gender are missing
data %<>%
  drop_na(stop_date, stop_time, driver_gender)

# Adjusting dates, times and index

In [20]:
# We start by concatenating the two string columns into one that we call combined, using paste()
data %<>%
  mutate(datetime_stop = paste(stop_date, stop_time, sep = " "))

In [21]:
data %>% 
  select(datetime_stop) %>%
  head()

datetime_stop
<chr>
2005-01-02 01:55:00
2005-01-02 20:30:00
2005-01-04 12:55:00
2005-01-06 01:30:00
2005-01-12 08:05:00
2005-01-18 08:15:00


In [22]:
# The lubridate library helps us to transform strings into dates ore datetimes (timestamps)
library(lubridate)

In [24]:
data %<>%
  mutate(datetime_stop = datetime_stop %>% as_datetime()) 

In [25]:
data %>% 
  select(datetime_stop) %>%
  head()

datetime_stop
<dttm>
2005-01-02 01:55:00
2005-01-02 20:30:00
2005-01-04 12:55:00
2005-01-06 01:30:00
2005-01-12 08:05:00
2005-01-18 08:15:00


# Explore the data step-by-step

In [26]:
# Count number of stop outcomes
data %>%
group_by(stop_outcome) %>%
summarize(n = n())

## Note: Could be done faster with
# data %>% count(stop_outcome)

`summarise()` ungrouping output (override with `.groups` argument)



stop_outcome,n
<chr>,<int>
Arrest Driver,14630
Arrest Passenger,1973
Citation,428378
N/D,3431
No Action,3332
Warning,28840


In [None]:
# there are many ways to do the same thing. Meet count(), the short form of group_by(x) %>% summarize(n = n())
data %>%
count(stop_outcome, sort = TRUE)

In [None]:
# Relative proportions
data %>%
count(driver_race, sort = TRUE) %>%
mutate(pct = (n / sum(n)) %>% round(2) )

### Let's try out some hypotheses

One hypothesis could be that the stop_outcome is different for different races. Discrimination?

In [None]:
data %>%
count(driver_race, stop_outcome) %>%
group_by(driver_race) %>%
mutate(pct = (n / sum(n)) %>% round(2) ) 


Nice, can we have that also in a easy to investigate matrix? Sure, lets use `spread()`. Notice there are many more functions to do that, such as `prob.table`, but I here stay in the `dplyr` framework

In [None]:
data %>%
count(driver_race, stop_outcome) %>%
group_by(driver_race) %>%
mutate(pct = (n / sum(n, na.rm = TRUE)) %>% round(2) ) %>%
select(-n) %>%
spread(stop_outcome, pct)

We might also have a look at nice datasummary. The `summarytools` package has some nice functions to do so.

In [None]:
# install.packages("summarytools")
# library(summarytools)

In [None]:
# data %>% dfSummary()

### Filtering by multiple conditions

Again, we can use `dplyr` `filter()` with multiple conditions, or `group_by()` + `summarize()` for summaries in nested groups.

In [None]:
# Filter for being histanic and arrested
data %>%
filter(driver_race == "Hispanic" & is_arrested == TRUE) %>%
head(10)

##### Rules for filtering

- & AND
- | OR
- Each condition must be surrounded by () and many are possible
- == Equality
- != Inequality

##### Remember, that we are not making any statement about causation. This is purely a correlation exercise (so far!)

#### A bit on boolean series

True = 1 and False = 0
Which means that you can perform calculations on them:

In [None]:
# Create a DataFrame of male and female drivers stopped for speeding
# We can take a shortcut using `group_by`
data %>%
filter(violation == "Speeding") %>%
count(driver_gender, stop_outcome) %>%
group_by(driver_gender) %>%
mutate(pct = (n / sum(n, na.rm = TRUE)) %>% round(2) )


### "protective frisk"
Sometimes during stops if a search is conducted, the officer also checks the driver if they have a weapon. This is called a "protective frisk".
Let's try to figure out if men are frisked more than women.

In [None]:
# Look at the different search types performed
data %>%
count(search_type, sort = TRUE)


#### Extracting a string
As you can see, search type is a multiple choice object/string column. *Incident to Arrest* and *Pribable Cause* are the most commont but combinations are possible. Generally, to work with text strings, the `tioyverse` package `stringr` has many useful functions. We can use the `str_detect` method to filter to filter out cases of interest. This will return a boolean series, which we can assign to a new varioable 'frisk' in our dataframe.

**NOTE:** For missing search type values, `str_detect()` would traditionaly output an `NA`, therefore we beforehand replace them with an empty string with `replace_na("")`

In [None]:
data %<>%
mutate(frisk = search_type %>% replace_na("") %>% str_detect("Protective Frisk") )

In [None]:
# How many frisks have been performed? (1598)
# Notice that pull() extracts a vector from a dataframe
data %>% pull(frisk) %>% sum()

## Using the datetime index to select data

What if you assume that things got better or worse over the years? Remember the `lubridate` package for handling date-times?

In [None]:
# Are things getting better or worse over the years?
library(lubridate)

In [None]:
data %<>%
mutate(year = year(stop_date), month = month(stop_date), day = day (stop_date)) %>%
drop_na(year)

In [None]:
data %>%
group_by(year) %>%
summarize(frisk_mean = frisk %>% mean())

We can obviously also plot that with `ggplot`.

Note: We use `gather()` to create tidy data where multiple variables can be jointly plotted.

In [None]:
data %>%
group_by(year) %>%
summarize(frisk = frisk %>% mean(na.rm = TRUE), is_arrested = is_arrested %>% mean(na.rm = TRUE))  %>%
ungroup() %>% 
gather(key = "key", value = "mean", frisk, is_arrested) %>%
ggplot(aes(x = year, y = mean, color = key)) +
geom_line()

We can do grouping on different levels. For instance, lets see in which months we have most arrests

In [None]:
data %>% 
filter(is_arrested == TRUE) %>%
ggplot(aes(x = month)) + geom_bar()

We see arrest rates are higher in the winter than in the summer. 

### Transforming categorical in to nummerical data

You may have noted the `stop_duration` column in our dataset and that it is an `string` variable. That means, we can use it as a dimension but not to perform any calculations. What we can do, is map the categories to a reasonable nummerical value using a mapping dictionary the  `recode` command.

In [None]:
data %>%
count(stop_duration)

In [None]:
data %<>%
  mutate(stop_duration_num =  recode(stop_duration, "0-15 Min" = 7.5 , "16-30 Min" = 23, "30+ Min" = 45, ) )

In [None]:
data %>%
count(stop_duration_num)

#### A quick intro to loops

Is quite covered in my R intro.

In [None]:
# Let's do everything in one line
data %>%
group_by(violation_raw) %>%
summarize(stop_duration_num = mean(stop_duration_num, na.rm = TRUE))  %>%
ggplot(aes(x = violation_raw, y = stop_duration_num)) +
geom_col() +
coord_flip() # To have horizontal bars

#### Cutting intervals

`driver_age` contains the (duh) the age of the driver. It's a continuous nummerical value and thus good for more advanced analysis but perhaps a bit to detailed for exploration.

More instrumental in that context would be to slice that variable up into ordered categries corresponding to age-populations of interest, say "teen", "20s", "30s" etc.

In [None]:
data %>% select(driver_age) %>% summary()

In [None]:
# Create categories with cut()
data %>% 
mutate(driver_age_cat = driver_age %>% cut(5)) %>%
head()

In [None]:
# We can also provide labels
labs <- c("teen", "20s", "30s", "40s", "50+")

# Create categories with cut()
data %>% mutate(driver_age_cat =  driver_age %>% cut(breaks = c(20,30,40,50)) %>% head()