# Practice Association Rules  - Minneapolis Crime Data

The dataset used in the notebook (curtesy of Open Data Minneapolis) includes information about police calls and crimes committed between 2010 to 2016. 

We will use the data to do some association rule mining for finding frequent patterns.

Read the data from `/dsa/data/all_datasets/minneapolis_crimedata/crimes.csv`

 - **Hint:** If you aren't sure of the R function to use for some of the activities, try an Internet search. If necessary, ask on Mutual Aid. (Finding the correct function to use in a situation is a common thing we need to do.)

In [8]:
crimes_data = read.csv('/dsa/data/all_datasets/minneapolis_crimedata/crimes.csv')

In [117]:
head(crimes_data,2)

publicaddress,controlnbr,CCN,Precinct,ReportedDate,BeginDate,Time,Offense,Description,UCRCode,EnteredDate,Long,Lat,x,y,Neighborhood,lastchanged,LastUpdateDate,OBJECTID,ESRI_OID
0029XX Chicago AV S,2001,MP 2010 049740,3,2010-02-22T13:40:00.000Z,2010-02-22T13:22:00.000Z,13:22:00,SHOPLF,Shoplifting,7,2010-02-22T13:37:48.000Z,-93.26305,44.94899,531143.9,157587.0,PHILLIPS WEST,2010-02-22T18:51:39.000Z,2015-09-21T14:16:59.000Z,,
0049XX Queen AV N,2002,MP 2010 999144,4,2010-02-22T13:38:15.000Z,2010-02-21T08:00:00.000Z,08:00:00,TFMV,Theft From Motr Vehc,7,2010-02-22T13:38:15.000Z,-93.31088,45.04451,518728.3,192399.9,SHINGLE CREEK,2010-02-22T13:38:22.000Z,2015-09-21T14:16:59.000Z,,


In [114]:
dim(crimes_data)

The columns 
- `controlnbr`
- `CCN`
- `Time`
- `ReportedDate`
- `Offense`
- `UCRCode`
- `EnteredDate`
- `Long`
- `Lat`
- `x`
- `y`
- `lastchanged`
- `LastUpdateDate`
- `OBJECTID`
- `ESRI_OID` 

are not helpful or interpretable. So lets just delete them from dataset.

**Activity 1:** Remove the columns listed above from dataframe. 

In [160]:
# Your code for activity 1 goes here.
# ------------------------------------

#column_names = c(controlnbr,CCN,Time,ReportedDate,Offense,UCRCode,EnteredDate,Long,Lat,x,y,lastchanged,LastUpdateDate,OBJECTID,ESRI_OID)
crimes_data1 = subset(crimes_data,select = -c(controlnbr,CCN,Time,ReportedDate,Offense,UCRCode,EnteredDate,Long,Lat,x,y,lastchanged,LastUpdateDate,OBJECTID,ESRI_OID))
head(crimes_data1,2)



publicaddress,Precinct,BeginDate,Description,Neighborhood
0029XX Chicago AV S,3,2010-02-22T13:22:00.000Z,Shoplifting,PHILLIPS WEST
0049XX Queen AV N,4,2010-02-21T08:00:00.000Z,Theft From Motr Vehc,SHINGLE CREEK


The first 6 characters in publicaddress don't make any sense. 

**Activity 2:** Strip the first 7 characters or extract the rest of the characters from the publicaddress column.

In [161]:
crimes_data1[, 'publicaddress'] <- sapply(crimes_data1[, 'publicaddress'], as.character)

In [162]:
total <- length(crimes_data1$publicaddress)
# create progress bar
pb <- txtProgressBar(min = 0, max = total, style = 3)

for (i in 1:total){
    #Sys.sleep(0.1)
    # update progress bar
    setTxtProgressBar(pb, i)
    
    text = crimes_data1$publicaddress[i]
    text = as.character(text)
    if (grepl('X', text)){
        address = unlist(strsplit(text, "X "))
        crimes_data1$publicaddress[i] = address[2]
    } else if (grepl('/', text)){
        address = unlist(strsplit(text, "/ "))
        crimes_data1$publicaddress[i] = address[2]
    }
}
close(pb)



In [163]:
head(crimes_data1,3)

publicaddress,Precinct,BeginDate,Description,Neighborhood
Chicago AV S,3,2010-02-22T13:22:00.000Z,Shoplifting,PHILLIPS WEST
Queen AV N,4,2010-02-21T08:00:00.000Z,Theft From Motr Vehc,SHINGLE CREEK
16 AV SE,2,2010-02-19T23:20:00.000Z,Other Theft,UNIVERSITY OF MINNESOTA


The BeginDate column is of type `factor`. What we want to have are columns for the date, weekday, and hour. 

**Reference:**

- [lubridate](https://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/)
- [Handling date-times in R](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/ColeBeck/datestimes.pdf)

You can work with date and time data types in R using either the built-in POSIXt library or external packages like the lubridate package or the chron package. 

There are two POSIX time types, POSIXct and POSIXlt. "ct" stands for calendar time, it stores the number of seconds since the origin (the beginning of Jan 1, 1970). "lt", or local time, keeps the date as a list of time attributes (such as "hour" and "mon"). See [DateTimeClasses](https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/DateTimeClasses) for more details.

The Lubridate package assists greatly in working with dates and times. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R lacks these capabilities in some situations. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.

We will use two steps to first convert BeginDate to a new date column. In Activity 3, we will convert it to POSIXlt type using the R [strptime() function](https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/strptime). Then in Activity 4, we will convert it to a Date using a Lubridate package function. Finally in Activity 5, we will extract the weekday and hour into new columns from the BeginDate column using Lubridate functions.

**Activity 3:** The BeginDate column is of type `factor`. Convert its type to `POSIXlt` using the strptime() function. You will need to replace the character "T" in the column with a white space " " in order to use strptime(). 

In [164]:
# Your code for activity 3 goes here.
# ------------------------------------
crimes_data1$BeginDate = strptime(as.character(crimes_data1$BeginDate),"%Y-%m-%dT%H:%M:%S")
head(crimes_data1,2)

publicaddress,Precinct,BeginDate,Description,Neighborhood
Chicago AV S,3,2010-02-22 13:22:00,Shoplifting,PHILLIPS WEST
Queen AV N,4,2010-02-21 08:00:00,Theft From Motr Vehc,SHINGLE CREEK


**Activity 4:** Extract the date from BeginDate column and stored it as a new column called date. 

In [168]:
# Your code for activity 4 goes here.
# ------------------------------------
library(lubridate)
crimes_data1$date = as.Date(crimes_data1$BeginDate,format("%Y-%m-%d"))

In [169]:
# Check your work...
class(crimes_data1$date)

**Activity 5:** Extract the weekday from date and hour from from BeginDate column and store them as new columns called weekday and hour respectively. 

In [177]:
# Your code for activity 5 goes here.
# ------------------------------------
crimes_data1$weekday = weekdays(as.Date(crimes_data1$date))
crimes_data1$hour = crimes_data1$BeginDate$hour

We dont need the BeginDate column any more. So lets just delete it from dataframe.

In [179]:
# Your code goes here.
# ------------------------------------
crimes_data1$BeginDate <- NULL


**Activity 6:** Convert the hour variable into an ordered factor with levels "mid night", "morning", "noon","night" for different hours of the day. 

In [181]:
# Your code for activity 6 goes here.
# ------------------------------------
crimes_data1[["hour"]] <- ordered(cut(crimes_data1[["hour"]], c(-1,0,11,12,24)),
                                 labels = c("mid night", "morning", "noon","night"))


In [189]:
head(crimes_data1,3)

publicaddress,Precinct,Description,Neighborhood,date,weekday,hour
Chicago AV S,3,Shoplifting,PHILLIPS WEST,2010-02-22,Monday,night
Queen AV N,4,Theft From Motr Vehc,SHINGLE CREEK,2010-02-21,Sunday,morning
16 AV SE,2,Other Theft,UNIVERSITY OF MINNESOTA,2010-02-19,Friday,night


In [191]:
str(crimes_data1)

'data.frame':	136121 obs. of  7 variables:
 $ publicaddress: Factor w/ 1038 levels "1 AV N","1 AV NE",..: 397 817 46 952 311 985 123 248 373 641 ...
 $ Precinct     : Factor w/ 6 levels "1","2","3","4",..: 3 4 2 2 1 2 3 2 4 3 ...
 $ Description  : Factor w/ 36 levels "1st Deg Domes Asslt",..: 29 33 22 32 22 22 22 13 12 34 ...
 $ Neighborhood : Factor w/ 88 levels "","ARMATAGE",..: 65 71 78 78 21 78 83 52 43 24 ...
 $ date         : Factor w/ 2120 levels "2010-01-01","2010-01-02",..: 53 52 50 51 52 53 41 52 53 53 ...
 $ weekday      : Factor w/ 7 levels "Friday","Monday",..: 2 4 1 3 4 2 7 4 2 2 ...
 $ hour         : Ord.factor w/ 4 levels "mid night"<"morning"<..: 4 2 4 4 4 2 3 2 2 2 ...


**Activity 7:** Convert the columns "publicaddress", "Precinct", "weekday", "date" into factor type. 

In [184]:
# Your code for activity 7 goes here..
crimes_data1["publicaddress"] = as.factor(crimes_data1[["publicaddress"]])
crimes_data1["Precinct"] = as.factor(crimes_data1[["Precinct"]])
crimes_data1["weekday"] = as.factor(crimes_data1[["weekday"]])
crimes_data1["date"] = as.factor(crimes_data1[["date"]])

**Activity 8:** Now, coerce the data set into transactions. Save this transactions to crimes_trans variable.

In [185]:
# Your code for activity 8 goes here.
# ------------------------------------
library("arules")
crimes_trans <- as(crimes_data1, "transactions")

Loading required package: Matrix

Attaching package: ‘arules’

The following object is masked from ‘package:dplyr’:

    recode

The following objects are masked from ‘package:base’:

    abbreviate, write



In [186]:
crimes_trans

transactions in sparse format with
 136121 transactions (rows) and
 3299 items (columns)

**Activity 9:** Generate association rules for the transactions in crimes_trans with support of 0.01 and confidence of 0.6

In [187]:
# Your code for activity 9 goes here.
# ------------------------------------
rules <- apriori(crimes_trans, parameter = list(support = 0.01, confidence = 0.6))


Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.6    0.1    1 none FALSE            TRUE       5    0.01      1
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 1361 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[3299 item(s), 136121 transaction(s)] done [0.39s].
sorting and recoding items ... [78 item(s)] done [0.02s].
creating transaction tree ... done [0.12s].
checking subsets of size 1 2 3 4 done [0.01s].
writing ... [109 rule(s)] done [0.00s].
creating S4 object  ... done [0.01s].


**Activity 10:** Display the generated rules using inspect(). 

In [188]:
# Your code for activity 10 goes here.
# ------------------------------------
inspect(rules)

      lhs                                               rhs                             support confidence      lift count
[1]   {publicaddress=Penn AV N}                      => {Precinct=4}                 0.01050536  0.9986034  4.793895  1430
[2]   {Neighborhood=KING FIELD}                      => {Precinct=5}                 0.01110042  0.9980185  5.135766  1511
[3]   {Neighborhood=HOWE}                            => {Precinct=3}                 0.01096084  0.9986613  3.755215  1492
[4]   {Neighborhood=MCKINLEY}                        => {Precinct=4}                 0.01114450  0.9960604  4.781687  1517
[5]   {Neighborhood=STANDISH}                        => {Precinct=3}                 0.01140897  0.9980720  3.752999  1553
[6]   {Neighborhood=STEVENS SQUARE - LORING HEIGHTS} => {Precinct=5}                 0.01124735  0.9915803  5.102635  1531
[7]   {Neighborhood=CORCORAN}                        => {Precinct=3}                 0.01156324  0.9987310  3.755477  1574
[8]   {publicadd

# Save your notebook, then `File > Close and Halt`