TITLE

INTRODUCTION

In order to answer this question, the "Communities and Crimes" dataset from the UCI Machine Learning Repository was chosen, which contains various socio-economic, law enforcement and crime statistics from the 1990 US Census, 1990 US LEMAS survey and 1995 FBI UCR (UCI). The data is unnormalized and observations are listed at the community level, which includes cities, townships, and boroughs.

PRELIMINARY EXPLORATORY DATA ANALYSIS

In [1]:
library(tidyverse)
library(dbplyr)
library(repr)
library(tidymodels)
library(stringr)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘dbplyr’


The following objects are masked from ‘package:dplyr’:

    ident, sql


── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        

In [2]:
# Vector for column names
colnames <- c("communityname",
                "state",
                "countyCode",
                "communityCode",
                "fold",
                "population",
                "householdsize",
                "racepctblack",
                "racePctWhite",
                "racePctAsian",
                "racePctHisp",
                "agePct12t21",
                "agePct12t29",
                "agePct16t24",
                "agePct65up",
                "numbUrban",
                "pctUrban",
                "medIncome",
                "pctWWage",
                "pctWFarmSelf",
                "pctWInvInc",
                "pctWSocSec",
                "pctWPubAsst",
                "pctWRetire",
                "medFamInc",
                "perCapInc",
                "whitePerCap",
                "blackPerCap",
                "indianPerCap",
                "AsianPerCap",
                "OtherPerCap",
                "HispPerCap",
                "NumUnderPov",
                "PctPopUnderPov",
                "PctLess9thGrade",
                "PctNotHSGrad",
                "PctBSorMore",
                "PctUnemployed",
                "PctEmploy",
                "PctEmplManu",
                "PctEmplProfServ",
                "PctOccupManu",
                "PctOccupMgmtProf",
                "MalePctDivorce",
                "MalePctNevMarr",
                "FemalePctDiv",
                "TotalPctDiv",
                "PersPerFam",
                "PctFam2Par",
                "PctKids2Par",
                "PctYoungKids2Par",
                "PctTeen2Par",
                "PctWorkMomYoungKids",
                "PctWorkMom",
                "NumKidsBornNeverMar",
                "PctKidsBornNeverMar",
                "NumImmig",
                "PctImmigRecent",
                "PctImmigRec5",
                "PctImmigRec8",
                "PctImmigRec10",
                "PctRecentImmig",
                "PctRecImmig5",
                "PctRecImmig8",
                "PctRecImmig10",
                "PctSpeakEnglOnly",
                "PctNotSpeakEnglWell",
                "PctLargHouseFam",
                "PctLargHouseOccup",
                "PersPerOccupHous",
                "PersPerOwnOccHous",
                "PersPerRentOccHous",
                "PctPersOwnOccup",
                "PctPersDenseHous",
                "PctHousLess3BR",
                "MedNumBR",
                "HousVacant",
                "PctHousOccup",
                "PctHousOwnOcc",
                "PctVacantBoarded",
                "PctVacMore6Mos",
                "MedYrHousBuilt",
                "PctHousNoPhone",
                "PctWOFullPlumb",
                "OwnOccLowQuart",
                "OwnOccMedVal",
                "OwnOccHiQuart",
                "OwnOccQrange",
                "RentLowQ",
                "RentMedian",
                "RentHighQ",
                "RentQrange",
                "MedRent",
                "MedRentPctHousInc",
                "MedOwnCostPctInc",
                "MedOwnCostPctIncNoMtg",
                "NumInShelters",
                "NumStreet",
                "PctForeignBorn",
                "PctBornSameState",
                "PctSameHouse85",
                "PctSameCity85",
                "PctSameState85",
                "LemasSwornFT",
                "LemasSwFTPerPop",
                "LemasSwFTFieldOps",
                "LemasSwFTFieldPerPop",
                "LemasTotalReq",
                "LemasTotReqPerPop",
                "PolicReqPerOffic",
                "PolicPerPop",
                "RacialMatchCommPol",
                "PctPolicWhite",
                "PctPolicBlack",
                "PctPolicHisp",
                "PctPolicAsian",
                "PctPolicMinor",
                "OfficAssgnDrugUnits",
                "NumKindsDrugsSeiz",
                "PolicAveOTWorked",
                "LandArea",
                "PopDens",
                "PctUsePubTrans",
                "PolicCars",
                "PolicOperBudg",
                "LemasPctPolicOnPatr",
                "LemasGangUnitDeploy",
                "LemasPctOfficDrugUn",
                "PolicBudgPerPop",
                "murders",
                "murdPerPop",
                "rapes",
                "rapesPerPop",
                "robberies",
                "robbbPerPop",
                "assaults",
                "assaultPerPop",
                "burglaries",
                "burglPerPop",
                "larcenies",
                "larcPerPop",
                "autoTheft",
                "autoTheftPerPop",
                "arsons",
                "arsonsPerPop",
                "ViolentCrimesPerPop",
                "nonViolPerPop")

In [3]:
# Reads data in and specifies column names
crime <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt",
                  col_names = colnames)
                                
crime

“One or more parsing issues, see `problems()` for details”
[1mRows: [22m[34m2215[39m [1mColumns: [22m[34m147[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (42): communityname, state, countyCode, communityCode, LemasSwornFT, Le...
[32mdbl[39m (105): fold, population, householdsize, racepctblack, racePctWhite, race...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


communityname,state,countyCode,communityCode,fold,population,householdsize,racepctblack,racePctWhite,racePctAsian,⋯,burglaries,burglPerPop,larcenies,larcPerPop,autoTheft,autoTheftPerPop,arsons,arsonsPerPop,ViolentCrimesPerPop,nonViolPerPop
<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
BerkeleyHeightstownship,NJ,39,5320,1,11980,3.10,1.37,91.78,6.50,⋯,14,114.85,138,1132.08,16,131.26,2,16.41,41.02,1394.59
Marpletownship,PA,45,47616,1,23123,2.82,0.80,95.57,3.44,⋯,57,242.37,376,1598.78,26,110.55,1,4.25,127.56,1955.95
Tigardcity,OR,?,?,1,29344,2.43,0.74,94.33,3.43,⋯,274,758.14,1797,4972.19,136,376.3,22,60.87,218.59,6167.51
Gloversvillecity,NY,35,29443,1,16656,2.40,1.70,97.35,0.50,⋯,225,1301.78,716,4142.56,47,271.93,?,?,306.64,?
Bemidjicity,MN,7,5068,1,11245,2.76,0.53,89.16,1.17,⋯,91,728.93,1060,8490.87,91,728.93,5,40.05,?,9988.79
Springfieldcity,MO,?,?,1,140494,2.45,2.51,95.65,0.90,⋯,2094,1386.46,7690,5091.64,454,300.6,134,88.72,442.95,6867.42
Norwoodtown,MA,21,50250,1,28700,2.60,1.60,96.57,1.47,⋯,110,372.09,288,974.19,144,487.1,17,57.5,226.63,1890.88
Andersoncity,IN,?,?,1,59459,2.45,14.20,84.87,0.40,⋯,608,997.6,2250,3691.79,125,205.1,9,14.77,439.73,4909.26
Fargocity,ND,17,25700,1,74111,2.46,0.35,97.11,1.25,⋯,425,532.66,3149,3946.71,206,258.18,8,10.03,115.31,4747.58
Wacocity,TX,?,?,1,103590,2.62,23.14,67.60,0.92,⋯,2397,2221.81,6121,5673.63,1070,991.8,18,16.68,1544.24,8903.93


As our analysis will only consist of the ratio of violent crime rate to non-violent crime rate, the columns for the individual crimes can be ignored.

In [4]:
crime_selected <- crime |>
                    select(communityname, state, population, racepctblack:racePctHisp, ViolentCrimesPerPop, nonViolPerPop)
crime_selected

communityname,state,population,racepctblack,racePctWhite,racePctAsian,racePctHisp,ViolentCrimesPerPop,nonViolPerPop
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
BerkeleyHeightstownship,NJ,11980,1.37,91.78,6.50,1.88,41.02,1394.59
Marpletownship,PA,23123,0.80,95.57,3.44,0.85,127.56,1955.95
Tigardcity,OR,29344,0.74,94.33,3.43,2.35,218.59,6167.51
Gloversvillecity,NY,16656,1.70,97.35,0.50,0.70,306.64,?
Bemidjicity,MN,11245,0.53,89.16,1.17,0.52,?,9988.79
Springfieldcity,MO,140494,2.51,95.65,0.90,0.95,442.95,6867.42
Norwoodtown,MA,28700,1.60,96.57,1.47,1.10,226.63,1890.88
Andersoncity,IN,59459,14.20,84.87,0.40,0.63,439.73,4909.26
Fargocity,ND,74111,0.35,97.11,1.25,0.73,115.31,4747.58
Wacocity,TX,103590,23.14,67.60,0.92,16.35,1544.24,8903.93


The ratio of the violent crime rate to the non-violent crime rate can be calculated by dividing the violent crime rate by the non-violent crime rate. However, two columns are of the character type—they need to be converted to the double type as the values contain decimals. After this is performed, the ratio can be calculated.

In [5]:
crime_w_ratio <- crime_selected |>
                mutate(ViolentCrimesPerPop = as.numeric(ViolentCrimesPerPop),
                       nonViolPerPop = as.numeric(nonViolPerPop)) |>
                mutate(crime_ratio = ViolentCrimesPerPop / nonViolPerPop)

crime_w_ratio

“NAs introduced by coercion”
“NAs introduced by coercion”


communityname,state,population,racepctblack,racePctWhite,racePctAsian,racePctHisp,ViolentCrimesPerPop,nonViolPerPop,crime_ratio
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
BerkeleyHeightstownship,NJ,11980,1.37,91.78,6.50,1.88,41.02,1394.59,0.02941366
Marpletownship,PA,23123,0.80,95.57,3.44,0.85,127.56,1955.95,0.06521639
Tigardcity,OR,29344,0.74,94.33,3.43,2.35,218.59,6167.51,0.03544218
Gloversvillecity,NY,16656,1.70,97.35,0.50,0.70,306.64,,
Bemidjicity,MN,11245,0.53,89.16,1.17,0.52,,9988.79,
Springfieldcity,MO,140494,2.51,95.65,0.90,0.95,442.95,6867.42,0.06450021
Norwoodtown,MA,28700,1.60,96.57,1.47,1.10,226.63,1890.88,0.11985425
Andersoncity,IN,59459,14.20,84.87,0.40,0.63,439.73,4909.26,0.08957154
Fargocity,ND,74111,0.35,97.11,1.25,0.73,115.31,4747.58,0.02428816
Wacocity,TX,103590,23.14,67.60,0.92,16.35,1544.24,8903.93,0.17343353


To perform regression, the dataset needs to be split into training and testing datasets.

In [6]:
crime_split <- crime_w_ratio |>
                initial_split(prop = 0.75, strata = crime_ratio)

crime_training <- training(crime_split)
crime_testing <- testing(crime_split)

To obtain a general overview of the training dataset and help build the model, summary statistics will be useful. We will need to obtain the total number of observations for each state. We will also exclude observations with missing values, and the number of observations with missing values will ensure that there is a sufficient quantity of data to train the model.

In [7]:
crime_count <- crime_training |>
                group_by(state) |>
                summarize(count = n())
crime_count

crime_w_missing_values <-

state,count
<chr>,<int>
AL,28
AR,16
AZ,15
CA,204
CO,17
CT,53
DC,1
DE,1
FL,68
GA,34


METHODS

EXPECTED OUTCOMES AND SIGNIFICANCE

REFERENCES (need to format)

https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized

U. S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files),

U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)

U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics (Computer File) U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)

U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995)