## CIS 9: Final Project
## Data Analysis of San Jose Police Department Incident Reports (2018-2022) 
## Written by: Tiffany Overbo and Cherry Withers

### Project Summary

This project is to analyze incidence reports from police calls for services which are documented by the San Jose City Police Department. The goal is to analyze the trend in types of incidents, address area, time of day, frequency, and if there were any resolutions over 5 years from 2018 - 2023 (3/18/22)*. This will include pre-pandemic, pandemic, and post-pandemic years to see if there is a trend and hopes to answer the following questions:
* What are the 10 top incidents being reported each year by frequency? 
* How did the pandemic affect the incident counts of the following crimes/categories: Assault, Burglary, Disturbing Peace, Drugs/Alcohol, DUI, Fraud, Motor Vehicle Theft, Robbery, Sex Crime, Larceny, Vandalism, Vehicle Breakin/Theft, Others? 
* Arrest rates for each category mentioned above (per year/per month)?  
* Where are most of these incidents occurring (street names)? (NOT FEASIBLE)
* What part of the day are the incidents taking place, ie. am, pm, after midnight?
* Which months have higher incidents of crimes being reported year after year? 


### Data Information
We will be working with 6 .csv files, each containing the total incidents for each year from 2018-2023(Jan-Mar 18). This dataset shape and header data, sample, and description are listed below. [Source:] (https://data.sanjoseca.gov/dataset/police-calls-for-service)



__Import Modules and Load Dataset__

In [None]:
# Import modules here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("FinalProject/PoliceCalls_2018_2022.csv")

### Data Analysis

__Predictive Classification Model for Priority Codes__

In [None]:
# Import the Classifiers

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn import metrics

In [20]:
df['priority'].value_counts()

3    462886
2    456131
4    192846
6    184008
5    121178
1     43527
Name: priority, dtype: int64

In [None]:
# Prepare the data
X = df[['type']]
y = df['priority']

X.loc[:,'type'] = X['type'].astype('category').cat.codes

print(X['type'])

# look up the original categories from the integer codes
#categories = X['type'].cat.categories
#for code in X['type_codes']:
#    print(f"Category code {code} corresponds to category '{categories[code]}'")

In [None]:
# Create the training and testing X and y datasets.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [16]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)  
y_pred = model.predict(X_test)
score_accuracy = metrics.accuracy_score(y_test, y_pred)
score_f1 = f1_score(y_test, y_pred, average='weighted')
print(model,"\naccuracy_score: ", round(score_accuracy, 3))
print("f1_score: ", round(score_f1,3))

DecisionTreeClassifier() 
accuracy_score:  0.634
f1_score:  0.612


In [17]:
metrics.confusion_matrix(y_test, y_pred)

array([[ 5020,  1598,  3610,    21,    42,   466],
       [ 3543, 97190,  7262,   364,   917,  5029],
       [  721, 29454, 56050,  3222,  2126, 24079],
       [   30, 18561,  2430, 22283,  2592,  2248],
       [  471,  3454,  3163,  7945,  5375,  9902],
       [   32,   148,     0,   224,    69, 45503]], dtype=int64)

The confusion matrix above illustrates our model's shortcomings. For priority 3 & 4, half of the time it predicted incorrectly. It was worst for Priority 5 where most of the time it made innacurate predictions. 