<a href="https://colab.research.google.com/github/JuneWayne/DS3021-EDA/blob/main/assignment/Ethan_Cao_Crime_Data_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supplementary Homicide Report (Murder Accountability Project) Thoughts and Plans

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('SHR65_23.csv')
df.head()

Unnamed: 0,ID,CNTYFIPS,Ori,State,Agency,Agentype,Source,Solved,Year,Month,...,OffRace,OffEthnic,Weapon,Relationship,Circumstance,Subcircum,VicCount,OffCount,FileDate,MSA
0,197603001AK00101,"Anchorage, AK",AK00101,Alaska,Anchorage,Municipal police,FBI,Yes,1976,March,...,Black,Unknown or not reported,"Handgun - pistol, revolver, etc",Relationship not determined,Other arguments,,0,0,30180.0,"Anchorage, AK"
1,197604001AK00101,"Anchorage, AK",AK00101,Alaska,Anchorage,Municipal police,FBI,Yes,1976,April,...,White,Unknown or not reported,"Handgun - pistol, revolver, etc",Girlfriend,Other arguments,,0,0,30180.0,"Anchorage, AK"
2,197606001AK00101,"Anchorage, AK",AK00101,Alaska,Anchorage,Municipal police,FBI,Yes,1976,June,...,Black,Unknown or not reported,"Handgun - pistol, revolver, etc",Stranger,Other,,0,0,30180.0,"Anchorage, AK"
3,197606002AK00101,"Anchorage, AK",AK00101,Alaska,Anchorage,Municipal police,FBI,Yes,1976,June,...,White,Unknown or not reported,"Handgun - pistol, revolver, etc",Other - known to victim,Other arguments,,0,0,30180.0,"Anchorage, AK"
4,197607001AK00101,"Anchorage, AK",AK00101,Alaska,Anchorage,Municipal police,FBI,Yes,1976,July,...,American Indian or Alaskan Native,Unknown or not reported,Knife or cutting instrument,Brother,Other arguments,,0,0,30180.0,"Anchorage, AK"


In [5]:
df.columns

Index(['ID', 'CNTYFIPS', 'Ori', 'State', 'Agency', 'Agentype', 'Source',
       'Solved', 'Year', 'Month', 'Incident', 'ActionType', 'Homicide',
       'Situation', 'VicAge', 'VicSex', 'VicRace', 'VicEthnic', 'OffAge',
       'OffSex', 'OffRace', 'OffEthnic', 'Weapon', 'Relationship',
       'Circumstance', 'Subcircum', 'VicCount', 'OffCount', 'FileDate', 'MSA'],
      dtype='object')

# Data Cleaning Plan

## About My Data:

The current dataset I have in hand with me is a dataset obtained from Murderdata.org, which is a website for the Murder Accountability Project that tracks America's unsolved homicides through the Freedom of Information Act. The dataset I downloaded from this organization is called the Supplementary Homicide Report, which documented about 39,000 homicide cases with detailed information about victim and offender background for each case, as well as whether the case is solved or unsolved.

The data set contains a wide range of variables, aside from basic variables such sa time(year, month, day), location, reporting agency and agency type, the more informational variables include type of homicide, situation of victim, victim age, sex, race, ethnicity, as well as offender age, sex, race, and ethnicity. Further variables of detail for each cases of homicide include weapon used, relationship between offender and victim, and circumstance of homicide (such as GTA, robbery, rape, arson, sniper attack, children playing with gun etc). Not to mention variables such as number of victims and number of offenders.

Excluding variables such as agency identifiers, the type of agency that reported the crime, and other identifiers for whether the case was reported by the FBI and standardized reporting codes, all the variables mentioned above seems to be extremely valuable and insightful to play with.

## What I plan to do about it

On an exploratory data analytical perspective, one could definitely visualize the trends and relationships between each of the variables, if not between grouped variables to showcase insights such as: what does past data tell us about the correlation between offender victim relationship and homicide rate? What type of cases, or specifically, what type of victims/homicide type/number of victims are solved or unsolved? Is there a correlation between circumstance of homicide and number of victims?

Nevertheless, instead of resorting to exploratory visualizations, applying machine learning to identify clusters/classifications of data or predict homicide trends is what truly gives meaning to this dataset.

I have three directions for this:
<li>Applying K-means clustering to find unusual clusters of data points to uncover potential serial killers/serial killing cases.

<li>Applying Naive Bayes to predict likelihood of homicide occuring based off past homicide features.

<li>Applying linear regression to predict the growth in homicide cases in the future based off past homicide trends.

I feel like the first approach is the more realistically doable and impactful way to approach this dataset. Given that this is a detailed dataset about the features for each homicide cases, it is more feasible to utilize k-means clustering algorithms to uncover hidden trends or clusters of data to detect whether there is a serial killer that is linked to all of the victims. (Plus it would be interesting if the data points in of the clutters is shown to be all unsolved cases)

Naive Bayes could also be applicable for this dataset, but may require myself to manually calcualte the probabilities of each homicide incident based off each features before forming it into a model to predict future homicide likelihood. Plus, the variables in this dataset only describes victim and offender information on a surface level, it doesn't showcase their socio-economic background, general economic condition, or occupation etc. Thus it is hard to predict future occurrences of homicide based off surface-level descriptive variables, such as what weapons were used, how did the homicide happen etc. These variables all describe post-homicide information, which is to say what happened after a murder has happened.

This also brings me to the idea of linear regression, the data set doesn't seem to provide pre-homicide details about both the victim or the offender, apart from their ethnicity, race, age, and gender. It would only be unethical to predict homicide trends simply by these variables, and would only become a case of racial, ethnic, age, or gender discrimination/prejudice. Nevertheless, I hope to explore how other variables could be applied to actualize this effort.

## Biggest challenges expected

Obviously, data cleaning is the biggest challenge to this problem. Below is a list of variables that I think makes the data set hard to work with:
<li> unclear column headers: thankfully the organization has provided a documentation detailing what each column header represent, but for the sake of clarity while working with the dataset, I'll have to manually rename each header to a desired name

<li> unlear datetime structure: the structure of documenting datetime of homicide follows a rather unconventional approach: numerically displaying datetime as mm/dd/yy, I was only able to deduce the date by reading the documentation. Moreover, the datetime data is recorded as a float format, this actually distorts the data with the first numerical of '0' being displaced, hence making the data unintelligible. I'll have to transform it into actual datetime structure or find a way of replacing the dislocated '0's for each datetime.

<li> numerical digits are used to represent observations in certain variables: Since the data is obtained from official department records, there are designated codes to represent homicide type and cicumstances. Either I'll have to be extremely familiar with these codes, or I'll simply have to rename each type of observation to make sense of the data.

<li> Large volume of null values: There are large quantity of null values that are present in this dataset, I'll have to clean up all of these missing values in order to apply K-means and Naive Bayes onto the dataset.

By being able to clean up the data, I hope to make sense of the dataset more and understand what variables I can play with to apply to machine learning processes.



