#### Task 1.2: Data Preparation

Improve the quality of your data and prepare it by extracting new features interesting for describing the incidents. Some examples of indicators to be computed are:

- How many males are involved in incidents relative to the total number of males for the same city and in the same period?
- How many injured and killed people have been involved relative to the total injured and killed people in the same congressional district in a given period of time?
- Ratio of the number of killed people in the incidents relative to the number of participants in the incident
- Ratio of unharmed people in the incidents relative to the average of unharmed people in the same period

Note that these examples are not mandatory, and teams can define their own indicators. Each indicator must be correlated with a description and, when necessary, its mathematical formulation. The extracted variables will be useful for the clustering analysis in the second project's task. Once the set of indicators is computed, the team should explore the new features for a statistical analysis, including distributions, outliers, visualizations, and correlations.

For task 1.1 see the corresponding Notebook in [Task 1.1 - Data Understanding](Task1_Data_Understanding.ipynb).

For this task we followed the following check structure: [#WIP]()
1. Data aggregation
2. Reduction of dimensionality
3. Data cleaning
4. Discretization
5. Data transformation
6. Principal Component Analysis via Covariance Matrix
8. Data Similarity via Entropy and proximity coefficients


In [1]:
# This will take a while
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

import plotly.offline as py


# Set a seed for reproducibility
np.random.seed(42)

# Load dataset from data understanding
df_incident_du = pd.read_csv('../source/ds/cleaned_incidents_taskDU.csv', index_col=0)
df_incident = df_incident_du.copy()



In [3]:
# Check if dataset is still alive
df_incident.head()


Unnamed: 0_level_0,state,city_or_county,latitude,longitude,congressional_district,participant_age1,participant_age_group1,participant_gender1,min_age_participants,max_age_participants,n_participants_child,n_participants_teen,n_males,n_females,n_killed,n_injured,n_arrested,n_unharmed,n_participants,incident_characteristics1
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2013-01-07,North Carolina,Greensboro,36.114,-79.9569,6,18.0,Adult 18+,Female,14,47,0,1,2.0,2.0,2,2,0.0,0.0,4.0,Shot - Wounded/Injured
2013-04-07,California,Long Beach,33.8479,-118.19,47,23.0,Adult 18+,Male,23,23,0,0,4.0,0.0,1,3,0.0,0.0,4.0,Shot - Wounded/Injured
2013-05-02,Maryland,Baltimore,39.3167,-76.6085,7,22.0,Adult 18+,Male,22,25,0,0,2.0,0.0,1,0,1.0,0.0,2.0,"Shot - Dead (murder, accidental, suicide)"
2013-05-14,New Jersey,Delanco,40.0521,-74.9578,3,22.0,Adult 18+,Male,22,22,0,0,2.0,0.0,0,2,0.0,0.0,2.0,Shot - Wounded/Injured
2013-05-18,New York,Jamaica,40.673,-73.7881,5,14.0,Teen 12-17,Female,14,21,0,2,2.0,1.0,1,0,2.0,0.0,3.0,"Shot - Dead (murder, accidental, suicide)"



# Idea: Let's try to get number of incidents every 3 months
## Maybe this in data preparation
Since 2018 is the year with less records we could try to visualize what period contains more incidents, let's try to monitor every 3 months: