# Dados Utilizados

Para a realização deste projeto, foi necessário utilizar alguns conjuntos de dados de domínio público. Nesta seção iremos entrar em mais detalhes sobre as bases utilizadas.

In [1]:
import pandas as pd
import re

### Gun Violence

DESCRIÇÃO

Fonte: [Gun Violence Data - James Ko](https://www.kaggle.com/jameslko/gun-violence-data)

In [2]:
# Load gun violence dataset
gun_violence = pd.read_csv('../databases/gun_violence.zip', compression='zip')

### State Population Totals and Components of Change: 2010-2018

DESCRIÇÃO

Fonte: [United States Census Bureau](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html)

In [3]:
# Load population dataset
population = pd.read_csv('../databases/nst-est2018-alldata.zip', compression='zip')

# Limpeza dos Dados

As bases de dados selecionadas possuem alguns dados nos quais não estamos interessados e que podemos remover para melhorar o desempenho e facilitar o trabalho.
Além disso algumas estão faltando informações ou possuem uma formatação não muito adequada

### Gun Violence

In [4]:
# Drop unnecessary columns
gun_violence = gun_violence[[
    'incident_id',
    'date',
    'state',
    'n_killed',
    'n_injured',
    'gun_stolen',
    'gun_type',
    'n_guns_involved',
    'participant_age',
    'participant_age_group',
    'participant_gender',
    'participant_type'
]]

In [5]:
# Drop missing values
gun_violence.dropna(inplace=True)
gun_violence.reset_index(drop=True, inplace=True)

In [6]:
# Change date formatting
gun_violence['year'] = gun_violence['date'].map(lambda x: x[0:4])
gun_violence['month'] = gun_violence['date'].map(lambda x: x[5:7])
gun_violence['day'] = gun_violence['date'].map(lambda x: x[8:10])
gun_violence.drop(columns='date', inplace=True)

In [7]:
# Change formatting on gun_stolen, gun_type, participant_age, etc.
gun_violence['gun_stolen'] = gun_violence['gun_stolen'].map(lambda x: re.findall(r'[a-zA-Z]+', x))
gun_violence['gun_type'] = gun_violence['gun_type'].map(lambda x: re.findall(r'[a-zA-Z]+', x))
gun_violence['participant_age'] = gun_violence['participant_age'].map(lambda x: re.findall(r'[0-9]+', x)[1::2])
gun_violence['participant_age_group'] = gun_violence['participant_age_group'].map(lambda x: re.findall(r'[a-zA-Z]+', x))
gun_violence['participant_gender'] = gun_violence['participant_gender'].map(lambda x: re.findall(r'[a-zA-Z]+', x))
gun_violence['participant_type'] = gun_violence['participant_type'].map(lambda x: re.findall(r'[a-zA-Z]+', x))

gun_violence.head()

Unnamed: 0,incident_id,state,n_killed,n_injured,gun_stolen,gun_type,n_guns_involved,participant_age,participant_age_group,participant_gender,participant_type,year,month,day
0,478855,Ohio,1,3,"[Unknown, Unknown]","[Unknown, Unknown]",2.0,"[25, 31, 33, 34, 33]","[Adult, Adult, Adult, Adult, Adult]","[Male, Male, Male, Male, Male]","[Subject, Suspect, Subject, Suspect, Victim, V...",2013,1,1
1,478959,North Carolina,2,2,"[Unknown, Unknown]","[Handgun, Handgun]",2.0,"[18, 46, 14, 47]","[Adult, Adult, Teen, Adult]","[Female, Male, Male, Female]","[Victim, Victim, Victim, Subject, Suspect]",2013,1,7
2,479363,New Mexico,5,0,"[Unknown, Unknown]","[LR, Rem, AR]",2.0,"[51, 40, 9, 5, 2, 15]","[Adult, Adult, Child, Child, Child, Teen]","[Male, Female, Male, Female, Female, Male]","[Victim, Victim, Victim, Victim, Victim, Subje...",2013,1,19
3,491674,Tennessee,1,3,[Unknown],[Unknown],1.0,[19],[Adult],"[Male, Male, Male, Male]","[Victim, Victim, Victim, Victim, Subject, Susp...",2013,1,23
4,479413,Missouri,1,3,[Unknown],[Unknown],1.0,[28],[Adult],[Male],"[Victim, Victim, Victim, Victim, Subject, Susp...",2013,1,25


### Population

In [8]:
# Drop unnecessary columns
population = population[[
    'NAME',
    'POPESTIMATE2014',
    'POPESTIMATE2015',
    'POPESTIMATE2016',
    'POPESTIMATE2017'
]]

# Rename columns
population.columns = ['state', '2014', '2015', '2016', '2017']

population = population[5:56].reset_index(drop=True)
population['mean_population'] = population.drop(['state'], axis=1).mean(axis=1)

population.head()

Unnamed: 0,state,2014,2015,2016,2017,mean_population
0,Alabama,4842481,4853160,4864745,4875120,4858876.5
1,Alaska,736307,737547,741504,739786,738786.0
2,Arizona,6733840,6833596,6945452,7048876,6890441.0
3,Arkansas,2967726,2978407,2990410,3002997,2984885.0
4,California,38625139,38953142,39209127,39399349,39046689.25


### Participantes

Vamos criar um novo data frame com informações dos participantes dos incidentes para podermos trabalhar melhor.

In [9]:
# Create new Data Frame for participants
participants = []
for index, row in gun_violence.iterrows():
    for age, group, gender, p_type in zip(row['participant_age'], row['participant_age_group'], row['participant_gender'], row['participant_type']):
        participants.append([row['incident_id'], age, group, gender, p_type])

participants = pd.DataFrame(participants, columns=['incident_id', 'participant_age', 'participant_age_group', 'participant_gender', 'participant_type'])

participants.head()

Unnamed: 0,incident_id,participant_age,participant_age_group,participant_gender,participant_type
0,478855,25,Adult,Male,Subject
1,478855,31,Adult,Male,Suspect
2,478855,33,Adult,Male,Subject
3,478855,34,Adult,Male,Suspect
4,478855,33,Adult,Male,Victim


### Estatísticas por Estado

Vamos criar um novo data frame com estatísticas dos incidentes por Estado.

In [10]:
gun_violence = gun_violence[['incident_id', 'state', 'n_killed', 'n_injured', 'year', 'month', 'day']]

incidents_state = gun_violence.groupby('state')\
                  .agg({'n_injured':'sum', 'incident_id':'count', 'n_killed':'sum'})\
                  .rename(columns={'incident_id':'number_of_incidents'})\
                  .sort_values('state')\
                  .reset_index()

states = states = pd.concat([incidents_state, population['mean_population']], axis=1, sort=False, join='inner')
states['injured_per_capita'] = states['n_injured']/states['mean_population'] * 100000
states['incidents_per_capita'] = states['number_of_incidents']/states['mean_population'] * 100000
states['killed_per_capita'] = states['n_killed']/states['mean_population'] * 100000

states.head()

Unnamed: 0,state,n_injured,number_of_incidents,n_killed,mean_population,injured_per_capita,incidents_per_capita,killed_per_capita
0,Alabama,718,1618,925,4858876.5,14.777079,33.299879,19.037323
1,Alaska,112,598,163,738786.0,15.160006,80.943602,22.063223
2,Arizona,357,869,628,6890441.0,5.181091,12.611675,9.114076
3,Arkansas,570,1118,417,2984885.0,19.096213,37.455379,13.970387
4,California,2134,5899,2548,39046689.25,5.465252,15.107555,6.525521


### Salvar dados limpos

In [11]:
# Save clean datasets
gun_violence.to_csv('../databases/gun_violence_clean.csv', index=False)
population.to_csv('../databases/population.csv', index=False)
participants.to_csv('../databases/participants.csv', index=False)
states.to_csv('../databases/states.csv', index=False)