# Project: Investigate Gun Data in US

## Introduction

Following the implementation of the Brady act in 1994, the Federal Bureau of Investigation (FBI) developed a system to conduct background checks on individuals wanting to obtain a firearm. The system known as the National Instant Criminal Background Check System (NICS) was created in collaboration with the Buereu of Alcohol, Tabacco and Firearms and local law enforcement agencies. Since it's inception in November 1998, the FBI has released monthly data from each state and U.S territory. The FBI claims that over 300 million requests have been aprroved, and 1.5 million have been denied.

### Questions to Investigate:
* How does `permits` changed across years?
* In the gun shooting data, to what extent do the killed vs injured people change? 
* What is the overall trend of gun purchases?
* Which states have the highest growth in gun registrations?

In [1]:
# Imports
import pandas as pd
import numpy as np
import helpers as hp
import sweetviz as sv

## Data Wrangling
* I have surfed the internet for associated data that can help dig deeper in the provided data to extract insightful analysis. The data found was `Gun violence database` that is found here: https://www.kaggle.com/gunviolencearchive/gun-violence-database?select=mass_shootings_all.csv

In [2]:
# Reading the dataset
df_gun = pd.read_csv("gun_data.csv")
df_violence = pd.read_csv("mass_shootings_all.csv")

In [3]:
# Exploring data types of the dfs
df_gun.dtypes

month                         object
state                         object
permit                       float64
permit_recheck               float64
handgun                      float64
long_gun                     float64
other                        float64
multiple                       int64
admin                        float64
prepawn_handgun              float64
prepawn_long_gun             float64
prepawn_other                float64
redemption_handgun           float64
redemption_long_gun          float64
redemption_other             float64
returned_handgun             float64
returned_long_gun            float64
returned_other               float64
rentals_handgun              float64
rentals_long_gun             float64
private_sale_handgun         float64
private_sale_long_gun        float64
private_sale_other           float64
return_to_seller_handgun     float64
return_to_seller_long_gun    float64
return_to_seller_other       float64
totals                         int64
d

In [4]:
# Exploring data types of the dfs
df_violence.dtypes

Incident Date      object
State              object
City Or County     object
Address            object
# Killed            int64
# Injured           int64
Operations        float64
dtype: object

## Data Cleaning
* Cleaning columns names only for `df_violence`
* Data types Conversion: `df_gun['month', 'multiple', 'totals']` and `df_violence['Incident Date']` to their correct dtypes.
* Dealing with missing values

In [5]:
# First: clean `df_violence`
df_violence.columns = df_violence.columns.to_series().apply(lambda x: x.strip('# ').lower().replace(' ', '_'))
df_violence.columns.tolist()

['incident_date',
 'state',
 'city_or_county',
 'address',
 'killed',
 'injured',
 'operations']

#### Data types conversion
As shown above, there are three columns that needs data type conversion in order to do some analysis specially for the `month` and `Incident Date` column for time serires analysis. The next cell will do this conversion.

In [6]:
# Converting data types
df_gun = df_gun.astype({'month':'datetime64[ns]'})
df_violence = df_violence.astype({'incident_date':'datetime64[ns]'})

In [7]:
# Making sure of conversion for df_gun
df_gun.dtypes

month                        datetime64[ns]
state                                object
permit                              float64
permit_recheck                      float64
handgun                             float64
long_gun                            float64
other                               float64
multiple                              int64
admin                               float64
prepawn_handgun                     float64
prepawn_long_gun                    float64
prepawn_other                       float64
redemption_handgun                  float64
redemption_long_gun                 float64
redemption_other                    float64
returned_handgun                    float64
returned_long_gun                   float64
returned_other                      float64
rentals_handgun                     float64
rentals_long_gun                    float64
private_sale_handgun                float64
private_sale_long_gun               float64
private_sale_other              

In [8]:
# Making sure of conversion for df_violence
df_violence.dtypes

incident_date     datetime64[ns]
state                     object
city_or_county            object
address                   object
killed                     int64
injured                    int64
operations               float64
dtype: object

In [9]:
hp.explore_nans(df_gun, 'Exploring Null Values in Gun Data')

In [10]:
hp.explore_nans(df_violence, 'Exploring Null Values in Gun Violence')

#### Dealing with missing values
As shown in the above chart, there arey many columns we need to take a decision ahead. The next will handle missing values through two steps. First to drop columns that have 60% or more of missing data so that no appropriate statistical operation can be done on. Second, filling other missing data with the mode value of each column assueming that they will take the same most frequent values. I will not be using mean due to outliers which will affect the distribution of the data.

In [11]:
# task 1: dropping columns that have 60% or more of missing data
df_gun_v1 = hp.drop_60_missings(df_gun)
# recheck NaNs
hp.explore_nans(df_gun_v1, 'Exploring Null Values in Gun Data')

In [12]:
# task 1: dropping columns that have 60% or more of missing data
df_violence_v1 = hp.drop_60_missings(df_violence)
# `address` column has two missing values. No action needed to take against as the only suitable one is `fillna('ffill')`
# which is suitable in execution but not vaild for data so that it misleads gun shoot incident address
# recheck NaNs
hp.explore_nans(df_violence_v1, 'Gun Violence')