### Title: <span style='color:red;'>US_GUN_VIOLENCE</span>

### Introduction:
##### Project Description
Human societies are plagued with several social issues that can be life-threatening at times. A notable case is the issue of gun violence in the United States which results in tens of thousands of deaths and injuries annually. In this project we perform a deep exploration of gun violence incidents reported in the US.

##### Dataset Description:
The dataset used in this project was obtained from [source](https://www.kaggle.com/jameslko/gun-violence-data)
##### Overall Goal:
<span style='color:red;'>My goal is to explore this data to draw useful insights from data about the gun violence in the US.</span>

##### Specific goals
At the end of this exploratory analysis, I hope to answer the following questions:
1. how much has has gun violence increased over the years?
2. how much has deaths per year due to gun violence changed over the years?
3. where in the US is gun violence most prevalent





### Load Modules and Data

For the purpose of analysis, I would be making use of the following modules. Next, I would load in my data.

In [1]:
#imports
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


In [37]:
import os
os.chdir("F:/micro masters/PYTHON/US Gun Violence")

In [38]:
# load data
guns = pd.read_csv("guns.csv")


### Preview of Data

In [12]:
guns.head(5)

Unnamed: 0,incident_id,date,state,city_or_county,n_killed,n_injured,incident_url,source_url,incident_url_fields_missing,congressional_district,...,longitude,n_guns_involved,notes,participant_age,participant_age_group,participant_gender,participant_relationship,participant_status,participant_type,sources
0,461105,1/1/2013,Pennsylvania,Mckeesport,0,4,http://www.gunviolencearchive.org/incident/461105,http://www.post-gazette.com/local/south/2013/0...,False,14.0,...,-79.8559,,Julian Sims under investigation: Four Shot and...,0::20,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...,0::Male||1::Male||3::Male||4::Female,,0::Arrested||1::Injured||2::Injured||3::Injure...,0::Victim||1::Victim||2::Victim||3::Victim||4:...,http://pittsburgh.cbslocal.com/2013/01/01/4-pe...
1,460726,1/1/2013,California,Hawthorne,1,3,http://www.gunviolencearchive.org/incident/460726,http://www.dailybulletin.com/article/zz/201301...,False,43.0,...,-118.333,,Four Shot; One Killed; Unidentified shooter in...,0::20,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...,0::Male,,0::Killed||1::Injured||2::Injured||3::Injured,0::Victim||1::Victim||2::Victim||3::Victim||4:...,http://losangeles.cbslocal.com/2013/01/01/man-...
2,478855,1/1/2013,Ohio,Lorain,1,3,http://www.gunviolencearchive.org/incident/478855,http://chronicle.northcoastnow.com/2013/02/14/...,False,9.0,...,-82.1377,2.0,,0::25||1::31||2::33||3::34||4::33,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...,0::Male||1::Male||2::Male||3::Male||4::Male,,"0::Injured, Unharmed, Arrested||1::Unharmed, A...",0::Subject-Suspect||1::Subject-Suspect||2::Vic...,http://www.morningjournal.com/general-news/201...
3,478925,1/5/2013,Colorado,Aurora,4,0,http://www.gunviolencearchive.org/incident/478925,http://www.dailydemocrat.com/20130106/aurora-s...,False,6.0,...,-104.802,,,0::29||1::33||2::56||3::33,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...,0::Female||1::Male||2::Male||3::Male,,0::Killed||1::Killed||2::Killed||3::Killed,0::Victim||1::Victim||2::Victim||3::Subject-Su...,http://denver.cbslocal.com/2013/01/06/officer-...
4,478959,1/7/2013,North Carolina,Greensboro,2,2,http://www.gunviolencearchive.org/incident/478959,http://www.journalnow.com/news/local/article_d...,False,6.0,...,-79.9569,2.0,Two firearms recovered. (Attempted) murder sui...,0::18||1::46||2::14||3::47,0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::...,0::Female||1::Male||2::Male||3::Female,3::Family,0::Injured||1::Injured||2::Killed||3::Killed,0::Victim||1::Victim||2::Victim||3::Subject-Su...,http://myfox8.com/2013/01/08/update-mother-sho...


In [5]:
guns.shape

(239677, 24)

### 1. Data quality Assessment
**Goal**:
To perform an initial assessment of the data quality that would allow me to go to the next steps of my analysis.

First, I will take out of the initial data, the relevant subset that interests my exploration of the data.

Secondly, to complete this step, I would be examining the dataset for the following data quality issues:
- Accuracy of field values
- Missing values
- Duplicate entries

#### IDENTIFYING RELEVANT COLUMNS

To simplify the data, the first thing I want to do is to have a look at the variable names to see which ones contain the data needed for any kind of intuitive analysis that I am interested in.

In [13]:
guns.columns

Index(['incident_id', 'date', 'state', 'city_or_county', 'n_killed',
       'n_injured', 'incident_url', 'source_url',
       'incident_url_fields_missing', 'congressional_district', 'gun_stolen',
       'gun_type', 'latitude', 'location_description', 'longitude',
       'n_guns_involved', 'notes', 'participant_age', 'participant_age_group',
       'participant_gender', 'participant_relationship', 'participant_status',
       'participant_type', 'sources'],
      dtype='object')

**comment:** Looking at the variable names, I will be removing the following columns: ('incident_id', 'incident_url', 'source_url',  'incident_url_fields_missing', 'congressional_district', 'gun_stolen','gun_type','n_guns_involved', 'notes', 'sources', 'location_description', 'participant_status', 'participant_type','participant_gender', 'participant_relationship',)

In [39]:
new_guns=guns[['date', 'state', 'city_or_county', 'n_killed','n_injured', 'latitude', 'longitude', 'participant_age', 'participant_age_group']]
new_guns.shape

(239677, 9)

#### PROFILING THE DATA

From above, I have 9 variables and 239677 observations from the dataset that are of interest to me.

The first step towards any data analysis is to have an overview of the data. To help me achieve this goal, I would be using the pandas profiling method.

In [7]:
import pandas_profiling as pp

In [8]:
profile=pp.ProfileReport(new_guns)
profile

0,1
Number of variables,9
Number of observations,239677
Total Missing (%),7.0%
Total size in memory,16.5 MiB
Average record size in memory,72.0 B

0,1
Numeric,4
Categorical,5
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,12898
Unique (%),5.4%
Missing (%),0.0%
Missing (n),0

0,1
Chicago,10814
Baltimore,3943
Washington,3279
Other values (12895),221641

Value,Count,Frequency (%),Unnamed: 3
Chicago,10814,4.5%,
Baltimore,3943,1.6%,
Washington,3279,1.4%,
New Orleans,3071,1.3%,
Philadelphia,2963,1.2%,
Saint Louis,2501,1.0%,
Houston,2501,1.0%,
Milwaukee,2487,1.0%,
Jacksonville,2448,1.0%,
Memphis,2386,1.0%,

0,1
Distinct count,1725
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
1/1/2017,342
7/4/2017,248
1/1/2018,242
Other values (1722),238845

Value,Count,Frequency (%),Unnamed: 3
1/1/2017,342,0.1%,
7/4/2017,248,0.1%,
1/1/2018,242,0.1%,
5/28/2017,242,0.1%,
8/28/2016,230,0.1%,
4/16/2017,229,0.1%,
8/21/2016,227,0.1%,
7/4/2016,224,0.1%,
4/22/2017,222,0.1%,
1/1/2016,221,0.1%,

0,1
Distinct count,101241
Unique (%),42.2%
Missing (%),3.3%
Missing (n),7923
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,37.547
Minimum,19.111
Maximum,71.337
Zeros (%),0.0%

0,1
Minimum,19.111
5-th percentile,29.221
Q1,33.903
Median,38.571
Q3,41.437
95-th percentile,44.054
Maximum,71.337
Range,52.225
Interquartile range,7.534

0,1
Standard deviation,5.1308
Coef of variation,0.13665
Kurtosis,1.8789
Mean,37.547
MAD,4.1492
Skewness,0.20723
Sum,8701600
Variance,26.325
Memory size,1.8 MiB

Value,Count,Frequency (%),Unnamed: 3
33.6356,253,0.1%,
39.294000000000004,244,0.1%,
29.9872,170,0.1%,
33.4347,161,0.1%,
32.8982,160,0.1%,
38.9075,142,0.1%,
36.1334,112,0.0%,
29.9551,109,0.0%,
28.436,109,0.0%,
39.8496,108,0.0%,

Value,Count,Frequency (%),Unnamed: 3
19.1114,1,0.0%,
19.1127,1,0.0%,
19.2,1,0.0%,
19.2017,1,0.0%,
19.4243,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
71.2997,1,0.0%,
71.3,1,0.0%,
71.3001,1,0.0%,
71.3005,1,0.0%,
71.3368,1,0.0%,

0,1
Distinct count,112348
Unique (%),46.9%
Missing (%),3.3%
Missing (n),7923
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-89.338
Minimum,-171.43
Maximum,97.433
Zeros (%),0.0%

0,1
Minimum,-171.43
5-th percentile,-121.42
Q1,-94.159
Median,-86.25
Q3,-80.049
95-th percentile,-73.229
Maximum,97.433
Range,268.86
Interquartile range,14.11

0,1
Standard deviation,14.36
Coef of variation,-0.16073
Kurtosis,2.5309
Mean,-89.338
MAD,10.595
Skewness,-1.3548
Sum,-20705000
Variance,206.2
Memory size,1.8 MiB

Value,Count,Frequency (%),Unnamed: 3
-84.4333,254,0.1%,
-76.62,234,0.1%,
-112.006,173,0.1%,
-95.3477,169,0.1%,
-97.0404,160,0.1%,
-77.0176,135,0.1%,
-122.296,119,0.0%,
-104.67399999999999,113,0.0%,
-90.0747,112,0.0%,
-81.3065,106,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-171.429,1,0.0%,
-166.541,1,0.0%,
-166.097,1,0.0%,
-165.71099999999998,2,0.0%,
-165.58599999999998,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-67.2711,1,0.0%,
80.9491,1,0.0%,
90.37,1,0.0%,
96.7591,1,0.0%,
97.4331,2,0.0%,

0,1
Distinct count,23
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.49401
Minimum,0
Maximum,53
Zeros (%),59.4%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,2
Maximum,53
Range,53
Interquartile range,1

0,1
Standard deviation,0.72995
Coef of variation,1.4776
Kurtosis,142.83
Mean,0.49401
MAD,0.58737
Skewness,4.4434
Sum,118402
Variance,0.53283
Memory size,1.8 MiB

Value,Count,Frequency (%),Unnamed: 3
0,142487,59.4%,
1,81986,34.2%,
2,11484,4.8%,
3,2513,1.0%,
4,760,0.3%,
5,241,0.1%,
6,91,0.0%,
7,51,0.0%,
8,19,0.0%,
9,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,142487,59.4%,
1,81986,34.2%,
2,11484,4.8%,
3,2513,1.0%,
4,760,0.3%,

Value,Count,Frequency (%),Unnamed: 3
18,1,0.0%,
19,3,0.0%,
20,1,0.0%,
25,1,0.0%,
53,1,0.0%,

0,1
Distinct count,16
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.25229
Minimum,0
Maximum,50
Zeros (%),77.5%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,50
Range,50
Interquartile range,0

0,1
Standard deviation,0.52178
Coef of variation,2.0682
Kurtosis,390.59
Mean,0.25229
MAD,0.39123
Skewness,6.6364
Sum,60468
Variance,0.27225
Memory size,1.8 MiB

Value,Count,Frequency (%),Unnamed: 3
0,185835,77.5%,
1,48436,20.2%,
2,4604,1.9%,
3,595,0.2%,
4,139,0.1%,
5,41,0.0%,
6,11,0.0%,
8,5,0.0%,
9,3,0.0%,
7,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,185835,77.5%,
1,48436,20.2%,
2,4604,1.9%,
3,595,0.2%,
4,139,0.1%,

Value,Count,Frequency (%),Unnamed: 3
11,1,0.0%,
16,1,0.0%,
17,1,0.0%,
27,1,0.0%,
50,1,0.0%,

0,1
Distinct count,18952
Unique (%),7.9%
Missing (%),38.5%
Missing (n),92298

0,1
0::24,3814
0::23,3735
0::22,3733
Other values (18948),136097
(Missing),92298

Value,Count,Frequency (%),Unnamed: 3
0::24,3814,1.6%,
0::23,3735,1.6%,
0::22,3733,1.6%,
0::19,3719,1.6%,
0::21,3612,1.5%,
0::18,3536,1.5%,
0::20,3535,1.5%,
0::25,3500,1.5%,
0::26,3277,1.4%,
0::27,3110,1.3%,

0,1
Distinct count,899
Unique (%),0.4%
Missing (%),17.6%
Missing (n),42119

0,1
0::Adult 18+,94671
0::Adult 18+||1::Adult 18+,49273
0::Adult 18+||1::Adult 18+||2::Adult 18+,13893
Other values (895),39721
(Missing),42119

Value,Count,Frequency (%),Unnamed: 3
0::Adult 18+,94671,39.5%,
0::Adult 18+||1::Adult 18+,49273,20.6%,
0::Adult 18+||1::Adult 18+||2::Adult 18+,13893,5.8%,
0::Teen 12-17,7392,3.1%,
0::Adult 18+||1::Adult 18+||2::Adult 18+||3::Adult 18+,4975,2.1%,
1::Adult 18+,3916,1.6%,
0::Adult 18+||1::Teen 12-17,1962,0.8%,
0::Teen 12-17||1::Adult 18+,1914,0.8%,
0::Adult 18+||1::Adult 18+||2::Adult 18+||3::Adult 18+||4::Adult 18+,1736,0.7%,
0::Teen 12-17||1::Teen 12-17,1673,0.7%,

0,1
Distinct count,51
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Illinois,17556
California,16306
Florida,15029
Other values (48),190786

Value,Count,Frequency (%),Unnamed: 3
Illinois,17556,7.3%,
California,16306,6.8%,
Florida,15029,6.3%,
Texas,13577,5.7%,
Ohio,10244,4.3%,
New York,9712,4.1%,
Pennsylvania,8929,3.7%,
Georgia,8925,3.7%,
North Carolina,8739,3.6%,
Louisiana,8103,3.4%,

Unnamed: 0,date,state,city_or_county,n_killed,n_injured,latitude,longitude,participant_age,participant_age_group
0,1/1/2013,Pennsylvania,Mckeesport,0,4,40.3467,-79.8559,0::20,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
1,1/1/2013,California,Hawthorne,1,3,33.909,-118.333,0::20,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
2,1/1/2013,Ohio,Lorain,1,3,41.4455,-82.1377,0::25||1::31||2::33||3::34||4::33,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
3,1/5/2013,Colorado,Aurora,4,0,39.6518,-104.802,0::29||1::33||2::56||3::33,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...
4,1/7/2013,North Carolina,Greensboro,2,2,36.114,-79.9569,0::18||1::46||2::14||3::47,0::Adult 18+||1::Adult 18+||2::Teen 12-17||3::...


In [9]:
new_guns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 239677 entries, 0 to 239676
Data columns (total 9 columns):
date                     239677 non-null object
state                    239677 non-null object
city_or_county           239677 non-null object
n_killed                 239677 non-null int64
n_injured                239677 non-null int64
latitude                 231754 non-null float64
longitude                231754 non-null float64
participant_age          147379 non-null object
participant_age_group    197558 non-null object
dtypes: float64(2), int64(2), object(5)
memory usage: 16.5+ MB


**ASSESSMENT OF FEILD VALUES**

In [10]:
import numpy as np
new_guns.describe(include=np.float64)

Unnamed: 0,latitude,longitude
count,231754.0,231754.0
mean,37.546598,-89.338348
std,5.130763,14.359546
min,19.1114,-171.429
25%,33.9034,-94.158725
50%,38.5706,-86.2496
75%,41.437375,-80.048625
max,71.3368,97.4331


In [40]:
new_guns['date']=pd.to_datetime(new_guns['date'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [36]:

for column in new_guns:
    Categories=[]
    new_guns[column]=pd.Categorical(new_guns[column])
    Categories.append([column, new_guns[column].cat.categories])
    
    print('\n')
    print(Categories)
    print(' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.




[['date', DatetimeIndex(['2013-01-01', '2013-01-05', '2013-01-07', '2013-01-19',
               '2013-01-21', '2013-01-23', '2013-01-25', '2013-01-26',
               '2013-02-02', '2013-02-03',
               ...
               '2018-03-22', '2018-03-23', '2018-03-24', '2018-03-25',
               '2018-03-26', '2018-03-27', '2018-03-28', '2018-03-29',
               '2018-03-30', '2018-03-31'],
              dtype='datetime64[ns]', length=1725, freq=None)]]
 


[['state', Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia',
       'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky',
       'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
       'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon'

#### ANALYZING THE OVERVIEW OF THE DATA

SHAPE

From the above output, the 9 variables of my data comprises of 4 numerical variables and 5 categorical variables.

MISSING VALUE

I have 7% of missing data. participant_age has a total of 38.5% of its inputs missen and that is the highest. For this reason and the fact that participant_age_group is a better variable to capture age, I would be removing this variables. This is because, with such high number of missing values, any analysis done with these variable can be highly misleading with high errors.

After this, I would further remove any additional rows with missing values.

FIELD VALUES

The participant_age variable appears to have some unreasonable values. For instance:['0.041666667', '0.042361111', '0.043055556', '0.04375', '0.044444444', '0.045138889', '0.045833333', '0.046527778', '0.047916667','0.048611111'] I would need to remove such inputs should they still exist after dealing with missing values and duplicates.

To help me deal with these issues for the remaining variables in the object format, I would need to convert their categories to sub-categories to fish out these unreasonable inputs and also facilitate my analysis.

DUPLICATES

The data currently has 864 duplicate rows. After dealing with the missing values, this may reduce the remaining duplicates and then, i would remove these duplicates.


### 2. Data Cleaning
Given, the issues I have raised above, I will be doing the relevant data cleaning needed to help smoothen my subsequent analysis.
First, I will deal with missing values.
Next, unreasonable column values
Lastly, duplicates.
Also, I may subset certain coulmns where necessary.

DEALING WITH MISSING VALUES

In [24]:
new_guns=new_guns.drop(['participant_age'], axis=1)

In [25]:
new_guns=new_guns.dropna()

In [26]:
new_guns.shape

(190571, 8)

DEALING WITH DUPLICATES

In [27]:
new_guns.drop_duplicates

<bound method DataFrame.drop_duplicates of              date                 state            city_or_county  n_killed  \
0      2013-01-01          Pennsylvania                Mckeesport         0   
1      2013-01-01            California                 Hawthorne         1   
2      2013-01-01                  Ohio                    Lorain         1   
3      2013-01-05              Colorado                    Aurora         4   
4      2013-01-07        North Carolina                Greensboro         2   
5      2013-01-07              Oklahoma                     Tulsa         4   
6      2013-01-19            New Mexico               Albuquerque         5   
8      2013-01-21            California                 Brentwood         0   
9      2013-01-23              Maryland                 Baltimore         1   
10     2013-01-23             Tennessee               Chattanooga         1   
11     2013-01-25              Missouri               Saint Louis         1   
12     20

In [28]:
new_guns.shape

(190571, 8)

From above output, it is clear that the duplicates no longer exist after the initial cleaning.

**Let's have an overview of new data**

In [29]:
new_guns.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 190571 entries, 0 to 239676
Data columns (total 8 columns):
date                     190571 non-null datetime64[ns]
state                    190571 non-null object
city_or_county           190571 non-null object
n_killed                 190571 non-null int64
n_injured                190571 non-null int64
latitude                 190571 non-null float64
longitude                190571 non-null float64
participant_age_group    190571 non-null object
dtypes: datetime64[ns](1), float64(2), int64(2), object(3)
memory usage: 13.1+ MB


**Next, I want to do any needed subsetting of my data for easy visualization and analysis**

In [30]:
new_guns["month-year"]=pd.to_datetime(new_guns['date']).dt.to_period('M')
new_guns.head(2)

Unnamed: 0,date,state,city_or_county,n_killed,n_injured,latitude,longitude,participant_age_group,month-year
0,2013-01-01,Pennsylvania,Mckeesport,0,4,40.3467,-79.8559,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...,2013-01
1,2013-01-01,California,Hawthorne,1,3,33.909,-118.333,0::Adult 18+||1::Adult 18+||2::Adult 18+||3::A...,2013-01


In [31]:
new_guns.shape

(190571, 9)

At the end of cleaning the data, my new data contains 9 variables and 190571 observations.


**END OF DATA CLEANING**

Reference

Retrieved from [Source code](https://www.dataquest.io/blog/python-datetime-tutorial/)

Retrieved from [Source code](https://github.com/pandas-profiling/pandas-profiling)

cmdline(2018).Python and R Tips. Retrieved from [Source code](https://cmdlinetips.com/2018/04/how-to-drop-one-or-more-columns-in-pandas-dataframe/amp/)