# Analyzing Gun Deaths in the US: 2012-2014


### Data Schema
- **year**: the year in which the fatality occurred.
- **month**: the month in which the fatality occurred.
- **intent**: the intent of the perpetrator of the crime. This can be Suicide, Accidental, NA, Homicide, or Undetermined.
- **police**: whether a police officer was involved with the shooting. Either 0 (false) or 1 (true).
- **sex**: the gender of the victim. Either M or F.
- **age**: the age of the victim.
- **race**: the race of the victim. Either Asian/Pacific Islander, Native American/Native Alaskan, Black, Hispanic, or White.
- **hispanic**: a code indicating the Hispanic origin of the victim.
- **place**: where the shooting occurred. Has several categories, which you're encouraged to explore on your own.
- **education**: educational status of the victim. Can be one of the following:
    * 1: Less than High School
    * 2: Graduated from High School or equivalent
    * 3: Some College
    * 4: At least graduated from College
    * 5: Not available

### First, let's import libraries and load the dataset.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 100)

import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline

import seaborn as sns

In [None]:
df = pd.read_csv('project_files/guns.csv')

# 1. Basic information

First, always look at basic information about the dataset. 

<br>
Display the dimensions of the dataset.

In [None]:
# Print shape of dataframe

In [1]:
# Print the data types

In [None]:
# Filter and display only df.dtypes that are 'object'

In [2]:
# Organizing the data by a column value: first by the year, then by month:

In [3]:
# Diplay head

# 2. Distributions of numeric features

One of the most enlightening data exploration tasks is plotting the distributions of your features.

In [None]:
# Plot histogram grid



In [4]:
# Summarize numerical features


# 3. Distributions of categorical features

Next, let's take a look at the distributions of our categorical features.
<br>

Display summary statistics for categorical features.

In [5]:
# Summarize categorical features


In [6]:
# Plot bar plot for each categorical feature


In [7]:
# Calculate correlations between numeric features


In [8]:
# Make the figsize 10 x 8


# Generate a mask for the upper triangle


# Plot heatmap of annotated correlations


# Plot heatmap of correlations


# 4. Data Cleaning
Drop unwanted observations

In [9]:
# Drop duplicates


### Fix structural errors

The next bucket under data cleaning involves fixing structural errors. 

<br>

In [10]:
# Display unique values of 'police'


Next, to check for typos, mislabeled classes or inconsistent capitalization, display all the class distributions for the <code style="color:steelblue">'intent'</code> feature.

In [11]:
# Class distributions for 'intent'


In [12]:
# Murder should be Homicide

# accident should be Accidental

# suicide should be Suicide


### Plot the class distributions for 'intent' for comparison

In [13]:
# Class distributions for 'intent'


Looks much better!!

Now do the same for 'race'

In [14]:
# Class distributions for 'race'


In [15]:
# Caucasian should be White


In [16]:
# Class distributions for 'Race'



### Label missing categorical data

It's finally time to address missing data.

<br>
First, find and count the missing categorical data.

In [17]:
# Display number of missing values by feature (categorical)


In [18]:
# Fill missing categorical values


In [19]:
# Display number of missing values by feature (categorical)


# Flag and fill missing numeric data

Finally, let's flag and fill missing numeric data.

<br>
First, let's find and count missing values in numerical feature.

In [20]:
# Display number of missing values by feature (numeric)


Let's take a look at the unique values for education to see if we should replace null values or drop the observations. Reference the schema above for education definitions.

In [21]:
# View unique values for education


In [22]:
# Fill missing categorical values


In [23]:
# Display number of missing values by feature (numeric)


Great, looks like you've taken care of education. Now handle the age missing values...

In [None]:
# Fill missing categorical values


In [None]:
# Display number of missing values by feature (numeric)


In [None]:
# Print shape of dataframe

### For readability and concistency - capitalizing column names and name the index
Code provided as we haven't done this in previous lessons.

In [None]:
df.index.name = 'Index'
df.columns = map(str.capitalize, df.columns)

In [None]:
# Print head

In [None]:
# Save our cleaned data for later use
