# 2. Data wrangling   
### 2.1 Data overviews  
### 2.2 Importing related libs & modules  
### 2.3 Loading data  
- Gun Possession: Total of gun licensees per <span style="color:red">year</span>  
- Gun Deaths: Total gun deaths (homicide, sucide...) per <span style="color:red">year</span>  
- [Firearm Deaths by Age](https://webappa.cdc.gov/sasweb/ncipc/mortrate.html): No of death group by age listed by <span style="color:red">year</span>  
- Mass Shooting: Number of mass shooting cases with shooter age, gender, mental health, employment by <span style="color:red">year, states</span>  
- Unemployment rate: Unemployment rate by <span style="color:red">years, states</span>  
### 2.4 Explore the data  
- Data distribution & missing values  
- Numeric features  
- Category features   
### 2.5 Target variables  
### 2.6 Save data  
### 2.7 Sumary  


## 2.1 Data Overviews  
### input data  
index: Year 2013 - 2018  
Dependent variable (y): Gun deaths per years & states     
Independent variables (Xi): Population, Gun possession, Employment rate, Age group, Shooter age, Shooter gender, Unemployment rate,
  
### What to do  
Load, transform & visualize data.  
Q: Add more features for the prediction?  

## 2.2 Import libs & modules  
### geoplot for geodetic display

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import seaborn as sns
import os

from library.sb_utils import save_file

## 2.3 Loading data  
### DF: GunDeaths_2009_2018; GunPossession_1986_2018; Mass_Shooting; Employment;   
Additional data sorting by year.

In [2]:
GunDeaths_2009_2018  = pd.read_csv('../data/USA_Crimes/GunDeaths_2009_2018.csv').set_index('Year')
# FirearmDeaths_2000_2018  = pd.read_csv('../data/USA_Crimes/FirearmDeaths2000_2018.csv').set_index('Year')
GunPossession_1986_2018 = pd.read_csv('../data/USA_Crimes/GunPossession_1986_2018.csv', index_col=0)

In [3]:
GunPossession_1986_2018 = GunPossession_1986_2018[['Total Licensees ','Licensed Business Entities']]
GunPossession_1986_2018.columns = ['total_licensees ','licensed_business_entities']
GunPossession_1986_2018['year'] = GunPossession_1986_2018.index
# GunPossession_1986_2018.info()

In [None]:
# FirearmDeaths_2000_2018.drop(['State','Ethnicity','First Year','Last Year','Cause of Death'], axis=1, inplace=True)
# FirearmDeaths_2000_2018.columns

In [4]:
GunDeaths_2009_2018 = GunDeaths_2009_2018[['Population','Total gun deaths','Total children and teen gun deaths']]
GunDeaths_2009_2018.columns = ['population','total_gun_deaths','total_children_teen_gun_deaths']
GunDeaths_2009_2018['year'] = GunDeaths_2009_2018.index
# GunDeaths_2009_2018.columns 

### **Combine data**
#### Data: $\color{cyan}{\text{Year, Polpulation, Total licenses, Gun deaths, age group, sex.}}$
#### DF = GunDeaths_2009_2018 + GunPossession_1986_2018 + Employment by year & states

Yearly employment info by states

In [5]:
Employment_header = ['FIPS Code', 'State and area', 'Year', 'Civilian non-institutional population', 
  'Civilian labor force/Total', 'Civilian labor force/Percent of population', 
  'Civilian labor force/Employment/Total', 'Civilian labor force/Employment/Percent of population',
  'Civilian labor force/Unemployment/Total','Civilian labor force/Unemployment/Rate']

In [6]:
# Employment = pd.read_excel('../data/USA_Crimes/staadata.xlsx', header=0)
Employment=pd.read_excel(
     os.path.join("../data/USA_Crimes/", "staadata.xlsx"),
     engine='openpyxl', header=None, names= Employment_header,skiprows=8
).set_index('Year')

In [7]:
Employment = Employment[['State and area', 'Civilian non-institutional population', 
  'Civilian labor force/Total','Civilian labor force/Unemployment/Rate']]
Employment.columns = ['state','state_population','state_labor_force','unemployment_rate']
Employment['year'] = Employment.index

In [16]:
Employment.head(2)

Unnamed: 0_level_0,state,state_population,state_labor_force,unemployment_rate,year
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1976,Alabama,2632667,1501284,6.8,1976
1976,Alaska,239917,163570,7.6,1976


Mass shooting for employment, mental health, gender, which will be merged with 3 other dataframes

In [8]:
Mass_Shooting = pd.read_csv('../data/USA_Crimes/USMassShooting19662019(cleaned).csv', parse_dates=True, index_col='Date')

In [9]:
MS_column_drop = ['S#','Title','Area','Incident Area','Open/Close Location','Target','Cause','Summary',
  'Shooter status','No. of shooter/suspect','Race','Latitude','Longitude']
Mass_Shooting.sort_index(inplace=True)
Mass_Shooting.drop(MS_column_drop, axis=1, inplace=True)

In [10]:
Mass_Shooting['date'] = pd.to_datetime(Mass_Shooting.index)
Mass_Shooting['year'] = Mass_Shooting['date'].dt.year
# Mass_Shooting['month'] = Mass_Shooting['date'].dt.month
# Mass_Shooting['monthday'] = Mass_Shooting['date'].dt.day
# Mass_Shooting['weekday'] = Mass_Shooting['date'].dt.weekday

In [11]:
#get states from locations
def get_state(txt):
    val = txt.split(", ")[-1]            
    return val

Mass_Shooting['state'] = Mass_Shooting['Location'].apply(lambda x : get_state(x))

In [12]:
Mass_Shooting.drop('Location',axis=1,inplace=True)
Mass_Shooting.columns = ['fatalities', 'injured', 'total_victims',
       'policeman_killed', 'age', 'employeed(Y/N)', 'employed_at',
       'mental_health_issues', 'gender', 'date', 'year', 'state']

In [187]:
print(len(Employment.state.unique()),np.sort(Employment.state.unique()),'\n',
  len(Mass_Shooting.state.unique()),np.sort(Mass_Shooting.state.unique()))

53 ['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Los Angeles County' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts'
 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana' 'Nebraska'
 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York'
 'New York city' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma'
 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota'
 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington'
 'West Virginia' 'Wisconsin' 'Wyoming'] 
 53 [' Virginia' 'Alabama' 'Alaska' 'Arizona' 'Arkansas' 'CA' 'California'
 'Colorado' 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii' 'Idaho'
 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Lousiana'
 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi'
 'Missouri' 'Montana' 'NV' 'Nebraska' 'Nevada' 'New Jersey'

In [19]:
df = pd.merge(Mass_Shooting,GunPossession_1986_2018,on=['year'],how='left')

In [20]:
df = pd.merge(df,GunDeaths_2009_2018,on=['year'],how='left')

In [21]:
df = pd.merge(df,Employment,on=['year','state'],how='left')

In [22]:
df.set_index('date',inplace=True)

In [23]:
df.drop(df[(df.year<2009) | (df.year>2019)].index,inplace=True)

In [24]:
print(Mass_Shooting.shape,GunPossession_1986_2018.shape,GunDeaths_2009_2018.shape,Employment.shape,df.shape)

(339, 12) (33, 3) (10, 4) (2332, 5) (224, 19)


In [25]:
df.head(2)

Unnamed: 0_level_0,fatalities,injured,total_victims,policeman_killed,age,employeed(Y/N),employed_at,mental_health_issues,gender,year,state,total_licensees,licensed_business_entities,population,total_gun_deaths,total_children_teen_gun_deaths,state_population,state_labor_force,unemployment_rate
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2009-03-10,10,6,16,0,28,-999,-999,No,Male,2009,Alabama,115395,60349,307006550,31347,2811,3621410.0,2162999.0,11.0
2009-03-29,8,2,10,0,45,-999,-999,Yes,Male,2009,North Carolina,115395,60349,307006550,31347,2811,7117828.0,4570789.0,10.6


In [26]:
df.tail(2)

Unnamed: 0_level_0,fatalities,injured,total_victims,policeman_killed,age,employeed(Y/N),employed_at,mental_health_issues,gender,year,state,total_licensees,licensed_business_entities,population,total_gun_deaths,total_children_teen_gun_deaths,state_population,state_labor_force,unemployment_rate
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2019-12-06,3,8,11,-999,-,-999,-999,-999,Male,2019,Florida,,,,,,17410114.0,10336749.0,3.1
2019-12-10,4,3,7,-999,47,-999,-999,-999,Male,2019,New Jersey,,,,,,7070716.0,4493125.0,3.6


In [27]:
df['total_victims'][df.year==2013].sum()

104

In [31]:
df['total_victims'][(df.year==2013) & (df.state=='Texas')].sum()

8

## 2.4 Explore data  
### a. Data transforming  

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 224 entries, 2009-03-10 to 2019-12-10
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   fatalities                      224 non-null    int64  
 1   injured                         224 non-null    int64  
 2   total_victims                   224 non-null    int64  
 3   policeman_killed                224 non-null    int64  
 4   age                             224 non-null    object 
 5   employeed(Y/N)                  224 non-null    int64  
 6   employed_at                     224 non-null    object 
 7   mental_health_issues            224 non-null    object 
 8   gender                          224 non-null    object 
 9   year                            224 non-null    int64  
 10  state                           224 non-null    object 
 11  total_licensees                 214 non-null    object 
 12  licensed_business

##  2.5 Target variables  
index: Year 2009 - 2019  
Dependent variable (y): total_victims
Independent variables (Xi): population, state population, suspect_age, total_licensees, unemployment_rate

## 2.6 Save data  

In [284]:
df.to_csv('../data/Clean_data/MassShooting2009_2019_final.csv')

## 2.7 Summary  