# 2. Data wrangling   
   ### 2.1 Data overviews  
   ### 2.2 Importing related libs & modules  
   ### 2.3 Loading data  
        a. Gun Possession: number of guns by $\color{red}{\text{years}}$  
        b. Gun Deaths: number of deaths (homicide, sucide...) by $\color{red}{\text{years}}$  
        c. [Firearm Deaths by Age](https://webappa.cdc.gov/sasweb/ncipc/mortrate.html): No of death group by age listed by $\color{red}{\text{years}}$  
        d. Mass Shooting: No cases of shooting with shooter age, employeed  by $\color{red}{\text{years, states}}$  
        e. Unemployment rate: Unemployment rate by $\color{red}{\text{years, states}}$  
   ### 2.4 Explore the data  
        a. Data distribution & missing values  
        b. Numeric features  
        c. Category features   
   ### 2.5 Target variables  
   ### 2.6 Save data  
   ### 2.7 Sumary  


## 2.1 Data Overviews  
### input data  
index: Year 2013 - 2018  
Dependent variable (y): Gun deaths per years & states     
Independent variables (Xi): Population, Gun possession, Employment rate, Age group, Shooter age, Shooter gender, Unemployment rate,
  
### What to do  
Load, transform & visualize data.  
Q: Add more features for the prediction?  

## 2.2 Import libs & modules  
### geoplot for geodetic display

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import seaborn as sns
import os

from library.sb_utils import save_file

## 2.3 Loading data  
### DF: GunDeaths_2009_2018; FirearmDeaths_2000_2018; GunPossession_1986_2018; Mass_Shooting; Employment;   
Additional data sorting by year.

In [87]:
GunDeaths_2009_2018  = pd.read_csv('../data/USA_Crimes/GunDeaths_2009_2018.csv').set_index('Year')
# FirearmDeaths_2000_2018  = pd.read_csv('../data/USA_Crimes/FirearmDeaths2000_2018.csv').set_index('Year')
GunPossession_1986_2018 = pd.read_csv('../data/USA_Crimes/GunPossession_1986_2018.csv', index_col=0)

In [88]:
GunPossession_1986_2018 = GunPossession_1986_2018[['Total Licensees ','Licensed Business Entities']]
GunPossession_1986_2018['year'] = GunPossession_1986_2018.index
# GunPossession_1986_2018.info()

In [None]:
# FirearmDeaths_2000_2018.drop(['State','Ethnicity','First Year','Last Year','Cause of Death'], axis=1, inplace=True)
# FirearmDeaths_2000_2018.columns

In [89]:
GunDeaths_2009_2018 = GunDeaths_2009_2018[['Population','Total gun deaths','Total children and teen gun deaths']]
GunDeaths_2009_2018['year'] = GunDeaths_2009_2018.index
# GunDeaths_2009_2018.columns 

### **Combine data**
#### Data1: $\color{cyan}{\text{Year, Polpulation, Total firearms, Total licenses, Gun deaths, age group, sex, race.}}$
#### DF1 = GunDeaths_2009_2018 + FirearmDeaths_2000_2018 + GunPossession_1986_2018

Yearly employment info by states

In [138]:
Employment_header = ['FIPS Code', 'State and area', 'Year', 'Civilian non-institutional population', 
  'Civilian labor force/Total', 'Civilian labor force/Percent of population', 
  'Civilian labor force/Employment/Total', 'Civilian labor force/Employment/Percent of population',
  'Civilian labor force/Unemployment/Total','Civilian labor force/Unemployment/Rate']

In [139]:
# Employment = pd.read_excel('../data/USA_Crimes/staadata.xlsx', header=0)
Employment=pd.read_excel(
     os.path.join("../data/USA_Crimes/", "staadata.xlsx"),
     engine='openpyxl', header=None, names= Employment_header,skiprows=8
).set_index('Year')

In [140]:
Employment = Employment[['State and area', 'Civilian non-institutional population', 
  'Civilian labor force/Total','Civilian labor force/Unemployment/Rate']]
Employment.columns = ['state','State population','State labor force','Unemployment rate']
Employment['year'] = Employment.index

In [111]:
Employment.head(2)

Unnamed: 0_level_0,FIPS Code,state,State population,State labor force,Unemployment rate,year
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1976,1,Alabama,2632667,1501284,6.8,1976
1976,2,Alaska,239917,163570,7.6,1976


Mass shooting for employment, mental health, gender, lat, long which will be join with 2013-2018 gun violence data

In [25]:
Mass_Shooting = pd.read_csv('../data/USA_Crimes/US Mass Shooting 1966-2019 (cleaned).csv', parse_dates=True, index_col='Date')

In [28]:
MS_column_drop = ['S#','Title','Area','Incident Area','Open/Close Location','Target','Cause','Summary',
  'Shooter status','No. of shooter/suspect']
Mass_Shooting.sort_index(inplace=True)
Mass_Shooting.drop(MS_column_drop, axis=1, inplace=True)

In [43]:
Mass_Shooting['date'] = pd.to_datetime(Mass_Shooting.index)
Mass_Shooting['year'] = Mass_Shooting['date'].dt.year
# Mass_Shooting['month'] = Mass_Shooting['date'].dt.month
# Mass_Shooting['monthday'] = Mass_Shooting['date'].dt.day
# Mass_Shooting['weekday'] = Mass_Shooting['date'].dt.weekday

In [30]:
def get_state(txt):
    val = txt.split(", ")[-1]            
    return val

Mass_Shooting['state'] = Mass_Shooting['Location'].apply(lambda x : get_state(x))

In [31]:
Mass_Shooting.head(2)

Unnamed: 0_level_0,Location,Fatalities,Injured,Total victims,Policeman Killed,Age,Employeed (Y/N),Employed at,Mental Health Issues,Race,Gender,Latitude,Longitude,state
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1971-11-11,"Spokane, Washington",1,4,5,0,-999,-999,-999,Yes,"White, White American or European American",Male,47.673674,-117.415984,Washington
1972-12-31,"New Orleans, Louisiana",9,13,22,4,23,-999,-999,Yes,"Black, Black American or African American",Male,30.068724,-89.931474,Louisiana


### **Combine data**
#### Data2: $\color{cyan}{\text{Year, State, Population, Unemployment rate, Gun violent cases, Shooter gender, Shooter age, Employeed.}}$  
#### DF2 = Gun_violence_2013_2018  + Mass_shooting + Employment

In [82]:
Gun_Violence_2013_2018 = pd.read_csv('../data/USA_Crimes/gun-violence-data_01-2013_03-2018.csv', parse_dates=True, index_col='date')

In [83]:
GV_column_drop = ['city_or_county','address','gun_stolen','incident_url','source_url',
  'incident_url_fields_missing','congressional_district','gun_stolen','gun_type','incident_characteristics',
  'latitude','longitude','location_description','notes','participant_name','participant_age_group','participant_status','participant_relationship','sources',
  'state_house_district','state_senate_district']
Gun_Violence_2013_2018.sort_index(inplace=True)
Gun_Violence_2013_2018.drop(GV_column_drop, axis=1, inplace=True)

In [49]:
Gun_Violence_2013_2018.head(2)

Unnamed: 0_level_0,incident_id,state,n_killed,n_injured,n_guns_involved,participant_age,participant_gender,participant_type
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2013-01-01,461105,Pennsylvania,0,4,,0::20,0::Male||1::Male||3::Male||4::Female,0::Victim||1::Victim||2::Victim||3::Victim||4:...
2013-01-01,460726,California,1,3,,0::20,0::Male,0::Victim||1::Victim||2::Victim||3::Victim||4:...


In [84]:
Gun_Violence_2013_2018['date'] = pd.to_datetime(Gun_Violence_2013_2018.index)
Gun_Violence_2013_2018['year'] = Gun_Violence_2013_2018['date'].dt.year
# Gun_Violence_2013_2018['month'] = Gun_Violence_2013_2018['date'].dt.month
# Gun_Violence_2013_2018['monthday'] = Gun_Violence_2013_2018['date'].dt.day
# Gun_Violence_2013_2018['weekday'] = Gun_Violence_2013_2018['date'].dt.weekday
Gun_Violence_2013_2018['Total victims'] = Gun_Violence_2013_2018['n_killed'] + Gun_Violence_2013_2018['n_injured']

## 2.4 Explore data  
### a. Data transforming  

In [85]:
def get_user_mapping(txt):
    if txt == "NA":
        return {}
    mapping = {}
    for d in txt.split("||"):
        try:
            key = d.split("::")[0]
            val = d.split("::")[1]
            if key not in mapping:
                mapping[key] = val
        except:
            pass
    return mapping

Gun_Violence_2013_2018['participant_type'] = Gun_Violence_2013_2018['participant_type'].fillna("NA")
Gun_Violence_2013_2018['participant_type_map'] = Gun_Violence_2013_2018['participant_type'].apply(lambda x : get_user_mapping(x))
Gun_Violence_2013_2018['participant_age'] = Gun_Violence_2013_2018['participant_age'].fillna("NA")
Gun_Violence_2013_2018['participant_age_map'] = Gun_Violence_2013_2018['participant_age'].apply(lambda x : get_user_mapping(x))
Gun_Violence_2013_2018['participant_gender'] = Gun_Violence_2013_2018['participant_gender'].fillna("NA")
Gun_Violence_2013_2018['participant_gender_map'] = Gun_Violence_2013_2018['participant_gender'].apply(lambda x : get_user_mapping(x))

## Finding the Suspect Age Groups
suspect_age_groups = {}
for i, row in Gun_Violence_2013_2018.iterrows():
    suspects = []
    for k,v in row['participant_type_map'].items():
        if "suspect" in v.lower():
            suspects.append(k)
    for suspect in suspects:
        if suspect in row['participant_age_map']:
            ag = row['participant_age_map'][suspect]
            if ag not in suspect_age_groups:
                suspect_age_groups[ag] = 0 
            else:
                suspect_age_groups[ag] += 1

# suspect_age_groups = dict(sorted(suspect_age_groups.items()))
trace1 = go.Bar(x=list(map(int,suspect_age_groups.keys())), y=list(suspect_age_groups.values()), opacity=0.75, name="month", marker=dict(color='rgba(200, 20, 160, 0.6)'))
layout = dict(height=400, title='Suspects Age - Distribution', xaxis=dict(range=[0, 100]), legend=dict(orientation="h"));
fig = go.Figure(data=[trace1], layout=layout)
iplot(fig)

In [94]:
# %%debug 
ag = []
for i, row in Gun_Violence_2013_2018.iterrows():
    suspects = []
    for k,v in row['participant_type_map'].items():
        if "suspect" in v.lower():
            suspects.append(k)    
    b=[]
    for suspect in suspects:        
        if suspect in row['participant_age_map']:
            b.append(row['participant_age_map'][suspect])
    ag.append(b)    

Gun_Violence_2013_2018['suspect_age'] = ag

In [95]:
Gun_Violence_2013_2018.drop(['participant_type','participant_age','participant_gender'],axis=1,inplace=True)
Gun_Violence_2013_2018.head(2)

Unnamed: 0_level_0,incident_id,state,n_killed,n_injured,n_guns_involved,date,year,Total victims,participant_type_map,participant_age_map,participant_gender_map,suspect_age
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-01-01,461105,Pennsylvania,0,4,,2013-01-01,2013,4,"{'0': 'Victim', '1': 'Victim', '2': 'Victim', ...",{'0': '20'},"{'0': 'Male', '1': 'Male', '3': 'Male', '4': '...",[]
2013-01-01,460726,California,1,3,,2013-01-01,2013,4,"{'0': 'Victim', '1': 'Victim', '2': 'Victim', ...",{'0': '20'},{'0': 'Male'},[]


In [96]:
GunDeaths_2009_2018.head(2)

Unnamed: 0_level_0,Population,Total gun deaths,Total children and teen gun deaths,year
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2009,307006550,31347,2811,2009
2010,309330219,31672,2711,2010


In [97]:
GunPossession_1986_2018.head(2)

Unnamed: 0_level_0,Total Licensees,Licensed Business Entities,year
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1986,267166,256527,1986
1987,262022,250928,1987


In [141]:
Employment.head(2)

Unnamed: 0_level_0,state,State population,State labor force,Unemployment rate,year
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1976,Alabama,2632667,1501284,6.8,1976
1976,Alaska,239917,163570,7.6,1976


In [129]:
print(len(Employment.state.unique()),np.sort(Employment.state.unique()),'\n',
  len(df.state.unique()),np.sort(df.state.unique()))

53 ['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Los Angeles County' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts'
 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana' 'Nebraska'
 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York'
 'New York city' 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma'
 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina' 'South Dakota'
 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia' 'Washington'
 'West Virginia' 'Wisconsin' 'Wyoming'] 
 51 ['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New 

In [56]:
def add_columns(row):
  if Gun_Violence_2013_2018.year == GunDeaths_2009_2018.index:
    return GunDeaths_2009_2018['Population']

In [142]:
df = pd.merge(Gun_Violence_2013_2018,GunPossession_1986_2018,on=['year'],how='left')

In [143]:
df = pd.merge(df,GunDeaths_2009_2018,on=['year'],how='left')

In [144]:
df = pd.merge(df,Employment,on=['year','state'],how='left')

In [145]:
print(Gun_Violence_2013_2018.shape,GunPossession_1986_2018.shape,GunDeaths_2009_2018.shape,Employment.shape,df.shape)

(239677, 12) (33, 3) (10, 4) (2332, 5) (239677, 20)


In [150]:
df.set_index('date',inplace=True)
df.head(2)

Unnamed: 0_level_0,incident_id,state,n_killed,n_injured,n_guns_involved,year,Total victims,participant_type_map,participant_age_map,participant_gender_map,suspect_age,Total Licensees,Licensed Business Entities,Population,Total gun deaths,Total children and teen gun deaths,State population,State labor force,Unemployment rate
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2013-01-01,461105,Pennsylvania,0,4,,2013,4,"{'0': 'Victim', '1': 'Victim', '2': 'Victim', ...",{'0': '20'},"{'0': 'Male', '1': 'Male', '3': 'Male', '4': '...",[],139244,74795,316497531,33636,2465,10178255,6442411,7.4
2013-01-01,460726,California,1,3,,2013,4,"{'0': 'Victim', '1': 'Victim', '2': 'Victim', ...",{'0': '20'},{'0': 'Male'},[],139244,74795,316497531,33636,2465,29637113,18624992,8.9


In [162]:
df['Total victims'][df.year==2013].sum()

1296

In [168]:
df['Total victims'][(df.year==2013) & (df.state=='Texas')].sum()

65

##  2.5 Target variables  
index: Year 2013 - 2018  
Dependent variable (y): Total victims
Independent variables (Xi): Population, State population, suspect_age, Total Licensees, Unemployment rate

## 2.6 Save data  

In [284]:
df.to_csv('../data/Clean_data/GunViolence2013_2018_final.csv')

## 2.7 Summary  