# 2. Data wrangling   
   ### * 2.1 Data overviews  
   ### * 2.2 Importing related libs & modules  
   ### * 2.3 Loading data  
        * a. Gun Possession: number of guns by $\color{red}{\text{years}}$  
        * b. Gun Deaths: number of deaths (homicide, sucide...) by $\color{red}{\text{years}}$  
        * c. [Firearm Deaths by Age](https://webappa.cdc.gov/sasweb/ncipc/mortrate.html): No of death group by age listed by $\color{red}{\text{years}}$  
        * d. Mass Shooting: No cases of shooting with shooter age, employeed  by $\color{red}{\text{years, states}}$  
        * e. Unemployment rate: Unemployment rate by $\color{red}{\text{years, states}}$  
   ### * 2.4 Explore the data  
        * a. Data distribution & missing values  
        * b. Numeric features  
        * c. Category features   
   ### * 2.5 Target variables  
   ### * 2.6 Save data  
   ### * 2.7 Sumary  


## 2.1 Data Overviews  
### Targeted data  
index: Year 2009 - 2018  
Dependent variable (y): Gun deaths; Age group; Shooter age; Employeed;    
Independent variables (Xi): Population, Gun possession, Employment rate, Election year  
  
### What to do  
Load, transform & visualize data.  
Q: Add more features for the prediction?  

## 2.2 Import libs & modules  
### geoplot for geodetic display

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import seaborn as sns
import os

from library.sb_utils import save_file

## 2.3 Loading data  
### DF: GunDeaths_2009_2018; FirearmDeaths_2000_2018; GunPossession_1986_2018; Mass_Shooting; Employment;  

In [2]:
GunDeaths_2009_2018  = pd.read_csv('../data/USA_Crimes/GunDeaths_2009_2018.csv').set_index('Year')
FirearmDeaths_2000_2018  = pd.read_csv('../data/USA_Crimes/FirearmDeaths2000_2018.csv').set_index('Year')
GunPossession_1986_2018 = pd.read_csv('../data/USA_Crimes/GunPossession_1986_2018.csv', index_col=0)

In [3]:
GunPossession_1986_2018 = GunPossession_1986_2018[['Total Manufactured Firearms','Total Imports','Total Licensees ']]
# GunPossession_1986_2018.info()

In [4]:
GunDeaths_2009_2018 = GunDeaths_2009_2018[['Population','Total gun deaths','Total children and teen gun deaths']]
GunDeaths_2009_2018.columns

Index(['Population', 'Total gun deaths', 'Total children and teen gun deaths'], dtype='object')

In [5]:
FirearmDeaths_2000_2018.drop(['State','Ethnicity','First Year','Last Year','Cause of Death'], axis=1, inplace=True)
FirearmDeaths_2000_2018.columns

Index(['Sex', 'Race', 'Age Group', 'Deaths', 'Population', 'Crude Rate'], dtype='object')

### Combine data
##### Data1: $\color{cyan}{\text{Year, Polpulation, Total firearms, Total licenses, Gun deaths, age group, sex, race.}}$
##### DF1 = GunDeaths_2009_2018 + FirearmDeaths_2000_2018 + GunPossession_1986_2018

In [6]:
Employment_header = ['FIPS Code', 'State and area', 'Year', 'Civilian non-institutional population', 'Civilian labor force/Total', 'Civilian labor force', 'Civilian labor force/Percent of population', 'Civilian labor force/Employment/Total', 'Civilian labor force/Employment/Percent of population','Civilian labor force/Unemployment/Total','Civilian labor force/Unemployment/Rate']

In [7]:
# Employment = pd.read_excel('../data/USA_Crimes/staadata.xlsx', header=0)
Employment=pd.read_excel(
     os.path.join("../data/USA_Crimes/", "staadata.xlsx"),
     engine='openpyxl', header=None, names= Employment_header
).set_index('Year')

In [8]:
Employment = Employment.iloc[8:]
# Employment.head(10)

In [9]:
Mass_Shooting = pd.read_csv('../data/USA_Crimes/US Mass Shooting 1966-2019 (cleaned).csv', parse_dates=True, index_col='Date')

In [10]:
MS_column_drop = ['S#','Open/Close Location','Latitude','Longitude']
Mass_Shooting.sort_index(inplace=True)
Mass_Shooting.drop(MS_column_drop, axis=1, inplace=True)

In [11]:
Mass_Shooting['date'] = pd.to_datetime(Mass_Shooting.index)
Mass_Shooting['year'] = Mass_Shooting['date'].dt.year
Mass_Shooting['month'] = Mass_Shooting['date'].dt.month
Mass_Shooting['monthday'] = Mass_Shooting['date'].dt.day
Mass_Shooting['weekday'] = Mass_Shooting['date'].dt.weekday

In [12]:
def get_state(txt):
    val = txt.split(", ")[-1]            
    return val

Mass_Shooting['state'] = Mass_Shooting['Location'].apply(lambda x : get_state(x))

### Combine data 
##### Data2: $\color{cyan}{\text{Year, State, Population, Unemployment rate, Gun violent cases, Shooter gender, Shooter age, Employeed.}}$  
##### DF2 = Gun_violence_2013_2018  + Mass_shooting + Employment

In [13]:
Gun_Violence_2013_2018 = pd.read_csv('../data/USA_Crimes/gun-violence-data_01-2013_03-2018.csv', parse_dates=True, index_col='date')
# GV_column_drop = ['incident_id','city_or_county','address','gun_stolen','incident_url','source_url','location_description','incident_characteristics','participant_name','incident_url_fields_missing', 'participant_status','congressional_district','latitude','longitude','sources','state_house_district','state_senate_district']
# Gun_Violence_2013_2018.sort_index(inplace=True)
# Gun_Violence_2013_2018.drop(GV_column_drop, axis=1, inplace=True)

In [14]:
Gun_Violence_2013_2018['date'] = pd.to_datetime(Gun_Violence_2013_2018.index)
Gun_Violence_2013_2018['year'] = Gun_Violence_2013_2018['date'].dt.year
Gun_Violence_2013_2018['month'] = Gun_Violence_2013_2018['date'].dt.month
Gun_Violence_2013_2018['monthday'] = Gun_Violence_2013_2018['date'].dt.day
Gun_Violence_2013_2018['weekday'] = Gun_Violence_2013_2018['date'].dt.weekday
Gun_Violence_2013_2018['Total victims'] = Gun_Violence_2013_2018['n_killed'] + Gun_Violence_2013_2018['n_injured']

## 2.4 Explore data  
### a. Data transforming  

In [15]:
def get_user_mapping(txt):
    if txt == "NA":
        return {}
    mapping = {}
    for d in txt.split("||"):
        try:
            key = d.split("::")[0]
            val = d.split("::")[1]
            if key not in mapping:
                mapping[key] = val
        except:
            pass
    return mapping

Gun_Violence_2013_2018['participant_type'] = Gun_Violence_2013_2018['participant_type'].fillna("NA")
Gun_Violence_2013_2018['participant_type_map'] = Gun_Violence_2013_2018['participant_type'].apply(lambda x : get_user_mapping(x))
Gun_Violence_2013_2018['participant_age'] = Gun_Violence_2013_2018['participant_age'].fillna("NA")
Gun_Violence_2013_2018['participant_age_map'] = Gun_Violence_2013_2018['participant_age'].apply(lambda x : get_user_mapping(x))
Gun_Violence_2013_2018['participant_gender'] = Gun_Violence_2013_2018['participant_gender'].fillna("NA")
Gun_Violence_2013_2018['participant_gender_map'] = Gun_Violence_2013_2018['participant_gender'].apply(lambda x : get_user_mapping(x))

## Finding the Suspect Age Groups
suspect_age_groups = {}
for i, row in Gun_Violence_2013_2018.iterrows():
    suspects = []
    for k,v in row['participant_type_map'].items():
        if "suspect" in v.lower():
            suspects.append(k)
    for suspect in suspects:
        if suspect in row['participant_age_map']:
            ag = row['participant_age_map'][suspect]
            if ag not in suspect_age_groups:
                suspect_age_groups[ag] = 0 
            else:
                suspect_age_groups[ag] += 1

# suspect_age_groups = dict(sorted(suspect_age_groups.items()))
trace1 = go.Bar(x=list(map(int,suspect_age_groups.keys())), y=list(suspect_age_groups.values()), opacity=0.75, name="month", marker=dict(color='rgba(200, 20, 160, 0.6)'))
layout = dict(height=400, title='Suspects Age - Distribution', xaxis=dict(range=[0, 100]), legend=dict(orientation="h"));
fig = go.Figure(data=[trace1], layout=layout)
iplot(fig)

In [16]:
# %%debug 
ag = []
for i, row in Gun_Violence_2013_2018.iterrows():
    suspects = []
    for k,v in row['participant_type_map'].items():
        if "suspect" in v.lower():
            suspects.append(k)    
    b=[]
    for suspect in suspects:        
        if suspect in row['participant_age_map']:
            b.append(row['participant_age_map'][suspect])
    ag.append(b)    

Gun_Violence_2013_2018['suspect_age'] = ag

In [17]:
idx = Gun_Violence_2013_2018.index.intersection(Mass_Shooting.index)
len(idx)

20009

In [282]:
idx = pd.merge(Mass_Shooting, Gun_Violence_2013_2018, how ='inner', on =['Total victims'], left_index=True)
# idx = pd.concat([Gun_Violence_2013_2018, Mass_Shooting], axis=1, join="inner")

In [283]:
idx.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 366148 entries, 2013-01-19 to 2018-02-14
Data columns (total 64 columns):
 #   Column                       Non-Null Count   Dtype         
---  ------                       --------------   -----         
 0   Title                        366148 non-null  object        
 1   Location                     366148 non-null  object        
 2   Area                         366148 non-null  object        
 3   Incident Area                366148 non-null  object        
 4   Target                       366148 non-null  object        
 5   Cause                        366148 non-null  object        
 6   Summary                      366148 non-null  object        
 7   Shooter status               366148 non-null  object        
 8   No. of shooter/suspect       366148 non-null  object        
 9   Fatalities_x                 366148 non-null  int64         
 10  Injured                      366148 non-null  int64         
 11  Total vict

In [287]:
len(Gun_Violence_2013_2018)

239677

##  2.5 Tagert variables  
index: Year 2009 - 2018  
Dependent variable (y): Gun deaths; Age group; Shooter age; Employeed;    
Independent variables (Xi): Population, Gun possession, Employment rate, Election year 

## 2.6 Save data  

In [18]:
Gun_Violence_2013_2018.to_csv('../data/Clean_data/GunViolence2013_2018.csv')

In [284]:
idx.to_csv('../data/Clean_data/GunViolence2013_2018_final.csv')

## 2.7 Summary  