# Mother Jones Mass Shooting Data Cleaning
### Authors: Joe Acosta
Imports the initial data collected by the mother jones foundation and cleans the data. The data is currently being imported directly from the website and are being cleaned such that future research can expand on their findings.

Mother Jones Initial Data: https://www.motherjones.com/politics/2012/12/mass-shootings-mother-jones-full-data/

In [1]:
import pandas as pd
import numpy as np
# import re

## Initial Data Import
Creates a Pandas DataFrame given the link to the initial data.
Verifies the data types and corrects any discrepencies.

In [2]:
mj_url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQBEbQoWMn_P81DuwmlQC0_jr2sJDzkkC0mvF6WLcM53ZYXi8RMfUlunvP1B5W0jRrJvH-wc-WGjDB1/pub?gid=0&single=true&output=csv'
mjms_df = pd.read_csv(mj_url)

In [3]:
mjms_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151 entries, 0 to 150
Data columns (total 24 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   case                              151 non-null    object
 1   location                          151 non-null    object
 2   date                              151 non-null    object
 3   summary                           151 non-null    object
 4   fatalities                        151 non-null    int64 
 5   injured                           151 non-null    int64 
 6   total_victims                     151 non-null    int64 
 7   location.1                        151 non-null    object
 8   age_of_shooter                    151 non-null    object
 9   prior_signs_mental_health_issues  151 non-null    object
 10  mental_health_details             151 non-null    object
 11  weapons_obtained_legally          151 non-null    object
 12  where_obtained        

## Data Cleaning

### Date Data Cleaning
There are various formats for the year and there exists a seperate column year. Updates the date format such that it's in mm-dd-yyyy format then drops the year column. Also converts the date to datetime64[ns] type allowing for various date related opperations including sorting by year, month, day, decade, etc.

In [4]:
def update_year(date, year) :
    '''
    Updates the date ensuring it's in mm/dd/yyyy format

    Parameters:
     - date (String): date in mm/dd/yy or mm/dd/yyyy
     - year (String): year in yyyy format

    Return:
     - _date (String): date in the format mm/dd/yyyy
    
    // Include into the doc created
    '''
    _month_day = '/'.join(date.split('/')[:2])
    _date = _month_day + '/' + year
    return _date

In [5]:
# Converts date to mm-dd-yyyy using year column
mjms_df.date = mjms_df.apply(lambda row: update_year(row.date, str(row.year)), axis=1)

# Converts date series to pd date type
mjms_df.date = pd.to_datetime(mjms_df.date)

# Drops the year column
mjms_df = mjms_df.drop('year', axis=1)

In [6]:
# Verifies year can get extracted from the date col
mjms_df.date.sample(3).dt.year

36    2019
94    2012
97    2011
Name: date, dtype: int32

### Fatalities Data Cleaning
Calculates the total_victims field given the fatalities and injuries fields. The original data ins't a calculated field. This solution prevents human error in the total_victims field so long as the fatalities and injuries fields are accurate.

In [7]:
mjms_df.total_victims = mjms_df.fatalities + mjms_df.injured

### Age of Shooter Data Cleaning
Cleans the age of shooter. For those cases where there were multiple shooters, the average is taken

#### Verifying missing data
Checking for the missing ages that prevent the data from being stored as an int

In [8]:
for i in mjms_df.age_of_shooter.unique() :
    print(i, end=', ')

14, 44, 67, 40, 21, 59, 18, 33, 25, 28, 43, 72, 31, 22, 15, 20, 70, 23, 45, -, 57, 19, 51, 36, 24, 32, 46, 26, 54, 29, 38, 17, 47, 37, 64, 39, 27, 34, 42, 41, 52, 16, 48, 66, 11, 35, 55, 50, 

#### Converts Data to Numeric
Replaces all missing values with -1 then converts the Series to numeric (int)

In [9]:
# Replaces missing ages with -1 then converts the column to integer
mjms_df.age_of_shooter = mjms_df.age_of_shooter.replace('-', -1)
mjms_df.age_of_shooter = pd.to_numeric(mjms_df.age_of_shooter)

In [10]:
pd.set_option('display.max_colwidth', None)

#### Identify Data Cases
For the data that's missing or wrong, returns the case name to use for updating the ages

In [11]:
# Issolate the cases without ages
mjms_df.loc[mjms_df.age_of_shooter == -1, ('case', 'summary', 'age_of_shooter')]

Unnamed: 0,case,summary,age_of_shooter
25,Sacramento County church shooting,"""A man believed to be meeting his three children for a supervised visit at a church just outside Sacramento on Monday afternoon fatally shot the children and an adult accompanying them before killing himself, police officials said. Sheriff Scott Jones of Sacramento County told reporters at the scene that the gunman had a restraining order against him, and that he had to have supervised visits with his children, who were younger than 15."" (NYTimes)",-1
34,Jersey City kosher market shooting,"David N. Anderson, 47, and Francine Graham, 50, were heavily armed and traveling in a white van when they first killed a police officer in a cemetery, and then opened fire at a kosher market, “fueled both by anti-Semitism and anti-law enforcement beliefs,” according to New Jersey authorities. The pair, linked to the antisemitic ideology of the Black Hebrew Israelites extremist group, were killed after a lenghty gun battle with police at the market.",-1


In [12]:
mjms_df.loc[mjms_df.age_of_shooter == 11, ('case', 'summary', 'age_of_shooter')]

Unnamed: 0,case,summary,age_of_shooter
126,Westside Middle School killings,"Mitchell Scott Johnson, 13, and Andrew Douglas Golden, 11, two juveniles, ambushed students and teachers as they left the school; they were apprehended by police at the scene.",11


#### Supporting documentation
For the 'Sacramento County church shooting' case, source @ https://www.cnn.com/2022/02/28/us/sacramento-church-shooting identified the shooter as 39-year-old David Mora Rojas

In [13]:
# Sets age of shooter for 'Sacramento Country Church Shooting'
mjms_df.loc[mjms_df.case == 'Sacramento County church shooting', 'age_of_shooter'] = 39

#### Fixing Data Input Error
For the 'Westside Middle School killings' case, the shooters ages were 13 and 11. There were two assalents though only one was recorded.
For the 'Jersey City kosher market shooting' case, the shooters ages were 47 and 50. Ref case summary.

In [14]:
mjms_df.at[34, 'age_of_shooter'] = [47, 50]
mjms_df.at[126, 'age_of_shooter'] = [11, 13]

ValueError: Must have equal len keys and value when setting with an iterable

#### Verify Data Corrections

In [None]:
mjms_df.loc[(25, 34, 126), ('case', 'summary', 'age_of_shooter')]

In [None]:
pd.reset_option('display.max_colwidth')

### Gender Data Shooter
Normalizes the genderes of the shooters

- M: Male gender 
- F: Female gender
- T->F: Transgender transitioning from male to female
- T->M: Transgender transitioning from female to male
- M/M: Slash used when multiple shooters where each letter represents a shooter
- O: Gender non-conforming including agender, non-binary, bigender, etc.

In [None]:
mjms_df.gender.unique()

In [None]:
mjms_df.gender = mjms_df.gender.replace('Female', 'F')
mjms_df.gender = mjms_df.gender.replace('Male', 'M')
mjms_df.gender = mjms_df.gender.replace('Male & Female', 'M/F')

#### Cleaning Audrey's case
Identifies Index and updates the case to reflect she is a Transgender Female to Male

In [None]:
mjms_df[mjms_df.gender.str.contains('F ("identifies as transgender', regex=False)]

#### Cleaning 'Westside Middle School killings' case
The gender of both shooters were not reflected. Updates to account for two shooters

In [None]:
mjms_df.at[126, 'gender'] = 'M/M'

In [15]:
mjms_df.gender.unique()

array(['M',
       'F ("identifies as transgender" and "Audrey Hale is a biological woman who, on a social media profile, used male pronouns,” according to Nashville Metro PD officials)',
       'Male & Female', 'F', 'Male', 'Female'], dtype=object)