## Parsing the data

In the "Data_Download_Extraction.ipynb' We downloaded the multiple causes of death data files (2012-2016) form the CDC and unzipped them into the "data" directory under the current working directory.

Here we will try to parse the data from the extracted files into one giant CSV based on this document: https://www.cdc.gov/nchs/data/dvs/Multiple_Cause_Record_Layout_2016.pdf

In [21]:
import os

cwd = os.getcwd()
years = list(range(2012, 2017))
data_file_names = []
for _1, _2, files  in os.walk(cwd+"\\data"):
    data_file_names = files
data_file_names = {year:file for year, file in zip(years, data_file_names)}

In [22]:
#desired fields in the data

fields = ["year", "month", "age", "sex", "race", "cause", "place_of_injury", "education"]

The following dictionaries will make it easire to look up the codes used in the data and replace them wiht what they represent.

In [23]:
Manner = {1: "Accident",
2: "Suicide",
3: "Homicide",
4: "Pending investigation",
5: "Could not determine",
6: "Self-Inflicted",
7: "Natural",} 

ICD_10_Suicide_Range = ["X"+str(num) for num in range(60,85)]

suicide_cause_dict = {}

for item in ICD_10_Suicide_Range:
    num = int(item[1:])
    cause = ""
    if num in range(60, 70):
        cause =  "poisoning/drugs"
    elif num is 70:
        cause = "hanging/suffocation"
    elif num is 71:
        cause = "drowning"
    elif num in range(72,75):
        cause = "gun"
    elif num is 75:
        cause = "explosive"
    elif num in [76, 77]:
        cause = "burning/burns"
    elif num is 78:
        cause = "sharp object"
    elif num is 79:
        cause = "blunt object"
    elif num is 80:
        cause = "jumping"
    elif num is 81:
        cause = "moving object"
    elif num is 82:
        cause = "vehicle accident"
    else:
        cause = "other"
    suicide_cause_dict[item] = cause

## the dataset uses two encodings for education: 2003 encoding and 1989 encoding.
## the following dictionaris attempt a consistent interpretation of the two encodings
    
education_2003 = {
    1: "Less than HS",
    2: "HS/GED",
    3: "HS/GED",
    4: "Some College",
    5: "Bachelor's",
    6: "Bachelors+",
    7: "Bachelors+",
    8: "Bachelors+"    
}

education_89 = {
    0: "Less than HS",
    1: "Less than HS",
    2: "Less than HS",
    3: "Less than HS",
    4: "Less than HS",
    5: "Less than HS",
    6: "Less than HS",
    7: "Less than HS",
    8: "Less than HS",
    9: "HS/GED",
    10: "HS/GED",
    11: "HS/GED",
    12: "HS/GED",
    13: "Some College",
    14: "Some College",
    15: "Some College",
    16: "Bachelors+",
    17: "Bachelors+",
}

Let us first extract every death marked as a suicied into a list of strings
ICD 10 codes for suicides are found here (X60 - X84):https://en.wikipedia.org/wiki/ICD-10_Chapter_XX:_External_causes_of_morbidity_and_mortality

the manner field may be blank, and the cause of death may not be reported as an ICD code. But hopefully using both, we will arrive at a resonable tally of all sucides in the US in the five year period of 2012 to 2016

In [24]:
suicide_deaths = []

for file in data_file_names.values():
    
    with open(cwd+"\\data\\"+file) as f:
        for line in f:
            
            manner = line[106].strip()
            ICD10 = line[145:149].strip()
            
            if manner and int(manner) in [2, 6]:
                suicide_deaths.append(line)
                
            elif ICD10 and (ICD10 in ICD_10_Suicide_Range):
                suicide_deaths.append(line)
                

In [25]:
len(suicide_deaths)

215211

**215211** sucicides in five years! That is over a 117 suicids per day.

Lets see if there are causes of sucides in the lit that are not in our dictionary, so we can add them

In [26]:
unknown_causes = set()
for line in suicide_deaths:
    if line[145:149].strip() not in suicide_cause_dict.keys():
        unknown_causes.add(line[145:149].strip())
        
print("number of unknown causes =", len(unknown_causes))
for cause in unknown_causes:
    print(cause)

number of unknown causes = 268
W68
I255
G10
I725
K746
E877
J111
V839
I350
F191
G908
G210
Y841
N320
V959
I428
X47
G936
V594
V870
I422
V485
A047
E102
I632
I678
W78
V555
E669
C189
C349
F54
F101
I251
K709
E878
Y839
C859
I279
E162
J449
I694
C439
J448
G309
K769
A084
X41
R64
J851
O268
I519
I119
W79
E780
K767
E232
N189
W84
Y608
K760
J189
N185
I38
W83
O961
I803
G35
I259
V877
Y658
Y832
E722
J690
F204
I516
I517
I420
X99
C900
A090
F111
J439
V092
E46
F03
W05
J841
C269
C509
E109
E872
E149
M069
Q613
Y831
Y870
Y836
R99
W17
F109
V695
Y579
J181
X00
V489
X43
W31
I131
I639
N390
F328
I429
X45
I458
C97
W74
J869
E161
K224
I509
I10
Y848
O969
G311
V455
J398
E050
W76
E86
W18
W40
W13
Y86
V599
I802
F259
F119
I426
G312
Y846
I469
X01
I250
X599
X49
J989
D65
I499
I500
E871
B238
I248
N19
W67
V475
C64
V499
V892
C329
J60
W70
E141
K219
K353
X42
I518
X44
J180
F319
I619
C921
V575
M622
G931
I64
J441
Y350
E43
J440
V545
V923
C229
E875
W10
G20
E101
O993
V051
W19
V059
M726
B86
C762
E273
X14
E142
J348
F102
W21
W20
D432
I739
F322

The only way to rectify this seem to be to look up all of these and add them to the dictionary. I only add my closest interpretation so as to not make the nuber of cuses too unweildy. For example, ICD-10 code V839 is stated as "Unspecified occupant of special industrial vehicle injured in nontraffic accident". Since we know this death was a suicide, it should be ok to simply say this was a "vehicle accident", same as X82. Note: The 'C's are neoplasms. Since I don't know what a sucide by neoplasm even means, I will simply categorize it as other.

In [27]:
for item in unknown_causes:
    if item[0] == "V":
        suicide_cause_dict[item] = "vehicle accident" # may include pedestrians involded in accidents
    elif item[0] in ["C", "K", "E", "I", "F", "W", "J", "D", "G", "N", "A", "B", "M", "Q", "R", "O"]: # these are mostly diseases
        suicide_cause_dict[item] = "other"
    elif item == "Y350":
        suicide_cause_dict[item] = "gun"
    elif item == "X01":
        suicide_cause_dict[item] = "burning/burns"
    elif item in ["X47", "Y579", "X40", "X43"]:
        suicide_cause_dict[item] = "poisoning/drugs"
    else:
        suicide_cause_dict[item] = "other"
    

We will now process each line to add to a csv file

In [28]:
import csv
suicides_file =  open("All_Suicides.csv", "w")
    

In [29]:
#writing the first line as the heading sfor the columns
writer = csv.writer(suicides_file, lineterminator = "\n")
writer.writerow(fields)

56

the following cell defines functions that we will use to porcess each row of data and add it to the dataframe

In [30]:
def get_age(line):
    #every age under 1 years is encoded to be 0. It's irrelevant for us anyway
    flag = int(line[69])
    detailed_age = int(line[70:73])
    if flag == 9 or detailed_age == 999:
        return "unknown"
    elif flag == 1:
        return detailed_age
    else:
        return 0
    
def get_race(line):
    hispanic = int(line[483:486].strip())
    race = line[444:446].strip()
    if hispanic>=199 and hispanic<=996:
        return "Hispanic"
    elif not race:
        return "Unknown"
    elif int(race) == 1:
        return "White"
    elif int(race) == 2:
        return "Black"
    elif int(race) == 3:
        return "Native American/Native Alaskan"
    else:
        return "Asian/Pacific Islander"

def get_education(line):
    education_flag = int(line[63])
    
    if education_flag:        
        if education_flag == 0:
            education = int(line[60:62].strip())
            return education_89.get(education, "unknown")
        elif education_flag == 1:
            education = int(line[62].strip())
            return education_2003.get(education, "unknown")
    
    return "unknown"
        
def get_place(line):
    place = line[144].strip()
    if place:
        place = int(place)
        if place == 0:
            return "Home"
        if place == 1:
            return "Residential institution"
        if place == 2:
            return "School"
        if place == 3:
            return "Sports"
        if place == 4:
            return "Street"
        if place == 5:
            return "Trade/service area"
        if place == 6:
            return "Industrial/construction area"
        if place == 7:
            return "Farm"        
    return "other/unknown"
        

def process_line(line):
    year = line[101:105].strip()
    month = line[64:66].strip()
    age = get_age(line)
    sex = line[68].strip()
    race = get_race(line)
    cause = suicide_cause_dict.get(line[145:149].strip(), "other")
    place = get_place(line)
    education = get_education(line)
    values = [year, month, age, sex, race, cause, place, education]
    return values

Finnaly, processing eac row to add to the data frame

In [31]:
for line in suicide_deaths:
    writer.writerow(process_line(line))

In [32]:
suicides_file.close()

Looking at the data i an Excel Sheet, It looks fine and seems to make sense.The only thing that concermns me is that 23000 of the entries have the education field "unknown". That is nearly 11 per cent of the data. I checked the layout and encodings again. I do not see an error. Other entries look fine. May be we wil discover more errors in exploratory analysis.