## Preface

This all started because I watched Dave Franco die back-to-back in two separate movies and I wanted to know what the high scores looked like for the under-reported metric of "film deaths". I don't trust any of the click-bait articles I've found online, and figured if anyone should be trusted, it's an obscure crowd sourced wiki.

To do so, our goal is to parse a wiki dump from [Cinemorgue](https://cinemorgue.fandom.com/wiki/Cinemorgue_Wiki) to figure out which actor/actress has died the most in film.

To start, download the "Current Pages" file from [this link](https://cinemorgue.fandom.com/wiki/Special:Statistics) and replace the existing file in the root of the directory if you would like a more updated version.

Unzipping with 3rd party apps is miserable, so we are just going to use py7zr. It's pure python, no external dependencies, and actively maintained!

In [1]:
import py7zr

with py7zr.SevenZipFile('cinemorgue_pages_current.xml.7z', mode='r') as z:
    z.extractall(path='input')

In [2]:
import xml.etree.ElementTree as ET
import pandas as pd
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import os

pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
# One time download below
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

## Parsing The XML
Fandom (the site Cinemorgue is hosted on) follows the mediawiki export format of /export-0.11/.

Using the existing xml tree is way easier than relying on any sort of wikiparser library out there.

In [4]:
NS = 'http://www.mediawiki.org/xml/export-0.11/'

def parse_wikimedia_xml(filepath):
    tree = ET.parse(filepath)
    root = tree.getroot()
    data = []
    for page in root.findall('{%s}page' % NS):
        ns = page.find('{%s}ns' % NS).text
        if ns != "0":
            continue
        title = page.find('{%s}title' % NS).text
        revision = page.find('{%s}revision' % NS)
        text = revision.find('{%s}text' % NS).text
        data.append({'title': title, 'text': text})
    df = pd.DataFrame(data)
    return df

df = parse_wikimedia_xml('input/cinemorgue_pages_current.xml')

# Review output.
df.head(3)

Unnamed: 0,title,text
0,Cinemorgue Wiki,"<mainpage-leftcolumn-start />\n{{Mainpage welcome}}\n{{Heading|Index}}\n<div style=""font-family: 'Graveyard', sans-serif; padding: 0 1.2em; border-left: 4px solid #000000; margin-left: 1.2em; text-align:center; text-transform:uppercase; color:--theme-body-text-color;"">\n<gallery widths=""330"" hideaddbutton=""true"" position=""center"" spacing=""small"" orientation=""square"" bordersize=""large"" bordercolor=""#000000"" captionalign=""center"" captionposition=""within"" columns=""2"" navigation=""true"">\nFile:A pic for Cinemorgue actors.jpg|link=Category:Actors|<span style=""font-size:24px; font-family: 'Graveyard'; letter-spacing: 2px; color:#fff; "">Actor index</span>\nFile:Karolineherfurthperfume3.jpg|link=Category:Actresses|<span style=""font-size:24px; font-family: 'Graveyard'; letter-spacing: 2px; color:#fff;"">Actress index</span>\n</gallery>\n<gallery widths=""220"" hideaddbutton=""true"" position=""center"" spacing=""small"" orientation=""square"" bordersize=""large"" bordercolor=""#000000"" captionalign=""cente..."
1,Main Page,#REDIRECT [[Cinemorgue Wiki]]
2,Marilyn Monroe,"[[File:Marilynmonroe.jpg|thumb|350px|Marilyn Monroe in ''Niagara'']]\n[http://www.imdb.com/name/nm0000054/ Marilyn Monroe] (1926 - 1962) \n\nPlayboy Sweetheart of the Month December 1953 (Historically considered the first ever Playboy Playmate)\n\n==Film Deaths==\n*'''''[[Niagara (1953)|Niagara]]''''' '''[[Niagara (1953)|(1953)]]''' [''Rose Loomis'']: Strangled with her white scarf worn under her black dress by [[Joseph Cotten]] in a bell tower. The murder is shown in shadow, and as the camera pans the bells of the tower for added suspense, and her body falls into the frame afterwards with Cotton holding the scarf.\n\n==Noteworthy Connections==\n*Foster sister of [[Jody Lawrance]]\n*Ex-wife of Joe DiMaggio (famed baseball player)\n*Ex-wife of Arthur Miller (famed playwright)\n*Mistress of President John F. Kennedy\n*Ex-girlfriend of Jorge Guinile (Brazilian billionaire)\n*'''No relation''' to [[Carolyn Monroe]].\n\n{{DEFAULTSORT:Monroe, Marilyn}}\n[[Category:Actresses]]\n[[Category..."


## Cinemorgue Page Structure

We will only be looking for film deaths. TV Deaths include a lot of voice actors from animated shows which feels like cheating and ruins the spirt of finding out which actor put it all on the line.

The structure of each wikimedia page, while not perfect, is relatively consistent.

Each Actor/Actress page follows the structure below:

* Overview  
* Film Deaths  
* Television Deaths/TV Deaths  
* Video Game Deaths  
* Music Video Deaths  
* Notable Connections
* Other General Page Formatting Failures

Not every page has the subsequent sections, so just to be thorough, we check for and delete every other section.

In [5]:
# Delete everything before Film Deaths.
df['text'] = df['text'].str.split("Film Deaths", n=1, expand=True)[1]

In [6]:
#Delete everything below TV Deaths.
df['text'] = df['text'].str.split("Television Deaths", n=1, expand=True)[0]

In [7]:
df['text'] = df['text'].str.split("TV Deaths", n=1, expand=True)[0]

In [8]:
df['text'] = df['text'].str.split("TV Series Deaths", n=1, expand=True)[0]

In [9]:
df['text'] = df['text'].str.split("Video Game Deaths", n=1, expand=True)[0]

In [10]:
df['text'] = df['text'].str.split("Music Video Deaths", n=1, expand=True)[0]

In [11]:
df['text'] = df['text'].str.split("Notable Connections", n=1, expand=True)[0]

In [12]:
df['text'] = df['text'].str.split("Noteworthy Connections", n=1, expand=True)[0]

In [13]:
df['text'] = df['text'].str.split("Gallery", n=1, expand=True)[0]

In [14]:
df['text'] = df['text'].str.split("DEFAULTSORT:", n=1, expand=True)[0]

In [15]:
df['text'] = df['text'].str.split("Category", n=1, expand=True)[0]

In [16]:
#Drop all recently nulled rows. Ready for splitting.
df = df.dropna(subset=['text'])

## String Splitting Actors

Every movie death is (generally) annotated by a line break and an asterisk.

For each instance of this, we'll the entry out into a new row.

In [17]:
# New df to store the split rows
new_rows = {'title': [], 'text': []}

# Iterate through the original df
for idx, row in df.iterrows():
    title = row['title']
    text_parts = row['text'].split('\n*')
    
    # Append the new rows to the new df
    # Skip the first element, usually contains gibberish before first line.
    for part in text_parts[1:]:
        new_rows['title'].append(title)
        new_rows['text'].append(part)

# Create the new df
new_df = pd.DataFrame(new_rows)

# Review output.
new_df.head(3)

Unnamed: 0,title,text
0,Joseph Cotten,'''[[Shadow of a Doubt (1943)|''Shadow of a Doubt'' (1943)]]''' [''Uncle Charlie'']: Falls out of a train and into the path of another train during a struggle with [[Teresa Wright]].
1,Joseph Cotten,'''''[[Niagara (1953)]]''''' [''George Loomis'']: Drowned when his boat sinks while going over Niagara Falls.
2,Joseph Cotten,"'''[[The Last Sunset (1961)|''The Last Sunset'' (1961)]]''' [''John Breckenridge'']: Shot in the back [[Adam Williams]] as he leaves the cantina, as he is flanked by [[Rock Hudson]] and [[Kirk Douglas]]. (''Thanks to Brian'')."


## String Splitting Film Year

Now we need to find the year the film was released. 
This is to help differentiate common/repeated movie titles that have been made over the years.

We are looking for the first instance of 4 digits between parentheis.

In [18]:
#Creating year column.
def extract_year(text):
    match = re.search(r'\((\d{4})\)', text)
    if match:
        return match.group(1)
    else:
        return None

# Apply the function to create the "year" column
new_df['year'] = new_df['text'].apply(extract_year)

# Review output.
new_df.head(3)

Unnamed: 0,title,text,year
0,Joseph Cotten,'''[[Shadow of a Doubt (1943)|''Shadow of a Doubt'' (1943)]]''' [''Uncle Charlie'']: Falls out of a train and into the path of another train during a struggle with [[Teresa Wright]].,1943
1,Joseph Cotten,'''''[[Niagara (1953)]]''''' [''George Loomis'']: Drowned when his boat sinks while going over Niagara Falls.,1953
2,Joseph Cotten,"'''[[The Last Sunset (1961)|''The Last Sunset'' (1961)]]''' [''John Breckenridge'']: Shot in the back [[Adam Williams]] as he leaves the cantina, as he is flanked by [[Rock Hudson]] and [[Kirk Douglas]]. (''Thanks to Brian'').",1961


## Exhaustive String Splitting Film Titles
Now begins the slow stripping away of unncessary info following each movie title.

There are a lot of weird cases, and even weirder non alphanumeric characters. 
So we're going to strip away characters slowly but surely to create uniform pattern we can parse.

In [19]:
# Remove all pairs of apostraphes or quotation marks.
new_df['text'] = new_df['text'].str.replace(r"[''\"\=]", "", regex=True)

In [20]:
# Some titles have stray html formatting tags in them. 
new_df['text'] = new_df['text'].str.replace(r'\s*<.*?>\s*', '', regex=True)

In [21]:
# Some titles will just have the link hardcoded in the title which is pretty impressive.
new_df['text'] = new_df['text'].str.replace(r'https://\S+\s*', '', regex=True)

In [22]:
# Some titles might even hardcode the unsecured link instead.
new_df['text'] = new_df['text'].str.replace(r'http://\S+\s*', '', regex=True)

In [23]:
# Save line contents for additional parsing later.
new_df['raw_text'] = new_df['text']

Titles come in two forms: 
Links and Non-Links.

* Links are formatted in a way that have the title listed twice between brackets.
(e.g. [The Shining (1980) | The Shining (1980)])

* Non-Links bow to no god, and are strutured however someone decided to make the wiki entry.
Our best hope is to just capture everything leading up to the first instance of a year between parenthesis. (e.g. ________ (1980) )

In [24]:
# If a string starts with a link [], grab the contained string.
# If a string is not a link, grab all textup until the first date ().
def extract_text(row):
    if row.startswith("["):
        # Remove parenthesis and their contents from inside the square brackets
        cleaned_text = re.sub(r'\([^()]*\)', '', row)
        match = re.search(r'\[(.*?)\]', cleaned_text)
        if match:
            return match.group(1)
    else:
        match = re.search(r'^([^()]*)', row)
        if match:
            return match.group(1).strip()
    return ''  # Return an empty string if no match is found

new_df['text'] = new_df['text'].apply(extract_text)

In [25]:
# For links, delete everything after the first instance of a |.
new_df['text'] = new_df['text'].str.split("|", n=1, expand=True)[0]

In [26]:
# Remove all remaining [[]].
new_df['text'] = new_df['text'].str.replace(r'\[|\]', '', regex=True)

In [27]:
# Categories sneak their way into some titles, so remove brakets that contain these as well.
new_df['text'] = new_df['text'].str.split("{", n=1, expand=True)[0]

In [28]:
# Remove all white space at end of string.
new_df['text'] = new_df['text'].str.strip()

In [29]:
# Remove all blank rows.
new_df = new_df[new_df['year'] != '']

In [30]:
# Remove all null rows that don't contain a year.
new_df = new_df.dropna(subset=['year'])
# Review output.
new_df.head(3)

Unnamed: 0,title,text,year,raw_text
0,Joseph Cotten,Shadow of a Doubt,1943,[[Shadow of a Doubt (1943)|Shadow of a Doubt (1943)]] [Uncle Charlie]: Falls out of a train and into the path of another train during a struggle with [[Teresa Wright]].
1,Joseph Cotten,Niagara,1953,[[Niagara (1953)]] [George Loomis]: Drowned when his boat sinks while going over Niagara Falls.
2,Joseph Cotten,The Last Sunset,1961,"[[The Last Sunset (1961)|The Last Sunset (1961)]] [John Breckenridge]: Shot in the back [[Adam Williams]] as he leaves the cantina, as he is flanked by [[Rock Hudson]] and [[Kirk Douglas]]. (Thanks to Brian)."


## Cause of Death

Now we're going to attempt to append cause of death.

The least over-kill (pun kind of intended) way to go about this is tokenizing the text, and hard coding classifiers.

To define cause of death, we're going to use the [FBI's boilerplate for homicide methodology](https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/tables/expanded-homicide-data-table-8.xls) and sprinkle in a few other things I think need distinctions.

We will use: 
* Firearms
* Cutting Instrument
* Blunt Objects
* Personal Weapons
* Strangulation
* Drowning
* Impact
* Vehiclar
* Supernatural
* Weather
* Animals
* Fire
* Explosion
* Poison
* Narcotics
* Ailment

In [31]:
# Remove all alphanumeric characters from the column.
new_df['raw_text'] = new_df['raw_text'].str.replace(r'[^a-zA-Z#\s]', ' ', regex=True)

# Removes spaces from above. Easier this way.
new_df['raw_text'] = new_df['raw_text'].str.replace(r'  +', ' ', regex=True)

# Review output.
new_df.head(3)

Unnamed: 0,title,text,year,raw_text
0,Joseph Cotten,Shadow of a Doubt,1943,Shadow of a Doubt Shadow of a Doubt Uncle Charlie Falls out of a train and into the path of another train during a struggle with Teresa Wright
1,Joseph Cotten,Niagara,1953,Niagara George Loomis Drowned when his boat sinks while going over Niagara Falls
2,Joseph Cotten,The Last Sunset,1961,The Last Sunset The Last Sunset John Breckenridge Shot in the back Adam Williams as he leaves the cantina as he is flanked by Rock Hudson and Kirk Douglas Thanks to Brian


In [32]:
# Remove stop words to help lemmetizer.
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = word_tokenize(text)
    filtered = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered)

new_df['processed_text'] = new_df['raw_text'].apply(remove_stopwords)

# Review output.
new_df.head(3)

Unnamed: 0,title,text,year,raw_text,processed_text
0,Joseph Cotten,Shadow of a Doubt,1943,Shadow of a Doubt Shadow of a Doubt Uncle Charlie Falls out of a train and into the path of another train during a struggle with Teresa Wright,Shadow Doubt Shadow Doubt Uncle Charlie Falls train path another train struggle Teresa Wright
1,Joseph Cotten,Niagara,1953,Niagara George Loomis Drowned when his boat sinks while going over Niagara Falls,Niagara George Loomis Drowned boat sinks going Niagara Falls
2,Joseph Cotten,The Last Sunset,1961,The Last Sunset The Last Sunset John Breckenridge Shot in the back Adam Williams as he leaves the cantina as he is flanked by Rock Hudson and Kirk Douglas Thanks to Brian,Last Sunset Last Sunset John Breckenridge Shot back Adam Williams leaves cantina flanked Rock Hudson Kirk Douglas Thanks Brian


In [33]:
# Lemmatize > stemming, doing so makes creating a dictionary for classifiers less insane.
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.VERB)  # Default to VERB instead of NOUN

def lemmatize_text(text):
    text = text.lower()
    words = word_tokenize(text)
    lemmatized = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]
    return ' '.join(lemmatized)

new_df['processed_text'] = new_df['processed_text'].apply(lemmatize_text)

# Review output.
new_df.head(3)

Unnamed: 0,title,text,year,raw_text,processed_text
0,Joseph Cotten,Shadow of a Doubt,1943,Shadow of a Doubt Shadow of a Doubt Uncle Charlie Falls out of a train and into the path of another train during a struggle with Teresa Wright,shadow doubt shadow doubt uncle charlie fall train path another train struggle teresa wright
1,Joseph Cotten,Niagara,1953,Niagara George Loomis Drowned when his boat sinks while going over Niagara Falls,niagara george loomis drown boat sink go niagara fall
2,Joseph Cotten,The Last Sunset,1961,The Last Sunset The Last Sunset John Breckenridge Shot in the back Adam Williams as he leaves the cantina as he is flanked by Rock Hudson and Kirk Douglas Thanks to Brian,last sunset last sunset john breckenridge shot back adam williams leaf cantina flank rock hudson kirk douglas thanks brian


In [34]:
# Every possible word myself and Claude could think of.

keywords = {
    'Firearms': [
        # Weapons
        'gun', 'pistol', 'rifle', 'revolver', 'shotgun', 'handgun', 'firearm', 'musket', 
        'machine gun', 'submachine gun', 'assault rifle', 'carbine', 'derringer',
        
        # Specific types
        'glock', 'beretta', 'colt', 'magnum', 'winchester', 'ak47', 'm16', 'uzi',
        
        # Actions
        'shoot', 'shot', 'fire', 'discharge', 'trigger', 'aim', 'target',
        'gunfire', 'shootout', 'gunfight', 'crossfire', 'gunshot', 'gunned', 'misfire',
        
        # Related terms
        'sniper', 'marksman', 'gunman', 'shooter', 'gunslinger',
        'bullet', 'ammunition', 'ammo', 'cartridge', 'shell', 'round',
        
        # Locations/Context
        'point blank', 'drive by', 'shooting range',
        
        # Other
        'armed', 'open fire', 'line of fire'
    ],

    'Cutting Instrument': [
        # Specific knife types
        'knife', 'knives', 'dagger', 'blade', 'sword', 'machete', 'bayonet', 'switchblade',
        'stiletto', 'pocketknife', 'katana', 'cleaver', 'scalpel', 'dirk', 'rapier', 'saber', 
        'shiv', 'axe', 'chainsaw',
        
        # Bladed weapons
        'axe', 'hatchet', 'spear', 'javelin', 'lance', 'scythe', 'sickle',
        'razor', 'shuriken', 'kunai',
        
        # Knife-specific actions
        'stab', 'stabbed', 'slice', 'gouge', 'slash', 'gut', 'hack', 'impale', 'pierce', 
        'thrust', 'cut', 'disembowel', 'disembowled', 'beheading',
        'carve', 'decapitate', 'slit', 'lacerate', 'nick', 'gash', 'puncture', 'laceration'
    ],
    
    'Blunt Objects': [
        # Specific weapons
        'club', 'bat', 'hammer', 'mallet', 'crowbar', 'pipe', 'rod',
        'staff', 'cane', 'baton', 'blackjack', 'nightstick', 'truncheon',
        'cudgel', 'brick', 'rock', 'stone', 'candlestick', 'wrench',
        'pole', 'stick', 'paddle', 'tire iron', 'baseball bat', 'sledgehammer',
        
        # Blunt-specific actions
        'bludgeon', 'bash', 'pummel', 'pound', 'clobber', 'whack',
        'strike', 'smash', 'crack', 'flatten', 'slam', 'batter', 'blunt', 'contusion'

    ],
    
    'Personal Weapons': [
        
        # Unarmed combat actions
        'punch', 'kick', 'slap', 'jab', 'uppercut', 'hook',
        'headbutt','grapple', 'tackle', 'throw', 'beaten', 'fight',
        
        # Specific techniques
        'chokehold', 'roundhouse kick', 'body slam', 
        
        # Fighting styles references
        'martial art', 'karate', 'kung fu', 'wrestle', 'box', 'kickbox',
        
        # Combat-specific phrases
        'hand to hand', 'brawl', 'bare knuckle', 'close combat'
    ],

    'Strangulation': [
        # Methods
        'strangle', 'strangulation', 'choke', 'throttle', 'garrote', 'hang',
        'asphyxiate', 'asphyxiation', 'suffocate', 'suffocation', 'smother',
        'oxygen deprivation', 'suffocates', 'hung', 'choked',
        
        # Implements
        'rope', 'cord', 'wire', 'cable', 'chain', 'belt',
        'necktie', 'garrote wire', 'ligature', 'noose', 'string',
        'electrical cord', 'phone cord', 'clothesline',
        
        # Actions/Verbs
        'constrict', 'compress', 'squeeze', 'tighten', 'restrict', 'neck lock',
        
        # Physical signs
        'petechiae', 'ligature mark', 'neck bruise',
        'hyoid', 'larynx', 'carotid', 'windpipe', 'airway'
    ],

    'Drowning': [
        # Actions/Verbs
        'drown', 'submerge', 'sink', 'underwater',
        'sink', 'plunge', 'immerse', 'flood', 'waterlog',
        'swept away', 'pull under', 'went under',
        
        # Water environments
        'pool', 'ocean', 'sea', 'lake', 'river', 'pond', 'bath',
        'bathtub', 'hot tub', 'well', 'reservoir', 'canal',
        'swimming pool', 'waterfall', 'marsh', 'swamp',
        'undertow', 'riptide', 'current', 'whirlpool', 'wave',
        'flash flood', 'tsunami',
        
        # Medical terms
        'water inhalation', 'pulmonary edema',
        
        # Circumstances
        'boat accident', 'capsize', 'shipwreck',
        'overboard', 'swim accident', 'water'
    ],

    'Impact': [
        # Falls
        'fall', 'fell', 'plummet', 'plunge', 'drop', 'tumble',
        
        # Thrown/Pushed
        'throw', 'push', 'shove', 'toss', 'hurl',
        'flung', 'launch', 'propel', 'catapult',
        
        # Crushing
        'crush', 'compress', 'squash', 'flatten',
        'collapse on', 'bury under',
        'cave in', 'avalanche', 'landslide',
        
        # Objects
        'boulder', 'rock', 'tree', 'beam', 'girder',
        'concrete', 'debris', 'rubble', 'building collapse',
        'structural collapse', 'wall collapse',
        
        # Heights/Locations
        'cliff', 'bridge', 'balcony', 'window', 'roof',
        'stairs', 'elevator shaft', 'scaffold', 'ladder',
        'mountain', 'building', 'high rise',
        
        # Force descriptions
        'impact force', 'blunt impact', 'high velocity',
        'terminal velocity', 'gravitational force',
        
        # Injuries
        'impact trauma', 'blunt force trauma', 'crush injury',
        'compression injury', 'multiple trauma',
        'massive trauma', 'internal injury',
        
        # Common phrases
        'died from fall', 'killed by falling', 'crushed to death',
        'died on impact', 'fatal fall', 'deadly impact'
    ],

    'Vehicular': [
        # Vehicle types
        'car', 'truck', 'bus', 'motorcycle', 'van', 'suv',
        'semi truck', 'tractor trailer', 'lorry', 'vehicle',
        'automobile', 'motor vehicle', 'big rig', '18 wheeler',
        
        # Crash types
        'crash', 'collision', 'wreck', 'pileup', 'run over',
        'side impact', 'hit and run', 'drive', 'tire', 'brake', 'whiplash',
        
        # Vehicle actions
        'swerved', 'rolled', 'flipped', 'overturned',
        'ran off road', 'veered off', 'careened',
        
        # Likely Locations
        'highway', 'freeway',
        
        # Circumstances
        'dui', 'speeding', 'road rage', 'mechanical failure'
    ],
    
    'Fire': [
        # Fire types/sources
        'fire', 'flame', 'blaze', 'inferno', 'wildfire', 'firestorm',
        'bonfire', 'conflagration', 'arson', 'backdraft', 'flamethrower',
        'molotov cocktail',
        
        # Fire actions/verbs
        'burn', 'incinerate', 'immolate', 'char', 'scorch', 'singe',
        'ignite', 'kindle', 'combust', 'smolder', 'torch', 'immolation',
        
        # Fire effects
        'melt', 'carbonize', 'cremate', 'roast', 'sear', 'blacken',
        'ash', 'smoke inhalation', 'ablaze', 'molten',
        
        # Fire-related terms
        'spark', 'ember', 'cinder', 'smoke', 'soot', 'cremation'
    ],
    
    'Explosives': [
        # Explosive devices
        'bomb', 'explosive', 'explosion', 'detonation', 'dynamite', 'tnt', 'c4', 'semtex', 'grenade',
        'landmine', 'mine', 'ied', 'missile', 'rocket', 'mortar', 'warhead',
        'powder keg', 'demolition','claymore',
        
        # Explosion actions/verbs
        'detonate', 'explode', 'blast', 'burst', 'rupture', 'erupt',
        'discharge', 'blow up', 'vaporize', 'disintegrate',
        
        # Explosion effects
        'shockwave', 'fragmentation', 'shrapnel',
        'percussion', 'overpressure', 'blast wave', 'sonic boom',
        
        # Explosive materials
        'gunpowder', 'nitroglycerin', 'fuse', 'detonator', 'blasting cap',
        
        # Related terms
        'blast radius', 'ground zero', 'blown to pieces', 'blown apart', 'blown away'
    ],
    
    'Poison': [
        # Chemical Poisons
        'poison', 'toxin', 'toxic', 'chemical', 'cyanide', 'arsenic', 'mercury',
        'strychnine', 'ricin', 'thallium', 'polonium', 'carbon monoxide', 'acid',
        'pesticide', 'hemlock', 'bleach', 'ammonia', 'drain cleaner', 'antifreeze',
        
        # Biological Toxins
        'tetrodotoxin', 'botulinum', 'mushroom', 'nightshade', 'oleander', 'wolfsbane', 'algae',
        
        # Action Verbs
        'contaminate', 'taint', 'spike', 'lace'
    ],
    
    'Narcotics': [
        # Opioids
        'heroin', 'fentanyl', 'morphine', 'oxycodone', 'methadone',
        'hydrocodone', 'codeine', 'opium', 'oxycontin', 'vicodin',
        
        # Stimulants
        'cocaine', 'methamphetamine', 'amphetamine', 'crack cocaine',
        'crystal meth', 'speed', 'mdma', 'ecstasy',
        
        # Depressants
        'barbiturate', 'benzodiazepine', 'valium', 'xanax',
        'quaalude', 'rohypnol', 'ghb',
        
        # General Terms
        'drug', 'medication', 'narcotic', 'substance',
        'prescription', 'pharmaceutical',
        
        # Methods of Use
        'overdose', 'injection', 'intravenous', 'snort', 'inhale',
        
        # Addiction Terms
        'addiction', 'dependence', 'substance abuse', 'withdrawal'
    ],

    'Animals': [
        # Large predatory mammals
        'lion', 'tiger', 'bear', 'grizzly bear', 'polar bear', 'black bear', 'wolf', 'leopard', 
        'jaguar', 'cougar', 'mountain lion', 'puma', 'panther', 'cheetah', 'hyena', 
        'wolverine', 'badger', 'dingo', 'coyote', 'dog',

        # Large herbivorous mammals
        'elephant', 'hippopotamus', 'rhinoceros', 'cape buffalo', 'bison', 'moose', 'elk',
        'wild boar', 'warthog', 'bull', 'water buffalo', 'musk ox', 'camel', 'zebra',

        # Primates
        'gorilla', 'chimpanzee', 'orangutan', 'baboon', 'mandrill', 'gibbon', 'macaque',

        # Marine creatures
        'shark', 'great white shark', 'tiger shark', 'bull shark', 'hammerhead shark',
        'killer whale', 'orca', 'barracuda', 'piranha', 'giant octopus', 'box jellyfish',
        'portuguese man o war', 'blue-ringed octopus', 'cone snail', 'lionfish',
        'stonefish', 'stingray', 'manta ray', 'moray eel', 'seal',
        'sea lion', 'walrus', 'electric eel', 'giant squid', 'leopard seal',

        # Reptiles and amphibians
        'crocodile', 'alligator', 'caiman', 'komodo dragon', 'monitor lizard',
        'python', 'anaconda', 'boa constrictor', 'cobra', 'king cobra', 'black mamba',
        'viper', 'rattlesnake', 'copperhead', 'snake', 'death adder', 'sea krait',
        'taipan', 'gila monster', 'beaded lizard', 'nile monitor', 'saltwater crocodile',

        # Birds
        'eagle', 'hawk', 'falcon', 'vulture', 'condor', 'cassowary', 'ostrich', 'emu',
        'harpy eagle', 'owl', 'secretary bird',

        # Insects and arachnids
        'spider', 'scorpion', 'centipede', 'millipede', 'wasp', 'hornet', 'bee',
        'fire ant', 'bullet ant', 'tarantula', 'black widow', 'brown recluse',
        'funnel web spider', 'yellow jacket', 'killer bee', 'asian giant hornet',
        'deathstalker scorpion', 'brazilian wandering spider', 'bug', 'bugs',

        # General categories
        'wildlife', 'vermin', 'pest', 'critter', 'serpent', 'reptile', 'insect', 'arachnid',
        'arthropod', 'mammal', 'amphibian', 'parasite'
    ],

    'Supernatural': [
        # Undead
        'zombie', 'ghoul', 'revenant', 'mummy', 'skeleton', 'lich', 'draugr',
        'vampire', 'nosferatu', 'dhampir', 'wraith', 'wight', 'banshee',

        # Demons & Devils
        'demon', 'devil', 'fiend', 'hellspawn', 'imp', 'incubus', 'succubus',
        'archdevil', 'daemon', 'djinn', 'ifrit', 'hellhound', 'fallen angel',
        'abomination', 'infernal', 'antichrist',

        # Spirits & Ethereal
        'ghost', 'spirit', 'phantom', 'specter', 'poltergeist', 'apparition',
        'shade', 'soul', 'entity', 'harbinger', 'manifestation', 'haunting',
        'doppelganger', 'shadow being', 'revenant', 'supernatural'

        # Mythological creatures
        'dragon', 'basilisk', 'chimera', 'griffin', 'hydra', 'manticore',
        'minotaur', 'phoenix', 'unicorn', 'werewolf', 'wendigo', 'kraken',
        'leviathan', 'behemoth', 'cyclops', 'centaur', 'harpy', 'siren',
        'mermaid', 'gorgon', 'sphinx', 'cockatrice',

        # Celtic/Norse/Germanic
        'troll', 'goblin', 'ogre', 'giant', 'elf', 'dwarf', 'changeling',
        'fairy', 'pixie', 'gnome', 'valkyrie', 'nix', 'kobold', 'lindworm',
        'jotunn', 'draugr', 'wyvern',

        # Asian Mythology
        'oni', 'yokai', 'kappa', 'tengu', 'kitsune', 'dragon', 'naga',
        'rakshasa', 'jiangshi', 'yuki-onna', 'yurei', 'kirin',

        # Modern/Urban Fantasy
        'cryptid', 'mothman', 'bigfoot', 'sasquatch', 'yeti', 'chupacabra',
        'jersey devil', 'skinwalker', 'thunderbird', 'loch ness', 'cyborg'

        # Cosmic/Sci-fi
        'alien', 'extraterrestrial', 'xenomorph', 'predator', 'mutant',
        'hybrid', 'cosmic horror', 'eldritch', 'aberration', 'dimensional being',
        'shapeshifter', 'body snatcher', 'cosmic entity', 'starspawn',

        # Lovecraftian
        'elder god', 'old one', 'deep one', 'shoggoth', 'mi-go', 'elder thing',
        'nightgaunt', 'byakhee', 'star spawn', 'great old one', 'outer god',

        # Religious/Biblical
        'angel', 'seraphim', 'cherubim', 'archangel', 'nephilim', 'leviathan',
        'behemoth', 'arabian djinn', 'ifrit', 'golem',

        # Cursed/Possessed
        'possessed', 'cursed being', 'evil doll', 'animated object',
        'homunculus', 'construct', 'familiar', 'animated corpse',

        # General terms
        'monster', 'creature', 'beast', 'entity', 'apparition', 'manifestation',
        'aberration', 'monstrosity', 'horror', 'supernatural being', 'paranormal entity',
        'otherworldly being', 'abomination', 'phantasm', 'terror'
    ],

    'Weather': [
        # Wind events
        'tornado', 'cyclone', 'hurricane', 'typhoon', 'windstorm', 'dust devil',
        'microburst', 'derecho', 'gust', 'whirlwind', 'dust storm', 'sandstorm',
        'haboob', 'windshear', 'squall',

        # Temperature extremes
        'heat wave', 'heatwave', 'hyperthermia', 'heat stroke', 'heat exhaustion',
        'dehydration', 'sunstroke', 'freezing', 'hypothermia', 'frostbite',
        'deep freeze', 'cold snap', 'arctic blast', 'polar vortex',

        # Electrical/Light phenomena
        'lightning', 'thunderbolt', 'electrocution', 'ball lightning',
        'sheet lightning', 'fork lightning', 'thunder', 'St. Elmo\'s fire',
        'aurora', 'northern lights', 'solar flare', 'electromagnetic pulse',
        'solar storm', 'geomagnetic storm',

        # Air quality
        'smog', 'pollution',
        'volcanic', 'chemical fog', 'gas cloud',

        # Precipitation (non-water)
        'hail', 'sleet', 'ice storm', 'black ice', 'freezing rain',
        'graupel', 'ice pellets', 'rime', 'hoarfrost', 'freeze',
        'fog', 'mist', 'whiteout', 'blizzard',

        # Atmospheric events
        'avalanche', 'landslide', 'mudslide', 'rockslide', 'sinkhole',
        'earthquake', 'pyroclastic flow', 'lava flow',
        'ash cloud', 'meteor shower', 'meteorite', 'asteroid',

        # General terms
        'weather', 'natural disaster', 'atmospheric phenomenon',
        'meteorological event', 'climate event'
    ],
    
    'Ailment': [
        # Common Terminal Illnesses
        'cancer', 'tumor', 'leukemia', 'lymphoma', 'melanoma',
        'disease', 'plague', 'hemorrhage','fever', 'infection', 
        'septic', 'old age','bleed out', 'comatose','sick', 'ill', 
        'ailment', 'affliction', 'terminal', 'chronic', 'infect',
        'illness', 'health complications', 'sickness', 'hypothermia',
        
        # Heart Related
        'heart attack', 'cardiac arrest', 'heart failure',
        'stroke', 'aneurysm', 'blood clot', 'embolism', 
        'heart condition',
        
        # Respiratory
        'tuberculosis', 'consumption', 'pneumonia',
        'emphysema', 'asthma attack', 'black lung', 'pleurisy',
        
        # Historical/Period Diseases
        'cholera', 'typhoid', 'smallpox', 'scarlet fever',
        'typhus', 'dysentery', 'yellow fever', 'malaria',
        'polio', 'plague', 'bubonic', 'spanish flu',
        'consumption', 'dropsy', 'grippe',
        
        # War/Military Related
        'gangrene', 'infection', 'sepsis', 'shell shock',
        'trench foot',
        
        # Modern Viral/Infectious
        'aids', 'hiv', 'covid', 'coronavirus', 'sars',
        'ebola', 'virus', 'viral', 'pandemic', 'epidemic',
        
        # Age Related
        'old age', 'natural causes', 'dementia', 
        'alzheimers', 'parkinsons', 'senility',
        
        # Organ Failure
        'liver failure', 'kidney failure', 'organ failure',
        'multiple organ failure', 'renal failure',
        'respiratory failure', 'cirrhosis',
        
        # Brain Conditions
        'seizure', 'epilepsy', 'neurological',
        
        # General Medical Terms / Other
        'deathbed', 'birth defect', 'genetic disorder', 
        'miscarriage', 'natural cause'
    ]
}

# We want to classify based on the first word that appears in the string that exists in the dictionary.
# This is probably the best method of inferring cause of death, as it *should* be in the opener.

# Start with earliest_index being equal to inf so all valid index will be smaller.
# Earliest_index grabs the first index of a word that exists in the dictionary.
# Update values whenever a smaller index exists, otherwise skip.

def identify_cause(text):
    text = word_tokenize(text.lower())
    first_instance = None
    earliest_index = float('inf')
    
    for cause, keys in keywords.items():
        for key in keys:
            # Handling for compound words
            if ' ' in key:
                compound_words = key.split(' ')
                for i in range(len(text) - len(compound_words) + 1):
                    if all(text[i+j] == compound_word for j, compound_word in enumerate(compound_words)):
                        index = i
                        if index < earliest_index:
                            earliest_index = index
                            first_instance = cause
            else:
                # Original single-word handling
                if key in text:
                    index = text.index(key)
                    if index < earliest_index:
                        earliest_index = index
                        first_instance = cause
                    
    if first_instance:
        return first_instance
    return 'Other'

new_df['cause_of_death'] = new_df['processed_text'].apply(identify_cause)

In [35]:
# Use the value_counts() method to count occurrences of each unique value
count_series = new_df['cause_of_death'].value_counts()

# count_series will already be sorted in descending order by default
print(count_series)

cause_of_death
Firearms              21745
Cutting Instrument    13646
Other                  8393
Supernatural           8214
Impact                 4337
Strangulation          3353
Animals                3111
Ailment                2959
Explosives             2804
Vehicular              2793
Personal Weapons       2574
Drowning               2149
Blunt Objects          2111
Fire                   1487
Poison                 1084
Weather                 647
Narcotics               635
Name: count, dtype: int64


In [36]:
new_df.rename(columns={'title': 'Name', 'text': 'Movie', 'year': 'Year', 'cause_of_death':'Cause of Death'}, inplace=True)

In [37]:
new_df.to_csv('output/Cinemorgue.csv', columns=['Name', 'Movie', 'Year', 'Cause of Death'], index=False)

Top 8 actors, and their causes of death.

In [38]:
# Group by 'Name' and 'Movie' and count the occurrences
top_8 = new_df.groupby(['Name']).size().reset_index(name='Count')
top_8 = top_8.sort_values(by='Count', ascending=False).head(8)

# For each person in top 8, get their year range and calculate statistics
markdown_output = ""
for name in top_8['Name']:
    person_data = new_df[new_df['Name'] == name]
    
    # Get year range and difference
    min_year = person_data['Year'].astype(int).min()
    max_year = person_data['Year'].astype(int).max()
    year_diff = max_year - min_year
    
    # Calculate deaths and average
    total_deaths = len(person_data)
    avg_deaths_per_year = total_deaths / (year_diff + 1)
    
    # Create markdown entry
    markdown_output += f"### {name}\n"
    markdown_output += f"* Filmography: {min_year} - {max_year}\n"
    markdown_output += f"* Deaths: {total_deaths}\n"
    markdown_output += f"* Average per year: {avg_deaths_per_year:.2f}\n\n"

print(markdown_output)

### Danny Trejo
* Filmography: 1987 - 2023
* Deaths: 78
* Average per year: 2.11

### Christopher Lee
* Filmography: 1948 - 2011
* Deaths: 69
* Average per year: 1.08

### Feng Ku
* Filmography: 1966 - 1997
* Deaths: 61
* Average per year: 1.91

### Lance Henriksen
* Filmography: 1973 - 2020
* Deaths: 58
* Average per year: 1.21

### Udo Kier
* Filmography: 1968 - 2021
* Deaths: 52
* Average per year: 0.96

### Eric Roberts
* Filmography: 1983 - 2024
* Deaths: 51
* Average per year: 1.21

### Frank Welker
* Filmography: 1975 - 2019
* Deaths: 50
* Average per year: 1.11

### John Carradine
* Filmography: 1932 - 1989
* Deaths: 49
* Average per year: 0.84




In [39]:
markdown_output = ""
for name in top_8['Name']:
    # Get all causes of death and their counts for this person
    death_causes = (new_df[new_df['Name'] == name]
                   .groupby(['Cause of Death'])
                   .size()
                   .reset_index(name='Count')
                   .sort_values('Count', ascending=False))
    
    # Separate 'Other' from the rest of the causes
    other_cause = death_causes[death_causes['Cause of Death'] == 'Other']
    non_other_causes = death_causes[death_causes['Cause of Death'] != 'Other']
    
    # Format the non-Other causes with their counts
    causes_formatted = [f"{cause} ({count})" for cause, count in zip(non_other_causes['Cause of Death'], non_other_causes['Count'])]
    
    # Add 'Other' at the end if it exists
    if not other_cause.empty:
        other_formatted = f"Other ({other_cause.iloc[0]['Count']})"
        causes_formatted.append(other_formatted)
    
    # Join all causes with commas
    causes_string = ', '.join(causes_formatted)
    
    # Create markdown entry
    markdown_output += f"### {name}\n"
    markdown_output += f"{causes_string}\n\n"

print(markdown_output)

### Danny Trejo
Firearms (33), Cutting Instrument (10), Animals (6), Supernatural (6), Personal Weapons (4), Explosives (3), Fire (3), Blunt Objects (2), Strangulation (2), Vehicular (2), Drowning (1), Impact (1), Narcotics (1), Other (4)

### Christopher Lee
Supernatural (18), Cutting Instrument (15), Fire (9), Impact (5), Drowning (4), Firearms (4), Ailment (3), Strangulation (3), Explosives (1), Personal Weapons (1), Vehicular (1), Other (5)

### Feng Ku
Cutting Instrument (26), Firearms (9), Animals (6), Personal Weapons (5), Strangulation (5), Impact (4), Supernatural (4), Blunt Objects (1), Other (1)

### Lance Henriksen
Firearms (15), Cutting Instrument (14), Supernatural (9), Impact (4), Ailment (2), Animals (2), Blunt Objects (2), Explosives (1), Fire (1), Personal Weapons (1), Poison (1), Strangulation (1), Vehicular (1), Other (4)

### Udo Kier
Firearms (12), Cutting Instrument (10), Supernatural (6), Blunt Objects (4), Vehicular (3), Impact (2), Personal Weapons (2), Poison