Our goal is to parse a wiki dump from https://cinemorgue.fandom.com/wiki/Special:Statistics to figure out which actor/actress has died the most in movies because I don't believe a single click-bait article I've found online. Trust but verify. 

If you wish to grab a more recent update, download the "Current Pages" to get an XML file.
Do not download "Current Page and History" unless you're a bit of a parsing masochist.

Fandom (the site Cinemorgue is hosted on) follows the mediawiki export format of /export-0.10/.
Using the xml tree is way easier than relying on any sort of wikiparser library out there.

All actor/actress pages should be ns = 0, so we'll be grabbing these and ignoring all movie/tv show pages.

In [1]:
import xml.etree.ElementTree as ET
import pandas as pd

NS = 'http://www.mediawiki.org/xml/export-0.10/'

def parse_wikimedia_xml(filepath):
    tree = ET.parse(filepath)
    root = tree.getroot()
    data = []
    for page in root.findall('{%s}page' % NS):
        ns = page.find('{%s}ns' % NS).text
        if ns != "0":
            continue
        title = page.find('{%s}title' % NS).text
        revision = page.find('{%s}revision' % NS)
        text = revision.find('{%s}text' % NS).text
        data.append({'title': title, 'text': text})
    df = pd.DataFrame(data)
    return df

df = parse_wikimedia_xml('cinemorgue.xml')

df.head(5)

Unnamed: 0,title,text
0,Cinemorgue Wiki,<mainpage-leftcolumn-start />\n{{Mainpage welc...
1,Main Page,#REDIRECT [[Cinemorgue Wiki]]
2,Marilyn Monroe,[[File:Marilynmonroe.jpg|frame|Marilyn Monroe ...
3,Joseph Cotten,[[File:Josephcotten.jpg|thumb|230px|Joseph Cot...
4,Niagara (1953),[[File:NiagaraLobbyCard.jpg|frame|Lobby card f...


We will only be looking at Film Deaths, because Television Deaths seem to include a lot of voice actors from animated shows which feels like cheating and ruins the spirt of finding out which actor has died the most. If you would like to look at any of the subcategories, comment out the preceding sections and change the split index value from [0] to [1] to delete all text above it.

The structure of each wikimedia page, while not perfect, is relatively consistent.

Each page follows the structure below (if applicable):

-Overview 

-Film Deaths 

-Television Deaths/TV Deaths 

-Video Game Deaths 

-Music Video Deaths 

-Notable Connections 


We just want to look at movies, so we split everything above and below Film Deaths.
Not every page has the subsequent sections, so just to be thorough, we check for and delete every other section.

In [2]:
df['text'] = df['text'].str.split("Film Deaths", n=1, expand=True)[1]

In [3]:
df['text'] = df['text'].str.split("Television Deaths", n=1, expand=True)[0]

In [4]:
#Some maverick started labeling this section TV Deaths.
df['text'] = df['text'].str.split("TV Deaths", n=1, expand=True)[0]

In [5]:
df['text'] = df['text'].str.split("Video Game Deaths", n=1, expand=True)[0]

In [6]:
df['text'] = df['text'].str.split("Music Video Deaths", n=1, expand=True)[0]

In [7]:
df['text'] = df['text'].str.split("Notable Connections", n=1, expand=True)[0]

In [8]:
#Drop all recently nulled rows.
df = df.dropna(subset=['text'])

In [9]:
#Every bullet point for a movie is signafied by a line break + asterisk (\n*).
#To get a quick count of movie deaths, we can just do a count of bullet points.

df['death_count'] = df['text'].apply(lambda x: x.count('\n*') if x is not None else None)

In [10]:
#Buzzfeed top 10 list time, sort desc counts.
df.sort_values(by='death_count', ascending=False, inplace=True)

#Reset index.
df = df.reset_index(drop=True)

#Print top 10.
df.head(10)

Unnamed: 0,title,text,death_count
0,Danny Trejo,==\n\n*'''''[[Death Wish 4: The Crackdown (198...,71
1,Christopher Lee,==\n*'''''Penny and the Pownall Case'' (1948)...,66
2,Lance Henriksen,:==\n\n*'''''The Visitor'' (1979)''' [''Raymon...,58
3,Mel Blanc,==\n*'''''Daffy Duck and the Dinosaur'' (1939;...,56
4,John Carradine,:==\n*'''''[[The Sign of the Cross (1932)|The ...,48
5,Vincent Price,==\n*'''''[[Tower of London (1939)|Tower of Lo...,45
6,Boris Karloff,==\n*'''''The Criminal Code'' (1931)''' [''Gal...,43
7,John Hurt,==\n*'''''The Wild and the Willing'' (1962)'''...,43
8,Dennis Hopper,==\n*'''''[[Gunfight at the O.K. Corral (1957)...,42
9,Nicolas Cage,:==\n*'''''[[The Cotton Club (1984)]]''''' [''...,41


We will now attempt to get each movie title in their own row that you can join to IMDB data, etc. for more comparative fun.

Disclosure: 
This is the part where things start to go off the rails.
Thinking you can put years of manual entry into a tidy box with a few lines of code is a testament to man's arrogance.
We're just going to do our best to get each title, and clean up the most common errors.

So now that we have each actors name and their filmography side by side, it's time to break out each movie title into a new row.

Each movie title follows the structure:

-Bullet point (\n*) to signify new line.

-Two apostrophes ('') at the start of the title.

-Two apostrophes ('') at the end of the title.


There are a lot of weird cases, but generally most follow this format because " " signafies a bold title.


In [11]:
# New DataFrame to store the split rows
new_rows = {'title': [], 'text': []}

# Iterate through the original DataFrame
for idx, row in df.iterrows():
    title = row['title']
    text_parts = row['text'].split("\n*'''")
    
    # Append the new rows to the new DataFrame
    # Skip the first element, usually contains gibberish before first line.
    for part in text_parts[1:]:
        new_rows['title'].append(title)
        new_rows['text'].append(part)

# Create the new DataFrame
new_df = pd.DataFrame(new_rows)

# Print the result
new_df.head(3)

Unnamed: 0,title,text
0,Danny Trejo,''[[Death Wish 4: The Crackdown (1987)|Death W...
1,Danny Trejo,''The Hidden ''(1987)''' [''Prisoner'']: Shot ...
2,Danny Trejo,''Bulletproof'' (1988)''' [''Sharkey'']: Kille...


In [12]:
# There will be some rows that don't start with '' like they should.
# We will select the movie titles between '' '', and skip those that don't start with the container apostraphes.

def extract_text_between_quotes(input_string):
    if input_string.startswith("''"):
        # Find the indices of the first and second occurrences of ''
        first_quote_index = input_string.find("''")
        second_quote_index = input_string.find("''", first_quote_index + 2)

        # Extract the text between the first and second occurrences of ''
        result = input_string[first_quote_index + 2:second_quote_index].strip()
    else:
        # Return the whole input string if it doesn't start with ''
        result = input_string.strip()

    return result

output = []

for idx, row in new_df.iterrows():
    title, text = row
    extracted_text = extract_text_between_quotes(text)
    output.append({"title": title, "text": extracted_text})

new_df = pd.DataFrame(output)

new_df.head(3)

Unnamed: 0,title,text
0,Danny Trejo,[[Death Wish 4: The Crackdown (1987)|Death Wis...
1,Danny Trejo,The Hidden
2,Danny Trejo,Bulletproof


In [13]:
# Titles that are skipped above, delete everything after the first ''.
new_df['text'] = new_df['text'].str.split("''", n=1, expand=True)[0]

In [14]:
# Remove all instances of dates. Some are formatted like (1972; animated) so we're just removing all parenthesis. 
# Hopefully there aren't many movies that use parenthesis in their titles.
new_df['text'] = new_df['text'].str.replace(r"\s*\([^()]*\)", "", regex=True)

In [15]:
# Some titles are hyper links, and will have the title twice inbetween the container apostraphes.
# The two titles are split by |, so just delete all text after |.
new_df['text'] = new_df['text'].str.split("|", n=1, expand=True)[0]

In [16]:
# Titles that have links will be in inside of [ ]. Just delete them since the above deletes the link portion.
new_df['text'] = new_df['text'].str.replace(r'\[|\]', '', regex=True)

In [17]:
#Some titles will have bonus parenthesis () due to typos, etc. so just delete all instances.
new_df['text'] = new_df['text'].replace(to_replace=r'\(|\)', value='', regex=True)

In [18]:
#Some titles will just have the link hardcoded in the title which is pretty impressive.
new_df['text'] = new_df['text'].str.replace(r'https://\S+\s*', '', regex=True)

In [19]:
#Some titles might even hardcode the unsecured link instead.
new_df['text'] = new_df['text'].str.replace(r'http://\S+\s*', '', regex=True)

In [20]:
# Some titles have stray html formatting tags in them. 
new_df['text'] = new_df['text'].str.replace(r'\s*<.*?>\s*', '', regex=True)

In [21]:
#Final export.
new_df.to_csv('Cinemorgue.csv', index=False)