Red Heart Campaign Database Visualization

TW: This notebook contains discussion and media around topics such as domestic violence, sexual assault, physical violence, etc.

This notebook serves as a tool to visualize the data collected by the Red Heart Campaign. You can find the database [here](https://theredheartcampaign.org/database/).

We start by first downloading the website. This is because the website loads each case dynamically, so in order to scrape the database we must download a static version of the site. You can find the html file at ```Database _ The RED HEART Campaign.html```. The collective files are in the folder ```Database _ The RED HEART Campaign_files```.

To scrape the static html page, we use the python library BeautifulSoup. Documentation [here](https://beautiful-soup-4.readthedocs.io/en/latest/#). Let's import the libary, along with [Pandas](https://pandas.pydata.org/) for data analysis and [pathlib](https://docs.python.org/3/library/pathlib.html) for file organization.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
from pathlib import Path

Pandas doesn't like df.append so we use the following code snippet to quite the warnings.

In [2]:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

We read the html file and create our "soup" for BeautifulSoup to parse:

In [3]:
with open("Database _ The RED HEART Campaign.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")

The database sorts each case using an ID. We can extract this ID and create a list of all of them.

In [4]:
div_tags = soup.find_all("div")
ids = []
for div in div_tags:
    ID = div.get("id")
    if ID is not None and "res-" in ID:
        ids.append(ID)

Display all the IDs:

In [5]:
print(ids)

['res-6786', 'res-6787', 'res-6788', 'res-6789', 'res-6790', 'res-6791', 'res-6792', 'res-6795', 'res-6796', 'res-6798', 'res-6799', 'res-6800', 'res-6801', 'res-6802', 'res-6808', 'res-6803', 'res-6804', 'res-6806', 'res-6807', 'res-6809', 'res-6810', 'res-6811', 'res-6812', 'res-6813', 'res-6814', 'res-6815', 'res-6816', 'res-6817', 'res-6818', 'res-6819', 'res-6820', 'res-6823', 'res-6824', 'res-6825', 'res-6827', 'res-6828', 'res-6829', 'res-6830', 'res-6831', 'res-6832', 'res-6833', 'res-6762', 'res-6778', 'res-6665', 'res-6667', 'res-6673', 'res-6674', 'res-6675', 'res-6681', 'res-6682', 'res-6685', 'res-6694', 'res-6695', 'res-6686', 'res-6687', 'res-6688', 'res-6689', 'res-6690', 'res-6692', 'res-6693', 'res-6696', 'res-6697', 'res-6698', 'res-6699', 'res-6700', 'res-6701', 'res-6705', 'res-6716', 'res-6717', 'res-6718', 'res-6719', 'res-6729', 'res-6741', 'res-6758', 'res-6720', 'res-6721', 'res-6722', 'res-6724', 'res-6725', 'res-6730', 'res-6726', 'res-6727', 'res-6728', 're

Each case within the html file is contained with a div. The div has the following tags:
- class="custom_case-display custom-tooltip [id#]"
- data-age_of_death
- data-cause_of_death
- data-charge
- data-context_of_death
- data-gender
- data-img
- data-location
- data-min_sentence
- data-rel_to_victim
- data-source1
- data-source2
- data-story
- data-victim_name
- data-year_of_death
- id

Let's extract all of these from each div and put them into a massive dataframe, then export it to a CSV.

In [6]:
df = pd.DataFrame(columns=["id", "year", "name", "age", "gender", "cause", "charge", "context", "location", "sentence", "relation", "source1", "source2", "story"])
for id in ids:
    div = soup.find("div", {"id": id})

    year = div.get("data-year_of_death")
    name = div.get("data-victim_name")
    age = div.get("data-age_of_death")
    gender = div.get("data-gender")
    cause = div.get("data-cause_of_death")
    charge = div.get("data-charge")
    context = div.get("data-context_of_death")
    location = div.get("data-location")
    sentence = div.get("data-min_sentence")
    relation = div.get("data-rel_to_victim")
    source1 = div.get("data-source1")
    source2 = div.get("data-source2")
    story = div.get("data-story")

    data_row = {"id":id, "year":year, "name":name, "age":age, "gender":gender, "cause":cause, "charge":charge, "context":context, "location":location, "sentence":sentence, "relation":relation, "source1":source1, "source2":source2, "story":story}
    print(data_row)
    df = df.append(data_row, ignore_index=True)

filepath = Path("redheart_data.csv")
filepath.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(filepath, index=False)

{'id': 'res-6786', 'year': '2022', 'name': 'Unnamed boy', 'age': '17', 'gender': 'Male', 'cause': 'Stabbing', 'charge': 'Murder', 'context': 'Associate violence', 'location': 'Lychee Close, Manoora, Queensland', 'sentence': 'Before the courts', 'relation': 'Associate', 'source1': 'https://www.cairnspost.com.au/news/cairns/teen-charged-over-stabbing-death-at-wild-new-year-street-party/news-story/ccbf51ab38aedf4eaa1ecb4557d990bc', 'source2': '', 'story': 'January 1, 2022: An unnamed 17-year-old boy was stabbed to death at Lychee Close, Manoora, Queensland. A 14-year-old male is charged with his murder. He has not yet faced trial (January, 2022). '}
{'id': 'res-6787', 'year': '2022', 'name': 'Elizabeth Rose Struhs', 'age': '8', 'gender': 'Male, female', 'cause': 'Deprivation of medical treatment', 'charge': 'Murder', 'context': 'Domestic violence', 'location': 'Rangeville, Toowoomba, Queensland', 'sentence': 'Before the courts', 'relation': 'Father, mother', 'source1': 'https://www.thechr

Our CSV of the Red Heart database has been created at ```redheart_data.csv```.

We can't use the data yet as it needs to be cleaned. There's a lot of typos and data that we need to fix and aggregate. Make sure to run each code snippet below or else we can't visualize the data!

We read our csv into a dataframe for the notebook to use.

In [7]:
df = pd.read_csv("https://raw.githubusercontent.com/TangyKiwi/Worldie/master/RedHeart/redheart_data.csv")

To clean "age," we iterate through all "age" data values. If the value is non-numeric, we have to change it. We change "Unknown" data values to -1, and everything else (age under 1 year, displayed as "X months" or "X weeks") to 0.

In [8]:
for i in range(len(df["age"])):
    if not df["age"][i].isnumeric():
        if df["age"][i] == "Unknown": df.at[i, "age"] = -1
        else: df.at[i, "age"] = 0

"gender" cleaning: We make each instance lowercase and use strip() to clean out spaces. We then fix some typos using the replace() method.

In [9]:
df["gender"] = df["gender"].str.lower()
df["gender"] = df["gender"].str.strip()
df = df.replace({"gender":{"male, fema":"male, female", "unknown gender":"unknown", "associate violence":"male"}})

"context" cleaning: We make each instance lowercase and use strip(), then fix typos with replace().

In [10]:
df["context"] = df["context"].str.lower()
df["context"] = df["context"].str.strip()
df = df.replace({"context":{"unknown context":"unknown", "unsolved":"unknown", "unknown context of death":"unknown", "incomplete entry":"unknown", "bashing":"unknown", "domestic violece":"domestic violence", "associate vioence":"associate violence", "neighbour violence":"associate violence", "stanger violence":"stranger violence"}})

"cause" cleaning: This is the trickiest. There's a lot of typos and repeat instances of "unknown" ("unknown," "unknown cause of death," "unknown context," etc). We can aggregate all of this to "unknown." We also clean up past tense and present tense verbs and fix typos. We import ```re``` for regex cleaning, which allows us to remove words in between parentheses (we aggregate all of this under "asphyxiation").

In [11]:
import re

df["cause"] = df["cause"].str.lower()
df["cause"] = df["cause"].str.strip()
df["cause"] = df["cause"].replace({"raped":"rape", "raper":"rape", "bashed":"bashing", "unknown cause of death":"unknown", "poisoned": "poison", "poisoning": "poison", "shootingg": "shooting", "tortured": "torture", "drowned": "drowning", "shaken":"shaking", "stabbbing":"stabbing", "\}":")", "\.":", "}, regex=True)
df["cause"] = df["cause"].str.replace("shooting", "shootin")
df["cause"] = df["cause"].str.replace("shootin", "shooting")
df["cause"] = df["cause"].str.replace("bombing", "bomb")
df["cause"] = df["cause"].str.replace("bomb", "bombing")
df["cause"] = df["cause"].str.replace("drug overdose", "overdose")
df["cause"] = df["cause"].str.replace("overdose", "drug overdose")
df["cause"] = df["cause"].replace({"gas poison":"gas poisoning", "unknown cause":"unknown"}, regex=True)
for i in range(len(df["cause"])):
    df.at[i, "cause"] = re.sub(r'\([^)]*\)', '', df["cause"][i])
df["cause"] = df["cause"].str.replace("asphyxiation ,stabbing", "asphyxiation, stabbing")
df["cause"] = df["cause"].str.strip()

Now with the csv cleaned, we can put this into a new file. You can find the cleaned csv at ```redheart_data_cleaned.csv```.

In [12]:
filepath = Path("redheart_data_cleaned.csv")
filepath.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(filepath, index=False)

We can finally start visualizing the data. We import plotly for graphing, numpy for processing, and read in the cleaned csv and put it in a dataframe.

In [13]:
import plotly.express as px
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/TangyKiwi/Worldie/master/RedHeart/redheart_data_cleaned.csv")

All Ages Bar Chart:

In [14]:
age = df.get("age")
age = list(map(int, age))

s = pd.DataFrame({"Age":age})["Age"].value_counts()
age_counts = pd.DataFrame({"Age":s.index, "Count":s.values})
fig = px.bar(age_counts, x="Age", y="Count", title="Age")
fig.show()

Age Group Bar Chart:

In [15]:
age_groups = pd.cut(age, bins=[-2, -1, 14, 24, 64, 100])
# (-2, -1] (-1, 14] (14, 24] (24, 64] (64, 100]
unique, counts = np.unique(age_groups, return_counts=True)
age_groups = dict(zip(unique, counts))
age_groups = pd.DataFrame({"Age Groups":["Unknown", "Children (0-14)", "Youth (15-24)", "Adults (25-64)", "Seniors (65+)"], "Count":age_groups.values()})
fig = px.bar(age_groups, x="Age Groups", y="Count", title="Age Groups")
fig.show()

Age Group Pie Chart:

In [16]:
fig = px.pie(age_groups, values="Count", names="Age Groups", title="Age Groups")
fig.show()

Context Bar Chart:

In [17]:
s = df["context"].value_counts()
context_counts = pd.DataFrame({"Context":s.index, "Count":s.values})
fig = px.bar(context_counts, x="Context", y="Count", title="Context")
fig.show()

Context Pie Chart:

In [18]:
fig = px.pie(context_counts, values="Count", names="Context", title="Context")
fig.show()

Cause Bar Chart:

Some cases have multiple causes, we process this by splitting through commas.

In [19]:
causes = []
for i in range(len(df["cause"])):
    for c in df["cause"][i].split(", "):
        causes.append(c.strip())

s = pd.DataFrame({"Cause":causes})["Cause"].value_counts()
cause_counts = pd.DataFrame({"Cause":s.index, "Count":s.values})
fig = px.bar(cause_counts, x="Cause", y="Count", title="Cause of Death")
fig.show()

Year Bar Chart:

In [20]:
year = df.get("year")
year = list(map(int, year))
s = pd.DataFrame({"Year":year})["Year"].value_counts()
year_counts = pd.DataFrame({"Year":s.index, "Count":s.values})
fig = px.bar(year_counts, x="Year", y="Count", title="Year")
fig.show()

Gender Bar Chart:

In [21]:
gender = df.get("gender")
s = pd.DataFrame({"Accused Gender":gender})["Accused Gender"].value_counts()
gender_counts = pd.DataFrame({"Accused Gender":s.index, "Count":s.values})
fig = px.bar(gender_counts, x="Accused Gender", y="Count", title="Accused Gender")
fig.show()

Gender Pie Chart:

In [22]:
fig = px.pie(gender_counts, values="Count", names="Accused Gender", title="Accused Gender")
fig.show()

More to come!