<h1> <center> Analysis of the Haunted Places Dataset <center> </h1>

<b> Team 3 </b> <br>
Members: 

- Zili Yang
- Chen Yi Weng
- Aadarsh Sudhir Ghiya 
- Colin Leahey
- Niromikha Jayakumar 
- Yung Yee Chia


## Introduction

This notebook explores the Haunted Places dataset, extracts new features, joins additional datasets, and analyzes the data using similarity metrics and clustering techniques. The tasks include:
1. Preprocessing the dataset.
2. Extracting new features from descriptions.
3. Joining external datasets.
4. Analyzing the combined dataset using Tika-Similarity.
5. Visualizing the results.

## 1. Download and install Apache Tika

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
pip install tika

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install number-parser

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install datefinder

Note: you may need to restart the kernel to use updated packages.


## 2. Download the Haunted Places dataset

In [7]:
import pandas as pd

# # Load the dataset
# haunted_places_df = pd.read_csv('../Data/haunted_places.csv')

## 3. Create a combined TSV file for your Haunted Places dataset

In [9]:
# Convert the original CSV file to TSV
# Load the TSV file for further processing
# haunted_places_df.to_csv("haunted_places.tsv", sep='\t', index=False)
# haunted_places = pd.read_csv("haunted_places.tsv", sep='\t')


# tsv_path = '../Data/cleaned_haunted_places.tsv'
df = pd.read_csv("../Data/cleaned_haunted_places.tsv", sep='\t')

## 4. a) Audio Evidence

In [10]:
import re

def has_audio_evidence(description):
    audio_keywords = ["noises", "sound of snapping neck", "nursery rhymes"]
    return any(re.search(rf'\b{keyword}\b', description, re.IGNORECASE) for keyword in audio_keywords)

df['Audio Evidence'] = df['description'].apply(has_audio_evidence)

## 4. b) Image/Video/Visual Evidence

In [11]:
def has_visual_evidence(description):
    visual_keywords = ["cameras", "take pictures", "names of children written on walls"]
    return any(re.search(rf'\b{keyword}\b', description, re.IGNORECASE) for keyword in visual_keywords)

df['Image/Video/Visual Evidence'] = df['description'].apply(has_visual_evidence)

## 4. c) Haunted Places Date

In [12]:
import datefinder
import datetime
from datetime import date

def extract_date(description):
    try:
        # Attempt to find dates in the description
        matches = datefinder.find_dates(description)
        
        # Extract the first valid date
        for match in matches:
            return match.date()  # Return only the date part
        
    except Exception:
        # Silently handle errors without printing messages
        pass
    
    # Fallback to '2025-01-01' if no valid date is found or an error occurs
    return datetime.date(2025, 1, 1)

# Apply the function to the 'description' column
df['Haunted Places Date'] = df['description'].apply(extract_date)

In [13]:
df['Haunted Places Date']

0        2025-03-03
1        2025-03-01
2        2025-01-01
3        0211-03-03
4        2025-01-01
            ...    
10969    2025-03-12
10970    2025-01-01
10971    2025-03-18
10972    2025-01-01
10973    2025-01-01
Name: Haunted Places Date, Length: 10974, dtype: object

## 4. d) Haunted Places Witness Count

In [14]:
import re
from number_parser import parse_number

def preprocess_description(description):
    """Replace vague phrases with estimated numbers to aid extraction."""
    description = description.lower()
    # Replace ambiguous phrases with approximate numbers
    replacements = {
        "some": "3",
        "a few": "3",
        "several": "5",
        "many": "10",
        "a lot": "10",
        "a handful": "5",
        "numerous": "10",
        "countless": "15",
        "dozens": "12",
        "scores": "20",
        "hundreds": "100",
        "a couple": "2"
    }
    
    for word, num in replacements.items():
        description = re.sub(rf"\b{word}\b", num, description)  # Whole-word replacement
    return description


def extract_numbers_from_text(text):
    """Extract numerical values from written-out numbers in the text."""
    # Define a regex pattern to match written-out numbers
    number_words = r'\b(?:zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|thousand|million|billion)\b'
    
    # Find all matches of written-out numbers
    matches = re.findall(number_words, text.lower())
    
    # Parse each match into a numerical value
    numbers = [parse_number(match) for match in matches if parse_number(match) is not None]
    
    # Filter out irrelevant numbers (e.g., years, small numbers)
    filtered_numbers = [num for num in numbers if not (1900 <= num <= 2100)]  # Remove years
    filtered_numbers = [num for num in filtered_numbers if num > 1]  # Ignore small numbers
    
    return filtered_numbers


def extract_witness_count(description):
    """
    Extract witness count from a haunted place description.
    Returns a tuple (witness_count, method) where method indicates how the count was derived.
    """
    try:
        # Step 1: Preprocess the description
        preprocessed_text = preprocess_description(description)
        
        # Step 2: Extract numbers from the text
        numbers = extract_numbers_from_text(preprocessed_text)
        
        # Step 3: Return the first valid number found, or 0 if no numbers are found
        if numbers:
            return numbers[0], "explicit_number"
        
        # Step 4: Default to 0 if no numbers are found
        return 0, "default"
    
    except Exception as e:
        print(f"Error parsing witness count from description: {description[:100]}... Error: {e}")
        return 0, "error"


# Apply the function to the 'description' column
df['Haunted Places Witness Count'] = df['description'].apply(
    lambda desc: extract_witness_count(desc)[0]  # Extract only the count (not the method)
)

# Display the updated columns
print(df[['description', 'Haunted Places Witness Count']])

                                             description  \
0      Ada witch - Sometimes you can see a misty blue...   
1      A little girl was killed suddenly while waitin...   
2      If you take Gorman Rd. west towards Sand Creek...   
3      In the 1970's, one room, room 211, in the old ...   
4      Kappa Delta Sorority - The Kappa Delta Sororit...   
...                                                  ...   
10969  at 12 midnight you can see a lady with two lit...   
10970  Is haunted by the victims of a murder that hap...   
10971  The institution was for kids 18 years old and ...   
10972  Gymnasium -  their have been reports of a litt...   
10973  Cadets from the Air Force Academy participatin...   

       Haunted Places Witness Count  
0                                 0  
1                                 0  
2                                 0  
3                                 2  
4                                 0  
...                             ...  
10969        

## 4. e) Time of Day

In [15]:
def extract_time_of_day(description):
    time_keywords = {"evening": "Evening", "morning": "Morning", "dusk": "Dusk"}
    for keyword, time_of_day in time_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return time_of_day
    return "Unknown"

df['Time of Day'] = df['description'].apply(extract_time_of_day)

## 4. f) Apparition Type

In [16]:
def extract_apparition_type(description):
    apparition_keywords = {
        "ghost": "Ghost",
        "orb": "Orb",
        "ufo": "UFO",
        "uap": "UAP",
        "male": "Male",
        "female": "Female",
        "child": "Child",
        "several ghosts": "Several Ghosts"
    }
    for keyword, apparition_type in apparition_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return apparition_type
    return "Unknown"

df['Apparition Type'] = df['description'].apply(extract_apparition_type)

## 4. g) Event type

In [17]:
def extract_event_type(description):
    event_keywords = {
        "murder": "Murder",
        "die": "Death",
        "supernatural": "Supernatural Phenomenon"
    }
    for keyword, event_type in event_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return event_type
    return "Unknown"

df['Event Type'] = df['description'].apply(extract_event_type)

In [18]:
#to check
visual_evidence_records = df[df['Haunted Places Witness Count'] == 11]

# Display the filtered records
print(visual_evidence_records)

          city        country  \
1518     Lenox  United States   
3390  Woodward  United States   
8509     Paris  United States   

                                            description  \
1518  An eleven year old girl, whose last name is Sl...   
3390  This coffee house was originally a doctor's of...   
8509  Most people get a bad feeling just looking at ...   

                                 location          state state_abbrev  \
1518                     Cranewell Resort  Massachusetts           MA   
3390                     Leos Coffeehouse       Oklahoma           OK   
8509  Old Plantation home in Slate Shoals          Texas           TX   

      longitude   latitude  city_longitude  city_latitude  Audio Evidence  \
1518 -73.267236  42.341822      -73.284876      42.356461           False   
3390 -99.393019  36.434108      -99.390386      36.433648           False   
8509 -95.555513  33.660939      -95.555513      33.660939           False   

      Image/Video/Visual Evi

## 4. h) Merge the Alcohol Abuse Dataset

In [20]:
#  Haunted Places dataset (TSV format)
df.columns = [col.strip().lower() for col in df.columns]

# Load Alcohol Abuse dataset (appears to be comma-separated)
alcohol_df = pd.read_csv("../Data/alcohol_abuse.tsv", sep=",")
alcohol_df.columns = [col.strip().lower() for col in alcohol_df.columns]

# Rename the Alcohol Abuse column to ensure it has a common key ("state")
if "state" not in alcohol_df.columns and "state_name" in alcohol_df.columns:
    alcohol_df.rename(columns={"state_name": "state"}, inplace=True)

# Check that both DataFrames have the 'state' column
if "state" not in df.columns or "state" not in alcohol_df.columns:
    raise KeyError("The 'state' column is missing from one of the datasets.")

# Merge the datasets on the 'state' column using a left join
merged_df = pd.merge(df, alcohol_df, on="state", how="left")

# Save the merged dataset as a TSV file
merged_df.to_csv("../Data/haunted_places_with_alcohol.tsv", sep="\t", index=False)
print("Merge completed: {} rows merged.".format(merged_df.shape[0]))

Merge completed: 10974 rows merged.


## 4. i) Merge the Alcohol Abuse Dataset

In [23]:
pip install selenium

Collecting selenium
  Downloading selenium-4.27.1-py3-none-any.whl (9.7 MB)
[K     |████████████████████████████████| 9.7 MB 13.7 MB/s eta 0:00:01
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.12.2-py3-none-any.whl (21 kB)
Collecting urllib3[socks]<3,>=1.26
  Using cached urllib3-2.2.3-py3-none-any.whl (126 kB)
Collecting typing_extensions~=4.9
  Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Collecting websocket-client~=1.8
  Downloading websocket_client-1.8.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 3.1 MB/s eta 0:00:011
[?25hCollecting trio~=0.17
  Downloading trio-0.27.0-py3-none-any.whl (481 kB)
[K     |████████████████████████████████| 481 kB 8.8 MB/s eta 0:00:01
[?25hCollecting certifi>=2021.10.8
  Downloading certifi-2025.1.31-py3-none-any.whl (166 kB)
[K     |████████████████████████████████| 166 kB 1.2 MB/s eta 0:00:01
[?25hCollecting outcome>=1.2.0
  Downloading outcome-1.3.0.post0-py2.py3-none-any.

In [24]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd

# Set up Selenium to run headlessly (no browser window)
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)  # Ensure chromedriver is in your PATH

# URL for the daylight data page (modify if needed)
url = "https://www.timeanddate.com/astronomy/usa"
driver.get(url)
driver.implicitly_wait(10)  # Wait for dynamic content to load

# Get the full page source after dynamic content has loaded
html = driver.page_source
driver.quit()

# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Find all table elements on the page
tables = soup.find_all("table")
print("Number of tables found:", len(tables))

# Extract each table into a DataFrame and combine them
dfs = [pd.read_html(str(table))[0] for table in tables]
if len(dfs) > 1:
    full_df = pd.concat(dfs, ignore_index=True)
else:
    full_df = dfs[0]

# Save the full dataset as a TSV file
full_df.to_csv("../Data/daylight_hours_full.tsv", sep="\t", index=False)
print("Full daylight data saved to daylight_hours_full.tsv")

Number of tables found: 3
Full daylight data saved to daylight_hours_full.tsv


In [25]:
# Load the merged Haunted Places with Alcohol dataset (TSV)
merged_alcohol_df = pd.read_csv("../Data/haunted_places_with_alcohol.tsv", sep="\t")
merged_alcohol_df.columns = [col.strip().lower() for col in merged_alcohol_df.columns]

# Load the Daylight Hours dataset (TSV)
daylight_df = pd.read_csv("../Data/daylight_hours_full.tsv", sep="\t")
daylight_df.columns = [col.strip().lower() for col in daylight_df.columns]

# Inspect the daylight dataset columns and preview column "0"
print("Daylight Hours columns:", daylight_df.columns.tolist())
print(daylight_df["0"].head(5))

# Define a function to extract a two-letter state abbreviation from a string.
# This function assumes the format "City Name (XX)" where XX is the state abbreviation.
def extract_state(location):
    match = re.search(r'\(([A-Z]{2})\)', location)
    if match:
        return match.group(1).lower()  # return in lowercase for consistency
    else:
        return None

# Create a new 'state' column by extracting from column "0"
daylight_df["state"] = daylight_df["0"].apply(lambda x: extract_state(str(x)))
print(daylight_df[["0", "state"]].head(10))

# Check that both DataFrames have the 'state' column
if "state" not in merged_alcohol_df.columns:
    raise KeyError("The 'state' column is missing from the Haunted Places dataset.")
if "state" not in daylight_df.columns:
    raise KeyError("The 'state' column is missing from the Daylight Hours dataset.")

# Merge the datasets on the 'state' column using a left join
merged_final_df = pd.merge(merged_alcohol_df, daylight_df, on="state", how="left")

# Save the final merged dataset as a TSV file
merged_final_df.to_csv("../Data/haunted_places_with_alcohol_daylight.tsv", sep="\t", index=False)
print("Merge completed: {} rows merged.".format(merged_final_df.shape[0]))

Daylight Hours columns: ['0', '1', '↑ sunrise and ↓ sunset in united states (79 locations)', '↑ sunrise and ↓ sunset in united states (79 locations).1', '↑ sunrise and ↓ sunset in united states (79 locations).2', '↑ sunrise and ↓ sunset in united states (79 locations).3', '↑ sunrise and ↓ sunset in united states (79 locations).4', '↑ sunrise and ↓ sunset in united states (79 locations).5', '↑ sunrise and ↓ sunset in united states (79 locations).6', '↑ sunrise and ↓ sunset in united states (79 locations).7', '↑ sunrise and ↓ sunset in united states (79 locations).8', 'find sunrise and sunset for other places…', 'find sunrise and sunset for other places….1', 'find sunrise and sunset for other places….2', 'find sunrise and sunset for other places….3', 'find sunrise and sunset for other places….4']
0          Country:
1        Long Name:
2    Abbreviations:
3          Capital:
4       Time Zones:
Name: 0, dtype: object
                   0 state
0           Country:  None
1         Long Na