<h1> <center> Analysis of the Haunted Places Dataset <center> </h1>

<b> Team 3 </b> <br>
Members: 

- Zili Yang
- Chen Yi Weng
- Aadarsh Sudhir Ghiya 
- Colin Leahey
- Niromikha Jayakumar 
- Yung Yee Chia


## Introduction

This notebook explores the Haunted Places dataset, extracts new features, joins additional datasets, and analyzes the data using similarity metrics and clustering techniques. The tasks include:
1. Preprocessing the dataset.
2. Extracting new features from descriptions.
3. Joining external datasets.
4. Analyzing the combined dataset using Tika-Similarity.
5. Visualizing the results.

## 1. Download and install Apache Tika

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
pip install tika

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install number-parser

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install datefinder

Note: you may need to restart the kernel to use updated packages.


## 2. Download the Haunted Places dataset

In [5]:
import pandas as pd

# Load the dataset
haunted_places_df = pd.read_csv('../Data/haunted_places.csv')

# Make a copy of the original dataset
haunted_places_copy = haunted_places_df.copy()

## 3. Create a combined TSV file for your Haunted Places dataset

In [6]:
# Convert the original CSV file to TSV
haunted_places_df = pd.read_csv("../Data/haunted_places.csv")
haunted_places_df.to_csv("haunted_places.tsv", sep='\t', index=False)

In [7]:
# Load the TSV file for further processing
haunted_places = pd.read_csv("haunted_places.tsv", sep='\t')

## 4. a) Audio Evidence

In [8]:
import re

def has_audio_evidence(description):
    audio_keywords = ["noises", "sound of snapping neck", "nursery rhymes"]
    return any(re.search(rf'\b{keyword}\b', description, re.IGNORECASE) for keyword in audio_keywords)

haunted_places['Audio Evidence'] = haunted_places['description'].apply(has_audio_evidence)

## 4. b) Image/Video/Visual Evidence

In [9]:
def has_visual_evidence(description):
    visual_keywords = ["cameras", "take pictures", "names of children written on walls"]
    return any(re.search(rf'\b{keyword}\b', description, re.IGNORECASE) for keyword in visual_keywords)

haunted_places['Image/Video/Visual Evidence'] = haunted_places['description'].apply(has_visual_evidence)

## 4. c) Haunted Places Date

In [10]:
import datefinder
import datetime
from datetime import date

def extract_date(description):
    try:
        # Attempt to find dates in the description
        matches = datefinder.find_dates(description)
        
        # Extract the first valid date
        for match in matches:
            return match.date()  # Return only the date part
        
    except Exception:
        # Silently handle errors without printing messages
        pass
    
    # Fallback to '2025-01-01' if no valid date is found or an error occurs
    return datetime.date(2025, 1, 1)

# Apply the function to the 'description' column
haunted_places['Haunted Places Date'] = haunted_places['description'].apply(extract_date)

In [11]:
haunted_places['Haunted Places Date']

0        2025-03-03
1        2025-03-01
2        2025-01-01
3        0211-03-11
4        2025-01-01
            ...    
10987    2025-03-12
10988    2025-01-01
10989    2025-03-18
10990    2025-01-01
10991    2025-01-01
Name: Haunted Places Date, Length: 10992, dtype: object

## 4. d) Haunted Places Witness Count

In [14]:
import re
from number_parser import parse_number

def preprocess_description(description):
    """Replace vague phrases with estimated numbers to aid extraction."""
    description = description.lower()
    # Replace ambiguous phrases with approximate numbers
    replacements = {
        "some": "3",
        "a few": "3",
        "several": "5",
        "many": "10",
        "a lot": "10",
        "a handful": "5",
        "numerous": "10",
        "countless": "15",
        "dozens": "12",
        "scores": "20",
        "hundreds": "100",
        "a couple": "2"
    }
    
    for word, num in replacements.items():
        description = re.sub(rf"\b{word}\b", num, description)  # Whole-word replacement
    return description


def extract_numbers_from_text(text):
    """Extract numerical values from written-out numbers in the text."""
    # Define a regex pattern to match written-out numbers
    number_words = r'\b(?:zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|thousand|million|billion)\b'
    
    # Find all matches of written-out numbers
    matches = re.findall(number_words, text.lower())
    
    # Parse each match into a numerical value
    numbers = [parse_number(match) for match in matches if parse_number(match) is not None]
    
    # Filter out irrelevant numbers (e.g., years, small numbers)
    filtered_numbers = [num for num in numbers if not (1900 <= num <= 2100)]  # Remove years
    filtered_numbers = [num for num in filtered_numbers if num > 1]  # Ignore small numbers
    
    return filtered_numbers


def extract_witness_count(description):
    """
    Extract witness count from a haunted place description.
    Returns a tuple (witness_count, method) where method indicates how the count was derived.
    """
    try:
        # Step 1: Preprocess the description
        preprocessed_text = preprocess_description(description)
        
        # Step 2: Extract numbers from the text
        numbers = extract_numbers_from_text(preprocessed_text)
        
        # Step 3: Return the first valid number found, or 0 if no numbers are found
        if numbers:
            return numbers[0], "explicit_number"
        
        # Step 4: Default to 0 if no numbers are found
        return 0, "default"
    
    except Exception as e:
        print(f"Error parsing witness count from description: {description[:100]}... Error: {e}")
        return 0, "error"


# Apply the function to the 'description' column
haunted_places['Haunted Places Witness Count'] = haunted_places['description'].apply(
    lambda desc: extract_witness_count(desc)[0]  # Extract only the count (not the method)
)

# Display the updated columns
print(haunted_places[['description', 'Haunted Places Witness Count']])

                                             description  \
0      Ada witch - Sometimes you can see a misty blue...   
1      A little girl was killed suddenly while waitin...   
2      If you take Gorman Rd. west towards Sand Creek...   
3      In the 1970's, one room, room 211, in the old ...   
4      Kappa Delta Sorority - The Kappa Delta Sororit...   
...                                                  ...   
10987  at 12 midnight you can see a lady with two lit...   
10988  Is haunted by the victims of a murder that hap...   
10989  The institution was for kids 18 years old and ...   
10990  Gymnasium -  their have been reports of a litt...   
10991  Cadets from the Air Force Academy participatin...   

       Haunted Places Witness Count  
0                                 0  
1                                 0  
2                                 0  
3                                 2  
4                                 0  
...                             ...  
10987        

## 4. e) Time of Day

In [16]:
def extract_time_of_day(description):
    time_keywords = {"evening": "Evening", "morning": "Morning", "dusk": "Dusk"}
    for keyword, time_of_day in time_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return time_of_day
    return "Unknown"

haunted_places['Time of Day'] = haunted_places['description'].apply(extract_time_of_day)

## 4. f) Apparition Type

In [17]:
def extract_apparition_type(description):
    apparition_keywords = {
        "ghost": "Ghost",
        "orb": "Orb",
        "ufo": "UFO",
        "uap": "UAP",
        "male": "Male",
        "female": "Female",
        "child": "Child",
        "several ghosts": "Several Ghosts"
    }
    for keyword, apparition_type in apparition_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return apparition_type
    return "Unknown"

haunted_places['Apparition Type'] = haunted_places['description'].apply(extract_apparition_type)

## 4. g) Event type

In [18]:
def extract_event_type(description):
    event_keywords = {
        "murder": "Murder",
        "die": "Death",
        "supernatural": "Supernatural Phenomenon"
    }
    for keyword, event_type in event_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return event_type
    return "Unknown"

haunted_places['Event Type'] = haunted_places['description'].apply(extract_event_type)

In [19]:
#to check
visual_evidence_records = haunted_places[haunted_places['Haunted Places Witness Count'] == 11]

# Display the filtered records
print(visual_evidence_records)

          city        country  \
1520     Lenox  United States   
3394  Woodward  United States   
8524     Paris  United States   

                                            description  \
1520  An eleven year old girl, whose last name is Sl...   
3394  This coffee house was originally a doctor's of...   
8524  Most people get a bad feeling just looking at ...   

                                 location          state state_abbrev  \
1520                     Cranewell Resort  Massachusetts           MA   
3394                     Leos Coffeehouse       Oklahoma           OK   
8524  Old Plantation home in Slate Shoals          Texas           TX   

      longitude   latitude  city_longitude  city_latitude  Audio Evidence  \
1520 -73.267236  42.341822      -73.284876      42.356461           False   
3394 -99.393019  36.434108      -99.390386      36.433648           False   
8524        NaN        NaN      -95.555513      33.660939           False   

      Image/Video/Visual Evi

## 4. h)  Merge the Alcohol Abuse Dataset

In [20]:
import os
import pandas as pd

# Determine the path to the Data folder relative to the current working directory
data_folder = os.path.join(os.getcwd(), "..", "Data")
print("Data folder:", data_folder)
print("Files in Data folder:", os.listdir(data_folder))

# Set the file path for the Haunted Places dataset:
# First, try "haunted_places.tsv". If it doesn't exist, use "cleaned_haunted_places.tsv".
haunted_file = os.path.join(data_folder, "haunted_places.tsv")
if not os.path.exists(haunted_file):
    print(f"'{haunted_file}' not found. Using 'cleaned_haunted_places.tsv' instead.")
    haunted_file = os.path.join(data_folder, "cleaned_haunted_places.tsv")

# Set the file path for the Alcohol Abuse dataset
alcohol_file = os.path.join(data_folder, "alcohol_abuse.tsv")

# Load Haunted Places dataset
haunted_df = pd.read_csv(haunted_file, sep="\t")
haunted_df.columns = [col.strip().lower() for col in haunted_df.columns]

# Load Alcohol Abuse dataset (assuming it's comma-separated)
alcohol_df = pd.read_csv(alcohol_file, sep=",")
alcohol_df.columns = [col.strip().lower() for col in alcohol_df.columns]

# Rename the Alcohol Abuse column if necessary to ensure a common key "state"
if "state" not in alcohol_df.columns and "state_name" in alcohol_df.columns:
    alcohol_df.rename(columns={"state_name": "state"}, inplace=True)

# Check that both DataFrames have the 'state' column
if "state" not in haunted_df.columns or "state" not in alcohol_df.columns:
    raise KeyError("The 'state' column is missing from one of the datasets.")

# Merge the datasets on the 'state' column using a left join
merged_df = pd.merge(haunted_df, alcohol_df, on="state", how="left")

# Save the merged dataset as a TSV file in the Data folder
merged_file = os.path.join(data_folder, "haunted_places_with_alcohol.tsv")
merged_df.to_csv(merged_file, sep="\t", index=False)
print("Merge completed: {} rows merged. Merged file saved at: {}".format(merged_df.shape[0], merged_file))


Data folder: /Users/yungyeechia/Desktop/DSCI550-HW1-main/Source Code/../Data
Files in Data folder: ['counties.geojson', 'cleaned_crime_data.tsv', '.DS_Store', 'haunted_places.csv', 'NRIS_CR_Standards_Public.gdb.zip', 'haunted_places_with_alcohol.tsv', 'haunted_places_cleaned.csv', 'daylight_hours_full.tsv', 'haunted_places_with_alcohol_daylight.tsv', 'daylight_tidy.tsv', 'georef-united-states-of-america-county.geojson', 'haunted_places_cleaned_copy.csv', '2020_USRC_Summaries.xlsx', 'extracted_gdb', 'joined1.csv', 'haunted_places_with_sites.csv', 'cleaned_haunted_places.tsv', 'historic_sites.csv', 'alcohol_abuse.tsv']
'/Users/yungyeechia/Desktop/DSCI550-HW1-main/Source Code/../Data/haunted_places.tsv' not found. Using 'cleaned_haunted_places.tsv' instead.
Merge completed: 10974 rows merged. Merged file saved at: /Users/yungyeechia/Desktop/DSCI550-HW1-main/Source Code/../Data/haunted_places_with_alcohol.tsv


## 4. i)  Merge the Alcohol Abuse Dataset

In [21]:
import os
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def extract_timeanddate_daylight(output_filename="daylight_hours_full.tsv"):
    print("Starting extraction from Time and Date website...")
    # Set up Selenium in headless mode
    options = Options()
    options.add_argument("--headless")
    driver = webdriver.Chrome(options=options)  # Ensure chromedriver is in your PATH

    try:
        # Navigate to the Time and Date Astronomy page for the USA
        url = "https://www.timeanddate.com/astronomy/usa"
        driver.get(url)
        driver.implicitly_wait(10)  # Wait up to 10 seconds for dynamic content
        # Get the page source
        html = driver.page_source
        print("Page source obtained.")
    finally:
        # Close the browser
        driver.quit()
        print("Browser closed.")

    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(html, "html.parser")

    # Find all <table> elements on the page
    tables = soup.find_all("table")
    print("Number of tables found on the page:", len(tables))

    # Convert each table to a Pandas DataFrame and collect them
    dfs = []
    for i, table in enumerate(tables, start=1):
        try:
            df = pd.read_html(str(table))[0]
            dfs.append(df)
            print("Table", i, "extracted with shape:", df.shape)
        except ValueError as e:
            print("Skipping a table (index", i, ") due to error:", e)

    if not dfs:
        print("No valid tables found to scrape. Exiting.")
        return None

    if len(dfs) > 1:
        combined_df = pd.concat(dfs, ignore_index=True)
    else:
        combined_df = dfs[0]

    # Define the path to the Data folder (one level up)
    data_folder = os.path.join(os.getcwd(), "..", "Data")
    if not os.path.exists(data_folder):
        os.makedirs(data_folder)
        print("Data folder created at:", os.path.abspath(data_folder))
    else:
        print("Data folder exists at:", os.path.abspath(data_folder))
    
    # Save the combined DataFrame as a TSV file in the Data folder
    output_filename = os.path.join(data_folder, output_filename)
    combined_df.to_csv(output_filename, sep="\t", index=False)
    full_path = os.path.abspath(output_filename)
    
    if os.path.exists(full_path):
        print("Scraped data saved to:", full_path)
        print("File successfully saved!")
    else:
        print("Error: File was not saved to", full_path)
    return full_path

if __name__ == "__main__":
    extract_timeanddate_daylight()
    print("Extraction complete.")


Starting extraction from Time and Date website...
Page source obtained.
Browser closed.
Number of tables found on the page: 3
Table 1 extracted with shape: (7, 2)
Table 2 extracted with shape: (27, 9)
Table 3 extracted with shape: (199, 5)
Data folder exists at: /Users/yungyeechia/Desktop/DSCI550-HW1-main/Data
Scraped data saved to: /Users/yungyeechia/Desktop/DSCI550-HW1-main/Data/daylight_hours_full.tsv
File successfully saved!
Extraction complete.


In [31]:
import os
import pandas as pd
import re
from datetime import datetime

# Locating the data folder and file paths using relative paths
data_folder = "../Data"  # Moves up one level to access "Data/"
raw_file = os.path.join(data_folder, "daylight_hours_full.tsv")
output_file = os.path.join(data_folder, "daylight_tidy.tsv")

# Check if the file exists before proceeding
if not os.path.exists(raw_file):
    raise FileNotFoundError(f"File not found: {raw_file}")

print("Reading file from:", raw_file)

# Load the Raw Data
df = pd.read_csv(raw_file, sep="\t", dtype=str)

# Drop metadata rows if the first row contains "Country:"
if df.iloc[0, 0].strip().lower() == "country:":
    df = df.iloc[5:].reset_index(drop=True)

print("After dropping metadata rows, number of rows:", df.shape[0])

# Extract relevant blocks (Location, Sunrise, Sunset)
# Identify relevant columns manually
if df.shape[1] >= 11:
    blocks = [
        df.iloc[:, [2, 3, 4]].copy(),
        df.iloc[:, [5, 6, 7]].copy(),
        df.iloc[:, [8, 9, 10]].copy(),
    ]
else:
    raise ValueError("Not enough columns in the dataset. Check the table structure.")

# Rename columns consistently
for block in blocks:
    block.columns = ["location", "sunrise", "sunset"]

# Concatenate all blocks into one dataframe
tidy_df = pd.concat(blocks, ignore_index=True)

# Drop empty rows
tidy_df = tidy_df[tidy_df["location"].notna() & (tidy_df["location"].str.strip() != "")]
tidy_df.reset_index(drop=True, inplace=True)

print("Tidy DataFrame (first 10 rows):")
print(tidy_df.head(10))

# Extracting state abbreviations from location
# Assumes locations have a format like "Philadelphia (PA)"
tidy_df["state"] = tidy_df["location"].apply(
    lambda x: re.search(r"\(([A-Z]{2})\)", x).group(1) if re.search(r"\(([A-Z]{2})\)", x) else None
)

# Clean up sunrise and sunset times
# Remove any symbols (e.g., "↑", "↓") and strip whitespace
tidy_df["sunrise"] = tidy_df["sunrise"].str.replace("↑", "", regex=False).str.strip()
tidy_df["sunset"] = tidy_df["sunset"].str.replace("↓", "", regex=False).str.strip()

# Function to calculate daylight hours
def calculate_daylight_hours(row):
    try:
        sunrise_time = datetime.strptime(row["sunrise"], "%I:%M %p")
        sunset_time = datetime.strptime(row["sunset"], "%I:%M %p")
        daylight_hours = (sunset_time - sunrise_time).seconds / 3600  # Convert to hours
        return round(daylight_hours, 2)  # Round to 2 decimal places
    except Exception as e:
        return None  # Return None if there's an issue

# Apply function to calculate daylight hours
tidy_df["daylight_hours"] = tidy_df.apply(calculate_daylight_hours, axis=1)

print("Final cleaned data with daylight hours (first 10 rows):")
print(tidy_df.head(10))

# Save the cleaned data using a relative path
tidy_df.to_csv(output_file, sep="\t", index=False)
print("Tidy daylight data saved as:", output_file)


Reading file from: ../Data/daylight_hours_full.tsv
After dropping metadata rows, number of rows: 228
Tidy DataFrame (first 10 rows):
           location    sunrise     sunset
0         Adak (AK)  ↑ 9:08 am  ↓ 8:45 pm
1       Albany (NY)  ↑ 7:13 am  ↓ 6:57 pm
2  Albuquerque (NM)  ↑ 7:22 am  ↓ 7:11 pm
3         Ames (IA)  ↑ 7:32 am  ↓ 7:16 pm
4    Anchorage (AK)  ↑ 8:27 am  ↓ 7:52 pm
5    Annapolis (MD)  ↑ 7:23 am  ↓ 7:09 pm
6      Atlanta (GA)  ↑ 7:52 am  ↓ 7:42 pm
7      Augusta (ME)  ↑ 6:58 am  ↓ 6:40 pm
8       Austin (TX)  ↑ 7:45 am  ↓ 7:36 pm
9    Baltimore (MD)  ↑ 7:23 am  ↓ 7:09 pm
Final cleaned data with daylight hours (first 10 rows):
           location  sunrise   sunset state  daylight_hours
0         Adak (AK)  9:08 am  8:45 pm    AK           11.62
1       Albany (NY)  7:13 am  6:57 pm    NY           11.73
2  Albuquerque (NM)  7:22 am  7:11 pm    NM           11.82
3         Ames (IA)  7:32 am  7:16 pm    IA           11.73
4    Anchorage (AK)  8:27 am  7:52 pm    AK      

In [33]:
import pandas as pd
import os

# Define relative file paths
daylight_file = "../Data/daylight_tidy.tsv"  # Corrected file path
haunted_file = "../Data/haunted_places_with_alcohol.tsv"
output_file = "../Data/merged_haunted_with_alcohol_daylight.tsv"

# Check if files exist before proceeding
for file_path in [daylight_file, haunted_file]:
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Error: File not found at {file_path}. Check your Data/ folder.")

# Load the datasets
df_daylight = pd.read_csv(daylight_file, sep="\t")
df_haunted = pd.read_csv(haunted_file, sep="\t")

# Merge the datasets on state abbreviations
merged_df = df_haunted.merge(df_daylight, left_on="state_abbrev", right_on="state", how="left")

# Drop duplicate state column
merged_df.drop(columns=["state_y"], inplace=True)

# Rename columns for clarity
merged_df.rename(columns={"state_x": "state"}, inplace=True)

# Save the merged dataset using a relative path
merged_df.to_csv(output_file, sep="\t", index=False)

print(f"Merged dataset saved to: {output_file}")


Merged dataset saved to: ../Data/merged_haunted_with_alcohol_daylight.tsv
