<h1> <center> Analysis of the Haunted Places Dataset <center> </h1>

<b> Team 3 </b> <br>
Members: 

- Zili Yang
- Chen Yi Weng
- Aadarsh Sudhir Ghiya 
- Colin Leahey
- Niromikha Jayakumar 
- Yung Yee Chia


## Introduction

This notebook explores the Haunted Places dataset, extracts new features, joins additional datasets, and analyzes the data using similarity metrics and clustering techniques. The tasks include:
1. Preprocessing the dataset.
2. Extracting new features from descriptions.
3. Joining external datasets.
4. Analyzing the combined dataset using Tika-Similarity.
5. Visualizing the results.

## 1. Download and install Apache Tika

In [39]:
import warnings
warnings.filterwarnings('ignore')

In [40]:
pip install tika

Note: you may need to restart the kernel to use updated packages.


In [41]:
pip install number-parser

Note: you may need to restart the kernel to use updated packages.


In [42]:
pip install datefinder

Note: you may need to restart the kernel to use updated packages.


## 2. Download the Haunted Places dataset

In [43]:
import pandas as pd

# Load the dataset
haunted_places_df = pd.read_csv('../Data/haunted_places.csv')

# Make a copy of the original dataset
haunted_places_copy = haunted_places_df.copy()

## 3. Create a combined TSV file for your Haunted Places dataset

In [44]:
# Convert CSV to TSV
haunted_places_copy.to_csv('haunted_places.tsv', sep='\t', index=False)

## 4. a) Audio Evidence

In [45]:
import re

def has_audio_evidence(description):
    audio_keywords = ["noises", "sound of snapping neck", "nursery rhymes"]
    return any(re.search(rf'\b{keyword}\b', description, re.IGNORECASE) for keyword in audio_keywords)

haunted_places_copy['Audio Evidence'] = haunted_places_copy['description'].apply(has_audio_evidence)

## 4. b) Image/Video/Visual Evidence

In [46]:
def has_visual_evidence(description):
    visual_keywords = ["cameras", "take pictures", "names of children written on walls"]
    return any(re.search(rf'\b{keyword}\b', description, re.IGNORECASE) for keyword in visual_keywords)

haunted_places_copy['Image/Video/Visual Evidence'] = haunted_places_copy['description'].apply(has_visual_evidence)

## 4. c) Haunted Places Date

In [47]:
import datefinder
import datetime
from datetime import date

def extract_date(description):
    try:
        # Attempt to find dates in the description
        matches = datefinder.find_dates(description)
        
        # Extract the first valid date
        for match in matches:
            return match.date()  # Return only the date part
        
    except Exception:
        # Silently handle errors without printing messages
        pass
    
    # Fallback to '2025-01-01' if no valid date is found or an error occurs
    return datetime.date(2025, 1, 1)

# Apply the function to the 'description' column
haunted_places_copy['Haunted Places Date'] = haunted_places_copy['description'].apply(extract_date)

In [49]:
haunted_places_copy['Haunted Places Date']

0        2025-02-03
1        2025-02-01
2        2025-01-01
3        0211-02-24
4        2025-01-01
            ...    
10987    2025-02-12
10988    2025-01-01
10989    2025-02-18
10990    2025-01-01
10991    2025-01-01
Name: Haunted Places Date, Length: 10992, dtype: object

## 4. d) Haunted Places Witness Count

In [77]:
import re
from number_parser import parse_number

def preprocess_description(description):
    """Replace vague phrases with estimated numbers to aid extraction."""
    description = description.lower()
    # Replace ambiguous phrases with approximate numbers
    replacements = {
        "some": "3",
        "a few": "3",
        "several": "5",
        "many": "10",
        "a lot": "10",
        "a handful": "5",
        "numerous": "10",
        "countless": "15",
        "dozens": "12",
        "scores": "20",
        "hundreds": "100",
        "a couple": "2"
    }
    
    for word, num in replacements.items():
        description = re.sub(rf"\b{word}\b", num, description)  # Whole-word replacement
    return description


def extract_numbers_from_text(text):
    """Extract numerical values from written-out numbers in the text."""
    # Define a regex pattern to match written-out numbers
    number_words = r'\b(?:zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|thousand|million|billion)\b'
    
    # Find all matches of written-out numbers
    matches = re.findall(number_words, text.lower())
    
    # Parse each match into a numerical value
    numbers = [parse_number(match) for match in matches if parse_number(match) is not None]
    
    # Filter out irrelevant numbers (e.g., years, small numbers)
    filtered_numbers = [num for num in numbers if not (1900 <= num <= 2100)]  # Remove years
    filtered_numbers = [num for num in filtered_numbers if num > 1]  # Ignore small numbers
    
    return filtered_numbers


def extract_witness_count(description):
    """
    Extract witness count from a haunted place description.
    Returns a tuple (witness_count, method) where method indicates how the count was derived.
    """
    try:
        # Step 1: Preprocess the description
        preprocessed_text = preprocess_description(description)
        
        # Step 2: Extract numbers from the text
        numbers = extract_numbers_from_text(preprocessed_text)
        
        # Step 3: Return the first valid number found, or 0 if no numbers are found
        if numbers:
            return numbers[0], "explicit_number"
        
        # Step 4: Default to 0 if no numbers are found
        return 0, "default"
    
    except Exception as e:
        print(f"Error parsing witness count from description: {description[:100]}... Error: {e}")
        return 0, "error"


# Apply the function to the 'description' column
haunted_places_copy['Haunted Places Witness Count'] = haunted_places_copy['description'].apply(
    lambda desc: extract_witness_count(desc)[0]  # Extract only the count (not the method)
)

# Display the updated columns
print(haunted_places_copy[['description', 'Haunted Places Witness Count']])

                                             description  \
0      Ada witch - Sometimes you can see a misty blue...   
1      A little girl was killed suddenly while waitin...   
2      If you take Gorman Rd. west towards Sand Creek...   
3      In the 1970's, one room, room 211, in the old ...   
4      Kappa Delta Sorority - The Kappa Delta Sororit...   
...                                                  ...   
10987  at 12 midnight you can see a lady with two lit...   
10988  Is haunted by the victims of a murder that hap...   
10989  The institution was for kids 18 years old and ...   
10990  Gymnasium -  their have been reports of a litt...   
10991  Cadets from the Air Force Academy participatin...   

       Haunted Places Witness Count  
0                                 0  
1                                 0  
2                                 0  
3                                 2  
4                                 0  
...                             ...  
10987        

## 4. e) Time of Day

In [78]:
def extract_time_of_day(description):
    time_keywords = {"evening": "Evening", "morning": "Morning", "dusk": "Dusk"}
    for keyword, time_of_day in time_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return time_of_day
    return "Unknown"

haunted_places_copy['Time of Day'] = haunted_places_copy['description'].apply(extract_time_of_day)

## 4. f) Apparition Type

In [79]:
def extract_apparition_type(description):
    apparition_keywords = {
        "ghost": "Ghost",
        "orb": "Orb",
        "ufo": "UFO",
        "uap": "UAP",
        "male": "Male",
        "female": "Female",
        "child": "Child",
        "several ghosts": "Several Ghosts"
    }
    for keyword, apparition_type in apparition_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return apparition_type
    return "Unknown"

haunted_places_copy['Apparition Type'] = haunted_places_copy['description'].apply(extract_apparition_type)

## 4. g) Event type

In [80]:
def extract_event_type(description):
    event_keywords = {
        "murder": "Murder",
        "die": "Death",
        "supernatural": "Supernatural Phenomenon"
    }
    for keyword, event_type in event_keywords.items():
        if re.search(rf'\b{keyword}\b', description, re.IGNORECASE):
            return event_type
    return "Unknown"

haunted_places_copy['Event Type'] = haunted_places_copy['description'].apply(extract_event_type)

In [90]:
#to check
visual_evidence_records = haunted_places_copy[haunted_places_copy['Haunted Places Witness Count'] == 11]

# Display the filtered records
print(visual_evidence_records)

          city        country  \
1520     Lenox  United States   
3394  Woodward  United States   
8524     Paris  United States   

                                            description  \
1520  An eleven year old girl, whose last name is Sl...   
3394  This coffee house was originally a doctor's of...   
8524  Most people get a bad feeling just looking at ...   

                                 location          state state_abbrev  \
1520                     Cranewell Resort  Massachusetts           MA   
3394                     Leos Coffeehouse       Oklahoma           OK   
8524  Old Plantation home in Slate Shoals          Texas           TX   

      longitude   latitude  city_longitude  city_latitude  Audio Evidence  \
1520 -73.267236  42.341822      -73.284876      42.356461           False   
3394 -99.393019  36.434108      -99.390386      36.433648           False   
8524        NaN        NaN      -95.555513      33.660939           False   

      Image/Video/Visual Evi