# **The Stress-Sighting Hypothesis**
## A Data-Driven Analysis of Global Events and Reports of the Unknown.

**The Stress-Sighting Hypothesis** has the project goal of investigating whether there is a meaningful correlation between the frequency of reported UFO sightings and periods of heightened cultural, political or global stress, using historical event data and publicly reported sightings. 


## User Story
Alex Holloway is an investigative journalist, known for in-depth features that combine cultural analysis with data storytelling. They work with both independent media outlets and major publishers, seeking to explore how society processes uncertainty — from political unrest to media myths.

Alex is planning to write an article on how Global Stress Events impact the number of UFO sightings, and has asked us to conduct our analysis, using the publicly available NUFORC (National UFO Reporting Centre) UFO Sightings dataset found [here](https://www.kaggle.com/datasets/NUFORC/ufo-sightings/data)

## Business Requirements
In an era shaped by information saturation, political polarisation, and global crises, public perception is increasingly complex and emotionally charged. For journalists, researchers, and communicators, understanding how people respond to uncertainty is as important as the events themselves.

This project explores the potential relationship between **reported UFO sightings** and **global stress events**, not to investigate extraterrestrial phenomena, but to examine whether these sightings reflect **underlying patterns of public anxiety, media influence, and cultural tension.**

The outcome is a data-driven dashboard designed to support those working at the intersection of **data**, **storytelling**, and **public insight**.

![Alex Holloway – Persona Card](../images/alex_holloway_persona_card.png)

### Alex's Requirements:

- **Reveal Patterns**

Alex needs to identify correlations between historical periods of stress and spikes in UFO reporting - fast, clearly and without technical issues. 

- **Narrative Context**

They want to explore not just *when* things happened, but *why it matters.* Explanatory text and annotations support deeper storytelling.

- **Usable Insights**

Our charts and summaries must be easy to extract for use in articles or reports, including explanatory captions and legends.

- **Trustworthy Structure**

The data pipeline must be transparent, ethical and well-documented to ensure and maintain credibility in their journalistic work.

### Value Proposition:
Our Dashboard must empower users like Alex to:
- Translate complex data into cultural insight
- Frame journalistic stories with empirical evidence
- Uncover social signals hiding in unconventional data
- Offer the audience a grounded perspective on how fear, media, and uncertainty intersect.

---

## Hypotheses

Our Hypotheses for this project are as follows:

### **Hypothesis 1:** 

**There is a positive correlation between the number of glabal stress events in a given year and the number of UFO sightings.**

### **Hypothesis 2:**

**Years with higher total stress severity scores are associated with a greater number of UFO sightings.**

### **Hypothesis 3:**

**Cultural media events, (such as the release of UFO-themed films or television series) correspond with noticeable short-term spikes in reported sightings.**

For the sake of brevity, we will not outline our validation approaches here, as this will be covered in a seperate notebook.

---

## Data Preparation and Cleaning

In this section we will look to extract our data and give consideration to how we will clean it in order to make it effective for analysis. 
Our first step is to load our first dataset: *ufo_data_scrubbed.csv*

In [2]:
# import libraries and load dataset

import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("../data/raw/ufo_data_scrubbed.csv")
df.head()

  df = pd.read_csv("../data/raw/ufo_data_scrubbed.csv")


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


Straight away we get a data type warning advising us that columns 5 & 9 have mixed data types. This is less than ideal, and will cause issues further down the line when we attempt to merge, aggregate or model our data. 

let's go ahead and check our columns with the following code:

In [3]:
df.columns.to_list() # list all columns in the dataframe

['datetime',
 'city',
 'state',
 'country',
 'shape',
 'duration (seconds)',
 'duration (hours/min)',
 'comments',
 'date posted',
 'latitude',
 'longitude ']

Based on the principle of zero-indexing, we can see that our 'duration (seconds)' and 'latitude' columns are likely to be our offenders here. 
I'll now consult with ChatGPT to suggest code to help identify the problems in our code here:

In [6]:
# Helper function to check if value is numeric after cleaning
def is_clean_numeric(value):
    value = str(value).strip().lower()
    value = value.replace('’', '').replace('‘', '').replace("'", '').replace('"', '')
    value = value.replace('.', '', 1).replace('-', '', 1)
    return value.isdigit()

# Check non-numeric values in 'duration (seconds)'
non_numeric_duration = df[~df['duration (seconds)'].apply(is_clean_numeric)]
print("Non-numeric values in duration (seconds):")
print(non_numeric_duration['duration (seconds)'].unique())

# Check non-numeric values in 'latitude'
non_numeric_latitude = df[~df['latitude'].apply(is_clean_numeric)]
print("Non-numeric values in latitude:")
print(non_numeric_latitude['latitude'].unique())


Non-numeric values in duration (seconds):
['2`' '8`' '0.5`']
Non-numeric values in latitude:
['33q.200088']


We can see from the code output that we have some uexpected, non-numeric characters populating several rows. 
Let's now convert these columns to strictly numeric columns:

In [None]:
df['duration (seconds)'] = pd.to_numeric(df['duration (seconds)'], errors='coerce')
df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
# coerce will convert non-numeric values to NaN
# Code provided by ChatGPT to convert columns to numeric types, handling non-numeric values by converting them to NaN

Now, let's run our Helper Function again to check that things have been resolved as expected:

In [8]:
# Helper function to check if value is numeric after cleaning
def is_clean_numeric(value):
    value = str(value).strip().lower()
    value = value.replace('’', '').replace('‘', '').replace("'", '').replace('"', '')
    value = value.replace('.', '', 1).replace('-', '', 1)
    return value.isdigit()

# Check non-numeric values in 'duration (seconds)'
non_numeric_duration = df[~df['duration (seconds)'].apply(is_clean_numeric)]
print("Non-numeric values in duration (seconds):")
print(non_numeric_duration['duration (seconds)'].unique())

# Check non-numeric values in 'latitude'
non_numeric_latitude = df[~df['latitude'].apply(is_clean_numeric)]
print("Non-numeric values in latitude:")
print(non_numeric_latitude['latitude'].unique())

Non-numeric values in duration (seconds):
[nan]
Non-numeric values in latitude:
[nan]


We can now see that we have replaced the non-numeric values with the NaN (Not a Number) value.
Let us now flag the number or rows to be dropped, and export the dropped rows to a new .csv file for the purposes of auditing and transparency

In [9]:
# Flag rows with invalid (non-numeric) duration or latitude
df['invalid_duration_or_latitude'] = df[['duration (seconds)', 'latitude']].isnull().any(axis=1)


In [11]:
# Count and optionally save them
dropped_rows = df[df['invalid_duration_or_latitude']]
print(f"Number of rows to be dropped: {len(dropped_rows)}")

# Export dropped rows for audit
dropped_rows.to_csv("../data/dropped_invalid_coordinates_or_duration.csv", index=False)



Number of rows to be dropped: 4


As we can see, there are only 4 rows flagged to be dropped here, which represents ~0.005% of our total data, so let's go ahead and drop them 

In [12]:
# Drop rows with invalid duration or latitude
df = df[~df['invalid_duration_or_latitude']].drop(columns='invalid_duration_or_latitude')


Now let us check that our NaN values have been dropped from the 'duration (seconds)' and 'latitude' columns

In [13]:
print(df[['duration (seconds)', 'latitude']].isnull().sum())
# check that there are no more NaN values in the 'duration (seconds)' and 'latitude' columns

duration (seconds)    0
latitude              0
dtype: int64


In [14]:
# Display a few of the previously dropped values
dropped_rows[['duration (seconds)', 'latitude']].head()


Unnamed: 0,duration (seconds),latitude
27822,,33.9325
35692,,36.974167
43782,180.0,
58591,,4.440663


Here we can see that we have successfully removed the rows with NaN values, and that, as expected there are only 4 rows removed.
These rows represent such a small fraction of the data (~0.005%) that their absence will not introduce bias, distort correlations, or meaningfully affect the outcome of any regression or visual insights. Removing them ensures a cleaner, more reliable dataset without sacrificing representativeness.
