# **The Stress-Sighting Hypothesis**
## A Data-Driven Analysis of Global Events and Reports of the Unknown.

**The Stress-Sighting Hypothesis** has the project goal of investigating whether there is a meaningful correlation between the frequency of reported UFO sightings and periods of heightened cultural, political or global stress, using historical event data and publicly reported sightings. 


## User Story
Alex Holloway is an investigative journalist, known for in-depth features that combine cultural analysis with data storytelling. They work with both independent media outlets and major publishers, seeking to explore how society processes uncertainty — from political unrest to media myths.

Alex is planning to write an article on how Global Stress Events impact the number of UFO sightings, and has asked us to conduct our analysis, using the publicly available NUFORC (National UFO Reporting Centre) UFO Sightings dataset found [here](https://www.kaggle.com/datasets/NUFORC/ufo-sightings/data)

## Business Requirements
In an era shaped by information saturation, political polarisation, and global crises, public perception is increasingly complex and emotionally charged. For journalists, researchers, and communicators, understanding how people respond to uncertainty is as important as the events themselves.

This project explores the potential relationship between **reported UFO sightings** and **global stress events**, not to investigate extraterrestrial phenomena, but to examine whether these sightings reflect **underlying patterns of public anxiety, media influence, and cultural tension.**

The outcome is a data-driven dashboard designed to support those working at the intersection of **data**, **storytelling**, and **public insight**.

![Alex Holloway – Persona Card](../images/alex_holloway_persona_card.png)

### Alex's Requirements:

- **Reveal Patterns**

Alex needs to identify correlations between historical periods of stress and spikes in UFO reporting - fast, clearly and without technical issues. 

- **Narrative Context**

They want to explore not just *when* things happened, but *why it matters.* Explanatory text and annotations support deeper storytelling.

- **Usable Insights**

Our charts and summaries must be easy to extract for use in articles or reports, including explanatory captions and legends.

- **Trustworthy Structure**

The data pipeline must be transparent, ethical and well-documented to ensure and maintain credibility in their journalistic work.

### Value Proposition:
Our Dashboard must empower users like Alex to:
- Translate complex data into cultural insight
- Frame journalistic stories with empirical evidence
- Uncover social signals hiding in unconventional data
- Offer the audience a grounded perspective on how fear, media, and uncertainty intersect.

---

## Hypotheses

Our Hypotheses for this project are as follows:

### **Hypothesis 1:** 

**There is a positive correlation between the number of glabal stress events in a given year and the number of UFO sightings.**

### **Hypothesis 2:**

**Years with higher total stress severity scores are associated with a greater number of UFO sightings.**

### **Hypothesis 3:**

**Cultural media events, (such as the release of UFO-themed films or television series) correspond with noticeable short-term spikes in reported sightings.**

For the sake of brevity, we will not outline our validation approaches here, as this will be covered in a seperate notebook.

---

## Data Preparation and Cleaning

In this section we will look to extract our data and give consideration to how we will clean it in order to make it effective for analysis. 
Our first step is to load our first dataset: *ufo_data_scrubbed.csv*

In [56]:
# import libraries and load dataset

import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("../data/raw/ufo_data_scrubbed.csv")
df.head()

  df = pd.read_csv("../data/raw/ufo_data_scrubbed.csv")


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


Straight away we get a data type warning advising us that columns 5 & 9 have mixed data types. This is less than ideal, and will cause issues further down the line when we attempt to merge, aggregate or model our data. 

let's go ahead and check our columns with the following code:

In [57]:
df.columns.to_list() # list all columns in the dataframe

['datetime',
 'city',
 'state',
 'country',
 'shape',
 'duration (seconds)',
 'duration (hours/min)',
 'comments',
 'date posted',
 'latitude',
 'longitude ']

Based on the principle of zero-indexing, we can see that our 'duration (seconds)' and 'latitude' columns are likely to be our offenders here. 
I'll now consult with ChatGPT to suggest code to help identify the problems in our code here:

In [58]:
# Helper function to check if value is numeric after cleaning
def is_clean_numeric(value):
    value = str(value).strip().lower()
    value = value.replace('’', '').replace('‘', '').replace("'", '').replace('"', '')
    value = value.replace('.', '', 1).replace('-', '', 1)
    return value.isdigit()

# Check non-numeric values in 'duration (seconds)'
non_numeric_duration = df[~df['duration (seconds)'].apply(is_clean_numeric)]
print("Non-numeric values in duration (seconds):")
print(non_numeric_duration['duration (seconds)'].unique())

# Check non-numeric values in 'latitude'
non_numeric_latitude = df[~df['latitude'].apply(is_clean_numeric)]
print("Non-numeric values in latitude:")
print(non_numeric_latitude['latitude'].unique())


Non-numeric values in duration (seconds):
['2`' '8`' '0.5`']
Non-numeric values in latitude:
['33q.200088']


We can see from the code output that we have some uexpected, non-numeric characters populating several rows. 
Let's now convert these columns to strictly numeric columns:

In [59]:
df['duration (seconds)'] = pd.to_numeric(df['duration (seconds)'], errors='coerce')
df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
# coerce will convert non-numeric values to NaN
# Code provided by ChatGPT to convert columns to numeric types, handling non-numeric values by converting them to NaN

Now, let's run our Helper Function again to check that things have been resolved as expected:

In [60]:
# Helper function to check if value is numeric after cleaning
def is_clean_numeric(value):
    value = str(value).strip().lower()
    value = value.replace('’', '').replace('‘', '').replace("'", '').replace('"', '')
    value = value.replace('.', '', 1).replace('-', '', 1)
    return value.isdigit()

# Check non-numeric values in 'duration (seconds)'
non_numeric_duration = df[~df['duration (seconds)'].apply(is_clean_numeric)]
print("Non-numeric values in duration (seconds):")
print(non_numeric_duration['duration (seconds)'].unique())

# Check non-numeric values in 'latitude'
non_numeric_latitude = df[~df['latitude'].apply(is_clean_numeric)]
print("Non-numeric values in latitude:")
print(non_numeric_latitude['latitude'].unique())

Non-numeric values in duration (seconds):
[nan]
Non-numeric values in latitude:
[nan]


We can now see that we have replaced the non-numeric values with the NaN (Not a Number) value.
Let us now flag the number or rows to be dropped, and export the dropped rows to a new .csv file for the purposes of auditing and transparency

In [61]:
# Flag rows with invalid (non-numeric) duration or latitude
df['invalid_duration_or_latitude'] = df[['duration (seconds)', 'latitude']].isnull().any(axis=1)


In [62]:
# Count and optionally save them
dropped_rows = df[df['invalid_duration_or_latitude']]
print(f"Number of rows to be dropped: {len(dropped_rows)}")

# Export dropped rows for audit
dropped_rows.to_csv("../data/dropped_invalid_coordinates_or_duration.csv", index=False)



Number of rows to be dropped: 4


As we can see, there are only 4 rows flagged to be dropped here, which represents ~0.005% of our total data, so let's go ahead and drop them 

In [63]:
# Drop rows with invalid duration or latitude
df = df[~df['invalid_duration_or_latitude']].drop(columns='invalid_duration_or_latitude')


Now let us check that our NaN values have been dropped from the 'duration (seconds)' and 'latitude' columns

In [64]:
print(df[['duration (seconds)', 'latitude']].isnull().sum())
# check that there are no more NaN values in the 'duration (seconds)' and 'latitude' columns

duration (seconds)    0
latitude              0
dtype: int64


In [65]:
# Display a few of the previously dropped values
dropped_rows[['duration (seconds)', 'latitude']].head()


Unnamed: 0,duration (seconds),latitude
27822,,33.9325
35692,,36.974167
43782,180.0,
58591,,4.440663


Here we can see that we have successfully removed the rows with NaN values, and that, as expected there are only 4 rows removed.
These rows represent such a small fraction of the data (~0.005%) that their absence will not introduce bias, distort correlations, or meaningfully affect the outcome of any regression or visual insights. Removing them ensures a cleaner, more reliable dataset without sacrificing representativeness.


Now that we have solved our initial issue of unexpected characters appearing in the *duration (seconds)* column and the *latitude* column, let us now continue by conducting a broad audit of missing data across the entire dataset. We can conduct a very simple operation here by using a combination of the 'isnull()' and 'sum()' functions.

In [66]:
# check for missing value counts in the entire dataframe
missing_counts = df.isnull().sum()
missing_counts

datetime                   0
city                       0
state                   5796
country                 9668
shape                   1930
duration (seconds)         0
duration (hours/min)       0
comments                  15
date posted                0
latitude                   0
longitude                  0
dtype: int64

We can see here that despite using the 'scrubbed' version of our UFO data, we still have a lot of missing values to deal with. 
Let us quickly calculate the percentage of the whole database that has the missing values:

In [67]:
# Show percentage of missing values
(df.isnull().sum() / len(df) * 100).round(2).sort_values(ascending=False)


country                 12.04
state                    7.22
shape                    2.40
comments                 0.02
datetime                 0.00
city                     0.00
duration (seconds)       0.00
duration (hours/min)     0.00
date posted              0.00
latitude                 0.00
longitude                0.00
dtype: float64

We can see from this that missing *country* values make up ~12% of our dataset. Missing *state* entries account for ~7.2%.
Missing *shape* decriptors accoount for only 2.4%, and missing *comments* only 0.02%.

In this instance, let us first turn our attention to resolving the missing *country* values. Due to the statistically significant proportion of our dataset that this represents, we decide to impute the missing values with "Unknown" rather than deleting the rows entirely. 

We can perform this operation using the following methodology:

In [68]:
# impute missing country values with 'unknown'
df['country'] = df['country'].fillna('unknown')

let us check that this has worked as anticipating by running our previous code:

In [69]:
# check for missing value counts in the entire dataframe
missing_counts = df.isnull().sum()
missing_counts

datetime                   0
city                       0
state                   5796
country                    0
shape                   1930
duration (seconds)         0
duration (hours/min)       0
comments                  15
date posted                0
latitude                   0
longitude                  0
dtype: int64

Great! We can see that our *country* column now has zero missing entries, so our imputation has been successful.
We can now perform the same operation on the *state* columns. We've decided to take this course of action due to the dataset containing sightings that have occurred in regions outside of the US, and so may not require or have *state* values. These rows may still have relevance to our regional breakdown analysis that we may conduct later on. 
Let us perform the same operation as before, but alter our code to point to the *state* column:


In [70]:
# impute missing state values with 'unknown'
df['state'] = df['state'].fillna('unknown')

Again, let us run our test to ensure that our operation has been successful:

In [71]:
# check for missing value counts in the entire dataframe
missing_counts = df.isnull().sum()
missing_counts

datetime                   0
city                       0
state                      0
country                    0
shape                   1930
duration (seconds)         0
duration (hours/min)       0
comments                  15
date posted                0
latitude                   0
longitude                  0
dtype: int64

Success! Let us now move on to addressing the missing values in the *shape* column. Once again, it seems prudent for us to impute 'unknown' values into the missing values here; due to this column potentially feeding into later visulaisations factoring shape type as a notable interest. 

In [72]:
# impute missing shape values with 'unknown'
df['shape'] = df['shape'].fillna('unknown')

In [73]:
# check for missing value counts in the entire dataframe
missing_counts = df.isnull().sum()
missing_counts

datetime                 0
city                     0
state                    0
country                  0
shape                    0
duration (seconds)       0
duration (hours/min)     0
comments                15
date posted              0
latitude                 0
longitude                0
dtype: int64

We have now solved the majority of our missing values, with only the missing *comments* values remaining. We have two options for resolving this issue. Either we could fill this missing entries with empty string values, or drop the rows entirely. 
As we saw earlier, these missing values only account for 0.02% of our data, and so because of the low significance to our overall analysis, we decide to drop these rows. 
For this we use the *dropna()* method:

In [74]:
# drop rows with missing comments
df = df.dropna(subset=['comments'])


In [75]:
# check for missing value counts in the entire dataframe
missing_counts = df.isnull().sum()
missing_counts

datetime                0
city                    0
state                   0
country                 0
shape                   0
duration (seconds)      0
duration (hours/min)    0
comments                0
date posted             0
latitude                0
longitude               0
dtype: int64

Now that we have successfully handled our missing data entries, let us summarise our handling decisions:

- *country* : ~12% missing data filled with 'unknown'
- *state* : ~7.2% mnissing data filled with 'unknown'
- *shape* : ~2.4% missing data filled with 'unknown'
- *comments* : ~0.02% missing data dropped. 

These decisions were made in order to preserve the maximum data integrity, while allowing us flexibility in filtering and consistent formatting for categorical fields for visual analysis. 

Next, let us quickly ensure that our column names are standardised, due to us seeing that some columns contain spaces, for example the *duration (seconds)* column. 

In [76]:
# replace spaces in column names with underscores and convert to lowercase
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration_(seconds)',
       'duration_(hours/min)', 'comments', 'date_posted', 'latitude',
       'longitude'],
      dtype='object')

We can now see that our column names have been standardised, and that we have solved the issues with column names having spaces. This will help us later on when we come to merge our datasets, and also for when we start to conduct our analysis. 

Next, we should check that our columns are correctly assigned the proper data types.

In [77]:
# check data types of the columns
data_types = df.dtypes
data_types

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration_(seconds)      float64
duration_(hours/min)     object
comments                 object
date_posted              object
latitude                float64
longitude               float64
dtype: object

We can see from our data types check that there are some columns that will need to have their data types changed. 
First on our 'to-do' list is handling the *datetime* column - from 'object' to 'datetime.' We will do this by utilising the *.to_datetime()* method.

In [78]:
# convert 'datetime' column to datetime type
df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')
#  add the .dtypes function to check the data types again  
data_types = df.dtypes
data_types

datetime                datetime64[ns]
city                            object
state                           object
country                         object
shape                           object
duration_(seconds)             float64
duration_(hours/min)            object
comments                        object
date_posted                     object
latitude                       float64
longitude                      float64
dtype: object

Now that we have converted our *datetime* column to the correct format, let us move on to handling the next column with the incorrect data type - *duration_(hours/min)*

Addressing the requirements of this column tells us that it is not needed for our analysis, due to it having inconsistent formatting throughout, so would require some serious, time-consuming parsing. As we also have a *duration_(seconds)* column, we feel that the *duration_(hours/mins)* column is redundant for our purposes. 

Let us proceed to drop this column from our dataset using the *.drop()* method:

In [79]:
# drop 'duration_(hours/min)' column as it is redundant
df.drop(columns=['duration_(hours/min)'], inplace=True)

Now that we have removed the column, let us quickly check that it has been successfully removed:

In [80]:
df.head()  # Display the first few rows of the cleaned dataframe

Unnamed: 0,datetime,city,state,country,shape,duration_(seconds),comments,date_posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,unknown,light,7200.0,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),unknown,gb,circle,20.0,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611


We can see that our *duration_(hours/min) column has been successfully removed. 
Next, let us convert the *date_posted* column into from an 'object' to the correct 'datetime' format. We will utilise the same code as we used before, merely pointing the code to our chosen column:

In [81]:
# convert 'date_posted' column to datetime type
df['date_posted'] = pd.to_datetime(df['date_posted'], errors='coerce')
#  add the .dtypes function to check the data types again  
data_types = df.dtypes
data_types

datetime              datetime64[ns]
city                          object
state                         object
country                       object
shape                         object
duration_(seconds)           float64
comments                      object
date_posted           datetime64[ns]
latitude                     float64
longitude                    float64
dtype: object

We can now see that our columns are now correctly reformatted to their correct types. 
The next logical step is to check to see if there are any duplicate entries in our dataset. For this, we will emply the use of the .duplicated() and .sum() methods to show us a number of duplicate entries:

In [82]:
# Count total duplicates (excluding index)
duplicate_count = df.duplicated().sum()
print(f"Total duplicate rows: {duplicate_count}")


Total duplicate rows: 2


Let us check the duplicate rows:

In [83]:
# Show actual duplicate rows
df[df.duplicated()]


Unnamed: 0,datetime,city,state,country,shape,duration_(seconds),comments,date_posted,latitude,longitude
62690,2013-07-04 22:00:00,shakopee,mn,us,light,300.0,Orange fast orbs.,2013-07-05,44.798056,-93.526667
70780,2013-08-30 21:45:00,haymarket,va,us,light,30.0,2 bright lights...,2013-09-09,38.811944,-77.636667


In [84]:
df[df.duplicated(keep=False)]  # Show all duplicates, including the first occurrence

Unnamed: 0,datetime,city,state,country,shape,duration_(seconds),comments,date_posted,latitude,longitude
62689,2013-07-04 22:00:00,shakopee,mn,us,light,300.0,Orange fast orbs.,2013-07-05,44.798056,-93.526667
62690,2013-07-04 22:00:00,shakopee,mn,us,light,300.0,Orange fast orbs.,2013-07-05,44.798056,-93.526667
70779,2013-08-30 21:45:00,haymarket,va,us,light,30.0,2 bright lights...,2013-09-09,38.811944,-77.636667
70780,2013-08-30 21:45:00,haymarket,va,us,light,30.0,2 bright lights...,2013-09-09,38.811944,-77.636667


We can see from our duplicate check, that they have both been duplicated on successive rows after their first entries, and that all information is identically duplicated. We consider it safe therefore, to go ahead and drop these rows from the dataset.
For this we will go ahead and employ the *.drop_duplicates function. We'll also make certain to set our argument *inplace=True* to remove them completely. Removing these ensures clean aggregation and avoids skewing any yearly totals. 

In [86]:
# drop duplicate rows
df.drop_duplicates(inplace=True)



Let us perform a check to ensure that we have removed our 2 duplicate rows. 

In [87]:
# Count total duplicates (excluding index)
duplicate_count = df.duplicated().sum()
print(f"Total duplicate rows: {duplicate_count}")

Total duplicate rows: 0


Excellent, we now have handled our duplicated entries successfully. 