# Unidentified Flying Object (UFO) Sightings - Initial Exploration

This notebook explores a dataset of 80,000+ reported UFO sightings around the world between 1949 and 2014. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("default")

In [None]:
df = pd.read_csv("../data/ufo_sightings_raw.csv")

  df = pd.read_csv("../data/ufo_sightings_raw.csv")


In [None]:
df.head(10)

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.8830556,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.4180556,-157.803611
5,1961-10-10 19:00:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,2007-04-27,36.595,-82.188889
6,1965-10-10 21:00:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2006-02-14,51.434722,-3.18
7,1965-10-10 23:45:00,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,1999-10-02,41.1175,-73.408333
8,1966-10-10 20:00:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,2009-03-19,33.5861111,-86.286111
9,1966-10-10 21:00:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,2005-05-11,30.2947222,-82.984167


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80332 entries, 0 to 80331
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              80332 non-null  object 
 1   city                  80332 non-null  object 
 2   state                 74535 non-null  object 
 3   country               70662 non-null  object 
 4   shape                 78400 non-null  object 
 5   duration (seconds)    80332 non-null  object 
 6   duration (hours/min)  80332 non-null  object 
 7   comments              80317 non-null  object 
 8   date posted           80332 non-null  object 
 9   latitude              80332 non-null  object 
 10  longitude             80332 non-null  float64
dtypes: float64(1), object(10)
memory usage: 6.7+ MB


#### Notes:
The raw dataset contains several columns with mixed data types.  

In [None]:
df.describe(include="all")

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
count,80332,80332,74535,70662,78400,80332.0,80332,80317,80332,80332.0,80332.0
unique,69474,19900,67,5,29,705.0,8304,79997,317,23292.0,
top,2010-07-04 22:00:00,seattle,ca,us,light,300.0,5 minutes,Fireball,2009-12-12,47.6063889,
freq,36,525,9655,65114,16565,7070.0,4716,11,1510,481.0,
mean,,,,,,,,,,,-86.772885
std,,,,,,,,,,,39.697205
min,,,,,,,,,,,-176.658056
25%,,,,,,,,,,,-112.073333
50%,,,,,,,,,,,-87.903611
75%,,,,,,,,,,,-78.755


## 1. Dataset Dimensions: Exploring Key Categorical Columns

In [None]:
df.shape

(80332, 11)

**Notes**: 
the dataset contains 11 columns, and over 80,000 reported UFO sightings, capturing time, location, duration, and descriptive comments. 

In [None]:
# UFO/craft shape reported 
df["shape"].value_counts().head(10)

shape
light        16565
triangle      7865
circle        7608
fireball      6208
other         5649
unknown       5584
sphere        5387
disk          5213
oval          3733
formation     2457
Name: count, dtype: int64

**Notes**:
It appears that the three most sighted shapes between 1949 and 2014 were "light", "triangle", and "circle".

In [None]:
df["country"].value_counts()

country
us    65114
ca     3000
gb     1905
au      538
de      105
Name: count, dtype: int64

**Notes**:
The country with the most reported UFO sightings is the United States (U.S.). Why is this? Could it be that there are more people with technology or interest to observe or report UFO-related phenomena? 

In [None]:
# US states with the greatest amount of reported UFO activity
df["state"].value_counts().head(10)

state
ca    9655
wa    4268
fl    4200
tx    3677
ny    3219
az    2689
il    2645
pa    2582
oh    2425
mi    2071
Name: count, dtype: int64

#### Initial Categorical Observations
- According to our data, UFO reported sightings are most likely to occur in the United States. 
- Comparing sightings reported across different states, California overwhelmingly stands out.  
- It is apparent that most people observe certain UFO shapes (e.g., "light", "triangle", and "circle") more often than others

In [None]:
df["datetime"].value_counts().head(10)

datetime
2010-07-04 22:00:00    36
2012-07-04 22:00:00    31
1999-11-16 19:00:00    27
2009-09-19 20:00:00    26
2011-07-04 22:00:00    25
2004-10-31 20:00:00    23
2010-07-04 21:00:00    23
2013-07-04 22:00:00    22
2012-07-04 22:30:00    21
1999-11-16 19:05:00    20
Name: count, dtype: int64

## 2. Missing Values Check

The following section represents early data quality checks; full cleaning will occur in later phases. 

In [None]:
# Identifying Missing Values
df.isna().sum()

datetime                   0
city                       0
state                   5797
country                 9670
shape                   1932
duration (seconds)         0
duration (hours/min)       0
comments                  15
date posted                0
latitude                   0
longitude                  0
dtype: int64

**Notes**: 
Several columns contain missing values, especially those related to location fields (e.g., state & country). 

In [None]:
# Checking for duplicates
df.duplicated().sum()

np.int64(0)

## Initial Observations and Guiding Questions:

**A. Data Observations:**
- most columns are categorical ("object"), type conversion will be necessary for time-based and numerical analysis.
- missing values need to be addressed, specifically ("State" and "Country") to obtain accurate geographic comparisons.
- "Duration" data will need to be standardized due to inconsistent formatting before analysis.
- Reported UFO sightings seemed to be heavily concentrated in the U.S., especially California. This needs to be noted and investigated to determine if reporting bias has occurred.  

**B. Guiding Questions:**

The following questions focus on temporal, geographic, and descriptive patterns:

1. How have reported UFO sightings changed over time (by year and decade), and are there observable global trends?
2. Are increases in sightings consistent across countries, or are they primarily driven by reports from the United States?
3. Are certain UFO shapes more commonly reported in specific geographic regions?
4. Do reported UFO shapes vary across different time periods or decades?
5. What temporal patterns exist in UFO sightings (e.g., time of day, seasonality)?
6. Are certain UFO shapes associated with longer or shorter reported sighting durations?
7. Does the concentration of sightings in the U.S. appear stable over time, or does it fluctuate across decades?
8. To what extent might reporting frequency be influenced by population density, access to technology, or cultural factors, as inferred through geographic patterns?

**C. Extended & Future Questions Beyond the Current Dataset:**

The following questions fall outside the scope of the current dataset and are noted as potential long-term extensions requiring additional external data sources:

1. Are reported UFO sightings more common near military bases or government facilities?
2. Is there a geographic relationship between reported UFO sightings and reported cattle mutilation incidents?
3. Given that “UFO” is a broad classification, are there external datasets that could help refine categories (e.g., UAPs "Unidentified Aerial Phenomenon", USOs "Unidentified Submerged Object", interdimensional, or other unidentified phenomena) beyond simple shape descriptions?
4. Can additional datasets help contextualize sightings using environmental, military, or atmospheric data to move beyond descriptive classifications?