# UFO Sightings Analysis
## 📄 About the Dataset
This dataset contains over **80,000 reports** of UFO (Unidentified Flying Object) sightings spanning the **last century**, sourced from the **National UFO Reporting Center (NUFORC)**.  
It provides a rich source of information to explore **spatial**, **temporal**, and **descriptive** patterns in UFO reports across various regions.

## Importing Libraries

In [92]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings 
warnings.filterwarnings('ignore')

## Load the Dataset

In [93]:
df_ = pd.read_csv("scrubbed_ufo.csv")
df_.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


In [94]:
df.shape

(88674, 12)

In [95]:
df.isna().sum()

datetime              0
city                  0
state                 0
country               0
shape                 0
duration (seconds)    0
comments              0
date posted           0
latitude              0
longitude             0
year                  0
month                 0
dtype: int64

In [96]:
total_rows = len(df_)
missing_percent = df_.isnull().sum() * 100 / total_rows

# Print each column's missing percentage up to 5 decimal places
for column, percent in missing_percent.items():
    print(f"{column}: {percent:.2f}%")

datetime: 0.00%
city: 0.00%
state: 7.22%
country: 12.04%
shape: 2.41%
duration (seconds): 0.00%
duration (hours/min): 0.00%
comments: 0.02%
date posted: 0.00%
latitude: 0.00%
longitude : 0.00%


In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 88674 entries, 0 to 88678
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   datetime            88674 non-null  datetime64[ns]
 1   city                88674 non-null  object        
 2   state               88674 non-null  object        
 3   country             88674 non-null  object        
 4   shape               88674 non-null  object        
 5   duration (seconds)  88674 non-null  float64       
 6   comments            88674 non-null  object        
 7   date posted         88674 non-null  datetime64[ns]
 8   latitude            88674 non-null  float64       
 9   longitude           88674 non-null  float64       
 10  year                88674 non-null  int32         
 11  month               88674 non-null  int32         
dtypes: datetime64[ns](2), float64(3), int32(2), object(5)
memory usage: 8.1+ MB


**Problem:**  
Most columns in the dataset, including dates, durations, and geographic coordinates, are currently stored as **object (string) data types**. This prevents efficient data analysis, numerical computations, and time-based operations.

**Solution:**  
- Convert `datetime` and `date posted` columns to proper **datetime** objects to enable time-series analysis.  
- Convert `duration (seconds)` and `latitude` columns to **numeric** types to allow mathematical operations and filtering.  
- Handle any invalid or malformed data by coercing errors to `NaN`, which can then be addressed through cleaning or imputation.  
- These conversions are essential for accurate and meaningful exploratory data analysis (EDA) and visualization.

## Convert datetime columns to datetime dtype

In [98]:
# Convert datetime columns to datetime dtype
df['datetime'] = pd.to_datetime(df_['datetime'], errors='coerce')
df['date posted'] = pd.to_datetime(df_['date posted'], errors='coerce')

In [99]:
# Convert latitude and longitude to numeric
df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')
df['duration (seconds)'] = pd.to_numeric(df['duration (seconds)'], errors='coerce')

In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 88674 entries, 0 to 88678
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   datetime            79633 non-null  datetime64[ns]
 1   city                88674 non-null  object        
 2   state               88674 non-null  object        
 3   country             88674 non-null  object        
 4   shape               88674 non-null  object        
 5   duration (seconds)  88674 non-null  float64       
 6   comments            88674 non-null  object        
 7   date posted         80327 non-null  datetime64[ns]
 8   latitude            88674 non-null  float64       
 9   longitude           88674 non-null  float64       
 10  year                88674 non-null  int32         
 11  month               88674 non-null  int32         
dtypes: datetime64[ns](2), float64(3), int32(2), object(5)
memory usage: 8.1+ MB


### ⏱️ Duration Column Cleanup

- The dataset includes two duration fields: `duration (seconds)` and `duration (hours/min)`.
- `duration (seconds)` is a clean, standardized numeric field indicating how long the sighting lasted.
- `duration (hours/min)` is a free-text field with inconsistent formats (e.g., "1-2 hrs", "45 minutes", "1/2 hour"), making it unreliable for analysis.
- We converted `duration (seconds)` to numeric and dropped `duration (hours/min)` to ensure clean, consistent data.


In [101]:
df.drop(columns=['duration (hours/min)'], inplace=True)

KeyError: "['duration (hours/min)'] not found in axis"

In [None]:
# Sort the dataset features to numerical and categorical categories.
numerical_cols = []
categorical_cols = []
for col in df.columns:
    if pd.api.types.is_numeric_dtype(df[col]):
            numerical_cols.append(col)
    else:
        categorical_cols.append(col)

print("numerical Columns: ")
print(numerical_cols)

print("categorical Columns: ")
print(categorical_cols)

In [None]:
df.describe().T

In [None]:
# print the percentage of missing values for instances.
total = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
percent = ((df.isnull().sum() / df.isnull().count()).sort_values(ascending = False)[(df.isnull().sum() / df.isnull().count()).sort_values(ascending = False) != 0])
missing = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
print(missing)

# Data Cleaning using simpleImputer 

### 🛠️ Imputation

- Use subject-matter expertise to replace missing data with educated guesses  
- Common to use the mean  
- Can also use the median, or another value  
- For categorical values, we typically use the most frequent value — the mode


In [None]:
from sklearn.impute import SimpleImputer

# Impute missing values in categorical columns using the most frequent value
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df[['datetime', 'state', 'country',"date posted"]] = imputer.fit_transform(
    df[['datetime', 'state', 'country', "date posted"]]
)

# Drop rows where latitude or duration (seconds) is missing
df.dropna(subset=['latitude', 'duration (seconds)', "shape",'comments'], inplace=True)

# Preview cleaned DataFrame
df.head()

In [None]:
df.isna().sum()

### 📉 Missing Data Summary & Strategy

- `country`, `datetime`, `date posted`, and `state` have moderate missing values (~8–14%) and were imputed using the **most frequent value**.
- `duration (seconds)` , `latitude` , `shape` and `comments` had negligible missing values (<5%) and so we dropped the values.

### 🌠 Inspiration

This dataset opens up a wide range of fascinating questions for exploration:

- What states of the country are most likely to have UFO sightings?

- Are there any trends in UFO sightings over time?  

-  What are the most common descriptions or shapes of reported UFOs?

## What countries have the most UFO sightings?

In [None]:
df['country'].value_counts()

In [None]:
top_countries = df['country'].value_counts()
plt.figure(figsize=(10, 6))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title("Top 10 countries with Most UFO Sightings")
plt.xlabel("Number of Sightings")
plt.ylabel("country")
plt.tight_layout()
plt.show()

## What states have the most UFO sightings?
for top 10 states

In [None]:
df['state'].value_counts()

In [None]:
top_states = df['state'].value_counts().head(10)
plt.figure(figsize=(10, 6))
sns.barplot(x=top_states.values, y=top_states.index, palette='viridis')
plt.title("Top 10 States with Most UFO Sightings")
plt.xlabel("Number of Sightings")
plt.ylabel("State")
plt.tight_layout()
plt.show()

## Are there any trends in UFO sightings over time? Do they tend to be clustered or seasonal?

In [None]:
# Extract year and month
df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month


## yearly trend

In [None]:
# Count sightings per year
sightings_by_year = df['year'].value_counts().sort_index()

plt.figure(figsize=(12,6))
plt.plot(sightings_by_year.index, sightings_by_year.values, marker='o')
plt.title('UFO Sightings Over the Years')
plt.xlabel('Year')
plt.ylabel('Number of Sightings')
plt.grid(True)
plt.tight_layout()
plt.show()

## 📊 Post-1990 UFO Sightings Trend

It can be observed that after approximately **1990**, the number of UFO sightings began to **increase dramatically**, likely due to the rise of internet usage, media coverage, and easier reporting.

However, around **2010**, there is a noticeable **plummet** in sightings. This might be due to a change in reporting behavior, societal interest, or data collection methods.

Let’s visualize this in more detail using a barplot of yearly sightings from 1990 onward.


In [None]:
# Filter sightings from the year 2000 and onwards
sightings_2000_onward = sightings_by_year[sightings_by_year.index >= 1990]

# Bar plot for sightings from 2000 onwards
plt.figure(figsize=(12, 6))
sns.barplot(x=sightings_2000_onward.index, y=sightings_2000_onward.values, palette='viridis')
plt.title('UFO Sightings from 1990 Onward')
plt.xlabel('Year')
plt.ylabel('Number of Sightings')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## seasonal trend (monthly aggregated across years)

In [None]:
# Count sightings per month
sightings_by_month = df['month'].value_counts().sort_index()

plt.figure(figsize=(10,5))
sns.barplot(x=sightings_by_month.index, y=sightings_by_month.values, palette='coolwarm')
plt.title('Seasonal Trend of UFO Sightings (All Years Combined)')
plt.xlabel('Month')
plt.ylabel('Number of Sightings')
plt.xticks(ticks=range(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.tight_layout()
plt.show()


## Seasonal Trend Observation

It can be seen that during **July**, the number of UFO sightings **skyrocketed** compared to other months. This suggests a possible seasonal effect, where sightings peak in mid-summer. Further analysis could explore whether this trend repeats every year or is influenced by specific events or environmental factors.


## 👽 Most Common UFO Descriptions

In [None]:
df['shape'].value_counts()

In [None]:
# Count top 10 most common UFO shapes
top_shapes = df['shape'].value_counts().head(10)

# Plot the results
plt.figure(figsize=(10, 6))
sns.barplot(x=top_shapes.values, y=top_shapes.index, palette='magma')
plt.title('🔺 Top 10 Most Common UFO Descriptions (Shapes)')
plt.xlabel('Number of Sightings')
plt.ylabel('UFO Shape')
plt.tight_layout()
plt.show()


After analyzing the `shape` column in the dataset, it is evident that the most commonly reported shape of UFOs is:

### **Light**