#

# UFO Sighting Analysis

A small test notebook to look at the time series and geospatial distriution of UFO sightings as recorded in this __[kaggle dataset](https://www.kaggle.com/datasets/jonwright13/ufo-sightings-around-the-world-better)__.

In [88]:
import numpy as np
import pandas as pd
import seaborn as sns
import folium
from folium import plugins
import matplotlib.pyplot as plt
import os, sys
import sklearn as sk
import sweetviz as sv
import dtale
import geopandas 
from ydata_profiling import ProfileReport
sns.set()

Create a dataframe of the data file and check both the header for an example of the data structure

In [89]:
df = pd.read_csv(os.getcwd()+"/ufo-sightings-transformed.csv").drop(columns=["Unnamed: 0"])
df.sample(10)


Unnamed: 0,Date_time,date_documented,Year,Month,Hour,Season,Country_Code,Country,Region,Locale,latitude,longitude,UFO_shape,length_of_encounter_seconds,Encounter_Duration,Description
9864,1999-11-16 19:10:00,12/2/2000,1999,11,19,Autumn,CAN,Canada,Ontario,Toronto,43.666667,-79.416667,,20.0,20 sec,Not believed to be a U.F.O. but was definately...
78031,2007-09-27 22:00:00,11/28/2007,2007,9,22,Autumn,USA,United States,Arizona,Tempe,33.414722,-111.908611,Triangle,120.0,2 minutes,((HOAX??)) Large triangle shaped object with ...
72269,1967-08-08 16:00:00,11/28/2007,1967,8,16,Summer,USA,United States,Colorado,Alamosa,37.469444,-105.869444,Light,1200.0,20 minutes,Alamosa UFO disappears after flash of blinding...
22012,2008-12-03 19:15:00,1/10/2009,2008,12,19,Winter,USA,United States,Oregon,Lebanon,44.536667,-122.905833,Circle,120.0,1 1/2 minutes,Bright&#44 amber object in the sky with brilli...
34182,2006-03-30 01:00:00,10/31/2008,2006,3,1,Spring,AUS,Australia,Western Australia,Perth,-31.95224,115.861397,Oval,300.0,5 minutes approx,Large orange object possibly 50+ Metres across
20605,2013-12-25 00:30:00,1/10/2014,2013,12,0,Winter,USA,United States,California,Los Angeles,34.052222,-118.242778,Unknown,600.0,10+ minutes,Irregular flashing light near the star Betelge...
28275,2011-02-23 07:15:00,3/13/2012,2011,2,7,Winter,USA,United States,New York,Central Valley,41.331667,-74.121389,Diamond,60.0,1 minute,Saw several diamond orange lights flashing in ...
32195,2006-03-18 19:00:00,10/31/2008,2006,3,19,Spring,USA,United States,Oregon,Junction City,44.219444,-123.204444,Other,2700.0,45 mins,perfect silver ball floating
64199,1999-08-10 22:30:00,8/30/1999,1999,8,22,Summer,USA,United States,Colorado,Denver,39.739167,-104.984167,Other,300.0,five minutes,I saw a very strange looking objecet in the no...
25837,1996-02-01 06:07:00,11/2/1999,1996,2,6,Winter,USA,United States,Washington,Woodinville Junction,47.754444,-122.162222,,60.0,1 minute,Man reports seeing obj. in apparent polar orbi...


While it's good to have a quick overview of the data like above, we'd like abit more insight into the utility of the different features we have to work with. One of those being how much of the available data is composed of null values. For that we can just run the isnull method and sum each occurence

In [90]:
df.isnull().sum()

Date_time                         0
date_documented                   0
Year                              0
Month                             0
Hour                              0
Season                            0
Country_Code                    259
Country                         259
Region                          566
Locale                          457
latitude                          0
longitude                         0
UFO_shape                      1930
length_of_encounter_seconds       0
Encounter_Duration                0
Description                      15
dtype: int64

So we see that most of the numeric data types have no null values, while more of the categorical types comprise afew null entries. Its helpful to gauge just how frequent these are though, so we can look at them as a percentage of the total entries

In [91]:
df.isnull().sum()/len(df)

Date_time                      0.000000
date_documented                0.000000
Year                           0.000000
Month                          0.000000
Hour                           0.000000
Season                         0.000000
Country_Code                   0.003224
Country                        0.003224
Region                         0.007046
Locale                         0.005689
latitude                       0.000000
longitude                      0.000000
UFO_shape                      0.024026
length_of_encounter_seconds    0.000000
Encounter_Duration             0.000000
Description                    0.000187
dtype: float64

We see that for the features that do have null entries, these make up less than 1% of the total number, so we can assume they'll have negligible impact on any analysis we run

Further, we can look at ssome of the brief statitical propeties of the dataset

In [92]:
df.describe()

Unnamed: 0,Year,Month,Hour,latitude,longitude,length_of_encounter_seconds
count,80328.0,80328.0,80328.0,80328.0,80328.0,80328.0
mean,2003.850463,6.835026,15.525172,38.124963,-86.772015,9017.336
std,10.426547,3.234876,7.75375,10.469146,39.697805,620232.2
min,1906.0,1.0,0.0,-82.862752,-176.658056,0.001
25%,2001.0,4.0,10.0,34.134722,-112.073333,30.0
50%,2006.0,7.0,19.0,39.4125,-87.903611,180.0
75%,2011.0,9.0,21.0,42.788333,-78.755,600.0
max,2014.0,12.0,23.0,72.7,178.4419,97836000.0


As a final check, it could also be good to check the datatypes of each column, and assign new datatypes in order to make plotting easier

In [93]:
df.dtypes

Date_time                       object
date_documented                 object
Year                             int64
Month                            int64
Hour                             int64
Season                          object
Country_Code                    object
Country                         object
Region                          object
Locale                          object
latitude                       float64
longitude                      float64
UFO_shape                       object
length_of_encounter_seconds    float64
Encounter_Duration              object
Description                     object
dtype: object

In [94]:
types = {
    "Season":"category",
    "Country":"category",
    "Region":"category",
    "UFO_shape":"category",
    "Description":"string"
}

df = df.astype(types)
df.head(5)

Unnamed: 0,Date_time,date_documented,Year,Month,Hour,Season,Country_Code,Country,Region,Locale,latitude,longitude,UFO_shape,length_of_encounter_seconds,Encounter_Duration,Description
0,1949-10-10 20:30:00,4/27/2004,1949,10,20,Autumn,USA,United States,Texas,San Marcos,29.883056,-97.941111,Cylinder,2700.0,45 minutes,This event took place in early fall around 194...
1,1949-10-10 21:00:00,12/16/2005,1949,10,21,Autumn,USA,United States,Texas,Bexar County,29.38421,-98.581082,Light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...
2,1955-10-10 17:00:00,1/21/2008,1955,10,17,Autumn,GBR,United Kingdom,England,Chester,53.2,-2.916667,Circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...
3,1956-10-10 21:00:00,1/17/2004,1956,10,21,Autumn,USA,United States,Texas,Edna,28.978333,-96.645833,Circle,20.0,1/2 hour,My older brother and twin sister were leaving ...
4,1960-10-10 20:00:00,1/22/2004,1960,10,20,Autumn,USA,United States,Hawaii,Kaneohe,21.418056,-157.803611,Light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...


With some of the initial data cleaning and overview done, we can look at some visualistions of the data itself in order to inspect any relations

In [138]:
geometry = geopandas.points_from_xy(df.longitude, df.latitude)
geo_df = geopandas.GeoDataFrame(
    df[["Year", "Month", "Season", "Country", "Region", "UFO_shape", "latitude", "longitude"]], geometry=geometry
)


geo_overlay = geopandas.read_file("countries.geojson")
geo_overlay.loc[30:40,:]

Unnamed: 0,ADMIN,ISO_A3,ISO_A2,geometry
30,Belize,BLZ,BZ,"MULTIPOLYGON (((-87.80370 17.31599, -87.80358 ..."
31,Bermuda,BMU,BM,"MULTIPOLYGON (((-64.81197 32.28994, -64.80770 ..."
32,Bolivia,BOL,BO,"MULTIPOLYGON (((-65.29247 -11.50472, -65.25756..."
33,Brazil,BRA,BR,"MULTIPOLYGON (((-48.54259 -27.81666, -48.55187..."
34,Barbados,BRB,BB,"MULTIPOLYGON (((-59.42691 13.16039, -59.43004 ..."
35,Brunei,BRN,BN,"MULTIPOLYGON (((115.13453 4.90884, 115.14584 4..."
36,Bhutan,BTN,BT,"MULTIPOLYGON (((90.26180 28.33535, 90.26180 28..."
37,Botswana,BWA,BW,"MULTIPOLYGON (((25.25978 -17.79411, 25.21937 -..."
38,Central African Republic,CAF,CF,"MULTIPOLYGON (((22.55576 10.97897, 22.57705 10..."
39,Canada,CAN,CA,"MULTIPOLYGON (((-65.61059 43.42817, -65.62881 ..."


With the geometry feature included, the above object is considered a *GeoDataFrame*. However, it will be useful to have a simple measure of the number of events per country, since it will be useful for some GIS plotting
So, by dropping the geometry data, the geodataframe becomes a standard dataframe, and we can get the count of all reported sightings

In [126]:
df = geo_df.drop(columns=['geometry'], axis=1)

#df.groupby(['Country']).value_counts()
counts = df['Country'].value_counts()

counts = counts.to_frame()
counts['Country'] = counts.index
counts = counts.reset_index(drop=True)
counts = counts.sort_values(by=['Country'])


counts.head(40)

Unnamed: 0,count,Country
56,11,Afghanistan
123,2,Albania
131,1,Algeria
35,23,Argentina
103,3,Armenia
3,630,Australia
83,6,Austria
104,3,Azerbaijan
105,3,Bahrain
58,11,Bangladesh


So now we have a GeoJSON layer onto which we can plot relevant data, and the Geodataframe containing the relevant info regarding the sightings. However, we should run a check to see if there is a sizable number of countries not accounted for in both the GeoJSON and GeoDataFrame, or if there is possible mismatch between the names
To do so, we can check the country columns for both objects, and check for differences

In [97]:
np.setdiff1d(counts.Country, geo_overlay.ADMIN)

array(['Czechia', 'Eswatini', 'North Macedonia',
       'Palestinian Territories',
       'Saint Helena, Ascension and Tristan da Cunha', 'Serbia',
       'São Tomé and Príncipe', 'Tanzania', 'United States'], dtype=object)

So we can see that theres afew countries in the counts dataframe that appear to not be in the geo_overlay geodf. If we look at the entries in the geo_overlay geodf with similar names, we see that the countries actually are included, just with different titles i.e. "United States" in the counts df, and "United States of America" in the geodf
Since the number of differences is small, we can quickly replace the values in the geo_overlay geodf with those in the counts df
Then when all the entries with corresponding countires are remapped, the reminaing entries that are not concurrent can be dropped

In [136]:
countries = {
    "Czechia": "Czech Republic",
    "North Macedonia": "Macedonia",
    "Palestinian Territories": "Palestine",
    "Saint Helena, Ascension and Tristan da Cunha": "Saint Helena",
    "Serbia": "Republic of Serbia",
    "Tanzania": "United Republic of Tanzania",
    "United States": "United States of America"
}

counts = counts.replace({"Country":countries})


for i in np.setdiff1d(counts.Country, geo_overlay.ADMIN):
    index = counts[counts.Country==i].index[0]
    counts = counts.drop(index=index, axis=0)

array([], dtype=object)