# UAP Sightings

The sections on this notebook elaborate on the steps 2-5 from the Cross Industry Standard Process for Data Mining. Please refer to the repository for a complete explanation on _Business Understanding_ and _Deployment_.

# II. Data Understanding

Data used in this experiment comes from the [National UFO Reporting Center](https://nuforc.org/). It is crucial to understand the structure that this organization uses to archive reports. 

In [41]:
import seaborn as sns
import pandas as pd
import numpy as np

In [42]:
# Import raw data
data = pd.read_csv("Data/source_data.csv", low_memory=False)
print("Imported {} entries.".format(data.shape[0]))
data.head()

Imported 80332 entries.


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


As it can be seen in the cell below, numerical and temporal data is being imported as _object_ type, which is an indication that this data needs to be properly processed in order to get useful statistics. 

In [43]:
# Data types
data.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)       object
duration (hours/min)     object
comments                 object
date posted              object
latitude                 object
longitude               float64
dtype: object

Another important aspect to explore, is the consistency in categorical data, mainly the columns associated with locations. 

In [44]:
data["country"].unique()

array(['us', nan, 'gb', 'ca', 'au', 'de'], dtype=object)

It is clear that data comes from just a few countries: Unites States, Great Britain, Canada, Australia and Germany. Next, entries without a country indication are explored.

In [45]:
data[data["country"].isna() == True]

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
18,10/10/1973 23:00,bermuda nas,,,light,20,20 sec.,saw fast moving blip on the radar scope thin w...,1/11/2002,32.364167,-64.678611
29,10/10/1979 22:00,saddle lake (canada),ab,,triangle,270,4.5 or more min.,Lights far above&#44 that glance; then flee f...,1/19/2005,53.970571,-111.689885
35,10/10/1982 07:00,gisborne (new zealand),,,disk,120,2min,gisborne nz 1982 wainui beach to sponge bay,1/11/2002,-38.662334,178.017649
40,10/10/1986 20:00,holmes/pawling,ny,,chevron,180,3 minutes,Football Field Sized Chevron with bright white...,10/8/2007,41.523427,-73.646795
...,...,...,...,...,...,...,...,...,...,...,...
80238,9/9/2009 14:15,broomfield?lafayette,co,,rectangle,120,2 min,Large&#44 rectangular object seen flying in br...,12/12/2009,39.993596,-105.089706
80244,9/9/2009 20:17,lyman,me,,light,600,10 mins,Two lights ran across the sky&#44 as bright as...,12/12/2009,43.505096,-70.637968
80319,9/9/2013 20:15,clifton,nj,,other,3600,~1hr+,Luminous line seen in New Jersey sky.,9/30/2013,40.858433,-74.163755
80322,9/9/2013 21:00,aleksandrow (poland),,,light,15,15 seconds,Two points of light following one another in a...,9/30/2013,50.465843,22.891814


It can be seen, the entries without a country indication may still contain useful, but poorly structured, geographic information. Another question to answer is: which proportion of the data, fall into this condition.

In [46]:
no_country = data[data["country"].isna() == True].shape[0]
total = data.shape[0]

print("There are {} entries without a country association. This represents a {:.2f}% of the original dataset.".format(no_country, no_country * 100 / total))

There are 9670 entries without a country association. This represents a 12.04% of the original dataset.


The proportion of data without _country_ information is quite significant, it may be worthwhile trying to restore this field from the the other information. As a last step, is useful to know which columns have missing values.

In [47]:
# detect missing values by column
data.isna().sum()

datetime                   0
city                       0
state                   5797
country                 9670
shape                   1932
duration (seconds)         0
duration (hours/min)       0
comments                  15
date posted                0
latitude                   0
longitude                  0
dtype: int64

# III. Data Preparation

The first field to be processed is _datetime_. This information could potentially be used to discover patterns in sightings. In this case, the data could be better used having date and time as independent fields

In [48]:
data["date"] = data["datetime"].transform(lambda x: x.split(" ")[0])
data["time"] = data["datetime"].transform(lambda x: x.split(" ")[1])
data = data.drop(labels=["datetime"], axis=1)
data.head()

Unnamed: 0,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,date,time
0,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111,10/10/1949,20:30
1,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,10/10/1949,21:00
2,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667,10/10/1955,17:00
3,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833,10/10/1956,21:00
4,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611,10/10/1960,20:00


Since all entries have information about _longitude_ and _latitude_, these fields will be used to restore information regarding _state_ and _country_.

In [50]:
# data type conversion for latitude
data["latitude"] = pd.to_numeric(data["latitude"], errors="coerce")
invalid = data["latitude"].isna().sum()
data = data.dropna(subset=["latitude"])
print("A total of {} invalid latitude values were dropped.".format(invalid))

A total of 0 invalid latitude values were dropped.
