Origen de datos:https://www.kaggle.com/NUFORC/ufo-sightings#complete.csv

Context
This dataset contains over 80,000 reports of UFO sightings over the last century.

Content
There are two versions of this dataset: scrubbed and complete. The complete data includes entries where the location of the sighting was not found or blank (0.8146%) or have an erroneous or blank time (8.0237%). Since the reports date back to the 20th century, some older data might be obscured. Data contains city, state, time, description, and duration of each sighting.

Inspiration
What areas of the country are most likely to have UFO sightings?
Are there any trends in UFO sightings over time? Do they tend to be clustered or seasonal?
Do clusters of UFO sightings correlate with landmarks, such as airports or government research centers?
What are the most common UFO descriptions?
Acknowledgement
This dataset was scraped, geolocated, and time standardized from NUFORC data by Sigmond Axel here.

In [1]:
import pandas as pd
import time
from datetime import date

data = pd.read_csv('./UFO.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
data.shape

(88875, 14)

In [3]:
data.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13'],
      dtype='object')

In [4]:
data.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.9411111,,,
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,,,
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667,,,
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.6458333,,,
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.8036111,,,


In [5]:
data['duration (seconds)']

0            2700
1            7200
2              20
3              20
4             900
5             300
6             180
7            1200
8             180
9             120
10            300
11            180
12           1800
13            180
14             30
15           1200
16            120
17           1800
18             20
19            120
20           2700
21           1200
22           1200
23            360
24             60
25              3
26             30
27             30
28            300
29            900
           ...   
88845        1290
88846          60
88847         300
88848         900
88849           5
88850           1
88851    triangle
88852         120
88853           4
88854           0
88855           0
88856           8
88857          90
88858           0
88859        3600
88860          60
88861           3
88862          15
88863          60
88864         120
88865         180
88866          20
88867         600
88868        1200
88869     

In [8]:
data.dtypes

datetime                datetime64[ns]
city                            object
state                           object
country                         object
shape                           object
duration (seconds)              object
duration (hours/min)            object
comments                        object
date posted                     object
latitude                        object
longitude                       object
Unnamed: 11                     object
Unnamed: 12                    float64
Unnamed: 13                    float64
dtype: object

In [13]:
data['datetime'] = pd.to_datetime(data['datetime'], errors='coerce') #datetime to datetime
data['duration (seconds)'] = pd.to_numeric(data['duration (seconds)'], errors='coerce') #duration (seconds) to float
#data['duration (seconds)'] = data['duration (seconds)'].astype(int)
data['latitude'] = pd.to_numeric(data['latitude'], errors='coerce') #latitud to float
data['longitude'] = pd.to_numeric(data['longitude'], errors='coerce') #longitude to float
data.dtypes

datetime                datetime64[ns]
city                            object
state                           object
country                         object
shape                           object
duration (seconds)             float64
duration (hours/min)            object
comments                        object
date posted                     object
latitude                       float64
longitude                      float64
Unnamed: 11                     object
Unnamed: 12                    float64
Unnamed: 13                    float64
dtype: object

In [14]:
data['duration (seconds)']

0        2700.0
1        7200.0
2          20.0
3          20.0
4         900.0
5         300.0
6         180.0
7        1200.0
8         180.0
9         120.0
10        300.0
11        180.0
12       1800.0
13        180.0
14         30.0
15       1200.0
16        120.0
17       1800.0
18         20.0
19        120.0
20       2700.0
21       1200.0
22       1200.0
23        360.0
24         60.0
25          3.0
26         30.0
27         30.0
28        300.0
29        900.0
          ...  
88845    1290.0
88846      60.0
88847     300.0
88848     900.0
88849       5.0
88850       1.0
88851       NaN
88852     120.0
88853       4.0
88854       0.0
88855       0.0
88856       8.0
88857      90.0
88858       0.0
88859    3600.0
88860      60.0
88861       3.0
88862      15.0
88863      60.0
88864     120.0
88865     180.0
88866      20.0
88867     600.0
88868    1200.0
88869       0.0
88870    1200.0
88871       5.0
88872    1020.0
88873       0.0
88874       0.0
Name: duration (seconds)

In [15]:
data['Unnamed: 11'].value_counts()

0                      173
0.0                     56
-74.0063889             10
-111.093731              6
-91.831833               4
-122.3308333             4
-46.633309               4
-105.87008999999999      3
-115.1363889             3
60.631811                3
-112.0733333             2
-121.49333329999999      2
-122.0597222             2
-120.554201              2
-84.3880556              2
-89.398528               2
-99.901813               2
-78.656894               2
-89.398528               2
1.297355                 2
10.407561                2
-74.0063889              2
-95.3630556              2
-117.425                 2
-62.88333299999999       2
13.193401000000001       2
-76.6125000              2
9.939686                 1
-87.7516667              1
-3.056637                1
                      ... 
-119.0119444             1
-95.8588889              1
-30.97631                1
-86.134902               1
-123.13350200000001      1
-114.61601               1
-

In [16]:
data['Unnamed: 12'].value_counts()

 0.000000        6
 24.857883       4
-46.633309       1
-28.673147       1
-93.019722       1
 1.602034        1
-40.295777       1
 5.113554        1
-77.008889       1
 47.481766       1
-103.666667      1
 4.477793        1
-96.050000       1
-108.188584      1
-71267.000000    1
-121.271389      1
-116.419389      1
 17.637930       1
 14.082748       1
-86.286111       1
 2.213749        1
-64.095813       1
-122.742778      1
-87.791711       1
-111.248301      1
 66.075833       1
-121.424167      1
-73.245833       1
 17.940871       1
 14.696725       1
 14.801027       1
Name: Unnamed: 12, dtype: int64

In [17]:
data['Unnamed: 13'].value_counts()

-23.126667    1
 0.000000     1
Name: Unnamed: 13, dtype: int64

In [18]:
print(data['datetime'].min())
print(data['datetime'].max())

1906-11-11 00:00:00
2014-05-08 18:45:00


In [20]:
#split in date and time:
data['date'] = [d.date() for d in data['datetime']]

#data['time'] = [d.time() for d in data['datetime']]
data.dtypes

datetime                datetime64[ns]
city                            object
state                           object
country                         object
shape                           object
duration (seconds)             float64
duration (hours/min)            object
comments                        object
date posted                     object
latitude                       float64
longitude                      float64
Unnamed: 11                     object
Unnamed: 12                    float64
Unnamed: 13                    float64
date                            object
dtype: object

In [None]:
data['country'].unique()

In [None]:
data.isnull().sum()
#hay que eliminar las columnas: Unnamed: 11 (88197), Unnamed: 12 (88836), Unnamed: 13 (88873)

In [None]:
data.drop(['Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13','duration (hours/min)'], axis=1)