https://www.kaggle.com/cyanist/data-cleaning/data


## Context
This dataset contains over 80,000 reports of UFO sightings over the last century.

## Content
There are two versions of this dataset: scrubbed and complete. The complete data includes entries where the location of the sighting was not found or blank (0.8146%) or have an erroneous or blank time (8.0237%). Since the reports date back to the 20th century, some older data might be obscured. Data contains city, state, time, description, and duration of each sighting.

## Inspiration
* What areas of the country are most likely to have UFO sightings?
* Are there any trends in UFO sightings over time? Do they tend to be clustered or seasonal?
* Do clusters of UFO sightings correlate with landmarks, such as airports or government research centers?
* What are the most common UFO descriptions?

In [21]:
import pandas as pd

In [23]:
ufos = pd.read_csv('scrubbed.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [24]:
ufos = pd.read_csv('scrubbed.csv', low_memory=False)

In [72]:
ufos.head()

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min),comments,date_posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803611


## First, let's rename the columns

In [69]:
ufos.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude '],
      dtype='object')

In [73]:
ufos = ufos.rename(columns = {'duration (seconds)': 'duration_seconds', 'duration (hours/min)': 'duration_hours_min',
                      'date posted': 'date_posted'})

In [74]:
ufos.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration_seconds',
       'duration_hours_min)', 'comments', 'date_posted', 'latitude',
       'longitude '],
      dtype='object')

## Check types

In [26]:
ufos.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)       object
duration (hours/min)     object
comments                 object
date posted              object
latitude                 object
longitude               float64
dtype: object

In [11]:
ufos.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude '],
      dtype='object')

In [99]:
ufos.astype({'datetime': 'datetime64[ns]', 'date_posted': 'datetime64[ns]', 'latitude': float, 'duration_seconds': float })

ValueError: could not convert string to float: '2`'

We have problems with datetime, duration_seconds and latitude

Let's change only the dates and let's look for that data

In [38]:
ufos = ufos.astype({'date_posted': 'datetime64[ns]' })

In [39]:
ufos.dtypes

datetime                        object
city                            object
state                           object
country                         object
shape                           object
duration (seconds)              object
duration (hours/min)            object
comments                        object
date posted             datetime64[ns]
latitude                        object
longitude                      float64
dtype: object

## datetime

In [41]:
ufos.astype({'datetime': 'datetime64[ns]'})

ValueError: hour must be in 0..23

In [45]:
ufos.datetime = ufos.datetime.astype(str)

In [53]:
ufos['bad_hour'] = ufos.datetime.apply(lambda x: '24:' in x)

In [54]:
ufos[ufos['bad_hour'] != False]

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,bad_hour
388,10/11/2006 24:00,rome,ny,us,oval,120,a min or two,I was walking from the garage to the house&#44...,2007-02-01,43.2127778,-75.456111,True
693,10/1/2001 24:00,chulucanas-piura la vieja (peru),,,other,6312000,2 years,go to: http://www.24horas.com.pe/data/videos/...,2003-03-04,-5.129547,-80.120569,True
962,10/1/2012 24:00,novi,mi,us,triangle,300,5 minutes,V shaped and 8 big and very brite lights&#44mo...,2012-10-30,42.4805556,-83.475556,True
1067,10/12/2003 24:00,salatiga (indonesia),,,disk,22,22 seconds,UFO in Salatiga&#44Indonesia,2003-10-31,-7.33683,110.498817,True
1221,10/12/2013 24:00,cincinnati,oh,us,fireball,300,3-5 minutes,A bright orange light split into four&#44 did ...,2013-10-14,39.1619444,-84.456944,True
1222,10/12/2013 24:00,orangevale (sacramento),ca,us,sphere,60,1 minute,swarm of 8 orange orbs stop and dispers in man...,2013-10-23,38.6786111,-121.224722,True
1317,10/13/2004 24:00,white hall,ar,us,triangle,300,5 minutes,ufo spoted near chemical weapons disposal base .,2004-10-27,34.2738889,-92.090833,True
1359,10/13/2007 24:00,hudson,wi,us,sphere,180,3 minutes,6 Balls of Light Swooping Behind Chem-Clouds 9...,2007-11-28,44.9747222,-92.756667,True
1445,10/13/2012 24:00,kingston,pa,us,light,300,5 minutes,There were orange lights in the sky. The ones ...,2012-10-30,41.2616667,-75.897222,True
1663,10/14/2011 24:00,cibecue (west of),az,us,light,300,5 minutes,Bright yellow light moving slowly thru skies o...,2011-10-25,34.0447222,-110.484722,True


In [55]:
ufos['datetime'] = ufos.datetime.apply(lambda x:  x.replace('24:', '00:'))

In [56]:
ufos.astype({'datetime': 'datetime64[ns]'})

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,bad_hour
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.8830556,-97.941111,False
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082,False
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667,False
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645833,False
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.4180556,-157.803611,False
5,1961-10-10 19:00:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,2007-04-27,36.5950000,-82.188889,False
6,1965-10-10 21:00:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2006-02-14,51.434722,-3.180000,False
7,1965-10-10 23:45:00,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,1999-10-02,41.1175000,-73.408333,False
8,1966-10-10 20:00:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,2009-03-19,33.5861111,-86.286111,False
9,1966-10-10 21:00:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,2005-05-11,30.2947222,-82.984167,False


In [57]:
ufos = ufos.astype({'datetime': 'datetime64[ns]'})

### Now we want to delete the bad column

Two ways:

In [59]:
ufos = ufos.drop(['bad_hour'], axis=1)
ufos

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.8830556,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.9783333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.4180556,-157.803611
5,1961-10-10 19:00:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,2007-04-27,36.5950000,-82.188889
6,1965-10-10 21:00:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2006-02-14,51.434722,-3.180000
7,1965-10-10 23:45:00,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,1999-10-02,41.1175000,-73.408333
8,1966-10-10 20:00:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,2009-03-19,33.5861111,-86.286111
9,1966-10-10 21:00:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,2005-05-11,30.2947222,-82.984167


In [60]:
ufos.dtypes

datetime                datetime64[ns]
city                            object
state                           object
country                         object
shape                           object
duration (seconds)              object
duration (hours/min)            object
comments                        object
date posted             datetime64[ns]
latitude                        object
longitude                      float64
dtype: object

## Latitude

In [61]:
ufos.astype({'latitude': float })

ValueError: could not convert string to float: '33q.200088'

In [63]:
ufos[ufos['latitude'] == '33q.200088']

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
43782,1974-05-22 05:30:00,mescalero indian reservation,nm,,rectangle,180,two hours,Huge rectangular object emmitting intense whit...,2012-04-18,33q.200088,-105.624152


In [64]:
ufos.shape



(80332, 11)

We can change that data but we don't know which the new data would be. This time we will remove it.

In [65]:
ufos.drop(43782).shape

(80331, 11)

In [66]:
ufos = ufos.drop(43782)

In [68]:
ufos = ufos.astype({'latitude': float})

In [None]:
ufos['latitude'] = pd.to_numeric(ufos['latitude'], errors='coerce')


## duration_seconds

In [100]:
ufos.astype({'duration_seconds': 'float'})

ValueError: could not convert string to float: '2`'

In [101]:
ufos[ufos['duration_seconds'] == '2`']

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min),comments,date_posted,latitude,longitude
27822,2000-02-02 19:33:00,bouse,az,us,0,2`,each a few seconds,Driving through Plomosa Pass towards Bouse Loo...,2000-02-16,33.9325,-114.005


In [102]:
ufos.iloc[27822].

datetime                                             2000-02-02 19:33:00
city                                                               bouse
state                                                                 az
country                                                               us
shape                                                                  0
duration_seconds                                                      2`
duration_hours_min)                                   each a few seconds
comments               Driving through Plomosa Pass towards Bouse Loo...
date_posted                                          2000-02-16 00:00:00
latitude                                                         33.9325
longitude                                                       -114.005
Name: 27822, dtype: object

In [104]:
ufos.at[27822, 'duration_seconds'] = 2


In [105]:
ufos.astype({'duration_seconds': 'float'})

ValueError: could not convert string to float: '8`'

In [106]:
ufos[ufos['duration_seconds'] == '8`']

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min),comments,date_posted,latitude,longitude
35692,2005-04-10 22:52:00,santa cruz,ca,us,0,8`,eight seconds,2 red lights moving together and apart with a ...,2005-04-16,36.974167,-122.029722


In [107]:
ufos.at[35692, 'duration_seconds'] = 8


In [108]:
ufos.astype({'duration_seconds': 'float'})

ValueError: could not convert string to float: '0.5`'

We need to remove all the characters `

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

In [114]:
ufos['duration_seconds'] = ufos['duration_seconds'].apply(lambda x: x.replace('`', ''))

AttributeError: 'int' object has no attribute 'replace'

In [115]:
ufos['duration_seconds'] = ufos['duration_seconds'].astype(str)


In [116]:
ufos['duration_seconds'] = ufos['duration_seconds'].apply(lambda x: x.replace('`', ''))

In [118]:
ufos = ufos.astype({'duration_seconds': 'float'})

In [119]:
ufos.dtypes

datetime               datetime64[ns]
city                           object
state                          object
country                        object
shape                          object
duration_seconds              float64
duration_hours_min)            object
comments                       object
date_posted            datetime64[ns]
latitude                      float64
longitude                     float64
dtype: object

# NAN values


In [77]:
ufos.isnull()


Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min),comments,date_posted,latitude,longitude
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False
6,False,False,True,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False


#### Turn it into an array

In [76]:
ufos.isnull().values


array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False,  True, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [78]:
ufos.isnull().values.any()


True

#### Which columns has Nan values?

In [80]:
ufos.isnull().sum()

datetime                  0
city                      0
state                  5797
country                9669
shape                  1932
duration_seconds          0
duration_hours_min)       0
comments                 15
date_posted               0
latitude                  0
longitude                 0
dtype: int64

What to do? Sometimes we will fill them with the mean, other times with a value. This time we will fill it with 0

In [82]:
ufos.fillna(value=0).isnull().sum()


datetime               0
city                   0
state                  0
country                0
shape                  0
duration_seconds       0
duration_hours_min)    0
comments               0
date_posted            0
latitude               0
longitude              0
dtype: int64

In [83]:
ufos = ufos.fillna(value=0)

Por ejemplo pensar qué hacemos con el 0. Si hay muchos de uno y pocos de otro pensar si quitarlos.

In [88]:
ufos.country.value_counts()

us    65114
0      9669
ca     3000
gb     1905
au      538
de      105
Name: country, dtype: int64

### It makes sense to set some different tags depending on the duration

Different ways:
* Equal width bins: the range for each bin is the same size.
* Equal frequency bins: approximately the same number of records in each bin.

In [92]:
ufos.head()

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min),comments,date_posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,0,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),0,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803611


In [93]:
labels = ['Very Low', 'Low', 'Moderate', 'High', 'Very High']


In [122]:
ufos['duration_seconds_interval'] = pd.cut(ufos['duration_seconds'], 5, labels = labels)


In [123]:
ufos.head()

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min),comments,date_posted,latitude,longitude,duration_seconds_interval
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,2004-04-27,29.883056,-97.941111,Very Low
1,1949-10-10 21:00:00,lackland afb,tx,0,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,2005-12-16,29.38421,-98.581082,Very Low
2,1955-10-10 17:00:00,chester (uk/england),0,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,2008-01-21,53.2,-2.916667,Very Low
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,2004-01-17,28.978333,-96.645833,Very Low
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,2004-01-22,21.418056,-157.803611,Very Low


In [127]:
pd.cut(ufos['duration_seconds'], 5).value_counts()

(-97835.999, 19567200.001]      80324
(39134400.001, 58701600.0]          3
(78268800.0, 97836000.0]            2
(58701600.0, 78268800.0]            1
(19567200.001, 39134400.001]        1
Name: duration_seconds, dtype: int64

In [128]:
pd.qcut(ufos['duration_seconds'], 5).value_counts()

(20.0, 120.0]          19717
(0.0, 20.0]            17760
(120.0, 300.0]         15954
(900.0, 97836000.0]    13874
(300.0, 900.0]         13026
Name: duration_seconds, dtype: int64