https://www.kaggle.com/cyanist/data-cleaning/data


## Context
This dataset contains over 80,000 reports of UFO sightings over the last century.

## Content
There are two versions of this dataset: scrubbed and complete. The complete data includes entries where the location of the sighting was not found or blank (0.8146%) or have an erroneous or blank time (8.0237%). Since the reports date back to the 20th century, some older data might be obscured. Data contains city, state, time, description, and duration of each sighting.

## Inspiration
* What areas of the country are most likely to have UFO sightings?
* Are there any trends in UFO sightings over time? Do they tend to be clustered or seasonal?
* Do clusters of UFO sightings correlate with landmarks, such as airports or government research centers?
* What are the most common UFO descriptions?

In [2]:
import pandas as pd

In [3]:
ufos = pd.read_csv('scrubbed.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
ufos = pd.read_csv('scrubbed.csv', low_memory=False)

In [5]:
ufos.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


## First, let's rename the columns

In [6]:
ufos.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration (seconds)',
       'duration (hours/min)', 'comments', 'date posted', 'latitude',
       'longitude '],
      dtype='object')

In [7]:
ufos = ufos.rename(columns = {'duration (seconds)': 'duration_seconds', 'duration (hours/min)': 'duration_hours_min',
                      'date posted': 'date_posted'})

In [8]:
ufos.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration_seconds',
       'duration_hours_min', 'comments', 'date_posted', 'latitude',
       'longitude '],
      dtype='object')

## Check types

In [9]:
ufos.dtypes

datetime               object
city                   object
state                  object
country                object
shape                  object
duration_seconds       object
duration_hours_min     object
comments               object
date_posted            object
latitude               object
longitude             float64
dtype: object

In [10]:
ufos.columns

Index(['datetime', 'city', 'state', 'country', 'shape', 'duration_seconds',
       'duration_hours_min', 'comments', 'date_posted', 'latitude',
       'longitude '],
      dtype='object')

In [11]:
ufos.astype({'datetime': 'datetime64[ns]', 'date_posted': 'datetime64[ns]', 'latitude': float, 'duration_seconds': float })

ValueError: hour must be in 0..23

We have problems with datetime, duration_seconds and latitude

Let's change only the dates and let's look for that data

In [None]:
ufos = ufos.astype({'date_posted': 'datetime64[ns]' })

In [None]:
ufos.dtypes

## datetime

In [12]:
ufos.astype({'datetime': 'datetime64[ns]'})

ValueError: hour must be in 0..23

In [13]:
ufos.datetime = ufos.datetime.astype(str)

In [14]:
ufos['bad_hour'] = ufos.datetime.apply(lambda x: '24:' in x)

In [15]:
ufos[ufos['bad_hour'] != False]

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,bad_hour
388,10/11/2006 24:00,rome,ny,us,oval,120,a min or two,I was walking from the garage to the house&#44...,2/1/2007,43.2127778,-75.456111,True
693,10/1/2001 24:00,chulucanas-piura la vieja (peru),,,other,6312000,2 years,go to: http://www.24horas.com.pe/data/videos/...,3/4/2003,-5.129547,-80.120569,True
962,10/1/2012 24:00,novi,mi,us,triangle,300,5 minutes,V shaped and 8 big and very brite lights&#44mo...,10/30/2012,42.4805556,-83.475556,True
1067,10/12/2003 24:00,salatiga (indonesia),,,disk,22,22 seconds,UFO in Salatiga&#44Indonesia,10/31/2003,-7.33683,110.498817,True
1221,10/12/2013 24:00,cincinnati,oh,us,fireball,300,3-5 minutes,A bright orange light split into four&#44 did ...,10/14/2013,39.1619444,-84.456944,True
1222,10/12/2013 24:00,orangevale (sacramento),ca,us,sphere,60,1 minute,swarm of 8 orange orbs stop and dispers in man...,10/23/2013,38.6786111,-121.224722,True
1317,10/13/2004 24:00,white hall,ar,us,triangle,300,5 minutes,ufo spoted near chemical weapons disposal base .,10/27/2004,34.2738889,-92.090833,True
1359,10/13/2007 24:00,hudson,wi,us,sphere,180,3 minutes,6 Balls of Light Swooping Behind Chem-Clouds 9...,11/28/2007,44.9747222,-92.756667,True
1445,10/13/2012 24:00,kingston,pa,us,light,300,5 minutes,There were orange lights in the sky. The ones ...,10/30/2012,41.2616667,-75.897222,True
1663,10/14/2011 24:00,cibecue (west of),az,us,light,300,5 minutes,Bright yellow light moving slowly thru skies o...,10/25/2011,34.0447222,-110.484722,True


In [16]:
ufos['datetime'] = ufos.datetime.apply(lambda x:  x.replace('24:', '00:'))

In [17]:
ufos.astype({'datetime': 'datetime64[ns]'})

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,bad_hour
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111,False
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,False
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667,False
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833,False
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611,False
5,1961-10-10 19:00:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.5950000,-82.188889,False
6,1965-10-10 21:00:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2/14/2006,51.434722,-3.180000,False
7,1965-10-10 23:45:00,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/2/1999,41.1175000,-73.408333,False
8,1966-10-10 20:00:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,3/19/2009,33.5861111,-86.286111,False
9,1966-10-10 21:00:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,5/11/2005,30.2947222,-82.984167,False


In [18]:
ufos = ufos.astype({'datetime': 'datetime64[ns]'})

### Now we want to delete the bad column

Two ways:

In [19]:
ufos = ufos.drop(['bad_hour'], axis=1)
ufos

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611
5,1961-10-10 19:00:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.5950000,-82.188889
6,1965-10-10 21:00:00,penarth (uk/wales),,gb,circle,180,about 3 mins,penarth uk circle 3mins stayed 30ft above m...,2/14/2006,51.434722,-3.180000
7,1965-10-10 23:45:00,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/2/1999,41.1175000,-73.408333
8,1966-10-10 20:00:00,pell city,al,us,disk,180,3 minutes,Strobe Lighted disk shape object observed clos...,3/19/2009,33.5861111,-86.286111
9,1966-10-10 21:00:00,live oak,fl,us,disk,120,several minutes,Saucer zaps energy from powerline as my pregna...,5/11/2005,30.2947222,-82.984167


In [20]:
ufos.dtypes

datetime              datetime64[ns]
city                          object
state                         object
country                       object
shape                         object
duration_seconds              object
duration_hours_min            object
comments                      object
date_posted                   object
latitude                      object
longitude                    float64
dtype: object

## Latitude

In [21]:
ufos.astype({'latitude': float })

ValueError: could not convert string to float: '33q.200088'

In [22]:
ufos[ufos['latitude'] == '33q.200088']

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude
43782,1974-05-22 05:30:00,mescalero indian reservation,nm,,rectangle,180,two hours,Huge rectangular object emmitting intense whit...,4/18/2012,33q.200088,-105.624152


In [23]:
ufos.shape



(80332, 11)

We can change that data but we don't know which the new data would be. This time we will remove it.

In [24]:
ufos.drop(43782).shape

(80331, 11)

In [25]:
ufos = ufos.drop(43782)

In [26]:
ufos = ufos.astype({'latitude': float})

In [27]:
ufos['latitude'] = pd.to_numeric(ufos['latitude'], errors='coerce')


## duration_seconds

In [28]:
ufos.astype({'duration_seconds': 'float'})

ValueError: could not convert string to float: '2`'

In [29]:
ufos[ufos['duration_seconds'] == '2`']

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude
27822,2000-02-02 19:33:00,bouse,az,us,,2`,each a few seconds,Driving through Plomosa Pass towards Bouse Loo...,2/16/2000,33.9325,-114.005


In [30]:
ufos.iloc[27822].

SyntaxError: invalid syntax (<ipython-input-30-196d49cd4d3d>, line 1)

In [31]:
ufos.at[27822, 'duration_seconds'] = 2


In [32]:
ufos.astype({'duration_seconds': 'float'})

ValueError: could not convert string to float: '8`'

In [33]:
ufos[ufos['duration_seconds'] == '8`']

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude
35692,2005-04-10 22:52:00,santa cruz,ca,us,,8`,eight seconds,2 red lights moving together and apart with a ...,4/16/2005,36.974167,-122.029722


In [34]:
ufos.at[35692, 'duration_seconds'] = 8


In [35]:
ufos.astype({'duration_seconds': 'float'})

ValueError: could not convert string to float: '0.5`'

We need to remove all the characters `

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

In [36]:
ufos['duration_seconds'] = ufos['duration_seconds'].apply(lambda x: x.replace('`', ''))

AttributeError: 'int' object has no attribute 'replace'

In [37]:
ufos['duration_seconds'] = ufos['duration_seconds'].astype(str)


In [38]:
ufos['duration_seconds'] = ufos['duration_seconds'].apply(lambda x: x.replace('`', ''))

In [39]:
ufos = ufos.astype({'duration_seconds': 'float'})

In [40]:
ufos.dtypes

datetime              datetime64[ns]
city                          object
state                         object
country                       object
shape                         object
duration_seconds             float64
duration_hours_min            object
comments                      object
date_posted                   object
latitude                     float64
longitude                    float64
dtype: object

# NAN values


In [41]:
ufos.isnull()


Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False
6,False,False,True,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False


#### Turn it into an array

In [42]:
ufos.isnull().values


array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False,  True, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [43]:
ufos.isnull().values.any()


True

#### Which columns has Nan values?

In [44]:
ufos.isnull().sum()

datetime                 0
city                     0
state                 5797
country               9669
shape                 1932
duration_seconds         0
duration_hours_min       0
comments                15
date_posted              0
latitude                 0
longitude                0
dtype: int64

What to do? Sometimes we will fill them with the mean, other times with a value. This time we will fill it with 0

In [45]:
ufos.fillna(value=0).isnull().sum()


datetime              0
city                  0
state                 0
country               0
shape                 0
duration_seconds      0
duration_hours_min    0
comments              0
date_posted           0
latitude              0
longitude             0
dtype: int64

In [46]:
ufos = ufos.fillna(value=0)

Por ejemplo pensar qué hacemos con el 0. Si hay muchos de uno y pocos de otro pensar si quitarlos.

In [47]:
ufos.country.value_counts()

us    65114
0      9669
ca     3000
gb     1905
au      538
de      105
Name: country, dtype: int64

### It makes sense to set some different tags depending on the duration

Different ways:
* Equal width bins: the range for each bin is the same size.
* Equal frequency bins: approximately the same number of records in each bin.

In [48]:
ufos.head()

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
1,1949-10-10 21:00:00,lackland afb,tx,0,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,1955-10-10 17:00:00,chester (uk/england),0,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611


In [49]:
labels = ['Very Low', 'Low', 'Moderate', 'High', 'Very High']


In [50]:
ufos['duration_seconds_interval'] = pd.cut(ufos['duration_seconds'], 5, labels = labels)


In [51]:
ufos.head()

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111,Very Low
1,1949-10-10 21:00:00,lackland afb,tx,0,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,Very Low
2,1955-10-10 17:00:00,chester (uk/england),0,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667,Very Low
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833,Very Low
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611,Very Low


In [52]:
pd.cut(ufos['duration_seconds'], 5).value_counts()

(-97835.999, 19567200.001]      80324
(39134400.001, 58701600.0]          3
(78268800.0, 97836000.0]            2
(58701600.0, 78268800.0]            1
(19567200.001, 39134400.001]        1
Name: duration_seconds, dtype: int64

In [53]:
pd.qcut(ufos['duration_seconds'], 5).value_counts()

(20.0, 120.0]          19717
(0.0, 20.0]            17760
(120.0, 300.0]         15954
(900.0, 97836000.0]    13874
(300.0, 900.0]         13026
Name: duration_seconds, dtype: int64

# Aggregations and summarizations

In [54]:
ufos.head()

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111,Very Low
1,1949-10-10 21:00:00,lackland afb,tx,0,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,Very Low
2,1955-10-10 17:00:00,chester (uk/england),0,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667,Very Low
3,1956-10-10 21:00:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833,Very Low
4,1960-10-10 20:00:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611,Very Low


In [55]:
ufos.groupby('country').count()

Unnamed: 0_level_0,datetime,city,state,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,9669,9669,9669,9669,9669,9669,9669,9669,9669,9669,9669
au,538,538,538,538,538,538,538,538,538,538,538
ca,3000,3000,3000,3000,3000,3000,3000,3000,3000,3000,3000
de,105,105,105,105,105,105,105,105,105,105,105
gb,1905,1905,1905,1905,1905,1905,1905,1905,1905,1905,1905
us,65114,65114,65114,65114,65114,65114,65114,65114,65114,65114,65114


In [76]:
df = ufos.groupby('country').count()
df.index

Index([0, 'au', 'ca', 'de', 'gb', 'us'], dtype='object', name='country')

In [61]:
ufos.groupby('country').count()

country
0      9669
au      538
ca     3000
de      105
gb     1905
us    65114
Name: datetime, dtype: int64

In [67]:
ufos.groupby('country').count().iloc[:,0]

country
0      9669
au      538
ca     3000
de      105
gb     1905
us    65114
Name: datetime, dtype: int64

Cuál es la diferencia entre

In [1]:
ufos.groupby('country').count()[ufos.columns[0]]
ufos.groupby('country').count()[[ufos.columns[0]]]

NameError: name 'ufos' is not defined

In [65]:
ufos.groupby('country').count()[ufos.columns[0]]

country
0      9669
au      538
ca     3000
de      105
gb     1905
us    65114
Name: datetime, dtype: int64

In [66]:
ufos.groupby('country').count()[[ufos.columns[0]]]

Unnamed: 0_level_0,datetime
country,Unnamed: 1_level_1
0,9669
au,538
ca,3000
de,105
gb,1905
us,65114


In [72]:
# Only returns the valid types
ufos.groupby('country').mean()

Unnamed: 0_level_0,duration_seconds,latitude,longitude
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,13410.130046,34.66847,-56.105235
au,3806.469238,-32.797061,143.236885
ca,28859.437007,47.193421,-90.28578
de,24255.980952,50.60334,10.063808
gb,66061.321207,52.746508,-1.677395
us,5800.014047,38.357911,-95.71087


In [74]:
ufos.groupby('country').mean()[['duration_seconds']]

Unnamed: 0_level_0,duration_seconds
country,Unnamed: 1_level_1
0,13410.130046
au,3806.469238
ca,28859.437007
de,24255.980952
gb,66061.321207
us,5800.014047


 Queremos unir los dos dfs

In [78]:
ufos.groupby('country').count()[[ufos.columns[0]]].rename(columns = {'datetime': 'total_views'})

Unnamed: 0_level_0,total_views
country,Unnamed: 1_level_1
0,9669
au,538
ca,3000
de,105
gb,1905
us,65114


In [79]:
country_total_views = ufos.groupby('country').count()[[ufos.columns[0]]].rename(columns = {'datetime': 'total_views'})

In [80]:
ufos.groupby('country').mean()[['duration_seconds']].rename(columns = {'duration_seconds': 'mean_duration_seconds'})

Unnamed: 0_level_0,mean_duration_seconds
country,Unnamed: 1_level_1
0,13410.130046
au,3806.469238
ca,28859.437007
de,24255.980952
gb,66061.321207
us,5800.014047


In [81]:
country_mean_duration = ufos.groupby('country').mean()[['duration_seconds']].rename(columns = {'duration_seconds': 'mean_duration_seconds'})

In [83]:
country_total_views.merge(country_mean_duration, how = 'inner', left_index = True, right_index=True)

Unnamed: 0_level_0,total_views,mean_duration_seconds
country,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9669,13410.130046
au,538,3806.469238
ca,3000,28859.437007
de,105,24255.980952
gb,1905,66061.321207
us,65114,5800.014047


Summarizing by two fields

In [85]:
ufos.groupby(['country', 'shape']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,datetime,city,state,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval
country,shape,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0,271,271,271,271,271,271,271,271,271,271
0,changing,252,252,252,252,252,252,252,252,252,252
0,chevron,89,89,89,89,89,89,89,89,89,89
0,cigar,262,262,262,262,262,262,262,262,262,262
0,circle,891,891,891,891,891,891,891,891,891,891
0,cone,40,40,40,40,40,40,40,40,40,40
0,crescent,1,1,1,1,1,1,1,1,1,1
0,cross,25,25,25,25,25,25,25,25,25,25
0,cylinder,161,161,161,161,161,161,161,161,161,161
0,diamond,154,154,154,154,154,154,154,154,154,154


In [86]:
ufos.groupby([ 'shape', 'country']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,datetime,city,state,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval
shape,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0,271,271,271,271,271,271,271,271,271,271
0,au,11,11,11,11,11,11,11,11,11,11
0,ca,45,45,45,45,45,45,45,45,45,45
0,de,2,2,2,2,2,2,2,2,2,2
0,gb,50,50,50,50,50,50,50,50,50,50
0,us,1553,1553,1553,1553,1553,1553,1553,1553,1553,1553
changed,us,1,1,1,1,1,1,1,1,1,1
changing,0,252,252,252,252,252,252,252,252,252,252
changing,au,9,9,9,9,9,9,9,9,9,9
changing,ca,69,69,69,69,69,69,69,69,69,69


In [88]:
QUESTION
ufos.shape
ufos['shape']


(80331, 12)

In [90]:
set(ufos['shape'])

{0,
 'changed',
 'changing',
 'chevron',
 'cigar',
 'circle',
 'cone',
 'crescent',
 'cross',
 'cylinder',
 'delta',
 'diamond',
 'disk',
 'dome',
 'egg',
 'fireball',
 'flare',
 'flash',
 'formation',
 'hexagon',
 'light',
 'other',
 'oval',
 'pyramid',
 'rectangle',
 'round',
 'sphere',
 'teardrop',
 'triangle',
 'unknown'}

There are a lot of shapes. I will take the most importants.

In [121]:
important_shapes = ufos.groupby(['shape']).count()[[ufos.columns[0]]].sort_values(by='datetime', ascending=False)

In [122]:
important_shapes.rename(columns = {'datetime': 'total_views'}, inplace=True)

In [123]:
important_shapes

Unnamed: 0_level_0,total_views
shape,Unnamed: 1_level_1
light,16565
triangle,7865
circle,7608
fireball,6208
other,5649
unknown,5584
sphere,5387
disk,5213
oval,3733
formation,2457


In [124]:
ufos.shape[0]

80331

In [125]:
# Qué porcentaje representa?
important_shapes['per'] = 100 * important_shapes['total_views'] / ufos.shape[0]
important_shapes

Unnamed: 0_level_0,total_views,per
shape,Unnamed: 1_level_1,Unnamed: 2_level_1
light,16565,20.620931
triangle,7865,9.790741
circle,7608,9.470815
fireball,6208,7.728025
other,5649,7.032154
unknown,5584,6.951239
sphere,5387,6.706004
disk,5213,6.4894
oval,3733,4.647023
formation,2457,3.058595


In [126]:
cumsum_per = important_shapes[['per']].cumsum().rename(columns={'per': 'cumsum_per'})
important_shapes = important_shapes.merge(cumsum_per, how='inner', left_index=True, right_index=True)
important_shapes

Unnamed: 0_level_0,total_views,per,cumsum_per
shape,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
light,16565,20.620931,20.620931
triangle,7865,9.790741,30.411672
circle,7608,9.470815,39.882486
fireball,6208,7.728025,47.610512
other,5649,7.032154,54.642666
unknown,5584,6.951239,61.593905
sphere,5387,6.706004,68.299909
disk,5213,6.4894,74.789309
oval,3733,4.647023,79.436332
formation,2457,3.058595,82.494927


 Voy a coger el 80% de la muestra con forma, eso quiere decir que corto en oval.

Necesito una lista de los que voy a convertir a 'Otros'

In [129]:
set(ufos['shape'])

In [136]:
list(important_shapes.index)[9:]

['formation',
 'cigar',
 'changing',
 0,
 'flash',
 'rectangle',
 'cylinder',
 'diamond',
 'chevron',
 'egg',
 'teardrop',
 'cone',
 'cross',
 'delta',
 'crescent',
 'round',
 'hexagon',
 'pyramid',
 'flare',
 'changed',
 'dome']

In [143]:
shapes_to_other = list(important_shapes.index)[9:]

In [158]:
ufos['new_shape'] = ufos['shape'].apply(lambda x: 'other' if x in shapes_to_other else x)


In [153]:
ufos[['shape', 'new_shape']]

Unnamed: 0,shape,new_shape
0,cylinder,other
1,light,other
2,circle,other
3,circle,other
4,light,other
5,sphere,other
6,circle,other
7,disk,other
8,disk,other
9,disk,other


In [161]:
ufos.groupby(['shape', 'new_shape']).count().sort_values(by='new_shape')

Unnamed: 0_level_0,Unnamed: 1_level_0,datetime,city,state,country,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval
shape,new_shape,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
circle,circle,7608,7608,7608,7608,7608,7608,7608,7608,7608,7608,7608
disk,disk,5213,5213,5213,5213,5213,5213,5213,5213,5213,5213,5213
fireball,fireball,6208,6208,6208,6208,6208,6208,6208,6208,6208,6208,6208
light,light,16565,16565,16565,16565,16565,16565,16565,16565,16565,16565,16565
0,other,1932,1932,1932,1932,1932,1932,1932,1932,1932,1932,1932
teardrop,other,750,750,750,750,750,750,750,750,750,750,750
round,other,2,2,2,2,2,2,2,2,2,2,2
rectangle,other,1296,1296,1296,1296,1296,1296,1296,1296,1296,1296,1296
pyramid,other,1,1,1,1,1,1,1,1,1,1,1
other,other,5649,5649,5649,5649,5649,5649,5649,5649,5649,5649,5649


In [167]:
shape_country_total_views = ufos.groupby(['new_shape', 'country']).count()[[ufos.columns[0]]].rename(columns={'datetime': 'total_views'})

In [171]:
shape_country_mean_duration = ufos.groupby(['new_shape','country']).mean().rename(columns={'duration_seconds': 'mean_duration'})[['mean_duration']]

QUESTION Las medias son las mismas que las de antes?

In [173]:
shape_country_total_views.merge(shape_country_mean_duration, how = 'inner', left_index=True, right_index=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_views,mean_duration
new_shape,country,Unnamed: 2_level_1,Unnamed: 3_level_1
circle,0,891,14230.567901
circle,au,62,1096.629032
circle,ca,284,654.566901
circle,de,10,1990.0
circle,gb,243,856.722634
circle,us,6118,3777.292285
disk,0,736,1499.321332
disk,au,50,438.9
disk,ca,198,1193.974747
disk,de,6,202.0


Now we want to compare the duration of each view with the mean. We need to merge both dataframes.

In [174]:
shape_country = shape_country_total_views.merge(shape_country_mean_duration, how = 'inner', left_index=True, right_index=True)

Preguntar: Will it work?

In [175]:
ufos.merge(shape_country, how='inner', left_index=True, right_index=True)

ValueError: cannot join with no level specified and no overlapping names

In [177]:
ufos.head(1)

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval,new_shape
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111,Very Low,other


In [178]:
shape_country.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_views,mean_duration
new_shape,country,Unnamed: 2_level_1,Unnamed: 3_level_1
circle,0,891,14230.567901


Necesitamos decir la columna para mergear. Intentamos acceder a ella.

In [179]:
ufos.merge(shape_country, how='inner', on=['country', 'new_shape'])

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval,new_shape,total_views,mean_duration
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111,Very Low,other,17759,3917.791758
1,1974-10-10 19:30:00,hudson,ma,us,other,2700.0,45 minutes,Not sure of the eact month or year of this sig...,8/10/1999,42.391667,-71.566667,Very Low,other,17759,3917.791758
2,1977-10-10 12:00:00,san antonio,tx,us,other,30.0,30 seconds,i was about six or seven and my family and me ...,2/24/2005,29.423889,-98.493333,Very Low,other,17759,3917.791758
3,1978-10-10 02:00:00,elmont,ny,us,rectangle,300.0,5min,A memory I will never forget that happened men...,2/1/2007,40.700833,-73.713333,Very Low,other,17759,3917.791758
4,1979-10-10 00:00:00,poughkeepsie,ny,us,chevron,900.0,15 minutes,1/4 moon-like&#44 its &#39chord&#39 or flat s...,4/16/2005,41.700278,-73.921389,Very Low,other,17759,3917.791758
5,1984-10-10 12:00:00,traverse city,mi,us,other,120.0,couple minutes,translucent football seen over city airport,10/7/2003,44.763056,-85.620556,Very Low,other,17759,3917.791758
6,1984-10-10 22:00:00,white plains,ny,us,formation,20.0,15-20 seconds,Saw a hugh object in sky with lights intermitt...,8/10/1999,41.033889,-73.763333,Very Low,other,17759,3917.791758
7,1992-10-10 17:00:00,panama city,fl,us,formation,3600.0,1 hour(?),During a road trip to Panama City a friend and...,1/28/1999,30.158611,-85.660278,Very Low,other,17759,3917.791758
8,1992-10-10 20:15:00,seymour,tn,us,cigar,60.0,1min. 39s,Stationary Elongated UFO 200ft above vacant fi...,10/31/2008,35.890556,-83.724722,Very Low,other,17759,3917.791758
9,1993-10-10 23:00:00,carthage,tn,us,other,60.0,less than 1 min,1 object with green and red lights,3/21/2003,36.252222,-85.951667,Very Low,other,17759,3917.791758


Si no se llaman igual, siempre tenemos la opción

In [181]:
ufos = ufos.merge(shape_country, how='inner', left_on=['country', 'new_shape'], right_on=['country', 'new_shape'])
ufos

Unnamed: 0,datetime,city,state,country,shape,duration_seconds,duration_hours_min,comments,date_posted,latitude,longitude,duration_seconds_interval,new_shape,total_views,mean_duration
0,1949-10-10 20:30:00,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111,Very Low,other,17759,3917.791758
1,1974-10-10 19:30:00,hudson,ma,us,other,2700.0,45 minutes,Not sure of the eact month or year of this sig...,8/10/1999,42.391667,-71.566667,Very Low,other,17759,3917.791758
2,1977-10-10 12:00:00,san antonio,tx,us,other,30.0,30 seconds,i was about six or seven and my family and me ...,2/24/2005,29.423889,-98.493333,Very Low,other,17759,3917.791758
3,1978-10-10 02:00:00,elmont,ny,us,rectangle,300.0,5min,A memory I will never forget that happened men...,2/1/2007,40.700833,-73.713333,Very Low,other,17759,3917.791758
4,1979-10-10 00:00:00,poughkeepsie,ny,us,chevron,900.0,15 minutes,1/4 moon-like&#44 its &#39chord&#39 or flat s...,4/16/2005,41.700278,-73.921389,Very Low,other,17759,3917.791758
5,1984-10-10 12:00:00,traverse city,mi,us,other,120.0,couple minutes,translucent football seen over city airport,10/7/2003,44.763056,-85.620556,Very Low,other,17759,3917.791758
6,1984-10-10 22:00:00,white plains,ny,us,formation,20.0,15-20 seconds,Saw a hugh object in sky with lights intermitt...,8/10/1999,41.033889,-73.763333,Very Low,other,17759,3917.791758
7,1992-10-10 17:00:00,panama city,fl,us,formation,3600.0,1 hour(?),During a road trip to Panama City a friend and...,1/28/1999,30.158611,-85.660278,Very Low,other,17759,3917.791758
8,1992-10-10 20:15:00,seymour,tn,us,cigar,60.0,1min. 39s,Stationary Elongated UFO 200ft above vacant fi...,10/31/2008,35.890556,-83.724722,Very Low,other,17759,3917.791758
9,1993-10-10 23:00:00,carthage,tn,us,other,60.0,less than 1 min,1 object with green and red lights,3/21/2003,36.252222,-85.951667,Very Low,other,17759,3917.791758


In [182]:
ufos['mean_diff'] = np.abs(ufos['mean_duration'] - ufos['duration_seconds'])

As we have taken the mean across countries and shapes, now it makes sense to compare it that way

In [189]:
duration = ufos.groupby(['new_shape', 'country']).mean()[['mean_diff']]
duration

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_diff
new_shape,country,Unnamed: 2_level_1
circle,0,26529.06
circle,au,1468.612
circle,ca,803.9997
circle,de,3202.0
circle,gb,1121.683
circle,us,6307.845
disk,0,2043.261
disk,au,443.42
disk,ca,1549.464
disk,de,165.3333


In [191]:
duration['duration_min'] = duration['mean_diff'] / 60
duration

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_diff,duration_min
new_shape,country,Unnamed: 2_level_1,Unnamed: 3_level_1
circle,0,26529.06,442.150929
circle,au,1468.612,24.476864
circle,ca,803.9997,13.399995
circle,de,3202.0,53.366667
circle,gb,1121.683,18.694724
circle,us,6307.845,105.130748
disk,0,2043.261,34.054352
disk,au,443.42,7.390333
disk,ca,1549.464,25.824395
disk,de,165.3333,2.755556
