### Ahora que ha aprendido todo sobre el preprocesamiento, 

### probará estas técnicas en un conjunto de datos que registra información sobre avistamientos de ovnis.

Cada fila de este conjunto contiene información como la ubicación, el tipo de avistamiento, el número de segundo y minutos que duró el avistamiento, una drescipción del avistamiento y la fecha en que se registro el avistamiento.

In [1]:
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

ovnis = pd.read_csv("https://assets.datacamp.com/production/repositories/1816/datasets/a5ebfe5d2ed194f2668867603b563963af4769e9/ufo_sightings_large.csv")
ovnis.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.2875
3,11/21/2002 05:45,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,8/19/2010 12:55,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333


In [2]:
#Datos faltantes
ovnis["seconds"] = ovnis["seconds"].astype(float)
ovnis["date"] = pd.to_datetime(ovnis["date"])
print(ovnis.dtypes)
ovnis.head()

date              datetime64[ns]
city                      object
state                     object
country                   object
type                      object
seconds                  float64
length_of_time            object
desc                      object
recorded                  object
lat                       object
long                     float64
dtype: object


Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,2011-11-03 19:21:00,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,2004-10-03 19:05:00,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,2009-09-25 21:00:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.2875
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,2010-08-19 12:55:00,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333


In [4]:
print(ovnis[["length_of_time", "state", "type"]].isnull().sum())
print(ovnis.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4935, 11)


**Descartar datos faltantes.**

Eliminemos algunas de las filas donde ciertas columnas tienen valores faltantes.

Vamos a ver la length_of_time columna, la state columna y la type columna.

Si falta alguno de los valores en estas columnas, descartaremos las filas.

In [5]:
ovnis_no_missing = ovnis[ovnis["length_of_time"].notnull() & ovnis["state"].notnull() &  ovnis["type"].notnull() & ovnis["seconds"] != 0]

print(ovnis_no_missing.shape)
print(ovnis_no_missing[["length_of_time", "state", "type"]].isnull().sum())

(4096, 11)
length_of_time    0
state             0
type              0
dtype: int64


**Variables categorícas y estandarización.**

In [6]:
def return_minutes(time_string):
    pattern = re.compile(r"\d+")
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))

In [7]:
ovnis_no_missing["minutes"] = ovnis_no_missing["length_of_time"].apply(return_minutes)
ovnis_no_missing = ovnis_no_missing.fillna(0)

print(ovnis_no_missing[["length_of_time", "minutes"]].head())

              length_of_time  minutes
0                    2 weeks      2.0
1                     30sec.     30.0
3            about 5 minutes      0.0
5                 10 minutes     10.0
6  total? maybe around 10 mi      0.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ovnis_no_missing["minutes"] = ovnis_no_missing["length_of_time"].apply(return_minutes)


In [8]:
print(ovnis_no_missing[["seconds", "minutes"]].var())

ovnis_no_missing["seconds_log"] = np.log(ovnis_no_missing["seconds"])

print(ovnis_no_missing["seconds_log"].var())

seconds    1.615640e+10
minutes    2.341135e+02
dtype: float64
4.864907056671104


Ahora que nos hemos ocupado de algunas de las tareas de preprocesamiento más sencillas, es hora de diseñar nuevas funciones.

**Tenemos varios campos que son excelentes candidatos para ingeniería de características.**

- country : lo vamos a retiquetar con 1 si pertenece a US en otro caso 0.
- type : Aplicaremos get_dummies para tener nuevas columnas de este campo.
- date : Extracción de Mes y Año.

In [9]:
ovnis_no_missing["country_enc"] = ovnis_no_missing["country"].apply(lambda val: 1 if val == "us" else 0)

print(len(ovnis_no_missing["type"].unique()))

21


In [10]:
ovnis_no_missing.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long,minutes,seconds_log,country_enc
0,2011-11-03 19:21:00,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111,2.0,14.0058,1
1,2004-10-03 19:05:00,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556,30.0,3.401197,1
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222,0.0,5.703782,1
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,-117.156389,10.0,6.39693,1
6,2009-07-12 21:30:00,duluth,mn,us,oval,600.0,total? maybe around 10 mi,A minor amber color trail&#44 (from where we w...,3/13/2012,46.7833333,-92.106389,0.0,6.39693,1


In [11]:
ovnis_no_missing["type"].value_counts()

light        894
triangle     422
circle       396
fireball     319
unknown      292
other        285
sphere       275
disk         269
oval         189
formation    130
changing     112
cigar         85
flash         79
cylinder      72
rectangle     66
diamond       63
chevron       44
teardrop      44
egg           29
cone          17
cross         14
Name: type, dtype: int64

In [12]:
type_set = pd.get_dummies(ovnis_no_missing["type"])

ovnis_no_missing = pd.concat([ovnis_no_missing, type_set], axis=1)

ovnis_no_missing.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,...,flash,formation,light,other,oval,rectangle,sphere,teardrop,triangle,unknown
0,2011-11-03 19:21:00,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,...,0,0,0,0,0,0,0,0,0,1
1,2004-10-03 19:05:00,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,...,0,0,0,0,0,0,0,0,0,0
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,...,0,0,0,0,0,0,0,0,1,0
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,...,0,0,1,0,0,0,0,0,0,0
6,2009-07-12 21:30:00,duluth,mn,us,oval,600.0,total? maybe around 10 mi,A minor amber color trail&#44 (from where we w...,3/13/2012,46.7833333,...,0,0,0,0,1,0,0,0,0,0


Mes y año de la columna date.

In [13]:
print(ovnis_no_missing["date"].head())

ovnis_no_missing["month"] = ovnis_no_missing["date"].apply(lambda row: row.month)
ovnis_no_missing["year"] = ovnis_no_missing["date"].apply(lambda row: row.year)

print(ovnis_no_missing[["date", "month", "year"]].head())

0   2011-11-03 19:21:00
1   2004-10-03 19:05:00
3   2002-11-21 05:45:00
5   2012-06-16 23:00:00
6   2009-07-12 21:30:00
Name: date, dtype: datetime64[ns]
                 date  month  year
0 2011-11-03 19:21:00     11  2011
1 2004-10-03 19:05:00     10  2004
3 2002-11-21 05:45:00     11  2002
5 2012-06-16 23:00:00      6  2012
6 2009-07-12 21:30:00      7  2009


Vectorización de texto.

In [14]:
print(ovnis_no_missing["desc"].head())

vec = TfidfVectorizer()
desc_tfidf = vec.fit_transform(ovnis_no_missing["desc"])
vocab = {v: k for k,v in vec.vocabulary_.items()}

print(desc_tfidf.shape)

0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
3    It was a large&#44 triangular shaped flying ob...
5    Dancing lights that would fly around and then ...
6    A minor amber color trail&#44 (from where we w...
Name: desc, dtype: object
(4096, 5579)


### Selección final de características.

- Características redundantes.

- Vector de texto.


In [16]:
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Ordenando para sacar las n palabras más relevantes.
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

In [19]:
print(ovnis_no_missing[["seconds", "seconds_log", "minutes"]].corr())

to_drop = ["city", "country", "date", "desc", "lat", "length_of_time", "long", "minutes", "recorded", "seconds", "state"]

ovnis_no_missing_dropped = ovnis_no_missing.drop(to_drop, axis=1)

# Palabras poneradas
#print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 0, 3))

def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    
    return set(filter_list)

#filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

#filtered_words

              seconds  seconds_log   minutes
seconds      1.000000     0.174331 -0.021512
seconds_log  0.174331     1.000000  0.117108
minutes     -0.021512     0.117108  1.000000
