En este Cuaderno, veremos la ingeniería de características.

Explorando las diferentes formas de crear características nuevas y más útiles a partir de las que ya están en su conjunto de datos.

Verá cómo codificar, agregar y extraer información de características numéricas y textuales.

*Ingeniería de características* : Es la creación de nuevas características basadas en las ya existentes.

- Agrega información nueva y útil.

- Arroja información sobre las relaciones entre características. 

Existen formas automatizadas de crear nuevas características, primero cubriremos los métodos manuales. 

Hay una variedad de escenarios en los que es posible diseñar funciones para crear nuevas características. 

- Datos de texto : Convertir el texto en un vector de palabras -> Contar la frecuencia.

- Datos de cadenas de texto : Columna con el texto del color favorito -> Convertido en RGB.

- Columnas tipo fechas : Columna con fechas -> diferencia de tiempos y etiquetarlos por grupos.

In [53]:
import pandas as pd

volunteer = pd.read_csv("https://assets.datacamp.com/production/repositories/1816/datasets/668b96955d8b252aa8439c7602d516634e3f015e/volunteer_opportunities.csv").fillna('')

volunteer.loc[:,["title", "created_date", "category_desc","status","recurrence_type"]].head()

Unnamed: 0,title,created_date,category_desc,status,recurrence_type
0,Volunteers Needed For Rise Up & Stay Put! Home...,January 13 2011,,approved,onetime
1,Web designer,January 14 2011,Strengthening Communities,approved,onetime
2,Urban Adventures - Ice Skating at Lasker Rink,January 19 2011,Strengthening Communities,approved,onetime
3,Fight global hunger and support women farmers ...,January 21 2011,Strengthening Communities,approved,ongoing
4,Stop 'N' Swap,January 28 2011,Environment,approved,onetime


**LaberlEncoder y get_dummies**

In [54]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
#Etiquetado con apply y lambda
volunteer["recurrence_type_enc"] = volunteer["recurrence_type"].apply(lambda val: 1 if val == "ongoing" else 0)

#Etiquetado con LabelEncoder
volunteer["recurrence_type_le"] = le.fit_transform(volunteer["recurrence_type"])
volunteer["category_desc_dummy"] = le.fit_transform(volunteer["category_desc"])

volunteer.loc[:,["category_desc","category_desc_dummy","recurrence_type", "recurrence_type_enc", "recurrence_type_le"]].value_counts()

category_desc              category_desc_dummy  recurrence_type  recurrence_type_enc  recurrence_type_le
Strengthening Communities  6                    ongoing          1                    1                     227
Helping Neighbors in Need  5                    ongoing          1                    1                      99
Education                  1                    ongoing          1                    1                      84
Strengthening Communities  6                    onetime          0                    0                      80
Health                     4                    ongoing          1                    1                      42
                           0                    ongoing          1                    1                      34
Helping Neighbors in Need  5                    onetime          0                    0                      20
Environment                3                    onetime          0                    0                      17

In [55]:
#Filas en columnas.
pd.get_dummies(volunteer["category_desc"])

Unnamed: 0,Unnamed: 1,Education,Emergency Preparedness,Environment,Health,Helping Neighbors in Need,Strengthening Communities
0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,1
2,0,0,0,0,0,0,1
3,0,0,0,0,0,0,1
4,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...
660,0,0,0,0,0,1,0
661,0,0,0,0,0,0,1
662,0,0,0,0,0,1,0
663,0,0,0,0,0,0,1


In [40]:
#estadísticas agregadas : df["mean"] = df.apply(lambda row: row[columns].mean(), axis = 1)
hiking = pd.read_json("https://assets.datacamp.com/production/repositories/1816/datasets/4f26c48451bdbf73db8a58e226cd3d6b45cf7bb5/hiking.json")
hiking = hiking.fillna('')
hiking.head()

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,
3,B073,Peninsula,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Discover how the Peninsula has changed over th...,N,N,,
4,B073,Waterfall,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Trace the source of the Lake on the Waterfall ...,N,N,,


**Extracción de texto a números**

In [44]:
import re

def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    mile = re.match(pattern, length)
    
    if mile is not None:
        return float(mile.group(0))

hiking["Length_num"] = hiking["Length"].apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())        

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


**Convertir texto en vector de frecuencias -> tf-id**

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

title_text = volunteer["title"]
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(title_text)

y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

nb = GaussianNB()
nb.fit(X_train, y_train)
print(nb.score(X_test, y_test))

0.46706586826347307


In [63]:
print(text_tfidf[0])

  (0, 375)	0.3163915503784279
  (0, 855)	0.38192461589865456
  (0, 493)	0.3405778550191958
  (0, 822)	0.38192461589865456
  (0, 959)	0.38192461589865456
  (0, 1061)	0.25544926998167106
  (0, 869)	0.38192461589865456
  (0, 404)	0.15529778130809513
  (0, 690)	0.24072387702158726
  (0, 1086)	0.2304728774077965
