## Procesado de DataSet

Con lo visto en el apartado anterior de procesado de variables vamos a transformar un dataset en otro que pueda ser utilizado para entrenar modelos de predicción.

#### Ingesta de datos

Cargamos las librerías que vamos a utilizar y el dataset original


In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import feature_extraction


In [5]:
datos = pd.read_csv("../../RESOURCES/datos_procesamiento.csv")
datos.head()

Unnamed: 0,col_inexistente1,col2,col3,col_outliers,col_outliers2,col_categorica,col_ordinal,col_texto
0,59.0,52.0,2.232832,-50,0.771666,ratón,muy bien,Tenía en su casa una ama que pasaba de los cua...
1,31.0,74.0,0.906147,-5,1.068558,elefante,regular,"El resto della concluían sayo de velarte, calz..."
2,81.0,28.0,0.62675,-32,0.846396,ratón,muy mal,"El resto della concluían sayo de velarte, calz..."
3,34.0,16.0,0.816738,-84,0.637381,gato,mal,"Una olla de algo más vaca que carnero, salpicó..."
4,32.0,28.0,0.571131,65,4.540614,gato,bien,Tenía en su casa una ama que pasaba de los cua...


### Transformación del DataSet

#### Separación de variables

In [6]:
col_numericas =  ['col_inexistente1', 'col2', 'col3', 'col_outliers', 'col_outliers2']
col_categorica = ['col_categorica']
col_texto = ['col_texto']

#### Variables numéricas


In [9]:
imputador = preprocessing.Imputer(strategy="mean")
escalador = preprocessing.StandardScaler()
var_numericas_imputadas_escalado_standard = escalador.fit_transform(
                                                imputador.fit_transform(datos[col_numericas])
                                            )
df_numerico_procesado = pd.DataFrame(var_numericas_imputadas_escalado_standard,
                                                   columns=col_numericas)



#### Variables Categóricas

In [10]:
label_codificador_categorico = preprocessing.LabelEncoder()
categorias_codificadas = label_codificador_categorico.fit_transform(datos[col_categorica])
oh_codificador = preprocessing.OneHotEncoder(sparse=False)
categorias_oh_codificadas = oh_codificador.fit_transform(categorias_codificadas.reshape(1000,1))

df_categorico_procesado = pd.DataFrame(categorias_oh_codificadas, 
                                       columns=label_codificador_categorico.classes_)

  y = column_or_1d(y, warn=True)
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


#### Variables de Texto


In [13]:
vectorizador_tfidf = feature_extraction.text.TfidfVectorizer()
texto_vectorizado = vectorizador_tfidf.fit_transform(datos.col_texto)
df_texto_procesado =  pd.DataFrame(texto_vectorizado.toarray(), columns=vectorizador_tfidf.get_feature_names())

#### Exportación y muestra final

In [16]:

datos_procesados = pd.concat([
    df_numerico_procesado,
    df_categorico_procesado,
    df_texto_procesado 
], axis=1)

label_codificador_ordinal = preprocessing.LabelEncoder()
datos_procesados['col_ordinal'] = label_codificador_ordinal.fit_transform(datos.col_ordinal) 
datos_procesados.head()

Unnamed: 0,col_inexistente1,col2,col3,col_outliers,col_outliers2,elefante,gato,perro,ratón,acordarme,...,vaca,veinte,velarte,vellori,velludo,verdad,verosímiles,viernes,vivía,col_ordinal
0,0.399217,0.082807,0.442819,-0.6946,-0.038365,0.0,0.0,0.0,1.0,0.0,...,0.0,0.204745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
1,-0.653605,0.861333,-0.32339,-0.118466,-0.038278,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.181842,0.181842,0.181842,0.0,0.0,0.0,0.0,4
2,1.226435,-0.766494,-0.484752,-0.464146,-0.038343,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.181842,0.181842,0.181842,0.0,0.0,0.0,0.0,3
3,-0.540803,-1.191145,-0.375028,-1.129901,-0.038405,0.0,1.0,0.0,0.0,0.0,...,0.194272,0.0,0.0,0.0,0.0,0.0,0.0,0.194272,0.0,1
4,-0.616004,-0.766494,-0.516874,0.777743,-0.037257,0.0,1.0,0.0,0.0,0.0,...,0.0,0.204745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [17]:
datos_procesados.to_csv("../../RESOURCES/dataset_procesado.csv")