# Mentoría DiploDatos - Predecir compra de próximo usuario
## Práctico 2 - Análisis Exploratorio y Curación de Datos
### Consignas:
En este práctico se propone explorar y curar los datos además de comenzar
con algunas tareas de Machine Learning básicas en vista de preparar el dataset
para las tareas futuras y principales.

Los objetivos de esta parte son
* División del problema por sites (Brasil (MLB) y México(MLM))
* División del set de datos en train, validación y test.
* En el caso de los campos de texto, se propone implementar los siguientes preprocesamientos:
    * Todo a minúscula
    * Tokenizar
    * Eliminar stopwords
    * Eliminar signos de puntuación

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import string
import nltk
nltk.download('punkt')
import os
from datetime import datetime

[nltk_data] Downloading package punkt to /home/gv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Usamos el dataset del hito 1 que consistía con 3000 filas del archivo original, esto es para que no sea tan pesado levantarlo todo. Llegado el momento alcanza con cambiar el archivo .csv por el completo y listo.

In [2]:
views_data = pd.read_csv('mlDCchunk.csv')
views_data.head(3)

Unnamed: 0.1,Unnamed: 0,row_id,item_bought,user_view,timestamps,bought_title,bought_domain_id,bought_price,bought_category_id,bought_condition,title,domain_id,price,category_id,condition,view_count
0,0,0,1748830,1786148,1571495142,Relógio Medidor Inteligente Pulso Freqüência C...,MLB-SMARTWATCHES,90.0,MLB135384,0.0,Relógio Inteligente Smartwatch Gt08 Touch Scre...,MLB-SMARTWATCHES,119.99,MLB135384,0.0,18
1,1,0,1748830,1786148,1571495157,Relógio Medidor Inteligente Pulso Freqüência C...,MLB-SMARTWATCHES,90.0,MLB135384,0.0,Relógio Inteligente Smartwatch Gt08 Touch Scre...,MLB-SMARTWATCHES,119.99,MLB135384,0.0,18
2,2,0,1748830,1615991,1571495246,Relógio Medidor Inteligente Pulso Freqüência C...,MLB-SMARTWATCHES,90.0,MLB135384,0.0,Mochila Galáxia Compartimento P/ Laptop,MLB-SMARTWATCHES,79.71,MLB135384,0.0,18


## División del problema por sites (Brasil (MLB) y México(MLM))
vd_Bra: Es el views_data solo de Brasil

vd_Mex: Es el que tiene solo México

In [3]:
vd_Bra = views_data[views_data['bought_domain_id'].str.contains('MLB')]
vd_Mex = views_data[views_data['bought_domain_id'].str.contains('MLM')]

Corroboramos que los datasets nuevos no tengan valores de los sitios no deseados en cada caso

In [4]:
vd_Bra[vd_Bra['bought_domain_id'].str.contains('MLM')].sum()

Unnamed: 0            0.0
row_id                0.0
item_bought           0.0
user_view             0.0
timestamps            0.0
bought_title          0.0
bought_domain_id      0.0
bought_price          0.0
bought_category_id    0.0
bought_condition      0.0
title                 0.0
domain_id             0.0
price                 0.0
category_id           0.0
condition             0.0
view_count            0.0
dtype: float64

In [5]:
vd_Mex[vd_Mex['bought_domain_id'].str.contains('MLB')].sum()

Unnamed: 0            0.0
row_id                0.0
item_bought           0.0
user_view             0.0
timestamps            0.0
bought_title          0.0
bought_domain_id      0.0
bought_price          0.0
bought_category_id    0.0
bought_condition      0.0
title                 0.0
domain_id             0.0
price                 0.0
category_id           0.0
condition             0.0
view_count            0.0
dtype: float64

## Antes de continuar con la división del set de datos, hacemos el preprocesameinto lingüístico para que luego los datasets de Train, Test y Validación ya tengan estos features implementados

* ## Todo a minúscula

In [6]:
vd_Bra = vd_Bra.astype(str).apply(lambda x: x.str.lower())
vd_Mex = vd_Mex.astype(str).apply(lambda x: x.str.lower())
vd_Bra.head(3)

Unnamed: 0.1,Unnamed: 0,row_id,item_bought,user_view,timestamps,bought_title,bought_domain_id,bought_price,bought_category_id,bought_condition,title,domain_id,price,category_id,condition,view_count
0,0,0,1748830,1786148,1571495142,relógio medidor inteligente pulso freqüência c...,mlb-smartwatches,90.0,mlb135384,0.0,relógio inteligente smartwatch gt08 touch scre...,mlb-smartwatches,119.99,mlb135384,0.0,18
1,1,0,1748830,1786148,1571495157,relógio medidor inteligente pulso freqüência c...,mlb-smartwatches,90.0,mlb135384,0.0,relógio inteligente smartwatch gt08 touch scre...,mlb-smartwatches,119.99,mlb135384,0.0,18
2,2,0,1748830,1615991,1571495246,relógio medidor inteligente pulso freqüência c...,mlb-smartwatches,90.0,mlb135384,0.0,mochila galáxia compartimento p/ laptop,mlb-smartwatches,79.71,mlb135384,0.0,18


* ## Tokenizar y eliminar signos de puntuación para columnas bought_title y title

In [7]:
from nltk.tokenize import RegexpTokenizer
vd_Bra['tokenized_bought_title'] = vd_Bra['bought_title'].apply(lambda x: RegexpTokenizer(r'\w+').tokenize(x))
vd_Bra['tokenized_title'] = vd_Bra['title'].apply(lambda x: RegexpTokenizer(r'\w+').tokenize(x))
vd_Mex['tokenized_bought_title'] = vd_Mex['bought_title'].apply(lambda x: RegexpTokenizer(r'\w+').tokenize(x))
vd_Mex['tokenized_title'] = vd_Mex['title'].apply(lambda x: RegexpTokenizer(r'\w+').tokenize(x))
vd_Bra.head(3)

Unnamed: 0.1,Unnamed: 0,row_id,item_bought,user_view,timestamps,bought_title,bought_domain_id,bought_price,bought_category_id,bought_condition,title,domain_id,price,category_id,condition,view_count,tokenized_bought_title,tokenized_title
0,0,0,1748830,1786148,1571495142,relógio medidor inteligente pulso freqüência c...,mlb-smartwatches,90.0,mlb135384,0.0,relógio inteligente smartwatch gt08 touch scre...,mlb-smartwatches,119.99,mlb135384,0.0,18,"[relógio, medidor, inteligente, pulso, freqüên...","[relógio, inteligente, smartwatch, gt08, touch..."
1,1,0,1748830,1786148,1571495157,relógio medidor inteligente pulso freqüência c...,mlb-smartwatches,90.0,mlb135384,0.0,relógio inteligente smartwatch gt08 touch scre...,mlb-smartwatches,119.99,mlb135384,0.0,18,"[relógio, medidor, inteligente, pulso, freqüên...","[relógio, inteligente, smartwatch, gt08, touch..."
2,2,0,1748830,1615991,1571495246,relógio medidor inteligente pulso freqüência c...,mlb-smartwatches,90.0,mlb135384,0.0,mochila galáxia compartimento p/ laptop,mlb-smartwatches,79.71,mlb135384,0.0,18,"[relógio, medidor, inteligente, pulso, freqüên...","[mochila, galáxia, compartimento, p, laptop]"


* ## Eliminación de stopwords de las columnas bought_title y title

In [8]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words_es = set(stopwords.words('spanish'))
stop_words_br = set(stopwords.words('portuguese'))

vd_Bra['bought_title'] = vd_Bra['bought_title'].apply(lambda x: [item for item in x.split() if item not in stop_words_br])
vd_Bra['title'] = vd_Bra['title'].apply(lambda x: [item for item in x.split() if item not in stop_words_br])

vd_Mex['bought_title'] = vd_Mex['bought_title'].apply(lambda x: [item for item in x.split() if item not in stop_words_es])
vd_Mex['title'] = vd_Mex['title'].apply(lambda x: [item for item in x.split() if item not in stop_words_es])
vd_Bra.head(3)

[nltk_data] Downloading package stopwords to /home/gv/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0.1,Unnamed: 0,row_id,item_bought,user_view,timestamps,bought_title,bought_domain_id,bought_price,bought_category_id,bought_condition,title,domain_id,price,category_id,condition,view_count,tokenized_bought_title,tokenized_title
0,0,0,1748830,1786148,1571495142,"[relógio, medidor, inteligente, pulso, freqüên...",mlb-smartwatches,90.0,mlb135384,0.0,"[relógio, inteligente, smartwatch, gt08, touch...",mlb-smartwatches,119.99,mlb135384,0.0,18,"[relógio, medidor, inteligente, pulso, freqüên...","[relógio, inteligente, smartwatch, gt08, touch..."
1,1,0,1748830,1786148,1571495157,"[relógio, medidor, inteligente, pulso, freqüên...",mlb-smartwatches,90.0,mlb135384,0.0,"[relógio, inteligente, smartwatch, gt08, touch...",mlb-smartwatches,119.99,mlb135384,0.0,18,"[relógio, medidor, inteligente, pulso, freqüên...","[relógio, inteligente, smartwatch, gt08, touch..."
2,2,0,1748830,1615991,1571495246,"[relógio, medidor, inteligente, pulso, freqüên...",mlb-smartwatches,90.0,mlb135384,0.0,"[mochila, galáxia, compartimento, p/, laptop]",mlb-smartwatches,79.71,mlb135384,0.0,18,"[relógio, medidor, inteligente, pulso, freqüên...","[mochila, galáxia, compartimento, p, laptop]"


# División del set de datos en train, validación y test.

Defino una función que usa el split de numpy y luego la aplico a los dos subdatasets de cada sitio. Esto lo hago aleatoriamente con el parámetro random_state de Pandas Sample que funciona como la semilla para números pseudoaleatorios.

In [9]:
def train_test_validation_split(df, seed, trainSize, test_valSize):
    train, validate, test = \
              np.split(df.sample(frac=1, random_state=seed), 
                       [int(trainSize*len(df)), int(test_valSize*len(df))])
    return train, validate, test

Separo cada dateset en 60% Train, 20% Test y 20% Validation

In [10]:
vd_Bra_train, vd_Bra_validate, vd_Bra_test = train_test_validation_split(df=vd_Bra, seed=24, trainSize=.6, test_valSize=.8)
vd_Mex_train, vd_Mex_validate, vd_Mex_test = train_test_validation_split(df=vd_Mex, seed=24, trainSize=.6, test_valSize=.8)

Compruebo que los porcentajes de cada Train, Test y Validación sean los propuestos

In [11]:
def percent_split(df, df_split):
    perc = 100*len(df_split)/len(df)
    return perc

Bra_trainPerc = np.round(percent_split(vd_Bra, vd_Bra_train),3)
Bra_testPerc = np.round(percent_split(vd_Bra, vd_Bra_test),3)
Bra_validPerc = np.round(percent_split(vd_Bra, vd_Bra_validate),3)

Mex_trainPerc = np.round(percent_split(vd_Mex, vd_Mex_train),3)
Mex_testPerc = np.round(percent_split(vd_Mex, vd_Mex_test),3)
Mex_validPerc = np.round(percent_split(vd_Mex, vd_Mex_validate),3)

print(f'Porcentajes de Train, Test y Validación para el dataset de Brasil:\n% Train = {Bra_trainPerc}\n% Test = {Bra_testPerc}\n% Validación = {Bra_validPerc}')
print('------------------------------------------------------------------')
print('------------------------------------------------------------------')
print(f'Porcentajes de Train, Test y Validación para el dataset de México:\n% Train = {Mex_trainPerc}\n% Test = {Mex_testPerc}\n% Validación = {Mex_validPerc}')

Porcentajes de Train, Test y Validación para el dataset de Brasil:
% Train = 59.998
% Test = 20.002
% Validación = 19.999
------------------------------------------------------------------
------------------------------------------------------------------
Porcentajes de Train, Test y Validación para el dataset de México:
% Train = 59.987
% Test = 20.007
% Validación = 20.007
