# Trabajo Práctico N° 2
## Objetivo:

Para cada dispositivo presentado por Jampp, determinar el tiempo que transcurrirá hasta que el mismo aparezca nuevamente en una subasta, y el tiempo hasta que el usuario del mismo decida instalar una nueva aplicación.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import auc, accuracy_score
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
import datetime as dt

## Primero veo los dispositivos del target

In [3]:
target = pd.read_csv('data/target_final_competencia_revamped.csv')

In [4]:
target.head()

Unnamed: 0,ref_hash,obj
0,1000169251625791232_sc,0
1,1000169251625791232_st,0
2,1000395625957344640_sc,0
3,1000395625957344640_st,0
4,1006670001679961600_sc,0


Tomo los ref_hash

In [5]:
target['ref_hash'] = target['ref_hash'].transform(lambda x: str(x)[:-3])

In [6]:
target.head()

Unnamed: 0,ref_hash,obj
0,1000169251625791232,0
1,1000169251625791232,0
2,1000395625957344640,0
3,1000395625957344640,0
4,1006670001679961600,0


## Calculo cuanto tiempo tarda un dispositivo en aparecer en una subasta contando desde el inicio de la ventana 2

La idea es determinar el tiempo que transcurrió entre cada aparición de un dispositivo en una subasta, para luego tomar el tiempo mínimo de aparición de un dispositivo en una subasta y de acuerdo a eso predecir utilizando los features de la ventana anterior.

In [7]:
auct_predict = pd.read_csv('data/auctions_ventana2.csv', dtype = {"device_id": 'category', "ref_type_id": np.int8, "source_id": np.int8})

In [8]:
auct_predict['date'] = pd.to_datetime(auct_predict['date'])
auct_predict['date_inicial'] = dt.datetime(2019,4, 19)
auct_predict['timeToAuction'] = (auct_predict['date'] - auct_predict['date_inicial'])/np.timedelta64(1,'s')
auct_predict.head()

Unnamed: 0.1,Unnamed: 0,date,device_id,ref_type_id,source_id,date_inicial,timeToAuction
0,159478,2019-04-21 18:04:06.117104,8688913851990397877,1,1,2019-04-19,237846.117104
1,159479,2019-04-21 18:04:06.130756,4125147990784183352,7,1,2019-04-19,237846.130756
2,159480,2019-04-21 18:04:06.257844,5146681099678075004,1,1,2019-04-19,237846.257844
3,159481,2019-04-21 18:04:06.921719,2242043340575950024,1,1,2019-04-19,237846.921719
4,159482,2019-04-21 18:04:08.047843,8404476518073778202,1,1,2019-04-19,237848.047843


Tomo el tiempo mínimo y máximo, en SEGUNDOS, que tardó cada dispositivo en aparecer en una subasta

In [9]:
auction_time = auct_predict.groupby('device_id').agg({'timeToAuction': 'min'}).reset_index()
auction_time.columns = ['ref_hash', 'predict_value']
auction_time.head()

Unnamed: 0,ref_hash,predict_value
0,1000055965717261943,11946.498552
1,1000095322020146100,31261.453739
2,1000289257101263364,139583.619753
3,1000503354791340162,175119.075859
4,1000503394293263005,6222.051751


# Analizo los datos de la primer ventana

In [10]:
auct = pd.read_csv('data/auctions_ventana1.csv', dtype = {"device_id": 'category', "ref_type_id": np.int8, "source_id": np.int8})

In [11]:
auct.dtypes

date             object
device_id      category
ref_type_id        int8
source_id          int8
dtype: object

In [12]:
auct['date'] = pd.to_datetime(auct['date'])

In [13]:
auct.head()

Unnamed: 0,date,device_id,ref_type_id,source_id
0,2019-04-20 23:57:27.912838,1109595589636746168,7,0
1,2019-04-20 23:57:28.381114,5896614299191635403,1,0
2,2019-04-20 23:57:28.515423,4172466725848941608,1,0
3,2019-04-20 23:57:28.700884,2616279795187318849,7,0
4,2019-04-20 23:57:28.868312,8034952072073026056,1,0


## Veo cuantas veces aparece cada dispositivo en una subasta

Inicio sencillamente contando la cantidad de subastas en las que participó cada dispositivo, y lo agrego como un nuevo feature

In [14]:
auction_count = auct.groupby('device_id').agg({'date': 'count'}).reset_index()
auction_count.columns = ['ref_hash', 'auctions_count']

In [15]:
auction_count.head()

Unnamed: 0,ref_hash,auctions_count
0,1000061425870948777,8
1,1000503394293263005,343
2,1001008640113335510,24
3,1001123163431776865,68
4,1001144380199556647,78


Creo un único set de datos con los primeros features creados usando los ids de los dispositivos de la ventana 2

In [16]:
data = auction_count.merge(auction_time, on = 'ref_hash')

In [17]:
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value
0,1000061425870948777,8,7150.963789
1,1000503394293263005,343,6222.051751
2,1001123163431776865,68,55248.785583
3,1001144380199556647,78,47259.145921
4,1001510820906853249,72,11120.966694


## Pruebo con Random Forest

In [18]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

In [19]:
predictions = cross_val_predict(RFR, X, y, cv=10)
predictions

array([58179.11689363, 38051.43993503, 38051.43993503, ...,
       90662.87077937, 90662.87077937, 90662.87077937])

In [20]:
scores = cross_val_score(RFR, X, y, cv=5, scoring='neg_mean_squared_error')

## Calculo el RMSE

In [21]:
# convierto los valores a MSE scores
mse_scores = -scores
# paso de MSE a RMSE
rmse_scores = np.sqrt(mse_scores)

In [22]:
rmse_scores.mean()

57382.57618541962

## Veo la cantidad de veces que aparece cada dispositivo segun el source_id

Ahora me interesa ver la cantidad de veces que un dispositivo participó en una subasta desde cada tipo de fuente desde donde se produce la subasta. De esta manera los source_id más populares tomaran un valor mayor, luego tomo la desviación estándar de la cantidad para cada device_id

In [23]:
auct['apariciones'] = 1

In [24]:
auction_by_sourceID = auct.groupby(['device_id', 'source_id']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
auction_by_sourceID.columns = auction_by_sourceID.columns.droplevel(0)
auction_by_sourceID.columns = ['ref_hash', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
auction_by_sourceID.head()

Unnamed: 0,ref_hash,0,1,2,3,4,5,6,7,8,9
0,1000061425870948777,1.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1000503394293263005,225.0,79.0,1.0,18.0,1.0,0.0,0.0,0.0,19.0,0.0
2,1001008640113335510,9.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1001123163431776865,63.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0
4,1001144380199556647,46.0,15.0,1.0,0.0,1.0,0.0,1.0,0.0,14.0,0.0


In [25]:
auction_by_sourceID['auctions_by_srcID'] = auction_by_sourceID.iloc[:,1:].std(axis = 1)
auction_by_sourceID.head()

Unnamed: 0,ref_hash,0,1,2,3,4,5,6,7,8,9,auctions_by_srcID
0,1000061425870948777,1.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.20101
1,1000503394293263005,225.0,79.0,1.0,18.0,1.0,0.0,0.0,0.0,19.0,0.0,71.32718
2,1001008640113335510,9.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.25357
3,1001123163431776865,63.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,19.809089
4,1001144380199556647,46.0,15.0,1.0,0.0,1.0,0.0,1.0,0.0,14.0,0.0,14.649991


Agrego el nuevo feature a los datos

In [26]:
data = data.merge(auction_by_sourceID[['ref_hash', 'auctions_by_srcID']], on = 'ref_hash')


In [27]:
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID
0,1000061425870948777,8,7150.963789,2.20101
1,1000503394293263005,343,6222.051751,71.32718
2,1001123163431776865,68,55248.785583,19.809089
3,1001144380199556647,78,47259.145921,14.649991
4,1001510820906853249,72,11120.966694,13.75823


Vuelvo a probar el modelo con el nuevo feature

In [28]:
y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([1., 0.])

In [29]:
cross_val_predict(RFR, X, y, cv=10)

array([58179.11689363, 38051.43993503, 38051.43993503, ...,
       90662.87077937, 90662.87077937, 90662.87077937])

In [30]:
scores = cross_val_score(RFR, X, y, cv=10, scoring='neg_mean_squared_error')

In [31]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

55514.938871437254

El nuevo feature logró bajar el score promedio de RMSE, así que lo mantenemos en el dataset de features.

## Veo la cantidad de veces que aparece cada dispositivo en una subasta según el ref_type

In [32]:
auct['ref_type_id'].value_counts()

1    13313564
7     2331211
Name: ref_type_id, dtype: int64

In [33]:
dfApRef = auct.groupby(['device_id', 'ref_type_id']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
dfApRef.columns = dfApRef.columns.droplevel(0)
dfApRef['median_count_ref_type'] = dfApRef.median(axis = 1)
dfApRef.head()

ref_type_id,Unnamed: 1,1,7,median_count_ref_type
0,1000061425870948777,8.0,0.0,8.0
1,1000503394293263005,0.0,343.0,343.0
2,1001008640113335510,24.0,0.0,24.0
3,1001123163431776865,68.0,0.0,68.0
4,1001144380199556647,0.0,78.0,78.0


Me interesa ver si cada usuario sólo tiene apariciones para el mismo ref_type, lo chequeo para ver si me servirá o no el feature

In [34]:
dfApRef.columns = ['ref_hash', 'auctions_ref_type1', 'auctions_ref_type7', 'median_count_ref_type']
dfApRef.head()

Unnamed: 0,ref_hash,auctions_ref_type1,auctions_ref_type7,median_count_ref_type
0,1000061425870948777,8.0,0.0,8.0
1,1000503394293263005,0.0,343.0,343.0
2,1001008640113335510,24.0,0.0,24.0
3,1001123163431776865,68.0,0.0,68.0
4,1001144380199556647,0.0,78.0,78.0


In [35]:
dfApRef['aparece_en_distinto_ref_type'] = ((dfApRef['auctions_ref_type1'] > 0) & (dfApRef['auctions_ref_type7'] > 0))

In [36]:
dfApRef.head()

Unnamed: 0,ref_hash,auctions_ref_type1,auctions_ref_type7,median_count_ref_type,aparece_en_distinto_ref_type
0,1000061425870948777,8.0,0.0,8.0,False
1,1000503394293263005,0.0,343.0,343.0,False
2,1001008640113335510,24.0,0.0,24.0,False
3,1001123163431776865,68.0,0.0,68.0,False
4,1001144380199556647,0.0,78.0,78.0,False


In [37]:
dfApRef['aparece_en_distinto_ref_type'].value_counts()

False    318079
True        246
Name: aparece_en_distinto_ref_type, dtype: int64

Como hay valores para ref_types distintos puedo probar agregar este feature y ver que sucede

In [38]:
data = data.merge(dfApRef[['ref_hash', 'median_count_ref_type']], on = 'ref_hash')
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,median_count_ref_type
0,1000061425870948777,8,7150.963789,2.20101,8.0
1,1000503394293263005,343,6222.051751,71.32718,343.0
2,1001123163431776865,68,55248.785583,19.809089,68.0
3,1001144380199556647,78,47259.145921,14.649991,78.0
4,1001510820906853249,72,11120.966694,13.75823,72.0


In [39]:
y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([0.48618807, 0.        , 0.51381193])

In [40]:
cross_val_predict(RFR, X, y, cv=10)

array([58177.91540167, 38051.41250316, 38051.41250316, ...,
       90664.47595397, 90664.47595397, 90664.47595397])

In [41]:
scores = cross_val_score(RFR, X, y, cv=5, scoring='neg_mean_squared_error')

In [42]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

57382.80017892816

El valor promedio del score volvió a subir, pero el feature importance según Random Forest indica que el feature es importante, por lo que lo mantendremos por ahora para ver qué sucede a medida que agrego más features.

## Cantidad de apariciones de un dispositivo en las subastas por día

Calculo la cantidad de apariciones para cada dispositivo en las subastas por día, luego tomo la mediana para las apariciones.

In [43]:
auct['fecha'] = auct['date'].dt.date

In [44]:
dfApDay = auct.groupby(['device_id', 'fecha']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
dfApDay.columns = dfApDay.columns.droplevel(0)
dfApDay.columns = ['ref_hash', '2019-04-18', '2019-04-19', '2019-04-20']
dfApDay['auctions_by_day'] = dfApDay.median(axis = 1)

In [45]:
dfApDay.head()

Unnamed: 0,ref_hash,2019-04-18,2019-04-19,2019-04-20,auctions_by_day
0,1000061425870948777,2.0,2.0,4.0,3.0
1,1000503394293263005,111.0,99.0,133.0,122.0
2,1001008640113335510,24.0,0.0,0.0,12.0
3,1001123163431776865,16.0,20.0,32.0,26.0
4,1001144380199556647,34.0,27.0,17.0,30.5


Agrego el nuevo feature

In [46]:
data = data.merge(dfApDay[['ref_hash', 'auctions_by_day']], on = 'ref_hash')
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,median_count_ref_type,auctions_by_day
0,1000061425870948777,8,7150.963789,2.20101,8.0,3.0
1,1000503394293263005,343,6222.051751,71.32718,343.0,122.0
2,1001123163431776865,68,55248.785583,19.809089,68.0,26.0
3,1001144380199556647,78,47259.145921,14.649991,78.0,30.5
4,1001510820906853249,72,11120.966694,13.75823,72.0,29.5


Ahora puebo el modelo

In [47]:
y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([0.49027402, 0.        , 0.50972598, 0.        ])

In [48]:
cross_val_predict(RFR, X, y, cv=10)

array([58177.91540167, 38051.41250316, 38051.41250316, ...,
       90664.47595397, 90664.47595397, 90664.47595397])

In [49]:
scores = cross_val_score(RFR, X, y, cv=5, scoring='neg_mean_squared_error')


In [50]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

57382.82224791924

El score volvió a subir un poco, aparentemente estos features no resultan ser muy buenos, veré que sucede más adelante.

## Pruebo con Gradient Boosting Regressor

In [51]:
from sklearn.ensemble import GradientBoostingRegressor

GBR = GradientBoostingRegressor(random_state=23, n_estimators=50, min_samples_split=50)
scores = cross_val_score(GBR , X, y, scoring = "neg_mean_squared_error", cv=5)


In [52]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

55522.11646717151

## Ahora pruebo con XGBoost 

In [53]:
import xgboost as xgb

In [54]:
XGB = xgb.XGBRegressor()

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




In [55]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

54886.44817504989

## Agrego como feature el tiempo promedio de aparición de un dispositivo durante la ventana anterior a la ventana en la que voy a predecir.

La idea de este feature es estudiar el comportamiento de los dispositivos, sus tiempos de aparición y calcular el mínimo tiempo entre ellos para tener un estimativo del tiempo a predecir.


In [79]:
apariciones_auctions = auct[['date', 'device_id']]
apariciones_auctions.head()

Unnamed: 0,date,device_id
0,2019-04-20 23:57:27.912838,1109595589636746168
1,2019-04-20 23:57:28.381114,5896614299191635403
2,2019-04-20 23:57:28.515423,4172466725848941608
3,2019-04-20 23:57:28.700884,2616279795187318849
4,2019-04-20 23:57:28.868312,8034952072073026056


In [80]:
apariciones_auctions['date'] = pd.to_datetime(apariciones_auctions['date'])
apariciones_auctions['date_inicial'] = dt.datetime(2019,4, 18)
apariciones_auctions['timeToAuction'] = (apariciones_auctions['date'] - apariciones_auctions['date_inicial'])/np.timedelta64(1,'s')
apariciones_auctions.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,date,device_id,date_inicial,timeToAuction
0,2019-04-20 23:57:27.912838,1109595589636746168,2019-04-18,259047.912838
1,2019-04-20 23:57:28.381114,5896614299191635403,2019-04-18,259048.381114
2,2019-04-20 23:57:28.515423,4172466725848941608,2019-04-18,259048.515423
3,2019-04-20 23:57:28.700884,2616279795187318849,2019-04-18,259048.700884
4,2019-04-20 23:57:28.868312,8034952072073026056,2019-04-18,259048.868312


In [81]:
apariciones_auctions = apariciones_auctions.groupby('device_id').agg({'timeToAuction': 'min'}).reset_index()
apariciones_auctions.columns = ['ref_hash', 'timeToAuction_min']
apariciones_auctions.head()

Unnamed: 0,ref_hash,timeToAuction_min
0,1000061425870948777,13095.182558
1,1000503394293263005,11.398298
2,1001008640113335510,61148.713542
3,1001123163431776865,1442.033706
4,1001144380199556647,1977.28307


Agrego el nuevo feature 

In [82]:
data = data.merge(apariciones_auctions, on = 'ref_hash')

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,median_count_ref_type,auctions_by_day,timeToAuction_mean,timeToAuction_min
0,1000061425870948777,8,7150.963789,2.20101,8.0,3.0,145187.163216,13095.182558
1,1000503394293263005,343,6222.051751,71.32718,343.0,122.0,145258.790686,11.398298
2,1001123163431776865,68,55248.785583,19.809089,68.0,26.0,174293.143627,1442.033706
3,1001144380199556647,78,47259.145921,14.649991,78.0,30.5,128185.363085,1977.28307
4,1001510820906853249,72,11120.966694,13.75823,72.0,29.5,103878.083597,10203.408548


In [84]:
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,median_count_ref_type,auctions_by_day,timeToAuction_min
0,1000061425870948777,8,7150.963789,2.20101,8.0,3.0,13095.182558
1,1000503394293263005,343,6222.051751,71.32718,343.0,122.0,11.398298
2,1001123163431776865,68,55248.785583,19.809089,68.0,26.0,1442.033706
3,1001144380199556647,78,47259.145921,14.649991,78.0,30.5,1977.28307
4,1001510820906853249,72,11120.966694,13.75823,72.0,29.5,10203.408548


## Pruebo con XGBoost

In [85]:
y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




In [86]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

39819.6651574377

## Agrego Features sobre los eventos

## Clicks

Primero considero el tiempo que tarda cada dispositivo en dar un click para esto utilizo todos los datos sin usar las divisiones en ventanas.

In [None]:
clks = pd.read_csv("data/clks_ventana1.csv", dtype = {'advertiser_id': np.int8, 'action_id': np.float32, 'source_id': np.int8, 'country_codde': 'category',  'carrier_id': np.float16, 'specs_brand': 'category', 'brand': np.float16, 'ref_type': 'category', 'ref_hash': 'category'})

In [None]:
clks.info()

In [None]:
click_time['possible_click'] = click_time['timeToClick'].transform(lambda x: 1/x if (x > 0) else 0.0).dropna(0.0)

In [None]:
click_time['possible_click'].value_counts()

## Installs

Utilizo el dataframe de instalaciones completo, sin aplicar una ventana, porque por ahora lo que me interesa es calcular el tiempo promedio que tardó cada dispositivo en hacer una instalación, no me interesan las fechas específicas.

In [None]:
inst = pd.read_csv('data/installs_ventana1.csv', dtype = {'application_id': np.int16, 'ref_type': 'category', 'ref_hash': 'category', 'click_hash': 'category', 'device_country_code': 'category', 'device_brand': 'category', 'device_model': 'category', 'kind': 'category', 'device_language': 'category'})

In [None]:
inst['created'] = pd.to_datetime(inst['created'])
inst.head()

In [None]:
inst.dtypes

## Calculo el tiempo que tarda en convertir cada dispositivo

In [None]:
inst.info()

In [None]:
inst['created_inicial'] = dt.datetime(2019, 4, 18)
inst['created_inicial'] = pd.to_datetime(inst['created_inicial'])
inst['timeToInstall'] = (inst['created'] - inst['created_inicial'])/np.timedelta64(1,'s')
inst.head()

Tomo el tiempo máximo y mínimo, en SEGUNDOS, que tardó un dispositivo en realizar una instalación

In [None]:
install_time = inst.groupby('ref_hash').agg({'timeToInstall': ['min', 'max']}).reset_index()
install_time.columns = install_time.columns.droplevel(1)
install_time.columns = ['ref_hash', 'minTimeToInstall', 'maxTimeToInstall']
install_time

## Agrego los dispositivos con installs y el tiempo que tarda cada dispositivo en realizar una instalación

In [None]:
auctions = auctions.merge(install_time[['ref_hash', 'possible_install']], on = 'ref_hash')

In [None]:
auctions = auctions.merge(click_time[['ref_hash', 'possible_click']], on = 'ref_hash')

In [None]:
auctions.info()

In [None]:
auctions.head()