# Trabajo Práctico N° 2
## Objetivo:

Para cada dispositivo presentado por Jampp, determinar el tiempo que transcurrirá hasta que el mismo aparezca nuevamente en una subasta, y el tiempo hasta que el usuario del mismo decida instalar una nueva aplicación.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import auc, accuracy_score
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
import datetime as dt

## Primero veo los dispositivos del target

In [2]:
target = pd.read_csv('data/target_competencia_ids.csv')

In [3]:
target.head()

Unnamed: 0,ref_hash,obj
0,1000169251625791246_sc,0
1,1000169251625791246_st,0
2,1000395625957344683_sc,0
3,1000395625957344683_st,0
4,1003027494996471685_sc,0


In [4]:
target.size

16148

Tomo los ref_hash

In [5]:
target['ref_hash'] = target['ref_hash'].transform(lambda x: str(x)[:-3])

In [6]:
target['ref_hash'] = target['ref_hash'].astype(np.int64)
target.head()

Unnamed: 0,ref_hash,obj
0,1000169251625791246,0
1,1000169251625791246,0
2,1000395625957344683,0
3,1000395625957344683,0
4,1003027494996471685,0


In [7]:
target['ref_hash'].nunique()

4037

## Calculo cuanto tiempo tarda un dispositivo en aparecer en una subasta contando desde el inicio de la ventana 2

La idea es determinar el tiempo que transcurrió entre cada aparición de un dispositivo en una subasta, para luego tomar el tiempo mínimo de aparición de un dispositivo en una subasta y de acuerdo a eso predecir utilizando los features de la ventana anterior.

In [8]:
auct_predict = pd.read_csv('data/auctions_ventana2.csv', dtype = { "ref_type_id": np.int8, "source_id": np.int8})

In [9]:
auct_predict['date'] = pd.to_datetime(auct_predict['date'])
auct_predict['date_inicial'] = dt.datetime(2019,4, 19)
auct_predict['timeToAuction'] = (auct_predict['date'] - auct_predict['date_inicial'])/np.timedelta64(1,'s')
auct_predict.head()

Unnamed: 0.1,Unnamed: 0,date,device_id,ref_type_id,source_id,date_inicial,timeToAuction
0,159478,2019-04-21 18:04:06.117104,8688913851990397877,1,1,2019-04-19,237846.117104
1,159479,2019-04-21 18:04:06.130756,4125147990784183352,7,1,2019-04-19,237846.130756
2,159480,2019-04-21 18:04:06.257844,5146681099678075004,1,1,2019-04-19,237846.257844
3,159481,2019-04-21 18:04:06.921719,2242043340575950024,1,1,2019-04-19,237846.921719
4,159482,2019-04-21 18:04:08.047843,8404476518073778202,1,1,2019-04-19,237848.047843


Tomo el tiempo mínimo y máximo, en SEGUNDOS, que tardó cada dispositivo en aparecer en una subasta

In [10]:
auction_time = auct_predict.groupby('device_id').agg({'timeToAuction': 'min'}).reset_index()
auction_time.columns = ['ref_hash', 'predict_value']
auction_time.head()

Unnamed: 0,ref_hash,predict_value
0,41863526108385,70828.465866
1,135153013040192,101454.009137
2,181891380775191,170365.420614
3,186034136943920,98271.945332
4,295841792051458,173232.827168


# Analizo los datos de la primer ventana

In [11]:
auct = pd.read_csv('data/auctions_ventana1.csv', dtype = { "ref_type_id": np.int8, "source_id": np.int8})

In [12]:
auct.dtypes

date           object
device_id       int64
ref_type_id      int8
source_id        int8
dtype: object

In [13]:
auct['date'] = pd.to_datetime(auct['date'])

In [14]:
auct.head()

Unnamed: 0,date,device_id,ref_type_id,source_id
0,2019-04-20 23:57:27.912838,1109595589636746168,7,0
1,2019-04-20 23:57:28.381114,5896614299191635403,1,0
2,2019-04-20 23:57:28.515423,4172466725848941608,1,0
3,2019-04-20 23:57:28.700884,2616279795187318849,7,0
4,2019-04-20 23:57:28.868312,8034952072073026056,1,0


## Veo cuantas veces aparece cada dispositivo en una subasta

Inicio sencillamente contando la cantidad de subastas en las que participó cada dispositivo, y lo agrego como un nuevo feature

In [15]:
auction_count = auct.groupby('device_id').agg({'date': 'count'}).reset_index()
auction_count.columns = ['ref_hash', 'auctions_count']

In [16]:
auction_count.head()

Unnamed: 0,ref_hash,auctions_count
0,41863526108385,35
1,135153013040192,8
2,161514654074162,6
3,181891380775191,1
4,186034136943920,7


Creo un único set de datos con los primeros features creados usando los ids de los dispositivos de la ventana 2

In [17]:
data = auction_count.merge(auction_time, on = 'ref_hash')

In [18]:
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value
0,41863526108385,35,70828.465866
1,135153013040192,8,101454.009137
2,181891380775191,1,170365.420614
3,186034136943920,7,98271.945332
4,345999128501141,48,9576.349849


## Pruebo con Random Forest

In [19]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

In [20]:
predictions = cross_val_predict(RFR, X, y, cv=10)
predictions

array([53425.02139278, 64310.97687888, 96141.30980674, ...,
       53567.08994914, 73483.71589645, 28943.34864995])

In [21]:
scores = cross_val_score(RFR, X, y, cv=5, scoring='neg_mean_squared_error')

## Calculo el RMSE

In [22]:
# convierto los valores a MSE scores
mse_scores = -scores
# paso de MSE a RMSE
rmse_scores = np.sqrt(mse_scores)

In [23]:
rmse_scores.mean()

55234.70508843897

## Pruebo con xgboost

In [24]:
import xgboost as xgb

In [34]:
y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

mse_scores = -scores
# paso de MSE a RMSE
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




54958.18891614814

## Veo la cantidad de veces que aparece cada dispositivo segun el source_id

Ahora me interesa ver la cantidad de veces que un dispositivo participó en una subasta desde cada tipo de fuente desde donde se produce la subasta. De esta manera los source_id más populares tomaran un valor mayor, luego tomo la desviación estándar de la cantidad para cada device_id

In [25]:
auct['apariciones'] = 1

In [26]:
auction_by_sourceID = auct.groupby(['device_id', 'source_id']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
auction_by_sourceID.columns = auction_by_sourceID.columns.droplevel(0)
auction_by_sourceID.columns = ['ref_hash', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
auction_by_sourceID.head()

Unnamed: 0,ref_hash,0,1,2,3,4,5,6,7,8,9
0,41863526108385,0.0,0.0,0.0,28.0,0.0,3.0,0.0,0.0,4.0,0.0
1,135153013040192,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,161514654074162,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,181891380775191,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,186034136943920,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
auction_by_sourceID['auctions_by_srcID'] = auction_by_sourceID.iloc[:,1:].std(axis = 1)
auction_by_sourceID.head()

Unnamed: 0,ref_hash,0,1,2,3,4,5,6,7,8,9,auctions_by_srcID
0,41863526108385,0.0,0.0,0.0,28.0,0.0,3.0,0.0,0.0,4.0,0.0,8.733715
1,135153013040192,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.529822
2,161514654074162,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.264911
3,181891380775191,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.316228
4,186034136943920,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.213594


Agrego el nuevo feature a los datos

In [28]:
data1 = data.merge(auction_by_sourceID[['ref_hash', 'auctions_by_srcID']], on = 'ref_hash')


In [29]:
data1.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID
0,41863526108385,35,70828.465866,8.733715
1,135153013040192,8,101454.009137,2.529822
2,181891380775191,1,170365.420614,0.316228
3,186034136943920,7,98271.945332,2.213594
4,345999128501141,48,9576.349849,14.140564


Vuelvo a probar el modelo con el nuevo feature

In [30]:
y = data1['predict_value']
X = data1.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([1., 0.])

In [31]:
cross_val_predict(RFR, X, y, cv=10)

array([53425.02139278, 64310.97687888, 96141.30980674, ...,
       53567.08994914, 73483.71589645, 28943.34864995])

In [32]:
scores = cross_val_score(RFR, X, y, cv=10, scoring='neg_mean_squared_error')

In [33]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

55233.40488970417

Pruebo xgboost

In [36]:
y = data1['predict_value']
X = data1.drop(['ref_hash', 'predict_value'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

mse_scores = -scores
# paso de MSE a RMSE
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




54925.932115098156

El nuevo feature logró bajar el score promedio de RMSE, así que lo mantenemos en el dataset de features.

## Veo la cantidad de veces que aparece cada dispositivo en una subasta según el ref_type

In [37]:
auct['ref_type_id'].value_counts()

1    13313564
7     2331211
Name: ref_type_id, dtype: int64

In [38]:
dfApRef = auct.groupby(['device_id', 'ref_type_id']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
dfApRef.columns = dfApRef.columns.droplevel(0)
dfApRef['median_count_ref_type'] = dfApRef.median(axis = 1)
dfApRef.head()

ref_type_id,Unnamed: 1,1,7,median_count_ref_type
0,41863526108385,35.0,0.0,35.0
1,135153013040192,8.0,0.0,8.0
2,161514654074162,6.0,0.0,6.0
3,181891380775191,1.0,0.0,1.0
4,186034136943920,7.0,0.0,7.0


Me interesa ver si cada usuario sólo tiene apariciones para el mismo ref_type, lo chequeo para ver si me servirá o no el feature

In [39]:
dfApRef.columns = ['ref_hash', 'auctions_ref_type1', 'auctions_ref_type7', 'median_count_ref_type']
dfApRef.head()

Unnamed: 0,ref_hash,auctions_ref_type1,auctions_ref_type7,median_count_ref_type
0,41863526108385,35.0,0.0,35.0
1,135153013040192,8.0,0.0,8.0
2,161514654074162,6.0,0.0,6.0
3,181891380775191,1.0,0.0,1.0
4,186034136943920,7.0,0.0,7.0


In [40]:
dfApRef['aparece_en_distinto_ref_type'] = ((dfApRef['auctions_ref_type1'] > 0) & (dfApRef['auctions_ref_type7'] > 0))

In [41]:
dfApRef.head()

Unnamed: 0,ref_hash,auctions_ref_type1,auctions_ref_type7,median_count_ref_type,aparece_en_distinto_ref_type
0,41863526108385,35.0,0.0,35.0,False
1,135153013040192,8.0,0.0,8.0,False
2,161514654074162,6.0,0.0,6.0,False
3,181891380775191,1.0,0.0,1.0,False
4,186034136943920,7.0,0.0,7.0,False


In [42]:
dfApRef['aparece_en_distinto_ref_type'].value_counts()

False    318079
True        246
Name: aparece_en_distinto_ref_type, dtype: int64

Como hay valores para ref_types distintos puedo probar agregar este feature y ver que sucede

In [43]:
data2 = data1.merge(dfApRef[['ref_hash', 'median_count_ref_type']], on = 'ref_hash')
data2.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,median_count_ref_type
0,41863526108385,35,70828.465866,8.733715,35.0
1,135153013040192,8,101454.009137,2.529822,8.0
2,181891380775191,1,170365.420614,0.316228,1.0
3,186034136943920,7,98271.945332,2.213594,7.0
4,345999128501141,48,9576.349849,14.140564,48.0


In [44]:
y = data2['predict_value']
X = data2.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([0.49061662, 0.        , 0.50938338])

In [45]:
cross_val_predict(RFR, X, y, cv=10)

array([53454.20517815, 64139.30224559, 96142.0359979 , ...,
       53567.99493623, 73481.99074885, 28944.37827845])

In [46]:
scores = cross_val_score(RFR, X, y, cv=5, scoring='neg_mean_squared_error')

In [47]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

55233.48109832053

## XGBoost

In [48]:
y = data2['predict_value']
X = data2.drop(['ref_hash', 'predict_value'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

mse_scores = -scores
# paso de MSE a RMSE
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




54926.97128598095

El valor promedio del score volvió a subir, pero el feature importance según Random Forest indica que el feature es importante, además subió muy poco, así que lo mantendremos por ahora para ver qué sucede a medida que agrego más features.

## Cantidad de apariciones de un dispositivo en las subastas por día

Calculo la cantidad de apariciones para cada dispositivo en las subastas por día, luego tomo la mediana para las apariciones.

In [49]:
auct['fecha'] = auct['date'].dt.date

In [50]:
dfApDay = auct.groupby(['device_id', 'fecha']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
dfApDay.columns = dfApDay.columns.droplevel(0)
dfApDay.columns = ['ref_hash', '2019-04-18', '2019-04-19', '2019-04-20']
dfApDay['auctions_by_day'] = dfApDay.median(axis = 1)

In [51]:
dfApDay.head()

Unnamed: 0,ref_hash,2019-04-18,2019-04-19,2019-04-20,auctions_by_day
0,41863526108385,0.0,1.0,34.0,17.5
1,135153013040192,0.0,0.0,8.0,4.0
2,161514654074162,6.0,0.0,0.0,3.0
3,181891380775191,0.0,0.0,1.0,0.5
4,186034136943920,2.0,0.0,5.0,3.5


Agrego el nuevo feature

In [52]:
data3 = data1.merge(dfApDay[['ref_hash', 'auctions_by_day']], on = 'ref_hash')
data3.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,auctions_by_day
0,41863526108385,35,70828.465866,8.733715,17.5
1,135153013040192,8,101454.009137,2.529822,4.0
2,181891380775191,1,170365.420614,0.316228,0.5
3,186034136943920,7,98271.945332,2.213594,3.5
4,345999128501141,48,9576.349849,14.140564,23.5


Ahora puebo el modelo

In [57]:
y = data3['predict_value']
X = data3.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([1., 0., 0.])

In [58]:
cross_val_predict(RFR, X, y, cv=10)

array([53425.02139278, 64310.97687888, 96141.30980674, ...,
       53567.08994914, 73483.71589645, 28943.34864995])

In [59]:
scores = cross_val_score(RFR, X, y, cv=5, scoring='neg_mean_squared_error')


In [60]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

55234.70508843897

## Pruebo con Gradient Boosting Regressor

In [62]:
from sklearn.ensemble import GradientBoostingRegressor

GBR = GradientBoostingRegressor(random_state=23, n_estimators=50, min_samples_split=50)
scores = cross_val_score(GBR , X, y, scoring = "neg_mean_squared_error", cv=5)


In [63]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

53921.75272292011

## Ahora pruebo con XGBoost 

In [64]:
XGB = xgb.XGBRegressor()

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




In [65]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

53285.37288489265

El score volvió a subir un poco con random forest, pero con xgboost bajó, así que voy a mantener este feature

## Agrego como feature el tiempo minimo de aparición de un dispositivo durante la ventana anterior a la ventana en la que voy a predecir.

La idea de este feature es estudiar el comportamiento de los dispositivos, sus tiempos de aparición y calcular el mínimo tiempo entre ellos para tener un estimativo del tiempo a predecir.


In [66]:
apariciones_auctions = auct[['date', 'device_id']]
apariciones_auctions.head()

Unnamed: 0,date,device_id
0,2019-04-20 23:57:27.912838,1109595589636746168
1,2019-04-20 23:57:28.381114,5896614299191635403
2,2019-04-20 23:57:28.515423,4172466725848941608
3,2019-04-20 23:57:28.700884,2616279795187318849
4,2019-04-20 23:57:28.868312,8034952072073026056


In [67]:
apariciones_auctions['date'] = pd.to_datetime(apariciones_auctions['date'])
apariciones_auctions['date_inicial'] = dt.datetime(2019,4, 18)
apariciones_auctions['timeToAuction'] = (apariciones_auctions['date'] - apariciones_auctions['date_inicial'])/np.timedelta64(1,'s')
apariciones_auctions.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,date,device_id,date_inicial,timeToAuction
0,2019-04-20 23:57:27.912838,1109595589636746168,2019-04-18,259047.912838
1,2019-04-20 23:57:28.381114,5896614299191635403,2019-04-18,259048.381114
2,2019-04-20 23:57:28.515423,4172466725848941608,2019-04-18,259048.515423
3,2019-04-20 23:57:28.700884,2616279795187318849,2019-04-18,259048.700884
4,2019-04-20 23:57:28.868312,8034952072073026056,2019-04-18,259048.868312


In [68]:
apariciones_auctions = apariciones_auctions.groupby('device_id').agg({'timeToAuction': 'min'}).reset_index()
apariciones_auctions.columns = ['ref_hash', 'timeToAuction_min']
apariciones_auctions.head()

Unnamed: 0,ref_hash,timeToAuction_min
0,41863526108385,157228.465866
1,135153013040192,187854.009137
2,161514654074162,10366.357746
3,181891380775191,256765.420614
4,186034136943920,60166.331894


Agrego el nuevo feature 

In [70]:
data4 = data3.merge(apariciones_auctions, on = 'ref_hash')

In [71]:
data4.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,auctions_by_day,timeToAuction_min
0,41863526108385,35,70828.465866,8.733715,17.5,157228.465866
1,135153013040192,8,101454.009137,2.529822,4.0,187854.009137
2,181891380775191,1,170365.420614,0.316228,0.5,256765.420614
3,186034136943920,7,98271.945332,2.213594,3.5,60166.331894
4,345999128501141,48,9576.349849,14.140564,23.5,54290.38024


## Pruebo con XGBoost

In [72]:
y = data4['predict_value']
X = data4.drop(['ref_hash', 'predict_value'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




In [73]:
# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

39416.00018540661

Este feature mejoró muchísimo el score, lo mantengo

## Agrego Features sobre los eventos

In [74]:
evts = pd.read_csv("data/events_ventana1.csv", dtype = {"event_id": np.int16, "application_id": np.int16, 'device_countrycode': 'category', 'device_os_version': 'category', 'device_brand': 'category', 'device_model': 'category', 'device_city': 'category', 'session_user_agent': 'category', 'trans_id': 'category', 'user_agent': 'category', 'carrier' : 'category', 'kind': 'category', 'device_os': 'category', 'connection_type': 'category', 'ip_address': 'category', 'device_language': 'category'})

In [75]:
evts.head()

Unnamed: 0,date,event_id,ref_type,ref_hash,application_id,attributed,device_countrycode,device_os_version,device_brand,device_model,...,trans_id,user_agent,event_uuid,carrier,kind,device_os,wifi,connection_type,ip_address,device_language
0,2019-04-20 01:42:49.120,0,1891515180541284343,5857744372586891366,210,False,6287817205707153877,,,4.318294190479584e+18,...,,5.046185273150854e+18,5b506964-5f47-4b28-a8c2-8a92d6c23379,,5.882882097123621e+18,,False,,7544543351571901618,3.3013777759777e+18
1,2019-04-20 01:42:49.340,1,1891515180541284343,7642521036780133571,210,False,6287817205707153877,,,,...,,,f1fb9d15-1a7b-4116-8d3b-c4c403e197e2,,4.017674184041173e+18,,False,,6949523255335024165,
2,2019-04-20 01:42:49.365,1,1891515180541284343,2548841562898283198,210,False,6287817205707153877,,,,...,,,c85a0b15-a5d7-472e-8116-6bfa3db19687,,4.017674184041173e+18,,False,,6428537280982666957,
3,2019-04-20 01:42:51.438,2,1891515180541284343,609402887625919085,210,False,6287817205707153877,,,,...,,,f4aa0a97-2de6-4f22-95c6-1b3150112cb9,,6.168308581888313e+18,,False,,7607371352198017145,
4,2019-04-20 01:42:51.838,1,1891515180541284343,9114651763556439823,210,False,6287817205707153877,,,,...,,,08e2f7f7-875f-4aa0-b337-b9b87b0d83ea,,4.017674184041173e+18,,False,,2901772839007473756,


## Aplicaciones populares

Agrego Features tomando en cuenta cuales la popularidad de las applicaciones. La idea es generar un feature que asigne un número a cada dispositivo de acuerdo al id de la applicación. De esta manera podría inferirse que si una applicación es más popular que otra es entonces más probable que se genere un evento sobre la misma.

In [76]:
apps_populares = evts[['application_id', 'ref_hash']].groupby('application_id').count().reset_index()
apps_populares.columns = ['application_id', 'popularidad_app']
apps_populares.head()

Unnamed: 0,application_id,popularidad_app
0,1,21
1,2,153
2,3,699
3,5,36
4,6,19


In [77]:
apps_populares = evts[['ref_hash', 'application_id']].merge(apps_populares, on = 'application_id')
apps_populares.head()

Unnamed: 0,ref_hash,application_id,popularidad_app
0,5857744372586891366,210,574130
1,7642521036780133571,210,574130
2,2548841562898283198,210,574130
3,609402887625919085,210,574130
4,9114651763556439823,210,574130


In [78]:
apps_counts = evts[['date', 'ref_hash', 'application_id']].groupby(['ref_hash', 'application_id']).count().reset_index()
apps_counts.head()

Unnamed: 0,ref_hash,application_id,date
0,41863526108385,65,21
1,41863526108385,121,3
2,41863526108385,210,64
3,161514654074162,121,8
4,186034136943920,210,13


In [79]:
# Le asigno a cada dispositivo la popularidad de la applicación en la cual generó más eventos
apps_pops = apps_counts.groupby(['ref_hash']).agg({'date': 'max'}).reset_index()
apps_pops.columns = ['ref_hash', 'popularidad_events']
apps_pops.head()

Unnamed: 0,ref_hash,popularidad_events
0,41863526108385,64
1,161514654074162,8
2,186034136943920,13
3,360710529886978,3
4,365882020742330,57


Agrego el nuevo feature a los datos

In [80]:
data5 = data4.merge(apps_pops, on = 'ref_hash', how = 'left')
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,timeToAuction_min
0,41863526108385,35,70828.465866,157228.465866
1,135153013040192,8,101454.009137,187854.009137
2,181891380775191,1,170365.420614,256765.420614
3,186034136943920,7,98271.945332,60166.331894
4,345999128501141,48,9576.349849,54290.38024


In [None]:
# A los dispositivos que no registran eventos, se le asignará una popularidad de valor 0
#data5['popularidad'] = data5['popularidad_events'].fillna(0.0)

## Pruebo el feature con XGBoost

In [81]:
y = data5['predict_value']
X = data5.drop(['ref_hash', 'predict_value'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




39468.9025675278

con este feature el score volvió a suir, lo descarto

## Agrego un feature tomando en cuenta la cantidad de eventos en los que participó cada dispositivo

In [82]:
count_events = evts.groupby('ref_hash').agg({'date': 'count'}).reset_index()
count_events.columns = ['ref_hash', 'count_events']
count_events.head()

Unnamed: 0,ref_hash,count_events
0,41863526108385,88
1,161514654074162,8
2,186034136943920,13
3,360710529886978,3
4,365882020742330,57


Agrego el nuevo feature y veo qué sucede con el score

In [83]:
data6 = data4.merge(count_events, on = 'ref_hash', how = 'left')
#data['count_events'] = data['count_events'].fillna(0)

## Vuelvo a probar el modelo con XGBoost

In [84]:
y = data6['predict_value']
X = data6.drop(['ref_hash', 'predict_value'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




39486.79576184739

se puede ver que este feature tampoco es bueno, actualmente data4 es el set de features con mejor score

# Pruebo los features anteriores para predecir los tiempos de conversiones

In [135]:
inst = pd.read_csv('data/installs_ventana4.csv', dtype = {'application_id': np.int16, 'ref_type': 'category', 'click_hash': 'category', 'device_country_code': 'category', 'device_brand': 'category', 'device_model': 'category', 'kind': 'category', 'device_language': 'category'})

## Installs

Tomo los tiempos de la primera conversión de cada dispositivo dentro de la ventana 2 para entrenar el algoritmo luego

In [136]:
inst['created'] = pd.to_datetime(inst['created'])
inst.head()

Unnamed: 0,created,application_id,ref_type,ref_hash,click_hash,attributed,implicit,device_countrycode,device_brand,device_model,session_user_agent,user_agent,event_uuid,kind,wifi,trans_id,ip_address,device_language
0,2019-04-23 20:55:27.691,1,1494519392962156891,404154463572960372,,False,True,6287817205707153877,,6.208879341562586e+18,adjust.com,,06f054c1-608b-437b-a7d7-21d7c5860fe4,app_open,,,5153039437575809299,3.3013777759777e+18
1,2019-04-23 18:14:14.756,1,1494519392962156891,1448747284042458525,,False,False,6287817205707153877,,6.208879341562586e+18,adjust.com,,,,,,7469505621463131113,3.3013777759777e+18
2,2019-04-23 06:32:40.906,1,1494519392962156891,1465267208368809184,,False,True,6287817205707153877,,4.322425952775763e+18,adjust.com,,e47c9cd6-121e-4c29-91a3-bb6d26afd36f,app_open,,,3744928684799347195,3.3013777759777e+18
3,2019-04-23 03:12:38.142,1,1494519392962156891,2459396953479678888,,False,False,6287817205707153877,,6.208879341562586e+18,adjust.com,,,,,,7787095122814381489,3.3013777759777e+18
4,2019-04-23 17:34:06.764,1,1494519392962156891,5981451938163147680,,False,True,6287817205707153877,,7.116381917892006e+18,adjust.com,,95f7543b-e881-43f4-a686-63a09cb8cf97,app_open,,,807937722812865778,3.3013777759777e+18


## Calculo el tiempo que tarda en convertir cada dispositivo

In [137]:
inst['created_inicial'] = dt.datetime(2019, 4, 19)
inst['created_inicial'] = pd.to_datetime(inst['created_inicial'])
inst['timeToInstall'] = (inst['created'] - inst['created_inicial'])/np.timedelta64(1,'s')
inst.head()

Unnamed: 0,created,application_id,ref_type,ref_hash,click_hash,attributed,implicit,device_countrycode,device_brand,device_model,session_user_agent,user_agent,event_uuid,kind,wifi,trans_id,ip_address,device_language,created_inicial,timeToInstall
0,2019-04-23 20:55:27.691,1,1494519392962156891,404154463572960372,,False,True,6287817205707153877,,6.208879341562586e+18,adjust.com,,06f054c1-608b-437b-a7d7-21d7c5860fe4,app_open,,,5153039437575809299,3.3013777759777e+18,2019-04-19,420927.691
1,2019-04-23 18:14:14.756,1,1494519392962156891,1448747284042458525,,False,False,6287817205707153877,,6.208879341562586e+18,adjust.com,,,,,,7469505621463131113,3.3013777759777e+18,2019-04-19,411254.756
2,2019-04-23 06:32:40.906,1,1494519392962156891,1465267208368809184,,False,True,6287817205707153877,,4.322425952775763e+18,adjust.com,,e47c9cd6-121e-4c29-91a3-bb6d26afd36f,app_open,,,3744928684799347195,3.3013777759777e+18,2019-04-19,369160.906
3,2019-04-23 03:12:38.142,1,1494519392962156891,2459396953479678888,,False,False,6287817205707153877,,6.208879341562586e+18,adjust.com,,,,,,7787095122814381489,3.3013777759777e+18,2019-04-19,357158.142
4,2019-04-23 17:34:06.764,1,1494519392962156891,5981451938163147680,,False,True,6287817205707153877,,7.116381917892006e+18,adjust.com,,95f7543b-e881-43f4-a686-63a09cb8cf97,app_open,,,807937722812865778,3.3013777759777e+18,2019-04-19,408846.764


Tomo el tiempo mínimo, en SEGUNDOS, que tardó un dispositivo en realizar una instalación

In [138]:
install_time = inst.groupby('ref_hash').agg({'timeToInstall': 'min'}).reset_index()
install_time.columns = ['ref_hash', 'predict_time_install']
install_time.head()

Unnamed: 0,ref_hash,predict_time_install
0,40621409780134,242267.657
1,161514654074162,324824.805
2,168103949904656,259773.188
3,283297668933729,308529.476
4,345999128501141,188378.532


## Ahora leo los datos de la ventana 1 para crear features sobre estos ids

In [139]:
installs = pd.read_csv('data/installs_ventana3.csv', dtype = {'application_id': np.int16, 'ref_type': 'category', 'click_hash': 'category', 'device_country_code': 'category', 'device_brand': 'category', 'device_model': 'category', 'kind': 'category', 'device_language': 'category'})
installs.head()

Unnamed: 0,created,application_id,ref_type,ref_hash,click_hash,attributed,implicit,device_countrycode,device_brand,device_model,session_user_agent,user_agent,event_uuid,kind,wifi,trans_id,ip_address,device_language
0,2019-04-20 10:15:36.274,1,1494519392962156891,5230323462636548010,,False,True,6287817205707153877,,8.355495513718673e+18,adjust.com,,dda99e3c-9c4b-487d-891c-79f0a02cb4a8,app_open,,,8291809486355890410,4.060929664968129e+18
1,2019-04-20 21:56:47.151,1,1494519392962156891,5097163995161606833,,False,True,6287817205707153877,,2.3557720913769155e+18,adjust.com,,7010c3ce-0fcf-46c6-9be8-374cc0e20af4,app_open,,,4006811922873399949,3.3013777759777e+18
2,2019-04-20 22:40:41.239,1,1494519392962156891,6328027616411983332,,False,False,6287817205707153877,,6.156971151807135e+18,adjust.com,,,,,,3386455054590810771,3.3013777759777e+18
3,2019-04-20 18:19:27.485,1,1494519392962156891,7522785771858684314,,False,True,6287817205707153877,,6.208879341562586e+18,adjust.com,,35e7504f-cd9b-4a78-b89f-7335a8bd735a,app_open,,,7297343258015606683,3.3013777759777e+18
4,2019-04-20 03:40:21.239,1,1494519392962156891,7882044913917355073,,False,True,6287817205707153877,,4.566898029552894e+18,adjust.com,,f3a8649c-bd47-4874-a20e-8d3d5cedac2e,app_open,,,4764130939738113581,4.060929664968129e+18


## Agrego un feature sobre la popularidad de las applicaciones

In [140]:
apps_populares_installs = installs.groupby('application_id').agg({'created': 'count'}).reset_index()
apps_populares_installs.columns = ['application_id', 'popularidad_app']
apps_populares_installs = installs[['ref_hash', 'application_id']].merge(apps_populares_installs, on = 'application_id')
apps_populares_installs.head()

Unnamed: 0,ref_hash,application_id,popularidad_app
0,5230323462636548010,1,13
1,5097163995161606833,1,13
2,6328027616411983332,1,13
3,7522785771858684314,1,13
4,7882044913917355073,1,13


In [141]:
apps_counts = installs.groupby(['ref_hash', 'application_id']).agg({'created': 'count'}).reset_index()
# Le asigno a cada dispositivo la popularidad de la applicación en la cual generó más eventos
apps_pops = apps_counts.groupby(['ref_hash']).agg({'created': 'max'}).reset_index()
apps_pops.columns = ['ref_hash', 'popularidad_apps']
apps_pops.head()

Unnamed: 0,ref_hash,popularidad_apps
0,40621409780134,1
1,135153013040192,1
2,161514654074162,1
3,168103949904656,1
4,283297668933729,1


## Creo el set de datos par a entrenar los algoritmos

In [142]:
data_installs = install_time.merge(apps_pops, on = 'ref_hash') 
data_installs.head()

Unnamed: 0,ref_hash,predict_time_install,popularidad_apps
0,40621409780134,242267.657,1
1,161514654074162,324824.805,1
2,168103949904656,259773.188,1
3,283297668933729,308529.476,1
4,345999128501141,188378.532,1


## Pruebo XGBoost para predecir

In [143]:
y = data_installs['predict_time_install']
X = data_installs.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




52795.34464223376

## Agrego como feature el tiempo real que tarda un dispositivo en realizar la primera conversión, dentro de la ventana 1

In [144]:
installs['created'] = pd.to_datetime(inst['created'])
installs['created_inicial'] = dt.datetime(2019, 4, 18)
installs['created_inicial'] = pd.to_datetime(installs['created_inicial'])
installs['timeToInstall'] = (installs['created'] - installs['created_inicial'])/np.timedelta64(1,'s')


In [145]:
#Ahora tomo el tiempo mínimo
time_to_install = installs.groupby('ref_hash').agg({'timeToInstall': 'min'}).reset_index()

In [146]:
time_to_install.head()

Unnamed: 0,ref_hash,timeToInstall
0,40621409780134,499211.765
1,135153013040192,350437.005
2,161514654074162,359769.986
3,168103949904656,466012.395
4,283297668933729,504232.319


Agrego el  nuevo feature y vuelvo a probar el algoritmo

In [147]:
data_installs1 = data_installs.merge(time_to_install, on = 'ref_hash')
data_installs1.head()

Unnamed: 0,ref_hash,predict_time_install,popularidad_apps,timeToInstall
0,40621409780134,242267.657,1,499211.765
1,161514654074162,324824.805,1,359769.986
2,168103949904656,259773.188,1,466012.395
3,283297668933729,308529.476,1,504232.319
4,345999128501141,188378.532,1,433985.583


In [148]:
y = data_installs1['predict_time_install']
X = data_installs1.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




51556.10690476193

Se mejoró el score, por lo que mantengo el feature

## ahora agrego los features sobre los eventos

In [149]:
new_data1 = data_installs1.merge(apps_pops, on = 'ref_hash', how = 'left')
new_data1.head()

Unnamed: 0,ref_hash,predict_time_install,popularidad_apps_x,timeToInstall,popularidad_apps_y
0,40621409780134,242267.657,1,499211.765,1
1,161514654074162,324824.805,1,359769.986,1
2,168103949904656,259773.188,1,466012.395,1
3,283297668933729,308529.476,1,504232.319,1
4,345999128501141,188378.532,1,433985.583,1


Pruebo el nuevo feature con xgboost

In [150]:
y = new_data1['predict_time_install']
X = new_data1.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




51556.10690476193

El score se mantuvo igual, por lo que puedo sospechar que las apps populares en events son las mismas que en installs

Agrego un feature sobre los eventos registrados para cada dispositivo

In [101]:
new_data2 = data_installs1.merge(count_events, on = 'ref_hash', how = 'left')

In [102]:
#new_data['count_events'] = new_data['count_events'].fillna(0.0)
y = new_data2['predict_time_install']
X = new_data2.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




50958.90353411638

Se logró mejorar el score con este feature, así que lo mantengo

## Agrego los features sobre auctions

In [151]:
#agrego la mediana de subastas registradas por cada dispositivo según el ref_type_id
new_data3 = new_data2.merge(dfApRef[['ref_hash', 'median_count_ref_type']], on = 'ref_hash', how = 'left')

In [152]:
#new_data['median_count_ref_type'] = new_data['median_count_ref_type'].fillna(0.0)
y = new_data3['predict_time_install']
X = new_data3.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




50996.45182194811

El feature anterior empeoró las performance por lo que no se lo tomará en cuenta para las predicciones

In [154]:
#agrego la cantidad de subastas registradas para cada dispositivo
new_data4 = new_data2.merge(auction_count, on = 'ref_hash', how = 'left')
new_data4.head()

Unnamed: 0,ref_hash,predict_time_install,popularidad_apps,timeToInstall,count_events,auctions_count
0,135153013040192,106580.258,1,244609.776,,8.0
1,365882020742330,9756.508,1,156033.413,57.0,5.0
2,519199987760489,6460.475,1,253539.683,2.0,1.0
3,848739744907305,23437.123,1,323225.147,1.0,
4,918552236966790,134047.027,1,298019.318,10.0,


In [155]:
##new_data['auction_count'] = new_data['auctions_count'].fillna(0.0)
y = new_data4['predict_time_install']
X = new_data4.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




50996.46436563425

El feature anterior tampoco logró mejorar el score, aunque no lo empeoró, por ahora no lo tomo en cuenta

In [157]:
#Agrego el tiempo mínimo de aparición de cada dispositivo en una subasta
new_data5 = new_data2.merge(apariciones_auctions, on = 'ref_hash', how  = 'left')
new_data5.head()

Unnamed: 0,ref_hash,predict_time_install,popularidad_apps,timeToInstall,count_events,timeToAuction_min
0,135153013040192,106580.258,1,244609.776,,187854.009137
1,365882020742330,9756.508,1,156033.413,57.0,93060.134292
2,519199987760489,6460.475,1,253539.683,2.0,189996.383574
3,848739744907305,23437.123,1,323225.147,1.0,
4,918552236966790,134047.027,1,298019.318,10.0,


In [158]:
#new_data[ 'auctions_by_srcID'] = new_data[ 'auctions_by_srcID'].fillna(0.0)
y = new_data5['predict_time_install']
X = new_data5.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




50630.72132517265

Se puede apreciar que con este nuevo feature se mejoró bastante el score, así que lo mantendremos entre los features para las predicciones

Agrego otro feature, tomando en cuenta el promedio de subastas, por día, en las que participa cada dispositivo

In [159]:
new_data6 = new_data5.merge(dfApDay[['ref_hash', 'auctions_by_day']], on = 'ref_hash', how  = 'left')
new_data6.head()

Unnamed: 0,ref_hash,predict_time_install,popularidad_apps,timeToInstall,count_events,timeToAuction_min,auctions_by_day
0,135153013040192,106580.258,1,244609.776,,187854.009137,4.0
1,365882020742330,9756.508,1,156033.413,57.0,93060.134292,2.5
2,519199987760489,6460.475,1,253539.683,2.0,189996.383574,0.5
3,848739744907305,23437.123,1,323225.147,1.0,,
4,918552236966790,134047.027,1,298019.318,10.0,,


Ahora pruebo el modelo con los nuevos features

In [160]:
#new_data['timeToAuction_min'] = new_data['timeToAuction_min'].fillna(72 * 60 * 60)
y = new_data6['predict_time_install']
X = new_data6.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




50643.96058748433

Este feature hizo que el score empeore un poco, así que no lo mantendré en cuenta


## Agrego un nuevo feature tomando en cuenta el user_agent desde donde se origina la instalación

In [161]:
installs_by_user_agent = installs[['user_agent', 'ref_hash']].groupby('user_agent').count().reset_index()
installs_by_user_agent.columns = ['user_agent', 'installs_por_user_agent']
installs_by_user_agent = installs_by_user_agent.merge(installs[['ref_hash', 'user_agent']], on = 'user_agent')

In [113]:
installs_by_user_agent.head()

Unnamed: 0,user_agent,installs_por_user_agent,ref_hash
0,$%28PRODUCT_NAME%29/746 CFNetwork/976 Darwin/1...,1,8670865579348815667
1,%E3%83%8E%E3%83%BC%E3%83%88%E3%83%B3%20WiFi%20...,1,1184543608462124266
2,%E6%A2%A6%E5%B9%BB%E8%8A%B1%E5%9B%AD/3.3.2 CFN...,1,5471994910141133099
3,5miles/2778 CFNetwork/902.2 Darwin/17.7.0,1,6967960820179343958
4,5miles/2902 CFNetwork/811.5.4 Darwin/16.6.0,2,9178440513583202912


In [162]:
new_data7 = installs_by_user_agent[['installs_por_user_agent', 'ref_hash']].merge(new_data5, on = 'ref_hash')
new_data7.head()

Unnamed: 0,installs_por_user_agent,ref_hash,predict_time_install,popularidad_apps,timeToInstall,count_events,timeToAuction_min
0,5,6899713516256349714,147224.531,1,174199.86,1.0,4242.087909
1,5,3014247125747035504,135947.215,1,174248.052,4.0,
2,3,6056618266145674347,99586.128,1,320896.206,12.0,
3,3,3668870771655093485,152582.184,1,320743.091,,
4,11,7862570813058233308,95251.592,1,320743.609,,


Pruebo el nuevo feature

In [163]:
#new_data[ 'auctions_by_srcID'] = new_data[ 'auctions_by_srcID'].fillna(0.0)
y = new_data7['predict_time_install']
X = new_data7.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




32924.07269312037

Con el nuevo feature el score mejoró bastante así que lo mantengo, hasta ahora el mejor es new_data7

## Agrego un nuevo feature tomando en cuenta la session_user_agent

In [164]:
session_user_agent = installs[['ref_hash', 'session_user_agent']].groupby('session_user_agent').count().reset_index()
session_user_agent.head()

Unnamed: 0,session_user_agent,ref_hash
0,5041C Android 8.1.0,1
1,Apsalar-Postback,4268
2,Audible/462 CFNetwork/758.5.3 Darwin/15.6.0,1
3,Audible/476 CFNetwork/758.5.3 Darwin/15.6.0,1
4,B1-870 Android 7.0,1


In [165]:
session_feature = installs[['ref_hash', 'session_user_agent']].merge(session_user_agent, on = 'session_user_agent', how = 'left')
session_feature = session_feature.drop(columns = 'session_user_agent')
session_feature.columns = ['ref_hash', 'session_user_agent']
session_feature.head()

Unnamed: 0,ref_hash,session_user_agent
0,5230323462636548010,24314.0
1,5097163995161606833,24314.0
2,6328027616411983332,24314.0
3,7522785771858684314,24314.0
4,7882044913917355073,24314.0


Agrego el nuevo feature al set de features

In [166]:
new_data8 = new_data7.merge(session_feature, on = 'ref_hash')

Pruebo el nuevo feature

In [167]:
y = new_data8['predict_time_install']
X = new_data8.drop(['ref_hash', 'predict_time_install'], axis=1)

XGB = xgb.XGBRegressor()
XGB.fit(X, y)

scores = cross_val_score(XGB , X, y, scoring = "neg_mean_squared_error", cv=5)

# RMSE
mse_scores = -scores
rmse_scores = np.sqrt(mse_scores)
rmse_scores.mean()

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




33638.90062753151

Este feature empeoró el score, así que lo descarto