# Trabajo Práctico N° 2
## Objetivo:

Para cada dispositivo presentado por Jampp, determinar el tiempo que transcurrirá hasta que el mismo aparezca nuevamente en una subasta, y el tiempo hasta que el usuario del mismo decida instalar una nueva aplicación.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import auc, accuracy_score
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV, KFold, RandomizedSearchCV, train_test_split
import xgboost as xgb
import datetime as dt

## Primero veo los dispositivos del target

In [2]:
target = pd.read_csv('data/target_final_competencia_revamped.csv')

In [3]:
target.head()

Unnamed: 0,ref_hash,obj
0,1000169251625791232_sc,0
1,1000169251625791232_st,0
2,1000395625957344640_sc,0
3,1000395625957344640_st,0
4,1006670001679961600_sc,0


Tomo los ref_hash

In [4]:
target['ref_hash'] = target['ref_hash'].transform(lambda x: str(x)[:-3])

In [5]:
target.head()

Unnamed: 0,ref_hash,obj
0,1000169251625791232,0
1,1000169251625791232,0
2,1000395625957344640,0
3,1000395625957344640,0
4,1006670001679961600,0


# Analizo los datos de la primer ventana

In [6]:
auct = pd.read_csv('data/auctions_ventana1.csv', dtype = {"device_id": 'category', "ref_type_id": np.int8, "source_id": np.int8})

In [7]:
auct.dtypes

date             object
device_id      category
ref_type_id        int8
source_id          int8
dtype: object

In [8]:
auct['date'] = pd.to_datetime(auct['date'])

In [9]:
auct.head()

Unnamed: 0,date,device_id,ref_type_id,source_id
0,2019-04-20 23:57:27.912838,1109595589636746168,7,0
1,2019-04-20 23:57:28.381114,5896614299191635403,1,0
2,2019-04-20 23:57:28.515423,4172466725848941608,1,0
3,2019-04-20 23:57:28.700884,2616279795187318849,7,0
4,2019-04-20 23:57:28.868312,8034952072073026056,1,0


## calculo cuanto tiempo tarda un dispositivo en aparecer en una subasta contando desde el inicio de la ventana

La idea es determinar el tiempo que transcurrió entre cada aparición de un dispositivo en una subasta, para luego tomar el tiempo promedio de aparición de un dispositivo en una subasta y de acuerdo a eso predecir.

In [10]:
auct['date_inicial'] = dt.datetime(2019,4, 18)
auct['date_inicial'] = pd.to_datetime(auct['date_inicial'])
auct['timeToAuction'] = (auct['date'] - auct['date_inicial'])/np.timedelta64(1,'s')
auct.head()

Unnamed: 0,date,device_id,ref_type_id,source_id,date_inicial,timeToAuction
0,2019-04-20 23:57:27.912838,1109595589636746168,7,0,2019-04-18,259047.912838
1,2019-04-20 23:57:28.381114,5896614299191635403,1,0,2019-04-18,259048.381114
2,2019-04-20 23:57:28.515423,4172466725848941608,1,0,2019-04-18,259048.515423
3,2019-04-20 23:57:28.700884,2616279795187318849,7,0,2019-04-18,259048.700884
4,2019-04-20 23:57:28.868312,8034952072073026056,1,0,2019-04-18,259048.868312


Tomo el tiempo mínimo y máximo, en SEGUNDOS, que tardó cada dispositivo en aparecer en una subasta

In [11]:
auction_time = auct.groupby('device_id').agg({'timeToAuction': 'min'}).reset_index()
auction_time.columns = ['ref_hash', 'predict_value']
auction_time.head()

Unnamed: 0,ref_hash,predict_value
0,1000061425870948777,13095.182558
1,1000503394293263005,11.398298
2,1001008640113335510,61148.713542
3,1001123163431776865,1442.033706
4,1001144380199556647,1977.28307


## Veo cuantas veces aparece cada dispositivo en una subasta

Inicio sencillamente contando la cantidad de subastas en las que participó cada dispositivo, y lo agrego como un nuevo feature

In [12]:
auction_count = auct.groupby('device_id').agg({'date': 'count'}).reset_index()
auction_count.columns = ['ref_hash', 'auctions_count']

In [13]:
auction_count.head()

Unnamed: 0,ref_hash,auctions_count
0,1000061425870948777,8
1,1000503394293263005,343
2,1001008640113335510,24
3,1001123163431776865,68
4,1001144380199556647,78


Creo un único set de datos con los primeros features creados 

In [14]:
data = auction_count.merge(auction_time, on = 'ref_hash')

In [15]:
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value
0,1000061425870948777,8,13095.182558
1,1000503394293263005,343,11.398298
2,1001008640113335510,24,61148.713542
3,1001123163431776865,68,1442.033706
4,1001144380199556647,78,1977.28307


## Pruebo con Random Forest

In [16]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression

y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

In [17]:
predictions = cross_val_predict(RFR, X, y, cv=10)
predictions

array([ 77710.58982183,  55362.03830697,  55568.04482871, ...,
       109018.6135936 , 109018.6135936 , 109018.6135936 ])

In [18]:
new_y = np.array(y)
new_y

array([1.30951826e+04, 1.13982980e+01, 6.11487135e+04, ...,
       8.93016922e+04, 8.94119782e+04, 9.90811254e+04])

In [19]:
scores = cross_val_score(RFR, X, y, cv=10)
scores.mean()

-0.20243543636676992

## Veo la cantidad de veces que aparece cada dispositivo segun el source_id

Ahora me interesa ver la cantidad de veces que un dispositivo participó en una subasta desde cada tipo de fuente desde donde se produce la subasta. De esta manera los source_id más populares tomaran un valor mayor, luego tomo la desviación estándar de la cantidad para cada device_id

In [19]:
auct['apariciones'] = 1

In [20]:
auction_by_sourceID = auct.groupby(['device_id', 'source_id']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
auction_by_sourceID.columns = auction_by_sourceID.columns.droplevel(0)
auction_by_sourceID.columns = ['ref_hash', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
auction_by_sourceID.head()

Unnamed: 0,ref_hash,0,1,2,3,4,5,6,7,8,9
0,1000061425870948777,1.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1000503394293263005,225.0,79.0,1.0,18.0,1.0,0.0,0.0,0.0,19.0,0.0
2,1001008640113335510,9.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1001123163431776865,63.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0
4,1001144380199556647,46.0,15.0,1.0,0.0,1.0,0.0,1.0,0.0,14.0,0.0


In [21]:
auction_by_sourceID['auctions_by_srcID'] = auction_by_sourceID.iloc[:,1:].std(axis = 1)
auction_by_sourceID.head()

Unnamed: 0,ref_hash,0,1,2,3,4,5,6,7,8,9,auctions_by_srcID
0,1000061425870948777,1.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.20101
1,1000503394293263005,225.0,79.0,1.0,18.0,1.0,0.0,0.0,0.0,19.0,0.0,71.32718
2,1001008640113335510,9.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.25357
3,1001123163431776865,63.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,19.809089
4,1001144380199556647,46.0,15.0,1.0,0.0,1.0,0.0,1.0,0.0,14.0,0.0,14.649991


Agrego el nuevo feature a los datos

In [22]:
data = data.merge(auction_by_sourceID[['ref_hash', 'auctions_by_srcID']], on = 'ref_hash')


In [23]:
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID
0,1000061425870948777,8,13095,2.20101
1,1000503394293263005,343,11,71.32718
2,1001008640113335510,24,61148,5.25357
3,1001123163431776865,68,1442,19.809089
4,1001144380199556647,78,1977,14.649991


Vuelvo a probar el modelo con el nuevo feature

In [24]:
y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([0.97130935, 0.02869065])

In [25]:
cross_val_predict(RFR, X, y, cv=10)

array([ 77710.09775776,  55361.54324169,  55567.549781  , ...,
       109018.12684959, 109018.12684959, 109018.12684959])

In [26]:
scores = cross_val_score(RFR, X, y, cv=10, scoring='explained_variance')
scores.mean()

-0.0008055673580986134

## Veo la cantidad de veces que aparece cada dispositivo en una subasta según el ref_type

In [27]:
auct['ref_type_id'].value_counts()

1    13313564
7     2331211
Name: ref_type_id, dtype: int64

In [28]:
dfApRef = auct.groupby(['device_id', 'ref_type_id']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
dfApRef.columns = dfApRef.columns.droplevel(0)
dfApRef['median_count_ref_type'] = dfApRef.median(axis = 1)
dfApRef.head()

ref_type_id,Unnamed: 1,1,7,median_count_ref_type
0,1000061425870948777,8.0,0.0,8.0
1,1000503394293263005,0.0,343.0,343.0
2,1001008640113335510,24.0,0.0,24.0
3,1001123163431776865,68.0,0.0,68.0
4,1001144380199556647,0.0,78.0,78.0


Me interesa ver si cada usuario sólo tiene apariciones para el mismo ref_type, lo chequeo para ver si me servirá o no el feature

In [29]:
dfApRef.columns = ['ref_hash', 'auctions_ref_type1', 'auctions_ref_type7', 'median_count_ref_type']
dfApRef.head()

Unnamed: 0,ref_hash,auctions_ref_type1,auctions_ref_type7,median_count_ref_type
0,1000061425870948777,8.0,0.0,8.0
1,1000503394293263005,0.0,343.0,343.0
2,1001008640113335510,24.0,0.0,24.0
3,1001123163431776865,68.0,0.0,68.0
4,1001144380199556647,0.0,78.0,78.0


In [30]:
dfApRef['aparece_en_distinto_ref_type'] = ((dfApRef['auctions_ref_type1'] > 0) & (dfApRef['auctions_ref_type7'] > 0))

In [31]:
dfApRef.head()

Unnamed: 0,ref_hash,auctions_ref_type1,auctions_ref_type7,median_count_ref_type,aparece_en_distinto_ref_type
0,1000061425870948777,8.0,0.0,8.0,False
1,1000503394293263005,0.0,343.0,343.0,False
2,1001008640113335510,24.0,0.0,24.0,False
3,1001123163431776865,68.0,0.0,68.0,False
4,1001144380199556647,0.0,78.0,78.0,False


In [32]:
dfApRef['aparece_en_distinto_ref_type'].value_counts()

False    318079
True        246
Name: aparece_en_distinto_ref_type, dtype: int64

Como hay valores para ref_types distintos puedo probar agregar este feature y ver que sucede

In [33]:
data = data.merge(dfApRef[['ref_hash', 'median_count_ref_type']], on = 'ref_hash')
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,median_count_ref_type
0,1000061425870948777,8,13095,2.20101,8.0
1,1000503394293263005,343,11,71.32718,343.0
2,1001008640113335510,24,61148,5.25357,24.0
3,1001123163431776865,68,1442,19.809089,68.0
4,1001144380199556647,78,1977,14.649991,78.0


In [34]:
y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([0.73831851, 0.00969993, 0.25198156])

In [35]:
cross_val_predict(RFR, X, y, cv=10)

array([ 77719.47785329,  55370.06244506,  55576.06898437, ...,
       109018.26585696, 109018.26585696, 109018.26585696])

In [36]:
scores = cross_val_score(RFR, X, y, cv=5, scoring='r2')
scores.mean()

-0.2254267256479733

## Cantidad de apariciones de un dispositivo en las subastas por día

Calculo la cantidad de apariciones para cada dispositivo en las subastas por día, luego tomo la mediana para las apariciones.

In [37]:
auct['fecha'] = auct['date'].dt.date

In [38]:
dfApDay = auct.groupby(['device_id', 'fecha']).agg({'apariciones':'sum'}).unstack(1).fillna(0).reset_index()
dfApDay.columns = dfApDay.columns.droplevel(0)
dfApDay.columns = ['ref_hash', '2019-04-18', '2019-04-19', '2019-04-20']
dfApDay['auctions_by_day'] = dfApDay.median(axis = 1)

In [39]:
dfApDay.head()

Unnamed: 0,ref_hash,2019-04-18,2019-04-19,2019-04-20,auctions_by_day
0,1000061425870948777,2.0,2.0,4.0,3.0
1,1000503394293263005,111.0,99.0,133.0,122.0
2,1001008640113335510,24.0,0.0,0.0,12.0
3,1001123163431776865,16.0,20.0,32.0,26.0
4,1001144380199556647,34.0,27.0,17.0,30.5


Agrego el nuevo feature

In [40]:
data = data.merge(dfApDay, on = 'ref_hash')
data.head()

Unnamed: 0,ref_hash,auctions_count,predict_value,auctions_by_srcID,median_count_ref_type,2019-04-18,2019-04-19,2019-04-20,auctions_by_day
0,1000061425870948777,8,13095,2.20101,8.0,2.0,2.0,4.0,3.0
1,1000503394293263005,343,11,71.32718,343.0,111.0,99.0,133.0,122.0
2,1001008640113335510,24,61148,5.25357,24.0,24.0,0.0,0.0,12.0
3,1001123163431776865,68,1442,19.809089,68.0,16.0,20.0,32.0,26.0
4,1001144380199556647,78,1977,14.649991,78.0,34.0,27.0,17.0,30.5


Ahora puebo el modelo

In [41]:
y = data['predict_value']
X = data.drop(['ref_hash', 'predict_value'], axis=1)

RFR = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)
RFR.fit(X, y)  

RFR.feature_importances_

array([0.        , 0.        , 0.        , 0.85491983, 0.14508017,
       0.        , 0.        ])

In [42]:
cross_val_predict(RFR, X, y, cv=10)

array([ 37647.12013009,  24360.45178185,  24360.45178185, ...,
       125325.17658511, 125325.17658511, 125325.17658511])

In [43]:
scores = cross_val_score(RFR, X, y, cv=5, scoring='neg_mean_absolute_error')
scores.mean()

-25542.264868818827

## Clicks

Primero considero el tiempo que tarda cada dispositivo en dar un click para esto utilizo todos los datos sin usar las divisiones en ventanas.

In [44]:
clks = pd.read_csv("data/clks_ventana1.csv", dtype = {'advertiser_id': np.int8, 'action_id': np.float32, 'source_id': np.int8, 'country_codde': 'category',  'carrier_id': np.float16, 'specs_brand': 'category', 'brand': np.float16, 'ref_type': 'category', 'ref_hash': 'category'})

In [45]:
clks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12768 entries, 0 to 12767
Data columns (total 20 columns):
advertiser_id      12768 non-null int8
action_id          1 non-null float32
source_id          12768 non-null int8
created            12768 non-null object
country_code       12768 non-null int64
latitude           12768 non-null float64
longitude          12768 non-null float64
wifi_connection    12768 non-null bool
carrier_id         12548 non-null float16
trans_id           12768 non-null object
os_minor           12761 non-null float64
agent_device       2004 non-null float64
os_major           12761 non-null float64
specs_brand        12768 non-null category
brand              2923 non-null float16
timeToClick        7510 non-null float64
touchX             8571 non-null float64
touchY             8571 non-null float64
ref_type           12768 non-null category
ref_hash           12768 non-null category
dtypes: bool(1), category(3), float16(2), float32(1), float64(8), int6

In [46]:
click_time['possible_click'] = click_time['timeToClick'].transform(lambda x: 1/x if (x > 0) else 0.0).dropna(0.0)

NameError: name 'click_time' is not defined

In [None]:
click_time['possible_click'].value_counts()

## Installs

Utilizo el dataframe de instalaciones completo, sin aplicar una ventana, porque por ahora lo que me interesa es calcular el tiempo promedio que tardó cada dispositivo en hacer una instalación, no me interesan las fechas específicas.

In [None]:
inst = pd.read_csv('data/installs_ventana1.csv', dtype = {'application_id': np.int16, 'ref_type': 'category', 'ref_hash': 'category', 'click_hash': 'category', 'device_country_code': 'category', 'device_brand': 'category', 'device_model': 'category', 'kind': 'category', 'device_language': 'category'})

In [None]:
inst['created'] = pd.to_datetime(inst['created'])
inst.head()

In [None]:
inst.dtypes

## Calculo el tiempo que tarda en convertir cada dispositivo

In [None]:
inst.info()

In [None]:
inst['created_inicial'] = dt.datetime(2019, 4, 18)
inst['created_inicial'] = pd.to_datetime(inst['created_inicial'])
inst['timeToInstall'] = (inst['created'] - inst['created_inicial'])/np.timedelta64(1,'s')
inst.head()

Tomo el tiempo máximo y mínimo, en SEGUNDOS, que tardó un dispositivo en realizar una instalación

In [None]:
install_time = inst.groupby('ref_hash').agg({'timeToInstall': ['min', 'max']}).reset_index()
install_time.columns = install_time.columns.droplevel(1)
install_time.columns = ['ref_hash', 'minTimeToInstall', 'maxTimeToInstall']
install_time

## Agrego los dispositivos con installs y el tiempo que tarda cada dispositivo en realizar una instalación

In [None]:
auctions = auctions.merge(install_time[['ref_hash', 'possible_install']], on = 'ref_hash')

In [None]:
auctions = auctions.merge(click_time[['ref_hash', 'possible_click']], on = 'ref_hash')

In [None]:
auctions.info()

In [None]:
auctions.head()

## Pruebo con Gradient Boosting

In [None]:
auctions['label_st'] = auctions['timeToAuction'].astype(int)
auctions.head()

In [None]:
df_y = auctions['label_st']
df_X = auctions.drop(['ref_hash','timeToAuction', 'label_st'], axis=1)

GBC = GradientBoostingClassifier(random_state=23, n_estimators=50, min_samples_split=50)
scores = cross_val_score(GBC , df_X, df_y, scoring="roc_auc", cv=5)
scores.mean()

## Pruebo con XGBoost

In [None]:
''''XGBC = xgb.XGBClassifier(learning_rate =0.075, n_estimators=95, max_depth=4, min_child_weight=6, 
                         gamma=0.3, subsample=0.8, colsample_bytree=0.8, objective= 'linear:logistic',
                         scale_pos_weight=0.8, seed = 15)

scores = cross_val_score(XGBC , df_X, df_y, scoring="roc_auc", cv=5)
scores.mean()'''