<a href="https://colab.research.google.com/github/JulioLaz/Consumer_Spending_Prediction_final/blob/main/Copia_Consumer_Spending_Prediction_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**PROBLEMA DE NEGOCIO**


---




La necesidad de prever y optimizar el gasto de sus usuarios ha llevado a una empresa de comercio electrónico a buscar soluciones innovadoras. Como científicos de datos, hemos sido convocados para desarrollar un modelo de machine learning que pueda predecir con precisión cuánto gastará un usuario al visitar dicho sitio web.

### **Tus tareas principales serán:**

**1. Preprocesamiento de Datos:** Importar correctamente y analizar y comprender el conjunto de datos proporcionado, realizar limpieza de datos, eliminar atributos que no aportan valor y manejar valores faltantes.

**2. Exploración y Feature Engineering:** Realizar visualizaciones para entender las relaciones entre las variables y seleccionar las características relevantes, identificar variables llaves, codificación de variables categóricas y normalización/escalado de datos.

**3. Construcción de Modelos:** Experimentar con algunos algoritmos de machine learning como Linear Regression, Decision Tree Regressor, Random Forest Regressor, entre otros.

**4. Evaluación y Selección del Modelo:** Evaluar los modelos utilizando métricas como el error cuadrático medio (MSE), la raíz cuadrada del error cuadrático medio (RMSE) y el coeficiente de determinación (R²). Seleccionar el modelo con el mejor rendimiento para la predicción del gasto de los usuarios.

## Referencia de las variables:
https://support.google.com/analytics/answer/3437719?hl=es-419

#**1. Configuración del Ambiente**


---




In [1]:
# !python -V
# print('------')
# !pip show Pandas | grep 'Name\|Version'
# print('------')
# !pip show Matplotlib | grep 'Name\|Version'

# Python 3.10.12
# ------
# Name: pandas
# Version: 1.5.3
# ------
# Name: matplotlib
# Version: 3.7.1

In [2]:
!pip install xgboost



In [3]:
!pip install wget



In [4]:
import wget
import warnings
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import lightgbm as lgb

from scipy.stats import randint
from sklearn.preprocessing import LabelEncoder, StandardScaler, FunctionTransformer
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import cross_val_score
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from joblib import dump, load

# Ignorar las advertencias
warnings.filterwarnings("ignore")

# Configurar pandas para mostrar todas las columnas
pd.set_option('display.max_columns', None)

# Variables globales
global df_traffic, resultados, modelo, modelo_clasificacion


#**2. Preprocesamiento de Datos**


---


In [5]:
def preprocesamiento():
  global df_traffic
  df_traffic = pd.read_csv('https://raw.githubusercontent.com/ElProfeAlejo/Bootcamp_Databases/main/traffic_site.csv', dtype={'date':object,'fullVisitorId':object,'visitId':object})
  diccionarios = ['device','geoNetwork','trafficSource','totals']

  ## Desempacar diccionario:
  for columna in diccionarios:
    df_traffic = df_traffic.join(pd.DataFrame([json.loads(linea) for linea in df_traffic[columna]]))
  df_traffic.drop(columns=diccionarios, axis=1,inplace=True)

  # Convertir las columnas a string para envitar error:
  df_traffic_str = df_traffic.astype(str).copy()

  # Buscar las columnas que tienen un sólo valor:
  unique_value=[]
  for col in df_traffic_str.drop(columns='isMobile',axis=1).columns:
      if 1 == len(df_traffic_str[col].unique()):
        unique_value.append(col)
  print(f'Vars con valor único ({len(unique_value)})\n {unique_value}')

  ### eliminar col con valor único:
  df_traffic.drop(columns=unique_value,axis=1,inplace=True)

  ## Elimino columna con valor un sólo valor
  df_traffic.drop(columns='campaignCode',axis=1,inplace=True)

  ### cambiar columnas a tipo número:
  cuant = ['fullVisitorId','visitId','visitNumber','visitStartTime', 'bounces', 'hits','pageviews','newVisits','pageviews', 'transactionRevenue']
  for columna in cuant:
      df_traffic[columna] = pd.to_numeric(df_traffic[columna])

# ///////////////////////////////////////////////////////////////////////////

  ### Analizar si hay alguna tipo dict:
  for col in df_traffic.columns:
    if isinstance(df_traffic[col].iloc[0], dict):
        print(f"La columna '{col}' contiene valores de tipo dict.")

  ### cambiar valor dentro del dict anterior:
  df_traffic['adwordsClickInfo'] = df_traffic['adwordsClickInfo'].apply(lambda x: np.nan if isinstance(x, dict) and x == {'criteriaParameters': 'not available in demo dataset'} else x)

  ### Desempacar del dict clave valor:
  # Aplicar pd.Series() a la columna 'adwordsClickInfo' para dividir los diccionarios en columnas
  expanded_info = df_traffic['adwordsClickInfo'].apply(pd.Series)

  # Concatenar el DataFrame original con las nuevas columnas
  df_traffic = pd.concat([df_traffic, expanded_info], axis=1)

  # Eliminar la columnas:
  columns_to_drop = ['adwordsClickInfo', 'criteriaParameters', 0, 'targetingCriteria', 'date']
  df_traffic.drop(columns=columns_to_drop, inplace=True)

  df_traffic = df_traffic.drop_duplicates() ##eliminar filas duplicadas
# ///////////////////////////////////////////////////////////////////////////

  ## Cambio formato a visitStartTime:
  df_traffic['visitStartTime'] = pd.to_datetime(df_traffic['visitStartTime'], unit='s')

  ### cambia los nan a ceros:
  df_traffic.fillna(0, inplace=True)

  ### Dividir el target en 1e6:
  df_traffic['transactionRevenue']= df_traffic['transactionRevenue']/1e6
  df_traffic.head(5)

preprocesamiento()

Vars con valor único (18)
 ['socialEngagementType', 'browserVersion', 'browserSize', 'operatingSystemVersion', 'mobileDeviceBranding', 'mobileDeviceModel', 'mobileInputSelector', 'mobileDeviceInfo', 'mobileDeviceMarketingName', 'flashVersion', 'language', 'screenColors', 'screenResolution', 'cityId', 'latitude', 'longitude', 'networkLocation', 'visits']
La columna 'adwordsClickInfo' contiene valores de tipo dict.


#**3. Exploración y Feature Engineering**


---


In [6]:
def feature_engineering():
    global df_traffic
    ### Descomponer la columna visitStartTime en columns: año, mes, semana, quincena:
    df_traffic['visitStartTime'] = pd.to_datetime(df_traffic['visitStartTime'])

    # Crear columnas para el año, el mes, la semana del mes, la quincena del mes y la hora
    df_traffic['year'] = df_traffic['visitStartTime'].dt.year.astype('uint16')
    df_traffic['month'] = df_traffic['visitStartTime'].dt.month.astype('uint8')
    df_traffic['fortnight'] = df_traffic['visitStartTime'].dt.day.apply(lambda day: 1 if day <= 15 else 2).astype('uint8')
    df_traffic['hour'] = df_traffic['visitStartTime'].dt.hour.astype('uint8')
    df_traffic['day'] = df_traffic['visitStartTime'].dt.day.astype('uint8')
    df_traffic['time_range'] = pd.cut(df_traffic['visitStartTime'].dt.hour, bins=[0, 6, 12, 18, 24], labels=['madrugada', 'mañana', 'tarde', 'noche'], ordered=False).astype('object')

    ## Elimino col visitStartTime:
    df_traffic.drop(columns='visitStartTime', axis=1,inplace=True)

    ### Aplicar Codificador de etiquetas para transformar de cualitativa a cuantitativa ordinal:
    cualitativas = df_traffic.dtypes[df_traffic.dtypes == object].keys()
    for columna in cualitativas:
        lbl = LabelEncoder()
        strings = list(df_traffic[columna].values.astype('str'))
        lbl.fit(strings)
        df_traffic[columna] = lbl.transform(strings)
        # Convertir al tipo uint8
        df_traffic[columna] = df_traffic[columna].astype('uint8')

    ## Elimino col sessionId:
    df_traffic.drop(columns='sessionId',inplace=True)

    ## Codificación de frecuencia:
    ### Codificación de Frecuencia:  para fullVisitorId:
    fullVisitorId_frequency = df_traffic['fullVisitorId'].value_counts()
    df_traffic['fullVisitorId_enc_frec'] = df_traffic['fullVisitorId'].map(fullVisitorId_frequency)

    ### Codificación de Frecuencia:  para visitId:
    fullVisitorId_frequency = df_traffic['visitId'].value_counts()
    df_traffic['visitId_enc_frec'] = df_traffic['visitId'].map(fullVisitorId_frequency)

    ### Eliminar visitId, fullVisitorId:
    df_traffic.drop(columns='visitId',axis=1,inplace=True)
    df_traffic.drop(columns='fullVisitorId',axis=1,inplace=True)

    ## convertir a int la col booleana:
    df_traffic['isMobile'] = df_traffic['isMobile'].astype(int)

    ## cambiar los nan por ceros:
    df_traffic.fillna(0, inplace=True)

    # Rellenar los valores faltantes en 'transactionRevenue' con cero
    df_traffic['transactionRevenue'].fillna(0, inplace=True)

    ## Crear nueva col con clasificacion de 0 y 1 para transactionRevenue:
    df_traffic['revenue_zero'] = np.where(df_traffic['transactionRevenue'] == 0, 1, 0)

    # Codificación one-hot y eliminación de columnas originales
    columns=['browser', 'continent','networkDomain']
    df_traffic = pd.get_dummies(df_traffic, columns=columns, prefix=columns, drop_first=True)

    ### cambiar a frecuencias:
    columns_to_map = ['city', 'country', 'subContinent', 'metro','hour','time_range','channelGrouping']

    for column in columns_to_map:
        column_frequency = df_traffic[column].value_counts()
        df_traffic[column] = df_traffic[column].map(column_frequency)

    #### Eliminar columnas;
    columns_features= ['year','fortnight','isMobile','campaign','gclId',
                       'page', 'adContent','bounces','newVisits',
                       'metro','visitId_enc_frec','browser_1','browser_2',
                      'browser_3',	'browser_4',	'browser_6',	'browser_7'] ### ,'gclId','page'
    for feature in columns_features:
        df_traffic.drop(columns=[feature], inplace=True)

    df_traffic.drop(columns=['isVideoAd', 'adNetworkType','slot','hits'],axis=1,inplace=True)

    ### optimize memory
    conversion_dict = {
        'transactionRevenue': 'uint16',
        'channelGrouping': 'uint16',
        'subContinent': 'uint16',
        'country': 'uint16',
        'city': 'uint16',
        'hour': 'uint16',
        'time_range': 'uint16',
        'fullVisitorId_enc_frec': 'uint8',
        'visitNumber': 'uint8',
        'revenue_zero': 'uint8',
        'pageviews': 'uint16'
    }
    df_traffic = df_traffic.astype(conversion_dict)

feature_engineering()

#**4. Construcción de Modelos**


---


In [7]:
def crea_modelos():

    data_traf=df_traffic.copy()
    X = data_traf.drop('transactionRevenue',axis=1)
    y = data_traf.transactionRevenue.copy()

    ### Separar en bases de entrenamiento y prueba:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 42)

    # Crea una instancia de StandardScaler
    scaler = StandardScaler()

    # Ajusta el escalador a tus datos de entrenamiento
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    ### MODELS  ###
    #  RandomForestRegressor
    #  LGBMRegressor
    #  XGBRegressor
    #  XGB

    random_forest_regressor = RandomForestRegressor(n_estimators=200, max_depth=10, min_samples_split=10, min_samples_leaf=4)
    lgbm_regressor = LGBMRegressor(n_estimators=165, max_depth=10, learning_rate=0.1, min_child_samples=10)
    xgb_regressor = XGBRegressor(n_estimators=140, max_depth=10, learning_rate=0.1, min_child_weight=10)

    # Entrenar los modelos con los datos
    random_forest_regressor.fit(X_train_scaled, y_train)
    lgbm_regressor.fit(X_train, y_train)
    xgb_regressor.fit(X_train, y_train)

    # Hacer predicciones con los modelos entrenados
    y_pred_random_forest = random_forest_regressor.predict(X_test_scaled)
    y_pred_lgbm = lgbm_regressor.predict(X_test)
    y_pred_xgb = xgb_regressor.predict(X_test)

    # Evaluar los modelos
    rmse_random_forest = mean_squared_error(y_test, y_pred_random_forest, squared=False)
    rmse_lgbm = mean_squared_error(y_test, y_pred_lgbm, squared=False)
    rmse_xgb = mean_squared_error(y_test, y_pred_xgb, squared=False)

    r2_random_forest = r2_score(y_test, y_pred_random_forest)
    r2_lgbm = r2_score(y_test, y_pred_lgbm)
    r2_xgb = r2_score(y_test, y_pred_xgb)

    ################################################################
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)

    # Definir los parámetros del modelo
    params_xgb = {
        'objective': 'reg:squarederror',
        'eval_metric': 'rmse',
        'max_depth': 6,
        'eta': 0.05,
        'subsample': 0.91,
        'colsample_bytree': 0.81,
        'seed': 42
    }

    # Entrenar el modelo
    num_round = 900  # Número de iteraciones
    bst = xgb.train(params_xgb, dtrain, num_round, evals=[(dtest, 'eval')], early_stopping_rounds=18)

    # Hacer predicciones en el conjunto de prueba
    y_pred = bst.predict(dtest)

    # Evaluar el modelo
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    ################################################################
    print("Resultados de los Modelos Adicionales:")
    print(f"Random Forest Regressor - R-cuadrado (R²): {r2_random_forest:.2%}, RMSE: {rmse_random_forest:.2f}")
    print(f"LightGBM Regressor - R-cuadrado (R²): {r2_lgbm:.2%}, RMSE: {rmse_lgbm:.2f}")
    print(f"XGBoost Regressor - R-cuadrado (R²): {r2_xgb:.2%}, RMSE: {rmse_xgb:.2f}")
    print(f"XGB - R-cuadrado (R²): {r2:.2%}, RMSE: {rmse:.2f}")

In [8]:
crea_modelos()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003273 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1063
[LightGBM] [Info] Number of data points in the train set: 9826, number of used features: 204
[LightGBM] [Info] Start training from score 1.480256
[0]	eval-rmse:20.20259
[1]	eval-rmse:20.18937
[2]	eval-rmse:20.03733
[3]	eval-rmse:19.67091
[4]	eval-rmse:19.21404
[5]	eval-rmse:18.84813
[6]	eval-rmse:18.59248
[7]	eval-rmse:18.44395
[8]	eval-rmse:18.29335
[9]	eval-rmse:18.21222
[10]	eval-rmse:17.70104
[11]	eval-rmse:17.36419
[12]	eval-rmse:17.07507
[13]	eval-rmse:16.94660
[14]	eval-rmse:16.76920
[15]	eval-rmse:16.74981
[16]	eval-rmse:16.51668
[17]	eval-rmse:16.20364
[18]	eval-rmse:16.11670
[19]	eval-rmse:15.87760
[20]	eval-rmse:15.67497
[21]	eval-rmse:15.48008
[22]	eval-rmse:15.20711
[23]	eval-rmse:14.93635
[24]	eval-rmse:14.77145
[25]

In [9]:
def modelos_new():
    data_traf = df_traffic.copy()
    X = data_traf.drop('transactionRevenue', axis=1)
    y = data_traf.transactionRevenue.copy()

    # Separar en bases de entrenamiento y prueba
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

    # Escalar los datos
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    random_forest_regressor = RandomForestRegressor(n_estimators=200, max_depth=10, min_samples_split=10, min_samples_leaf=4)
    lgbm_regressor = LGBMRegressor(n_estimators=200, max_depth=10, learning_rate=0.1, min_child_samples=10)
    xgb_regressor = XGBRegressor(n_estimators=200, max_depth=10, learning_rate=0.1, min_child_weight=10)

    # Entrenar y evaluar modelos de manera independiente
    random_forest_regressor.fit(X_train_scaled, y_train)
    lgbm_regressor.fit(X_train, y_train)
    xgb_regressor.fit(X_train, y_train)

    y_pred_random_forest = random_forest_regressor.predict(X_test_scaled)
    y_pred_lgbm = lgbm_regressor.predict(X_test)
    y_pred_xgb = xgb_regressor.predict(X_test)

    rmse_random_forest = mean_squared_error(y_test, y_pred_random_forest, squared=False)
    rmse_lgbm = mean_squared_error(y_test, y_pred_lgbm, squared=False)
    rmse_xgb = mean_squared_error(y_test, y_pred_xgb, squared=False)

    r2_random_forest = r2_score(y_test, y_pred_random_forest)
    r2_lgbm = r2_score(y_test, y_pred_lgbm)
    r2_xgb = r2_score(y_test, y_pred_xgb)

    # Realizar validación cruzada para XGBoost
    params_xgb = {
        'objective': 'reg:squarederror',
        'eval_metric': 'rmse',
        'max_depth': 10,
        'eta': 0.1,
        'subsample': 0.91,
        'colsample_bytree': 0.8,
        'seed': 42
    }

    dall = xgb.DMatrix(X, label=y)
    cv_results = xgb.cv(params_xgb, dall, num_boost_round=1000, nfold=5, metrics='rmse', early_stopping_rounds=10, seed=42)
    best_num_round = cv_results['test-rmse-mean'].idxmin()
    bst_new = xgb.train(params_xgb, dall, num_boost_round=best_num_round)
    y_pred_new = bst_new.predict(xgb.DMatrix(X_test))

    rmse_new = mean_squared_error(y_test, y_pred_new, squared=False)
    r2_new = r2_score(y_test, y_pred_new)

    # Imprimir resultados
    print("Resultados de los Modelos Adicionales:")
    print(f"Random Forest Regressor - R²: {r2_random_forest:.2%}, RMSE: {rmse_random_forest:.2f}")
    print(f"LightGBM Regressor - R²: {r2_lgbm:.2%}, RMSE: {rmse_lgbm:.2f}")
    print(f"XGBoost Regressor - R²: {r2_xgb:.2%}, RMSE: {rmse_xgb:.2f}")
    print(f"XGB DMatrix - R²: {r2_new:.2%}, RMSE: {rmse_new:.2f}")


In [10]:
modelos_new()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006292 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1063
[LightGBM] [Info] Number of data points in the train set: 9826, number of used features: 204
[LightGBM] [Info] Start training from score 1.480256
Resultados de los Modelos Adicionales:
Random Forest Regressor - R²: 57.09%, RMSE: 13.55
LightGBM Regressor - R²: 65.14%, RMSE: 12.22
XGBoost Regressor - R²: 69.85%, RMSE: 11.36
XGB DMatrix - R²: 99.98%, RMSE: 0.29
