# Regresión Lineal Clientes Ecommerce

In [26]:
import warnings
warnings.filterwarnings("ignore")

## Creación Spark

In [1]:
import os, subprocess

java8_home = "/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home"

os.environ["JAVA_HOME"] = java8_home
os.environ["PATH"] = os.path.join(java8_home, "bin") + os.pathsep + os.environ.get("PATH","")

os.environ["HADOOP_USER_NAME"] = os.environ.get("USER", "tomas")

print("JAVA_HOME fijado a:", os.environ["JAVA_HOME"])
try:
    print("which java (kernel):", subprocess.check_output(["which","java"]).decode().strip())
    print("java -version (kernel):")
    print(subprocess.check_output(["java","-version"], stderr=subprocess.STDOUT).decode())
except Exception as e:
    print("Error llamando a java desde kernel:", e)

JAVA_HOME fijado a: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
which java (kernel): /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/bin/java
java -version (kernel):
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.292-b10, mixed mode)



In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()

25/09/13 15:31:19 WARN Utils: Your hostname, MacBook-Air-de-Tomas-3.local resolves to a loopback address: 127.0.0.1; using 192.168.1.4 instead (on interface en0)
25/09/13 15:31:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
25/09/13 15:31:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/13 15:31:21 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Importación de Datos

In [8]:
# Use Spark to read in the Ecommerce Customers csv file.
data = spark.read.csv("../PySparkCourse/MLData/Ecommerce_Customers.csv",inferSchema=True,header=True)

In [5]:
data.printSchema()

root
 |-- Email: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- Avatar: string (nullable = true)
 |-- Avg Session Length: double (nullable = true)
 |-- Time on App: double (nullable = true)
 |-- Time on Website: double (nullable = true)
 |-- Length of Membership: double (nullable = true)
 |-- Yearly Amount Spent: double (nullable = true)



In [10]:
data.show()

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|33.000914755642675|11.330278057777512|37.110597442120856|   4.104543202376424| 487.54750486747207|
|riverarebecca@gma...|1414 David Throug...|   

In [11]:
data.head(5)

[Row(Email='mstephenson@fernandez.com', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005),
 Row(Email='hduke@hotmail.com', Address='4547 Archer CommonDiazchester, CA 06566-8576', Avatar='DarkGreen', Avg Session Length=31.92627202636016, Time on App=11.109460728682564, Time on Website=37.268958868297744, Length of Membership=2.66403418213262, Yearly Amount Spent=392.2049334443264),
 Row(Email='pallen@yahoo.com', Address='24645 Valerie Unions Suite 582Cobbborough, DC 99414-7564', Avatar='Bisque', Avg Session Length=33.000914755642675, Time on App=11.330278057777512, Time on Website=37.110597442120856, Length of Membership=4.104543202376424, Yearly Amount Spent=487.54750486747207),
 Row(Email='riverarebecca@gmail.com', Address='1414 David ThroughwayPort Jason, OH 22070-1220', Avatar='Sad

## Organización de datos para el Modelo

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [12]:
data.columns

['Email',
 'Address',
 'Avatar',
 'Avg Session Length',
 'Time on App',
 'Time on Website',
 'Length of Membership',
 'Yearly Amount Spent']

### Selección de características

- VectorAssembler: Transformer para convertir columnas numéricas en un único vector. Spark ML necesita trabajar con variables predictorias (features) que estén empaquetadas en una única columna de tipo vector

In [18]:
assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],
    outputCol="features")

output = assembler.transform(data)

In [19]:
output.show()

+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|               Email|             Address|          Avatar|Avg Session Length|       Time on App|   Time on Website|Length of Membership|Yearly Amount Spent|            features|
+--------------------+--------------------+----------------+------------------+------------------+------------------+--------------------+-------------------+--------------------+
|mstephenson@ferna...|835 Frank TunnelW...|          Violet| 34.49726772511229| 12.65565114916675| 39.57766801952616|  4.0826206329529615|  587.9510539684005|[34.4972677251122...|
|   hduke@hotmail.com|4547 Archer Commo...|       DarkGreen| 31.92627202636016|11.109460728682564|37.268958868297744|    2.66403418213262|  392.2049334443264|[31.9262720263601...|
|    pallen@yahoo.com|24645 Valerie Uni...|          Bisque|33.000914755642675|11.330278057777512|37

### Conjunto de Entrenamiento

- Se escogen únicamente los features y la variable objetivo
- Se selecciona el conjunto de entrenamiento (.randomSplit([train, test]))

In [20]:
final_data = output.select("features",'Yearly Amount Spent')
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [22]:
train_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                353|
|   mean|   500.320399983673|
| stddev|  80.75792811130972|
|    min| 256.67058229005585|
|    max|  765.5184619388373|
+-------+-------------------+



                                                                                

In [23]:
test_data.describe().show()

+-------+-------------------+
|summary|Yearly Amount Spent|
+-------+-------------------+
|  count|                147|
|   mean|  496.8974009187671|
| stddev|  75.95157643123034|
|    min|   266.086340948469|
|    max|  684.1634310159512|
+-------+-------------------+



# Modelo

In [27]:
lr = LinearRegression(labelCol='Yearly Amount Spent')
lr_model = lr.fit(train_data)

25/09/13 20:50:55 WARN Instrumentation: [c39cc119] regParam is zero, which might cause numerical instability and overfitting.


### Coeficientes e Intercepto

- **Coeficienties:** corresponde al peso asignado en las variables de features
- **Intercepto:** devuelve el valor de b

In [None]:

print("Coeficientes: {}".format(lr_model.coefficients))
print("Intercepto: {}".format(lr_model.intercept))

Coeficientes: [25.46510537893336,39.21935508737206,0.420935971717978,61.253295015550755]
Intercepto: -1047.2422635629598


**Análisis**

- **Avg Session Length (25.46):** Por cada unidad que aumenta el promedio de duración de la sesión, el gasto anual esperado (Variable objetivo) aumenta 25.46 (manteniendo las demás variables constantes)
- **Time on App (39.22):** Por cada unidad de incremento en el tiempo en la app, el gasto aumenta en promedio 39.22
- **Time on Website (0.42):** Tiene un efecto casi nulo, aumentar el tiempo en el sitio web apenas y cambia el gasto
- **Length of Memebership (61.25):** Cada año adicionald e membresía incrementa en promedio 61.25 en el gasto anual. Es un predictor bastante fuerte, los clientes antiguos tienden a gastar más
<br>

- **Intercepto (-1047.24):** es el valor esperado del gasto anual cuando todas las variables predictorias son cero (Sirve como **punto de ajuste** en la recta de regresión)

### Evaluación del Modelo

In [38]:
from pyspark.sql.functions import col

preds_with_label = lr_model.transform(test_data).select("features", "Yearly Amount Spent", "prediction").withColumn("residual", col("Yearly Amount Spent") - col("prediction"))

preds_with_label.show(10, truncate=False)

+-----------------------------------------------------------------------------+-------------------+------------------+-------------------+
|features                                                                     |Yearly Amount Spent|prediction        |residual           |
+-----------------------------------------------------------------------------+-------------------+------------------+-------------------+
|[30.3931845423455,11.80298577760313,36.315763151803424,2.0838141920346707]   |319.9288698031936  |332.55597042553177|-12.627100622338162|
|[30.57436368417137,11.351049011250833,37.08884657968332,4.078308001651641]   |442.06441375806565 |441.93978419907444|0.12462955899121653|
|[30.81620064887634,11.851398743073142,36.925043038878634,1.0845853030221226] |266.086340948469   |284.2772513334942 |-18.190910385025177|
|[30.83643267477343,13.100109538542807,35.907721429679654,3.361612984538193]  |467.5019004269896  |472.81331516831415|-5.3114147413245405|
|[31.04722213948752,11.1996

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator_mse = RegressionEvaluator(labelCol="Yearly Amount Spent", predictionCol="prediction", metricName="mse")
evaluator_rmse = RegressionEvaluator(labelCol="Yearly Amount Spent", predictionCol="prediction", metricName="rmse")
evaluator_r2 = RegressionEvaluator(labelCol="Yearly Amount Spent", predictionCol="prediction", metricName="r2")
evaluator_mae = RegressionEvaluator(labelCol="Yearly Amount Spent", predictionCol="prediction", metricName="mae")

mse = evaluator_mse.evaluate(preds_with_label)
rmse = evaluator_rmse.evaluate(preds_with_label)
r2 = evaluator_r2.evaluate(preds_with_label)
mae = evaluator_mae.evaluate(preds_with_label)

print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"MAE: {mae}")
print(f"R2: {r2}")

mean_target = preds_with_label.selectExpr("avg(`Yearly Amount Spent`) as mean_y").collect()[0]['mean_y']
nrmse = rmse / mean_target
print(f"Mean target: {mean_target}")
print(f"NRMSE (RMSE / mean): {nrmse:.4f}")

MSE: 113.71402712614695
RMSE: 10.663677936160063
MAE: 8.58447169658969
R2: 0.9801525400480523
Mean target: 496.8974009187671
NRMSE (RMSE / mean): 0.0215


**Análisis** 

- **MSE (113.71) - RMSE (10.66):** mide el error promedio, mientras mas pequeño mejor. Indica que, en promedio, las predicciones del modelos se desvían 10.66 unidades de la variable gasto anual esperado
- **MAE (8.58):** el error absoluto medio es una métrica más robusta a outliers. La relación **RMSE > MAE** indica que aparecen algunos errores más grandes (Outlier con más impacto)
<br>

- **R2 (0.98):** mide qué proporción de la variabilidad del gasto anual explica el modelo. El modelo explica 98% de la varianza del target en el conjunto de prueba, con el valor tan alto sugiere que el modelo ajusta muy bien los datos
- **Mean target (496.90) - NRMSE (2.15%):** indica que tiene un error relativo muy bajo, es decir, hace predicciones muy precisas en términos relativos
<br>
<br>

**Conclusión:** el modelo tiene alto poder predictivo y errores promedio pequeños en la escala del target. En prinicipo es un buen modelo

## Implementación (EJEMPLO NO APLICADO)

In [None]:
lr_model.save("/mnt/models/lr_model_v1")

### Cargar Modelo

In [None]:
from pyspark.ml.regression import LinearRegressionModel
model = LinearRegressionModel.load("/mnt/models/lr_model_v1")

### Cargar y Transformar Datos Nuevos

In [None]:
data = spark.read.parquet("/mnt/data/new_customers/")

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["Avg Session Length","Time on App","Time on Website","Length of Membership"],
                            outputCol="features")
data = assembler.transform(data)

### Predicción

In [None]:
preds = model.transform(data.select("features", "customer_id")).select("customer_id", "prediction")

### Guardar resultados

In [None]:
preds.write.mode("overwrite").parquet("/mnt/predictions/lr/predictions_2025-09-13/")