<a href="https://colab.research.google.com/github/Jarcos09/Tareas/blob/main/ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🎓 **Inteligencia Artificial Aplicada**

## 🤖 **Análisis de grandes volúmenes de datos (Gpo 10)**

### 🏛️ Tecnológico de Monterrey

#### 👨‍🏫 **Profesor titular :** Dr. Iván Olmos Pineda
#### 👩‍🏫 **Profesor asistente :** Verónica Sandra Guzmán de Valle

### 📊 **Proyecto | Base de Datos de Big Data**

#### 📅 **04 de mayo de 2025**

* 🧑‍💻 **A01795941 :** Juan Carlos Pérez Nava




In [31]:
import os
import sys
module_path = os.path.abspath(os.path.join('proyectos/librerias'))
if module_path not in sys.path:
    sys.path.append(module_path)
from graficas import *

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, sum, avg, lit, count, when, format_number, round, rand
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import Imputer

import matplotlib.pyplot as plt
import seaborn as sns
import kagglehub

import pandas as pd


from functools import reduce

In [32]:
path = kagglehub.dataset_download("sobhanmoosavi/us-accidents")
print("Path to dataset files:", path)

Path to dataset files: /home/jarcos/.cache/kagglehub/datasets/sobhanmoosavi/us-accidents/versions/13


In [33]:
spark = SparkSession.builder.master("local[*]").appName("CargarCSV").config("spark.driver.memory", "40g").config("spark.executor.memory", "20g").getOrCreate()
df_accident = spark.read.option("header", True).option("inferSchema", True).csv(path)
spark.sparkContext.setLogLevel("ERROR")



In [34]:
df_accident.show(5)

+---+-------+--------+-------------------+-------------------+-----------------+------------------+-------+-------+------------+--------------------+--------------------+------------+----------+-----+----------+-------+----------+------------+-------------------+--------------+-------------+-----------+------------+--------------+--------------+---------------+-----------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+------------+--------------+--------------+-----------------+---------------------+
| ID| Source|Severity|         Start_Time|           End_Time|        Start_Lat|         Start_Lng|End_Lat|End_Lng|Distance(mi)|         Description|              Street|        City|    County|State|   Zipcode|Country|  Timezone|Airport_Code|  Weather_Timestamp|Temperature(F)|Wind_Chill(F)|Humidity(%)|Pressure(in)|Visibility(mi)|Wind_Direction|Wind_Speed(mph)|Precipitation(in)|Weather_Condition|Ameni

**Obtención de las estadísticas descriptivas de las características categóricas**

# Particionamiento

El particionamiento del conjunto de datos se basa en las condiciones climáticas y la severidad del accidente, dividiéndolo en múltiples subconjuntos según combinaciones específicas de estas características.

In [35]:
columnas_clave = [
    "ID", "Weather_Condition","Precipitation(in)","Severity", "City", "State",
    "Temperature(F)", "Humidity(%)", "Visibility(mi)","Wind_Direction","Wind_Speed(mph)","Crossing","Junction","Railway",
    "Roundabout","Stop","Sunrise_Sunset","Traffic_Calming","Traffic_Signal"]

total = df_accident.count()

combinaciones_top = df_accident.groupBy("Weather_Condition", "Severity") \
    .agg(count("*").alias("Frecuencia")) \
    .withColumn("Proporción", col("Frecuencia") / total) \
    .orderBy(col("Proporción").desc())

combinaciones_top = combinaciones_top.withColumn("Frecuencia", col("Frecuencia"))  \
    .withColumn("Proporción", col("Proporción")*100)

df_particionada = df_accident.select(columnas_clave)
df_particionada.write.mode("overwrite").partitionBy("Weather_Condition","Severity").parquet("us_accidents_partitioned")

combinaciones_top.show(10, truncate=False)





+-----------------+--------+----------+------------------+
|Weather_Condition|Severity|Frecuencia|Proporción        |
+-----------------+--------+----------+------------------+
|Fair             |2       |2226576   |28.810332392473782|
|Mostly Cloudy    |2       |792735    |10.25743511523869 |
|Cloudy           |2       |692929    |8.966015449005317 |
|Partly Cloudy    |2       |548760    |7.1005696655734685|
|Clear            |2       |536971    |6.948028270815386 |
|Light Rain       |2       |270162    |3.495706870017238 |
|Overcast         |2       |248938    |3.22108319011686  |
|Clear            |3       |244956    |3.1695589018882835|
|Fair             |3       |240084    |3.1065186376367455|
|Mostly Cloudy    |3       |189229    |2.4484905919651614|
+-----------------+--------+----------+------------------+
only showing top 10 rows



                                                                                

In [36]:
max_reg = 1000

# Filtrar las filas
combinaciones_filtradas = combinaciones_top.filter(col("Frecuencia") >= max_reg)

# Si hay datos, guardarlos en un vector
particiones = combinaciones_filtradas.select("Weather_Condition", "Severity").collect()

# Mostrar el resultado

print(f'✅ Se identificaron \033[32m\033[1m{len(particiones)}\033[0m particiones que contienen más de \033[36m{max_reg}\033[0m registros.')

✅ Se identificaron [32m[1m95[0m particiones que contienen más de [36m1000[0m registros.


In [37]:
muestras = 500
contador_total = 0
semilla = 450

# Crear un DataFrame vacío con la misma estructura
df_muestras = spark.createDataFrame([], df_particionada.schema)
lista_muestras = []

for particion in particiones:

    contador_total += 1
    weather = particion["Weather_Condition"]
    severity = particion["Severity"]

    print(f"Extrayendo Partición #\033[32m\033[1m{contador_total:03}\033[0m | 🌦 Weather: \033[1;36m{weather}\033[0m | ⚠ Severity: \033[1;36m{severity}\033[0m")

    # Filtrar correctamente la partición
    df_filtrada = df_particionada.filter((col("Weather_Condition") == weather) & (col("Severity") == severity))

    # Limitar registros
    df_rand = df_filtrada.orderBy(rand(semilla)).limit(muestras)

    lista_muestras.append(df_rand)

df_muestras = lista_muestras[0]  # Inicializamos con el primer DataFrame

for df in lista_muestras[1:]:
    df_muestras = df_muestras.union(df)

# **Optimizar con persistencia**
df_muestras = df_muestras.persist().coalesce(8)

# Contar los registros en el nuevo DataFrame
contador_total = df_muestras.count()


print(f"Total de registros obtenidos en la muestra: \033[32m\033[1m{contador_total}\033[0m")

Extrayendo Partición #[32m[1m001[0m | 🌦 Weather: [1;36mFair[0m | ⚠ Severity: [1;36m2[0m
Extrayendo Partición #[32m[1m002[0m | 🌦 Weather: [1;36mMostly Cloudy[0m | ⚠ Severity: [1;36m2[0m
Extrayendo Partición #[32m[1m003[0m | 🌦 Weather: [1;36mCloudy[0m | ⚠ Severity: [1;36m2[0m
Extrayendo Partición #[32m[1m004[0m | 🌦 Weather: [1;36mPartly Cloudy[0m | ⚠ Severity: [1;36m2[0m
Extrayendo Partición #[32m[1m005[0m | 🌦 Weather: [1;36mClear[0m | ⚠ Severity: [1;36m2[0m
Extrayendo Partición #[32m[1m006[0m | 🌦 Weather: [1;36mLight Rain[0m | ⚠ Severity: [1;36m2[0m
Extrayendo Partición #[32m[1m007[0m | 🌦 Weather: [1;36mOvercast[0m | ⚠ Severity: [1;36m2[0m
Extrayendo Partición #[32m[1m008[0m | 🌦 Weather: [1;36mClear[0m | ⚠ Severity: [1;36m3[0m
Extrayendo Partición #[32m[1m009[0m | 🌦 Weather: [1;36mFair[0m | ⚠ Severity: [1;36m3[0m
Extrayendo Partición #[32m[1m010[0m | 🌦 Weather: [1;36mMostly Cloudy[0m | ⚠ Severity: [1;36m3[0m
Extrayend



Total de registros obtenidos en la muestra: [32m[1m46000[0m


                                                                                

# Imputando valores faltantes

In [38]:
def obten_nulos(particion):
  from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

  print(f"📊 Total de filas en la partición: {particion.count()}")
  print(f"🗂️ Número de columnas en la partición: {len(particion.columns)}")

  info_nulos = {}
  cols_nulos = {}

  total_rows = particion.count()

  registros_totales = particion.count()

  # Contar valores nulos por columna

  cols_nulos = particion.select(
    [sum(col(c).isNull().cast("int")).alias(c) for c in particion.columns]
    )

  # Convertir los resultados en un diccionario
  info_nulos = {c: cols_nulos.select(c).collect()[0][0] for c in particion.columns}

  # Filtrar solo las columnas con valores nulos
  cols_nulos = {c: {"count": v, "percent": (v / total_rows) * 100} for c, v in info_nulos.items() if v > 0}

  # Validar si existen nulos
  if not cols_nulos:
        print("✅ No existen valores nulos en la partición.")
        return

  listado = [(key, value['count'], value['percent']) for key, value in cols_nulos.items()]

  # Definir el esquema del DataFrame
  schema = StructType([
    StructField("Columna", StringType(), True),
    StructField("Total de nulos", IntegerType(), True),
    StructField("Porcentaje", DoubleType(), True)
  ])

  df_resumen_nulos = spark.createDataFrame(listado, schema=schema)

  for col_name in [c for c, t in df_resumen_nulos.dtypes if t == "double"]:
      df_resumen_nulos = df_resumen_nulos.withColumn(col_name, round(df_resumen_nulos[col_name], 2))

  df_resumen_nulos.orderBy(col("Total de nulos").desc()).show(truncate=False)


In [39]:
def imputacion_valores(particion):
    print("✅ Se realiza la imputación utilizando los siguientes valores:\n")

    # Obtener las modas (valores más frecuentes)
    moda_Weather = particion.groupBy("Weather_Condition").count().orderBy(col("count").desc()).first()["Weather_Condition"]
    moda_City = particion.groupBy("City").count().orderBy(col("count").desc()).first()["City"]
    moda_Sunset = particion.groupBy("Sunrise_Sunset").count().orderBy(col("count").desc()).first()["Sunrise_Sunset"]
    moda_wub = particion.groupBy("Wind_Direction").count().orderBy(col("count").desc()).first()["Wind_Direction"]

    # Obtener promedios de las variables numéricas
    media_Temperature = particion.select(round(avg(col("Temperature(F)")), 2).alias("avg_temp")).collect()[0][0]
    media_Humidity = particion.select(round(avg(col("Humidity(%)")), 2).alias("avg_humidity")).collect()[0][0]
    media_Visibility = particion.select(round(avg(col("Visibility(mi)")), 2).alias("avg_visibility")).collect()[0][0]
    media_Precipitation = particion.select(round(avg(col("Precipitation(in)")), 2).alias("avg_precipitation")).collect()[0][0]
    media_Wind_Speed = particion.select(round(avg(col("Wind_Speed(mph)")), 2).alias("avg_wind_speed")).collect()[0][0]


    # Imprimir valores calculados correctamente
    print(f"🌡️ Temperatura promedio: {media_Temperature}")
    print(f"💧 Humedad promedio: {media_Humidity}")
    print(f"👀 Visibilidad promedio: {media_Visibility}")
    print(f"🌧️ Precipitación promedio: {media_Precipitation}")
    print(f"🌬️ Velocidad del viento promedio: {media_Wind_Speed}")

    print(f"☁️ Condición meteorológica más frecuente: {moda_Weather}")
    print(f"🏙️ Ciudad más frecuente: {moda_City}")
    print(f"🌅 Hora de atardecer más frecuente: {moda_Sunset}")
    print(f"🌬️ Dirección del viento más frecuente: {moda_wub}")

    # Aplicar imputación con Imputer
    from pyspark.ml.feature import Imputer

    imputer_num = Imputer(
        inputCols=["Temperature(F)", "Humidity(%)", "Visibility(mi)", "Precipitation(in)", "Wind_Speed(mph)"],
        outputCols=["Temperature(F)", "Humidity(%)", "Visibility(mi)", "Precipitation(in)", "Wind_Speed(mph)"]
    ).setStrategy("mean")

    particion = imputer_num.fit(particion).transform(particion)

    # Imputación de valores categóricos con na.fill()
    particion = particion.na.fill({
        "Weather_Condition": moda_Weather,
        "City": moda_City,
        "Sunrise_Sunset": moda_Sunset,
        "Wind_Direction": moda_wub
    })

    print("\n🔍 Se validan nuevamente los valores nulos para corroborar la imputación.\n")

    obten_nulos(particion)

    for col_name in [c for c, t in particion.dtypes if t == "double"]:
      particion = particion.withColumn(col_name, round(particion[col_name], 1))

    return particion

In [40]:
obten_nulos(df_muestras)

📊 Total de filas en la partición: 46000
🗂️ Número de columnas en la partición: 19
+-----------------+--------------+----------+
|Columna          |Total de nulos|Porcentaje|
+-----------------+--------------+----------+
|Precipitation(in)|9628          |20.93     |
|Wind_Speed(mph)  |2488          |5.41      |
|Humidity(%)      |318           |0.69      |
|Temperature(F)   |224           |0.49      |
|Wind_Direction   |179           |0.39      |
|Sunrise_Sunset   |171           |0.37      |
|Visibility(mi)   |141           |0.31      |
|City             |1             |0.0       |
+-----------------+--------------+----------+



In [41]:
muestra_imp = imputacion_valores(df_muestras)

✅ Se realiza la imputación utilizando los siguientes valores:

🌡️ Temperatura promedio: 57.7
💧 Humedad promedio: 75.24
👀 Visibilidad promedio: 6.7
🌧️ Precipitación promedio: 0.05
🌬️ Velocidad del viento promedio: 11.27
☁️ Condición meteorológica más frecuente: Fair
🏙️ Ciudad más frecuente: Miami
🌅 Hora de atardecer más frecuente: Day
🌬️ Dirección del viento más frecuente: CALM

🔍 Se validan nuevamente los valores nulos para corroborar la imputación.

📊 Total de filas en la partición: 46000
🗂️ Número de columnas en la partición: 19
✅ No existen valores nulos en la partición.


In [42]:
outliers = calcular_IQR(muestra_imp,['Precipitation(in)','Temperature(F)','Humidity(%)','Visibility(mi)','Wind_Speed(mph)'])
outliers

Unnamed: 0_level_0,IQR,Límite Inf.,Límite Sup.
Columna,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Precipitation(in),0.1,-0.15,0.25
Temperature(F),31.6,-5.0,121.4
Humidity(%),29.0,19.5,135.5
Visibility(mi),7.0,-7.5,20.5
Wind_Speed(mph),9.2,-8.0,28.8


In [43]:
for index, row in outliers.iterrows():
    columna = index
    limite_inf = row["Límite Inf."]
    limite_sup = row["Límite Sup."]

    quoted_columna = f"`{columna}`"

    muestra_imp = muestra_imp.withColumn(columna, when((col(columna) < limite_inf) | (col(columna) > limite_sup), muestra_imp.selectExpr(f"avg({quoted_columna})").collect()[0][0])
        .otherwise(col(columna))
    )


In [44]:
outliers = calcular_IQR(muestra_imp,['Precipitation(in)','Temperature(F)','Humidity(%)','Visibility(mi)','Wind_Speed(mph)'])
outliers

Unnamed: 0_level_0,IQR,Límite Inf.,Límite Sup.
Columna,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Precipitation(in),0.1,-0.15,0.25
Temperature(F),31.0,-3.5,120.5
Humidity(%),27.0,24.5,132.5
Visibility(mi),7.0,-7.5,20.5
Wind_Speed(mph),8.2,-6.5,26.3


In [45]:
muestra_imp.show(5)

+---------+-----------------+-----------------+--------+--------------+-----+--------------+-----------+--------------+--------------+---------------+--------+--------+-------+----------+-----+--------------+---------------+--------------+
|       ID|Weather_Condition|Precipitation(in)|Severity|          City|State|Temperature(F)|Humidity(%)|Visibility(mi)|Wind_Direction|Wind_Speed(mph)|Crossing|Junction|Railway|Roundabout| Stop|Sunrise_Sunset|Traffic_Calming|Traffic_Signal|
+---------+-----------------+-----------------+--------+--------------+-----+--------------+-----------+--------------+--------------+---------------+--------+--------+-------+----------+-----+--------------+---------------+--------------+
|A-4912715|             Fair|              0.0|       2|  Port Angeles|   WA|          60.0|       72.0|          10.0|           WNW|           12.0|   false|   false|  false|     false|false|           Day|          false|         false|
|A-4035789|             Fair|           

# Preparando datos

In [46]:
muestra_imp.groupBy("Severity").count().show()

+--------+-----+
|Severity|count|
+--------+-----+
|       3|15000|
|       4| 6500|
|       2|22000|
|       1| 2500|
+--------+-----+



In [47]:
muestra_imp.show(5)

+---------+-----------------+-----------------+--------+--------------+-----+--------------+-----------+--------------+--------------+---------------+--------+--------+-------+----------+-----+--------------+---------------+--------------+
|       ID|Weather_Condition|Precipitation(in)|Severity|          City|State|Temperature(F)|Humidity(%)|Visibility(mi)|Wind_Direction|Wind_Speed(mph)|Crossing|Junction|Railway|Roundabout| Stop|Sunrise_Sunset|Traffic_Calming|Traffic_Signal|
+---------+-----------------+-----------------+--------+--------------+-----+--------------+-----------+--------------+--------------+---------------+--------+--------+-------+----------+-----+--------------+---------------+--------------+
|A-4912715|             Fair|              0.0|       2|  Port Angeles|   WA|          60.0|       72.0|          10.0|           WNW|           12.0|   false|   false|  false|     false|false|           Day|          false|         false|
|A-4035789|             Fair|           

In [48]:
categoricas = ["Weather_Condition", "City", "State", "Sunrise_Sunset","Wind_Direction"]
binarias = ["Crossing", "Junction", "Railway", "Roundabout", "Stop", "Traffic_Calming", "Traffic_Signal"]

In [49]:
# ✅ Crear una copia de `imp_sev_1` para trabajar sobre ella
Transf_muestra = muestra_imp.alias("copia_muestra")  # Esto asegura que el original quede intacto

# Convertir variables binarias a 0 y 1 en la copia
for columna in binarias:
    Transf_muestra = Transf_muestra.withColumn(columna + "_num", col(columna).cast("int"))

# Aplicar StringIndexer a las variables categóricas
indexers = [StringIndexer(inputCol=col, outputCol=col + "_Index").fit(Transf_muestra) for col in categoricas]
for indexer in indexers:
    Transf_muestra = indexer.transform(Transf_muestra)

index_label = StringIndexer(inputCol="Severity", outputCol="Severity_Index").fit(Transf_muestra)
Transf_muestra = index_label.transform(Transf_muestra)

# Aplicar One-Hot Encoding a las categóricas
codificadores = [OneHotEncoder(inputCol=col + "_Index", outputCol=col + "_OHE").fit(Transf_muestra) for col in categoricas]
for codificador in codificadores:
    Transf_muestra = codificador.transform(Transf_muestra)

# 🔥 Eliminar las columnas originales que ya no se usarán en el modelo
Transf_muestra = Transf_muestra.drop(*categoricas).drop(*binarias)

Transf_muestra.show()




+---------+-----------------+--------+--------------+-----------+--------------+---------------+------------+------------+-----------+--------------+--------+-------------------+------------------+-----------------------+----------+-----------+--------------------+--------------------+--------------+---------------------+-------------------+---------------+------------------+------------------+
|       ID|Precipitation(in)|Severity|Temperature(F)|Humidity(%)|Visibility(mi)|Wind_Speed(mph)|Crossing_num|Junction_num|Railway_num|Roundabout_num|Stop_num|Traffic_Calming_num|Traffic_Signal_num|Weather_Condition_Index|City_Index|State_Index|Sunrise_Sunset_Index|Wind_Direction_Index|Severity_Index|Weather_Condition_OHE|           City_OHE|      State_OHE|Sunrise_Sunset_OHE|Wind_Direction_OHE|
+---------+-----------------+--------+--------------+-----------+--------------+---------------+------------+------------+-----------+--------------+--------+-------------------+------------------+-------

In [50]:
atributos = [ 'Precipitation(in)', 'Temperature(F)', 'Humidity(%)', 'Visibility(mi)',
              'Wind_Speed(mph)','Crossing_num', 'Junction_num', 'Railway_num',
              'Roundabout_num','Stop_num','Traffic_Calming_num', 'Traffic_Signal_num',
              'Weather_Condition_OHE','City_OHE','State_OHE','Sunrise_Sunset_OHE','Wind_Direction_OHE']

In [51]:
assembler = VectorAssembler(inputCols=atributos, outputCol = 'Caracteristicas')
df_vec = assembler.transform(Transf_muestra)
df_vec.select('Caracteristicas','Severity_index').show(5,truncate = False)

+---------------------------------------------------------------------------------+--------------+
|Caracteristicas                                                                  |Severity_index|
+---------------------------------------------------------------------------------+--------------+
|(5434,[1,2,3,4,13,3255,5381,5410,5413],[60.0,72.0,10.0,12.0,1.0,1.0,1.0,1.0,1.0])|0.0           |
|(5434,[0,1,2,3,13,413,5383,5410,5411],[0.1,72.0,20.0,10.0,1.0,1.0,1.0,1.0,1.0])  |0.0           |
|(5434,[1,2,3,4,13,1440,5369,5410,5412],[75.0,94.0,7.0,3.0,1.0,1.0,1.0,1.0,1.0])  |0.0           |
|(5434,[1,2,3,13,121,5366,5411],[54.0,78.0,10.0,1.0,1.0,1.0,1.0])                 |0.0           |
|(5434,[1,2,3,4,13,883,5362,5413],[60.0,55.0,10.0,6.0,1.0,1.0,1.0,1.0])           |0.0           |
+---------------------------------------------------------------------------------+--------------+
only showing top 5 rows



In [52]:
# Aplicar StandardScaler
scaler = StandardScaler(inputCol="Caracteristicas", outputCol="Caracteristicas_scale", withStd=True, withMean=True)
scaler_model = scaler.fit(df_vec)
df_scaled = scaler_model.transform(df_vec)

In [53]:
df_scaled.select('Caracteristicas_scale','Severity_index').show(5,truncate = True)

+---------------------+--------------+
|Caracteristicas_scale|Severity_index|
+---------------------+--------------+
| [-0.7106023951071...|           0.0|
| [1.16458160594481...|           0.0|
| [-0.7106023951071...|           0.0|
| [-0.7106023951071...|           0.0|
| [-0.7106023951071...|           0.0|
+---------------------+--------------+
only showing top 5 rows



# Creando conjuntos de datos

In [54]:
#spark.conf.set("spark.sql.shuffle.partitions", "200")
train, test = df_scaled.randomSplit([0.8,0.2], seed = 10)

train_size = train.count()
test_size = test.count()
total_size = train_size + test_size

train_pct = (train_size / total_size) * 100
test_pct = (test_size / total_size) * 100

print(f"""Existen {train_size} instancias en el conjunto train ({train_pct:.2f}%),
y {test_size} en el conjunto test ({test_pct:.2f}%).""")

Existen 36756 instancias en el conjunto train (79.90%),
y 9244 en el conjunto test (20.10%).


In [55]:
lr = LinearRegression(featuresCol='Caracteristicas_scale', labelCol='Severity_Index', maxIter=50, regParam=0.01, elasticNetParam=0.2)
lr_model = lr.fit(train)
y_pred = lr_model.transform(test)
y_pred.select('Caracteristicas_scale','Severity_Index','prediction').show(5,truncate = True)



+---------------------+--------------+------------------+
|Caracteristicas_scale|Severity_Index|        prediction|
+---------------------+--------------+------------------+
| [-0.7106023951071...|           0.0| 1.338933353580241|
| [1.16458160594481...|           1.0|1.2853623665992282|
| [-0.7106023951071...|           0.0|1.6009070969980073|
| [-0.7106023951071...|           0.0|1.3980048097073956|
| [-0.7106023951071...|           1.0|1.5098498571210905|
+---------------------+--------------+------------------+
only showing top 5 rows



In [56]:
print ("The coefficient of the model is : ", lr_model.coefficients)
print ("The Intercept of the model is : ", lr_model.intercept)

The coefficient of the model is :  (5434,[0,1,2,3,4,5,6,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,70,71,72,74,75,77,78,79,82,83,84,85,87,88,89,90,91,92,93,94,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,113,114,115,116,118,120,121,123,124,125,126,127,128,130,132,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,155,157,158,159,160,163,164,166,167,170,172,173,174,177,178,179,180,181,182,183,184,187,188,192,193,194,195,196,197,198,199,202,203,204,205,206,207,208,209,211,212,213,214,215,216,218,219,222,223,224,225,226,228,230,233,234,235,238,239,240,242,243,247,248,250,253,254,255,258,261,265,267,268,272,273,274,275,276,277,278,280,281,282,283,287,289,290,291,292,293,294,295,297,301,303,305,307,308,309,310,311,312,313,314,317,319,320,321,322,324,325,326,327,328,331,332,333,334,335,338,339,340,341,342,343,345,347,350,351

In [57]:
#Root Mean Square Error
eval_lr = RegressionEvaluator(labelCol="Stop_num", predictionCol="prediction", metricName="rmse")
rmse_lr = eval_lr.evaluate(y_pred)
print("RMSE: %.3f" % rmse_lr)

# Mean Square Error
mse = eval_lr.evaluate(y_pred, {eval_lr.metricName: "mse"})
print("MSE: %.3f" % mse)

# Mean Absolute Error
mae = eval_lr.evaluate(y_pred, {eval_lr.metricName: "mae"})
print("MAE: %.3f" % mae)

# r2 - coefficient of determination
r2 = eval_lr.evaluate(y_pred, {eval_lr.metricName: "r2"})
print("r2: %.3f" %r2)

RMSE: 0.927
MSE: 0.860
MAE: 0.784
r2: -41.946


In [58]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=3, seed=42, featuresCol="Caracteristicas_scale")
model = kmeans.fit(df_scaled)



In [59]:
predictions = model.transform(df_scaled)
predictions.select('Caracteristicas_scale','Severity_Index','prediction').show()


+---------------------+--------------+----------+
|Caracteristicas_scale|Severity_Index|prediction|
+---------------------+--------------+----------+
| [-0.7106023951071...|           0.0|         0|
| [1.16458160594481...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|
| [-0.7106023951071...|           0.0|         0|


In [60]:
summary = model.summary
print("Tamaño de los clusters:", summary.clusterSizes)
print("Costo de entrenamiento (WSSSE):", summary.trainingCost)


Tamaño de los clusters: [45845, 153, 2]
Costo de entrenamiento (WSSSE): 249863137.2076334
