# Análisis de datos y modelado

## 1. Leyendo datos limpios desde hadoop

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .appName("ModeloFacebook") \
    .getOrCreate()

25/11/15 23:00:40 WARN Utils: Your hostname, vbox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
25/11/15 23:00:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/15 23:00:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
coments_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .option("quote", "\"") \
    .option("escape", "\"") \
    .option("encoding", "UTF-8") \
    .option("delimiter", ",") \
    .csv("hdfs://localhost:9000/user/upao/processed/facebook/facebook_coments_clean.csv")

coments_df.printSchema()
print("Filas:", coments_df.count())
coments_df.show(5)

                                                                                

root
 |-- post_id: string (nullable = true)
 |-- comment_id: string (nullable = true)
 |-- comment_text: string (nullable = true)
 |-- author_name: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- comment_date: timestamp (nullable = true)
 |-- reaction_count: integer (nullable = true)



                                                                                

Filas: 623179
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+
|             post_id|          comment_id|        comment_text|         author_name|           author_id|        comment_date|reaction_count|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+
|e132943d-b740-48a...|88ef00aa-f56e-4b5...|Gracias por el da...|       Cristian Mayo|552d5bf8-c9e0-4a1...|2024-12-23 20:38:...|            50|
|e132943d-b740-48a...|68bfa1d8-274a-450...|Uff, yo también t...|   Roldán Arias Jove|8a01b3a1-4a62-4f5...|2025-01-23 00:09:...|             3|
|e132943d-b740-48a...|b0d5beb8-a735-45e...|Si vas a Machu Pi...|Bernardo de Rosselló|4370d5bf-39d6-4ad...|2025-05-23 05:43:...|            33|
|e132943d-b740-48a...|c1ee05b3-7024-407...|Crecí cerca de Ma...|  Áurea Elorza-Marin|0a122838-d6e1-4b6...|2025-10-26 00:55:...| 

In [5]:
posts_df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .option("quote", "\"") \
    .option("escape", "\"") \
    .option("encoding", "UTF-8") \
    .option("delimiter", ",") \
    .csv("hdfs://localhost:9000/user/upao/processed/facebook/facebook_posts_clean.csv")

posts_df.printSchema()
print("Filas:", posts_df.count())
posts_df.show(5)

root
 |-- post_id: string (nullable = true)
 |-- author: string (nullable = true)
 |-- description: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- reactions_count: integer (nullable = true)
 |-- comment_count: integer (nullable = true)

Filas: 37997
+--------------------+--------------------+--------------------+--------------------+---------------+-------------+
|             post_id|              author|         description|          created_at|reactions_count|comment_count|
+--------------------+--------------------+--------------------+--------------------+---------------+-------------+
|000c2dee-0f1d-4a3...|Cayetano de Rodri...|Si eres enfermero...|2024-06-19 12:52:...|           1036|           21|
|001059a8-f6e9-4d1...|Guadalupe Blázque...|Bretaña es decent...|2024-09-07 16:18:...|            927|           10|
|003c4a8d-18f8-4c7...|     Cosme Ferrándiz|El riesgo de mala...|2024-12-06 17:31:...|           1605|           12|
|00403888-5440-45e...|Luca

## 2. Extracción de features y creación del dataset para entrenamiento

### 2.1 Extraccion de destino turístico mediante diccionario de palabras claves 

In [7]:
destinos = [
    "Máncora", "Punta Sal", "Zorritos", "Acapulco", "Cancas",
    "Puerto Pizarro", "Bocapán", "Playa Hermosa", "Caleta Grau",
    "Los Órganos", "Vichayito", "Cabo Blanco", "Lobitos", "Colán",
    "El Ñuro", "Las Pocitas", "Negritos", "Paita", "Yacila",
    "Nonura", "Chulliyachi", "Matacaballo", "Constante", "Bayóvar",
    "Pimentel", "Puerto Eten", "San José", "Santa Rosa", "Laguna Azul",
    "Huanchaco", "Pacasmayo", "Puerto Chicama", "Las Delicias", "Salaverry",
    "Puerto Morín", "Chepén", "Cherrepe", "Guañape",
    "Casma", "Chimbote", "Tortugas", "Huarmey",
    "La Pocita", "Tamborero", "Antivito", "Samanco", "Besique",
    "Culebras", "Puerto Supe", "Végueta", "Barranca","Tumbes", "Zarumilla", "Aguas Verdes", "Corrales",
    "Piura", "Sullana", "Talara", "Paita", "Catacaos",
    "Chulucanas", "Sechura", "Morropón", "Huancabamba", "Ayabaca",
    "La Unión", "Canchaque", "Tambo Grande",
    "Chiclayo", "Lambayeque", "Ferreñafe", "Monsefú", "Olmos",
    "Motupe", "Jayanca", "Túcume", "Mórrope", "Zaña",
    "Trujillo", "Pacasmayo", "Chepén", "Ascope", "Otuzco",
    "Huamachuco", "Santiago de Chuco", "Virú", "Guadalupe",
    "San Pedro de Lloc", "Moche", "Chao",
    "Cajamarca", "Baños del Inca", "Celendín", "Chota", "Cutervo",
    "Jaén", "San Ignacio", "Bambamarca", "Contumazá", "Cajabamba",
    "Chachapoyas", "Bagua Grande", "Bagua", "Lamud", "Luya",
    "Rodríguez de Mendoza","Moyobamba", "Rioja", "Tarapoto", "Juanjuí","Chan Chan", "Huaca de la Luna", "Huaca del Sol", "Complejo El Brujo",
    "Huaca del Dragón", "Huaca Esmeralda", "Marcahuamachuco",
    "Wiracochapampa", "Galindo", "Farfán", "San José de Moro",
    "Museo Señor de Sipán", "Bosque de Pómac", "Complejo de Túcume",
    "Huaca Rajada", "Tumbas Reales de Sipán", "Chotuna Chornancap",
    "Ventarrón", "Collud-Zarpán", "Huaca Bandera", "Cinto",
    "Kuélap", "Sarcófagos de Karajía", "Revash", "Gran Vilaya",
    "Laguna de los Cóndores", "Yalape", "Macro", "Pueblo de los Muertos",
    "Ventanillas de Otuzco", "Cumbemayo", "Kuntur Wasi", "Pacopampa",
    "Ventanillas de Combayo", "Necrópolis de Combayo",
    "Templo de Chavín de Huántar", "Sechín", "Pañamarca", "Huaca de Punkurí",
    "Castillo de Huarmey", "Chankillo", "Pashash", "Paramonga",
    "Aypate", "Huaca Narihualá", "Chusis", "Cerro Vicús", "Cabeza de Vaca",
    "Ciudad Sagrada de Caral", "Áspero", "Vichama", "Bandurria", "Complejo Paraíso","Chan Chan", "Huaca de la Luna", "Huaca del Sol", "Complejo El Brujo",
    "Huaca del Dragón", "Huaca Esmeralda", "Marcahuamachuco",
    "Wiracochapampa", "Galindo", "Farfán", "San José de Moro",
    "Museo Señor de Sipán", "Bosque de Pómac", "Complejo de Túcume",
    "Huaca Rajada", "Tumbas Reales de Sipán", "Chotuna Chornancap",
    "Ventarrón", "Collud-Zarpán", "Huaca Bandera", "Cinto",
    "Kuélap", "Sarcófagos de Karajía", "Revash", "Gran Vilaya",
    "Laguna de los Cóndores", "Yalape", "Macro", "Pueblo de los Muertos",
    "Ventanillas de Otuzco", "Cumbemayo", "Kuntur Wasi", "Pacopampa",
    "Ventanillas de Combayo", "Necrópolis de Combayo",
    "Templo de Chavín de Huántar", "Sechín", "Pañamarca", "Huaca de Punkurí",
    "Castillo de Huarmey", "Chankillo", "Pashash", "Paramonga",
    "Aypate", "Huaca Narihualá", "Chusis", "Cerro Vicús", "Cabeza de Vaca",
    "Ciudad Sagrada de Caral", "Áspero", "Vichama", "Bandurria", "Complejo Paraíso","Cusco", "Arequipa", "Puno", "Huaraz", "Cajamarca",
    "Ayacucho", "Huancayo", "Huánuco", "Cerro de Pasco", "Abancay",
    "Huancavelica", "Chachapoyas", "Moquegua", "Ollantaytambo", "Pisac",
    "Urubamba", "Calca", "Chivay", "Yanque", "Jauja",
    "Tarma", "La Oroya", "Baños del Inca", "Celendín", "Chota",
    "Cutervo", "Bambamarca", "Cajabamba", "Contumazá",
    "Huamachuco", "Santiago de Chuco", "Otuzco", "Caraz", "Yungay",
    "Chacas", "Huari", "Pomabamba", "Recuay", "Andahuaylas",
    "Juliaca", "Lampa", "Ayaviri", "Desaguadero", "Yunguyo",
    "Concepción", "Chupaca", "Sicaya", "Carhuamayo", "Lamud", "Luya","Pachacámac", "Huaca Pucllana", "Huaca Huallamarca", "Complejo Mateo Salado",
    "Puruchuco", "Sacsayhuamán", "Ollantaytambo", "Pisac", "Moray",
    "Tipón", "Piquillacta", "Choquequirao", "Sillustani", "Cutimbo",
    "Pukara", "Complejo Arqueológico Wari", "Intihuatana de Vilcashuamán",
    "Gran Pajatén", "Kotosh", "Tunanmarca", "Arwaturo", "Tambo Colorado",
    "Petroglifos de Toro Muerto", "Willkawaín", "Honcopampa", "Tambo de Mora",
    "Incahuasi de Cañete", "Huaytará", "Ushnu de Huanacopampa", "Pikimachay",
    "Qenqo", "Tambomachay", "Puca Pucara", "Huchuy Qosqo", "Chinchero",
    "Vitcos", "Espíritu Pampa", "Runkurakay", "Sayacmarca", "Phuyupatamarca",
    "Wiñay Wayna", "Petroglifos de Checta", "Fortaleza de Collique",
    "Cantamarca", "Rúpac", "Chiprac", "Fortaleza de Acaray", "Las Shicras",
    "Pampa de las Llamas-Moxeke", "Cerro Sechín", "Garagay", "Cardal",
    "Cahuachi", "Estaquería", "Paredones", "Petroglifos de Miculla", "Cerro Baúl",
    "Reserva Nacional Pacaya Samiria", "Río Amazonas", "Reserva Nacional Tambopata",
    "Parque Nacional del Manu", "Lago Sandoval", "Collpa de Guacamayos Chuncho",
    "Laguna Yarinacocha", "Cueva de las Lechuzas", "Parque Nacional Tingo María",
    "Cataratas de Ahuashiyacu", "Laguna de Sauce", "Castillo de Lamas",
    "Petroglifos de Polish", "Comunidad Nativa Boras", "Comunidad Nativa Yaguas",
    "Isla de los Monos", "Malecón de Iquitos", "Barrio de Belén",
    "Mercado de Belén", "Complejo Turístico Quistococha", "Mariposario Pilpintuwasi",
    "Cocha Otorongo", "Cocha Salvador", "Parque Nacional Yanachaga-Chemillén",
    "Catarata Velo de la Novia", "Boquerón del Padre Abad", "Catarata de Yulitunqui",
    "Aguas Sulfurosas de Jacintillo", "La Bella Durmiente",
    "Baños Termales Paucaryacu", "Reserva Comunal Yanesha",
    "Jardín Botánico de Pucallpa", "Plaza de Armas de Iquitos", "Casa de Fierro",
    "Lago Tres Chimbadas", "Reserva Nacional Allpahuayo-Mishana",
    "Cataratas de Tsyapo", "Río Tambopata", "Río Madre de Dios",
    "Collpa de Loros El Infierno", "Valle de Chanchamayo", "Catarata de Bayoz",
    "Catarata de Tinamuz", "Reserva Indígena Amarakaeri", "Río Ene",
    "Río Apurímac", "Pongo de Manseriche", "Santuario Nacional Pampa Hermosa",
    "Comunidad Nativa Asháninka", "Catarata El Encanto de la Sirena",
    "Catarata de Regalía", "Jardín Botánico de Iquitos",
    "Comunidad Nativa Shipibo-Conibo", "Río Huallaga", "Río Ucayali",
    "Río Marañón","Iquitos", "Puerto Maldonado", "Pucallpa", "Tingo María", "Oxapampa",
    "Pozuzo", "Villa Rica", "La Merced", "San Ramón", "Pichanaki",
    "Nauta", "Lamas", "Sauce", "Aguaytía", "Quillabamba",
    "Atalaya", "Satipo", "Mazamari", "Requena", "Contamana",
    "Iberia", "Iñapari", "Santa María de Nieva", "Bellavista", "Saposoa",
    "Tocache", "Pilcopata", "Puerto Inca", "Ciudad Constitución", "Yurimaguas",
    "Caballococha", "Tamshiyacu", "Indiana", "Mazán", "San Lorenzo",
    "Santa Rosa de Yavarí", "Jepelacio", "Nueva Cajamarca", "Soritor",
    "Pacayzapa", "Pebas", "Pucacaca", "San Hilarión", "Shapaja",
    "Chazuta", "Tabalosos", "San José de Sisa", "Sarayacu", "Orellana",
    "Jenaro Herrera", "Bretaña", "Lagunas", "Balsapuerto", "Huicungo",
    "Pachiza","Petroglifos de Cunchipata",
    "Petroglifos de Shampuyacu",
    "Petroglifos de Balsapuerto",
    "Petroglifos de Quiaca",
    "Petroglifos de Pongo de Mainique",
    "Ruinas de Tantamayo",
    "Complejo Arqueologico de Uchkupishqo",
    "Ruinas de Chipuric",
    "Petroglifos de Faical",
    "Petroglifos de Samanga",
    "Petroglifos de Manga",
    "Petroglifos de Queros",
    "Sitio Arqueologico de Timbara",
    "Petroglifos de Catarata",
    "Petroglifos de Panguana",
    "Petroglifos de Pusac",
    "Petroglifos de San Antonio",
    "Petroglifos de Chazuta",
    "Tumbas Colgantes de la Jalca",
    "Mausoleos de Oton",
    "Mausoleos de Diablo Wasi",
    "Templo de Llama-G",
    "Fortaleza de Huaylillas",
    "Ruinas de Pirca Pirca",
    "Ruinas de Purunllacta",
    "Tumbas de Leca",
    "Tumbas de La Petaca",
    "Tumbas de Chipurik",
    "Sitio Arqueologico de Llactapata",
    "Petroglifos de Che-Che",
    "Petroglifos de Pitumarka",
    "Petroglifos de Santa Rosa",
    "Petroglifos de Chivay",
    "Petroglifos de Santa Cruz",
    "Petroglifos de Yamon",
    "Petroglifos de Utco",
    "Petroglifos de Limones",
    "Petroglifos de San Martin",
    "Petroglifos de Cacatachi",
    "Petroglifos de Shilcayo",
    "Petroglifos de Pumahuasi",
    "Petroglifos de Tunshuhuaico",
    "Petroglifos de Juanjui",
    "Petroglifos de Bellavista",
    "Petroglifos de Picota",
    "Templo de las Manos Cruzadas de Tingo Maria",
    "Ruinas de Saposoa",
    "Ruinas de Shunte",
    "Ruinas de Condormarca",
    "Complejo Arqueologico El Sapo",
    "Ruinas de la Conga"
]

#### A) Tabla Posts

In [6]:
posts_df.show(5)

+--------------------+--------------------+--------------------+--------------------+---------------+-------------+
|             post_id|              author|         description|          created_at|reactions_count|comment_count|
+--------------------+--------------------+--------------------+--------------------+---------------+-------------+
|000c2dee-0f1d-4a3...|Cayetano de Rodri...|Si eres enfermero...|2024-06-19 12:52:...|           1036|           21|
|001059a8-f6e9-4d1...|Guadalupe Blázque...|Bretaña es decent...|2024-09-07 16:18:...|            927|           10|
|003c4a8d-18f8-4c7...|     Cosme Ferrándiz|El riesgo de mala...|2024-12-06 17:31:...|           1605|           12|
|00403888-5440-45e...|Lucas Dani Sáez M...|No se si hacer vi...|2025-04-14 17:38:...|           1581|           14|
|00452062-c70e-4e7...| Selena Molins Torre|Que fue lo que ma...|2023-12-16 11:47:...|            482|            9|
+--------------------+--------------------+--------------------+--------

In [8]:
from pyspark.sql.functions import col, when, regexp_extract, regexp_replace

# Construimos un regex OR, escapando espacios
regex_pattern = "|".join([f"(?i){d}" for d in destinos])   # (?i) = case insensitive

df_dest = posts_df.withColumn(
    "destino",
    regexp_extract(col("description"), regex_pattern, 0)
)

# Filtrar solo registros donde sí se detectó destino
df_dest_not_null = df_dest.filter(col("destino") != "")

In [13]:
df_dest_not_null.select("destino").show(truncate=False)
print("Filas:", df_dest_not_null.count())

+-----------------------------------+
|destino                            |
+-----------------------------------+
|San Hilarión                       |
|Bretaña                            |
|Pachiza                            |
|Parque Nacional Yanachaga-Chemillén|
|Collpa de Loros El Infierno        |
|Tumbas de Leca                     |
|Petroglifos de San Antonio         |
|Tumbas de Chipurik                 |
|Luya                               |
|Reserva Nacional Allpahuayo-Mishana|
|Petroglifos de Shilcayo            |
|Chivay                             |
|Puerto Chicama                     |
|Pachiza                            |
|Petroglifos de Samanga             |
|Petroglifos de Samanga             |
|Laguna Yarinacocha                 |
|Calca                              |
|Petroglifos de Catarata            |
|La Merced                          |
+-----------------------------------+
only showing top 20 rows





Filas: 36148


                                                                                

#### B) Tabla Comments

In [16]:
coments_df.show(5)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+
|             post_id|          comment_id|        comment_text|         author_name|           author_id|        comment_date|reaction_count|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+
|e132943d-b740-48a...|88ef00aa-f56e-4b5...|Gracias por el da...|       Cristian Mayo|552d5bf8-c9e0-4a1...|2024-12-23 20:38:...|            50|
|e132943d-b740-48a...|68bfa1d8-274a-450...|Uff, yo también t...|   Roldán Arias Jove|8a01b3a1-4a62-4f5...|2025-01-23 00:09:...|             3|
|e132943d-b740-48a...|b0d5beb8-a735-45e...|Si vas a Machu Pi...|Bernardo de Rosselló|4370d5bf-39d6-4ad...|2025-05-23 05:43:...|            33|
|e132943d-b740-48a...|c1ee05b3-7024-407...|Crecí cerca de Ma...|  Áurea Elorza-Marin|0a122838-d6e1-4b6...|2025-10-26 00:55:...|            23|

In [14]:
df_dest_comments = coments_df.withColumn(
    "destino",
    regexp_extract(col("comment_text"), regex_pattern, 0)
)

# Filtrar solo registros donde sí se detectó destino
df_dest_comments_not_null = df_dest_comments.filter(col("destino") != "")

In [15]:
df_dest_comments_not_null.select("destino").show(truncate=False)
print("Filas:", df_dest_comments_not_null.count())

+---------+
|destino  |
+---------+
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cajabamba|
|Cusco    |
|Cusco    |
|Cusco    |
|Cusco    |
|Cusco    |
|Moche    |
|Moche    |
+---------+
only showing top 20 rows



[Stage 30:>                                                         (0 + 1) / 1]

Filas: 585732


                                                                                

#### Analizando los resultados:

#### Los cantidad de registros en la tabla de comentarios era de 37997 mientras que de posts era 623179
#### luego de haber extraido el destino de cada comentario y post hemos eliminado aquellos los cuales no menciona algun lugar turístico.
#### Quedando la tabla de posts con un total de 36148 filas mientras que la tabla comments tiene 585732 registros.

### 2.2 Asignación de departamento a cada lugar turistico

#### 2.2.1 Diccionario de destinos turísticos y departamentos

In [19]:
destino_departamento = {
    "Máncora": "Piura",
    "Punta Sal": "Tumbes",
    "Zorritos": "Tumbes",
    "Acapulco": "Tumbes",
    "Cancas": "Tumbes",
    "Puerto Pizarro": "Tumbes",
    "Bocapán": "Tumbes",
    "Playa Hermosa": "Tumbes",
    "Caleta Grau": "Piura",
    "Los Órganos": "Piura",
    "Vichayito": "Piura",
    "Cabo Blanco": "Piura",
    "Lobitos": "Piura",
    "Colán": "Piura",
    "El Ñuro": "Piura",
    "Las Pocitas": "Piura",
    "Negritos": "Piura",
    "Paita": "Piura",
    "Yacila": "Piura",
    "Nonura": "Piura",
    "Chulliyachi": "Piura",
    "Matacaballo": "Piura",
    "Constante": "Piura",
    "Bayóvar": "Piura",
    "Pimentel": "Lambayeque",
    "Puerto Eten": "Lambayeque",
    "San José": "Lambayeque",
    "Santa Rosa": "Lambayeque",
    "Laguna Azul": "Cajamarca",
    "Huanchaco": "La Libertad",
    "Pacasmayo": "La Libertad",
    "Puerto Chicama": "La Libertad",
    "Las Delicias": "La Libertad",
    "Salaverry": "La Libertad",
    "Puerto Morín": "La Libertad",
    "Chepén": "La Libertad",
    "Cherrepe": "La Libertad",
    "Guañape": "La Libertad",
    "Casma": "Áncash",
    "Chimbote": "Áncash",
    "Tortugas": "Áncash",
    "Huarmey": "Áncash",
    "La Pocita": "Áncash",
    "Tamborero": "Áncash",
    "Antivito": "Áncash",
    "Samanco": "Áncash",
    "Besique": "Áncash",
    "Culebras": "Áncash",
    "Puerto Supe": "Lima",
    "Végueta": "Lima",
    "Barranca": "Lima",
    "Tumbes": "Tumbes",
    "Zarumilla": "Tumbes",
    "Aguas Verdes": "Tumbes",
    "Corrales": "Tumbes",
    "Piura": "Piura",
    "Sullana": "Piura",
    "Talara": "Piura",
    "Paita": "Piura",
    "Catacaos": "Piura",
    "Chulucanas": "Piura",
    "Sechura": "Piura",
    "Morropón": "Piura",
    "Huancabamba": "Piura",
    "Ayabaca": "Piura",
    "La Unión": "Piura",
    "Canchaque": "Piura",
    "Tambo Grande": "Piura",
    "Chiclayo": "Lambayeque",
    "Lambayeque": "Lambayeque",
    "Ferreñafe": "Lambayeque",
    "Monsefú": "Lambayeque",
    "Olmos": "Lambayeque",
    "Motupe": "Lambayeque",
    "Jayanca": "Lambayeque",
    "Túcume": "Lambayeque",
    "Mórrope": "Lambayeque",
    "Zaña": "Lambayeque",
    "Trujillo": "La Libertad",
    "Pacasmayo": "La Libertad",
    "Chepén": "La Libertad",
    "Ascope": "La Libertad",
    "Otuzco": "La Libertad",
    "Huamachuco": "La Libertad",
    "Santiago de Chuco": "La Libertad",
    "Virú": "La Libertad",
    "Guadalupe": "La Libertad",
    "San Pedro de Lloc": "La Libertad",
    "Moche": "La Libertad",
    "Chao": "La Libertad",
    "Cajamarca": "Cajamarca",
    "Baños del Inca": "Cajamarca",
    "Celendín": "Cajamarca",
    "Chota": "Cajamarca",
    "Cutervo": "Cajamarca",
    "Jaén": "Cajamarca",
    "San Ignacio": "Cajamarca",
    "Bambamarca": "Cajamarca",
    "Contumazá": "Cajamarca",
    "Cajabamba": "Cajamarca",
    "Chachapoyas": "Amazonas",
    "Bagua Grande": "Amazonas",
    "Bagua": "Amazonas",
    "Lamud": "Amazonas",
    "Luya": "Amazonas",
    "Rodríguez de Mendoza": "Amazonas",
    "Moyobamba": "San Martín",
    "Rioja": "San Martín",
    "Tarapoto": "San Martín",
    "Juanjuí": "San Martín",
    "Chan Chan": "La Libertad",
    "Huaca de la Luna": "La Libertad",
    "Huaca del Sol": "La Libertad",
    "Complejo El Brujo": "La Libertad",
    "Huaca del Dragón": "La Libertad",
    "Huaca Esmeralda": "La Libertad",
    "Marcahuamachuco": "La Libertad",
    "Wiracochapampa": "La Libertad",
    "Galindo": "La Libertad",
    "Farfán": "La Libertad",
    "San José de Moro": "Lambayeque",
    "Museo Señor de Sipán": "Lambayeque",
    "Bosque de Pómac": "Lambayeque",
    "Complejo de Túcume": "Lambayeque",
    "Huaca Rajada": "Lambayeque",
    "Tumbas Reales de Sipán": "Lambayeque",
    "Chotuna Chornancap": "Lambayeque",
    "Ventarrón": "Lambayeque",
    "Collud-Zarpán": "Lambayeque",
    "Huaca Bandera": "Lambayeque",
    "Cinto": "Lambayeque",
    "Kuélap": "Amazonas",
    "Sarcófagos de Karajía": "Amazonas",
    "Revash": "Amazonas",
    "Gran Vilaya": "Amazonas",
    "Laguna de los Cóndores": "Amazonas",
    "Yalape": "Amazonas",
    "Macro": "Amazonas",
    "Pueblo de los Muertos": "Amazonas",
    "Ventanillas de Otuzco": "Cajamarca",
    "Cumbemayo": "Cajamarca",
    "Kuntur Wasi": "Cajamarca",
    "Pacopampa": "Cajamarca",
    "Ventanillas de Combayo": "Cajamarca",
    "Necrópolis de Combayo": "Cajamarca",
    "Templo de Chavín de Huántar": "Áncash",
    "Sechín": "Áncash",
    "Pañamarca": "Áncash",
    "Huaca de Punkurí": "Áncash",
    "Castillo de Huarmey": "Áncash",
    "Chankillo": "Áncash",
    "Pashash": "Áncash",
    "Paramonga": "Lima",
    "Aypate": "Piura",
    "Huaca Narihualá": "Piura",
    "Chusis": "Piura",
    "Cerro Vicús": "Piura",
    "Cabeza de Vaca": "Piura",
    "Ciudad Sagrada de Caral": "Lima",
    "Áspero": "Lima",
    "Vichama": "Lima",
    "Bandurria": "Lima",
    "Complejo Paraíso": "Lima",
    "Machu Picchu": "Cusco",
    "Valle Sagrado de los Incas": "Cusco",
    "Sacsayhuamán": "Cusco",
    "Montaña de Siete Colores": "Cusco",
    "Laguna Humantay": "Cusco",
    "Choquequirao": "Cusco",
    "Cañón del Colca": "Arequipa",
    "Monasterio de Santa Catalina": "Arequipa",
    "Volcán Misti": "Arequipa",
    "Lago Titicaca": "Puno",
    "Islas Flotantes de los Uros": "Puno",
    "Isla Taquile": "Puno",
    "Isla Amantaní": "Puno",
    "Sillustani": "Puno",
    "Cañón de Cotahuasi": "Arequipa",
    "Parque Nacional Huascarán": "Áncash",
    "Laguna 69": "Áncash",
    "Laguna de Llanganuco": "Áncash",
    "Nevado Pastoruri": "Áncash",
    "Chavín de Huántar": "Áncash",
    "Cumbemayo": "Cajamarca",
    "Ventanillas de Otuzco": "Cajamarca",
    "Granja Porcón": "Cajamarca",
    "Kuélap": "Amazonas",
    "Sarcófagos de Karajía": "Amazonas",
    "Catarata de Gocta": "Amazonas",
    "Mausoleos de Revash": "Amazonas",
    "Complejo Arqueológico Wari": "Ayacucho",
    "Pampa de la Quinua": "Ayacucho",
    "Aguas Turquesas de Millpu": "Ayacucho",
    "Complejo Arqueológico de Vilcashuamán": "Ayacucho",
    "Laguna de Paca": "Junín",
    "Convento de Ocopa": "Junín",
    "Valle del Mantaro": "Junín",
    "Bosque de Piedras de Huayllay": "Pasco",
    "Kotosh": "Huánuco",
    "Laguna de Pacucha": "Apurímac",
    "Complejo Arqueológico de Sondor": "Apurímac",
    "Salinas de Maras": "Cusco",
    "Andenes de Moray": "Cusco",
    "Raqchi": "Cusco",
    "Tipón": "Cusco",
    "Piquillacta": "Cusco",
    "Puente Inca Q'eswachaka": "Cusco",
    "Reserva Nacional de Salinas y Aguada Blanca": "Arequipa",
    "Reserva Nacional del Titicaca": "Puno",
    "Nevado Huascarán": "Áncash",
    "Nevado Alpamayo": "Áncash",
    "Baños Termales de Cónoc": "Cajamarca",
    "Cusco": "Cusco",
    "Arequipa": "Arequipa",
    "Puno": "Puno",
    "Huaraz": "Áncash",
    "Cajamarca": "Cajamarca",
    "Ayacucho": "Ayacucho",
    "Huancayo": "Junín",
    "Huánuco": "Huánuco",
    "Cerro de Pasco": "Pasco",
    "Abancay": "Apurímac",
    "Huancavelica": "Huancavelica",
    "Chachapoyas": "Amazonas",
    "Moquegua": "Moquegua",
    "Ollantaytambo": "Cusco",
    "Pisac": "Cusco",
    "Urubamba": "Cusco",
    "Calca": "Cusco",
    "Chivay": "Arequipa",
    "Yanque": "Arequipa",
    "Jauja": "Junín",
    "Tarma": "Junín",
    "La Oroya": "Junín",
    "Baños del Inca": "Cajamarca",
    "Celendín": "Cajamarca",
    "Chota": "Cajamarca",
    "Cutervo": "Cajamarca",
    "Bambamarca": "Cajamarca",
    "Cajabamba": "Cajamarca",
    "Contumazá": "Cajamarca",
    "Huamachuco": "La Libertad",
    "Santiago de Chuco": "La Libertad",
    "Otuzco": "La Libertad",
    "Caraz": "Áncash",
    "Yungay": "Áncash",
    "Chacas": "Áncash",
    "Huari": "Áncash",
    "Pomabamba": "Áncash",
    "Recuay": "Áncash",
    "Andahuaylas": "Apurímac",
    "Juliaca": "Puno",
    "Lampa": "Puno",
    "Ayaviri": "Puno",
    "Desaguadero": "Puno",
    "Yunguyo": "Puno",
    "Concepción": "Junín",
    "Chupaca": "Junín",
    "Sicaya": "Junín",
    "Carhuamayo": "Junín",
    "Lamud": "Amazonas",
    "Luya": "Amazonas",
    "Pachacámac": "Lima",
    "Huaca Pucllana": "Lima",
    "Huaca Huallamarca": "Lima",
    "Complejo Mateo Salado": "Lima",
    "Puruchuco": "Lima",
    "Sacsayhuamán": "Cusco",
    "Ollantaytambo": "Cusco",
    "Pisac": "Cusco",
    "Moray": "Cusco",
    "Tipón": "Cusco",
    "Piquillacta": "Cusco",
    "Choquequirao": "Cusco",
    "Sillustani": "Puno",
    "Cutimbo": "Puno",
    "Pukara": "Puno",
    "Complejo Arqueológico Wari": "Ayacucho",
    "Intihuatana de Vilcashuamán": "Ayacucho",
    "Gran Pajatén": "San Martín",
    "Kotosh": "Huánuco",
    "Tunanmarca": "Junín",
    "Arwaturo": "Junín",
    "Tambo Colorado": "Ica",
    "Petroglifos de Toro Muerto": "Arequipa",
    "Willkawaín": "Áncash",
    "Honcopampa": "Áncash",
    "Tambo de Mora": "Ica",
    "Incahuasi de Cañete": "Lima",
    "Huaytará": "Huancavelica",
    "Ushnu de Huanacopampa": "Huancavelica",
    "Pikimachay": "Ayacucho",
    "Qenqo": "Cusco",
    "Tambomachay": "Cusco",
    "Puca Pucara": "Cusco",
    "Huchuy Qosqo": "Cusco",
    "Chinchero": "Cusco",
    "Vitcos": "Cusco",
    "Espíritu Pampa": "Cusco",
    "Runkurakay": "Cusco",
    "Sayacmarca": "Cusco",
    "Phuyupatamarca": "Cusco",
    "Wiñay Wayna": "Cusco",
    "Petroglifos de Checta": "Lima",
    "Fortaleza de Collique": "Lima",
    "Cantamarca": "Lima",
    "Rúpac": "Lima",
    "Chiprac": "Lima",
    "Fortaleza de Acaray": "Lima",
    "Las Shicras": "Lima",
    "Pampa de las Llamas-Moxeke": "La Libertad",
    "Cerro Sechín": "Áncash",
    "Garagay": "Lima",
    "Cardal": "Lima",
    "Cahuachi": "Ica",
    "Estaquería": "Ica",
    "Paredones": "Ica",
    "Petroglifos de Miculla": "Tacna",
    "Cerro Baúl": "Moquegua",
    "Reserva Nacional Pacaya Samiria": "Loreto",
    "Río Amazonas": "Loreto",
    "Reserva Nacional Tambopata": "Madre de Dios",
    "Parque Nacional del Manu": "Madre de Dios",
    "Lago Sandoval": "Madre de Dios",
    "Collpa de Guacamayos Chuncho": "Madre de Dios",
    "Laguna Yarinacocha": "Ucayali",
    "Cueva de las Lechuzas": "Huánuco",
    "Parque Nacional Tingo María": "Huánuco",
    "Cataratas de Ahuashiyacu": "San Martín",
    "Laguna de Sauce": "San Martín",
    "Castillo de Lamas": "San Martín",
    "Petroglifos de Polish": "San Martín",
    "Comunidad Nativa Boras": "Loreto",
    "Comunidad Nativa Yaguas": "Loreto",
    "Isla de los Monos": "Loreto",
    "Malecón de Iquitos": "Loreto",
    "Barrio de Belén": "Loreto",
    "Mercado de Belén": "Loreto",
    "Complejo Turístico Quistococha": "Loreto",
    "Mariposario Pilpintuwasi": "Loreto",
    "Cocha Otorongo": "Loreto",
    "Cocha Salvador": "Loreto",
    "Parque Nacional Yanachaga-Chemillén": "Pasco",
    "Catarata Velo de la Novia": "Junín",
    "Boquerón del Padre Abad": "Ucayali",
    "Catarata de Yulitunqui": "Ucayali",
    "Aguas Sulfurosas de Jacintillo": "Ucayali",
    "La Bella Durmiente": "Huánuco",
    "Baños Termales Paucaryacu": "Huánuco",
    "Reserva Comunal Yanesha": "Pasco",
    "Jardín Botánico de Pucallpa": "Ucayali",
    "Plaza de Armas de Iquitos": "Loreto",
    "Casa de Fierro": "Loreto",
    "Lago Tres Chimbadas": "Madre de Dios",
    "Reserva Nacional Allpahuayo-Mishana": "Loreto",
    "Cataratas de Tsyapo": "San Martín",
    "Río Tambopata": "Madre de Dios",
    "Río Madre de Dios": "Madre de Dios",
    "Collpa de Loros El Infierno": "Madre de Dios",
    "Valle de Chanchamayo": "Junín",
    "Catarata de Bayoz": "Junín",
    "Catarata de Tinamuz": "Junín",
    "Reserva Indígena Amarakaeri": "Madre de Dios",
    "Río Ene": "Junín",
    "Río Apurímac": "Apurímac",
    "Pongo de Manseriche": "Loreto",
    "Santuario Nacional Pampa Hermosa": "San Martín",
    "Comunidad Nativa Asháninka": "Junín",
    "Catarata El Encanto de la Sirena": "San Martín",
    "Catarata de Regalía": "San Martín",
    "Jardín Botánico de Iquitos": "Loreto",
    "Comunidad Nativa Shipibo-Conibo": "Ucayali",
    "Río Huallaga": "San Martín",
    "Río Ucayali": "Ucayali",
    "Río Marañón": "Loreto",
    "Iquitos": "Loreto",
    "Puerto Maldonado": "Madre de Dios",
    "Pucallpa": "Ucayali",
    "Tingo María": "Huánuco",
    "Oxapampa": "Pasco",
    "Pozuzo": "Pasco",
    "Villa Rica": "Pasco",
    "La Merced": "Junín",
    "San Ramón": "Junín",
    "Pichanaki": "Junín",
    "Nauta": "Loreto",
    "Lamas": "San Martín",
    "Sauce": "San Martín",
    "Aguaytía": "Ucayali",
    "Quillabamba": "Cusco",
    "Atalaya": "Ucayali",
    "Satipo": "Junín",
    "Mazamari": "Junín",
    "Requena": "Loreto",
    "Contamana": "Loreto",
    "Iberia": "Madre de Dios",
    "Iñapari": "Madre de Dios",
    "Santa María de Nieva": "Amazonas",
    "Bellavista": "San Martín",
    "Saposoa": "San Martín",
    "Tocache": "San Martín",
    "Pilcopata": "Madre de Dios",
    "Puerto Inca": "Huánuco",
    "Ciudad Constitución": "Pasco",
    "Yurimaguas": "Loreto",
    "Caballococha": "Loreto",
    "Tamshiyacu": "Loreto",
    "Indiana": "Loreto",
    "Mazán": "Loreto",
    "San Lorenzo": "Loreto",
    "Santa Rosa de Yavarí": "Loreto",
    "Jepelacio": "San Martín",
    "Nueva Cajamarca": "San Martín",
    "Soritor": "San Martín",
    "Pacayzapa": "San Martín",
    "Pebas": "Loreto",
    "Pucacaca": "San Martín",
    "San Hilarión": "San Martín",
    "Shapaja": "San Martín",
    "Chazuta": "San Martín",
    "Tabalosos": "San Martín",
    "San José de Sisa": "San Martín",
    "Sarayacu": "Ucayali",
    "Orellana": "Loreto",
    "Jenaro Herrera": "Loreto",
    "Bretaña": "Loreto",
    "Lagunas": "Loreto",
    "Balsapuerto": "Loreto",
    "Huicungo": "San Martín",
    "Pachiza": "San Martín",
    "Petroglifos de Cunchipata": "San Martín",
    "Petroglifos de Shampuyacu": "San Martín",
    "Petroglifos de Balsapuerto": "Loreto",
    "Petroglifos de Quiaca": "San Martín",
    "Petroglifos de Pongo de Mainique": "Cusco",
    "Ruinas de Tantamayo": "Huánuco",
    "Complejo Arqueologico de Uchkupishqo": "San Martín",
    "Ruinas de Chipuric": "San Martín",
    "Petroglifos de Faical": "San Martín",
    "Petroglifos de Samanga": "San Martín",
    "Petroglifos de Manga": "San Martín",
    "Petroglifos de Queros": "Cusco",
    "Sitio Arqueologico de Timbara": "San Martín",
    "Petroglifos de Catarata": "San Martín",
    "Petroglifos de Panguana": "San Martín",
    "Petroglifos de Pusac": "San Martín",
    "Petroglifos de San Antonio": "San Martín",
    "Petroglifos de Chazuta": "San Martín",
    "Tumbas Colgantes de la Jalca": "Amazonas",
    "Mausoleos de Oton": "Amazonas",
    "Mausoleos de Diablo Wasi": "Amazonas",
    "Templo de Llama-G": "San Martín",
    "Fortaleza de Huaylillas": "San Martín",
    "Ruinas de Pirca Pirca": "San Martín",
    "Ruinas de Purunllacta": "San Martín",
    "Tumbas de Leca": "San Martín",
    "Tumbas de La Petaca": "San Martín",
    "Tumbas de Chipurik": "San Martín",
    "Sitio Arqueologico de Llactapata": "Cusco",
    "Petroglifos de Che-Che": "San Martín",
    "Petroglifos de Pitumarka": "Cusco",
    "Petroglifos de Santa Rosa": "San Martín",
    "Petroglifos de Chivay": "Arequipa",
    "Petroglifos de Santa Cruz": "San Martín",
    "Petroglifos de Yamon": "San Martín",
    "Petroglifos de Utco": "San Martín",
    "Petroglifos de Limones": "San Martín",
    "Petroglifos de San Martin": "San Martín",
    "Petroglifos de Cacatachi": "San Martín",
    "Petroglifos de Shilcayo": "San Martín",
    "Petroglifos de Pumahuasi": "San Martín",
    "Petroglifos de Tunshuhuaico": "San Martín",
    "Petroglifos de Juanjui": "San Martín",
    "Petroglifos de Bellavista": "San Martín",
    "Petroglifos de Picota": "San Martín",
    "Templo de las Manos Cruzadas de Tingo Maria": "Huánuco",
    "Ruinas de Saposoa": "San Martín",
    "Ruinas de Shunte": "San Martín",
    "Ruinas de Condormarca": "San Martín",
    "Complejo Arqueologico El Sapo": "San Martín",
    "Ruinas de la Conga": "San Martín"
}

#### A) Tabla Posts

In [20]:
from pyspark.sql.functions import create_map, lit, col

# Convertir el dict en mapa compatible con Spark
mapping_expr = create_map([lit(x) for pair in destino_departamento.items() for x in pair])

df_dest_dept = df_dest_not_null.withColumn("departamento", mapping_expr[col("destino")])

In [26]:
df_dest_dept.show()

+--------------------+--------------------+--------------------+--------------------+---------------+-------------+--------------------+-------------+
|             post_id|              author|         description|          created_at|reactions_count|comment_count|             destino| departamento|
+--------------------+--------------------+--------------------+--------------------+---------------+-------------+--------------------+-------------+
|000c2dee-0f1d-4a3...|Cayetano de Rodri...|Si eres enfermero...|2024-06-19 12:52:...|           1036|           21|        San Hilarión|   San Martín|
|001059a8-f6e9-4d1...|Guadalupe Blázque...|Bretaña es decent...|2024-09-07 16:18:...|            927|           10|             Bretaña|       Loreto|
|003c4a8d-18f8-4c7...|     Cosme Ferrándiz|El riesgo de mala...|2024-12-06 17:31:...|           1605|           12|             Pachiza|   San Martín|
|00403888-5440-45e...|Lucas Dani Sáez M...|No se si hacer vi...|2025-04-14 17:38:...|         

Cantidad de columnas con valor departamento nulo

In [27]:
df_dest_dept.filter(col("departamento") == "").show()

                                                                                

+-------+------+-----------+----------+---------------+-------------+-------+------------+
|post_id|author|description|created_at|reactions_count|comment_count|destino|departamento|
+-------+------+-----------+----------+---------------+-------------+-------+------------+
+-------+------+-----------+----------+---------------+-------------+-------+------------+



                                                                                

#### B) Tabla Comments

In [28]:
df_comments_dest_dept = df_dest_comments_not_null.withColumn("departamento", mapping_expr[col("destino")])

In [29]:
df_comments_dest_dept.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+---------+------------+
|             post_id|          comment_id|        comment_text|         author_name|           author_id|        comment_date|reaction_count|  destino|departamento|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+---------+------------+
|362db8f3-689a-4b1...|625e0ebb-0893-4ea...|Turistas en Cajab...| Goyo Gisbert Tirado|f4bbd7ef-9f1f-453...|2025-01-07 03:34:...|            69|Cajabamba|   Cajamarca|
|362db8f3-689a-4b1...|6f268cfb-6ef2-4e9...|¿En Cajabamba hay...|Jose Francisco Ca...|1b8d85b9-3212-4c2...|2025-07-23 18:44:...|           130|Cajabamba|   Cajamarca|
|362db8f3-689a-4b1...|4dd817fa-1783-4ac...|Crecí en Cajabamb...|Ester Carmona Del...|d71114c4-2386-42e...|2024-05-14 23:59:...|           126|Cajabamba|   Cajamarca|
|362

Cantidad de columnas con valor departamento nulo

In [30]:
df_comments_dest_dept.filter(col("departamento") == "").show()

[Stage 42:>                                                         (0 + 1) / 1]

+-------+----------+------------+-----------+---------+------------+--------------+-------+------------+
|post_id|comment_id|comment_text|author_name|author_id|comment_date|reaction_count|destino|departamento|
+-------+----------+------------+-----------+---------+------------+--------------+-------+------------+
+-------+----------+------------+-----------+---------+------------+--------------+-------+------------+



                                                                                

### 2.3 Análisis de sentimientos a la columna comment_text y  description para el calculo de nivel de aceptación

In [32]:
df_dest_dept.write.mode("overwrite").parquet("parquets/df_dest_dept.parquet")

                                                                                

In [None]:
df_dest_dept =  spark.read.parquet("parquets/df_dest_dept.parquet")

In [33]:
df_comments_dest_dept.write.mode("overwrite").parquet("parquets/df_comments_dest_dept.parquet")

                                                                                

In [None]:
df_comments_dest_dept =  spark.read.parquet("parquets/df_comments_dest_dept.parquet")

#### A) Tabla Posts

In [37]:
pip install spark-nlp

Collecting spark-nlp
  Downloading spark_nlp-6.2.2-py2.py3-none-any.whl.metadata (19 kB)
Downloading spark_nlp-6.2.2-py2.py3-none-any.whl (744 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m744.4/744.4 kB[0m [31m1.9 MB/s[0m  [33m0:00:00[0m8.3 MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-6.2.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [38]:
import sparknlp
spark = sparknlp.start()
print("Spark NLP listo!")

Spark NLP listo!


25/11/16 01:56:54 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [50]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.functions import udf, col
from pyspark.sql.types import FloatType

pipeline = PretrainedPipeline("sentimentdl_use_twitter", lang="es")

sentimentdl_use_twitter download started this may take some time.


TypeError: 'JavaPackage' object is not callable

In [46]:
def get_sentiment_score(text):
    if not text:
        return None
        
    result = pipeline.annotate(text)

    # La salida es algo como: {'document': [...], 'category': ['positive'], 'confidence': ['0.98']}
    label = result["category"][0]
    confidence = float(result["confidence"][0])

    if label == "positive":
        return confidence
    elif label == "negative":
        return -confidence
    else:
        return 0.0

In [None]:
sentiment_score_udf = udf(get_sentiment_score, FloatType())

df_post_sentiment = df_dest_dept.withColumn("nivel_aceptacion", sentiment_score_udf(col("description")))

#### B) Tabla Comments

### 2.4 Calculo de la demanda mediante reactions_count,comment_count y el nivel de aceptación

### 2.5 Calculo de la aceptación promedio por més

## 3. Análisis exploratorio de datos

## 4. Selección y aplicación del modelo machine learning

### Modelo de clasificación para la demanda y modelo de regresión lineal para predecir el nivel de aceptación 

## 5. Evaluación del modelo