# Entrenamiento de regresores

Objetivo: Para cada hexágono encontrar un valor perceptual.

Cada hexágono tiene:

*  N>=3, donde N es el número mínimo de imágenes
*  R=10, donde R es la resolución de cada hexágono. Con un área promedio por hexágono de 0.015km2 (15,047.5m2) y largo promedio de cada lado 0.076km
* Number of regions con estas características: 24220



### Libraries

In [6]:
from google.colab import drive
drive.mount('/content/drive')

ValueError: mount failed

In [None]:
!pip install srai[all]
!pip install contextily
!pip install alphashape

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point
import zipfile
import srai
import os
from PIL import Image
import glob
import contextily as ctx
import alphashape

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, median_absolute_error,r2_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
import scipy.stats
import pickle


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#To create embeddings
from srai.loaders import OSMOnlineLoader, OSMWayLoader, OSMPbfLoader
from srai.regionalizers import geocode_to_region_gdf, S2Regionalizer
from srai.plotting import plot_regions, plot_numeric_data
from srai.embedders import CountEmbedder, ContextualCountEmbedder,Hex2VecEmbedder, Highway2VecEmbedder
from srai.joiners import IntersectionJoiner
from srai.loaders.osm_loaders.filters import HEX2VEC_FILTER
from srai.neighbourhoods.h3_neighbourhood import H3Neighbourhood
from srai.regionalizers import H3Regionalizer, geocode_to_region_gdf

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Create dataset and training/testing sets

### Df embeddings

In [20]:
df_embeddings = pd.read_csv('/content/drive/MyDrive/UC-TESIS/data/embeddings/h3_embeddings_150_75_100.csv')
df_embeddings

Unnamed: 0,region_id,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,8ab2c54614b7fff,-0.023930,0.040677,-0.017497,-0.275375,0.103468,0.078189,-0.061404,0.082823,0.071790,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
1,8ab2c5409c37fff,-0.356221,0.336155,0.056644,-0.032715,0.185949,-0.196175,-0.233436,0.125683,0.293554,...,-0.012782,0.278413,0.040773,0.042589,-0.320704,-0.143026,-0.161286,-0.083865,-0.558367,-0.020263
2,8ab2c5735c27fff,0.511298,0.475341,0.060442,0.396583,-0.291473,0.012991,-0.526473,0.214151,0.395182,...,0.071864,-0.055179,0.182469,0.402972,-0.274875,-0.136615,-0.433164,-0.104150,0.073335,0.634614
3,8ab2c5470d37fff,-0.154229,-0.212532,-0.277868,0.359974,-0.432614,-0.131339,-0.035649,-0.234396,0.266802,...,0.030813,-0.076295,0.212762,-0.210441,-0.170993,0.012732,-0.051932,0.190558,0.100510,-0.181921
4,8ab2c51982d7fff,0.117171,-0.137542,0.248391,0.000995,-0.488907,-0.224059,0.048757,-0.305915,0.187887,...,-0.703700,-0.091194,0.196549,-0.125396,-0.249640,-0.063369,0.138847,0.637275,-0.073419,0.112756
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24215,8ab2c5440607fff,0.129073,-0.313071,-0.266625,0.206505,0.424100,-0.071679,-0.106600,-0.243351,-0.110525,...,0.437696,0.314795,0.128654,-0.177794,0.288626,-0.032735,0.041986,0.082394,0.089014,-0.192874
24216,8ab2c5735ba7fff,-0.023930,0.040677,-0.017497,-0.275375,0.103468,0.078189,-0.061404,0.082823,0.071790,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
24217,8ab2c519132ffff,-0.023930,0.040677,-0.017497,-0.275375,0.103468,0.078189,-0.061404,0.082823,0.071790,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
24218,8ab2c5573517fff,-0.023930,0.040677,-0.017497,-0.275375,0.103468,0.078189,-0.061404,0.082823,0.071790,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010


In [21]:
df_embeddings = df_embeddings.set_index('region_id')
df_embeddings

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
region_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8ab2c54614b7fff,-0.023930,0.040677,-0.017497,-0.275375,0.103468,0.078189,-0.061404,0.082823,0.071790,0.071382,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c5409c37fff,-0.356221,0.336155,0.056644,-0.032715,0.185949,-0.196175,-0.233436,0.125683,0.293554,0.491351,...,-0.012782,0.278413,0.040773,0.042589,-0.320704,-0.143026,-0.161286,-0.083865,-0.558367,-0.020263
8ab2c5735c27fff,0.511298,0.475341,0.060442,0.396583,-0.291473,0.012991,-0.526473,0.214151,0.395182,0.211956,...,0.071864,-0.055179,0.182469,0.402972,-0.274875,-0.136615,-0.433164,-0.104150,0.073335,0.634614
8ab2c5470d37fff,-0.154229,-0.212532,-0.277868,0.359974,-0.432614,-0.131339,-0.035649,-0.234396,0.266802,0.241064,...,0.030813,-0.076295,0.212762,-0.210441,-0.170993,0.012732,-0.051932,0.190558,0.100510,-0.181921
8ab2c51982d7fff,0.117171,-0.137542,0.248391,0.000995,-0.488907,-0.224059,0.048757,-0.305915,0.187887,-0.228582,...,-0.703700,-0.091194,0.196549,-0.125396,-0.249640,-0.063369,0.138847,0.637275,-0.073419,0.112756
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8ab2c5440607fff,0.129073,-0.313071,-0.266625,0.206505,0.424100,-0.071679,-0.106600,-0.243351,-0.110525,0.429839,...,0.437696,0.314795,0.128654,-0.177794,0.288626,-0.032735,0.041986,0.082394,0.089014,-0.192874
8ab2c5735ba7fff,-0.023930,0.040677,-0.017497,-0.275375,0.103468,0.078189,-0.061404,0.082823,0.071790,0.071382,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c519132ffff,-0.023930,0.040677,-0.017497,-0.275375,0.103468,0.078189,-0.061404,0.082823,0.071790,0.071382,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c5573517fff,-0.023930,0.040677,-0.017497,-0.275375,0.103468,0.078189,-0.061404,0.082823,0.071790,0.071382,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010


### Df variables perceptuales

In [22]:
df = pd.read_csv('/content/drive/MyDrive/UC-TESIS/data/vars_perceptuales_santiago.csv')
df

Unnamed: 0,latlong,beautiful,boring,depressing,lively,safe,wealthy,lat,lon
0,"-33.323944,-70.51263428391168",-0.306948,1.565049,0.572029,-1.137733,-0.120456,-0.561887,-33.323944,-70.512634
1,"-33.323944,-70.5127291",-0.421388,0.309495,0.368965,-0.098733,-0.103042,-0.162294,-33.323944,-70.512729
2,"-33.323944,-70.51298714285714",0.116505,0.164284,-0.110312,0.063860,0.391172,0.226372,-33.323944,-70.512987
3,"-33.323944,-70.51343609999999",-0.159113,-0.500987,-0.213503,0.635165,0.300856,0.453708,-33.323944,-70.513436
4,"-33.323944,-70.51379769565217",-1.226162,1.176751,1.462015,-0.842954,-0.946355,-0.936168,-33.323944,-70.513798
...,...,...,...,...,...,...,...,...,...
121346,"-33.67884090851735,-70.68059514195582",-1.318599,2.201657,1.673377,-2.265229,-1.677807,-2.014101,-33.678841,-70.680595
121347,"-33.67884090851735,-70.69912023659306",0.827014,1.129426,-0.431715,-1.053424,0.438107,0.078089,-33.678841,-70.699120
121348,"-33.67884090851735,-70.70653027444796",0.975725,0.838188,-0.583064,-0.515637,0.957951,0.547197,-33.678841,-70.706530
121349,"-33.67884090851735,-70.7176453312303",-0.629273,0.780489,0.881986,-1.363918,-1.328026,-1.726176,-33.678841,-70.717645


In [23]:
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lon, df.lat))
gdf = gdf.set_crs(epsg=4326)
gdf_rm= gdf.copy().drop(columns=['lat', 'lon', 'latlong'])
gdf_rm


Unnamed: 0,beautiful,boring,depressing,lively,safe,wealthy,geometry
0,-0.306948,1.565049,0.572029,-1.137733,-0.120456,-0.561887,POINT (-70.51263 -33.32394)
1,-0.421388,0.309495,0.368965,-0.098733,-0.103042,-0.162294,POINT (-70.51273 -33.32394)
2,0.116505,0.164284,-0.110312,0.063860,0.391172,0.226372,POINT (-70.51299 -33.32394)
3,-0.159113,-0.500987,-0.213503,0.635165,0.300856,0.453708,POINT (-70.51344 -33.32394)
4,-1.226162,1.176751,1.462015,-0.842954,-0.946355,-0.936168,POINT (-70.51380 -33.32394)
...,...,...,...,...,...,...,...
121346,-1.318599,2.201657,1.673377,-2.265229,-1.677807,-2.014101,POINT (-70.68060 -33.67884)
121347,0.827014,1.129426,-0.431715,-1.053424,0.438107,0.078089,POINT (-70.69912 -33.67884)
121348,0.975725,0.838188,-0.583064,-0.515637,0.957951,0.547197,POINT (-70.70653 -33.67884)
121349,-0.629273,0.780489,0.881986,-1.363918,-1.328026,-1.726176,POINT (-70.71765 -33.67884)


Create region_id column

In [24]:
# Extraer los puntos de la columna 'geometry'
points = list(gdf_rm.geometry)
points = np.array([[point.x, point.y] for point in gdf_rm.geometry])
# Calcular el alpha shape
alpha = 100
alpha_shape = alphashape.alphashape(points, alpha)

# Crear un nuevo GeoDataFrame con el alpha shape
gdf_alpha_shape = gpd.GeoDataFrame(geometry=[alpha_shape])
gdf_alpha_shape['region_id'] = "Santiago Metropolitan Region, Chile"
gdf_alpha_shape = gdf_alpha_shape.set_crs(epsg=4326)
study_area2 = gdf_alpha_shape
study_area2

Unnamed: 0,geometry,region_id
0,"MULTIPOLYGON (((-70.76705 -33.67659, -70.76581...","Santiago Metropolitan Region, Chile"


In [25]:
regionalizer = H3Regionalizer(resolution=10, buffer=True)
regions_gdf_rm_10 = regionalizer.transform(study_area2)
regions_gdf_rm_10_no_index= regions_gdf_rm_10.reset_index()
regions_gdf_rm_10_no_index

Unnamed: 0,region_id,geometry
0,8ab2c57a0877fff,"POLYGON ((-70.67042 -33.65051, -70.67106 -33.6..."
1,8ab2c542b36ffff,"POLYGON ((-70.76083 -33.51373, -70.76146 -33.5..."
2,8ab2c55282cffff,"POLYGON ((-70.74971 -33.35969, -70.75034 -33.3..."
3,8ab2c519bcc7fff,"POLYGON ((-70.55181 -33.42376, -70.55244 -33.4..."
4,8ab2c5565ce7fff,"POLYGON ((-70.63264 -33.36447, -70.63327 -33.3..."
...,...,...
87105,8ab2c54492cffff,"POLYGON ((-70.62281 -33.62579, -70.62344 -33.6..."
87106,8ab2c50b0077fff,"POLYGON ((-70.56827 -33.45732, -70.56890 -33.4..."
87107,8ab2c54256a7fff,"POLYGON ((-70.79275 -33.44812, -70.79339 -33.4..."
87108,8ab2c547540ffff,"POLYGON ((-70.71536 -33.52109, -70.71599 -33.5..."


Add region_id colum with regionalizer function.



In [26]:
#Contiene solo las filas donde ambos gdf  se intersectan
df_perceptual = gpd.sjoin(gdf_rm, regions_gdf_rm_10_no_index, how="inner", predicate="intersects")
df_perceptual.drop(columns=['index_right'], inplace=True)
df_perceptual

Unnamed: 0,beautiful,boring,depressing,lively,safe,wealthy,geometry,region_id
0,-0.306948,1.565049,0.572029,-1.137733,-0.120456,-0.561887,POINT (-70.51263 -33.32394),8ab2c51a2297fff
1,-0.421388,0.309495,0.368965,-0.098733,-0.103042,-0.162294,POINT (-70.51273 -33.32394),8ab2c51a2297fff
2,0.116505,0.164284,-0.110312,0.063860,0.391172,0.226372,POINT (-70.51299 -33.32394),8ab2c51a274ffff
3,-0.159113,-0.500987,-0.213503,0.635165,0.300856,0.453708,POINT (-70.51344 -33.32394),8ab2c51a274ffff
4,-1.226162,1.176751,1.462015,-0.842954,-0.946355,-0.936168,POINT (-70.51380 -33.32394),8ab2c51a274ffff
...,...,...,...,...,...,...,...,...
121346,-1.318599,2.201657,1.673377,-2.265229,-1.677807,-2.014101,POINT (-70.68060 -33.67884),8ab2c57a30affff
121347,0.827014,1.129426,-0.431715,-1.053424,0.438107,0.078089,POINT (-70.69912 -33.67884),8ab2c57849affff
121348,0.975725,0.838188,-0.583064,-0.515637,0.957951,0.547197,POINT (-70.70653 -33.67884),8ab2c5784d37fff
121349,-0.629273,0.780489,0.881986,-1.363918,-1.328026,-1.726176,POINT (-70.71765 -33.67884),8ab2c57b1067fff


### Create final dataset antes de split

* ***Dataset full*** con regiones múltiples y con más de 1 feature por hexágono

In [27]:
#only matching rows from both df
dataset_full = pd.merge(df_perceptual, df_embeddings, on='region_id')
dataset_full = dataset_full.set_index('region_id').drop(columns=['geometry'])
dataset_full

Unnamed: 0_level_0,beautiful,boring,depressing,lively,safe,wealthy,0,1,2,3,...,90,91,92,93,94,95,96,97,98,99
region_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8ab2c51a274ffff,0.116505,0.164284,-0.110312,0.063860,0.391172,0.226372,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c51a274ffff,-0.159113,-0.500987,-0.213503,0.635165,0.300856,0.453708,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c51a274ffff,-1.226162,1.176751,1.462015,-0.842954,-0.946355,-0.936168,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c51a274ffff,-0.308952,0.491572,0.351956,-0.205486,0.098004,-0.171525,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c51a274ffff,-0.255187,0.108916,0.208224,0.178949,0.217683,0.072019,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8ab2c54dd80ffff,0.077252,0.637953,0.144422,-0.589375,-0.117685,-0.098696,-0.300282,-0.230503,0.195256,-0.364696,...,0.160869,-0.016424,-0.036420,0.376509,-0.010768,-0.141948,-0.526136,0.079622,0.009509,0.082694
8ab2c54dd80ffff,0.688408,0.694112,-0.407948,-0.622452,0.357219,0.302788,-0.300282,-0.230503,0.195256,-0.364696,...,0.160869,-0.016424,-0.036420,0.376509,-0.010768,-0.141948,-0.526136,0.079622,0.009509,0.082694
8ab2c57a268ffff,-0.065934,2.345176,0.678399,-2.201416,-0.623765,-0.908452,-0.095155,0.081839,0.058075,0.198763,...,-0.160243,-0.125825,-0.070056,0.252882,0.131426,-0.226724,-0.199880,0.119445,-0.031587,0.105807
8ab2c57a268ffff,0.411270,1.080804,0.115137,-1.929492,-1.131603,-1.081190,-0.095155,0.081839,0.058075,0.198763,...,-0.160243,-0.125825,-0.070056,0.252882,0.131426,-0.226724,-0.199880,0.119445,-0.031587,0.105807


regiones únicas = 24220 --> dataset.region_id.nunique()

**Problema**: Para cada región (hexágono) tenemos distintas características (beautiful,	boring,	depressing,	lively,	safe	y wealthy) dentro de cada región. ¿Cuál es la mejor manera de ponderar esas características y representar mejor cada región? -->  media aritmética, la geométrica, el máximo, moda, mediana, otro?

Media aritmética

In [28]:
df_means = dataset_full.groupby('region_id').mean()
df_means

Unnamed: 0_level_0,beautiful,boring,depressing,lively,safe,wealthy,0,1,2,3,...,90,91,92,93,94,95,96,97,98,99
region_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8ab2c5012c4ffff,1.530274,-1.674806,-1.689757,1.525779,1.326105,1.537481,0.061735,0.028381,-0.014441,0.140128,...,0.050715,-0.271304,-0.213770,0.192833,0.213708,0.039599,-0.188866,0.170294,0.169785,0.159016
8ab2c5080057fff,1.797793,-0.473614,-1.603372,0.481697,1.303100,1.463700,-0.009823,-0.065363,0.021479,-0.529560,...,0.026409,-0.246902,0.014540,-0.198926,-0.076141,-0.011190,0.102937,0.077663,0.060189,-0.084835
8ab2c5080087fff,0.617557,1.008862,-0.127893,-1.063960,-0.006197,-0.005126,-0.049334,0.016889,-0.024865,-0.654026,...,-0.141553,-0.123275,0.003810,-0.035499,-0.403222,0.020489,-0.083732,0.241804,0.027533,-0.298355
8ab2c508009ffff,0.344537,0.689234,-0.113139,-0.690042,-0.006416,-0.023423,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c50800f7fff,0.658131,-0.016540,-0.538743,-0.102794,0.375745,0.415797,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8ab2c5cdeca7fff,-0.422076,1.432125,0.556406,-0.505823,0.275188,-0.190039,-0.195331,-0.210522,-0.279355,0.339834,...,-0.006257,-0.070476,0.206203,-0.153168,-0.100565,-0.045583,-0.066746,0.180717,0.064394,-0.221057
8ab2c5cdecaffff,-0.301907,0.642069,0.367796,-0.402620,-0.067361,-0.244828,-0.152469,-0.207537,-0.275766,0.355418,...,0.027344,-0.072804,0.207733,-0.203336,-0.168068,0.012886,-0.051808,0.189225,0.097021,-0.178744
8ab2c5cdecd7fff,-0.047693,0.280894,0.017444,-0.186770,-0.057487,0.011591,-0.154229,-0.212532,-0.277868,0.359974,...,0.030813,-0.076295,0.212762,-0.210441,-0.170993,0.012732,-0.051932,0.190558,0.100510,-0.181921
8ab2c5cdecdffff,-0.932705,1.460426,1.053305,-1.076747,-0.633851,-0.852946,-0.152469,-0.207537,-0.275766,0.355418,...,0.027344,-0.072804,0.207733,-0.203336,-0.168068,0.012886,-0.051808,0.189225,0.097021,-0.178744


Mediana

In [29]:
df_medians = dataset_full.groupby('region_id').median()
df_medians

Unnamed: 0_level_0,beautiful,boring,depressing,lively,safe,wealthy,0,1,2,3,...,90,91,92,93,94,95,96,97,98,99
region_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8ab2c5012c4ffff,1.361391,-2.042482,-1.900075,1.903879,1.576332,1.891004,0.061735,0.028381,-0.014441,0.140128,...,0.050715,-0.271304,-0.213770,0.192833,0.213708,0.039599,-0.188866,0.170294,0.169785,0.159016
8ab2c5080057fff,1.604606,-0.781635,-1.509878,0.690671,1.329464,1.497841,-0.009823,-0.065363,0.021479,-0.529560,...,0.026409,-0.246902,0.014540,-0.198926,-0.076141,-0.011190,0.102937,0.077663,0.060189,-0.084835
8ab2c5080087fff,0.534498,1.203662,-0.124734,-0.988998,-0.178066,-0.009102,-0.049334,0.016889,-0.024865,-0.654026,...,-0.141553,-0.123275,0.003810,-0.035499,-0.403222,0.020489,-0.083732,0.241804,0.027533,-0.298355
8ab2c508009ffff,0.523684,0.534084,-0.243849,-0.727690,-0.063302,-0.097517,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
8ab2c50800f7fff,0.338217,-0.086866,-0.254180,-0.074327,0.160953,0.149841,-0.023930,0.040677,-0.017497,-0.275375,...,0.180155,-0.046677,0.126445,-0.061428,-0.006454,-0.030310,0.056697,-0.153097,0.123646,-0.004010
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8ab2c5cdeca7fff,-0.426769,1.631752,0.593024,-0.516723,0.275822,-0.118063,-0.195331,-0.210522,-0.279355,0.339834,...,-0.006257,-0.070476,0.206203,-0.153168,-0.100565,-0.045583,-0.066746,0.180717,0.064394,-0.221057
8ab2c5cdecaffff,-0.180813,0.702974,0.187430,-0.467125,-0.044658,-0.363213,-0.152469,-0.207537,-0.275766,0.355418,...,0.027344,-0.072804,0.207733,-0.203336,-0.168068,0.012886,-0.051808,0.189225,0.097021,-0.178744
8ab2c5cdecd7fff,-0.537953,0.905214,0.774207,-0.618593,-0.771503,-0.753049,-0.154229,-0.212532,-0.277868,0.359974,...,0.030813,-0.076295,0.212762,-0.210441,-0.170993,0.012732,-0.051932,0.190558,0.100510,-0.181921
8ab2c5cdecdffff,-0.974256,1.550554,1.053681,-1.071634,-0.626345,-0.882865,-0.152469,-0.207537,-0.275766,0.355418,...,0.027344,-0.072804,0.207733,-0.203336,-0.168068,0.012886,-0.051808,0.189225,0.097021,-0.178744


## Models



### Split

80% for training

20% for testing

*  Matriz X: embeddings OSM hex2vec
*  Vector objetivo y: variables perceptuales de las imágenes. Probar vector objetivo por separado: beautiful y boring
```
# hex2vec columns OSM
X = df_means.drop(['beautiful', 'boring', 'depressing', 'lively', 'safe', 'wealthy'], axis=1)
# Perceptual columns
y = df_means[['beautiful', 'boring', 'depressing', 'lively', 'safe', 'wealthy']]

```



## 1.Random forest


In [None]:
#X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size=0.2, random_state=25)
#y_pred_1 = rf_model_df_means_1.predict(X_test_1)

Probar vector objetivo por separado
  - beautiful
  - boring

10 variaciones del modelo cada uno entrenado y evalado en diferentes random subsets

**cross_val_score**

'neg_mean_squared_error':  especifica que la métrica  para evaluar el modelo en cada fold es el MSE negativo. La razón? cross_val_score espera una función de puntuación donde las puntuaciones más altas son mejores. Como el MSE es una medida de error (valores más bajos son mejores), se niega para alinearlo con la convención.

El signo menos (-) del cross_val_score invierte los valores para obtener el MSE positivo. cv_scores_mse contendrá una matriz de valores MSE para cada fold.

### 1.1 Random forest con df media, y= beautiful


### 1.3 Random forest con df mediana, y=beautiful

In [32]:
%%time
print("comenzó?")
# Dataset
# hex2vec columns OSM
X = df_medians.drop(['beautiful', 'boring', 'depressing', 'lively', 'safe', 'wealthy'], axis=1)
y = df_medians[['beautiful']]  #y = df_medians[['boring']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=25)

model = RandomForestRegressor(n_estimators=50, random_state=25)
print("se viene el fit")
model.fit(X_train, y_train.values.ravel())
print("terminó fit")

print("cv2 calculando")

cv_scores_r2 = cross_val_score(model, X_train, y_train.values.ravel(), cv=5, scoring='r2')
#cv_scores_mse = -cross_val_score(model, X_train, y_train.values.ravel(), cv=5, scoring='neg_mean_squared_error')
print("cv2 ok")
#predictions en training
y_train_pred = model.predict(X_train)
print("predicciones")


#predictions en test
y_test_pred = model.predict(X_test)

#metrics
r2_train = r2_score(y_train, y_train_pred)
print(f'R2 Score train: {r2_train}')
r2_test = r2_score(y_test, y_test_pred)
print(f'R2 Score test: {r2_test}')

#mse_train = mean_squared_error(y_train, y_train_pred)
#print(f'Mean Squared Error: {mse_train}')
#mse_test = mean_squared_error(y_test, y_test_pred)
#print(f'Mean Squared Error: {mse_test}')

comenzó?
se viene el fit
terminó fit
cv2 calculando
cv2 ok
predicciones
R2 Score train: 0.4664649570701829
R2 Score test: 0.05252226858639952
CPU times: user 8min 14s, sys: 508 ms, total: 8min 14s
Wall time: 8min 23s


In [34]:
%%time
print("comenzó?")
# Dataset
# hex2vec columns OSM
X = df_medians.drop(['beautiful', 'boring', 'depressing', 'lively', 'safe', 'wealthy'], axis=1)
y = df_medians[['beautiful']]  #y = df_means[['boring']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=25)

model = RandomForestRegressor(n_estimators=50, random_state=25)
print("se viene el fit")
model.fit(X_train, y_train.values.ravel())
print("cv2 calculando")
cv_scores_r2 = cross_val_score(model, X_train, y_train.values.ravel(), cv=5, scoring='r2')
#cv_scores_mse = -cross_val_score(model, X_train, y_train.values.ravel(), cv=5, scoring='neg_mean_squared_error')
print("R2 cv scores:\n", cv_scores_r2)
print("--------------------")
print("R2 cv score mean experiments:", cv_scores_r2.mean())

comenzó?
se viene el fit
cv2 calculando
R2 cv scores:
 [0.04477817 0.04551103 0.03126635 0.05244511 0.04491812]
--------------------
R2 cv score mean experiments: 0.043783754208872795
CPU times: user 8min 16s, sys: 503 ms, total: 8min 16s
Wall time: 8min 21s


In [2]:
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import fbeta_score, make_scorer

In [3]:
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape_scorer = make_scorer(mean_absolute_percentage_error, greater_is_better=False)

median beautiful

In [37]:
%%time
# Dataset
# hex2vec columns OSM
X = df_medians.drop(['beautiful', 'boring', 'depressing', 'lively', 'safe', 'wealthy'], axis=1)
y = df_medians[['beautiful']]  #y = df_means[['boring']]

model = RandomForestRegressor(n_estimators=100, random_state=25)

# Define the scoring metrics
scoring = {
    'mse': make_scorer(mean_squared_error),
    'r2': make_scorer(r2_score)
}

# Perform cross-validation
cv_results = cross_validate(model, X_train, y_train.values.ravel(), cv=5, scoring=scoring, return_train_score=True)

# Access training scores
train_mse = cv_results['train_mse']
train_r2 = cv_results['train_r2']

# Access testing scores
test_mse = cv_results['test_mse']
test_r2 = cv_results['test_r2']

# Print the results
print("Training MSE:", train_mse.mean(), "+/-", train_mse.std())
print("Training R2:", train_r2.mean(), "+/-", train_r2.std())
print("Testing MSE:", test_mse.mean(), "+/-", test_mse.std())
print("Testing R2:", test_r2.mean(), "+/-", test_r2.std())

Training MSE: 0.3350880487219855 +/- 0.00195201562517744
Training R2: 0.47693380538351315 +/- 0.003358247395846957
Testing MSE: 0.6094812959010738 +/- 0.008416453836889744
Testing R2: 0.04846288436441706 +/- 0.007615777294372
CPU times: user 13min 5s, sys: 800 ms, total: 13min 6s
Wall time: 13min 13s


median boring

In [5]:
%%time

# Dataset
# hex2vec columns OSM
X = df_medians.drop(['beautiful', 'boring', 'depressing', 'lively', 'safe', 'wealthy'], axis=1)
y = df_medians[['boring']]  #y = df_means[['boring']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=25)
model = RandomForestRegressor(n_estimators=100, random_state=25)

# Define the scoring metrics
scoring = {
    'mse': make_scorer(mean_squared_error),
    'r2': make_scorer(r2_score),
    'mape': mape_scorer
}

# Perform cross-validation
cv_results = cross_validate(model, X_train, y_train.values.ravel(), cv=5, scoring=scoring, return_train_score=True)

# Access training scores
train_mse = cv_results['train_mse']
train_r2 = cv_results['train_r2']
train_mape = cv_results['train_mape']

# Access testing scores
test_mse = cv_results['test_mse']
test_r2 = cv_results['test_r2']
test_mape = cv_results['test_mape']

# Dado que greater_is_better=False, los scores son negativos, así que multiplicamos por -1
#cv_results = -cv_results

# Print the results
print("Training MSE:", train_mse.mean())
print("Training R2:", train_r2.mean())
print("Training MAPE:", train_mape.mean())
print("--------------------")
print("Testing MSE:", test_mse.mean())
print("Testing R2:", test_r2.mean())
print("Testing MAPE:", test_mape.mean())


NameError: name 'df_medians' is not defined

#### Métricas

In [None]:
print("Training Results:")
for resultado in resultados_train:
    print(f"Variación {resultado['variacion']}: MSE = {resultado['mse_train']:.4f}, R2 = {resultado['r2_train']:.4f}, MAPE = {resultado['mape_train']:.2f}%, "
          f"CV MSE Mean = {resultado['mse_cv_mean']:.4f}, CV MSE Std = {resultado['mse_cv_std']:.4f}, "
          f"CV R2 Mean = {resultado['r2_cv_mean']:.4f}, CV R2 Std = {resultado['r2_cv_std']:.4f}")

print("\nTest Results:")
for resultado in resultados_test:
    print(f"Variación {resultado['variacion']}: MSE = {resultado['mse_test']:.4f}, R2 = {resultado['r2_test']:.4f}, MAPE = {resultado['mape_test']:.2f}%")

Training Results:
Variación 0: MSE = 0.3414, R2 = 0.4696, MAPE = 309.77%, CV MSE Mean = 0.6139, CV MSE Std = 0.0196, CV R2 Mean = 0.0460, CV R2 Std = 0.0139
Variación 1: MSE = 0.3396, R2 = 0.4728, MAPE = 385.19%, CV MSE Mean = 0.6104, CV MSE Std = 0.0179, CV R2 Mean = 0.0522, CV R2 Std = 0.0137
Variación 2: MSE = 0.3384, R2 = 0.4673, MAPE = 375.10%, CV MSE Mean = 0.6023, CV MSE Std = 0.0151, CV R2 Mean = 0.0516, CV R2 Std = 0.0090
Variación 3: MSE = 0.3415, R2 = 0.4700, MAPE = 330.74%, CV MSE Mean = 0.6126, CV MSE Std = 0.0038, CV R2 Mean = 0.0491, CV R2 Std = 0.0067

Test Results:
Variación 0: MSE = 0.5903, R2 = 0.0567, MAPE = 403.74%
Variación 1: MSE = 0.5951, R2 = 0.0454, MAPE = 184.23%
Variación 2: MSE = 0.6280, R2 = 0.0470, MAPE = 215.58%
Variación 3: MSE = 0.5953, R2 = 0.0443, MAPE = 348.46%


### 1.4 Random forest con df mediana, y=boring

In [None]:
# Dataset
# hex2vec columns OSM
X = df_medians.drop(['beautiful', 'boring', 'depressing', 'lively', 'safe', 'wealthy'], axis=1)
# Perceptual columns
y = df_medians[['boring']]

path_model = "/content/drive/MyDrive/UC-TESIS/models/random_forest/"
modelos = []
resultados_train = []
resultados_test = []

for i in range(3):
    # Create variations with different random states
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=i)

    model = RandomForestRegressor(n_estimators=100, random_state=25)

    # Cross-validation on training set
    cv_scores_mse = -cross_val_score(model, X_train, y_train, cv=3, scoring='neg_mean_squared_error')
    cv_scores_r2 = cross_val_score(model, X_train, y_train, cv=3, scoring='r2')

    # Fit the model on the full training set
    model.fit(X_train, y_train)
    modelos.append(model)

    # Guarda el modelo serializado
    nombre_archivo = f"modelo_mediana_boring_{i}.pkl"
    ruta_completa = os.path.join(path_model, nombre_archivo)
    with open(ruta_completa, "wb") as archivo:
        pickle.dump(model, archivo)

    # Predictions and metrics on training set
    y_train_pred = model.predict(X_train)
    mse_train = mean_squared_error(y_train, y_train_pred)
    r2_train = r2_score(y_train, y_train_pred)
    mape_train = np.mean(np.abs((y_train.values - y_train_pred) / y_train.values)) * 100

    # Predictions and metrics on test set
    y_test_pred = model.predict(X_test)
    mse_test = mean_squared_error(y_test, y_test_pred)
    r2_test = r2_score(y_test, y_test_pred)
    mape_test = np.mean(np.abs((y_test.values - y_test_pred) / y_test.values)) * 100

    # Store results
    resultados_train.append({
        'variacion': i,
        'mse_train': mse_train,
        'r2_train': r2_train,
        'mape_train': mape_train,
        'mse_cv_mean': cv_scores_mse.mean(),
        'mse_cv_std': cv_scores_mse.std(),
        'r2_cv_mean': cv_scores_r2.mean(),
        'r2_cv_std': cv_scores_r2.std()
    })
    resultados_test.append({
        'variacion': i,
        'mse_test': mse_test,
        'r2_test': r2_test,
        'mape_test': mape_test
    })

print("Training Results:")
for resultado in resultados_train:
    print(f"Variación {resultado['variacion']}: MSE = {resultado['mse_train']:.4f}, R2 = {resultado['r2_train']:.4f}, MAPE = {resultado['mape_train']:.2f}%, "
          f"CV MSE Mean = {resultado['mse_cv_mean']:.4f}, CV MSE Std = {resultado['mse_cv_std']:.4f}, "
          f"CV R2 Mean = {resultado['r2_cv_mean']:.4f}, CV R2 Std = {resultado['r2_cv_std']:.4f}")

print("\nTest Results:")
for resultado in resultados_test:
    print(f"Variación {resultado['variacion']}: MSE = {resultado['mse_test']:.4f}, R2 = {resultado['r2_test']:.4f}, MAPE = {resultado['mape_test']:.2f}%")

  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)


  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)


KeyboardInterrupt: 

In [None]:
print("Training Results:")
for resultado in resultados_train:
    print(f"Variación {resultado['variacion']}: MSE = {resultado['mse_train']:.4f}, R2 = {resultado['r2_train']:.4f}, MAPE = {resultado['mape_train']:.2f}%, "
          f"CV MSE Mean = {resultado['mse_cv_mean']:.4f}, CV MSE Std = {resultado['mse_cv_std']:.4f}, "
          f"CV R2 Mean = {resultado['r2_cv_mean']:.4f}, CV R2 Std = {resultado['r2_cv_std']:.4f}")

print("\nTest Results:")
for resultado in resultados_test:
    print(f"Variación {resultado['variacion']}: MSE = {resultado['mse_test']:.4f}, R2 = {resultado['r2_test']:.4f}, MAPE = {resultado['mape_test']:.2f}%")

Training Results:
Variación 0: MSE = 0.2742, R2 = 0.5786, MAPE = 458.60%, CV MSE Mean = 0.5329, CV MSE Std = 0.0051, CV R2 Mean = 0.1809, CV R2 Std = 0.0082

Test Results:
Variación 0: MSE = 0.5318, R2 = 0.1902, MAPE = 310.84%


## Métricas

* **R²**  mide qué tan bien el modelo se ajusta a los datos observados, indica la
proporción de la variabilidad en la variable dependiente que es explicada por el modelo.
  - R² de 1 indica un ajuste perfecto, donde el modelo explica toda la variabilidad de los datos.
  - R² de 0 indica que el modelo no explica ninguna variabilidad y es tan bueno como simplemente predecir la media de la variable dependiente.
  - R² negativos indican que el modelo se ajusta peor a los datos que una línea horizontal (la media).

* **MAPE** promedio de los porcentajes de error absoluto entre las predicciones y los valores reales. MAPE más bajo indica un mejor ajuste del modelo a los datos. Ej.MAPE del 10% significa que, en promedio, las predicciones del modelo se desvían un 10% de los valores reales.

`MAPE = (1/n) * Σ(|(Valor Real - Valor Predicho)| / |Valor Real|) * 100%`

    n es el número de observaciones.

    Σ indica la suma de los errores porcentuales absolutos.

* **MSE** mide la precisión de las predicciones especialmente cuando se desea penalizar más los errores grandes. Se calcula como el promedio de los cuadrados de las diferencias entre las predicciones y los valores reales.

`MSE = (1/n) * Σ(Valor Real - Valor Predicho)²`

    n es el número de observaciones.

    Σ indica la suma de los cuadrados de las diferencias





---



---



In [None]:
feature_importances = rf_model_df_means.feature_importances_
feature_importances

array([0.00689128, 0.00754062, 0.00737295, 0.00908062, 0.00742508,
       0.00715806, 0.00734839, 0.01317229, 0.00813272, 0.00882679,
       0.00775493, 0.00773393, 0.00723615, 0.00778604, 0.00840531,
       0.01303346, 0.02312417, 0.01201406, 0.00830343, 0.00786207,
       0.00850616, 0.00671786, 0.00723662, 0.0090734 , 0.00739984,
       0.00757581, 0.00661289, 0.00745428, 0.00673229, 0.00776539,
       0.00757438, 0.00785485, 0.00770826, 0.0085578 , 0.00698019,
       0.06118481, 0.00877937, 0.00865492, 0.01042748, 0.0082898 ,
       0.00683199, 0.02159178, 0.00797577, 0.00984722, 0.00859003,
       0.00806503, 0.00675782, 0.01215144, 0.0084055 , 0.00698846,
       0.0077827 , 0.00832266, 0.00757436, 0.00903284, 0.00773505,
       0.00740309, 0.00835756, 0.01418597, 0.00672674, 0.00792229,
       0.00859331, 0.00758543, 0.00729911, 0.00736496, 0.01525674,
       0.00778246, 0.01826636, 0.00732112, 0.00732738, 0.00878985,
       0.0077593 , 0.01216339, 0.01081794, 0.00797387, 0.00710