# Índice de Incidencia Delictiva en México
En el siguiente cuaderno, se comparan 3 métodos para obtener el índice de incidencia delictiva en México, definido como el primer componente principal de la descomposición.

Los métodos revisados son:

    - PCA con SciKit Learn
    - PCA y SVD con Numpy
    - PCA con algoritmo QR 
    
Al final, se realiza un método iterativo para calcular el índice desde 2015 hasta 2021 (octubre) para cada entidad federativa y se almacena el resultado, insumo del [tablero](https://datastudio.google.com/reporting/e4ffda99-e143-4e69-9454-391ea1796dc6).

## Librerías utilizadas

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
np.set_printoptions(formatter={'float_kind':'{:f}'.format})

## Carga de los datos

In [2]:
df = pd.read_csv('../data/data_clean.csv')
df.head()

Unnamed: 0,fecha,estado,abuso_de_confianza,abuso_sexual,acoso_sexual,allanamiento_de_morada,amenazas,corrupcion_de_menores,dano_a_la_propiedad,delitos_cometidos_por_servidores_publicos,...,lesiones,narcomenudeo,rapto,robo,secuestro,trata_de_personas,violacion_equiparada,violacion_simple,violencia_de_genero_en_todas_sus_modalidades_distinta_a_la_violencia_familiar,violencia_familiar
0,2015,Aguascalientes,0.995481,-0.891228,-0.702326,0.037878,0.398897,1.234028,0.416137,-0.753177,...,1.582294,0.567183,-0.353202,1.092806,-0.420769,-0.350597,0.639921,-0.539843,-0.191875,-0.739861
1,2015,Baja California,0.684287,2.82764,-0.702326,5.015057,1.083865,3.714389,2.426546,0.649223,...,1.674431,-1.061603,-0.303759,1.501483,-0.459697,1.137236,4.133267,1.505248,-0.288525,2.435918
2,2015,Baja California Sur,2.323666,0.310906,0.966655,0.289432,1.21446,-0.471516,1.775697,1.565241,...,1.038162,1.90282,-0.353202,2.64555,-0.72633,-0.70857,-0.468689,1.146517,-0.288525,2.020647
3,2015,Campeche,-1.363071,-0.929666,-0.702326,-0.405687,-1.071007,-0.668099,-1.045141,-0.528636,...,-1.637003,-0.838101,-0.353202,-1.302167,-0.193686,-0.448568,0.947719,-0.753702,-0.288525,-1.21411
4,2015,Coahuila,-0.048868,0.045424,1.242453,-0.100263,0.110921,0.006621,2.184097,-0.116755,...,0.724564,1.683024,-0.131886,-0.142698,-0.048199,-0.073012,-0.059276,-1.136537,-0.00968,-0.24453


La comparación se realiza tomando como año base 2015.

In [3]:
X = df[df.fecha == 2015].drop(columns = ['fecha', 'estado'])
X = X.to_numpy() 

## PCA con SKlearn


In [4]:
pca = PCA(n_components=28, svd_solver='full').fit(X)
z_skl = pca.transform(X)

In [5]:
pca.singular_values_

array([16.834452, 8.665687, 8.336795, 8.119100, 7.133143, 6.934652,
       6.537491, 6.088385, 5.823219, 5.396144, 4.808120, 4.367100,
       4.076169, 3.662393, 3.503298, 3.121766, 3.050389, 2.719860,
       2.342541, 2.146414, 2.061904, 1.641478, 1.378691, 1.130062,
       0.789160, 0.573871, 0.345105, 0.223925])

### Componentes principales

In [6]:
pd.DataFrame(z_skl).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,1.491941,0.654227,0.384337,-2.022406,2.068326,-0.877028,2.188809,-0.286691,-0.281603,1.320772,...,0.034773,0.543135,0.34525,0.394005,0.221725,-0.040664,0.074593,-0.046976,0.030021,0.006204
1,7.890937,3.848363,-3.597337,0.637463,1.417333,0.686643,-0.926374,-1.884681,-0.229611,-1.210829,...,0.299561,-0.174202,-0.334035,-0.033259,-0.07044,-0.133497,-0.013712,-0.025579,-0.002324,0.001007
2,4.707203,-2.544096,1.204539,-1.235695,-1.151094,1.044208,-0.425222,0.349565,-0.947265,-0.270612,...,-0.21336,0.095536,-0.10074,0.282013,0.073744,-0.06512,0.069149,-0.156954,-0.048001,-0.006949
3,-4.229105,1.246965,-0.668594,0.185795,0.53113,0.247936,-0.690387,-0.102497,0.470515,-0.562722,...,-0.773437,0.163297,-0.078126,-0.15007,0.2002,0.474142,0.43509,-0.058061,0.084423,-0.019317
4,-0.294966,-0.405094,0.196993,-2.112403,-0.622076,-0.319013,-1.08254,-0.875185,-1.303967,-0.940432,...,0.270181,-0.402897,0.368281,-0.374959,0.496517,-0.165297,-0.06894,0.015884,0.062763,-0.01322


Valores singulares

## PCA a partir de SVD con Numpy

### SVD Numpy

In [7]:
U, S, Vt = np.linalg.svd(X, full_matrices= False, compute_uv= True)

In [8]:
S

array([16.834452, 8.665687, 8.336795, 8.119100, 7.133143, 6.934652,
       6.537491, 6.088385, 5.823219, 5.396144, 4.808120, 4.367100,
       4.076169, 3.662393, 3.503298, 3.121766, 3.050389, 2.719860,
       2.342541, 2.146414, 2.061904, 1.641478, 1.378691, 1.130062,
       0.789160, 0.573871, 0.345105, 0.223925])

### PCA Numpy

In [9]:
z_np = S*U

In [10]:
pd.DataFrame(z_np).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,-1.491941,0.654227,0.384337,-2.022406,2.068326,0.877028,-2.188809,-0.286691,0.281603,1.320772,...,-0.034773,-0.543135,0.34525,0.394005,-0.221725,0.040664,-0.074593,-0.046976,-0.030021,0.006204
1,-7.890937,3.848363,-3.597337,0.637463,1.417333,-0.686643,0.926374,-1.884681,0.229611,-1.210829,...,-0.299561,0.174202,-0.334035,-0.033259,0.07044,0.133497,0.013712,-0.025579,0.002324,0.001007
2,-4.707203,-2.544096,1.204539,-1.235695,-1.151094,-1.044208,0.425222,0.349565,0.947265,-0.270612,...,0.21336,-0.095536,-0.10074,0.282013,-0.073744,0.06512,-0.069149,-0.156954,0.048001,-0.006949
3,4.229105,1.246965,-0.668594,0.185795,0.53113,-0.247936,0.690387,-0.102497,-0.470515,-0.562722,...,0.773437,-0.163297,-0.078126,-0.15007,-0.2002,-0.474142,-0.43509,-0.058061,-0.084423,-0.019317
4,0.294966,-0.405094,0.196993,-2.112403,-0.622076,0.319013,1.08254,-0.875185,1.303967,-0.940432,...,-0.270181,0.402897,0.368281,-0.374959,-0.496517,0.165297,0.06894,0.015884,-0.062763,-0.01322


## PCA a partir de SVD con algoritmo QR

### Funciones definidas por el equipo

In [11]:
def givens_rotation(A):
    (r, c) = np.shape(A)
    Q = np.identity(r)
    R = np.copy(A)
    (rows, cols) = np.tril_indices(r, -1, c)
    for (row, col) in zip(rows, cols):
        if R[row, col] != 0:  # Q = 1, s = 0, r, q sin cambios
            r_ = np.hypot(R[col, col], R[row, col])  # d
            c = R[col, col]/r_
            s = -R[row, col]/r_
            G = np.identity(r)
            G[[col, row], [col, row]] = c
            G[row, col] = s
            G[col, row] = -s
            R = np.dot(G, R)  # R=G(n-1,n)*...*G(2n)*...*G(23,1n)*...*G(12)*A
            Q = np.dot(Q, G.T)  # Q=G(n-1,n).T*...*G(2n).T*...*G(23,1n).T*...*G(12).T
    return (Q, R)

In [12]:
#Algoritmo QR para matrices simetricas (version simple)
def algoritmoQR(matrix):
    #T0
    T_k_minus_1 = matrix
    q,r = givens_rotation(matrix)
    m,n = q.shape
    Q = np.eye(m,n)
    T_k = r@q 
    #Q es la matriz que aproximará a los eigenvectores
    Q = Q@q
    n = 100 # Número de iteraciones
    while n > 0:
        T_k_minus_1 = T_k
        q,r = givens_rotation(T_k_minus_1)
        Q = Q@q
        #T_k es la matriz que aproxima a nuestros eigenvalores
        T_k = r@q
        n = n - 1
    
    return np.diag(T_k), Q

### SVD QR

In [13]:
eigenvalores_u, eigenvectores_u = algoritmoQR(X@X.T) #U

In [14]:
U_qr = pd.DataFrame(eigenvectores_u).iloc[:,0:28]

In [15]:
D_qr = np.sqrt(eigenvalores_u[eigenvalores_u > 0.000001])
D_qr

array([16.834452, 8.665676, 8.336785, 8.119124, 7.133137, 6.934658,
       6.537491, 6.088385, 5.823219, 5.396144, 4.808120, 4.367100,
       4.076169, 3.662393, 3.503298, 3.121766, 3.050389, 2.719860,
       2.342541, 2.146414, 2.061904, 1.641478, 1.378691, 1.130062,
       0.789160, 0.573871, 0.345105, 0.223925])

### PCA QR

In [16]:
z_qr = D_qr*U_qr
pd.DataFrame(z_qr).head() 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,1.491941,0.651656,-0.409404,2.018471,-2.063292,0.888181,-2.188802,-0.286798,0.281501,1.320772,...,-0.034773,-0.543113,-0.345282,-0.394005,0.221725,0.040664,0.074593,0.046976,-0.030021,-0.006204
1,7.890937,3.871065,3.581268,-0.601561,-1.421225,-0.679002,0.926367,-1.884769,0.22895,-1.210829,...,-0.299561,0.174181,0.334046,0.033259,-0.07044,0.133497,-0.013712,0.025579,0.002324,-0.001007
2,4.707203,-2.551738,-1.202563,1.223625,1.145121,-1.0504,0.425215,0.349201,0.947388,-0.270612,...,0.21336,-0.095543,0.100735,-0.282013,0.073744,0.06512,0.069149,0.156954,0.048001,0.006949
3,-4.229105,1.251178,0.663205,-0.17912,-0.532536,-0.245076,0.690385,-0.102317,-0.47055,-0.562723,...,0.773437,-0.163302,0.078117,0.15007,0.2002,-0.474142,0.43509,0.058061,-0.084423,0.019317
4,-0.294966,-0.406466,-0.216823,2.110334,0.623885,0.315646,1.082542,-0.875686,1.30366,-0.940431,...,-0.270181,0.40292,-0.368257,0.374959,0.496517,0.165297,-0.06894,-0.015884,-0.062763,0.01322


## Comparación de resultados

Podemos calcular el error relativo entre SK learn y numpy por columna por medio de:

$$\frac{|Z_{sk}-Z_{np}|}{|Z_{sk}|}$$

In [17]:
rel_error_sk_np = pd.DataFrame(np.abs(z_skl-z_np)/np.abs(z_skl))

np.round(rel_error_sk_np.mean(),3)

0     2.0
1     0.0
2     0.0
3     0.0
4     0.0
5     2.0
6     2.0
7     0.0
8     2.0
9     0.0
10    0.0
11    2.0
12    0.0
13    0.0
14    0.0
15    2.0
16    0.0
17    0.0
18    2.0
19    2.0
20    0.0
21    0.0
22    2.0
23    2.0
24    2.0
25    0.0
26    2.0
27    0.0
dtype: float64

Observamos que sólo existen 2 valores posibles: $2.0$ o $0$. 

Las columnas cuyo error relativo promedio es de $2.0$ indican que los valores de dicha columna son los mismos pero con signos opuestos, mientras que el valor $0$ significa que el valor y signo fueron iguales con ambos métodos

Siguiendo la misma lógica, pero ahora comparando la matriz obtenida con SKlearn y la matriz obtenida del algoritmo QR por medio del error relativo:

$$\frac{|Z_{sk}-Z_{qr}|}{|Z_{sk}|}$$

In [18]:
rel_error_sk_qr = pd.DataFrame(np.abs(z_skl-z_qr)/np.abs(z_skl))

np.round(rel_error_sk_qr.mean(),5)

0     0.00000
1     0.00895
2     1.99137
3     1.96087
4     2.00871
5     2.02602
6     2.00000
7     0.00102
8     2.00010
9     0.00000
10    2.00000
11    0.00000
12    2.00000
13    2.00030
14    0.00207
15    0.00814
16    0.00756
17    0.00000
18    2.00000
19    2.00007
20    2.00000
21    2.00000
22    0.00000
23    2.00000
24    0.00000
25    2.00000
26    2.00000
27    2.00000
dtype: float64

En este caso tenemos valores cercanos a $2.0$ y a $0.0$, pero a diferencia de la comparación anterior, no son exactamente dichos valores. Esto quiere decir que para algunos elementos de la matriz, además de existir diferencia de signos, existen diferencias pequeñas de aproximación

# Cálculo de índice para todos los años

### Con QR

In [19]:
ind_qr = pd.DataFrame(index = df[df.fecha==2015]['estado'])
for i in df.fecha.unique():
    print('Seleccionamos año: '+str(i))
    mat = df[df.fecha==i].drop(columns=['fecha','estado'])
    print('Aplicamos función Algortimo QR a datos estandarizados')
    mat2 = mat.to_numpy()
    print('Obtenemos U')
    eigenvalores_u, eigenvectores_u = algoritmoQR(mat2@mat2.T) #U
    U = pd.DataFrame(eigenvectores_u).iloc[:,0:28]
    print('Calculamos D')
    D = np.sqrt(eigenvalores_u)[:-4]
    print("Obtenemos Z'")
    Z = D*U
    print("Seleccionamos la primera columna de Z como índice")
    print('Almacenamos')
    ind_qr[i] = -Z.iloc[:,0].values #ajuste de signos primer componente
    print('***')

Seleccionamos año: 2015
Aplicamos función Algortimo QR a datos estandarizados
Obtenemos U
Calculamos D
Obtenemos Z'
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2016
Aplicamos función Algortimo QR a datos estandarizados
Obtenemos U


  D = np.sqrt(eigenvalores_u)[:-4]


Calculamos D
Obtenemos Z'
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2017
Aplicamos función Algortimo QR a datos estandarizados
Obtenemos U
Calculamos D
Obtenemos Z'
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2018
Aplicamos función Algortimo QR a datos estandarizados
Obtenemos U
Calculamos D
Obtenemos Z'
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2019
Aplicamos función Algortimo QR a datos estandarizados
Obtenemos U
Calculamos D
Obtenemos Z'
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2020
Aplicamos función Algortimo QR a datos estandarizados
Obtenemos U
Calculamos D
Obtenemos Z'
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2021
Aplicamos función Algortimo QR a datos estandarizados
Obtenemos U
Calculamos D
Obtenemos Z'
Seleccionamos la primera columna de Z como índice
Almacenamos
**

### Con  Numpy

In [20]:
ind_np = pd.DataFrame(index = df[df.fecha==2015]['estado'])
for i in df.fecha.unique():
    print('Seleccionamos año: '+str(i))
    X = df[df.fecha==i].drop(columns=['fecha','estado'])
    X = X.to_numpy()    
    print('Aplicamos SVD con Numpy a datos estandarizados')
    U, S, Vt = np.linalg.svd(X, full_matrices= False, compute_uv= True)
    Z = S*U
    print("Seleccionamos la primera columna de Z como índice")
    print('Almacenamos')
    ind_np[i] = pd.DataFrame(Z).iloc[:,0].values
    print('***')

Seleccionamos año: 2015
Aplicamos SVD con Numpy a datos estandarizados
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2016
Aplicamos SVD con Numpy a datos estandarizados
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2017
Aplicamos SVD con Numpy a datos estandarizados
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2018
Aplicamos SVD con Numpy a datos estandarizados
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2019
Aplicamos SVD con Numpy a datos estandarizados
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2020
Aplicamos SVD con Numpy a datos estandarizados
Seleccionamos la primera columna de Z como índice
Almacenamos
***
Seleccionamos año: 2021
Aplicamos SVD con Numpy a datos estandarizados
Seleccionamos la primera columna de Z como índice
Almacenamos
***


### Comparación de índices

Las cifras de los índices de todos los años coinciden entre los dos métodos.

In [22]:
#QR
ind_qr

Unnamed: 0_level_0,2015,2016,2017,2018,2019,2020,2021
estado,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aguascalientes,-1.491941,-0.321054,-2.66679,-3.897567,-3.960713,-3.51637,-3.775146
Baja California,-7.890937,-6.577347,-5.603274,-5.402784,-4.4524,-5.159402,-4.337488
Baja California Sur,-4.707203,-8.425234,-6.979556,-6.00792,-5.4205,-4.583249,-3.676376
Campeche,4.229105,3.924112,4.207,4.470252,4.935832,4.836567,4.89459
Coahuila,0.294966,-0.72697,0.17848,-0.202751,0.366897,0.461153,0.019274
Colima,3.084667,2.917567,-7.078736,-6.050488,-6.978488,-7.130548,-6.871069
Chiapas,2.988473,3.141346,3.127941,3.175865,4.19789,4.363549,4.517787
Chihuahua,-3.125126,-2.96391,-3.628012,-3.400333,-2.894715,-3.021142,-3.073501
Ciudad de México,-2.532294,-3.683305,-2.184399,-4.399916,-5.298683,-3.744166,-3.964892
Durango,-1.264475,-2.052157,-0.954684,-0.385983,0.084854,0.897764,0.344936


In [23]:
#Numpy
ind_np

Unnamed: 0_level_0,2015,2016,2017,2018,2019,2020,2021
estado,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aguascalientes,-1.491941,-0.321054,-2.66679,-3.897567,-3.960713,-3.51637,-3.775146
Baja California,-7.890937,-6.577347,-5.603274,-5.402784,-4.4524,-5.159402,-4.337488
Baja California Sur,-4.707203,-8.425234,-6.979556,-6.00792,-5.4205,-4.583249,-3.676376
Campeche,4.229105,3.924112,4.207,4.470252,4.935832,4.836567,4.89459
Coahuila,0.294966,-0.72697,0.17848,-0.202751,0.366897,0.461153,0.019274
Colima,3.084667,2.917567,-7.078736,-6.050488,-6.978488,-7.130548,-6.871069
Chiapas,2.988473,3.141346,3.127941,3.175865,4.19789,4.363549,4.517787
Chihuahua,-3.125126,-2.96391,-3.628012,-3.400333,-2.894715,-3.021142,-3.073501
Ciudad de México,-2.532294,-3.683305,-2.184399,-4.399916,-5.298683,-3.744166,-3.964892
Durango,-1.264475,-2.052157,-0.954684,-0.385983,0.084854,0.897764,0.344936
