## Sugestão de cidade utilizando o Índice de Criminalidade do Estado do Rio Grande do Sul
### Para o trabalho final da disciplina de Sistemas de Recomendação foi escolhido os bancos de dados Índice de Criminalidade (RS, 2023) e Estimativas Populacionais (RS, 2022). Os dados são reais e estão disponíveis no portal Dados Abertos RS do Estado do Rio Grande do Sul.

Indíce de Criminalidade no RS: [https://dados.rs.gov.br/dataset/indicadores-criminais-de-2023](https://dados.rs.gov.br/dataset/indicadores-criminais-de-2023)

Estimativas Populacionais no RS: [https://dados.rs.gov.br/dataset/dee-4259/resource/ce259dd9-c479-4a18-90b3-40098e6deb26](https://dados.rs.gov.br/dataset/dee-4259/resource/ce259dd9-c479-4a18-90b3-40098e6deb26)

### Pré-processamento nos Datasets

Algumas transformações necessarias nos dados:
- Os os índices de Criminalidade por região foram fornecidos separados por mês e os nomes das colunas demasiadamente grandes.

- Os dados populacionais por ano foram fornecidos com preenchimento inválido no ano de 2012 e os nomes das colunas eram demasiadamente extensos.

- Criação de uma chave estrangeira para relacionar as tabelas.

In [525]:
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go

#### Abertura dos arquivos de criminalidade por cidade em cada mês.

In [526]:
# Declarando os meses
meses = ['janeiro', 'fevereiro', 'marco', 'abril', 'maio', 'junho', 'julho', 'agosto', 'setembro', 'outubro', 'novembro', 'dezembro']

dfs = []

# Lendo os arquivos conforme os meses e adicionando a lista de dataframes
for mes in meses:
        arquivo = pd.read_csv(f'data\\tocsv\\{mes}.csv', delimiter=';')
        dfs.append(arquivo)

# Concatenando todos os dataframes ignorando o index
df = pd.concat(dfs, ignore_index=True)

# Agrupando por município e ibge e somando os valores
df_criminal = df.groupby(['municipios', 'ibge']).sum().reset_index()

# Salvando o dataframe em um arquivo csv
df.to_csv('data\\tocsv\\total.csv', sep=';', index=False)

df_criminal


Unnamed: 0,municipios,ibge,homicidio_doloso,total_vitimas_homicidio_doloso,latrocinio,furtos,abigeato,furto_veiculo,roubos,roubo_veiculo,estelionato,delitos_armas_municoes,entorpecente_posse,entorpecente_trafico,vitimas_latrocinio,vitimas_lesao_corporal_morte,total_vitimas_crimes_violentos
0,acegua,4300034,0,0,0,20,8,0,3,0,27,4,3,0,0,0,0
1,agua santa,4300059,1,1,0,20,3,4,1,0,13,2,0,0,0,0,2
2,agudo,4300109,1,1,0,102,2,2,3,1,51,3,7,10,0,0,1
3,ajuricaba,4300208,0,0,0,26,1,0,1,0,28,2,3,0,0,0,0
4,alecrim,4300307,1,1,0,36,10,0,2,0,17,10,0,4,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
492,vista alegre do prata,4323606,0,0,0,3,0,0,0,0,8,3,0,0,0,0,0
493,vista gaucha,4323705,1,1,0,11,0,0,0,0,6,2,0,0,0,0,1
494,vitoria das missoes,4323754,0,0,0,20,3,0,0,0,7,1,4,0,0,0,0
495,westfalia,4323770,0,0,0,10,2,1,0,0,20,1,1,0,0,0,0


#### Abertura do arquivo de quantidade de população por cidade em cada ano.

In [527]:
# Lendo o arquivo de população
df_pop = pd.read_csv('data\\tocsv\\populacao.csv', delimiter=';')

#df_pop.info() mostra que as colunas de 4 em diante são do tipo object e para fazer operações matemáticas é necessário converter para inteiro

# Substituindo os valores de '.' e '-' por 0 e convertendo para inteiro
for coluna in df_pop.columns[4:]:
    df_pop[coluna] = df_pop[coluna].str.replace('.', '').str.replace('-', '0').astype(int)

df_pop.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   municipios  497 non-null    object 
 1   ibge        497 non-null    int64  
 2   latitude    497 non-null    float64
 3   longitude   497 non-null    float64
 4   2010        497 non-null    int32  
 5   2011        497 non-null    int32  
 6   2012        497 non-null    int32  
 7   2013        497 non-null    int32  
 8   2014        497 non-null    int32  
 9   2015        497 non-null    int32  
 10  2016        497 non-null    int32  
 11  2017        497 non-null    int32  
 12  2018        497 non-null    int32  
 13  2019        497 non-null    int32  
 14  2020        497 non-null    int32  
 15  2021        497 non-null    int32  
dtypes: float64(2), int32(12), int64(1), object(1)
memory usage: 39.0+ KB


### União dos dois banco de dados

In [528]:
# O merge é feito com base na coluna ibge como forma de chave estrangeira com o método inner
df_merged = pd.merge(df_criminal, df_pop, how = 'inner', on = 'ibge').drop(columns=['municipios_y'])
df_merged.rename(columns={'municipios_x': 'municipios'}, inplace=True)

df_merged.info()

df_merged


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 31 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   municipios                      497 non-null    object 
 1   ibge                            497 non-null    int64  
 2   homicidio_doloso                497 non-null    int64  
 3   total_vitimas_homicidio_doloso  497 non-null    int64  
 4   latrocinio                      497 non-null    int64  
 5   furtos                          497 non-null    int64  
 6   abigeato                        497 non-null    int64  
 7   furto_veiculo                   497 non-null    int64  
 8   roubos                          497 non-null    int64  
 9   roubo_veiculo                   497 non-null    int64  
 10  estelionato                     497 non-null    int64  
 11  delitos_armas_municoes          497 non-null    int64  
 12  entorpecente_posse              497 

Unnamed: 0,municipios,ibge,homicidio_doloso,total_vitimas_homicidio_doloso,latrocinio,furtos,abigeato,furto_veiculo,roubos,roubo_veiculo,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,acegua,4300034,0,0,0,20,8,0,3,0,...,4539,4520,4564,4483,4472,4412,4487,4516,4540,4505
1,agua santa,4300059,1,1,0,20,3,4,1,0,...,3858,3898,3959,3922,3977,4013,4057,4107,4093,4256
2,agudo,4300109,1,1,0,102,2,2,3,1,...,16731,16838,16851,16701,16595,16475,16537,16556,16760,16612
3,ajuricaba,4300208,0,0,0,26,1,0,1,0,...,7389,7431,7299,7241,7279,7325,7546,7485,7584,7447
4,alecrim,4300307,1,1,0,36,10,0,2,0,...,7074,6891,6814,6598,6594,6569,6513,6435,6301,6403
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
492,vista alegre do prata,4323606,0,0,0,3,0,0,0,0,...,1578,1539,1582,1630,1677,1645,1648,1704,1721,1746
493,vista gaucha,4323705,1,1,0,11,0,0,0,0,...,2842,2828,2787,2790,2802,2834,2885,2940,2987,3002
494,vitoria das missoes,4323754,0,0,0,20,3,0,0,0,...,3453,3403,3415,3448,3383,3439,3389,3438,3397,3405
495,westfalia,4323770,0,0,0,10,2,1,0,0,...,2864,2957,2974,3007,3039,3088,3136,3125,3226,3257


### Cálculo da taxa de criminalidade a cada 1000 mil habitantes.
 Para que os dados das grandes cidades, com numerosa população, não tenham impacto nas análises estatísticas, calculasse a taxa de criminalidade percapita. Assim a comparação entre Índice de Criminalidade e População mantém a proporcionalidade.

In [529]:
# Lista das colunas que precisam ser calculadas
colunas_taxa = df_criminal.columns[2:]

# Calculando a taxa de criminalidade para cada mês
for coluna_taxa in colunas_taxa:
    for coluna_pop in df_pop.columns[4:]:
            
            # Calculando a taxa de criminalidade por 100.000 habitantes
            df_merged[f'taxa_{coluna_taxa}'] = df_criminal[coluna_taxa] / df_pop[coluna_pop] * 100000

df_merged.to_csv('data\\tocsv\\dados_criminalidade_população.csv', sep=';', index=False)

df_merged

#df_merged.info()

Unnamed: 0,municipios,ibge,homicidio_doloso,total_vitimas_homicidio_doloso,latrocinio,furtos,abigeato,furto_veiculo,roubos,roubo_veiculo,...,taxa_furto_veiculo,taxa_roubos,taxa_roubo_veiculo,taxa_estelionato,taxa_delitos_armas_municoes,taxa_entorpecente_posse,taxa_entorpecente_trafico,taxa_vitimas_latrocinio,taxa_vitimas_lesao_corporal_morte,taxa_total_vitimas_crimes_violentos
0,acegua,4300034,0,0,0,20,8,0,3,0,...,0.000000,66.592675,0.000000,599.334073,88.790233,66.592675,0.000000,0.0,0.0,0.000000
1,agua santa,4300059,1,1,0,20,3,4,1,0,...,93.984962,23.496241,0.000000,305.451128,46.992481,0.000000,0.000000,0.0,0.0,46.992481
2,agudo,4300109,1,1,0,102,2,2,3,1,...,12.039490,18.059234,6.019745,307.006983,18.059234,42.138213,60.197448,0.0,0.0,6.019745
3,ajuricaba,4300208,0,0,0,26,1,0,1,0,...,0.000000,13.428226,0.000000,375.990332,26.856452,40.284678,0.000000,0.0,0.0,0.000000
4,alecrim,4300307,1,1,0,36,10,0,2,0,...,0.000000,31.235358,0.000000,265.500547,156.176792,0.000000,62.470717,0.0,0.0,15.617679
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
492,vista alegre do prata,4323606,0,0,0,3,0,0,0,0,...,0.000000,0.000000,0.000000,458.190149,171.821306,0.000000,0.000000,0.0,0.0,0.000000
493,vista gaucha,4323705,1,1,0,11,0,0,0,0,...,0.000000,0.000000,0.000000,199.866755,66.622252,0.000000,0.000000,0.0,0.0,33.311126
494,vitoria das missoes,4323754,0,0,0,20,3,0,0,0,...,0.000000,0.000000,0.000000,205.580029,29.368576,117.474302,0.000000,0.0,0.0,0.000000
495,westfalia,4323770,0,0,0,10,2,1,0,0,...,30.703101,0.000000,0.000000,614.062020,30.703101,30.703101,0.000000,0.0,0.0,0.000000


## Redução de Dimensionalidade
### O dataset agora tem dados suficientes para ranquear as cidades em mais seguras e não seguras com dados proporcionais a população. Entretanto, existem muitas colunas. A alta dimensionalidade deixará o processo de aprendizado confuso criando um sobreajuste nos dados. Então é aplicado o PCA, uma técnida de redução de dimensionalidade.


O PCA será aplicado somente nas colunas referentes as taxas de criminalidade, pois os números absolutos são dados brutos e somente serviram para o cálculo das taxas. Com isto, os componentes principais irão explicar o comportamento da criminalidade nas cidades com a maior variabilidade das taxas.

In [530]:
colunas_extras = df_merged.columns[2:-15]

colunas_extras

df_taxa = df_merged.drop(colunas_extras, axis=1)

df_taxa.to_csv('data\\tocsv\\dados_taxa_criminalidade.csv', sep=';', index=False)

df_taxa



Unnamed: 0,municipios,ibge,taxa_homicidio_doloso,taxa_total_vitimas_homicidio_doloso,taxa_latrocinio,taxa_furtos,taxa_abigeato,taxa_furto_veiculo,taxa_roubos,taxa_roubo_veiculo,taxa_estelionato,taxa_delitos_armas_municoes,taxa_entorpecente_posse,taxa_entorpecente_trafico,taxa_vitimas_latrocinio,taxa_vitimas_lesao_corporal_morte,taxa_total_vitimas_crimes_violentos
0,acegua,4300034,0.000000,0.000000,0.0,443.951165,177.580466,0.000000,66.592675,0.000000,599.334073,88.790233,66.592675,0.000000,0.0,0.0,0.000000
1,agua santa,4300059,23.496241,23.496241,0.0,469.924812,70.488722,93.984962,23.496241,0.000000,305.451128,46.992481,0.000000,0.000000,0.0,0.0,46.992481
2,agudo,4300109,6.019745,6.019745,0.0,614.013966,12.039490,12.039490,18.059234,6.019745,307.006983,18.059234,42.138213,60.197448,0.0,0.0,6.019745
3,ajuricaba,4300208,0.000000,0.000000,0.0,349.133879,13.428226,0.000000,13.428226,0.000000,375.990332,26.856452,40.284678,0.000000,0.0,0.0,0.000000
4,alecrim,4300307,15.617679,15.617679,0.0,562.236452,156.176792,0.000000,31.235358,0.000000,265.500547,156.176792,0.000000,62.470717,0.0,0.0,15.617679
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
492,vista alegre do prata,4323606,0.000000,0.000000,0.0,171.821306,0.000000,0.000000,0.000000,0.000000,458.190149,171.821306,0.000000,0.000000,0.0,0.0,0.000000
493,vista gaucha,4323705,33.311126,33.311126,0.0,366.422385,0.000000,0.000000,0.000000,0.000000,199.866755,66.622252,0.000000,0.000000,0.0,0.0,33.311126
494,vitoria das missoes,4323754,0.000000,0.000000,0.0,587.371512,88.105727,0.000000,0.000000,0.000000,205.580029,29.368576,117.474302,0.000000,0.0,0.0,0.000000
495,westfalia,4323770,0.000000,0.000000,0.0,307.031010,61.406202,30.703101,0.000000,0.000000,614.062020,30.703101,30.703101,0.000000,0.0,0.0,0.000000


Para fins de demonstração, será usado somente uma amostra de 20 cidades.

In [531]:
df_sample = df_taxa.sample(n=20, axis=0, random_state=42)

df_sample

Unnamed: 0,municipios,ibge,taxa_homicidio_doloso,taxa_total_vitimas_homicidio_doloso,taxa_latrocinio,taxa_furtos,taxa_abigeato,taxa_furto_veiculo,taxa_roubos,taxa_roubo_veiculo,taxa_estelionato,taxa_delitos_armas_municoes,taxa_entorpecente_posse,taxa_entorpecente_trafico,taxa_vitimas_latrocinio,taxa_vitimas_lesao_corporal_morte,taxa_total_vitimas_crimes_violentos
483,viadutos,4322905,19.794141,19.794141,0.0,692.794933,19.794141,39.588282,0.0,0.0,356.294537,59.382423,98.970705,59.382423,0.0,0.0,19.794141
73,campos borges,4304101,0.0,0.0,0.0,619.111709,107.671602,26.9179,0.0,26.9179,296.096904,80.753701,26.9179,26.9179,0.0,0.0,0.0
231,lajeado,4311403,17.908121,18.96154,0.0,1128.211611,6.320513,77.952996,81.113253,10.534189,1079.754343,54.777781,130.62394,187.508559,0.0,0.0,22.121796
175,gaurama,4308706,17.094017,17.094017,0.0,581.196581,0.0,0.0,0.0,0.0,547.008547,34.188034,34.188034,34.188034,0.0,0.0,17.094017
237,macambara,4311718,0.0,0.0,0.0,848.010437,326.15786,0.0,0.0,0.0,500.108719,43.487715,0.0,0.0,0.0,0.0,0.0
424,selbach,4320305,0.0,0.0,0.0,249.233129,0.0,95.858896,0.0,0.0,383.435583,0.0,0.0,76.687117,0.0,0.0,0.0
155,estancia velha,4307609,2.009,2.009,0.0,958.293153,6.027001,50.225008,132.594021,34.153005,393.764063,18.081003,106.477017,44.198007,0.0,0.0,6.027001
55,braga,4302600,0.0,0.0,0.0,348.993289,80.536913,80.536913,53.691275,0.0,241.610738,107.38255,107.38255,107.38255,0.0,0.0,0.0
322,pontao,4314779,0.0,0.0,0.0,608.626621,26.462027,0.0,0.0,26.462027,688.012702,79.386081,79.386081,26.462027,0.0,0.0,0.0
9,alto alegre,4300554,0.0,0.0,0.0,365.726228,104.493208,0.0,0.0,0.0,52.246604,0.0,0.0,0.0,0.0,0.0,0.0


In [532]:
import numpy as np

from statsmodels.datasets import get_rdataset
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [533]:
df_sample.drop(['municipios', 'ibge'], axis='columns', inplace=True)

df_sample

Unnamed: 0,taxa_homicidio_doloso,taxa_total_vitimas_homicidio_doloso,taxa_latrocinio,taxa_furtos,taxa_abigeato,taxa_furto_veiculo,taxa_roubos,taxa_roubo_veiculo,taxa_estelionato,taxa_delitos_armas_municoes,taxa_entorpecente_posse,taxa_entorpecente_trafico,taxa_vitimas_latrocinio,taxa_vitimas_lesao_corporal_morte,taxa_total_vitimas_crimes_violentos
483,19.794141,19.794141,0.0,692.794933,19.794141,39.588282,0.0,0.0,356.294537,59.382423,98.970705,59.382423,0.0,0.0,19.794141
73,0.0,0.0,0.0,619.111709,107.671602,26.9179,0.0,26.9179,296.096904,80.753701,26.9179,26.9179,0.0,0.0,0.0
231,17.908121,18.96154,0.0,1128.211611,6.320513,77.952996,81.113253,10.534189,1079.754343,54.777781,130.62394,187.508559,0.0,0.0,22.121796
175,17.094017,17.094017,0.0,581.196581,0.0,0.0,0.0,0.0,547.008547,34.188034,34.188034,34.188034,0.0,0.0,17.094017
237,0.0,0.0,0.0,848.010437,326.15786,0.0,0.0,0.0,500.108719,43.487715,0.0,0.0,0.0,0.0,0.0
424,0.0,0.0,0.0,249.233129,0.0,95.858896,0.0,0.0,383.435583,0.0,0.0,76.687117,0.0,0.0,0.0
155,2.009,2.009,0.0,958.293153,6.027001,50.225008,132.594021,34.153005,393.764063,18.081003,106.477017,44.198007,0.0,0.0,6.027001
55,0.0,0.0,0.0,348.993289,80.536913,80.536913,53.691275,0.0,241.610738,107.38255,107.38255,107.38255,0.0,0.0,0.0
322,0.0,0.0,0.0,608.626621,26.462027,0.0,0.0,26.462027,688.012702,79.386081,79.386081,26.462027,0.0,0.0,0.0
9,0.0,0.0,0.0,365.726228,104.493208,0.0,0.0,0.0,52.246604,0.0,0.0,0.0,0.0,0.0,0.0


In [534]:
scaler = StandardScaler(with_std=True, with_mean=True)
data_scaled = pd.DataFrame(scaler.fit_transform(df_sample))
data_scaled.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,3.3306690000000003e-17,-5.5511150000000004e-17,-4.996004e-17,3.8857810000000004e-17,5.551115e-18,-1.915135e-16,0.0,-6.661338000000001e-17,7.771561000000001e-17,1.44329e-16,7.771561000000001e-17,5.5511150000000004e-17,-4.996004e-17,-1.1102230000000002e-17,1.1102230000000002e-17
std,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978,1.025978
min,-0.6994931,-0.6816916,-0.4141883,-0.9681663,-0.6780943,-1.113794,-0.616484,-0.7081368,-1.507495,-1.501965,-0.9939043,-0.8815828,-0.4141883,-0.2294157,-0.6959448
25%,-0.6994931,-0.6816916,-0.4141883,-0.7914025,-0.5936598,-0.6785608,-0.616484,-0.7081368,-0.6744146,-0.9660994,-0.9939043,-0.520812,-0.4141883,-0.2294157,-0.6959448
50%,-0.6994931,-0.6816916,-0.4141883,-0.2184977,-0.4169081,-0.2611144,-0.616484,-0.7081368,-0.2269227,0.2019467,0.02918485,-0.3576427,-0.4141883,-0.2294157,-0.6959448
75%,0.5369648,0.4962004,-0.4141883,0.4187483,0.206159,0.3481238,0.202417,0.7153594,0.3342292,0.9124671,0.5548625,0.1531906,-0.4141883,-0.2294157,0.3992507
max,2.471621,2.562285,2.74875,2.978709,3.730162,2.891085,2.862717,2.078835,2.21053,1.74994,2.596895,3.603491,2.74875,4.358899,2.40825


In [535]:
pca = PCA()
components = pca.fit_transform(data_scaled)

components

array([[ 2.64234366e-03,  2.40150279e-01,  1.16687644e+00,
        -6.25517609e-01, -9.71946695e-02, -4.90245179e-01,
         3.27932216e-01,  1.82136033e-02,  2.11334891e-01,
        -3.07795766e-01, -4.97256857e-01, -1.06845951e-02,
         1.07575592e-02, -2.72249181e-03, -2.20591243e-17],
       [-1.48332019e+00, -6.33015360e-01, -2.28908986e-01,
         4.91774133e-01,  1.85674574e+00,  3.53421965e-01,
         1.74571872e-01,  3.80356766e-01, -1.41031333e-01,
        -5.21557410e-01,  5.12869479e-01,  2.17272970e-01,
        -1.72897303e-02, -1.93464295e-02,  6.90527425e-18],
       [ 2.31104866e+00, -1.10250624e+00, -2.52108184e-01,
        -1.23214720e-01, -9.16665856e-01,  7.07734599e-01,
         1.50108563e+00, -1.34241164e-01,  9.14547714e-01,
         1.78921718e-01, -8.32429057e-02,  2.89873549e-01,
        -3.05796672e-02,  1.02107088e-02, -1.91781121e-17],
       [-7.29421454e-01,  4.56326193e-01,  5.06874983e-01,
        -3.96893815e-01, -6.36234226e-01, -1.03950814

In [536]:
pca.explained_variance_ratio_

array([5.01442797e-01, 1.65719459e-01, 8.24075341e-02, 6.95058984e-02,
       5.80267333e-02, 4.31502718e-02, 3.25963146e-02, 1.73889392e-02,
       1.41437382e-02, 7.58752469e-03, 6.19176629e-03, 1.72863306e-03,
       1.02385688e-04, 8.00460616e-06, 2.21817961e-34])

In [537]:
px.area(
    x=range(1, pca.explained_variance_ratio_.cumsum().shape[0] + 1),
    y=pca.explained_variance_ratio_.cumsum(),
    labels={"x": "# Components", "y": "Explained Variance"}
)

In [538]:
pca = PCA(n_components=2)
components = pca.fit_transform(data_scaled)

components

array([[ 2.64234366e-03,  2.40150279e-01],
       [-1.48332019e+00, -6.33015360e-01],
       [ 2.31104866e+00, -1.10250624e+00],
       [-7.29421454e-01,  4.56326193e-01],
       [-2.40311620e+00, -3.49882827e-01],
       [-1.59268291e+00, -9.77951006e-01],
       [ 4.09284776e-02, -1.35909044e+00],
       [-6.02002981e-01, -6.14976381e-01],
       [-9.71117670e-01, -3.61181138e-01],
       [-2.95745153e+00,  1.13305929e-02],
       [-2.25614836e+00,  1.18139133e-01],
       [-2.01001255e+00, -5.53877689e-01],
       [ 5.40378935e-01,  1.04467725e+00],
       [ 5.94658961e-01,  2.06270946e+00],
       [ 1.73473426e+00,  4.03226599e+00],
       [-1.81090444e+00, -2.50699898e-01],
       [ 8.81919190e+00, -3.34599233e+00],
       [ 5.21069067e+00,  3.15543445e+00],
       [-1.10061585e+00, -8.91555058e-01],
       [-1.33748008e+00, -6.80304989e-01]])

In [548]:


loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

crimes = df_sample.columns

fig = px.scatter(components, x=0, y=1, color=crimes, labels={'0': 'PC1', '1': 'PC2'})

features = df_sample.columns

for i, feature in enumerate(features):
    fig.add_annotation(
        ax=0, ay=0,
        axref="x", ayref="y",
        x=loadings[i, 0],
        y=loadings[i, 1],
        showarrow=True,
        arrowsize=2,
        arrowhead=2,
        xanchor="right",
        yanchor="top"
    )
    fig.add_annotation(
        x=loadings[i, 0],
        y=loadings[i, 1],
        ax=0, ay=0,
        xanchor="center",
        yanchor="bottom",
        text=feature,
        yshift=5,
    )
fig.update_yaxes(autorange="reversed")
fig.show()

ValueError: All arguments should have the same length. The length of argument `color` is 15, whereas the length of  previously-processed arguments ['0', '1'] is 20

In [540]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
""" i, j = 0, 1 # which components
scale_arrow = s_ = 2
components[:,1] *= -1
pca.components_[1] *= -1 # flip the y-axis
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.scatter(components[:,0], components[:,1])
ax.set_xlabel('PC%d' % (i+1))
ax.set_ylabel('PC%d' % (j+1))
for k in range(pca.components_.shape[1]):
    ax.arrow(0, 0, s_*pca.components_[i,k], s_*pca.components_[j,k])
    ax.text(s_*pca.components_[i,k],
            s_*pca.components_[j,k],
            df_sample.columns[k]) """