# **Costa Rica - Seizure and Homicide Data Analysis**
#### InSight Crime - MAD Unit 
June, 2025


---------------------



### Project Setup

#### Version Control

The project is created within a single GitHub repository ([FelipeVillota/costa-rica-presentation](https://github.com/FelipeVillota/costa-rica-presentation)). I keep the repository `private` with the possibility to give collaborator-access to the online repo at any time. 

#### Reproducible Environment

In [None]:
# IMPORTANT

# In the Terminal, run the following commands to set up a virtual environment called `venv-cr`:
# python -m venv venv-cr

# To activate environment, run (the first is an optional, temporary auth) :
# Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
# venv-cr\Scripts\activate

# Then select respective kernel 

# Update list master list
# pip freeze > requirements.txt

In [1]:
# Checking venv-db-watch works --> accept to install ipykernel package to connect to kernel if asked
import sys
print(sys.executable)

c:\Users\USER\Desktop\ic\costa-rica\costa-rica-presentation\venv-cr\Scripts\python.exe


#### Loading Libraries

In [None]:
# Install the required packages in the virtual environment:
# pip install --upgrade pip
# pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib pandas
# pip install gspread gspread-formatting seaborn matplotlib missingno    

In [7]:
import os
import re
import requests
import pandas as pd
from datetime import datetime
from google.oauth2 import service_account
from googleapiclient.discovery import build
import gspread
from google.oauth2.service_account import Credentials
from gspread_formatting import format_cell_ranges, CellFormat, Color
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from io import BytesIO


# pip freeze > requirements.txt

### Loading Data

In [26]:
# Use raw GitHub URLs
url_homicides = "https://github.com/FelipeVillota/costa-rica-presentation/raw/main/data/raw/homicides-2010-2024.xlsx"
url_seizures = "https://github.com/FelipeVillota/costa-rica-presentation/raw/main/data/raw/seizures-2019-2024.xlsx"

# Get file content
response1 = requests.get(url_homicides)
response2 = requests.get(url_seizures)

In [46]:
# Load all sheets from each Excel file as separate DataFrames
homicides_sheets = pd.read_excel(BytesIO(response1.content), engine='openpyxl', sheet_name=None)
seizures_sheets = pd.read_excel(BytesIO(response2.content), engine='openpyxl', sheet_name=None)
seizures_df = seizures_sheets['8']  

# Here there are dictionaries where keys are sheet names and values are DataFrames
# 
homicides_sheets = list(homicides_sheets.values())[0]

In [47]:

# Preview first sheet from homicides
print("First sheet from homicidios file:")
print(homicides_sheets.head())


print(seizures_df.head())


First sheet from homicidios file:
         país      mes   año  homicidios_país_mes  \
0  Costa Rica    enero  2010                   44   
1  Costa Rica  febrero  2010                   45   
2  Costa Rica    marzo  2010                   47   
3  Costa Rica    abril  2010                   42   
4  Costa Rica     mayo  2010                   39   

   homicidios_país_acumulado_mes  homicidios_país_año  
0                            756                  527  
1                            690                  527  
2                            699                  527  
3                            645                  527  
4                            701                  527  
  8.Decomisos de drogas según provincia, cantón y distrito por tipo de sustancia. 2019-2024  \
0                                               País                                          
1                                         Costa Rica                                          
2                         

In [48]:
# Iterate through them:
print("\nAll sheets in homicides file:")
for sheet_name, df in homicides_sheets.items():
    print(f"Sheet: {sheet_name}")
    print(df.head())
    print("-----")

print("\nOnly sheet in seizures file:")
for sheet_name, df in seizures_df.items():
    print(f"Sheet: {sheet_name}")
    print(df.head())
    print("-----")


All sheets in homicides file:
Sheet: país
0    Costa Rica
1    Costa Rica
2    Costa Rica
3    Costa Rica
4    Costa Rica
Name: país, dtype: object
-----
Sheet: mes
0      enero
1    febrero
2      marzo
3      abril
4       mayo
Name: mes, dtype: object
-----
Sheet: año
0    2010
1    2010
2    2010
3    2010
4    2010
Name: año, dtype: int64
-----
Sheet: homicidios_país_mes
0    44
1    45
2    47
3    42
4    39
Name: homicidios_país_mes, dtype: int64
-----
Sheet: homicidios_país_acumulado_mes
0    756
1    690
2    699
3    645
4    701
Name: homicidios_país_acumulado_mes, dtype: int64
-----
Sheet: homicidios_país_año
0    527
1    527
2    527
3    527
4    527
Name: homicidios_país_año, dtype: int64
-----

Only sheet in seizures file:
Sheet: 8.Decomisos de drogas según provincia, cantón y distrito por tipo de sustancia. 2019-2024
0          País
1    Costa Rica
2    Costa Rica
3    Costa Rica
4    Costa Rica
Name: 8.Decomisos de drogas según provincia, cantón y distrito por tipo

### Execution

#### Pre-processing

In [None]:
# Handle Missing/Invalid Values
 
#df = df.replace('', pd.NA)  # Convert empty strings to NA
#df = df.replace(r'^\s*$', pd.NA, regex=True)  # Convert whitespace to NA
#print("✓ Converted empty strings/whitespace to NA values")

✓ Converted empty strings/whitespace to NA values


### Exploratory Data Analysis

In [50]:
seizures_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3400 entries, 0 to 3399
Data columns (total 11 columns):
 #   Column                                                                                     Non-Null Count  Dtype 
---  ------                                                                                     --------------  ----- 
 0   8.Decomisos de drogas según provincia, cantón y distrito por tipo de sustancia. 2019-2024  3400 non-null   object
 1   Unnamed: 1                                                                                 3399 non-null   object
 2   Unnamed: 2                                                                                 3399 non-null   object
 3   Unnamed: 3                                                                                 3399 non-null   object
 4   Unnamed: 4                                                                                 3399 non-null   object
 5   Unnamed: 5                                             

##### Summary Stats

In [49]:
def get_dataframe_summary(seizures_df):
    # --- General DataFrame Info ---
    general_info = {
        "Shape": f"{seizures_df.shape[0]} rows × {seizures_df.shape[1]} cols",
        "Memory Usage": f"{seizures_df.memory_usage(deep=True).sum() / (1024 ** 2):.2f} MB",
        "Columns with NA": f"{seizures_df.isna().any().sum()} / {len(seizures_df.columns)}",
        "Duplicate Rows": f"{seizures_df.duplicated().sum()} ({(seizures_df.duplicated().mean() * 100):.1f}%)",
        "Numeric Columns": f"{seizures_df.select_dtypes(include='number').shape[1]}",
        "Categorical Columns": f"{seizures_df.select_dtypes(include=['object', 'category']).shape[1]}",
        "Datetime Columns": f"{seizures_df.select_dtypes(include='datetime').shape[1]}"
    }

    # --- Column-Level Stats ---
    column_stats = pd.DataFrame({
        'Variable': seizures_df.columns,
        'Dtype': seizures_df.dtypes.values,
        'Unique_Count': seizures_df.nunique().values,
        'NA_Count': seizures_df.isna().sum().values,
        'NA_Percentage': (seizures_df.isna().mean() * 100).round(1).values,
        'Duplicate_Count': seizures_df.apply(lambda col: col.duplicated(keep=False).sum()).values,
        'Duplicate_Percentage': (seizures_df.apply(lambda col: col.duplicated(keep=False).mean()) * 100).round(1).values,
        'Unique_Values': seizures_df.apply(lambda x: x.drop_duplicates().tolist()).values
    }).sort_values('Unique_Count', ascending=False)


    # Format percentages
    column_stats['NA_Percentage'] = column_stats['NA_Percentage'].astype(str) + '%'
    column_stats['Duplicate_Percentage'] = column_stats['Duplicate_Percentage'].astype(str) + '%'

    return general_info, column_stats

# Run the summary
general_info, column_stats = get_dataframe_summary(seizures_df)

# Print General Info
print("=== GENERAL DATAFRAME INFO ===")
for key, value in general_info.items():
    print(f"{key}: {value}")

# Display Column Stats (including dropdown options)
print("\n=== COLUMN-LEVEL STATISTICS ===")
display(column_stats)


=== GENERAL DATAFRAME INFO ===
Shape: 3400 rows × 11 cols
Memory Usage: 1.58 MB
Columns with NA: 10 / 11
Duplicate Rows: 0 (0.0%)
Numeric Columns: 0
Categorical Columns: 11
Datetime Columns: 0

=== COLUMN-LEVEL STATISTICS ===


Unnamed: 0,Variable,Dtype,Unique_Count,NA_Count,NA_Percentage,Duplicate_Count,Duplicate_Percentage,Unique_Values
9,Unnamed: 9,object,2652,1,0.0%,976,28.7%,"[Marihuana (kg), 14.33302, 1.4579, 0.0395, 1.0..."
7,Unnamed: 7,object,1512,1,0.0%,2217,65.2%,"[Crack (piedras), 6644.4, 10406.6666666667, 20..."
5,Unnamed: 5,object,1288,1,0.0%,2316,68.1%,"[Cocaína (kg), 0.05825, 0.7025, 0.00625, 0.329..."
8,Unnamed: 8,object,633,1,0.0%,3120,91.8%,"[Marihuana Eventos, 769, 478, 21, 449, 450, 29..."
3,Unnamed: 3,object,398,1,0.0%,3395,99.9%,"[Distrito, Alajuela, San José, Carrizal, San A..."
6,Unnamed: 6,object,306,1,0.0%,3256,95.8%,"[Crack Eventos, 370, 289, 9, 60, 155, 3, 17, 1..."
4,Unnamed: 4,object,212,1,0.0%,3308,97.3%,"[ Cocaína Eventos, 161, 144, 9, 34, 51, 3, 46,..."
2,Unnamed: 2,object,94,1,0.0%,3396,99.9%,"[Cantón, Alajuela, San Ramón, Grecia, San Mate..."
1,Unnamed: 1,object,13,1,0.0%,3397,99.9%,"[Provincia, Alajuela, Cartago, Guanacaste, Her..."
10,Unnamed: 10,object,8,1,0.0%,3398,99.9%,"[Año, 2019, 2018, 2020, 2021, 2022, 2023, 2024..."
