In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_excel("../crime/data_cts_intentional_homicide.xlsx")
df.head()

Unnamed: 0,UNODC,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,24/11/2024,,,,,,,,,,,,
1,Iso3_code,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
2,ARM,Armenia,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,35,CTS
3,CHE,Switzerland,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,28,CTS
4,COL,Colombia,Americas,Latin America and the Caribbean,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,15053,CTS


## Cleaning
In the table diplayed above we see that the column names are all unnamed and that row 1 would be great for columns instead. So in the code below we create a new dataframe with row 1 of the old df as columns and copy everyting from row 2 of the old dataframe to be the rows in the new dataframe, resulting in a much nicer and easier to read table

In [3]:
new_columns = df.iloc[1]  
cleaned_df = df.iloc[2:].copy()  
cleaned_df.columns = new_columns 
cleaned_df.reset_index(drop=True, inplace=True)
cleaned_df.rename(columns={'Iso3_code': 'ISO'}, inplace=True) # same name as in weapon dataset
cleaned_df.head()

1,ISO,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ARM,Armenia,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,35,CTS
1,CHE,Switzerland,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,28,CTS
2,COL,Colombia,Americas,Latin America and the Caribbean,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,15053,CTS
3,CZE,Czechia,Europe,Eastern Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,69,CTS
4,DEU,Germany,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2013,Counts,455,CTS


Next we need to filter based on year since the weapon data is only from 2018, so we do that in the cell below

In [4]:
df_2018 = cleaned_df[cleaned_df['Year'] == "2018"].copy() # note that year is in string format 
df_2018.reset_index(drop=True, inplace=True)
df_2018.head()

1,ISO,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ALB,Albania,Europe,Southern Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,57,CTS
1,ARM,Armenia,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,40,CTS
2,ATG,Antigua and Barbuda,Americas,Latin America and the Caribbean,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,7,CTS
3,AUT,Austria,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,44,CTS
4,AZE,Azerbaijan,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,189,CTS


## Merging  
Now we are gonna take a closer look at the countries in the dataset, we want them to be overlapping with the weapon dataset (merged_dataset.csv), so that is what we are going to try here

In [5]:
country_list = cleaned_df['Country'].unique().tolist()
print(f"Number of unique countries: {len(country_list)}")
print("List of countries:")
print(country_list)

Number of unique countries: 215
List of countries:
['Armenia', 'Switzerland', 'Colombia', 'Czechia', 'Germany', 'Finland', 'Guatemala', 'Honduras', 'Hungary', 'Iceland', 'Italy', 'Japan', 'Sri Lanka', 'Lithuania', 'Mongolia', 'Norway', 'Serbia', 'Slovakia', 'Slovenia', 'Türkiye', 'Antigua and Barbuda', 'Austria', 'Belgium', 'Belize', 'Bolivia (Plurinational State of)', 'Bhutan', 'Denmark', 'France', 'Greece', 'China, Macao Special Administrative Region', 'Russian Federation', 'Uzbekistan', 'Albania', 'Azerbaijan', 'Barbados', 'Chile', 'Spain', 'Grenada', 'Guyana', 'Croatia', 'Liechtenstein', 'Latvia', 'Montenegro', 'Panama', 'Dominica', 'Jordan', 'Saint Kitts and Nevis', 'Saint Lucia', 'Mexico', 'Malta', 'Oman', 'Trinidad and Tobago', 'Bulgaria', 'Bahamas', 'Canada', 'Costa Rica', 'Dominican Republic', 'El Salvador', 'Uruguay', 'Bosnia and Herzegovina', 'Saint Vincent and the Grenadines', 'Ecuador', 'Holy See', 'Indonesia', 'Morocco', 'Mauritius', 'Aruba', 'Anguilla', 'Australia', 'Bel

In [6]:
df_weapons = pd.read_csv("../new data/merged_dataset.csv")
df_weapons.head()

Unnamed: 0,ISO,Country,Continent,Subregion,Population,Estimate of firearms in civilian possession,Registered firearms,Unregistered firearms,Total law enforcement firearms,Total military firearms
0,ABW,Aruba,Americas,Caribbean,105000.0,3000,,,700,
1,AFG,Afghanistan,Asia,Southern Asia,34169000.0,4270000,,,239000,
2,AGO,Angola,Africa,Middle Africa,26656000.0,2982000,,,60000,203300.0
3,ALB,Albania,Europe,Southern Europe,2911000.0,350000,65747.0,284253.0,19000,21750.0
4,AND,Andorra,Europe,Southern Europe,69000.0,10000,7599.0,2401.0,976,


Here we merge the two dataframes together, also getting rid of some unnecessary duplicate columns. 
Important to note is that the weapon data now appears in multiple rows

In [10]:
merged_df = pd.merge(df_2018, df_weapons, on='ISO', how='inner')
merged_df_cleaned = merged_df.copy()
merged_df_cleaned['Country'] = merged_df_cleaned['Country_x']  
merged_df_cleaned['Subregion'] = merged_df_cleaned['Subregion_x']  
merged_df_cleaned.drop(['Country_x', 'Country_y', 'Subregion_x', 'Subregion_y'], axis=1, inplace=True)
merged_df_cleaned.head()

#merged_df_cleaned.to_csv("merged_homicide_weapon.csv", index=False)

Unnamed: 0,ISO,Region,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source,Continent,Population,Estimate of firearms in civilian possession,Registered firearms,Unregistered firearms,Total law enforcement firearms,Total military firearms,Country,Subregion
0,ALB,Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,57,CTS,Europe,2911000.0,350000,65747.0,284253.0,19000,21750.0,Albania,Southern Europe
1,ALB,Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Female,Total,2018,Counts,2,CTS,Europe,2911000.0,350000,65747.0,284253.0,19000,21750.0,Albania,Southern Europe
2,ALB,Europe,Persons arrested/suspected for intentional hom...,by citizenship,Foreign citizens,Male,Total,2018,Counts,1,CTS,Europe,2911000.0,350000,65747.0,284253.0,19000,21750.0,Albania,Southern Europe
3,ALB,Europe,Persons arrested/suspected for intentional hom...,by citizenship,Foreign citizens,Female,Total,2018,Counts,0,CTS,Europe,2911000.0,350000,65747.0,284253.0,19000,21750.0,Albania,Southern Europe
4,ALB,Europe,Victims of intentional homicide,Total,Total,Total,Total,2018,Counts,66,MD/CTS/GSH 2023 Revision/NSO,Europe,2911000.0,350000,65747.0,284253.0,19000,21750.0,Albania,Southern Europe


Let's look at the countries again...
Now there are 135 left

In [11]:
country_list = merged_df_cleaned['Country'].unique().tolist()
print(f"Number of unique countries: {len(country_list)}")
print("List of countries:")
print(country_list)

Number of unique countries: 135
List of countries:
['Albania', 'Armenia', 'Antigua and Barbuda', 'Austria', 'Azerbaijan', 'Belgium', 'Bulgaria', 'Bahamas', 'Bolivia (Plurinational State of)', 'Barbados', 'Switzerland', 'Chile', 'Colombia', 'Czechia', 'Germany', 'Dominica', 'Denmark', 'Spain', 'Finland', 'France', 'Greece', 'Guyana', 'Croatia', 'Hungary', 'Iceland', 'Italy', 'Jordan', 'Saint Kitts and Nevis', 'Saint Lucia', 'Sri Lanka', 'Lithuania', 'Latvia', 'Mexico', 'Malta', 'Mongolia', 'Norway', 'Oman', 'Russian Federation', 'Serbia', 'Slovakia', 'Slovenia', 'Trinidad and Tobago', 'Türkiye', 'Uzbekistan', 'Afghanistan', 'Argentina', 'American Samoa', 'Australia', 'Bangladesh', 'Bahrain', 'Bosnia and Herzegovina', 'Belarus', 'Belize', 'Bermuda', 'Brazil', 'Bhutan', 'Canada', 'China', 'Cabo Verde', 'Costa Rica', 'Cuba', 'Cayman Islands', 'Cyprus', 'Dominican Republic', 'Algeria', 'Ecuador', 'Estonia', 'Georgia', 'Ghana', 'Grenada', 'Guatemala', 'French Guiana', 'China, Hong Kong Speci

Because the merged dataset contains the weapon data on multiple rows, we are gonna create a new dataframe with homocide data with only the overlapping countries but without the weapon data, just in case this is easier to use for analysis

In [12]:
homicide_filtered = df_2018[df_2018['ISO'].isin(merged_df_cleaned['ISO'])].copy()
homicide_filtered.reset_index(drop=True, inplace=True)
homicide_filtered.head()

#homicide_filtered.to_csv("homicide_data_overlapping.csv", index=False)

1,ISO,Country,Region,Subregion,Indicator,Dimension,Category,Sex,Age,Year,Unit of measurement,VALUE,Source
0,ALB,Albania,Europe,Southern Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,57,CTS
1,ARM,Armenia,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,40,CTS
2,ATG,Antigua and Barbuda,Americas,Latin America and the Caribbean,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,7,CTS
3,AUT,Austria,Europe,Western Europe,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,44,CTS
4,AZE,Azerbaijan,Asia,Western Asia,Persons arrested/suspected for intentional hom...,by citizenship,National citizens,Male,Total,2018,Counts,189,CTS


When looking at the countries of filtered dataset we see that the countries indeed are the same :-)

In [13]:
country_list = homicide_filtered['Country'].unique().tolist()
print(f"Number of unique countries: {len(country_list)}")
print("List of countries:")
print(country_list)

Number of unique countries: 135
List of countries:
['Albania', 'Armenia', 'Antigua and Barbuda', 'Austria', 'Azerbaijan', 'Belgium', 'Bulgaria', 'Bahamas', 'Bolivia (Plurinational State of)', 'Barbados', 'Switzerland', 'Chile', 'Colombia', 'Czechia', 'Germany', 'Dominica', 'Denmark', 'Spain', 'Finland', 'France', 'Greece', 'Guyana', 'Croatia', 'Hungary', 'Iceland', 'Italy', 'Jordan', 'Saint Kitts and Nevis', 'Saint Lucia', 'Sri Lanka', 'Lithuania', 'Latvia', 'Mexico', 'Malta', 'Mongolia', 'Norway', 'Oman', 'Russian Federation', 'Serbia', 'Slovakia', 'Slovenia', 'Trinidad and Tobago', 'Türkiye', 'Uzbekistan', 'Afghanistan', 'Argentina', 'American Samoa', 'Australia', 'Bangladesh', 'Bahrain', 'Bosnia and Herzegovina', 'Belarus', 'Belize', 'Bermuda', 'Brazil', 'Bhutan', 'Canada', 'China', 'Cabo Verde', 'Costa Rica', 'Cuba', 'Cayman Islands', 'Cyprus', 'Dominican Republic', 'Algeria', 'Ecuador', 'Estonia', 'Georgia', 'Ghana', 'Grenada', 'Guatemala', 'French Guiana', 'China, Hong Kong Speci

One last thing to note is that we do not yet filter out the rows with NULL values here!