# **Combining Datasets for Firearm and Crime Analysis**
**This notebook processes and merges multiple datasets containing information on firearm ownership, crime rates, and GDP per country. 
The goal is to prepare a clean and structured dataset for further statistical analysis. 
We perform data cleaning, feature engineering, and merging operations**

In [41]:
import pandas as pd

## **Weapon data**
**We begin by loading the firearm datasets.
The weapon, crime, GDP datasets have different structures and thus require column renaming and transformations to ensure consistency before merging. That is also something we do in the cells below**

In [42]:
df1 = pd.read_csv('../new-data/separate_datasets/civilian dataset.csv')
df1.rename(columns={'Population 2017': 'Population'}, inplace=True)
df1.rename(columns={'Country code': "ISO"}, inplace=True)
df1.rename(columns={'Region': "Continent"}, inplace=True)
df1.head(5)

Unnamed: 0.1,Unnamed: 0,ISO,Country,Continent,Subregion,Population,Estimate of firearms in civilian possession,Estimate of Computation civilian firearms method per 100 persons,Method of computation,Registered firearms,Unregistered firearms
0,0,AFG,Afghanistan,Asia,Southern Asia,34169000.0,4270000,12.5,2.0,,
1,1,ALB,Albania,Europe,Southern Europe,2911000.0,350000,12.0,2.0,65747.0,284253.0
2,2,DZA,Algeria,Africa,Northern Africa,41064000.0,877000,2.1,2.0,200000.0,677000.0
3,3,ASM,American Samoa,Oceania,Polynesia,56000.0,400,0.7,2.0,250.0,150.0
4,4,AND,Andorra,Europe,Southern Europe,69000.0,10000,14.1,3.0,7599.0,2401.0


In [43]:
df2 = pd.read_csv('../new-data/separate_datasets/law_enforcement dataset.csv')
df2.head(5)

Unnamed: 0.1,Unnamed: 0,ISO,Country,Continent,Subregion,Population,Active Personel,Total firearms,Computation method
0,6,AFG,Afghanistan,Asia,Southern Asia,34169000,159600,239000,2.0
1,9,ALB,Albania,Europe,Southern Europe,2911000,9743,19000,2.0
2,10,DZA,Algeria,Africa,Northern Africa,41064000,302178,363000,2.0
3,15,ASM,American Samoa,Oceania,Polynesia,56000,115,90,3.0
4,16,AND,Andorra,Europe,Southern Europe,69000,341,976,1.0


In [44]:
df3 = pd.read_csv('../new-data/separate_datasets/military dataset.csv')
df3.rename(columns={'Country Code': 'ISO'}, inplace=True)
df3.rename(columns={'Region': "Continent"}, inplace=True)
df3.head(5)

Unnamed: 0.1,Unnamed: 0,ISO,Country,Continent,Subregion,Population,Total military firearms,Computation method
0,0,ALB,Albania,Europe,Southern Europe,2911000,21750,2
1,1,DZA,Algeria,Africa,Northern Africa,41064000,637720,2
2,2,AGO,Angola,Africa,Middle Africa,26656000,203300,2
3,3,ATG,Antigua and Barbuda,Americas,Caribbean,94000,438,2
4,4,ARG,Argentina,Americas,South America,44272000,679770,2


In [45]:
# Delete some redundant rows from the dataframes
value_to_remove = "ISO"
df1 = df1[df1["ISO"] != value_to_remove]
df2 = df2[df2["ISO"] != value_to_remove]
df3 = df3[df3["ISO"] != value_to_remove]

value_to_remove1 = "Country"
df1 = df1[df1["ISO"] != value_to_remove1]
df2 = df2[df2["ISO"] != value_to_remove1]
df3 = df3[df3["ISO"] != value_to_remove1]

**Alright now we start by merging the weapon data**

In [46]:
# Create a merged dataset with only the essential info
for df in [df1, df2, df3]:
    df["Population"] = pd.to_numeric(df["Population"], errors="coerce")

merged_df = pd.merge(df1, df2, on=["ISO", "Country", "Continent", "Subregion", "Population"], how="outer")
merged_df = pd.merge(merged_df, df3, on=["ISO", "Country", "Continent", "Subregion", "Population"], how="outer")

final_columns = [ "ISO", "Country", "Continent", "Subregion", "Population"]

merged_df = merged_df[final_columns]

# Mappings from df1 to merged_df
m1 = df1.set_index("ISO")["Estimate of firearms in civilian possession"]
merged_df["Estimate of firearms in civilian possession"] = merged_df["ISO"].map(m1)
m2 = df1.set_index("ISO")["Registered firearms"]
merged_df["Registered firearms"] = merged_df["ISO"].map(m2)
m3 = df1.set_index("ISO")["Unregistered firearms"]
merged_df["Unregistered firearms"] = merged_df["ISO"].map(m3)

# Mappings from df2 to merged_df
m4 = df2.set_index("ISO")["Total firearms"]
merged_df["Total firearms"] = merged_df["ISO"].map(m4)

# Mappings from df3 to merged_df
m5 = df3.set_index("ISO")["Total military firearms"]
merged_df["Total military firearms"] = merged_df["ISO"].map(m5)

merged_df.rename(columns={'Total firearms': 'Total law enforcement firearms'}, inplace=True)

merged_df.head(5)

Unnamed: 0,ISO,Country,Continent,Subregion,Population,Estimate of firearms in civilian possession,Registered firearms,Unregistered firearms,Total law enforcement firearms,Total military firearms
0,ABW,Aruba,Americas,Caribbean,105000.0,3000,,,700,
1,AFG,Afghanistan,Asia,Southern Asia,34169000.0,4270000,,,239000,
2,AGO,Angola,Africa,Middle Africa,26656000.0,2982000,,,60000,203300.0
3,ALB,Albania,Europe,Southern Europe,2911000.0,350000,65747.0,284253.0,19000,21750.0
4,AND,Andorra,Europe,Southern Europe,69000.0,10000,7599.0,2401.0,976,


**Note that there are columns with unregistered/registered firearms estimations, but we do not use those in our analysis. These contain a lot NULL values, plus we are interested in the number of weapons in civilian/military/law enforcement possession anyway**

In [47]:
merged_df.to_csv("merged_dataset.csv", index=False) 

#### From here we start with the addition of GDP to the merged dataset with homicide, and calculating per capitas. And we will also be adding violent & sexual crime numbers

In [48]:
df = pd.read_csv('../new-data/merged_homicide_weapon_no_NULL.csv')
desired_order = ["ISO", "Country", "Continent", "Subregion", "Year", "Population", "Estimate of firearms in civilian possession", 
                 "Total law enforcement firearms", "Total military firearms", "Total Homicides"]
df = df[desired_order]

gdp = pd.read_excel('../new-data/separate_datasets/GDP.xlsx')
sexual_violent_df = pd.read_excel('../crime/data_cts_violent_and_sexual_crime.xlsx')

In [49]:
gdp.head(5)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Aruba,ABW,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,2790850000.0,2962907000.0,2983635000.0,3092429000.0,3276184000.0,3395799000.0,2481857000.0,2929447000.0,3279344000.0,3648573000.0
1,Africa Eastern and Southern,AFE,GDP (current US$),NY.GDP.MKTP.CD,24210630000.0,24963980000.0,27078800000.0,31775750000.0,30285790000.0,33813170000.0,...,978708000000.0,898278000000.0,828943000000.0,972999000000.0,1012310000000.0,1009720000000.0,933392000000.0,1085750000000.0,1191420000000.0,1245470000000.0
2,Afghanistan,AFG,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,20497130000.0,19134220000.0,18116570000.0,18753460000.0,18053220000.0,18799440000.0,19955930000.0,14260000000.0,14497240000.0,17233050000.0
3,Africa Western and Central,AFW,GDP (current US$),NY.GDP.MKTP.CD,11904950000.0,12707880000.0,13630760000.0,14469090000.0,15803760000.0,16921090000.0,...,897416000000.0,771767000000.0,694361000000.0,687849000000.0,770495000000.0,826484000000.0,789802000000.0,849312000000.0,883974000000.0,799106000000.0
4,Angola,AGO,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,...,135967000000.0,90496420000.0,52761620000.0,73690150000.0,79450690000.0,70897960000.0,48501560000.0,66505130000.0,104400000000.0,84824650000.0


**We want 2018 data**

In [50]:
mapping1 = gdp.set_index("Country Code")[2018]
df["Country GDP"] = df["ISO"].map(mapping1)

df.head(5)

Unnamed: 0,ISO,Country,Continent,Subregion,Year,Population,Estimate of firearms in civilian possession,Total law enforcement firearms,Total military firearms,Total Homicides,Country GDP
0,ALB,Albania,Europe,Southern Europe,2018,2911000.0,350000,19000,21750.0,6248,15379510000.0
1,DZA,Algeria,Africa,Northern Africa,2018,41064000.0,877000,363000,637720.0,9855,194554000000.0
2,ATG,Antigua and Barbuda,Americas,Latin America and the Caribbean,2018,94000.0,5000,800,438.0,534,1661530000.0
3,ARG,Argentina,Americas,Latin America and the Caribbean,2018,44272000.0,3256000,391000,679770.0,119890,524820000000.0
4,ARM,Armenia,Asia,Western Asia,2018,3032000.0,186000,18000,509240.0,1237,12457940000.0


**Here we create the "per 1000 people" columns which we will be using for the analysis**

Note that we fill in the VEN GDP manually since it was missing put VEN had all the weapon data so we wanted to include this one to avoid losing too much. A google search helped us finding the right data: https://countryeconomy.com/gdp/venezuela#:~:text=The%20GDP%20figure%20in%202018,2017%2C%20when%20it%20was%20%243%2C791

In [51]:
columns_to_convert = [
    "Estimate of firearms in civilian possession",
    "Total law enforcement firearms",
    "Total military firearms",
    "Total Homicides",
    "Country GDP",
]

df.loc[df["ISO"] == "VEN", "Country GDP"] = 102021000000.00 # fill in venezuela GDP manually

for column in columns_to_convert:
    df[f"{column} per 1,000"] = (df[column] / df["Population"]) * 1000

df.head(5)

Unnamed: 0,ISO,Country,Continent,Subregion,Year,Population,Estimate of firearms in civilian possession,Total law enforcement firearms,Total military firearms,Total Homicides,Country GDP,"Estimate of firearms in civilian possession per 1,000","Total law enforcement firearms per 1,000","Total military firearms per 1,000","Total Homicides per 1,000","Country GDP per 1,000"
0,ALB,Albania,Europe,Southern Europe,2018,2911000.0,350000,19000,21750.0,6248,15379510000.0,120.233597,6.526967,7.471659,2.146341,5283239.0
1,DZA,Algeria,Africa,Northern Africa,2018,41064000.0,877000,363000,637720.0,9855,194554000000.0,21.356906,8.83986,15.529905,0.239991,4737824.0
2,ATG,Antigua and Barbuda,Americas,Latin America and the Caribbean,2018,94000.0,5000,800,438.0,534,1661530000.0,53.191489,8.510638,4.659574,5.680851,17675850.0
3,ARG,Argentina,Americas,Latin America and the Caribbean,2018,44272000.0,3256000,391000,679770.0,119890,524820000000.0,73.545356,8.831767,15.3544,2.708032,11854450.0
4,ARM,Armenia,Asia,Western Asia,2018,3032000.0,186000,18000,509240.0,1237,12457940000.0,61.345646,5.936675,167.955145,0.407982,4108819.0


### **Adding the violent and sexual crimes**

**Violent and sexual crime data come in one dataset. Because of way the dataset is organized, separating the two crimes in separate mentrics accurately is difficult. Because of this, a single metric called Violent and Sexual Crime Rate per 1,000 is introduced**

In [52]:
sexual_violent_df = sexual_violent_df[sexual_violent_df['Year'] == 2018] # Isolate only year 2018, the one we are interested in

population_mapping = df.set_index('ISO')['Population'].to_dict()
sexual_violent_df['Population'] = sexual_violent_df['ISO'].map(population_mapping)

# In some cases, the unit of measurement is not a count but a rate per 100,000; we convert such instances back to counts using the population data we added above
sexual_violent_df.loc[
    sexual_violent_df['Unit of measurement'] == 'Rate per 100,000', 'VALUE'
] = (sexual_violent_df['VALUE'] * sexual_violent_df['Population']) / 100000

# Group by 'ISO' and sum the 'VALUE' column - we will map with the other dataframe so we only need the ISO and the count
new_sexual_violent_df = sexual_violent_df.groupby('ISO', as_index=False)['VALUE'].sum()
new_sexual_violent_df.rename(columns={'VALUE': 'Total S&V Crime Count'}, inplace=True)

print(new_sexual_violent_df.shape[0])
new_sexual_violent_df.head(5)

116


Unnamed: 0,ISO,Total S&V Crime Count
0,ALB,1364.67589
1,AND,947.235911
2,ARE,774.380588
3,ARG,169139.973736
4,ARM,416.172111


**Here we merge the total sexual and violent crime counts from new_sexual_violent_df into our main dataset using ISO country codes. Missing values are replaced with zero, and crime counts are rounded for consistency. We then calculate the crime rate per 1,000 people as we need that for analysis. And then the dataset columns are reordered for clarity***

In [53]:
mapping1 = new_sexual_violent_df.set_index("ISO")["Total S&V Crime Count"]
df["Total Sexual and Violent Crime Counts"] = df["ISO"].map(mapping1)

df["Total Sexual and Violent Crime Counts"] = df["Total Sexual and Violent Crime Counts"].fillna(0).apply(round)
df["Total Sexual and Violent Crime Rates per 1,000"] = (df["Total Sexual and Violent Crime Counts"] / df["Population"]) * 1000

new_order = [
    "ISO",
    "Country",
    "Continent",
    "Subregion",
    "Year",
    "Population",
    "Estimate of firearms in civilian possession",
    "Total law enforcement firearms",
    "Total military firearms",
    "Estimate of firearms in civilian possession per 1,000",
    "Total law enforcement firearms per 1,000",
    "Total military firearms per 1,000",
    "Total Homicides",
    "Total Sexual and Violent Crime Counts",
    "Total Homicides per 1,000",
    "Total Sexual and Violent Crime Rates per 1,000",
    "Country GDP",
    "Country GDP per 1,000"
]

df = df[new_order]

print(df.shape[0])
df.head(5)

93


Unnamed: 0,ISO,Country,Continent,Subregion,Year,Population,Estimate of firearms in civilian possession,Total law enforcement firearms,Total military firearms,"Estimate of firearms in civilian possession per 1,000","Total law enforcement firearms per 1,000","Total military firearms per 1,000",Total Homicides,Total Sexual and Violent Crime Counts,"Total Homicides per 1,000","Total Sexual and Violent Crime Rates per 1,000",Country GDP,"Country GDP per 1,000"
0,ALB,Albania,Europe,Southern Europe,2018,2911000.0,350000,19000,21750.0,120.233597,6.526967,7.471659,6248,1365,2.146341,0.468911,15379510000.0,5283239.0
1,DZA,Algeria,Africa,Northern Africa,2018,41064000.0,877000,363000,637720.0,21.356906,8.83986,15.529905,9855,20008,0.239991,0.487239,194554000000.0,4737824.0
2,ATG,Antigua and Barbuda,Americas,Latin America and the Caribbean,2018,94000.0,5000,800,438.0,53.191489,8.510638,4.659574,534,456,5.680851,4.851064,1661530000.0,17675850.0
3,ARG,Argentina,Americas,Latin America and the Caribbean,2018,44272000.0,3256000,391000,679770.0,73.545356,8.831767,15.3544,119890,169140,2.708032,3.820473,524820000000.0,11854450.0
4,ARM,Armenia,Asia,Western Asia,2018,3032000.0,186000,18000,509240.0,61.345646,5.936675,167.955145,1237,416,0.407982,0.137203,12457940000.0,4108819.0


In [54]:
df.to_csv("everything_merged_dataset_with_NULL_violent&sexual.csv", index=False) 

**Some countries have no information (0) for the sexual and violent crime columns. The csv file saved above contains such counties. In the following cells, I identify countries missing entries in either one of those columns and create a new csv file where those counties are no longer present**

In [55]:
countries_to_remove = df[df['Total Sexual and Violent Crime Counts'] == 0]
countries_to_remove[['ISO', 'Country']]

Unnamed: 0,ISO,Country
9,BHR,Bahrain
10,BGD,Bangladesh
23,CHN,China
38,IND,India
43,JPN,Japan
46,MYS,Malaysia
48,MRT,Mauritania
49,MUS,Mauritius
56,NPL,Nepal
68,QAT,Qatar


In [56]:
df_no_0 = df[~(df['Total Sexual and Violent Crime Counts'] == 0)]

**Create a "combined" crime metric by summing the 3 categories of crime
and also one for the *total* amount of firearms per 1000. These metrics we also need for the analysis**

In [57]:
# Create a new column for combined crime rate
df_no_0.loc[:, "Combined Crime Rate per 1,000"] = (
    df_no_0["Total Homicides per 1,000"] +
    df_no_0["Total Sexual and Violent Crime Rates per 1,000"]
)

# Create a new column for total firearms per capita
df_no_0.loc[:, "Total Firearms per 1,000"] = (
    df_no_0["Estimate of firearms in civilian possession per 1,000"] +
    df_no_0["Total law enforcement firearms per 1,000"] +
    df_no_0["Total military firearms per 1,000"]
)

df_no_0.to_csv("everything_merged_dataset_no_NULL_violent&sexual.csv", index=False) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_0.loc[:, "Combined Crime Rate per 1,000"] = (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_0.loc[:, "Total Firearms per 1,000"] = (


**To sum up, in the most stripped down version of the dataset, where there should be no NULL values in any of the columns, we have 76 countries in total to work with!**

In [58]:
df_no_0.shape[0]

77