# Legislation datasets

### Dataset 1: domestic violence
There is legislation specifically addressing domestic violence (1=yes; 0=no)
Download: https://genderdata.worldbank.org/en/indicator/sg-leg-dvaw

The indicator measures whether there is legislation addressing domestic violence that includes criminal sanctions or provides for protection orders for domestic violence, or the legislation addresses "harassment" that clearly leads to physical or mental harm in the context of domestic violence.
(source: https://genderdata.worldbank.org/en/indicator/sg-leg-dvaw)

In [None]:
# -*- coding: utf-8 -*-
"""
Created on Fri Sep 26 10:29:02 2025

@author: elske
"""

import pandas as pd

# Loading the datasets

import zipfile

zip_path = r".\datasets\there-is-legislation-specifically-addressing-domestic-violence-1-yes-0-no.zip"
dfs = []

with zipfile.ZipFile(zip_path, 'r') as z:
    print("Bestanden in ZIP:", z.namelist())

    for filename in z.namelist():
        if filename.endswith(".csv"): 
            with z.open(filename) as f:
                df = pd.read_csv(f)
                df["source_file"] = filename  
                dfs.append(df)

legis1 = pd.concat(dfs, ignore_index=True)

print(legis1.head())

### Dataset 2: sexual harassment in employment
Criminal penalties or civil remedies exist for sexual harassment in employment (1=yes; 0=no)
Download: https://genderdata.worldbank.org/en/indicator/sg-pen-sxhr-em 

The indicator measures whether the law establishes criminal sanctions, such as fines or imprisonment, for sexual harassment in employment; if the provision in the criminal code provides for reparation of damages for offenses covered by the code; or if the law provides for civil remedies or compensation for victims of sexual harassment in employment or the workplace, even after dismissal of the victims.
(source: https://genderdata.worldbank.org/en/indicator/sg-pen-sxhr-em 


In [None]:

zip_path = r".\datasets\SG.PEN.SXHR.EM.zip"
dfs = []

with zipfile.ZipFile(zip_path, 'r') as z:
    print("Bestanden in ZIP:", z.namelist())

    for filename in z.namelist():
        if filename.endswith(".csv"): 
            with z.open(filename) as f:
                df = pd.read_csv(f)
                df["source_file"] = filename  
                dfs.append(df)

legis2 = pd.concat(dfs, ignore_index=True)

print(legis2.head())


### Background information:
Both datasets cover the period from 1970 to 2023, with the data being collected after the end of each year. Therefore, data labeled as 2023 actually refers to the previous year, 2022. The datasets include data from almost all countries worldwide, with a few exceptions such as Greenland, Turkmenistan and North Korea. Both datasets share the same structure, consisting of columns like country names and codes, years, the specific topic and a binary indicator of its presence (1 = yes, 0 = no).


### Data preprocessing
Since both datasets contained many columns with only missing data, we removed these columns. They were supposed to include information on definitions or general comments. For our analysis, removing these columns has no negative effect. In addition, the datasets contain a large amount of data related to topics that are not relevant to our analysis, such as water supply or mobile payments. All unnecessary data is removed from the datasets. To combine both topics of interest, the datasets were merged. Before this could be done, the column names of the binary outcomes were first renamed to reflect the corresponding topic and a selection was made of the columns to be analyzed. After merging the datasets, the resulting dataset contains 5 columns and 10,206 rows.

##### Checking on missing data

In [None]:

print(legis1.isna().sum())
print(legis2.isna().sum())

# Deleting all columns that contain mostly missing data

legis1_clean = legis1.dropna(axis=1, thresh=(len(legis1) - 10))
  
print(legis1_clean.shape)

legis2_clean = legis2.dropna(axis=1, thresh=(len(legis1) - 10))
  
print(legis2_clean.shape)



##### Deleting all unnecessary data

In [None]:


legis1_clean = legis1_clean[legis1_clean["Indicator Name"] == 
                "There is legislation specifically addressing domestic violence (1=yes; 0=no)"]

print(legis1_clean.shape)
print(legis1_clean["Indicator Name"].unique())

legis2_clean = legis2_clean[legis2_clean["Indicator Name"] == 
                "Criminal penalties or civil remedies exist for sexual harassment in employment (1=yes; 0=no)"]

print(legis2_clean.shape)
print(legis2_clean["Indicator Name"].unique())


##### Renaming Value columns to prepare for merging datasets

In [None]:

legis1_clean = legis1_clean.rename(columns={"Value" : "Legislation Domestic Violence"})

legis2_clean = legis2_clean.rename(columns={"Value" : "Legislation Sexual Harassment"})

##### Selecting columns of interest for merging datasets

In [None]:

legis1_clean = legis1_clean[["Country Name", "Country Code", "Year", "Legislation Domestic Violence"]]

legis2_clean = legis2_clean[["Country Name", "Country Code", "Year", "Legislation Sexual Harassment"]]

##### Merging the datasets 

In [None]:

 
legis_merged = legis2_clean.merge(
    legis1_clean[["Country Name", "Country Code", "Year", "Legislation Domestic Violence"]],
    on=["Country Code", "Year", "Country Name"],
    how="inner")

print(legis_merged.shape)
print(legis_merged.head)
print(legis_merged.isna().sum())

legis_merged = legis_merged.dropna()
print(legis_merged.shape)
print(legis_merged.isna().sum())
