# Exploratory Analysis of Violence Against Women and Girls (VAWG) Dataset
In this notebook I provide a basic quantitative and semantic analysis of the Violence Against Women and Girls dataset sourced from the [Demographic and Health Surveys](https://dhsprogram.com/methodology/Survey-Types/DHS.cfm) program, as well as support these analyses with some simple data visualization.

The original idea of what I wanted to do was look at time trends of domestic violence against women over time. It turns out that this particular dataset cannot be used to answer such a question. Will have to keep looking.

___
## Introduction

### About the dataset

### What is in this dataset?
Taken directly from the data_dictionary.pdf document included with the dataset - dtype column not included in data_dictionary.pdf

| Column Name | Description | dtype |
|---|---|---|
|RecordID|Numeric value unique to each question by country|int|
|Country|Country in which the survey was conducted|categorical|
|Gender|Whether the respondents were Male or Female|categorical|
|Demographics Question|Refers to the different types of demographic groupings used to segment respondents – marital status, education level, employment status, residence type, or age|categorical|
|Demographics Response|Refers to demographic segment into which the respondent falls (e.g. the age groupings are split into 15-24, 25-34, and 35-49)|categorical|
|Question|Respondents were asked if they agreed with the [statements listed below]|categorical|
|Survey Year|Year in which the Demographic and Health Survey (DHS) took place. “DHS surveys are nationally-representative household surveys that provide data for a wide range of monitoring and impact evaluation indicators in the areas of population, health, and nutrition. Standard DHS Surveys have large sample sizes (usually between 5,000 and 30,000 households) and typically are conducted around every 5 years, to allow comparisons over time.” (https://dhsprogram.com/What-We-Do/Survey-Types/ DHS.cfm)|date|
|Value|% of people surveyed in the relevant group who agree with the question (e.g. the percentage of women aged 15-24 in Afghanistan who agree that a husband is justified in hitting or beating his wife if she burns the food)|float|

Respondents were asked if they agreed with the following statements:
- A husband is justified in hitting or beating his wife if she burns the food 
- A husband is  justified in hitting or beating his wife if she argues with him  
- A husband is justified in hitting or beating his wife if she goes out without telling him 
- A husband is justified in hitting or beating his wife if she neglects the children  
- A husband is justified in hitting or beating his wife if she refuses to have sex with him  
- A husband is justified in hitting or beating his wife for at least one specific reason

___
### Loading the dataset and analysis of raw data

In [1]:
import altair as alt
import pandas as pd
import numpy as np
from IPython.display import display

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [2]:
# load in the data and take a look at categorical data

raw_data = pd.read_csv('data/violence_data.csv')
raw_data.describe(include='object')

Unnamed: 0,Country,Gender,Demographics Question,Demographics Response,Question,Survey Year
count,12600,12600,12600,12600,12600,12600
unique,70,2,5,15,6,18
top,Afghanistan,F,Education,Never married,... if she burns the food,01/01/2013
freq,180,6300,3360,840,2100,1980


In [3]:
# take a look at the countries and years surveyed.

with pd.option_context('display.max_rows', None):
    display(raw_data.loc[:,['Country', 'Survey Year']].drop_duplicates().reset_index(drop=True))

Unnamed: 0,Country,Survey Year
0,Afghanistan,01/01/2015
1,Albania,01/01/2017
2,Angola,01/01/2015
3,Armenia,01/01/2015
4,Azerbaijan,01/01/2006
5,Bangladesh,01/01/2014
6,Benin,01/01/2017
7,Bolivia,01/01/2008
8,Burkina Faso,01/01/2010
9,Burundi,01/01/2016


So it looks like each country in which a survey has been conducted has only been surveyed a single time.

In [4]:
# for each year this survey has been conducted, let's
# take a look at the total number of surveys conducted,
# and the total number of countries in which the survey
# was conducted

survey_count = alt.Chart(raw_data).mark_bar().encode(
    x='Survey Year',
    y='count(RecordID)',
)

country_count = alt.Chart(raw_data).mark_bar().encode(
    x='Survey Year',
    y={"aggregate": "distinct",
       "field": "Country",
       "type": "quantitative"},
    tooltip=['Survey Year',
             'count(RecordID)',
             {"aggregate": "distinct",
              "field": "Country",
              "type": "quantitative"}]
)

alt.layer(survey_count, country_count).resolve_scale(
    y='independent',
)

Note the number of surveys conducted in each country is constant.

In [5]:
# let's check that distribution of participant gender makes sense

alt.Chart(raw_data).mark_bar().encode(
    x = 'count(RecordID)',
    y = 'Country',
    color = "Gender",
    tooltip = ['Survey Year', 'Gender', 'count(RecordID)']
).configure_axisX(orient='top')

In [6]:
print(raw_data.loc[:, ['Demographics Question',
                 'Demographics Response']
            ].drop_duplicates().sort_values('Demographics Question'))

alt.Chart(raw_data.sort_values('Demographics Question')).mark_bar().encode(
    alt.X('count(RecordID)'),
    alt.Y('Demographics Response',
          sort=alt.EncodingSortField(field="Demographics Question")),
    color="Gender",
    tooltip=["Demographics Question", "Demographics Response", "Gender", "count(RecordID)"],
).properties(width=100, height=160).facet("Country", columns=2).resolve_scale(y='independent',
                                                                               x='independent')

   Demographics Question         Demographics Response
6                    Age                         15-24
9                    Age                         25-34
12                   Age                         35-49
1              Education                        Higher
2              Education                     Secondary
3              Education                       Primary
13             Education                  No education
5             Employment             Employed for kind
7             Employment                    Unemployed
14            Employment             Employed for cash
0         Marital status                 Never married
4         Marital status  Widowed, divorced, separated
10        Marital status    Married or living together
8              Residence                         Rural
11             Residence                         Urban


### Conclusion - Follow-up Project Ideas