### Data analysis and Exploration

#### About Dataset

**Description**

This dataset contains information on domestic violence against women in a specific rural area of a developing country. The data is collected to help understand the correlation between various socio-economic factors and domestic violence.

**Dataset Content**

The dataset includes the following columns:

- SL. No: Serial number of the record.
- Age: Age of the respondent.
- Education: Educational attainment of the respondent( tertiary for higher secondary).
- Employment: Employment status of the respondent.
- Income: Income level of the respondent (0 indicating no income).
- Marital status: Marital status of the respondent (married or unmarried).
- Violence: Indicates whether the respondent has experienced domestic violence (yes or no).

**Context**

Domestic violence against women is a significant issue in many developing countries. Understanding the factors that contribute to this violence can help in creating effective interventions and policies. This dataset aims to provide a basis for analysis and research in this critical area.



#### **What is Exploratory Data Analysis?**
 
 - Exploratory Data Analysis(EDA) helps understand a dataset before applying machine learning or making decisions

#### **Goal of Exploratory Data Analysis**
 - The goal is to identify patterns, outliers, and insights

**Step 1 : Import necessary libraries and load your dataset**

In [26]:
import pandas as pd

In [27]:
data = pd.read_csv("Domestic violence.csv")

**Step 2 : Read first lines of your dataset**

In [28]:
data.head()

Unnamed: 0,SL. No,Age,Education,Employment,Income,Marital status,Violence
0,1,30,secondary,unemployed,0,married,yes
1,2,47,tertiary,unemployed,0,married,no
2,3,24,tertiary,unemployed,0,unmarred,no
3,4,22,tertiary,unemployed,0,unmarred,no
4,5,50,primary,unemployed,0,married,yes


From the above we can see that the dataset may contain information about different indivisuals so first we would like to look at the number of people sampled

**Step 3:Data analysis**

In [29]:
data.shape

(347, 7)

The data has 347 individuals who were recorded and for each about 7 attributes were checked ,now we will be checking the data types used in our data

In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   SL. No           347 non-null    int64 
 1   Age              347 non-null    int64 
 2   Education        347 non-null    object
 3   Employment       347 non-null    object
 4   Income           347 non-null    int64 
 5   Marital status   347 non-null    object
 6   Violence         347 non-null    object
dtypes: int64(3), object(4)
memory usage: 19.1+ KB


Majority of the attributes are categories and we see that the number of non-null values in each category is equivalent to the total number of entries in the dataset which implies that there are no missing values.

In [31]:
data.describe()

Unnamed: 0,SL. No,Age,Income
count,347.0,347.0,347.0
mean,174.0,31.380403,2110.685879
std,100.314505,9.601569,5743.278766
min,1.0,15.0,0.0
25%,87.5,23.0,0.0
50%,174.0,30.0,0.0
75%,260.5,39.5,0.0
max,347.0,60.0,35000.0


For the people that we sampled we see the following things from the above statistical summary

-  Majority of there women are at the age 30 with the youngest being 15 and oldest being 60 years which indicates that the women of interest are people between age 15 - 60 
- We also notice that the  on average the women do not have any source of income meaning that this might be a factor that contributes to domestic violence

In [32]:
data["Education "].value_counts()

Education 
primary      132
secondary    114
none          52
tertiary      49
Name: count, dtype: int64

We see that about 38% of the women only attended primary school with 15% of them not having any form of education 

- This results indicate that 53 % of the people have education level up to primary school which is consisted with the the numbers indicated by the income of 50% of the women not having any sort of income this could be the results of them not having high level of education

- Now we would want to check the relationship between  income and domestic violence cases


In [33]:
data.groupby('Violence ')['Income'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Violence,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
no,261.0,2504.628352,6295.998263,0.0,0.0,0.0,0.0,35000.0
yes,86.0,915.116279,3331.084171,0.0,0.0,0.0,0.0,24000.0


We can see that income does not really have a high impact on violence so we will be looking at the marital status compared to domestic violence

In [34]:
cross_tab = pd.crosstab(data['Marital status '], data['Violence '])
print(cross_tab)

Violence          no  yes
Marital status           
married          217   83
unmarred          44    3


From this we can see that married women are more likely to suffer domestic abuse compare to women who are not married which might also narrow  down our target to being only married women

In [39]:
print(data['Violence '].value_counts(normalize=True))  

Violence 
no     0.752161
yes    0.247839
Name: proportion, dtype: float64


The above statistics indicate that  majority of women who were recorded do not suffer from domestic violence ,now we will like to check if these women are employed since majority lack education

In [35]:
data['Employment '].value_counts()

Employment 
unemployed       274
semi employed     47
employed          23
employed           3
Name: count, dtype: int64

We see that for employed people there are two different caterogies which we will need to merge since they represent the same type of category

In [40]:
print(data['Employment '].value_counts(normalize=True))

Employment 
unemployed       0.789625
semi employed    0.135447
employed         0.066282
employed         0.008646
Name: proportion, dtype: float64


Aside from employed category being repeated we can see that 78 percent of these women are not employed which makes sense since about 50% of them do not have any income and have up to primary education level.

In [36]:
print(data.duplicated().sum())

0


Our dataset does not have any duplicate and now we will move to data cleaning and visualization on the relationship between socio-economic factors and domestic violence against women

**Based on the analysis so far we can come to the assumption that the factors used  to predict if a women has suffered domestic violent may not be the direct factors that affect domestic violence and maybe other factors will also have to be considered as this is not enough information to make such a conclusion**