# Divorce Prediction

In this notebook, I'll explore the dataset from kaggle that's called divorce-prediction, you can find that dataset here: https://www.kaggle.com/datasets/andrewmvd/divorce-prediction

The responses in that dataset were collected on a 5 point scale (0=Never, 1=Seldom, 2=Averagely, 3=Frequently, 4=Always).

This dataset was collected from  Turkey .

In [14]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import ydata_profiling
from pandas_profiling import ProfileReport
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

random_state=2023

## Reading the data and exploring it:

In [3]:
data= pd.read_csv('/kaggle/input/divorce-prediction/divorce_data.csv',sep=';')
data.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q46,Q47,Q48,Q49,Q50,Q51,Q52,Q53,Q54,Divorce
0,2,2,4,1,0,0,0,0,0,0,...,2,1,3,3,3,2,3,2,1,1
1,4,4,4,4,4,0,0,4,4,4,...,2,2,3,4,4,4,4,2,2,1
2,2,2,2,2,1,3,2,1,1,2,...,3,2,3,1,1,1,2,2,2,1
3,3,2,3,2,3,3,3,3,3,3,...,2,2,3,3,3,3,2,2,2,1
4,2,2,1,1,1,1,0,0,0,0,...,2,1,2,3,2,2,2,1,0,1


In [4]:
data.shape

(170, 55)

### Pandas Profiling

It's a very useful tool, and since this dataset is quiet small, we will be able to get that report pretty fast.<br>

In the report:

- We can see that there are duplicates in the data.
- There are no missing values.
- We can take a look at the corelations between the different variables.

In [18]:
report = ProfileReport(data)
report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [5]:
data.drop_duplicates(inplace=True, ignore_index=False)
data.shape

(150, 55)

#### The data is quiet blanced:
As we can see from the code below, the ratio between divorce -represented by 1- and not divorce -represented by 0- is about 1:1.3 so accuracy can be a good metric to indicate our model's preformance. 

In [30]:
print(sum(data["Divorce"]==1), sum(data["Divorce"]==0),sum(data["Divorce"]==0)/sum(data["Divorce"]==1))

66 84 1.2727272727272727


There are 54 questions in the dataset and they are named as Q1, Q2....Q54, it's important to look at what those numbers represent so here are the full questions:

In [6]:
questions= pd.read_csv('/kaggle/input/divorce-prediction/reference.tsv',sep=';')
for i in questions.values:
    print(i[0])

1|If one of us apologizes when our discussion deteriorates, the discussion ends.
2|I know we can ignore our differences, even if things get hard sometimes.
3|When we need it, we can take our discussions with my spouse from the beginning and correct it.
4|When I discuss with my spouse, to contact him will eventually work.
5|The time I spent with my wife is special for us.
6|We don't have time at home as partners.
7|We are like two strangers who share the same environment at home rather than family.
8|I enjoy our holidays with my wife.
9|I enjoy traveling with my wife.
10|Most of our goals are common to my spouse.
11|I think that one day in the future, when I look back, I see that my spouse and I have been in harmony with each other.
12|My spouse and I have similar values in terms of personal freedom.
13|My spouse and I have similar sense of entertainment.
14|Most of our goals for people (children, friends, etc.) are the same.
15|Our dreams with my spouse are similar and harmonious.
16|W

### Setting assumptions and checking them 

Looking at those questions I started having some assumptions about the question and their link to the divorce probabilty.<br>

The questions: Q6, Q7, Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q41,Q42, Q45, Q46, Q47, Q49, Q50, Q51, Q52, Q53 and Q54 have very negative languae and somewhat hurtful actions so I think that we can assume that couples who answerd those questions with 4 or 5 are most likely to get a divorce.<br>

I also expect that the couples who answered the the questions that have positive meanings with 0 or 1 to have divorce more than the couples who answered those questions with 4 or 5.<br>

Let's check that assumption:


In [28]:
#first get the divorced couples in a differebt dataset
divorced=data[data["Divorce"]==1]
divorced.shape

(66, 55)

In [24]:
#then let's get a report on this divorced dataset alone
report = ProfileReport(divorced)
report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [31]:
not_divorced=data[data["Divorce"]==0]
not_divorced.shape

(84, 55)

In [26]:
#then let's get a report on this not_divorced dataset alone
report = ProfileReport(not_divorced)
report

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Some Notes from the above 2 reports:

Looking at the 2 reports above can give us some very good notes and insights on the relationships on the dataset, and here are some of those notes:
- Couples who stayed together never answered Q7 with anything but 0 or 1, So answering that question with anything higher than 1 is a real red flag in the relationship.
- Q8 is supposed to have a very positive meaning so I assumed that the couples that stayed with each other will answer it with 3 or 4, has really surprising results as you can from the report that from 84 couples who stayed together 76 persons answered that question with 0!
- Q9 is same as Q9 and only 80 persons of the 84 couples who stayed togrther enjoy traveling with their wives!
- Q10 which indicates whether the couples have the same goals or not has surprising results: almost 50% of the people who got divorced answered this question with 3, and more than 64% of the couples who stayed together answered it with 0!
- Q11 which is really optimistic question got only 0 and 1 from the couples who stayed together while more than 69% of the divorced couples answered it with 3 and 4!
- Q12 also got mostly 0 and 1 from the not-divorced couples and 2, 3 and 4 from divorced couples.
- Q14 which indicates the unity of the goals of the couples which I assumed will get 3 and 4 from non-divorced, got mostly 0 and 1 from them!
- Q15 which also indicates the unity of the dreams of the couples which I also assumed will get 3 and 4 from non-divorced, got mostly 0 and 1 from them!
- Q16 is about the love concept for the cvouples and most of the non-divorced in this dataset seems to rarely agree on those concepts as they mostly answered that question with 0 and 1.
- Q17 answers seems to reveal that non-divorced couples rarley share the same views about being happy while divorced does!
- Q18, Q19 and Q20 answers shows that the couples who stayed together almost never share similar ideas about how marriage should be, how roles should be in marriage or the values in trust, while a lot of divorced does!
- Q21, Q22, Q23, Q24 answers shows that parteners who doesn't know what their spouse like or how their spouse wants to be taken care of when they sick or their spouse's favorite food or even what kind of stress their spouse is facing in their life are staying with eachother! as all of the non-divorced couples answered this question with 0 or 1.
- Basicly the answers of questions from 21 to 30 shows that non-divorced couples know too littile about eachother.
- Couples who stayed together answered the questions from 33 to 36 which indicates how disrespectful the couples are to each other, and the couples who stayed together mostly answered with 0 or 1 while more than 90% of the divorced couples answered those questions with 3 or 4.
- Q37,  Q41 and Q47 which show how much anger is contained in that relationship, got mostly 0 and 1 from couples who stayed together while got mostly 3 or 4 from the divorced ones.
- Q38,  Q39, Q40, Q42 and Q45 which show how bad the discussions and arguments and basically the communication skills between the couples can be, got mostly 0 and 1 from couples who stayed together while got mostly 3 or 4 from the divorced ones.
- Q52, Q53 and Q54 which show the readiness of the partners to tell their spouses their inadequacies got answered mostly with 0 and 1 from non-divorced couples and 3 and 4 from divorced ones. 


## Modeling 

I'll use **cross_val_score** which applyes the K-FOLD Cross Validation to get the accuracy score of each model and then chose the best one.

In [13]:
x= data.drop("Divorce",axis=1).values
y= data["Divorce"].values

### Decision Tree

In [15]:
clf1 = DecisionTreeClassifier(random_state=random_state)

In [27]:
scores = cross_val_score(clf1, x, y, cv=10)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.9667 (+/- 0.12)


## SVM

In [22]:
clf2 = SVC(gamma='auto')

In [28]:
scores = cross_val_score(clf2, x, y, cv=10)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.9733 (+/- 0.12)


## Random Forest


In [24]:
clf3=RandomForestClassifier()

In [29]:
scores = cross_val_score(clf3, x, y, cv=10)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.9733 (+/- 0.12)


## Logistic Regression

In [26]:
clf4=LogisticRegression(random_state=random_state)

In [32]:
scores = cross_val_score(clf4, x, y, cv=10)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.9800 (+/- 0.12)
