# Speed dating experiment
The goal of this project is to extract knowledge from a database created during a speed dating experiment in the US. After cleaning data we will first extract simple facts. We will then focus on the question of shared interest and their influence on getting a second date. 

## 1. Exploring and cleaning

Import data and libs

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.io as pio

pio.renderers.default = "svg"

# If you are on the workspaces:
pio.renderers.default = "iframe_connected"

import plotly.express as px
import plotly.graph_objects as go

df = pd.read_csv('Speed Dating Data.csv', encoding="ISO-8859-1")

#### Visualise a sample of raw data. 
Random sample gives a little more information than the first rows since some columns can be ordered. To fully understand the data it is better to read the columns description given by the author.

In [2]:
# Add more visualisable columns 
pd.options.display.max_columns = 200
display(df.sample(10))

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,mn_sat,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,reading,tv,theater,movies,concerts,music,shopping,yoga,exphappy,expnum,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1,attr4_1,sinc4_1,intel4_1,fun4_1,amb4_1,shar4_1,attr2_1,sinc2_1,intel2_1,fun2_1,amb2_1,shar2_1,attr3_1,sinc3_1,fun3_1,intel3_1,amb3_1,attr5_1,sinc5_1,intel5_1,fun5_1,amb5_1,dec,attr,sinc,intel,fun,amb,shar,like,prob,met,match_es,attr1_s,sinc1_s,intel1_s,fun1_s,amb1_s,shar1_s,attr3_s,sinc3_s,intel3_s,fun3_s,amb3_s,satis_2,length,numdat_2,attr7_2,sinc7_2,intel7_2,fun7_2,amb7_2,shar7_2,attr1_2,sinc1_2,intel1_2,fun1_2,amb1_2,shar1_2,attr4_2,sinc4_2,intel4_2,fun4_2,amb4_2,shar4_2,attr2_2,sinc2_2,intel2_2,fun2_2,amb2_2,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,numdat_3,num_in_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr7_3,sinc7_3,intel7_3,fun7_3,amb7_3,shar7_3,attr4_3,sinc4_3,intel4_3,fun4_3,amb4_3,shar4_3,attr2_3,sinc2_3,intel2_3,fun2_3,amb2_3,shar2_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
6353,411,15.0,1,30,2,15,18,1,1.0,1,16,394.0,0,0.4,1,24.0,2.0,18.0,19.0,16.0,15.0,15.0,17.0,0,9.5,7.0,7.0,8.0,9.0,6.0,7.0,6.0,2.0,34.0,Finance,8.0,The American University,1206.0,22481.0,2.0,8.0,2.0,Washington DC,20817.0,80006.0,3.0,4.0,1.0,M&A Advisory,7.0,7.0,6.0,6.0,7.0,7.0,6.0,3.0,3.0,1.0,8.0,1.0,7.0,6.0,7.0,10.0,8.0,3.0,3.0,,35.0,15.0,15.0,25.0,10.0,0.0,40.0,20.0,15.0,15.0,5.0,5.0,60.0,0.0,10.0,20.0,0.0,0.0,8.0,6.0,9.0,7.0,5.0,7.0,7.0,7.0,8.0,6.0,0,6.0,9.0,8.0,7.0,4.0,5.0,6.0,7.0,2.0,2.0,80.0,0.0,0.0,20.0,0.0,0.0,9.0,7.0,8.0,10.0,6.0,8.0,1.0,3.0,,,,,,,50.0,0.0,25.0,25.0,0.0,0.0,40.0,20.0,20.0,10.0,10.0,0.0,70.0,20.0,0.0,10.0,0.0,0.0,8.0,7.0,9.0,10.0,7.0,8.0,6.0,8.0,8.0,6.0,1.0,1.0,0.0,0.0,,80.0,0.0,10.0,10.0,0.0,0.0,70.0,0.0,10.0,20.0,0.0,0.0,70.0,0.0,15.0,15.0,0.0,0.0,80.0,20.0,0.0,0.0,0.0,0.0,7.0,8.0,8.0,9.0,6.0,7.0,7.0,8.0,9.0,7.0
4271,285,13.0,1,26,2,11,21,8,8.0,5,9,260.0,1,-0.18,0,27.0,4.0,19.0,20.0,19.0,14.0,13.0,15.0,1,6.0,9.0,7.0,7.0,6.0,5.0,7.0,4.0,2.0,24.0,engineering,5.0,tech school,,,6.0,1.0,1.0,International Student,10025.0,,4.0,6.0,6.0,finance or engineering,7.0,5.0,7.0,6.0,10.0,9.0,10.0,9.0,14.0,5.0,9.0,7.0,3.0,10.0,10.0,10.0,10.0,10.0,10.0,,20.0,20.0,25.0,25.0,5.0,5.0,20.0,20.0,25.0,25.0,5.0,5.0,20.0,20.0,25.0,25.0,5.0,5.0,8.0,10.0,10.0,10.0,8.0,8.0,10.0,10.0,8.0,10.0,1,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,0.0,5.0,30.0,20.0,10.0,10.0,20.0,10.0,9.0,9.0,9.0,9.0,9.0,7.0,1.0,2.0,,,,,,,30.0,10.0,20.0,20.0,10.0,10.0,30.0,10.0,20.0,20.0,10.0,10.0,30.0,10.0,20.0,20.0,10.0,10.0,9.0,9.0,9.0,9.0,9.0,8.0,8.0,8.0,8.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
815,56,1.0,0,1,1,3,10,5,,5,8,73.0,0,-0.31,1,22.0,2.0,30.0,10.0,20.0,30.0,0.0,10.0,0,7.0,7.0,6.0,5.0,4.0,5.0,5.0,5.0,2.0,23.0,social work,11.0,,,,2.0,10.0,9.0,Florida,33496.0,41466.0,1.0,3.0,1.0,social worker,9.0,2.0,1.0,10.0,10.0,5.0,5.0,2.0,1.0,7.0,6.0,8.0,8.0,7.0,6.0,9.0,10.0,4.0,3.0,15.0,17.0,18.0,18.0,15.0,17.0,15.0,,,,,,,32.0,10.0,12.0,27.0,7.0,12.0,10.0,10.0,9.0,8.0,5.0,,,,,,1,4.0,10.0,9.0,8.0,9.0,7.0,6.0,8.0,2.0,3.0,,,,,,,,,,,,1.0,3.0,3.0,,,,,,,20.0,15.0,15.0,15.0,20.0,15.0,,,,,,,,,,,,,9.0,9.0,7.0,9.0,7.0,,,,,,0.0,2.0,0.0,,,15.0,18.0,18.0,17.0,18.0,14.0,,,,,,,,,,,,,,,,,,,9.0,10.0,8.0,9.0,7.0,,,,,
3946,269,18.0,0,35,2,11,21,20,20.0,21,20,292.0,0,0.21,0,28.0,1.0,20.0,18.0,20.0,17.0,10.0,15.0,0,7.0,8.0,7.0,6.0,7.0,5.0,5.0,7.0,2.0,24.0,Art History,7.0,Wesleyan University,1380.0,27100.0,2.0,4.0,5.0,New Jersey,8904.0,,2.0,6.0,2.0,professor,2.0,4.0,1.0,8.0,8.0,8.0,10.0,7.0,3.0,1.0,10.0,3.0,6.0,6.0,6.0,9.0,4.0,2.0,6.0,,20.0,10.0,30.0,15.0,15.0,10.0,30.0,5.0,20.0,30.0,5.0,10.0,30.0,10.0,10.0,30.0,5.0,10.0,8.0,8.0,6.0,8.0,7.0,7.0,6.0,7.0,4.0,8.0,0,8.0,7.0,7.0,9.0,7.0,2.0,5.0,5.0,0.0,1.0,15.0,10.0,30.0,15.0,15.0,15.0,8.0,5.0,8.0,7.0,9.0,2.0,3.0,3.0,,,,,,,20.0,10.0,40.0,10.0,10.0,10.0,25.0,10.0,20.0,25.0,10.0,10.0,25.0,10.0,20.0,25.0,10.0,10.0,7.0,6.0,9.0,7.0,8.0,7.0,6.0,7.0,6.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6154,400,4.0,1,8,2,15,18,3,11.0,11,15,393.0,0,0.25,1,22.0,2.0,8.0,20.0,25.0,25.0,12.0,10.0,0,6.0,6.0,8.0,6.0,9.0,6.0,5.0,5.0,2.0,26.0,Business,8.0,Bocconi University Milan,,,2.0,7.0,7.0,Italy,80136.0,,5.0,4.0,2.0,Entrepreneur,7.0,7.0,4.0,7.0,7.0,6.0,6.0,6.0,1.0,8.0,7.0,1.0,7.0,7.0,6.0,7.0,1.0,1.0,3.0,,70.0,5.0,20.0,3.0,1.0,1.0,40.0,10.0,20.0,10.0,10.0,10.0,80.0,5.0,5.0,4.0,3.0,3.0,7.0,10.0,6.0,8.0,7.0,7.0,7.0,9.0,6.0,9.0,0,3.0,8.0,8.0,6.0,6.0,,5.0,8.0,2.0,8.0,50.0,15.0,15.0,15.0,3.0,2.0,7.0,9.0,8.0,6.0,7.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6409,414,18.0,1,36,2,15,18,6,12.0,13,18,396.0,1,-0.56,0,25.0,4.0,15.0,18.0,18.0,18.0,15.0,16.0,1,5.0,9.0,9.0,6.0,9.0,5.0,7.0,4.0,2.0,29.0,business,8.0,Princeton,1460.0,27230.0,2.0,2.0,10.0,New Jersey,10021.0,55080.0,6.0,1.0,1.0,business,7.0,10.0,5.0,9.0,5.0,0.0,0.0,5.0,5.0,0.0,8.0,7.0,0.0,0.0,0.0,1.0,2.0,0.0,5.0,,95.0,1.0,1.0,1.0,1.0,1.0,15.0,15.0,15.0,15.0,15.0,25.0,95.0,1.0,1.0,1.0,1.0,1.0,,,,,,,,,,,1,8.0,8.0,8.0,6.0,8.0,4.0,6.0,1.0,2.0,2.0,95.0,1.0,1.0,1.0,1.0,1.0,8.0,10.0,8.0,9.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
6270,407,11.0,1,22,2,15,18,8,5.0,4,5,383.0,0,0.51,1,27.0,2.0,25.0,20.0,10.0,20.0,15.0,10.0,1,5.0,6.0,6.0,5.0,,,6.0,4.0,2.0,24.0,Film,14.0,Georgetown University,1400.0,25425.0,2.0,1.0,1.0,New Jersey,7901.0,,3.0,4.0,2.0,Screenwriter,6.0,8.0,4.0,7.0,9.0,9.0,9.0,7.0,4.0,2.0,9.0,5.0,7.0,10.0,9.0,9.0,7.0,7.0,5.0,,30.0,15.0,15.0,15.0,15.0,10.0,50.0,10.0,10.0,10.0,10.0,10.0,50.0,10.0,10.0,10.0,10.0,10.0,7.0,9.0,8.0,7.0,8.0,5.0,6.0,7.0,6.0,7.0,0,6.0,8.0,8.0,7.0,6.0,4.0,6.0,7.0,2.0,5.0,50.0,10.0,10.0,10.0,10.0,10.0,7.0,8.0,8.0,7.0,9.0,6.0,3.0,1.0,60.0,10.0,10.0,10.0,5.0,5.0,40.0,10.0,20.0,10.0,10.0,10.0,50.0,10.0,10.0,10.0,10.0,10.0,50.0,10.0,10.0,10.0,10.0,10.0,7.0,8.0,8.0,7.0,9.0,7.0,7.0,8.0,6.0,8.0,1.0,0.0,0.0,0.0,,50.0,10.0,10.0,10.0,10.0,10.0,60.0,10.0,10.0,10.0,5.0,5.0,50.0,10.0,10.0,10.0,10.0,10.0,50.0,10.0,10.0,10.0,10.0,10.0,7.0,8.0,8.0,6.0,8.0,7.0,7.0,6.0,5.0,6.0
7754,524,16.0,0,31,2,21,22,11,11.0,15,15,545.0,0,-0.12,0,24.0,2.0,20.0,20.0,20.0,20.0,10.0,10.0,0,,,,,,,,,,25.0,medicine,4.0,Columbia University,1430.0,26908.0,4.0,9.0,6.0,Michigan,48306.0,72412.0,2.0,7.0,4.0,physician/healthcare,4.0,4.0,1.0,7.0,9.0,8.0,8.0,6.0,3.0,8.0,7.0,7.0,8.0,8.0,5.0,5.0,7.0,5.0,6.0,,15.0,20.0,25.0,20.0,10.0,10.0,30.0,10.0,15.0,20.0,15.0,10.0,35.0,10.0,15.0,20.0,10.0,10.0,5.0,9.0,7.0,8.0,7.0,5.0,9.0,9.0,7.0,7.0,1,,,,,,,,,,3.0,,,,,,,,,,,,8.0,3.0,3.0,20.0,25.0,25.0,15.0,10.0,5.0,20.0,20.0,20.0,15.0,15.0,10.0,25.0,15.0,15.0,20.0,10.0,15.0,25.0,15.0,15.0,20.0,10.0,15.0,6.0,10.0,8.0,8.0,8.0,5.0,9.0,8.0,7.0,8.0,2.0,3.0,1.0,1.0,1.0,15.0,15.0,20.0,15.0,20.0,15.0,20.0,20.0,20.0,15.0,15.0,10.0,30.0,10.0,10.0,20.0,10.0,20.0,35.0,10.0,15.0,20.0,10.0,10.0,6.0,9.0,8.0,8.0,8.0,6.0,9.0,8.0,7.0,9.0
6534,430,1.0,0,1,2,17,14,8,8.0,7,11,450.0,0,0.01,0,25.0,2.0,50.0,5.0,5.0,5.0,5.0,30.0,1,7.0,7.0,7.0,7.0,5.0,8.0,8.0,7.0,2.0,22.0,Elementary Education - Preservice,9.0,UC Irvine,,15260.0,4.0,3.0,10.0,California,94539.0,65498.0,1.0,6.0,2.0,Elementary school teacher,2.0,10.0,10.0,10.0,6.0,3.0,2.0,4.0,2.0,4.0,4.0,8.0,9.0,9.0,8.0,8.0,5.0,4.0,5.0,,10.0,25.0,15.0,20.0,15.0,15.0,25.0,20.0,10.0,30.0,10.0,5.0,25.0,10.0,10.0,30.0,10.0,15.0,5.0,8.0,8.0,5.0,8.0,5.0,8.0,6.0,9.0,9.0,0,5.0,7.0,8.0,6.0,9.0,5.0,6.0,3.0,2.0,0.0,10.0,25.0,10.0,20.0,15.0,20.0,5.0,8.0,5.0,9.0,8.0,4.0,1.0,2.0,10.0,25.0,10.0,25.0,10.0,20.0,10.0,25.0,15.0,20.0,10.0,20.0,20.0,20.0,15.0,25.0,10.0,10.0,25.0,25.0,10.0,25.0,10.0,5.0,5.0,8.0,5.0,9.0,8.0,5.0,9.0,6.0,9.0,9.0,0.0,0.0,0.0,,,10.0,25.0,15.0,20.0,10.0,20.0,10.0,30.0,10.0,15.0,10.0,25.0,20.0,20.0,10.0,25.0,15.0,10.0,20.0,20.0,15.0,20.0,15.0,10.0,5.0,8.0,5.0,9.0,7.0,6.0,9.0,7.0,9.0,8.0
6857,462,3.0,1,6,1,18,6,5,5.0,1,6,459.0,0,0.2,1,26.0,4.0,30.0,10.0,10.0,30.0,10.0,10.0,0,2.0,10.0,5.0,1.0,5.0,2.0,1.0,1.0,2.0,33.0,Electrical Engineering,5.0,China,,,4.0,1.0,5.0,"Cambridge, MA",,,4.0,6.0,6.0,"Professor, or Engineer",2.0,8.0,7.0,5.0,2.0,6.0,6.0,1.0,2.0,7.0,3.0,8.0,5.0,7.0,6.0,9.0,2.0,2.0,8.0,,30.0,30.0,10.0,10.0,0.0,20.0,10.0,30.0,20.0,10.0,10.0,20.0,20.0,20.0,10.0,20.0,10.0,20.0,4.0,9.0,6.0,8.0,2.0,3.0,10.0,7.0,8.0,3.0,1,8.0,6.0,7.0,6.0,7.0,4.0,7.0,4.0,0.0,3.0,7.0,8.0,4.0,5.0,3.0,4.0,5.0,8.0,7.0,8.0,4.0,2.0,1.0,1.0,30.0,30.0,10.0,10.0,0.0,20.0,30.0,20.0,15.0,15.0,0.0,20.0,30.0,20.0,10.0,20.0,10.0,10.0,30.0,30.0,10.0,10.0,10.0,10.0,4.0,8.0,7.0,6.0,2.0,3.0,9.0,8.0,7.0,2.0,0.0,0.0,0.0,,,30.0,30.0,10.0,20.0,0.0,10.0,30.0,20.0,10.0,20.0,0.0,20.0,25.0,20.0,20.0,15.0,10.0,10.0,25.0,20.0,20.0,15.0,10.0,10.0,4.0,8.0,6.0,4.0,2.0,3.0,9.0,7.0,7.0,2.0


#### Check missing values
We will drop columns with too many missing values. Let's define a threshold: we drop columns with more than 75% missing values. As we wont have enough time to explore the whole dataset, this is a first selection we have to make. We could explore later the dropped columns if needed.

In [3]:
# Calculating missing values for every column
missing_values = df.isnull().sum()
missing_values = missing_values/len(df)*100

# Sorting table
missing_values = missing_values.sort_values(ascending=False)

# Print missing values percentage
with pd.option_context('display.max_rows', None):
    print(missing_values)

num_in_3    92.026737
numdat_3    82.143710
expnum      78.515159
sinc7_2     76.665075
amb7_2      76.665075
shar7_2     76.438291
attr7_2     76.318931
intel7_2    76.318931
fun7_2      76.318931
amb5_3      75.936978
attr7_3     75.936978
sinc7_3     75.936978
intel7_3    75.936978
fun7_3      75.936978
amb7_3      75.936978
shar7_3     75.936978
shar2_3     75.936978
attr5_3     75.936978
sinc5_3     75.936978
intel5_3    75.936978
fun5_3      75.936978
attr4_3     64.681308
sinc4_3     64.681308
intel4_3    64.681308
fun4_3      64.681308
amb4_3      64.681308
shar4_3     64.681308
attr2_3     64.681308
sinc2_3     64.681308
fun2_3      64.681308
intel2_3    64.681308
amb2_3      64.681308
mn_sat      62.604440
tuition     57.233230
you_call    52.566245
shar1_3     52.566245
date_3      52.566245
attr1_3     52.566245
sinc1_3     52.566245
intel1_3    52.566245
fun1_3      52.566245
amb1_3      52.566245
attr3_3     52.566245
sinc3_3     52.566245
intel3_3    52.566245
fun3_3    

In [4]:
# Create a new dataframe keeping only columns with less than 60% missing values
dating = df.iloc[:,:]
for col_name, col_missing_values in zip(missing_values.index, missing_values):
    if col_missing_values >= 60: dating = dating.drop(col_name, axis=1)

In [5]:
dating.head(3)

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,reading,tv,theater,movies,concerts,music,shopping,yoga,exphappy,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1,attr4_1,sinc4_1,intel4_1,fun4_1,amb4_1,shar4_1,attr2_1,sinc2_1,intel2_1,fun2_1,amb2_1,shar2_1,attr3_1,sinc3_1,fun3_1,intel3_1,amb3_1,attr5_1,sinc5_1,intel5_1,fun5_1,amb5_1,dec,attr,sinc,intel,fun,amb,shar,like,prob,met,match_es,attr1_s,sinc1_s,intel1_s,fun1_s,amb1_s,shar1_s,attr3_s,sinc3_s,intel3_s,fun3_s,amb3_s,satis_2,length,numdat_2,attr1_2,sinc1_2,intel1_2,fun1_2,amb1_2,shar1_2,attr4_2,sinc4_2,intel4_2,fun4_2,amb4_2,shar4_2,attr2_2,sinc2_2,intel2_2,fun2_2,amb2_2,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3
0,1,1.0,0,1,1,1,10,7,,4,1,11.0,0,0.14,0,27.0,2.0,35.0,20.0,20.0,20.0,0.0,5.0,0,6.0,8.0,8.0,8.0,8.0,6.0,7.0,4.0,2.0,21.0,Law,1.0,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,5.0,6.0,9.0,1.0,10.0,10.0,9.0,8.0,1.0,3.0,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,35.0,20.0,15.0,20.0,5.0,5.0,6.0,8.0,8.0,8.0,7.0,,,,,,1,6.0,9.0,7.0,7.0,6.0,5.0,7.0,6.0,2.0,4.0,,,,,,,,,,,,6.0,2.0,1.0,19.44,16.67,13.89,22.22,11.11,16.67,,,,,,,,,,,,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,7.0,7.0,7.0,7.0
1,1,1.0,0,1,1,1,10,7,,3,2,12.0,0,0.54,0,22.0,2.0,60.0,0.0,0.0,40.0,0.0,0.0,0,7.0,8.0,10.0,7.0,7.0,5.0,8.0,4.0,2.0,21.0,Law,1.0,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,5.0,6.0,9.0,1.0,10.0,10.0,9.0,8.0,1.0,3.0,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,35.0,20.0,15.0,20.0,5.0,5.0,6.0,8.0,8.0,8.0,7.0,,,,,,1,7.0,8.0,7.0,8.0,5.0,6.0,7.0,5.0,1.0,4.0,,,,,,,,,,,,6.0,2.0,1.0,19.44,16.67,13.89,22.22,11.11,16.67,,,,,,,,,,,,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,7.0,7.0,7.0,7.0
2,1,1.0,0,1,1,1,10,7,,10,3,13.0,1,0.16,1,22.0,4.0,19.0,18.0,19.0,18.0,14.0,12.0,1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,21.0,Law,1.0,,,4.0,2.0,4.0,Chicago,60521,69487.0,2.0,7.0,1.0,lawyer,,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,5.0,6.0,9.0,1.0,10.0,10.0,9.0,8.0,1.0,3.0,15.0,20.0,20.0,15.0,15.0,15.0,,,,,,,35.0,20.0,15.0,20.0,5.0,5.0,6.0,8.0,8.0,8.0,7.0,,,,,,1,5.0,8.0,9.0,8.0,5.0,7.0,7.0,,1.0,4.0,,,,,,,,,,,,6.0,2.0,1.0,19.44,16.67,13.89,22.22,11.11,16.67,,,,,,,,,,,,,6.0,7.0,8.0,7.0,6.0,,,,,,1.0,1.0,0.0,15.0,20.0,20.0,15.0,15.0,15.0,5.0,7.0,7.0,7.0,7.0


#### Check data distribution
We rapidly check if we can find columns with remarkable distribution (too many or too few different values, extremly unbalanced distribution...). Everything seems ok.

In [6]:
# Describe the dataset
dating.describe(include="all")

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,pf_o_fun,pf_o_amb,pf_o_sha,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o,age,field,field_cd,undergra,tuition,race,imprace,imprelig,from,zipcode,income,goal,date,go_out,career,career_c,sports,tvsports,exercise,dining,museums,art,hiking,gaming,clubbing,reading,tv,theater,movies,concerts,music,shopping,yoga,exphappy,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1,attr4_1,sinc4_1,intel4_1,fun4_1,amb4_1,shar4_1,attr2_1,sinc2_1,intel2_1,fun2_1,amb2_1,shar2_1,attr3_1,sinc3_1,fun3_1,intel3_1,amb3_1,attr5_1,sinc5_1,intel5_1,fun5_1,amb5_1,dec,attr,sinc,intel,fun,amb,shar,like,prob,met,match_es,attr1_s,sinc1_s,intel1_s,fun1_s,amb1_s,shar1_s,attr3_s,sinc3_s,intel3_s,fun3_s,amb3_s,satis_2,length,numdat_2,attr1_2,sinc1_2,intel1_2,fun1_2,amb1_2,shar1_2,attr4_2,sinc4_2,intel4_2,fun4_2,amb4_2,shar4_2,attr2_2,sinc2_2,intel2_2,fun2_2,amb2_2,shar2_2,attr3_2,sinc3_2,intel3_2,fun3_2,amb3_2,attr5_2,sinc5_2,intel5_2,fun5_2,amb5_2,you_call,them_cal,date_3,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3
count,8378.0,8377.0,8378.0,8378.0,8378.0,8378.0,8378.0,8378.0,6532.0,8378.0,8378.0,8368.0,8378.0,8220.0,8378.0,8274.0,8305.0,8289.0,8289.0,8289.0,8280.0,8271.0,8249.0,8378.0,8166.0,8091.0,8072.0,8018.0,7656.0,7302.0,8128.0,8060.0,7993.0,8283.0,8315,8296.0,4914,3583.0,8315.0,8299.0,8299.0,8299,7314.0,4279.0,8299.0,8281.0,8299.0,8289,8240.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8299.0,8277.0,8299.0,8299.0,8299.0,8289.0,8279.0,8257.0,6489.0,6489.0,6489.0,6489.0,6489.0,6467.0,8299.0,8299.0,8299.0,8299.0,8289.0,8289.0,8273.0,8273.0,8273.0,8273.0,8273.0,4906.0,4906.0,4906.0,4906.0,4906.0,8378.0,8176.0,8101.0,8082.0,8028.0,7666.0,7311.0,8138.0,8069.0,8003.0,7205.0,4096.0,4096.0,4096.0,4096.0,4096.0,4096.0,4000.0,4000.0,4000.0,4000.0,4000.0,7463.0,7463.0,7433.0,7445.0,7463.0,7463.0,7463.0,7463.0,7463.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,5775.0,7463.0,7463.0,7463.0,7463.0,7463.0,4377.0,4377.0,4377.0,4377.0,4377.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0,3974.0
unique,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,259,,241,115.0,,,,269,409.0,261.0,,,,367,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Business,,UC Berkeley,26908.0,,,,New York,0.0,55080.0,,,,Finance,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,521,,107,241.0,,,,522,355.0,124.0,,,,202,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,283.675937,8.960248,0.500597,17.327166,1.828837,11.350919,16.872046,9.042731,9.295775,8.927668,8.963595,283.863767,0.164717,0.19601,0.395799,26.364999,2.756653,22.495347,17.396867,20.270759,17.459714,10.685375,11.84593,0.419551,6.190411,7.175256,7.369301,6.400599,6.778409,5.47487,6.134498,5.208251,1.960215,26.358928,,7.662488,,,2.757186,3.784793,3.651645,,,,2.122063,5.006762,2.158091,,5.277791,6.425232,4.575491,6.245813,7.783829,6.985781,6.714544,5.737077,3.881191,5.745993,7.678515,5.304133,6.776118,7.919629,6.825401,7.851066,5.631281,4.339197,5.534131,22.514632,17.396389,20.265613,17.457043,10.682539,11.845111,26.39436,11.071506,12.636308,15.566805,9.780089,11.014845,30.362192,13.273691,14.416891,18.42262,11.744499,11.854817,7.084733,8.294935,7.70446,8.403965,7.578388,6.941908,7.927232,8.284346,7.426213,7.617611,0.419909,6.189995,7.175164,7.368597,6.400598,6.777524,5.474559,6.134087,5.207523,0.948769,3.207814,20.791624,15.434255,17.243708,15.260869,11.144619,12.457925,7.21125,8.082,8.25775,7.6925,7.58925,5.71151,1.843495,2.338087,26.217194,15.865084,17.813755,17.654765,9.913436,12.760263,26.806234,11.929177,12.10303,15.16381,9.342511,11.320866,29.344369,13.89823,13.958265,17.967233,11.909735,12.887976,7.125285,7.931529,8.238912,7.602171,7.486802,6.827964,7.394106,7.838702,7.279415,7.332191,0.780825,0.981631,0.37695,24.384524,16.588583,19.411346,16.233415,10.898075,12.699142,7.240312,8.093357,8.388777,7.658782,7.391545
std,158.583367,5.491329,0.500029,10.940735,0.376673,5.995903,4.358458,5.514939,5.650199,5.477009,5.491068,158.584899,0.370947,0.303539,0.489051,3.563648,1.230689,12.569802,7.044003,6.782895,6.085526,6.126544,6.362746,0.493515,1.950305,1.740575,1.550501,1.954078,1.79408,2.156163,1.841258,2.129354,0.245925,3.566763,,3.758935,,,1.230905,2.845708,2.805237,,,,1.407181,1.444531,1.105246,,3.30952,2.619024,2.801874,2.418858,1.754868,2.052232,2.263407,2.570207,2.620507,2.502218,2.006565,2.529135,2.235152,1.700927,2.156283,1.791827,2.608913,2.717612,1.734059,12.587674,7.0467,6.783003,6.085239,6.124888,6.362154,16.297045,6.659233,6.717476,7.328256,6.998428,6.06015,16.249937,6.976775,6.263304,6.577929,6.886532,6.167314,1.395783,1.40746,1.564321,1.076608,1.778315,1.498653,1.627054,1.283657,1.779129,1.773094,0.493573,1.950169,1.740315,1.550453,1.953702,1.794055,2.156363,1.841285,2.129565,0.989889,2.444813,12.968524,6.915322,6.59642,5.356969,5.514028,5.921789,1.41545,1.455741,1.179317,1.626839,1.793136,1.820764,0.975662,0.63124,14.388694,6.658494,6.535894,6.129746,5.67555,6.651547,16.402836,6.401556,5.990607,7.290107,5.856329,6.296155,14.551171,6.17169,5.398621,6.100307,6.313281,5.615691,1.37139,1.503236,1.18028,1.5482,1.744634,1.411096,1.588145,1.280936,1.647478,1.521854,1.611694,1.382139,0.484683,13.71212,7.471537,6.124502,5.163777,5.900697,6.557041,1.576596,1.610309,1.459094,1.74467,1.961417
min,1.0,1.0,0.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,0.0,-0.83,0.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,,1.0,,,1.0,0.0,1.0,,,,1.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,3.0,2.0,2.0,1.0,3.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,3.0,1.0,4.0,3.0,2.0,1.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,4.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,3.0,2.0,1.0
25%,154.0,4.0,0.0,8.0,2.0,7.0,14.0,4.0,4.0,4.0,4.0,154.0,0.0,-0.02,0.0,24.0,2.0,15.0,15.0,17.39,15.0,5.0,9.52,0.0,5.0,6.0,6.0,5.0,6.0,4.0,5.0,4.0,2.0,24.0,,5.0,,,2.0,1.0,1.0,,,,1.0,4.0,1.0,,2.0,4.0,2.0,5.0,7.0,6.0,5.0,4.0,2.0,4.0,7.0,3.0,5.0,7.0,5.0,7.0,4.0,2.0,5.0,15.0,15.0,17.39,15.0,5.0,9.52,10.0,6.0,8.0,10.0,5.0,7.0,20.0,10.0,10.0,15.0,6.0,10.0,6.0,8.0,7.0,8.0,7.0,6.0,7.0,8.0,6.0,7.0,0.0,5.0,6.0,6.0,5.0,6.0,4.0,5.0,4.0,0.0,2.0,14.81,10.0,10.0,10.0,7.0,9.0,7.0,7.0,8.0,7.0,7.0,5.0,1.0,2.0,16.67,10.0,15.0,15.0,5.0,10.0,10.0,8.0,8.0,9.0,5.0,7.0,19.15,10.0,10.0,15.0,10.0,10.0,7.0,7.0,8.0,7.0,7.0,6.0,6.0,7.0,6.0,6.0,0.0,0.0,0.0,15.22,10.0,16.67,14.81,5.0,10.0,7.0,7.0,8.0,7.0,6.0
50%,281.0,8.0,1.0,16.0,2.0,11.0,18.0,8.0,9.0,8.0,8.0,281.0,0.0,0.21,0.0,26.0,2.0,20.0,18.37,20.0,18.0,10.0,10.64,0.0,6.0,7.0,7.0,7.0,7.0,6.0,6.0,5.0,2.0,26.0,,8.0,,,2.0,3.0,3.0,,,,2.0,5.0,2.0,,6.0,7.0,4.0,6.0,8.0,7.0,7.0,6.0,3.0,6.0,8.0,6.0,7.0,8.0,7.0,8.0,6.0,4.0,6.0,20.0,18.18,20.0,18.0,10.0,10.64,25.0,10.0,10.0,15.0,10.0,10.0,25.0,15.0,15.0,20.0,10.0,10.0,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,8.0,8.0,0.0,6.0,7.0,7.0,7.0,7.0,6.0,6.0,5.0,0.0,3.0,17.65,15.79,18.42,15.91,10.0,12.5,7.0,8.0,8.0,8.0,8.0,6.0,1.0,2.0,20.0,16.67,19.05,18.37,10.0,13.0,25.0,10.0,10.0,15.0,10.0,10.0,25.0,15.0,15.0,18.52,10.0,13.95,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,7.0,7.0,0.0,1.0,0.0,20.0,16.67,20.0,16.33,10.0,14.29,7.0,8.0,8.0,8.0,8.0
75%,407.0,13.0,1.0,26.0,2.0,15.0,20.0,13.0,14.0,13.0,13.0,408.0,0.0,0.43,1.0,28.0,4.0,25.0,20.0,23.81,20.0,15.0,16.0,1.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,2.0,28.0,,10.0,,,4.0,6.0,6.0,,,,2.0,6.0,3.0,,7.0,9.0,7.0,8.0,9.0,9.0,8.0,8.0,6.0,8.0,9.0,7.0,9.0,9.0,8.0,9.0,8.0,7.0,7.0,25.0,20.0,23.81,20.0,15.0,16.0,35.0,15.0,16.0,20.0,15.0,15.0,40.0,18.75,20.0,20.0,15.0,15.63,8.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,9.0,9.0,1.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,2.0,4.0,25.0,20.0,20.0,20.0,15.0,16.28,8.0,9.0,9.0,9.0,9.0,7.0,3.0,3.0,30.0,20.0,20.0,20.0,15.0,16.67,40.0,15.0,15.0,20.0,10.0,15.0,38.46,19.23,17.39,20.0,15.09,16.515,8.0,9.0,9.0,9.0,9.0,8.0,8.0,9.0,8.0,8.0,1.0,1.0,1.0,30.0,20.0,20.0,20.0,15.0,16.67,8.0,9.0,9.0,9.0,9.0


In [7]:
# Count number of men and women
dating.groupby("gender").count()["iid"]

gender
0    4184
1    4194
Name: iid, dtype: int64

We can note that people participating have a average age of 26 and most of them are between 18 and 28. There is nearly the same number of men and women.

## 2. What are men and women looking for

### Define what men and women are mostly waiting from their partner.

In [8]:
# "We want to know what you look for in the opposite sex BEFORE EVENT"
#     wave 6-9: rating range 1-10 -> written in doc but appears to be false
#     wave 1-5 and 10-21: rate sum must be 100
expectations_index = [
    "attr1_1",
    "sinc1_1",
    "intel1_1",
    "fun1_1",
    "amb1_1",
    "shar1_1"
]

# Visual name of previous indexes
expectations_name = [
    "Attractive",
    "Sincere",
    "Intelligent",
    "Fun",
    "Ambitious",
    "Interests"
]

From now on we will focus on expectaions. We exclude waves 6 to 9 since they are different from the rest of the data.

In [9]:
# Exclude waves 6, 7, 8 and 9
dating = dating[(dating["wave"] <=5) | (dating["wave"] > 9)]

# Creating expectations tables for men and woment
expectations = dating[["gender", *expectations_index]]
female_expectations = expectations[expectations["gender"] == 0].mean()[expectations_index]
male_expectations = expectations[expectations["gender"] == 1].mean()[expectations_index]

In [10]:
fig = go.Figure()

# Spider plot with women expectations in men
fig.add_trace(go.Scatterpolar(
      r=male_expectations.to_list(),
      theta=expectations_name,
      fill='toself',
      name='Male expectations'
))
# Spider plot with men expectations in women
fig.add_trace(go.Scatterpolar(
      r=female_expectations.to_list(),
      theta=expectations_name,
      fill='toself',
      name='Female expectations'
))

# Create layout
fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0, 31]
    )),
  showlegend=True,
  title={"text":"People expectations for their ideal partner"}
)

fig.show()

We can see that
- men value attractiveness 
- women value balanced personalities

### Are people consistent?
The previous graph shows expectations people think they have. But do they actually say "Yes" to people they think they would? <br/>
To give a first answer to this question let's compare people expectations with the rating they gave to the other person on the same criterias. We define the affinity score to do so.
<br/><br/>
Affinity is a score out of 10, representing how much a person's expectations match their partner's personality. <br/>(If you want to go further: affinity is, for each criteria the product of expectations and personality, then summed on criterias, then divided by the sum of expectations which is theoretically 100)

In [11]:
#Ranking the person met
# rate 1-10
ranking_other_person = [
    "attr",
    "sinc",
    "intel",
    "fun",
    "amb",
    "shar"
]

# Define affinity based on one's expectations and the other's rating
def affinity(df, i, expectations, ranking):
    affinity = 0
    expect_sum = 0
    # Evaluating each of the 6 expectations axis one by one
    for j in range(len(expectations)): 
        expect = dating.loc[i, expectations[j]] # importance of this attibute in one's expectations
        affinity += expect * (dating.loc[i, ranking[j]]) # increase affinity with previous line expect coef multiplied by the other's person rating
        expect_sum += expect # sum of expectations point, should be 100 if notations rules are respected
    if expect_sum == 0: return np.nan # this person did not give any expectation
    else: return affinity / expect_sum # affinity ponderated by total expectation score

# Create affinity column
dating["affinity"] = np.nan 
for i in dating.index:
    dating.loc[i, "affinity"] = affinity(dating, i, expectations_index, ranking_other_person)

Visualise some results

In [12]:
# Display a sample of expectations, other person's rating, affinity and waves
display(dating.sample(10)[ranking_other_person+expectations_index+["affinity", "wave"]])

Unnamed: 0,attr,sinc,intel,fun,amb,shar,attr1_1,sinc1_1,intel1_1,fun1_1,amb1_1,shar1_1,affinity,wave
3819,4.0,10.0,10.0,8.0,10.0,2.0,20.0,20.0,20.0,10.0,20.0,10.0,7.8,11
3513,8.0,7.0,8.0,7.0,7.0,4.0,18.0,10.0,30.0,10.0,10.0,22.0,6.82,10
833,6.0,8.0,9.0,5.0,9.0,7.0,,,,,,,,3
1227,6.0,7.0,10.0,7.0,10.0,7.0,15.0,15.0,20.0,15.0,20.0,15.0,8.05,4
7569,8.0,8.0,8.0,8.0,8.0,7.0,58.0,5.0,8.0,10.0,7.0,12.0,7.88,21
3624,3.0,5.0,5.0,3.0,5.0,2.0,25.0,7.0,25.0,25.0,8.0,10.0,3.7,11
5422,7.0,8.0,8.0,7.0,7.0,7.0,20.51,14.53,24.79,17.09,5.98,17.09,7.393239,14
6495,5.0,7.0,8.0,5.0,6.0,5.0,16.0,25.0,20.0,12.0,12.0,15.0,6.22,16
5257,4.0,5.0,6.0,4.0,6.0,,15.0,10.0,25.0,25.0,10.0,15.0,,14
7806,4.0,8.0,9.0,3.0,8.0,2.0,50.0,20.0,10.0,5.0,10.0,5.0,5.55,21


Now we study the influence of the affinity score on the final decision taken by the person. The person can choose to ask for a second date (say "Yes") or not (say "No")

In [13]:
# Separate each row by one's final decision, then caclculate affinity mean for each group
dating["dec"] = dating["dec"].replace({0: "No", 1:"Yes"})
display(dating.groupby("dec")["affinity"].mean())

dec
No     6.062689
Yes    7.335092
Name: affinity, dtype: float64

In [14]:
# Same but separate men and women first
dating["gender"] = dating["gender"].replace({0: "Female", 1:"Male"})
display(dating.groupby(["gender", "dec"])["affinity"].mean())

gender  dec
Female  No     6.065627
        Yes    7.334928
Male    No     6.059359
        Yes    7.335222
Name: affinity, dtype: float64

<br/>We see that people have a higher affinity with people they say "Yes" to, regardless of their gender. <br/><br/>
Now, lets visualise the volume of "Yes" and "No" decisions depending on the affinity.

In [15]:
# Small function to beautify the graphs with nice labels
def aff_bin(aff):
    if 0 <= aff < 1: return "0-1"
    if  1<= aff < 2: return "1-2"
    if  2<= aff < 3: return "2-3"
    if  3<= aff < 4: return "3-4"
    if  4<= aff < 5: return "4-5"
    if  5<= aff < 6: return "5-6"
    if  6<= aff < 7: return "6-7"
    if  7<= aff < 8: return "7-8"
    if  8<= aff < 9: return "8-9"
    if  9<= aff <= 10: return "9-10"

# Apply beautifying fuction
dating["affinity_bin"] = dating["affinity"].apply(aff_bin)
dating["affinity_bin"]

# Plot number of YES and NO for each affinity segment
#["0-1", "1-2", "2-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10"]
fig1 = px.histogram(
    dating, 
    x="affinity_bin", 
    color="dec", 
    category_orders={'affinity_bin':["0-1", "1-2", "2-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10"]},
    title="Influence of affinity on the final decision (volume)"
) 
fig1.show()

To go further we need to study the proportions.

In [16]:
# Calculating percentage from previous graph
yes = dating[dating["dec"] == "Yes"].groupby("affinity_bin").size() # calculate yes volume
no = dating[dating["dec"] == "No"].groupby("affinity_bin").size() # calculate no volume
yes_percentage = (yes / (yes + no)).reset_index() # calculate percentage
yes_percentage.columns = ["affinity_bin", "yes_percentage"] # beautifying

# Plot evolution of "yes" decision accross affinity segments 
fig = go.Figure(
    data = go.Scatter(
        x = yes_percentage["affinity_bin"], 
        y = yes_percentage["yes_percentage"]),
    layout = go.Layout(
        title = go.layout.Title(text = "Chance to ask for a second date based on affinity score (self evaluated)", x = 0.5),
        xaxis = go.layout.XAxis(title = 'Affinity'),
        yaxis = go.layout.YAxis(title = 'Yes percentage')
    )
)
fig.show()

### Conclusion regarding people consistency
There seems to be a correlation between how the other person fits one's criterias and the match chance. People are consistent between what they think they want before the event and the final choice they make.

## 3. Is sharing interests important?
First, lets visualise the volume of "Yes" and "No" decisions depending on the shared interest score. This score is evaluated by the person giving their decision.

In [17]:
# Plot final decision for different shared interests segments
fig1 = px.histogram(dating, x="shar", color="dec", title="Volume of decisions to ask for a second date regarding shared interest score") #nbins=60
fig1.show()

To go further we need to study the proportions.

In [18]:
# Calculating percentage from previous graph
yes = dating[dating["dec"] == "Yes"].groupby("shar").size() # yes volume
yes = pd.DataFrame(yes.drop([5.5, 6.5, 7.5, 8.5])) #droping columns with only one value to clarify plot
no = dating[dating["dec"] == "No"].groupby("shar").size() # no volume
no = pd.DataFrame(no.drop([6.5, 7.5])) #droping columns with only one value to clarify plot
yes_percentage = (yes / (yes + no)).reset_index() # percentage
yes_percentage.columns = ["shared_interests", "yes_percentage"] # beautifying

# Plot evolution of "yes" decision accross shared interests segments 
fig = go.Figure(
    data = go.Scatter(
        x = yes_percentage["shared_interests"], 
        y = yes_percentage["yes_percentage"]),
    layout = go.Layout(
        title = go.layout.Title(text = "Chance to ask for a second date based on shared interests score (self evaluated)", x = 0.5),
        xaxis = go.layout.XAxis(title = 'Shared interests'),
        yaxis = go.layout.YAxis(title = 'Yes chance')
    )
)
fig.show()

Here we see a correlation between the shared interest score and the chance the person will ask for a second date. It should also be the case with matchs (both person saying "Yes").

In [19]:
# Beautifying labels
dating["match"] = dating["match"].replace({0: "No match", 1:"Match"})

In [20]:
# Calculating percentage from previous graph
match = dating[dating["match"] == "Match"].groupby("shar").size()
match = pd.DataFrame(match.drop([6.5, 7.5, 8.5])) #droping columns with only one value to clarify plot
no = dating[dating["match"] == "No match"].groupby("shar").size()
no = pd.DataFrame(no.drop([5.5, 6.5, 7.5, 8.5])) #droping columns with only one value to clarify plot
match_percentage = (match / (match + no)).reset_index()
match_percentage.columns = ["shared_interests", "match_percentage"]

# Plot evolution of "match" (i.e. two matching "yes" final decisions) accross shared interests segments 
fig = go.Figure(
    data = go.Scatter(
        x = match_percentage["shared_interests"], 
        y = match_percentage["match_percentage"]),
    layout = go.Layout(
        title = go.layout.Title(text = "Chance to match based on shared interests score", x = 0.5),
        xaxis = go.layout.XAxis(title = 'Shared interests'),
        yaxis = go.layout.YAxis(title = 'Match percentage')
    )
)
fig.show()

### Conclusion regarding shared interests

Sharing interest seems to have a significant impact on getting a second date. Unfortunately correlation is not causation, and it does not mean that pretending to share an interest will help you get a second date. Further experiments would be necessary to answer this question.