# Project : Speed dating - Exporatory Data Analysis

## 1. First look at the data 

### 1.1. Overview

In [None]:
# Installing last version of plotly to avoid some bugs
!python -m pip install --upgrade plotly 



Description of the variables is available [here](./Speed%20Dating%20Data%20Key.pdf)

In [None]:
import pandas as pd
import numpy as np
from itertools import product
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
pio.renderers.default = "iframe_connected"

RANDOM_SEED = 0

df = pd.read_csv('Speed Dating Data.csv', encoding='ISO-8859-1')
print(df.shape)
df.head()

In [None]:
pd.options.display.precision = 2
pd.options.display.max_columns = 200
pd.options.display.max_colwidth = 15
display(df)

### 1.2. What's in a row ?

Reading the data key document and observing the dataset's structure, a row seems to correspond to one date between 2 participants of opposite sex (the 'subject' and the 'partner'), **from the subject's point of view (for each person, a unique date is respresented by two rows : one with the person as the 'subject' and another with the same person as a 'partner')**

The subject is identified by : 
- **its unique id 'iid'**, (we assume that each subject has participated to only one wave)
- its id within wave 'idg', unique for each subject in wave _(contrarily to what is stated in the data key document)_ 
- its id within wave and gender 'id', unique for each subject in gender and wave _(contrarily to what is stated in the data key document)_

In a similar way, the partner is identified by : 
- **its unique id 'pid' (corresponding to 'iid')**
- its id within wave 'partner' (corresponding to 'id')

Let's take the example of wave 2 to visualize it : 

In [None]:
df_w2 = df[df['wave']==2]
df_w2

In [None]:
female_participants_iid_w2 = list(np.sort(df_w2[df_w2['gender']==0]['iid'].unique()))
male_participants_iid_w2 = list(np.sort(df_w2[df_w2['gender']==1]['iid'].unique()))

female_participants_nb_w2 = len(female_participants_iid_w2)
male_participants_nb_w2 = len(male_participants_iid_w2)

female_participants_idg_w2 = list(np.sort(df_w2[df_w2['gender']==0]['idg'].unique()))
male_participants_idg_w2 = list(np.sort(df_w2[df_w2['gender']==1]['idg'].unique()))

female_participants_id_w2 = list(np.sort(df_w2[df_w2['gender']==0]['id'].astype(int).unique()))
male_participants_id_w2 = list(np.sort(df_w2[df_w2['gender']==1]['id'].astype(int).unique()))

female_example_iid = female_participants_iid_w2[0]
male_example_iid = male_participants_iid_w2[0]

def who_they_met_in_wave2(subject_iid):
    return df_w2[df_w2['iid']==subject_iid]['pid'].astype(int).sort_values().tolist()

print("--------------------------------------------------------- Wave 2 ----------------------------------------------------------------------------------")
print(f"Number of participants : {female_participants_nb_w2} women and {male_participants_nb_w2} men ")
print(f"Female participants iids : {female_participants_iid_w2}")
print(f"Male participants iids : {male_participants_iid_w2}")
print()
print(f"Female participants idgs : {female_participants_idg_w2}")
print(f"Male participants idgs : {male_participants_idg_w2}")
print()
print(f"Female participants ids : {female_participants_id_w2}")
print(f"Male participants ids : {male_participants_id_w2}")
print()
print("For example :",
      f"    - participant with iid nb. {female_example_iid} (female) met with {len(who_they_met_in_wave2(female_example_iid))} men with following iids : {who_they_met_in_wave2(female_example_iid)}",
      f"    - participant with iid nb. {male_example_iid} (male) met with {len(who_they_met_in_wave2(male_example_iid))} women with following iids : {who_they_met_in_wave2(male_example_iid)}",
      sep = "\n")

=> We see that generally, each woman meets with all men in her wave and vice-versa.  

### 1.3. Identifiers (subjects and partners)

In [None]:
print(f"Number of subjects : {len(df['iid'][df['iid'].notna()].unique())}")
print(f"Number of partners : {len(df['pid'][df['pid'].notna()].unique())}")
print(f"Are the iids and pids the same ? => {np.array_equal(np.sort(df['iid'][df['iid'].notna()].unique()), np.sort(df['pid'][df['pid'].notna()].unique()))}")

_Note : if the there is 551 subjects but their iids go from 1 to 552, it is because the iid 118 is missing in the dataset._

In [None]:
print("Missing values for 'iid' : ", df['iid'].isnull().sum())
print("Missing values for 'pid' : ", df['pid'].isnull().sum())

In [None]:
df[df['pid'].isna()]

As we have no missing unique identifiers for the subject id, and only one case for the partner id, we can drop other identifier columns ('id','idg', 'partner') as they will not provide us useful information, and also drop the lines where 'pid' is nan.

In [None]:
df = df.drop(['id','idg','partner'], axis = 1)[~df['pid'].isna()]
print(df.shape)
df

Let's take a look at the sample of data for one random subject : 

In [None]:
rng = np.random.default_rng(RANDOM_SEED)
display(df[df['iid']==rng.choice(df['iid'].unique())].sort_values('order'))

### 1.4. Getting data grouped by subject

Here the aggregation function we will use is mean, as :
- it will give us directly ratios for binary data (for example : 'samerace', matches'...)
- it will keep the value for datas that have a unique value which is subject-specific (for example 'gender', 'goal', 'race'...), even for non-numerical (if we set 'numeric_only' argument to None)

In [None]:
df.dtypes.value_counts()

In [None]:
obj_col_list = [col for col in df.columns if df.dtypes[col]==object]
print(f"Found {len(obj_col_list)} non-numerical columns : {obj_col_list}")
num_col_list = [col for col in df.columns if col not in obj_col_list]
print(f"Found {len(num_col_list)} numerical columns : {num_col_list}")

Checking that non-numerical data is subject-specific before applying the aggregation : 

In [None]:
for col in obj_col_list:
    is_subject_specific = True
    for iid in df['iid'].unique():
        if len(df[df['iid']==iid][col].unique())>1:
            is_subject_specific = False
    if is_subject_specific == True:
        print(f"OK : Column '{col}' is subject-specific")
    else:
        print(f"WARNING : Column '{col}' is NOT subject-specific")

In [None]:
df_by_iid = df.groupby('iid').agg({k:('mean' if k in num_col_list else 'first') for k in df.columns})
df_by_iid

In [None]:
#Showing an example of grouped data for a random subject 
print("Raw data : ")
rng = np.random.default_rng(RANDOM_SEED)
display(df[df['iid']==rng.choice(df['iid'].unique())].sort_values('order'))
print()
print("grouped data (mean aggregation) : ")
rng = np.random.default_rng(RANDOM_SEED)
df_by_iid.loc[[rng.choice(df['iid'].unique())],:]

We now have **the global dataframe and also a dataframe grouped by subject id for our analysis.**

### 1.5. Overall statistics

In [None]:
df.describe(include = 'all')

In [None]:
total_values_count = df.shape[0]*df.shape[1]
missing_values_count = df.isnull().values.sum()
print(f"On a total of {total_values_count} values in the dataset, there are {missing_values_count} missing values, which represent {round(missing_values_count/total_values_count*100)}%")

-> We notice already that the dataset has **a lot of missing values**, let's analyse it a little more. 

## 2. Missing values : What are the participants so shy (or lazy) about ?

### 2.1. Overview

In [None]:
def get_missing_values_percentage(dataframe):
    # Input : dataframe (n rows, m columns)
    # Output : a series (m values) whose indexes are the input dataframe column names 
    #          and the values are the percentage of missing values found in the column,
    #          rounded to 2 decimal places.
    return dataframe.isnull().sum().apply(lambda x : round(x/dataframe.shape[0]*100, 2)) 

In [None]:
#Looking for the percentage of NaN values in each column
missing_values_series = get_missing_values_percentage(df)
missing_values_series

In [None]:
# Defining a function to display a long series as a horizontal DataFrame
def row_display(series): #Displays a pandas series horizontally as a 1-row-dataframe
    display_dataframe = pd.DataFrame(columns = series.index)
    display_dataframe.loc[0] = list(series.values)
    return display_dataframe

In [None]:
missing_values_row_df = row_display(missing_values_series)
print(missing_values_row_df.shape)
display(missing_values_row_df)

In [None]:
missing_values_fig = go.Figure(
    data = go.Bar(x = missing_values_series.index, y = missing_values_series.values),
    layout = go.Layout(
        title = go.layout.Title(text = "Missing values percentage in the dataset", x = 0.5),
        xaxis = go.layout.XAxis(title = 'data',  tickangle = -90, rangeslider = go.layout.xaxis.Rangeslider(visible = True)),
        yaxis = go.layout.YAxis(title = '%', range = [0, 100])
    )
)

missing_values_fig.add_hline(y = 50, line_color = 'black', 
                              line_dash = 'dash', 
                              annotation_text = '50%', 
                              annotation_xanchor = 'left', 
                              annotation_x = 1.01, 
                              annotation_font_size = 14) #the last arguments are kwargs from the add_shape arguments and layout.Annotation properties  

missing_values_fig.show()

-> We observe that the last colums (from 'you_call' to the end of the dataset) contain a majority of NaN values.   
These columns correspond to the __answers to the last followup survey__ (3-4 weeks after they had been sent their matches)

It also seems like **the more we move forward in time from the experiment, the more missing values we have.**   
Before all, let's distinguish the data described in the document before the survey data (e.g. all first datas in the data key document til 'age' excluded) from the rest, as it seems to be global data, not chronologically defined before the survey, and probably filled out by the organizers and not the students participating.  

### 2.2. Non-survey data

In [None]:
df_non_survey = df.loc[:,'iid':'met_o']
df_non_survey

In [None]:
missing_values_series_non_survey = get_missing_values_percentage(df_non_survey)

missing_values_fig_non_survey = go.Figure(
    data = go.Bar(x = missing_values_series_non_survey.index, y = missing_values_series_non_survey.values),
    layout = go.Layout(
        title = go.layout.Title(text = "Missing values percentage in the non-survey data ", x = 0.5),
        xaxis = go.layout.XAxis(title = 'data',  tickangle = -90),
        yaxis = go.layout.YAxis(title = '%', range = [0, 100])
    )
)

missing_values_fig_non_survey.add_hline(y = 50, line_color = 'black', 
                              line_dash = 'dash', 
                              annotation_text = '50%', 
                              annotation_xanchor = 'left', 
                              annotation_x = 1.01, 
                              annotation_font_size = 14) #the last arguments are kwargs from the add_shape arguments and layout.Annotation properties  

missing_values_fig_non_survey.show()

**-> all the missing values percentages for the global, non-survey data are considered negligible**, except maybe for 'positin1' (station number where started)  

### 2.3. Survey data

#### 2.3.1. Missing values quantity evolution over time

Now let's split the survey data in different chronologically ordered categories, and show the mean percentage of missing values for each step : 
1. Registration to the event (columns `age` to `amb5_1` included)
2. During the event (columns `dec` to `amb3_s` included)
3. 1-day-after follow-up - 'mandatory' survey (columns `satis_2` to `amb5_2` included)
4. 3/4-weeks-after matches follow-up - last survey (all last columns since `you_call`)

In [None]:
df_reg = df.loc[:,'age':'amb5_1']
df_event = df.loc[:,'dec':'amb3_s']
df_after_1d = df.loc[:,'satis_2':'amb5_2']
df_after_3w = df.loc[:,'you_call':]
df_list = [df_reg, df_event, df_after_1d, df_after_3w]
time_labels = ['Before the event\n(registration)', 'During the event', 'Follow-up survey\n1 day after the event', 'Follow-up survey\n3-4 weeks after the event']

In [None]:
# Storing the missing values series (each corresponding to a time period) into a list for plotting in different graphs
# and creating a Series (missing_values_summary) giving for each period of time its mean of missing values percentage
missing_values_series_list = []
missing_values_summary = []
for i, df_part in enumerate(df_list):
    missing_values_series_list.append(get_missing_values_percentage(df_part))
    missing_values_mean_percentage = missing_values_series_list[i].mean()
    missing_values_summary.append(missing_values_mean_percentage)
missing_values_summary = pd.Series(missing_values_summary, index = time_labels)

In [None]:
sns.set_context("talk") # Setting a default scale for the axis and labels
plt.figure(figsize = (15,7))
plt.title("Mean percentage of missing values\nby time period in survey data", fontdict = {'fontsize': 22})
ax = sns.barplot(x = missing_values_summary.index, y = missing_values_summary.values, saturation = 0.7, palette ="YlOrRd")
ax.bar_label(ax.containers[0], fmt = '%d', fontsize = 20) # Adding values over bars, shown as integers
plt.axhline(y = 50, ls = '--', c = 'Black')
ax.text(3.55, 50,'50%', verticalalignment = 'center', fontsize=18)
ax.set_ylim([0, 100])
ax.set_ylabel('%')

sns.despine() # removing the right and top frame lines around the figure
plt.show()

-> We can confirm that **the percentage of missing values increase over time of the experiment.** 

Reading the data features description of the experiment, we see that (in each case, we do not know if some data was specified as mandatory or not): 
1. Before the event : survey filled out by the students interested in participating
2. During the event : a part of the survey is filled at the beggining and the other part halfway-through
3. Follow-up survey (1 day) : information to be filled out by the students in order to be send their matches (incentive to answer)
4. Follow-up survey (3-4 weeks) : 3-4 weeks after the students have been sent their matches - no incentives to answer are specified

We could interpret that increase of NA values over time by a progressive disinterest in the questions asked in the surveys, or maybe by an increasing complexity of the questions.

Now let's check if we can identify some other factors that tend to increase the missing values, and if it says something about the participants.

In [None]:
missing_values_figs = make_subplots(rows = len(missing_values_series_list), cols = 1, subplot_titles = time_labels, vertical_spacing = 0.1) # 1 subplot by time period

for i, (missing_values_part_series, time_label) in enumerate(zip(missing_values_series_list, time_labels)):
    missing_values_figs.add_trace(go.Bar(x = missing_values_part_series.index, y = missing_values_part_series.values, name = time_label),
                                  row = i+1,
                                  col = 1)
    missing_values_figs.add_hline(y = missing_values_summary.loc[time_label], 
                                  line_color = 'black', 
                                  line_dash = 'dash', 
                                  annotation_text = f'mean<br>({round(missing_values_summary.loc[time_label])}%)', # plotly annotations are in html format
                                  annotation_xanchor = 'left', 
                                  annotation_x = 1.01, 
                                  annotation_font_size = 14, #the last arguments are kwargs from the add_shape arguments and layout.Annotation properties
                                  row = i+1,
                                  col = 1
                                 )   

missing_values_figs.update_yaxes(range = [0,100], title = '%')
missing_values_figs.update_xaxes(tickangle = -90, tickfont_size = 12)
missing_values_figs.update_layout(
    title = go.layout.Title(text = "Percentage of missing values in the survey data", x = 0.5, font_size = 22),
    showlegend = False,
    legend_title = 'Time of the study',
    legend_xanchor = 'left',
    legend_x = 1.05,
    legend_y = 0.5,
    legend_font_size = 16,
    height = 1300,
    width = 1500,
    autosize = False,
    
)
missing_values_figs.show()

We first notice that for the data gathered during the event, every participant has filled the 'dec' data, corresponding to whether they would want to see again each partner.  
Also for the registration survey, the most filled data have a missing values ratio < 1%, which we will consider as zero.  

**But for the surveys after the event, every question has at least 10% missing values for the "1-day-after survey" and 52% for the "3-4-weeks-after survey".**   

#### 2.3.2. A not so long-lasting commitment (to the survey)

Lets see how many participants have not filled any part of each survey. 

In [None]:
df_by_iid_reg = df_by_iid.loc[:,'age':'amb5_1']
df_by_iid_event = df_by_iid.loc[:,'dec':'amb3_s']
df_by_iid_after_1d = df_by_iid.loc[:,'satis_2':'amb5_2']
df_by_iid_after_3w = df_by_iid.loc[:,'you_call':]
df_by_iid_list = [df_by_iid_reg, df_by_iid_event, df_by_iid_after_1d, df_by_iid_after_3w]

In [None]:
after_1d_no_answer = df_by_iid_after_1d.isnull().all(axis = 1)
after_1d_no_answer_ratio = after_1d_no_answer.value_counts()[True]/len(after_1d_no_answer)*100
after_3w_no_answer = df_by_iid_after_3w.isnull().all(axis = 1)
after_3w_no_answer_ratio = after_3w_no_answer.value_counts()[True]/len(after_3w_no_answer)*100
no_answer_ratio = [after_1d_no_answer_ratio, after_3w_no_answer_ratio]

In [None]:
sns.set_context("talk") # Setting a default scale for the axis and labels
plt.figure(figsize = (15,7))
plt.title("Percentage of participants not having filled\nany of the questions of the survey", fontdict = {'fontsize': 22})
ax = sns.barplot(x = time_labels[2:], y = no_answer_ratio, saturation = 0.7, palette ="YlOrRd")
ax.bar_label(ax.containers[0], fmt = '%.0f', fontsize = 20) # Adding values over bars, shown as integers
plt.axhline(y = 50, ls = '--', c = 'Black')
ax.text(1.55, 50,'50%', verticalalignment = 'center', fontsize=18)
ax.set_ylim([0, 100])
ax.set_ylabel('%')

sns.despine() # removing the right and top frame lines around the figure
plt.show()

-> We see that **for the final part of the survey, a majority of students did not respond at all !**    

To see if some questions in particular were harder to respond than others, we have to compare them without taking in account in the missing values calculation the people not having responded at all. 

In [None]:
def get_missing_values_percentage_rescaled(dataframe):
    # Input : dataframe (n rows, m columns)
    # Output : a series (m values) whose indexes are the input dataframe column names 
    #          and the values are the percentage of missing values found in the column,
    #          rounded to 2 decimal places, 
    #          after removal in the input dataframe of the lines having only missing values
    clean_dataframe = dataframe[~dataframe.isnull().all(axis = 1)] # Taking away the lines with only NAN values
    print(f"Found {100-round(clean_dataframe.shape[0]/dataframe.shape[0]*100)}% of lines having only NA values, removed from the calculation of missing values percentage.")
    return clean_dataframe.isnull().sum().apply(lambda x : round(x/clean_dataframe.shape[0]*100, 2)) 

In [None]:
after_1d_missing_values_series_rescaled = get_missing_values_percentage_rescaled(df_after_1d)
after_3w_missing_values_series_rescaled = get_missing_values_percentage_rescaled(df_after_3w)

In [None]:
missing_values_series_rescaled_list = missing_values_series_list[:2] + [after_1d_missing_values_series_rescaled, after_3w_missing_values_series_rescaled]

In [None]:
missing_values_rescaled_summary = []
for i, df_part in enumerate(df_list):
    missing_values_mean_percentage = missing_values_series_rescaled_list[i].mean()
    missing_values_rescaled_summary.append(missing_values_mean_percentage)
missing_values_rescaled_summary = pd.Series(missing_values_rescaled_summary, index = time_labels)

In [None]:
sns.set_context("talk") # Setting a default scale for the axis and labels
plt.figure(figsize = (15,7))
plt.title("Mean percentage of missing values\nby time period in the experiment\n (without non-respondants)", fontdict = {'fontsize': 22})
ax = sns.barplot(x = missing_values_rescaled_summary.index, y = missing_values_rescaled_summary.values, saturation = 0.7, palette ="YlOrRd")
ax.bar_label(ax.containers[0], fmt = '%d', fontsize = 20) # Adding values over bars, shown as integers
plt.axhline(y = 50, ls = '--', c = 'Black')
ax.text(3.55, 50,'50%', verticalalignment = 'center', fontsize=18)
ax.set_ylim([0, 100])
ax.set_ylabel('%')

sns.despine() # removing the right and top frame lines around the figure
plt.show()

-> We see now that, once we took away of each survey data the people not having responded at all, **the missing values rate is much more stable over time, which means that the respondants probably did not struggle more in answering the last survey than the others.** 

**It will hence be easier to compare the missing values rates for specific questions 'on the same scale' across all the survey data.**

In [None]:
missing_values_df_rescaled_list = []
for missing_values_part_series, time_label in zip(missing_values_series_rescaled_list, time_labels):
    missing_values_part_df = pd.DataFrame(missing_values_part_series.values, index = missing_values_part_series.index, columns = ['%_NAN'])
    missing_values_part_df['time_label'] = time_label
    missing_values_df_rescaled_list.append(missing_values_part_df)
missing_values_df_rescaled = pd.concat(missing_values_df_rescaled_list)

In [None]:
rescaled_missing_values_fig = go.Figure(
    data = go.Bar(
        x = [missing_values_df_rescaled['time_label'], missing_values_df_rescaled.index], 
        y = missing_values_df_rescaled['%_NAN']),
    layout = go.Layout(
        title = go.layout.Title(text = "Missing values percentage in the survey data <br>(without non-respondants)", x = 0.5),
        xaxis = go.layout.XAxis(title = 'data',  tickangle = -90, rangeslider = go.layout.xaxis.Rangeslider(visible = True)),
        yaxis = go.layout.YAxis(title = '%', range = [0, 100])
    )
)

rescaled_missing_values_fig.add_hline(y = missing_values_df_rescaled['%_NAN'].mean(), line_color = 'black', 
                              line_dash = 'dash', 
                              annotation_text = f"mean<br>({round(missing_values_df_rescaled['%_NAN'].mean())}%)", # plotly annotations are in html format
                              annotation_xanchor = 'left', 
                              annotation_x = 1.01, 
                              annotation_font_size = 14) #the last arguments are kwargs from the add_shape arguments and layout.Annotation properties 

rescaled_missing_values_fig.show()

**-> if we take away the temporal aspect of quantity of missing values, we see that some questions are much more oftenly skipped than others :**  

**Before the event :** 
- the information missing the most (78% !) is 'expnum', the expected number of matches the student will likely have during the event. As we notice also that there are more than 41% missing values for all attributes relative to how the student is perceived by others (attributes '5_1'), it seems the participants are not really keen on evaluating themselves amongst others on the 'market of love'...
    It doesn't mean although that they don't want to talk about themselves, because we see that the questions on how do the student measure up have a nearly maximum response rate !  
- the questions on the background of the particpants ('undergra', 'mn_sat', 'tuition' and 'income') are also very oftenly skipped (between 40% and 60%)

**During the event :**
- a majority of people don't have answered to the last questions concerning what they look for in the opposite sex (attributes '1_s') and how they measure up themselves (attributes '3_s'). It turns out that all these questions come halfway through the meeting. It is difficult at this point to interpret if the lack of answers was due to organization or to the relevance of the questions perceived by the students.

**1 day after the event :**  
- Many of the respondants (73%) have not answered the questions regarding the role played by the differents attributes of partners in the yes/no decisions during the event.
- We see that this time, students were more interested in answering about themselves (what they look for -attributes '1_2', and how they measure up -attributes '3_2') than in assuming about others (attributes '4_2', '2_2' and '5_2'). One thing that doesn't change since before the event is they still don't know or don't want to tell how they think others perceive them.

**3-4 weeks after matches :**
- Many of the respondants (60%) don't have answered to the question regarding how many of their matches they've been on a date with so far, and even more (83% !) didn't answer to the immediate following question 'If yes, how many ?' (maybe because this last question is quite hard to understand !)
- The majority of respondants still don't answer much to the questions regarding the role played by the differents attributes of partners in the yes/no decisions during the event (attributes '7_3') and how others perceive them (attributes '5_3'). 
- This time again, like in the 1-day-after survey, every respondant has answered to the questions regarding themselves (attributes '1_3' and '3_3') and where more hesitant when assuming about others (attributes '4_3' and '2_3'). 

### 2.4. Conclusion

After this analysis on missing values, we can assume that : 

1. There was a clear increase in missing values over time, explained more by a **progressive disinterest of the participants in the study** than by an increasing complexity of the questions (as the missing values rate once we took away the non-respondants is quite constant). The very high rate of non-respondants to the last survey can be explained by **the absence of incentive given to respond** (contrarily to the previous follow-up survey which had to be submitted in order for the participants to be sent their matches), in addition to **the time taken between the event and this last survey.**  

2. Once the temporal aspect set aside, we saw that **the respondants are more eager to answer to questions about their own personality or opinions than on others'.** 

## 3. How to have a second date ?

### 3.1. Attractiveness - Who has the most chances ?

In [None]:
# Creating a correlation matrix giving the influence of each of the subject's characteristics on their chance to get a second date 
corr = pd.concat([df_by_iid['date_3'], df_by_iid_reg], axis = 1).corr().loc[:,['date_3']].drop(['date_3'], axis = 0)

# Dropping the info that are about the partner and not the subject 
corr = corr[~corr.index.str.endswith('_o')]

# Sorting the values by ascending absolute value
corr = corr.sort_values(by = 'date_3', ascending = False, key = abs)

# Compute the cumulative importance of each feature to select only the ones that represent the first 80% of the overall importance
corr['cumulated_importance'] = corr['date_3'].abs().cumsum()
corr['cumulated_importance_%'] = corr['cumulated_importance'] / corr['date_3'].abs().sum() * 100
corr

In [None]:
print(f"Max correlation : {corr.abs()['date_3'].max()}, Min correlation : {corr.abs()['date_3'].min()}")

In [None]:
px.bar(corr, x = corr.index, y = 'date_3',
       color = 'date_3', color_continuous_scale = 'RdYlGn',
       hover_name= corr.index, hover_data = {
           'date_3' : True,
       },
       labels = {'date_3' : 'attractivity score', 'index' : "subject's characteristic"},
       title = "Subject's characteristics influence on the chance to get a second date")

In [None]:
# Filtering to get only the most important features
corr = corr[corr['cumulated_importance_%'] <= 80]
corr

In [None]:
px.bar(corr, x = corr.index, y = 'date_3',
       color = 'date_3', color_continuous_scale = 'RdYlGn',
       hover_name= corr.index, hover_data = {
           'date_3' : True,
       },
       labels = {'date_3' : 'attractivity score', 'index' : "subject's characteristic"},
       title = "Subject's characteristics influencing the most the chance to get a second date")

By looking at this graph, we can see that : 
- people who give importance to race and religion are less likely to have a second date
- people who love hiking, and, in a lesser degree, enjoy yoga, reading or clubbing tend to be more attractive
- people who prefer tv, gaming, tvsports, concerts, shopping, museums or rarely go out will be less attractive than others
- interestingly, people who expect much enjoyment from the speed dating experiment have less chance to get the second date
- people feeling confident about their capacity to have a second date will actually succeed more.
- believing that being fun is important, and moreover being so, can help

### 3.2. Good alchemy for getting a match - How a man and a woman get along with each other ?

In [None]:
df['match'].value_counts()/len(df)

**For getting a second date, the first step to is to get as much matches as one can.** This step is not very easy, as **only 16% of dates in the experiment are a match.**

This time, we'll hence move from focusing on how many chances one have to get a second date to how many chances a speed-dating between one man and one woman has to become a match. 
As in the previous analysis we studied wich caracterizes an attractive person, we'll study know **which caracterizes a good interaction between a man and a woman, and more particularly the influence of each one's favorite interest.** 

In [None]:
# defining the data structure/schema of df_interests 
interests_list = ['sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga']
df = df.set_index(['iid','pid'])
points_of_view = ["subject's interest", "partner's interest"]
df_subject_interests = pd.concat([df[interests_list]], keys = [points_of_view[0]], names = ['point of view'])
df_subject_interests

In [None]:
# Detect missing values
df_subject_interests.isna().sum()

In [None]:
print("Number of rows concerning people not having told any of their interests :", len(df_subject_interests[df_subject_interests.isna().all(axis = 1)]))

In [None]:
print("iiDs of people not having told their interests : ", df_subject_interests[df_subject_interests.isna().all(axis = 1)].index.get_level_values(1).unique().values.tolist())

Only 7 people don't have answered to the survey regarding their interests, we will drop the corresponding rows.

In [None]:
ids_to_drop = df_subject_interests[df_subject_interests.isna().all(axis = 1)].index.get_level_values(1).unique().values.tolist()

In [None]:
df_subject_interests = df_subject_interests.drop(ids_to_drop, axis = 0, level = 1).drop(ids_to_drop, axis = 0, level = 2) # removing both from iid level and pid level, as we will later also need partner's interests information
df_subject_interests.isna().sum()

Now we see that there is no more missing values.

In [None]:
df_subject_interests

In [None]:
print("Example : ")
print()
print("Interests of subject 1 :")
display(df_subject_interests.xs(1, level = 1))
print()
print("Interests of each of subject 1's partners :")
print("(i.e. interests of each subject who has had subject 1 as partner)")
display(df_subject_interests.xs(1, level = 2))

In [None]:
df_partner_interests = df_subject_interests.rename(index={"subject's interest":"partner's interest"})
for interest in interests_list:
    df_partner_interests.loc[("partner's interest", slice(None),slice(None)),interest] = df_subject_interests.index.map(lambda x : df_by_iid.loc[x[2],interest])
df_partner_interests

In [None]:
df_interests = pd.concat([df_subject_interests, df_partner_interests], axis = 0).reorder_levels([1, 2, 0], axis = 0).sort_index(level = [0,1], sort_remaining = False)
df_interests['favorite interest'] = df_interests.idxmax(axis = 1)
df_interests

In [None]:
# Creating the dataset giving for each date the subject's and partner's favorite interest
df_favorite_interests = df_interests.unstack('point of view').xs(key = 'favorite interest', level = 0, axis = 1)
df_favorite_interests['match']=df['match']
df_favorite_interests['gender']=df['gender']
df_favorite_interests

In [None]:
df_favorite_interests[df_favorite_interests['gender']==1]["subject's interest"].value_counts()

**We will display the chances of getting a match depending on the favorite interests of the 2 participants in a heatmap.**  
**As some pairs of favorite interests can be non-existent in the dataset, we will drop the interests in question to avoid the result to be mistaken with a 0% chance of getting a match.** 

In [None]:
# Removing an interest from the list if at least one pair with other interest of the list does not exist in the dataset
interests_to_drop = set([interests_pair[0] 
                         for interests_pair in product(interests_list, interests_list) 
                         if df_favorite_interests[(df_favorite_interests["subject's interest"] == interests_pair[0]) & (df_favorite_interests["partner's interest"] == interests_pair[1])].empty])
print("Dropping the following interests : ", interests_to_drop)
interests_list = [interest for interest in interests_list if interest not in interests_to_drop]
print("Interests list after drop : ", interests_list)

In [None]:
# Removing the observations for the interests removed from the list
mask1 = df_favorite_interests["partner's interest"].apply(lambda x : True if x not in interests_to_drop else False) 
mask2 = df_favorite_interests["subject's interest"].apply(lambda x : True if x not in interests_to_drop else False)
mask = mask1 & mask2
df_favorite_interests = df_favorite_interests[mask]
df_favorite_interests

In [None]:
display(px.density_heatmap(df_favorite_interests, x = "subject's interest", y = "partner's interest", z = 'match', histfunc = "avg", 
                           category_orders = {"subject's interest" : interests_list,
                                              "partner's interest" : list(reversed(interests_list))}, #reversed(<list>) returns an iterator that needs to be converted to a list
                           marginal_y = 'histogram',
                           range_color = [0,df_favorite_interests.groupby(["subject's interest", "partner's interest"])['match'].mean().max()], 
                           title = "Chances of getting a match by favorite interest pairs (all subjects)", height = 600, width = 1100))

As we see in the 2D histogram above (reminder : the global average of matches is 16%)
- the people who seem to fit the most are the hikers and tv sports fans (40% matches in average compared to the 16% global average !)
- surprisingly, concert enthousiasts never match with each other (but let's nuance this result because they rarely met in the experiment)
- the tv sports fans generally fit also really well with music lovers, but not at all with museums lovers
- people who love to exercise get along really well with museum lovers but never with art enthusiasts (it would be interesting to study what makes a difference between a museum lover and an art enthusiast)...
- people who are fond of dining fit similarly with almost everybody apart from the people most interested in exercise, with which they match around 2 times less
- art lovers and hikers are the only people who don't match among themselves, and they never do !
- contrarily, theaters regulars are the people who match the most among themselves
- art and music lovers also never match
- the sport enthusiasts, who represent more than 1/4 of the population, seem to get along quite similarly with everybody, with a preference for hikers (but don't like music, art, book or tv sports lovers so much)
- readers apparently don't match with many people in general (as the average is often below the global one), apart from other readers, and especially not with music or theater enthusiasts, and never with movies fans !

**We can have an even more precise view if we put gender in the equation. Let's visualize the same 2D histogram but with all subjects being male (and hence all partners being female, since the experiment was focused exclusively on heterosexual datings)**

In [None]:
display(px.density_heatmap(df_favorite_interests[df_favorite_interests['gender']==1], x = "subject's interest", y = "partner's interest", z = 'match', histfunc = "avg", 
                           category_orders = {"subject's interest" : interests_list,
                                              "partner's interest" : list(reversed(interests_list))}, #reversed(<list>) returns an iterator that needs to be converted to a list
                           marginal_x = 'histogram', marginal_y = 'histogram',
                           range_color = [0,df_favorite_interests.groupby(["subject's interest", "partner's interest"])['match'].mean().max()], 
                           title = "Chances of getting a match by favorite interest pairs (subjects male, partners female)", height = 900, width = 1100))

In addition to what we saw in the previous graph, we can see for example that : (reminder : the global average of matches is 16%)
- male hikers and female tv sports fans love each other (100% matches !!), but the latter nearly never match with other men than hikers
- we saw previously that the tv sports fans generally fit also really well with music lovers, but it's only one-way : male tv sports fans match with female music lovers, but never the other way around.
- women who are fond of concerts like sport men but not music enthusiasts. But men who love concerts match really well with women who love music. 
- women who love dining don't match well with men who love exercising
- exercise lovers who get along really well with museum lovers are mainly women but the incompatibility with art enthusiasts is cross-gender
- female art fans fit very often (50%) with men who love to go to museums, but never the other way around !
- in fact, men who love art match very few women apart from those who love dining or most of all, movies (50% matches)
- women who are music fans get along easily with men who are fond of tv sports, dining, hiking or movies