![Tinder](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M03-EDA/Tinder-Symbole.png)

# Speed Dating with Tinder

## Company's description 📇

<a href="https://tinder.com/" target="_blank">Tinder</a> is an online dating and geosocial networking application. In Tinder, users "swipe right" to like or "swipe left" to dislike other users' profiles, which include their photos, a short bio, and a list of their interests.

Tinder was launched by Sean Rad at a hackathon held at the Hatch Labs incubator in West Hollywood in 2012.

As of 2021, Tinder has recorded more than 65 billion matches worldwide.

## Project 🚧

The marketing team needs help on a new project. They are experiencing a decrease in the number of matches, and they are trying to find a way to understand **what makes people interested into each other**.

They decided to run a speed dating experiment with people who had to give Tinder lots of informations about themselves that could ultimately reflect on ther dating profile on the app.

Tinder then gathered the data from this experiment. Each row in the dataset represents one speed date between two people, and indicates wether each of them secretly agreed to go on a second date with the other person.

## Goals 🎯

Use the dataset to understand what makes people interested into each other to go on a second date together using:
* descriptive statistics
* visualisations

## Scope of this project 🖼️

Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.

The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information. See the Speed Dating Data Key document below for details.

[Dataset](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M03-EDA/Speed+Dating+Data.csv)

[Dataset Description](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M03-EDA/Speed+Dating+Data+Key.doc)

## Key information 🔑

##### Three Surveys and one ScoreCard were filled out by participants:

*   Survey filled out in order to register for the event (SignUp/Time1)
*   Survey filled out the day after participating in the event (FollowUp/Time2)
*   Survey filled out 3-4 weeks after they had been sent their matches (FollowUp2/Time3)
*   Scorecard filled out after each "date" during the event

##### Key features:

*   iid: unique participant id number
*   gender: 0 woman/1 man
*   pid: partner’s iid number
*   wave: wave number
*   round: number of people that met in wave
*   order: number of date in the wave
*   dec/dec_o: decision of participant/partner the night of event whether or not they would like to see him or her again (0 no/1 yes)
*   match: occurs when participant and partner both check “Yes” next to decision (0 no/1 yes)
*   match_es: How many matches do you estimate you will get?

##### Six attributes were explored using 6 questions:

*   Attractivity: attr
*   Sincerity: sinc
*   Intelligence: intel
*   Fun: fun
*   Ambition: amb
*   Shared interests: shar

##### Questions asked to explore attributes:

1.   What you look for in the opposite sex?
2.   What do you think the opposite sex looks for in a date?
3.   How do you think you measure up?
4.   What you think MOST of your fellow men/women look for in the opposite sex?
5.   How do you think others perceive you?
7.   What is the actual importance of these attributes in the decisions you've made? (FollowUps only)

*Remark: There is no "shared interests" attribute for question 3 and 5!

# Part 1: Exploratory Data Analysis (EDA)

## 0. Import libraries and dataset

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

In [2]:
'''
# Load dataset
with open('/content/sample_data/Speed+Dating+Data.csv', 'r', encoding='latin1') as f:
    df = pd.read_csv(f)
'''

"\n# Load dataset\nwith open('/content/sample_data/Speed+Dating+Data.csv', 'r', encoding='latin1') as f:\n    df = pd.read_csv(f)\n"

In [3]:
# Load dataset
with open('Speed+Dating+Data.csv', 'r', encoding='latin1') as f:
    df = pd.read_csv(f)

In [4]:
# Show dataset head
df.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [5]:
# dataset shape
df.shape

(8378, 195)

## 1. Check dataset: missing values and inconsistencies

##### 1.1. Check missing values:

In [6]:
# Get main statistics
df.describe(include='all')

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
count,8378.0,8377.0,8378.0,8378.0,8378.0,8378.0,8378.0,8378.0,6532.0,8378.0,...,3974.0,3974.0,3974.0,3974.0,3974.0,2016.0,2016.0,2016.0,2016.0,2016.0
unique,,,,,,,,,,,...,,,,,,,,,,
top,,,,,,,,,,,...,,,,,,,,,,
freq,,,,,,,,,,,...,,,,,,,,,,
mean,283.675937,8.960248,0.500597,17.327166,1.828837,11.350919,16.872046,9.042731,9.295775,8.927668,...,7.240312,8.093357,8.388777,7.658782,7.391545,6.81002,7.615079,7.93254,7.155258,7.048611
std,158.583367,5.491329,0.500029,10.940735,0.376673,5.995903,4.358458,5.514939,5.650199,5.477009,...,1.576596,1.610309,1.459094,1.74467,1.961417,1.507341,1.504551,1.340868,1.672787,1.717988
min,1.0,1.0,0.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,...,2.0,2.0,3.0,2.0,1.0,2.0,2.0,4.0,1.0,1.0
25%,154.0,4.0,0.0,8.0,2.0,7.0,14.0,4.0,4.0,4.0,...,7.0,7.0,8.0,7.0,6.0,6.0,7.0,7.0,6.0,6.0
50%,281.0,8.0,1.0,16.0,2.0,11.0,18.0,8.0,9.0,8.0,...,7.0,8.0,8.0,8.0,8.0,7.0,8.0,8.0,7.0,7.0
75%,407.0,13.0,1.0,26.0,2.0,15.0,20.0,13.0,14.0,13.0,...,8.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,8.0,8.0


In [7]:
# count unique iid
len(df['iid'].unique())

551

- From the Speed Dating data key document, we expect 554 participants. The last iid in the dataset is 552 but only 551 unique iid are counted.  

In [8]:
# Group by 'wave' and 'gender', then get the number of unique 'iid' values
wave_gender_iid_counts = df.groupby(['wave', 'gender'])['iid'].nunique().reset_index()

# Create a pivot table for a more organized view
wave_gender_iid_df = pd.pivot_table(
    wave_gender_iid_counts,
    values='iid',
    index='wave',
    columns='gender',
    fill_value=0  # Fill missing values with 0
)

# Rename columns for better readability
wave_gender_iid_df = wave_gender_iid_df.rename(columns={ 0: 'Female', 1: 'Male'})

# Display DataFrame
print(wave_gender_iid_df)

gender  Female  Male
wave                
1         10.0  10.0
2         19.0  16.0
3         10.0  10.0
4         18.0  18.0
5          9.0  10.0
6          5.0   5.0
7         16.0  16.0
8         10.0  10.0
9         20.0  20.0
10         9.0   9.0
11        21.0  21.0
12        14.0  14.0
13        10.0   9.0
14        20.0  18.0
15        18.0  19.0
16         6.0   8.0
17        10.0  14.0
18         6.0   6.0
19        15.0  15.0
20         6.0   7.0
21        22.0  22.0


* wave 3 is supposed to include 9 Female, 10 Male ==> 1 extra Female found.

* wave 5 is supposed to include 10 Female, 10 Male ==> 1 less Female found.

* wave 12 is supposed to include 15 Female, 14 Male ==> 1 less Female found.

* wave 19 is supposed to include 16 Female, 15 Male ==> 1 less Female found.

* wave 20 is supposed to include 6 Female, 8 Male ==> 1 less Male found.

In [9]:
# Group by 'wave' and get unique 'iid' values
unique_iids_per_wave = df.groupby('wave')['iid'].unique()

# Specify the waves to print
waves_to_print = [3, 5, 12, 19, 20]

# Print the results for the selected waves
for wave, iids in unique_iids_per_wave.items():
    if wave in waves_to_print:
        print(f"Wave {wave}: {iids}")

Wave 3: [56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75]
Wave 5: [112 113 114 115 116 117 119 120 121 122 123 124 125 126 127 128 129 130
 131]
Wave 12: [294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
 312 313 314 315 316 317 318 319 320 321]
Wave 19: [466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483
 484 485 486 487 488 489 490 491 492 493 494 495]
Wave 20: [496 497 498 499 500 501 502 503 504 505 506 507 508]


- iid 118 is missing from wave 5.

In [10]:
# Group by 'wave' and get unique 'pid' values
unique_pids_per_wave = df.groupby('wave')['pid'].unique()

# Specify the waves to print
waves_to_print = [5]

# Print the results for the selected waves
for wave, pids in unique_pids_per_wave.items():
    if wave in waves_to_print:
        print(f"Wave {wave}: {pids}")

Wave 5: [122. 123. 124. 125. 126. 127. 128. 129. 130. 131. 112. 113. 114. 115.
 116. 117. 119. 120. 121.  nan]


- After checking, pid 118 is marked as missing value in wave 5 in the corresponding rows.

In [11]:
# Drop rows where 'pid' is missing
df = df.dropna(subset=['pid'])
df.shape

(8368, 195)

- pid missing values are dropped.

In [12]:
# Get the sum of missing values for each column
missing_values = df.isnull().sum()

# Filter to show only columns with no missing values
no_missing_columns = missing_values[missing_values == 0]

# Display the columns with missing values and their counts
print(no_missing_columns)

iid         0
gender      0
idg         0
condtn      0
wave        0
round       0
position    0
order       0
partner     0
pid         0
match       0
samerace    0
dec_o       0
dec         0
dtype: int64


- Features with no missing values.

In [13]:
# Filter to show only columns with missing values
missing_columns = missing_values[missing_values > 0]

# Display the columns with missing values and their counts
print(missing_columns)

id             1
positin1    1836
int_corr     158
age_o         94
race_o        63
            ... 
attr5_3     6352
sinc5_3     6352
intel5_3    6352
fun5_3      6352
amb5_3      6352
Length: 181, dtype: int64


- There are many missing values.

In [14]:
# Filter to show only columns ending with '3' with missing values
missing_columns_3 = missing_values[missing_values.index.str.endswith('3')]

# Filter out columns with 0 missing values
missing_columns_3 = missing_columns_3[missing_columns_3 != 0]

# Sort the filtered missing values
sorted_missing_columns_3 = missing_columns_3.sort_values()

# Display the sorted columns with missing values and their counts
print(sorted_missing_columns_3)

date_3      4399
attr3_3     4399
sinc3_3     4399
shar1_3     4399
amb1_3      4399
fun1_3      4399
intel3_3    4399
sinc1_3     4399
attr1_3     4399
fun3_3      4399
amb3_3      4399
intel1_3    4399
shar4_3     5409
amb2_3      5409
fun2_3      5409
intel2_3    5409
sinc2_3     5409
attr2_3     5409
amb4_3      5409
fun4_3      5409
intel4_3    5409
sinc4_3     5409
attr4_3     5409
intel5_3    6352
sinc5_3     6352
attr5_3     6352
attr7_3     6352
fun5_3      6352
sinc7_3     6352
shar2_3     6352
fun7_3      6352
amb7_3      6352
shar7_3     6352
intel7_3    6352
amb5_3      6352
numdat_3    6872
num_in_3    7702
dtype: int64


- There are at least 53% of missing values in FollowUp2/Time3. These data will not be explored.

In [15]:
# Filter to show only columns ending with '2' with missing values
missing_columns_2 = missing_values[missing_values.index.str.endswith('2')]

# Filter out columns with 0 missing values
missing_columns_2 = missing_columns_2[missing_columns_2 != 0]

# Sort the filtered missing values
sorted_missing_columns_2 = missing_columns_2.sort_values()

# Display the sorted columns with missing values and their counts
print(sorted_missing_columns_2)

satis_2      914
shar1_2      914
amb1_2       914
fun1_2       914
intel1_2     914
attr3_2      914
sinc1_2      914
intel3_2     914
fun3_2       914
amb3_2       914
sinc3_2      914
attr1_2      932
numdat_2     944
shar2_2     2593
sinc2_2     2593
amb2_2      2593
fun2_2      2593
intel2_2    2593
attr2_2     2593
fun4_2      2593
amb4_2      2593
intel4_2    2593
sinc4_2     2593
attr4_2     2593
shar4_2     2593
fun5_2      3991
attr5_2     3991
sinc5_2     3991
intel5_2    3991
amb5_2      3991
fun7_2      6384
intel7_2    6384
attr7_2     6384
shar7_2     6394
amb7_2      6413
sinc7_2     6413
dtype: int64


- There are also many missing values in FollowUp1/Time2. Only answers to the question 1 and 3 will be potentially explored.

In [16]:
# Filter to show only attributes ending with '_1' with missing values
missing_columns_1 = missing_values[missing_values.index.str.endswith('_1')]

# Filter out columns with 0 missing values
missing_columns_1 = missing_columns_1[missing_columns_1 != 0]

# Sort the filtered missing values
sorted_missing_columns_1 = missing_columns_1.sort_values()

# Display the sorted columns with missing values and their counts
print(missing_columns_1)

attr1_1       79
sinc1_1       79
intel1_1      79
fun1_1        88
amb1_1        97
shar1_1      119
attr4_1     1879
sinc4_1     1879
intel4_1    1879
fun4_1      1879
amb4_1      1879
shar4_1     1901
attr2_1       79
sinc2_1       79
intel2_1      79
fun2_1        79
amb2_1        88
shar2_1       88
attr3_1      105
sinc3_1      105
fun3_1       105
intel3_1     105
amb3_1       105
attr5_1     3462
sinc5_1     3462
intel5_1    3462
fun5_1      3462
amb5_1      3462
dtype: int64


- There are many missing values for question 4 and 5.

##### 1.2. Check if all unique iid are present as unique pid and vice versa:

In [17]:
# Get unique iid values
unique_iids = df['iid'].unique()

# Get unique pid values
unique_pids = df['pid'].unique()

# Check if all unique iids are present in unique pids
all_iids_in_pids = all(iid in unique_pids for iid in unique_iids)

# Check if all unique pids are present in unique iids
all_pids_in_iids = all(pid in unique_iids for pid in unique_pids)

# Print the results
print(f"All unique iids are present in unique pids: {all_iids_in_pids}")
print(f"All unique pids are present in unique iids: {all_pids_in_iids}")

All unique iids are present in unique pids: True
All unique pids are present in unique iids: True


##### 1.3. Check consistency between dec, dec_o and match:

In [18]:
# Function to check the conditions
def check_dec_dec_o_match(row):
    if row['match'] == 1:
        if row['dec'] == 1 and row['dec_o'] == 1:
            return True
        else:
            return False  # Inconsistent for match = 1
    elif row['match'] == 0:
        if row['dec'] == 0 or row['dec_o'] == 0:
            return True
        else:
            return False  # Inconsistent for match = 0

# Apply the function to the DataFrame
df['dec_dec_o_match_check'] = df.apply(check_dec_dec_o_match, axis=1)

# Inconsistent cases
inconsistent_rows = df[df['dec_dec_o_match_check'] == False]
inconsistent_count = len(inconsistent_rows)
print(f"\nNumber of inconsistent rows: {inconsistent_count}")


Number of inconsistent rows: 0


## 2. EDA

* Depending on the wave, participants were asked either to rate the importance of the attributes on a scale of 1-10 (1=not at all important, 10=extremely important) (Waves 6-9) or to distribute 100 points among the attributes
(Waves 1-5, 10-21). During the dating, they were also asked to put NA when they were not able to judge an attribute.

* In general, missing values either correspond to 0 when the sum of the other attributes is 100 or it means the attribute is not considered as relevant.

* To be able to make a comparison, all the attributes data were scaled to form a distribution from 0 to 100 even when there were missing values.

In [19]:
# Define a list of attribute prefixes
attribute_prefixes = ['attr', 'sinc', 'intel', 'fun', 'amb', 'shar']

# Iterate through attribute groups (1_1, 2_1, 3_1, etc. and '' for no suffix)
for group_suffix in ['1_1', '1_2', '2_1', '3_1', '3_2', '4_1', '5_1', '_o', '']:
    # Create a list of attributes for the current group
    attributes = [prefix + group_suffix for prefix in attribute_prefixes]

    # Remove 'shar3_1' and 'shar5_1' if they exist in the list
    if group_suffix == '3_1' or group_suffix == '3_2' or group_suffix == '5_1':
        attributes.remove('shar' + group_suffix)

    # Create the total attributes column for the current group
    # Handle the case where there is no suffix
    total_attributes_col = 'total_attributes' + (group_suffix if group_suffix else '')
    df[total_attributes_col] = df[[a for a in attributes if a in df.columns]].sum(axis=1)

    # Calculate recalculated values for each attribute in the group
    for attribute in attributes:
      if attribute in df.columns:
        df[attribute + '_rec'] = (df[attribute] / df[total_attributes_col]) * 100

    # Calculate the sum of recalculated attributes for each row
    sum_recalculated_attributes_col = 'sum_recalculated_attributes' + (group_suffix if group_suffix else '')
    df[sum_recalculated_attributes_col] = df[[attr + '_rec' for attr in attributes if (attr + "_rec") in df.columns]].sum(axis=1).round(2)

    # Check if the sum is equal to 100
    sum_100_col = 'is_sum_100' + (group_suffix if group_suffix else '')
    df[sum_100_col] = df[sum_recalculated_attributes_col] == 100

    # Print the results
    print(df[sum_100_col].value_counts())

    # Filter and display rows where the sum is not 100
    false_rows = df[df[sum_100_col] == False]
    print(false_rows[['iid'] + attributes + [sum_recalculated_attributes_col, sum_100_col]])

is_sum_1001_1
True     8289
False      79
Name: count, dtype: int64
      iid  attr1_1  sinc1_1  intel1_1  fun1_1  amb1_1  shar1_1  \
312    28      NaN      NaN       NaN     NaN     NaN      NaN   
313    28      NaN      NaN       NaN     NaN     NaN      NaN   
314    28      NaN      NaN       NaN     NaN     NaN      NaN   
315    28      NaN      NaN       NaN     NaN     NaN      NaN   
316    28      NaN      NaN       NaN     NaN     NaN      NaN   
...   ...      ...      ...       ...     ...     ...      ...   
5127  346      NaN      NaN       NaN     NaN     NaN      NaN   
5128  346      NaN      NaN       NaN     NaN     NaN      NaN   
5129  346      NaN      NaN       NaN     NaN     NaN      NaN   
5130  346      NaN      NaN       NaN     NaN     NaN      NaN   
5131  346      NaN      NaN       NaN     NaN     NaN      NaN   

      sum_recalculated_attributes1_1  is_sum_1001_1  
312                              0.0          False  
313                            

### **1) Question: What are the least and the most desirable attributes in a male partner? 2) Does this differ for female partners? How important do people think attractiveness is in potential mate selection vs. its real impact?**



**a) Explore attributes data from SignUp/Time1 (before the date):**

*   We want to know what you look for in the opposite sex? (attributes1_1)
*   What do you think the opposite sex looks for in a date? (attributes2_1)
*   How do you think you measure up? (attributes3_1)
*   What you think MOST of your fellow men/women look for in the opposite sex? (attributes4_1)
*   How do you think others perceive you? (attributes5_1)

**b) Explore attributes data from FollowUp/Time2 (after the date):**

*   We want to know what you look for in the opposite sex? (attributes1_2)

In [20]:
# Define a list of attribute prefixes
attribute_prefixes = ['attr', 'sinc', 'intel', 'fun', 'amb', 'shar']

# Define attribute groups and their corresponding categories
attribute_groups = {
    '1_1': ['attr1_1_rec', 'sinc1_1_rec', 'intel1_1_rec', 'fun1_1_rec', 'amb1_1_rec', 'shar1_1_rec'],
    '1_2': ['attr1_2_rec', 'sinc1_2_rec', 'intel1_2_rec', 'fun1_2_rec', 'amb1_2_rec', 'shar1_2_rec'],
    '2_1': ['attr2_1_rec', 'sinc2_1_rec', 'intel2_1_rec', 'fun2_1_rec', 'amb2_1_rec', 'shar2_1_rec'],
    '3_1': ['attr3_1_rec', 'sinc3_1_rec', 'intel3_1_rec', 'fun3_1_rec', 'amb3_1_rec'],  # shar3_1 doesn't exist
    '4_1': ['attr4_1_rec', 'sinc4_1_rec', 'intel4_1_rec', 'fun4_1_rec', 'amb4_1_rec', 'shar4_1_rec'],
    '5_1': ['attr5_1_rec', 'sinc5_1_rec', 'intel5_1_rec', 'fun5_1_rec', 'amb5_1_rec']  # shar5_1 doesn't exist
}


# Iterate through attribute groups
for group_suffix, categories in attribute_groups.items():
    # Create a list of attributes for the current group
    attributes = [prefix + group_suffix for prefix in attribute_prefixes]

    # Remove attributes not in 'categories'
    attributes = [attr for attr in attributes if attr + '_rec' in categories]

    # ... (rest of the code for attribute calculation, filtering, etc.) ...

    # --- Radar Chart Creation ---
    # Group data by gender and calculate the mean for each attribute
    gender_attributes = df.groupby('gender')[categories].mean().reset_index()

    # Create the radar chart
    fig = go.Figure()

    # Add trace for women (gender = 0)
    fig.add_trace(go.Scatterpolar(
        r=gender_attributes[gender_attributes['gender'] == 0][categories].values.flatten().tolist(),
        theta=categories,
        fill='toself',
        name='Women',
        marker=dict(color='lightcoral')
    ))

    # Add trace for men (gender = 1)
    fig.add_trace(go.Scatterpolar(
        r=gender_attributes[gender_attributes['gender'] == 1][categories].values.flatten().tolist(),
        theta=categories,
        fill='toself',
        name='Men',
        marker=dict(color='skyblue')
    ))

    # Update layout
    fig.update_layout(
        polar=dict(
            radialaxis=dict(
                visible=True,
                range=[0, 40]  # Adjust range if needed
            )
        ),
        showlegend=True,
        title=f"Average Attribute Ratings by Gender (Group: {group_suffix})"
    )

    fig.show()

- Before the date, the least desired attribute for women is shared interests (12.7%) and ambition for men (8.5%).
- Before the date, women think the most important attribute is intelligence (21%) followed by sincerity and attractiveness (18%) and men think the most important attribute by far is attractiveness (27%) followed by intelligence (19%). After the date, both think attractiveness is the most important attribute (30% for men and 22% for women).
- Before the date, men see themselves intelligent (22%) and sincere (21%) and women see themselves sincere and intelligent (21%). Both put attractiveness as their least attribute (18%). In general, they share the same pattern. They also think, they are perceived as they see themself.
- Before the date, both think the opposite sex and also they fellow are looking for attractiveness first.

**c) Explore attributes data from the ScoreCard (during the date): attr, sinc, intel, fun, amb, shar.**

To evaluate the impact of an attribute in participant's decision, we'll calculate first the pourcentage of positive decisions obtained among all the dated partners and then correlate the average score recieved per attribute and the pourcentage of positive responses.

In [21]:
# Group by 'pid' and 'round' to get the total number of dates for each partner in each round
total_dates_per_pid_round = df.groupby(['pid', 'round'])['iid'].count().reset_index()
total_dates_per_pid_round.rename(columns={'iid': 'total_dates'}, inplace=True)

# Group by 'pid' and 'round' to get the total number of 'dec = 1' for each partner in each round
dec_counts_per_pid_round = df.groupby(['pid', 'round'])['dec'].sum().reset_index()
dec_counts_per_pid_round.rename(columns={'dec': 'dec_count_per_pid'}, inplace=True)

# Merge the two DataFrames to have total dates and dec_count in the same DataFrame
merged_df = pd.merge(total_dates_per_pid_round, dec_counts_per_pid_round, on=['pid', 'round'])

# Calculate the percentage of 'dec = 1' for each partner in each round
merged_df['dec_percentage_per_pid'] = (merged_df['dec_count_per_pid'] / merged_df['total_dates']) * 100

# Merge the 'dec_percentage_per_pid' back into the original DataFrame 'df'
df = pd.merge(df, merged_df[['pid', 'total_dates', 'dec_count_per_pid', 'dec_percentage_per_pid']], on=['pid'], how='left')

# Display the updated DataFrame 'df'
print(df.head())

   iid   id  gender  idg  condtn  wave  round  position  positin1  order  ...  \
0    1  1.0       0    1       1     1     10         7       NaN      4  ...   
1    1  1.0       0    1       1     1     10         7       NaN      3  ...   
2    1  1.0       0    1       1     1     10         7       NaN     10  ...   
3    1  1.0       0    1       1     1     10         7       NaN      5  ...   
4    1  1.0       0    1       1     1     10         7       NaN      7  ...   

    sinc_rec  intel_rec    fun_rec    amb_rec   shar_rec  \
0  22.500000  17.500000  17.500000  15.000000  12.500000   
1  19.512195  17.073171  19.512195  12.195122  14.634146   
2  19.047619  21.428571  19.047619  11.904762  16.666667   
3  14.285714  19.047619  16.666667  14.285714  19.047619   
4  16.216216  18.918919  18.918919  16.216216  16.216216   

   sum_recalculated_attributes  is_sum_100  total_dates  dec_count_per_pid  \
0                        100.0        True           10                  4

In [22]:
# Select the attributes you want to calculate the mean for
attributes = ['attr', 'sinc', 'intel', 'fun', 'amb', 'shar']

# Group by 'pid' and calculate the mean of the selected attributes
mean_attributes_per_pid = df.groupby('pid')[attributes].mean().reset_index()

# Display the results
print(mean_attributes_per_pid)

       pid      attr      sinc     intel       fun       amb      shar
0      1.0  6.700000  7.400000  8.000000  7.200000  8.000000  7.100000
1      2.0  7.700000  7.100000  7.900000  7.500000  7.500000  6.500000
2      3.0  6.500000  7.100000  7.300000  6.200000  7.111111  6.000000
3      4.0  7.000000  7.100000  7.700000  7.500000  7.700000  7.200000
4      5.0  5.300000  7.700000  7.600000  7.200000  7.800000  6.200000
..     ...       ...       ...       ...       ...       ...       ...
546  548.0  6.857143  5.809524  6.666667  5.714286  6.150000  4.450000
547  549.0  6.350000  6.650000  6.850000  6.650000  6.000000  5.111111
548  550.0  5.136364  5.818182  6.500000  5.272727  6.363636  4.190476
549  551.0  6.142857  6.666667  6.761905  5.571429  6.238095  5.166667
550  552.0  7.300000  5.850000  6.157895  5.750000  6.150000  5.000000

[551 rows x 7 columns]


In [23]:
# Select the attributes for correlation
attributes = ['attr', 'sinc', 'intel', 'fun', 'amb', 'shar']

# Merge mean attributes with dec_percentage
merged_data = pd.merge(mean_attributes_per_pid, df[['pid', 'dec_percentage_per_pid']], on='pid', how='left')
merged_data = merged_data.drop_duplicates(subset=['pid'])  # Remove duplicates if any

# Create correlation plots
for attribute in attributes:
    # Calculate correlation coefficient
    correlation_coeff = np.corrcoef(merged_data[attribute], merged_data['dec_percentage_per_pid'])[0, 1]


    fig = px.scatter(merged_data, x='dec_percentage_per_pid', y=attribute, trendline="ols",
                     title=f"Correlation between {attribute} and Decision Percentage (r = {correlation_coeff:.2f})",
                     labels={attribute: f"Average {attribute}", "dec_percentage": "Decision Percentage"})
    fig.show()

- There is a strong correlation between attractiveness and the percentage of positive decision (79%), followed by fun (66%) and shared interest (61%).

In [24]:
# Select the attributes for correlation
attributes = ['attr', 'sinc', 'intel', 'fun', 'amb', 'shar']

# Create an empty list to store correlation data
correlation_data = []

# Calculate correlation for average, male, and female
for attribute in attributes:
    # Average correlation
    avg_corr = np.corrcoef(merged_data[attribute], merged_data['dec_percentage_per_pid'])[0, 1]
    correlation_data.append({'Attribute': attribute, 'Gender': 'Average', 'Correlation': avg_corr})

    # Correlation for males (gender = 1)
    male_data = merged_data[df['gender'] == 1]  # Filter for males
    male_corr = np.corrcoef(male_data[attribute], male_data['dec_percentage_per_pid'])[0, 1]
    correlation_data.append({'Attribute': attribute, 'Gender': 'Male', 'Correlation': male_corr})

    # Correlation for females (gender = 0)
    female_data = merged_data[df['gender'] == 0]  # Filter for females
    female_corr = np.corrcoef(female_data[attribute], female_data['dec_percentage_per_pid'])[0, 1]
    correlation_data.append({'Attribute': attribute, 'Gender': 'Female', 'Correlation': female_corr})

# Create a DataFrame from the correlation data
correlation_df = pd.DataFrame(correlation_data)

# Create the bar chart using Plotly Express
fig = px.bar(correlation_df, x='Attribute', y='Correlation', color='Gender', barmode='group',
             title="Correlation between Attributes and Decision Percentage by Gender")
fig.show()


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.



- The attributes that coorelates the less with a positive decision are sincerity, intelligence and ambition since these interests are hard to catch during a short date while fun and shared interests make interactions easier, especially when you meet for the first time. Obviously attractiveness is the first think that you perceive when you meet somebody.  

In [25]:
# Create an empty list to store correlation data
correlation_data = []

# Calculate correlation for average, male, and female
for attribute in attributes:
    # Average correlation
    avg_corr = np.corrcoef(merged_data[attribute], merged_data['dec_percentage_per_pid'])[0, 1]
    correlation_data.append({'Attribute': attribute, 'Gender': 'Average', 'Correlation': avg_corr})

    # Correlation for males (gender = 1)
    male_data = merged_data[df['gender'] == 1]  # Filter for males
    male_corr = np.corrcoef(male_data[attribute], male_data['dec_percentage_per_pid'])[0, 1]
    correlation_data.append({'Attribute': attribute, 'Gender': 'Male', 'Correlation': male_corr})

    # Correlation for females (gender = 0)
    female_data = merged_data[df['gender'] == 0]  # Filter for females
    female_corr = np.corrcoef(female_data[attribute], female_data['dec_percentage_per_pid'])[0, 1]
    correlation_data.append({'Attribute': attribute, 'Gender': 'Female', 'Correlation': female_corr})

# Create a DataFrame from the correlation data
correlation_df = pd.DataFrame(correlation_data)

# Create separate bar charts for each attribute
for attribute in attributes:
    # Filter data for the current attribute
    attribute_data = correlation_df[correlation_df['Attribute'] == attribute]

    # Create the bar chart
    fig = px.bar(attribute_data, x='Gender', y='Correlation', color='Gender',
                 title=f"Correlation between {attribute} and Decision Percentage by Gender",
                 text_auto='.2f')  # Display correlation values on top of bars
    fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
    fig.show()


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.



- For women, intelligence and ambition have more impact on giving a positive decision than for men.
- For men, sincerity, fun and shared interests have more impact on giving a positive decision than for women.

### **3) Question: Are shared interests more important than a shared racial background?**

In [26]:
# Group by samerace and dec, then count occurrences
dec_samerace_counts = df.groupby(['samerace', 'dec'])['iid'].count().reset_index()

# Rename columns for better readability
dec_samerace_counts.rename(columns={'iid': 'count'}, inplace=True)

# Display the results
print(dec_samerace_counts)

# Optional: Calculate percentages
total_dates = len(df)
dec_samerace_counts['percentage'] = (dec_samerace_counts['count'] / total_dates) * 100
print(dec_samerace_counts)

   samerace  dec  count
0         0    0   2976
1         0    1   2076
2         1    0   1877
3         1    1   1439
   samerace  dec  count  percentage
0         0    0   2976   35.564054
1         0    1   2076   24.808795
2         1    0   1877   22.430688
3         1    1   1439   17.196463


In [27]:
# Group by 'dec' and calculate the mean of 'shar' and 'imprace'
avg_shar_imprace_by_dec = df.groupby('dec')[['shar', 'imprace', 'imprelig']].mean().reset_index()

# Melt the DataFrame to create a long format for Plotly Express
melted_df = avg_shar_imprace_by_dec.melt(id_vars=['dec'], value_vars=['shar', 'imprace', 'imprelig'],
                                      var_name='Criteria', value_name='Average Value')

# Convert 'dec' to categorical
melted_df['dec'] = melted_df['dec'].astype(str)  # Convert to string

# Create the bar chart using Plotly Express with facet_col for 'dec'
fig = px.bar(melted_df, x='Criteria', y='Average Value',
             color='dec', facet_col='dec',
             title="Average Shared Interests and Importance of Race by Decision",
             labels={'dec': 'Decision (0=No, 1=Yes)', 'Criteria': 'Criteria',
                     'Average Value': 'Average Value'},
             category_orders={"dec": ["0", "1"]})  # Order categories

fig.show()

- Shared interests have more impact than a shared race or religion on giving a positive decision. For participants it's more important to share interests than sharing the same religion or race.

In [28]:
# Filter for dec = 1
filtered_df = avg_shar_imprace_by_dec[avg_shar_imprace_by_dec['dec'] == 1]

# Melt the DataFrame to create a long format for Plotly Express
melted_df = filtered_df.melt(id_vars=['dec'], value_vars=['shar', 'imprace', 'imprelig'],
                                      var_name='Criteria', value_name='Average Value')

# Define custom colors for each Criteria
colors = {'shar': 'skyblue', 'imprace': 'lightcoral'}

# Create the bar chart using Plotly Express with custom colors and text values
fig = px.bar(melted_df, x='Criteria', y='Average Value',
             title="Average Shared Interests and Importance of Race for Decision = 1",
             labels={'Criteria': 'Criteria', 'Average Value': 'Average Value'},
             color='Criteria',  # Use 'Criteria' for color mapping
             color_discrete_map=colors,  # Apply custom colors
             text='Average Value')  # Add text for average values

fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')  # Format and position text

fig.show()

### **4) Question: Can people accurately predict their own perceived value in the dating market?**

In [29]:
# Define attribute groups
before_date_attrs = ['attr3_1_rec', 'sinc3_1_rec', 'intel3_1_rec', 'fun3_1_rec', 'amb3_1_rec']
after_date_attrs = ['attr3_2_rec', 'sinc3_2_rec', 'intel3_2_rec', 'fun3_2_rec', 'amb3_2_rec']
partner_attrs = ['attr_o_rec', 'sinc_o_rec', 'intel_o_rec', 'fun_o_rec', 'amb_o_rec']

# Create a DataFrame for plotting
plot_df = pd.DataFrame({
    'Attribute': ['Attractiveness', 'Sincerity', 'Intelligence', 'Fun', 'Ambition'] * 3,  # Repeat for 3 groups
    'Rating Type': ['Self (Before)', 'Self (After)', 'Partner'] * 5,  # Repeat for 5 attributes
    'Average Rating': 0  # Initialize with 0
})


# Calculate average ratings
for i, attr in enumerate(before_date_attrs):
    plot_df.loc[plot_df['Attribute'] == plot_df['Attribute'].unique()[i], 'Average Rating'] = df[attr].mean(), df[after_date_attrs[i]].mean(), df[partner_attrs[i]].mean()

# Reshape dataframe to have attributes in columns
plot_df = plot_df.pivot(index='Attribute', columns=['Rating Type'], values='Average Rating').reset_index()


# Create grouped bar chart
fig = px.bar(plot_df, x='Attribute', y=['Self (Before)', 'Self (After)', 'Partner'],
             title="Comparison of Self-Ratings and Partner Ratings",
             labels={'value': 'Average Rating', 'variable': 'Rating Type'})
fig.update_layout(barmode='group')
fig.show()


Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '(18.096624779430474, 18.55999617982894, 16.553698630162277)' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.



- Participants overrate their ambition and sincerity attributes and underrate their intelligence before the date.
- They underestimate their attractiveness and fun after the date. They loose confidence just after dating.

In [34]:
# Calculate match = 1 counts per iid
match_counts = df[df['match'] == 1].groupby('iid')['match'].count().reset_index()
match_counts.rename(columns={'match': 'match_count'}, inplace=True)

# Get match_es values per iid (since it doesn't change per iid)
match_es_per_iid = df[['iid', 'match_es']].drop_duplicates()

# Merge the two DataFrames
merged_data = pd.merge(match_counts, match_es_per_iid, on='iid', how='left')

# Calculate the delta (difference)
merged_data['delta'] = merged_data['match_count'] - merged_data['match_es']

# Create the box plot
fig = px.box(merged_data, y='delta',
             title="Distribution of Delta between Estimated Matches and real matches (match = 1)",
             labels={'delta': "Delta (match_es - match_count)"})
fig.show()

- In general, participants estimate well the matches they will get.

### **5) Question: In terms of getting a second date, is it better to be someone's first speed date of the night or their last?**

In [31]:
# Filter for first dates (order = 1)
first_dates = df[df['order'] == 1]

# Calculate percentage of dec = 1 for first dates
first_date_dec1_percentage = (first_dates['dec'].sum() / len(first_dates)) * 100

# Filter for last dates (order = round)
last_dates = df[df['order'] == df['round']]

# Calculate percentage of dec = 1 for last dates
last_date_dec1_percentage = (last_dates['dec'].sum() / len(last_dates)) * 100

# Filter for second dates (order = 2)
second_dates = df[df['order'] == 2]
second_date_dec1_percentage = (second_dates['dec'].sum() / len(second_dates)) * 100

# Filter for before last dates (order = round - 1)
before_last_dates = df[df['order'] == df['round'] - 1]
before_last_date_dec1_percentage = (before_last_dates['dec'].sum() / len(before_last_dates)) * 100

# Print the results
print(f"Percentage of dec = 1 for first dates: {first_date_dec1_percentage:.2f}%")
print(f"Percentage of dec = 1 for last dates: {last_date_dec1_percentage:.2f}%")
print(f"Percentage of dec = 1 for second dates: {second_date_dec1_percentage:.2f}%")
print(f"Percentage of dec = 1 for before last dates: {before_last_date_dec1_percentage:.2f}%")

Percentage of dec = 1 for first dates: 50.00%
Percentage of dec = 1 for last dates: 44.59%
Percentage of dec = 1 for second dates: 39.45%
Percentage of dec = 1 for before last dates: 43.45%


In [32]:
# Create bar chart data
data = [
    go.Bar(name='First Date', x=['First Date'], y=[first_date_dec1_percentage],
           text=[f'{first_date_dec1_percentage:.2f}%'], textposition='outside'),
    go.Bar(name='Second Date', x=['Second Date'], y=[second_date_dec1_percentage],
           text=[f'{second_date_dec1_percentage:.2f}%'], textposition='outside'),
    go.Bar(name='Before Last Date', x=['Before Last Date'], y=[before_last_date_dec1_percentage],
           text=[f'{before_last_date_dec1_percentage:.2f}%'], textposition='outside'),
    go.Bar(name='Last Date', x=['Last Date'], y=[last_date_dec1_percentage],
           text=[f'{last_date_dec1_percentage:.2f}%'], textposition='outside')
]

# Create layout
layout = go.Layout(
    title="Percentage of 'dec = 1' for First, Second, Before Last, and Last Dates",
    yaxis_title="Percentage (%)",
    barmode='group'  # You can change to 'stack' or 'overlay' if needed
)

# Create figure
fig = go.Figure(data=data, layout=layout)

# Show plot
fig.show()

- Clearly, it's better to be someone's first date. the worst case senario is when you are someone's second date.

### **6) Question6: Is there a correlation between shared interests and match?**

In [33]:
fig = px.box(df, x="match", y="int_corr",
                   title="Match vs. Correlation between interests",
                   labels={"match": "Match (0=No, 1=Yes)", "int_corr": "Correlation between interests"})
fig.show()

- Concerning matches, a higher correlation between interests is observed when there is a match.

# Part 2: Conclusion

- There is a strong correlation between attractiveness and the percentage of positive decision (79%), obviously attractiveness is the first think that you perceive when you meet somebody.  
- Attractiveness is followed by fun (66%) and shared interest (61%) which help to make interactions easier, especially when you meet for the first time.
- The attributes that coorelates the less with a positive decision are sincerity, intelligence and ambition since these interests are hard to catch during a short date.
- For women, intelligence and ambition have more impact on giving a positive decision than for men. While for men, sincerity, fun and shared interests have more impact on giving a positive decision than for women.
- Shared interests have more impact than a shared race or religion on giving a positive decision. For participants it's more important to share interests than having the same religion or race.
- Participants overrate their ambition and sincerity attributes and underrate their intelligence before the date.
- They underestimate their attractiveness and fun after the date. They loose confidence just after dating.
- Clearly, it's better to be someone's first date. The worst case is when you are someone's second date.