⚠️ This project is mandatory for certification bloc #2.

![Tinder](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M03-EDA/Tinder-Symbole.png)

# Speed Dating with Tinder

## Company's description 📇

<a href="https://tinder.com/" target="_blank">Tinder</a> is an online dating and geosocial networking application. In Tinder, users "swipe right" to like or "swipe left" to dislike other users' profiles, which include their photos, a short bio, and a list of their interests.

Tinder was launched by Sean Rad at a hackathon held at the Hatch Labs incubator in West Hollywood in 2012.

As of 2021, Tinder has recorded more than 65 billion matches worldwide.

## Project 🚧

The marketing team needs help on a new project. They are experiencing a decrease in the number of matches, and they are trying to find a way to understand **what makes people interested into each other**. 

They decided to run a speed dating experiment with people who had to give Tinder lots of informations about themselves that could ultimately reflect on ther dating profile on the app.

Tinder then gathered the data from this experiment. Each row in the dataset represents one speed date between two people, and indicates wether each of them secretly agreed to go on a second date with the other person.

## Goals 🎯

Use the dataset to understand what makes people interested into each other to go on a second date together:
* You may use descriptive statistics
* You may use visualisations

## Scope of this project 🖼️

Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.

The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information. See the Speed Dating Data Key document below for details.

[Dataset](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M03-EDA/Speed+Dating+Data.csv)

[Dataset Description](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M03-EDA/Speed+Dating+Data+Key.doc)

## Helpers 🦮

To help you achieve this project, here are a few tips that should help youbest destinations on a map

Data Exploration Ideas :
* What are the least desirable attributes in a male partner? Does this differ for female partners?
* How important do people think attractiveness is in potential mate selection vs. its real impact?
* Are shared interests more important than a shared racial background?
* Can people accurately predict their own perceived value in the dating market?
* In terms of getting a second date, is it better to be someone's first speed date of the night or their last?

## Deliverable 📬

To complete this project, your team should deliver:

A notebook with:
* descriptive statistics
* visualisations
* captions and interpretations on how the stats and visualisations are relevant to why people agree to a second date

In [80]:
import pandas as pd
import plotly.express as px

import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [81]:
df = pd.read_csv('Speed+Dating+Data.csv', encoding='latin1')

In [82]:
# Vérifier si num_in_3 est toujours supérieur ou égal à numdat_3
inconsistent_rows = df[df['num_in_3'] < df['numdat_3']]

# Calculer le pourcentage d'incohérences
inconsistency_percentage = (len(inconsistent_rows) / len(df)) * 100

inconsistency_percentage, len(inconsistent_rows)  # Afficher le pourcentage d'incohérence et le nombre de lignes concernées

(1.3845786583910242, 116)

In [83]:
df.loc[df['num_in_3'] < df['numdat_3'], 'num_in_3'] = df.loc[df['num_in_3'] < df['numdat_3'], 'numdat_3']

In [84]:
# Remplacer les NaN de numdat_3 par 0 
df['numdat_3'] = df['numdat_3'].fillna(0)
# Remplacer date_3 par 1 quand numdate_3 est superieur à 0
df.loc[df['numdat_3'] > 0, ['date_3']] = 1
df['date_3'] = df['date_3'].fillna(0)

# Remplacer Nales 0 de numdat_3 par la médiane lorsque date_3 est égal à 1
df.loc[(df['date_3'] == 1) & (df['numdat_3'] == 0), 'numdat_3'] = df['numdat_3'].median()

df["you_call"] = df["you_call"].fillna(0)
df["them_cal"] = df["them_cal"].fillna(0)

# Supprimer les lignes où wave est compris entre 6 et 9
df = df[~df['wave'].between(6, 9)]

# regrouper par iid
df_Grouped_By_iid = df.groupby("iid").mean(numeric_only=True).join(df.select_dtypes(exclude=['number'])).reset_index()
df_Grouped_By_iid_Sum = df.groupby("iid").sum(numeric_only=True).join(df.select_dtypes(exclude=['number'])).reset_index()

def RemoveMissingsValues(df) :
    # Suppression des colonnes avec plus de 50% de valeurs manquantes
    threshold = len(df) * 0.5
    df = df.dropna(thresh=threshold, axis=1)

RemoveMissingsValues(df)
RemoveMissingsValues(df_Grouped_By_iid)
RemoveMissingsValues(df_Grouped_By_iid_Sum)



In [85]:
missing_values = df.isnull().sum()

fig = px.bar(missing_values[missing_values > 500])
fig.show()

In [86]:
def RemoveOutliers(df) :
    df_numeric = df.select_dtypes(include=[np.number])

    # Calculer les quartiles uniquement sur les colonnes numériques
    Q1 = df_numeric.quantile(0.25)
    Q3 = df_numeric.quantile(0.75)
    IQR = Q3 - Q1

    # Calculer les limites pour filtrer les outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR


    # Fonction pour vérifier si une colonne est binaire
    def is_binary(column):
        return set(column.dropna().unique()).issubset({0, 1})

    # Remplacer les outliers par la médiane, sauf pour les colonnes binaires
    df_clean = df_numeric.apply(lambda x: np.where((x < lower_bound[x.name]) | (x > upper_bound[x.name]), x.median(), x) if not is_binary(x) else x, axis=0)

    return df_clean

df_clean = RemoveOutliers(df)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6816 entries, 0 to 8377
Columns: 187 entries, iid to amb5_3
dtypes: float64(182), int64(5)
memory usage: 9.8 MB


In [87]:
df_Grouped_By_iid_clean = RemoveOutliers(df_Grouped_By_iid)

df_Grouped_By_iid_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 449 entries, 0 to 448
Columns: 187 entries, iid to amb5_3
dtypes: float64(187)
memory usage: 656.1 KB


In [88]:
df_Grouped_By_iid_Sum_clean = RemoveOutliers(df_Grouped_By_iid_Sum)

df_Grouped_By_iid_Sum_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 449 entries, 0 to 448
Columns: 187 entries, iid to amb5_3
dtypes: float64(187)
memory usage: 656.1 KB


In [89]:
# Calculate the mean for each attribute by gender
df_Grouped_M = df_Grouped_By_iid_clean[df_Grouped_By_iid_clean["gender"] == 1]
df_Grouped_F = df_Grouped_By_iid_clean[df_Grouped_By_iid_clean["gender"] == 0]

attr1_1_Avg_M = df_Grouped_M["attr1_1"].mean()
sinc1_1_Avg_M = df_Grouped_M["sinc1_1"].mean()
intel1_1_Avg_M = df_Grouped_M["intel1_1"].mean()
fun1_1_Avg_M = df_Grouped_M["fun1_1"].mean()
amb1_1_Avg_M = df_Grouped_M["amb1_1"].mean()
shar1_1_Avg_M = df_Grouped_M["shar1_1"].mean()

attr1_1_Avg_F = df_Grouped_F["attr1_1"].mean()
sinc1_1_Avg_F = df_Grouped_F["sinc1_1"].mean()
intel1_1_Avg_F = df_Grouped_F["intel1_1"].mean()
fun1_1_Avg_F = df_Grouped_F["fun1_1"].mean()
amb1_1_Avg_F = df_Grouped_F["amb1_1"].mean()
shar1_1_Avg_F = df_Grouped_F["shar1_1"].mean()

# Create a list of new names for each attribute
attribute_names = ["Attractiveness", "Sincerity", "Intelligence", "Fun", "Ambition", "Shared Interests"]

# Create the figure
fig = go.Figure()

# Add bars for each attribute for males
fig.add_trace(go.Bar(
    x=attribute_names,
    y=[attr1_1_Avg_M, sinc1_1_Avg_M, intel1_1_Avg_M, fun1_1_Avg_M, amb1_1_Avg_M, shar1_1_Avg_M],
    text=[f'{round(attr1_1_Avg_M,1)}', f'{round(sinc1_1_Avg_M,1)}', f'{round(intel1_1_Avg_M,1)}', f'{round(fun1_1_Avg_M,1)}', f'{round(amb1_1_Avg_M,1)}', f'{round(shar1_1_Avg_M,1)}'],
    textposition='auto',
    name='Male'
))

# Add bars for each attribute for females
fig.add_trace(go.Bar(
    x=attribute_names,
    y=[attr1_1_Avg_F, sinc1_1_Avg_F, intel1_1_Avg_F, fun1_1_Avg_F, amb1_1_Avg_F, shar1_1_Avg_F],
    text=[f'{round(attr1_1_Avg_F,1)}', f'{round(sinc1_1_Avg_F,1)}', f'{round(intel1_1_Avg_F,1)}', f'{round(fun1_1_Avg_F,1)}', f'{round(amb1_1_Avg_F,1)}', f'{round(shar1_1_Avg_F,1)}'],
    textposition='auto',
    name='Female'
))

# Update layout
fig.update_layout(
    title='Comparing Preferences of Women and Men Before the Event',
    xaxis_title='Attributes',
    yaxis_title='Average rating',
    barmode='group'
)

# Show the plot
fig.show()

When considering the least desirable attributes in a partner, ambition ranks lowest for both male and female partners. Interestingly, the most desirable attributes differ between genders: attractiveness is the top priority for male partners, while intelligence is most valued in female partners. This highlights how preferences can vary based on gender, reflecting different priorities in partner selection.

In [90]:
# Calculate the mean for each attribute by gender
df_Grouped_M = df_Grouped_By_iid_clean[df_Grouped_By_iid_clean["gender"] == 1]
df_Grouped_F = df_Grouped_By_iid_clean[df_Grouped_By_iid_clean["gender"] == 0]

# 2_1 : What do you think the opposite sex looks for in a date?

attr2_1_Avg_M = df_Grouped_M["attr2_1"].mean()
sinc2_1_Avg_M = df_Grouped_M["sinc2_1"].mean()
intel2_1_Avg_M = df_Grouped_M["intel2_1"].mean()
fun2_1_Avg_M = df_Grouped_M["fun2_1"].mean()
amb2_1_Avg_M = df_Grouped_M["amb2_1"].mean()
shar2_1_Avg_M = df_Grouped_M["shar2_1"].mean()

attr2_1_Avg_F = df_Grouped_F["attr2_1"].mean()
sinc2_1_Avg_F = df_Grouped_F["sinc2_1"].mean()
intel2_1_Avg_F = df_Grouped_F["intel2_1"].mean()
fun2_1_Avg_F = df_Grouped_F["fun2_1"].mean()
amb2_1_Avg_F = df_Grouped_F["amb2_1"].mean()
shar2_1_Avg_F = df_Grouped_F["shar2_1"].mean()


fig = go.Figure()


fig.add_trace(go.Bar(
    x=attribute_names,
    y=[attr2_1_Avg_F, sinc2_1_Avg_F, intel2_1_Avg_F, fun2_1_Avg_F, amb2_1_Avg_F, shar2_1_Avg_F],
    text=[f'{round(attr2_1_Avg_F,1)}', f'{round(sinc2_1_Avg_F,1)}', f'{round(intel2_1_Avg_F,1)}', f'{round(fun2_1_Avg_F,1)}', f'{round(amb2_1_Avg_F,1)}', f'{round(shar2_1_Avg_F,1)}'],
    textposition='auto',
    name='Male supposed preference by female'
))
fig.add_trace(go.Bar(
    x=attribute_names,
    y=[attr1_1_Avg_M, sinc1_1_Avg_M, intel1_1_Avg_M, fun1_1_Avg_M, amb1_1_Avg_M, shar1_1_Avg_M],
    text=[f'{round(attr1_1_Avg_M,1)}', f'{round(sinc1_1_Avg_M,1)}', f'{round(intel1_1_Avg_M,1)}', f'{round(fun1_1_Avg_M,1)}', f'{round(amb1_1_Avg_M,1)}', f'{round(shar1_1_Avg_M,1)}'],
    textposition='auto',
    name='Male preference'
))




fig.add_trace(go.Bar(
    x=attribute_names,
    y=[attr2_1_Avg_M, sinc2_1_Avg_M, intel2_1_Avg_M, fun2_1_Avg_M, amb2_1_Avg_M, shar2_1_Avg_M],
    text=[f'{round(attr2_1_Avg_M,1)}', f'{round(sinc2_1_Avg_M,1)}', f'{round(intel2_1_Avg_M,1)}', f'{round(fun2_1_Avg_M,1)}', f'{round(amb2_1_Avg_M,1)}', f'{round(shar2_1_Avg_M,1)}'],
    textposition='auto',
    name='Female supposed preference by male'
))
fig.add_trace(go.Bar(
    x=attribute_names,
    y=[attr1_1_Avg_F, sinc1_1_Avg_F, intel1_1_Avg_F, fun1_1_Avg_F, amb1_1_Avg_F, shar1_1_Avg_F],
    text=[f'{round(attr1_1_Avg_F,1)}', f'{round(sinc1_1_Avg_F,1)}', f'{round(intel1_1_Avg_F,1)}', f'{round(fun1_1_Avg_F,1)}', f'{round(amb1_1_Avg_F,1)}', f'{round(shar1_1_Avg_F,1)}'],
    textposition='auto',
    name='Female preference'
))

fig.update_layout(
    title='Comparing opposite sex supposed preference',
    xaxis_title='Attributes',
    yaxis_title='Average rating',
    barmode='group'
)

fig.show()

When comparing male and female preferences with how they believe the opposite sex values these traits, it becomes evident that both genders tend to overestimate the importance of attractiveness. Conversely, they underestimate the significance of sincerity and intelligence in potential mate selection.

In [91]:
samerace_Avg = df_clean["samerace"].mean()
samerace_Avg_Positive_Dec = df_clean[df_clean["dec"] == 1]["samerace"].mean()
samerace_Avg_Match = df_clean[df_clean["match"] == 1]["samerace"].mean()

int_corr_Avg = df_clean["int_corr"].mean()
int_corr_Avg_Positive_Dec = df_clean[df_clean["dec"] == 1]["int_corr"].mean()
int_corr_Avg_Match = df_clean[df_clean["match"] == 1]["int_corr"].mean()

shar_Avg = df_clean["shar"].mean()
shar_Avg_Positive_Dec = df_clean[df_clean["dec"] == 1]["shar"].mean()
shar_Avg_Match = df_clean[df_clean["match"] == 1]["shar"].mean()
# Function to calculate percentage increase
def percentage_increase(current, previous):
    if previous == 0:
        return "N/A"
    return f'{((current - previous) / previous) * 100:.1f}%'

# Create subplots
fig = make_subplots(rows=1, cols=3, subplot_titles=("Average of duo with shared racial background", "Average of duo correlation of interests", "Average shared interests rating"))

# Add traces for each bar plot
fig.add_trace(go.Bar(
    x=["Global", "Positive decision", "Match"],
    y=[samerace_Avg, samerace_Avg_Positive_Dec, samerace_Avg_Match],
    text=[
        f'{samerace_Avg * 100:.1f}%',
        f'{samerace_Avg_Positive_Dec * 100:.1f}% (+{percentage_increase(samerace_Avg_Positive_Dec, samerace_Avg)})',
        f'{samerace_Avg_Match * 100:.1f}% (+{percentage_increase(samerace_Avg_Match, samerace_Avg)})'
    ],
    textposition='auto',
    name='Average of duo with shared racial background'
), row=1, col=1)

fig.add_trace(go.Bar(
    x=["Global", "Positive decision", "Match"],
    y=[int_corr_Avg, int_corr_Avg_Positive_Dec, int_corr_Avg_Match],
    text=[
        f'{int_corr_Avg:.3f}',
        f'{int_corr_Avg_Positive_Dec:.3f} (+{percentage_increase(int_corr_Avg_Positive_Dec, int_corr_Avg)})',
        f'{int_corr_Avg_Match:.3f} (+{percentage_increase(int_corr_Avg_Match, int_corr_Avg)})'
    ],
    textposition='auto',
    name='Average of duo correlation of interests'
), row=1, col=2)

fig.add_trace(go.Bar(
    x=["Global", "Positive decision", "Match"],
    y=[shar_Avg, shar_Avg_Positive_Dec, shar_Avg_Match],
    text=[
        f'{shar_Avg:.1f}',
        f'{shar_Avg_Positive_Dec:.1f} (+{percentage_increase(shar_Avg_Positive_Dec, shar_Avg)})',
        f'{shar_Avg_Match:.1f} (+{percentage_increase(shar_Avg_Match, shar_Avg)})'
    ],
    textposition='auto',
    name='Average shared interests rating'
), row=1, col=3)

# Update layout
fig.update_layout(
    showlegend=False,
    title='Comparing shared racial background and shared interests impact on decision and match',
    yaxis_title='Average'
)

fig.show()

In [92]:
# Calcul des corrélations
corr_samerace_dec = df_clean['samerace'].corr(df_clean['dec'])
corr_int_corr_dec = df_clean['int_corr'].corr(df_clean['dec'])
corr_shar_dec = df_clean['shar'].corr(df_clean['dec'])

corr_samerace_match = df_clean['samerace'].corr(df_clean['match'])
corr_int_corr_match = df_clean['int_corr'].corr(df_clean['match'])
corr_shar_match = df_clean['shar'].corr(df_clean['match'])

# Créer des sous-parcelles
fig = make_subplots(rows=1, cols=2, subplot_titles=("Correlation with Decision", "Correlation with Match"))

# Ajouter des traces pour chaque bar plot
fig.add_trace(go.Bar(
    x=["Shared Racial Background", "Correlation of Interests", "Shared Interests Rating"],
    y=[corr_samerace_dec, corr_int_corr_dec, corr_shar_dec],
    text=[f'{corr_samerace_dec:.2f}', f'{corr_int_corr_dec:.2f}', f'{corr_shar_dec:.2f}'],
    textposition='auto',
    name='Correlation with Decision'
), row=1, col=1)

fig.add_trace(go.Bar(
    x=["Shared Racial Background", "Correlation of Interests", "Shared Interests Rating"],
    y=[corr_samerace_match, corr_int_corr_match, corr_shar_match],
    text=[f'{corr_samerace_match:.2f}', f'{corr_int_corr_match:.2f}', f'{corr_shar_match:.2f}'],
    textposition='auto',
    name='Correlation with Match'
), row=1, col=2)


fig.update_layout(
    showlegend=False,
    title='Comparing Correlations with Decision and Match',
    yaxis_title='Correlation Coefficient'
)

fig.show()

The correlation analysis indicates that while the impact of sharing the same race and having correlated interests is minimal in achieving a match, the rating of shared interests plays a significantly more crucial role. Therefore, perceived shared interests are more important in forming a match than a shared racial background. Additionally, the int_corr column does not seem to fully capture the true correlation of shared interests in its entirety.

In [93]:




Average_Match_Expected = df_Grouped_By_iid_clean["match_es"].mean()
Average_Match = df_Grouped_By_iid_Sum_clean["match"].mean()

Average_self_assessment_attr = df_Grouped_By_iid_clean["attr3_1"].mean()
Average_expected_rating_attr = df_Grouped_By_iid_clean["attr5_1"].mean()
Average_rating_by_partners_attr = df_Grouped_By_iid_clean["attr_o"].mean()

Average_self_assessment_sinc = df_Grouped_By_iid_clean["sinc3_1"].mean()
Average_expected_rating_sinc = df_Grouped_By_iid_clean["sinc5_1"].mean()
Average_rating_by_partners_sinc = df_Grouped_By_iid_clean["sinc_o"].mean()

Average_self_assessment_intel = df_Grouped_By_iid_clean["intel3_1"].mean()
Average_expected_rating_intel = df_Grouped_By_iid_clean["intel5_1"].mean()
Average_rating_by_partners_intel = df_Grouped_By_iid_clean["intel_o"].mean()

Average_self_assessment_fun = df_Grouped_By_iid_clean["fun3_1"].mean()
Average_expected_rating_fun = df_Grouped_By_iid_clean["fun5_1"].mean()
Average_rating_by_partners_fun = df_Grouped_By_iid_clean["fun_o"].mean()

Average_self_assessment_amb = df_Grouped_By_iid_clean["amb3_1"].mean()
Average_expected_rating_amb = df_Grouped_By_iid_clean["amb5_1"].mean()
Average_rating_by_partners_amb = df_Grouped_By_iid_clean["amb_o"].mean()

fig = go.Figure()
fig.add_trace(go.Bar(
    x=["Average number match expected", "Average number of match"],
    y=[Average_Match_Expected, Average_Match],
    text=[f'{Average_Match_Expected:.1f}', f'{Average_Match:.1f}'],
    textposition='auto',
    name='Comparison of expectation and realilty'
))


fig.add_trace(go.Bar(
    x=["Self-Assessment", "Average expected rating", "Average Partner’s Rating"],
    y=[Average_self_assessment_attr, Average_expected_rating_attr, Average_rating_by_partners_attr],
    text=[f'{Average_self_assessment_attr:.1f}', f'{Average_expected_rating_attr:.1f}', f'{Average_rating_by_partners_attr:.1f}'],
    textposition='auto',
    name='Comparison of expectation and realilty',
    visible= False
))

fig.add_trace(go.Bar(
    x=["Self-Assessment", "Average expected rating", "Average Partner’s Rating"],
    y=[Average_self_assessment_sinc, Average_expected_rating_sinc, Average_rating_by_partners_sinc],
    text=[f'{Average_self_assessment_sinc:.1f}', f'{Average_expected_rating_sinc:.1f}', f'{Average_rating_by_partners_sinc:.1f}'],
    textposition='auto',
    name='Comparison of expectation and realilty',
    visible= False
))

fig.add_trace(go.Bar(
    x=["Self-Assessment", "Average expected rating", "Average Partner’s Rating"],
    y=[Average_self_assessment_intel, Average_expected_rating_intel, Average_rating_by_partners_intel],
    text=[f'{Average_self_assessment_intel:.1f}', f'{Average_expected_rating_intel:.1f}', f'{Average_rating_by_partners_intel:.1f}'],
    textposition='auto',
    name='Comparison of expectation and realilty',
    visible= False
))

fig.add_trace(go.Bar(
    x=["Self-Assessment", "Average expected rating", "Average Partner’s Rating"],
    y=[Average_self_assessment_fun, Average_expected_rating_fun, Average_rating_by_partners_fun],
    text=[f'{Average_self_assessment_fun:.1f}', f'{Average_expected_rating_fun:.1f}', f'{Average_rating_by_partners_fun:.1f}'],
    textposition='auto',
    name='Comparison of expectation and realilty',
    visible= False
))

fig.add_trace(go.Bar(
    x=["Self-Assessment", "Average expected rating", "Average Partner’s Rating"],
    y=[Average_self_assessment_amb, Average_expected_rating_amb, Average_rating_by_partners_amb],
    text=[f'{Average_self_assessment_amb:.1f}', f'{Average_expected_rating_amb:.1f}', f'{Average_rating_by_partners_amb:.1f}'],
    textposition='auto',
    name='Comparison of expectation and realilty',
    visible= False
))


fig.update_layout(
    showlegend=False,
    title='Expectation vs realilty',
    yaxis_title='Average',

    updatemenus=[
        dict(
            type="buttons",
            direction="down",
            buttons=list([
                dict(
                    args=["visible", [True,False,False,False,False,False]],
                    label="Match",
                    method="restyle"
                ),
                dict(
                    args=["visible", [False,True,False,False,False,False]],
                    label="Attractiveness",
                    method="restyle"
                ),
                dict(
                    args=["visible", [False,False,True,False,False,False]],
                    label="Sincerity",
                    method="restyle"
                ),
                dict(
                    args=["visible", [False,False,False,True,False,False]],
                    label="Intelligence",
                    method="restyle"
                ),
                dict(
                    args=["visible", [False,False,False,False,True,False]],
                    label="Fun",
                    method="restyle"
                ),
                dict(
                    args=["visible", [False,False,False,False,False,True]],
                    label="Ambition",
                    method="restyle"
                )
            ]),
        )
    ]
)


fig.show()

By comparing self-ratings with partner ratings, we can see that people tend to overestimate their perceived value in the dating market. On average, participants rated themselves higher in attractiveness (7 vs. 6.2), sincerity (8.3 vs. 7.2), intelligence (8.4 vs. 7.4), fun (7.8 vs. 6.4), and ambition (7.7 vs. 6.8) compared to how their partners rated them. Additionally, participants also overestimated the number of matches they would get, expecting an average of 2.6 matches, while the actual average was 2.4. This suggests a general overestimation both in self-assessment and in predicting outcomes.

In [94]:
# Filter first and last dates
first_last = df[(df['order'] == 1) | (df['order'] == df.groupby('wave')['order'].transform('max'))].copy()

# Add a column indicating if it's a first or last date
first_last['type'] = first_last['order'].apply(lambda x: 'First' if x == 1 else 'Last')

# Analyze the decision (dec == 1 means yes for a second date)
grouped_decision = first_last.groupby('type').agg({'dec': 'mean'}).reset_index()

# Analyze the match rate
grouped_match = first_last.groupby('type').agg({'match': 'mean'}).reset_index()

# Create subplots (1 row, 2 columns)
fig = make_subplots(rows=1, cols=2, subplot_titles=('Second Date Decision Rate', 'Match Rate'))

# Add the first subplot (second date decision rate)
fig.add_trace(go.Bar(
    x=grouped_decision['type'],
    y=grouped_decision['dec'],
    name='Decision Rate',
    marker_color='indigo'
), row=1, col=1)

# Add the second subplot (match rate)
fig.add_trace(go.Bar(
    x=grouped_match['type'],
    y=grouped_match['match'],
    name='Match Rate',
    marker_color='green'
), row=1, col=2)

# Update layout
fig.update_layout(
    title='Comparison of Match Rate and Second Date Decision by Speed Date Position',
    showlegend=False,  # Hide the legend as the names are similar
    template='plotly_white'
)

# Add y-axis labels
fig.update_yaxes(title_text="Second Date Decision Rate", row=1, col=1)
fig.update_yaxes(title_text="Match Rate", row=1, col=2)


fig.show()


In [95]:
# Calculer le nombre maximal de rendez-vous par vague
df['max_order'] = df.groupby('wave')['order'].transform('max')

# Normaliser la position du rendez-vous (entre 0 et 1)
df['normalized_order'] = df['order'] / df['max_order']

# Create a bar chart using Plotly to simulate a histogram
fig = px.histogram(df, x='normalized_order', y='match',
             labels={'normalized_order': 'Normalized Order', 'match': 'Match'},
             title='Average match rate by normalized speed date position',
             template='plotly_white',
             nbins=20,
             histfunc="avg")
# Show the figure
fig.show()
fig = px.histogram(df, x='normalized_order', y='dec',
             labels={'normalized_order': 'Normalized Order', 'dec': 'Positive decision'},
             title='Average positive decision rate by normalized speed date position',
             template='plotly_white',
             nbins=20,
             histfunc="avg")
# Show the figure
fig.show()

In [96]:
print(f" The correlation between order and decision is : {df['order'].corr(df['dec'])}")
print(f" The correlation between order and match is : {df['order'].corr(df['match'])}")

 The correlation between order and decision is : -0.04231438011020121
 The correlation between order and match is : -0.04864977596442452


In terms of getting a second date, it is slightly better to be someone's first speed date of the night compared to being the last. The first dates have a higher likelihood of receiving a positive decision for a second date (49.7% vs. 44.3%) and a marginally higher match rate (23% vs. 21%).

However, both the correlation values suggest that the effect is relatively weak, meaning that the position in the order of speed dates has only a minor impact on the likelihood of getting a second date.

In [97]:
fig = go.Figure()

fig.add_trace(go.Bar(
    x=attribute_names,
    y=[attr1_1_Avg_F, sinc1_1_Avg_F, intel1_1_Avg_F, fun1_1_Avg_F, amb1_1_Avg_F, shar1_1_Avg_F],
    text=[f'{round(attr1_1_Avg_F,1)}', f'{round(sinc1_1_Avg_F,1)}', f'{round(intel1_1_Avg_F,1)}', f'{round(fun1_1_Avg_F,1)}', f'{round(amb1_1_Avg_F,1)}', f'{round(shar1_1_Avg_F,1)}'],
    textposition='auto',
    name='Female'
))

fig.show()

In [98]:
df_dec_corr = df_clean.select_dtypes(include=[np.number]).corr()["dec"]

# Trier les corrélations par leur valeur absolue sans modifier les valeurs elles-mêmes
sorted_corr = df_dec_corr.reindex(df_dec_corr.abs().sort_values(ascending=False).index)

# Obtenir les 20 meilleures corrélations absolues (en excluant la colonne cible elle-même)
best_abs_corr = sorted_corr[1:21]

fig_best_abs = px.bar(best_abs_corr, title="Top 20 Corrélations Absolues avec 'dec'")

fig_best_abs.show()

In [99]:
df_match_corr = df_clean.select_dtypes(include=[np.number]).corr()["match"]

sorted_corr = df_match_corr.reindex(df_match_corr.abs().sort_values(ascending=False).index)

best_abs_corr = sorted_corr[1:21]

fig_best_abs = px.bar(best_abs_corr, title="Top 20 Corrélations Absolues avec 'match'")

fig_best_abs.show()

In [108]:
df_date_3_corr = df_Grouped_By_iid_clean.select_dtypes(include=[np.number]).corr()["date_3"]

sorted_corr = df_date_3_corr.reindex(df_date_3_corr.abs().sort_values(ascending=False).index)

best_abs_corr = sorted_corr[1:21]

fig_best_abs = px.bar(best_abs_corr, title="Top 20 Corrélations Absolues avec 'date_3'")

fig_best_abs.show()

In [106]:
df_date_3_corr = df.select_dtypes(include=[np.number]).corr()["numdat_3"]

sorted_corr = df_date_3_corr.reindex(df_date_3_corr.abs().sort_values(ascending=False).index)

best_abs_corr = sorted_corr[1:21]

fig_best_abs = px.bar(best_abs_corr, title="Top 20 Corrélations Absolues avec 'date_3'")

fig_best_abs.show()