# Speed Dating

## Challenge description

We will start a new data visualization and exploration project. Your goal will be to try to understand *love*! It's a very complicated subject so we've simplified it. Your goal is going to be to understand what happens during a speed dating and especially to understand what will influence the obtaining of a **second date**.

This is a Kaggle competition on which you can find more details here :

[Speed Dating Dataset](https://www.kaggle.com/annavictoria/speed-dating-experiment#Speed%20Dating%20Data%20Key.doc)

### Rendering

To be successful in this project, you will need to do a descriptive analysis of the main factors that influence getting a second appointment.

In [13]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df =  pd.read_csv("speed_dating_data.csv", encoding = "ISO-8859-1")

# Understanding dataset

- Data consist of speed dating events from 2002 to 2004.  
- During the events, the attendees have a four minute date with every other participant of the opposite sex.  
- After each date, participants are asked to rate their partner on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.  
- The IID column is the unique identifiant for participant, they are 551 (iid 118 is missing). Some participate more than once to speed dating events and then receive a new IID.  
- The dataset also contains information such as field of study, age, intended career, hobbies and many others that we will use to understand dating.  




In [14]:
df.shape

(8378, 195)

# Cleaning and manipulating data

- The dataset contains 195 columns. I am going to keep only those I am the most interested in.
- People have to vote multiple times for 6 attributes (Attractive, Sincere, Intelligent, Fun, Ambicious, Shared interest).
- They have a total of 100 points to distribute among these attributes when they declare what they are looking for in a partner.
- After the date, they have to rate the parner from 1 to 10 for each attributes.
- I would like to compare what they are looking for versus what impacts their decision in reality so I am going to weight the ratings to get homogeneous data for accuracy.

In [15]:
# Wave 6 to 9 have a different voting system even when it comes to declare what they are looking for so I prefer to remove it from the database
df = df[(df['wave'] != 6) & (df['wave'] != 7) & (df['wave'] != 8) & (df['wave'] != 9)]

# Keep the columns I am the most interested in
df = df[["iid", "gender", "pid", "match", "int_corr", "samerace", "age_o", "dec_o", "age", "field", "career", "career_c", "sports",
"tvsports", "exercise", "dining", "museums", "art", "hiking", "gaming", "clubbing", "reading", "tv", "theater", "movies", "concerts", "music", "shopping", "yoga",
"expnum", "attr1_1", "sinc1_1", "intel1_1", "fun1_1", "amb1_1", "shar1_1", "dec", "attr", "sinc", "intel", "fun", "amb", "shar", "like", "prob"]]

### What is the age distribution?

In [16]:
# Drop duplicate rows from the "iid" column (and reverse df order to have male first which lead to all the legend in following visualisations to be more consistent)
df_unique_participant = df.drop_duplicates(subset = ["iid"]).iloc[::-1].copy()

# Create a dict to rename and clarify part of the visualization
newnames = {"1":"Male", "0":"Female"}

fig = px.histogram(df_unique_participant, 
                    x="age", 
                    nbins=20,
                    color="gender",
                    template='plotly_dark', 
                    title = "Age distribution by gender",
                    color_discrete_map = {1:'royalblue', 0:'deeppink'},
                    width=1500, height=500
                    )
# Bar chart
fig.for_each_trace(lambda t: t.update(name = newnames[t.name]))

fig.update_layout(title_x=0.5, legend_title="")
fig.show()

- When compared by gender, the age distribution looks homogeneous. 
- Most participants are between 20 and 30.

##### Conclusion

People who participate to speed dating are usually young adults.



### What is the average age by gender?

In [17]:
# Split gender into 2 DataFrames
df_unique_men = df_unique_participant[df_unique_participant['gender'] == 1].copy(
)
df_unique_women = df_unique_participant[df_unique_participant['gender'] == 0].copy(
)

# Save the average age for each gender in a new variable
Women_mean = '{:.2f}'.format(df_unique_women['age'].mean())
Men_mean = '{:.2f}'.format(df_unique_men['age'].mean())

# Create a list containing the previously created variables
data = [['Male', Men_mean], ['Female', Women_mean]]

# Turn the list into a DataFrame
df_comparison_age_mean = pd.DataFrame(data, columns=['Gender', 'Age mean'])

# Bar chart
fig = px.bar(df_comparison_age_mean,
             color="Gender",
             template='plotly_dark',
             title="Age average by gender",
             text="value",
             width=1500, height=500,
             color_discrete_map={'Male': 'royalblue', 'Female': 'deeppink'})

fig.update_traces(texttemplate='%{text:.4s}', textposition='outside')
fig.update_layout(title_x=0.5, yaxis={'visible': False}, xaxis={
                  'visible': False}, legend_title="")
fig.show()


##### Conclusion

Men are slightly older on average but it's quite close.

### Does gender have an impact when it comes to accepting a second date?

In [18]:
# Split gender into 2 DataFrames again. This time I dont want unique unique values since I need each decision taken by IID
df_men = df[df['gender'] == 1].copy()
df_women = df[df['gender'] == 0].copy()

# Count the number of decision then normalize and multiply it by 100 to get percentage
df_men_decision_per_participant = df_men["dec"].value_counts(normalize=True).copy()*100
df_women_decision_per_participant = df_women["dec"].value_counts(normalize=True).copy()*100

# Create a list containing the previously created variables
second_date_data = [['Male', df_men_decision_per_participant[1]], ['Female', df_women_decision_per_participant[1]]]

# Turn the list into a DataFrame
df_second_date = pd.DataFrame(second_date_data, columns = ['Gender', 'Second date acceptance'])

# Bar chart
fig = px.bar(df_second_date, color="Gender", 
            template='plotly_dark', 
            title = "Acceptance of second date by gender",
            width=1500, height=500,
            text="value",
            color_discrete_map={'Male': 'royalblue','Female': 'deeppink'}
            )

fig.update_traces(texttemplate='%{text:.4s}', textposition='outside')
fig.update_layout(title_x=0.5, yaxis={'visible': False}, xaxis={'visible': False}, legend_title="")
fig.show()  

- 45.53% of the time, male agree to see their partner a second time.
- It happens only 37.88% for female. 

##### Conclusion

It seems women are more selective.

### What are the correlations with the decision to see the partner again a second time?

In [19]:
# Find correlations score for the columns I am interested in
corr = df[["sports", "tvsports", "exercise", "dining", "museums", "art", "hiking", "gaming", "clubbing", "reading", "tv", "theater", 
"movies", "concerts", "music", "shopping", "yoga", "dec", "attr", "sinc", "intel", "fun", "amb", "shar", "like", "prob"]].corr()

# Heatmap
fig = go.Figure()
fig.add_trace(go.Heatmap(
    z = corr,
    x = corr.columns.values,
    y = corr.columns.values,
    colorscale = px.colors.diverging.RdBu,
    zmid=0
    ))

fig.update_layout(width=1500, height=800, paper_bgcolor='black', font_color='white')
fig.show()

- We can see that "Art" and "Museums" are correlated with "Theater".
- The decision to see the partner is correlated with "Attractive", "Fun", "Shared interest" and "Like" (if they liked their partner or not).

##### Conclusion

Out of all the rated attributes, "Like" is the one that seems to have the most correlation with the decision. It seems when people like someone, they want to see him again. Who would have guessed?


### What are the top correlated features scores with the decision to see the partner a second time?

In [20]:
# Heatmap helps to have a global view of the correlation between the variables but I also want to have a podium of the correlations between the variables and the decision
corr["dec"].apply(abs).sort_values(ascending=False)[1:9]

like     0.514426
attr     0.480372
fun      0.415028
shar     0.410090
prob     0.323505
intel    0.226879
sinc     0.213461
amb      0.183014
Name: dec, dtype: float64

- "Like" is indeed the most important features closely followed by "Attractiveness".
- "Intelligence" and "Cincerity" are quite low in comparison.

##### Conclusion

Let's not forget how the data is gathered. People spent only 4 minutes together. This may be enough to find if you like the appearance of someone else but it's hard to tell how intelligent or sincere they are.

# Ratings

- People are given 100 points to distribute in specifics attributes. They give more points to the attributes that influence them the most to accept a second date.  
- The total must be equal to 100.  
  
### What both gender claim to be looking for in a partner?

In [21]:
# Create a list containing strings of the attributes
lst_carac = ['Attractive', 'Sincere', 'Intelligent', 'Fun', 'Ambitious', 'Interest']

# Create a dict to rename attributes rating to clarify it
dic_rename_1_1 = {"attr1_1":"Attractive", "sinc1_1":"Sincere","intel1_1":"Intelligent","fun1_1":"Fun","amb1_1":"Ambitious","shar1_1":"Interest"}

# Create a list containing strings of the current column names
lst_attributes = ["attr1_1", "sinc1_1", "intel1_1", "fun1_1", "amb1_1", "shar1_1"]

# Male polar chart DataFrame
df_men_polar_chart = df_unique_men[lst_attributes].dropna()
df_men_polar_chart = df_men_polar_chart.rename(columns=dic_rename_1_1)
df_men_polar_chart = df_men_polar_chart.mean(axis=0)

# Female polar chart DataFrame
df_women_polar_chart = df_unique_women[lst_attributes].dropna()
df_women_polar_chart = df_women_polar_chart.rename(columns=dic_rename_1_1)
df_women_polar_chart = df_women_polar_chart.mean(axis=0)

# Male bar chart DataFrame
df_men_are_looking_for = df_unique_men[lst_attributes].dropna()
df_men_are_looking_for = pd.DataFrame(df_men_are_looking_for.mean())
df_men_are_looking_for["gender"] = "Male"

# Female bar chart DataFrame
df_women_are_looking_for = df_unique_women[lst_attributes].dropna()
df_women_are_looking_for = pd.DataFrame(df_women_are_looking_for.mean())
df_women_are_looking_for["gender"] = "Female"

# Merging male and female bar chart DataFrames
df_comparison_gender_looking_for = pd.concat([df_men_are_looking_for, df_women_are_looking_for], axis=0)
df_comparison_gender_looking_for.rename(index=dic_rename_1_1, inplace=True)

# Polar chart
fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r=df_men_polar_chart,
      theta=lst_carac,
      name='Male',
      line_color="royalblue"
      ))

fig.add_trace(go.Scatterpolar(
      r=df_women_polar_chart,
      theta=lst_carac,
      name='Female',
      line_color="deeppink"
      ))

fig.update_layout(polar=dict(
                  radialaxis=dict(visible=True, 
                                    range=[0, 30]), 
                                    radialaxis_showticklabels=False, 
                                    radialaxis_showline=False),
                  width=1500, 
                  height=500, 
                  title= {'x' : 0.5},
                  title_text="What participants are looking for",
                  template="plotly_dark"
                  )     

fig.update_polars(angularaxis_direction="clockwise")

fig.update_traces(fill='toself')

# Bar chart
fig2 = px.bar(df_comparison_gender_looking_for, 
            color="gender", 
            barmode='group', 
            template='plotly_dark',
            width=1500, 
            height=500, 
            text="value",
            labels = {"index" : ""},
            color_discrete_map={'Male': 'royalblue','Female': 'deeppink'}
            )

fig2.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig2.update_layout(yaxis={'visible': False}, legend_title="")

fig.show()
fig2.show()


- Male are looking for "Attractive" and "Unambitious" female. They don't care much about "shared Interest" either.
- Female don't have such extreme interest in a specific attribute. The highest is "Intelligent" with 22 points.

##### Conclusion

Both gender rate "Intelligence" highly but it was the second lowest important features according to the correlations score seen earlier.

### What rating people gives when they accept a second date? Is it consistent with what they claimed to be looking for?

In [22]:
# Create a dict to rename attributes rating to clarify it
dic_rename = {"attr":"Attractive", "sinc":"Sincere","intel":"Intelligent","fun":"Fun","amb":"Ambitious","shar":"Interest"}

# Create a DataFrame without NaN values from attributes
df_male_attributes_correlation = df_men[["dec", "attr", "sinc", "intel", "fun", "amb", "shar"]].dropna()

# Create a variable to save correlation score
corr = df_male_attributes_correlation.corr()

# Create a podium of correalation score
df_male_attributes_correlation = corr["dec"].apply(abs).sort_values(ascending=False)[1:9]

# Weight this podium to a total of 100 points to match the rating system of what people are looking for
df_male_decision_polar = (df_male_attributes_correlation/sum(df_male_attributes_correlation)*100)

# Rename attributes to clarify it
df_male_decision_polar.rename(index=dic_rename, inplace=True)

# Reorder the podium so both data in polar chart have a matching order
df_male_decision_polar = df_male_decision_polar.reindex(index = df_men_polar_chart.index)


# Male bar chart DataFrame creation
df_male_decision_bar = pd.DataFrame(df_male_decision_polar)
df_male_decision_bar["case"] = "Actual decision"

df_men_are_looking_for = pd.DataFrame(df_men_are_looking_for)
df_men_are_looking_for = df_men_are_looking_for.drop('gender', axis=1)
df_men_are_looking_for = df_men_are_looking_for.rename(columns={0:"dec"})
df_men_are_looking_for["case"] = "Looking for"

# Merging what male are looking for and actual decision into a new DataFrame
df_male_merge_decision_bar = pd.concat([df_men_are_looking_for, df_male_decision_bar], axis=0)
df_male_merge_decision_bar.rename(index=dic_rename_1_1, inplace=True)

# Polar chart
fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r=df_men_polar_chart,
      theta=lst_carac,
      name='Looking for',
      line_color="royalblue"
      ))

fig.add_trace(go.Scatterpolar(
      r=df_male_decision_polar,
      theta=lst_carac,
      name='Actual decision',
      line_color="white"
      ))

fig.update_layout(polar=dict(
                    radialaxis=dict(
                          visible=True, 
                          range=[0, 30]), 
                          radialaxis_showticklabels=False, 
                          radialaxis_showline=False),
                  width=1500, 
                  height=500, 
                  title= {'x' : 0.5},
                  title_text="Male comparison between stated interest and actual decision",
                  template="plotly_dark"
                  )     

fig.update_polars(angularaxis_direction="clockwise")

fig.update_traces(fill='toself')

fig.show()

fig2 = px.bar(df_male_merge_decision_bar, 
            color="case", 
            barmode='group', 
            template='plotly_dark',
            width=1500, 
            height=500, 
            text="value",
            labels = {"index" : ""},
            color_discrete_map={'Looking for': 'royalblue','Actual decision': 'white'}
            )

fig2.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig2.update_layout(yaxis={'visible': False}, legend_title="")

fig2.show()

- The "Attractive" feature is quite close.
- "Sincere" and "Intelligent" are highly overated.
- "Fun" is slightly underestimated but "Interest" has the biggest gap, males are extremely underestimating the importance of shared interest.

##### Conclusion

It seems that apart of "Attractive" and "Fun", males are quite bad at figuring out what they really want. However, I believe that 4 minutes is not enough to estimate features such as "Sincere" and "Intelligent". The specific case of speed dating may play a part in this result.

In [23]:

# Create a DataFrame without NaN values from attributes
df_female_attributes_correlation = df_women[["dec", "attr", "sinc", "intel", "fun", "amb", "shar"]].dropna()

# Create a variable to save correlation score
corr = df_female_attributes_correlation.corr()

# Create a podium of correalation score
df_female_attributes_correlation = corr["dec"].apply(abs).sort_values(ascending=False)[1:9]

# Weight this podium to a total of 100 points to match the rating system of what people are looking for
df_female_decision_polar = (df_female_attributes_correlation/sum(df_female_attributes_correlation)*100)

# Rename attributes to clarify it
df_female_decision_polar.rename(index=dic_rename, inplace=True)

# Reorder the podium so both data in polar chart have a matching order
df_female_decision_polar = df_female_decision_polar.reindex(index = df_women_polar_chart.index)

# Female bar chart DataFrame creation
df_female_decision_bar = pd.DataFrame(df_female_decision_polar)
df_female_decision_bar["case"] = "Actual decision"

df_women_are_looking_for = pd.DataFrame(df_women_are_looking_for)
df_women_are_looking_for = df_women_are_looking_for.drop('gender', axis=1)
df_women_are_looking_for = df_women_are_looking_for.rename(columns={0:"dec"})
df_women_are_looking_for["case"] = "Stated interest"

# Merging what female are looking for and actual decision into a new DataFrame
df_female_merge_decision_bar = pd.concat([df_women_are_looking_for, df_female_decision_bar], axis=0)
df_female_merge_decision_bar.rename(index=dic_rename_1_1, inplace=True)


# Polar chart
fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r=df_women_polar_chart,
      theta=lst_carac,
      name='Stated interest',
      line_color="deeppink"
      ))

fig.add_trace(go.Scatterpolar(
      r=df_female_decision_polar,
      theta=lst_carac,
      name='Actual decision',
      line_color="white"
      ))

fig.update_layout(polar=dict(
                    radialaxis=dict(
                    visible=True, 
                    range=[0, 30]), 
                    radialaxis_showticklabels=False, 
                    radialaxis_showline=False),
                  width=1500, 
                  height=500, 
                  title= {'x' : 0.5},
                  title_text="Female comparison of stated interest and actual decision",
                  template="plotly_dark"
                  )     

fig.update_polars(angularaxis_direction="clockwise")

fig.update_traces(fill='toself')

fig.show()

fig2 = px.bar(df_female_merge_decision_bar, 
            color="case", 
            barmode='group', 
            template='plotly_dark',
            width=1500, 
            height=500, 
            text="value",
            labels = {"index" : ""},
            color_discrete_map={'Stated interest': 'deeppink','Actual decision': 'white'}
            )

fig2.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig2.update_layout(yaxis={'visible': False}, legend_title="")

fig2.show()

- Female moderately underestimate the importance of "Attractiveness" and "Fun".
- Female greatly underestimate the importance of shared "Interest".
- As for male, female greatly overestimate the importance of "Sincerity" and "Intelligence".

##### Conclusion

Female are apparently bad at figuring out what they really want. These observations are very similar to those done on male.

### Let's finally compare decision score between males and females

In [24]:
# Merging male and female actual decision
df_female_decision_bar["case"] = "Female actual decision"
df_male_decision_bar["case"] = "Male actual decision"
df_male_female_decision = pd.concat([df_male_decision_bar, df_female_decision_bar], axis=0)
df_male_female_decision.rename(index=dic_rename_1_1, inplace=True)



# Polar chart
fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r=df_male_decision_polar,
      theta=lst_carac,
      name='Male actual decision',
      line_color="royalblue"
      ))

fig.add_trace(go.Scatterpolar(
      r=df_female_decision_polar,
      theta=lst_carac,
      name='Female actual decision',
      line_color="deeppink"
      ))

fig.update_layout(polar=dict(
                    radialaxis=dict(
                    visible=True, 
                    range=[0, 30]), 
                    radialaxis_showticklabels=False, 
                    radialaxis_showline=False),
                  width=1500, 
                  height=500, 
                  title= {'x' : 0.5},
                  title_text="Comparison of decision between each gender",
                  template="plotly_dark"
                  )     

fig.update_polars(angularaxis_direction="clockwise")

fig.update_traces(fill='toself')

fig.show()

fig2 = px.bar(df_male_female_decision, 
            color="case", 
            barmode='group', 
            template='plotly_dark',
            width=1500, 
            height=500, 
            text="value",
            labels = {"index" : ""},
            color_discrete_map={'Male actual decision': 'royalblue','Female actual decision': 'deeppink'}
            )

fig2.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig2.update_layout(yaxis={'visible': False}, legend_title="")

fig2.show()

- The biggest gap between male and female decision is based on attractiveness. It was already the biggest gap when both gender were asked what they were looking for so there might be some truth behind it.

# Conclusion

I had some hope that I was going to crack the code of Love but it seems obvious to me that speed dating is a specific genre with it's own code and it may not globaly apply to love relationship. Four minutes is not sufficient to know someone enough so you can judge things such as Sincerity or Intelligence. However, it is enough for their instinct to tell if they like this person or not.

Based on speed dating data only, people are overall quite bad when they try to tell what they are looking for in the opposite sex.

During a speed dating event, Attractiveness and Fun are the most important features that lead people to like someone and decide if they want to meet him again.

