# Predicting Speed Dating Success Using Various Personality Traits

## Introduction



Speed dating is a popular dating practice that involves meeting multiple potential partners in a short period of time. This practice has gained popularity in recent years due to its effectiveness in helping people find compatible partners.

In this proposal, we will attempt to answer the following question: What individual characteristics are most important in predicting the success of speed dating? By analyzing data from a speed dating dataset, we will attempt to identify which factors are most strongly associated with successful matches.

The dataset we use contains information from speed dating events held in different cities in the United States. It includes information on participants' demographic characteristics, such as age, education level, and ethnicity, as well as their responses to survey questions related to their personalities, interests, and dating preferences, and whether or not each individual was matched with their potential partner.



## Methods and Results

#### Importing Libraries

We begin by installing and importing the required libraries for this analysis.

In [2]:
# Install packages
# install.packages('tidyverse')
# install.packages('tidymodels')
# install.packages('repr')
# install.packages('gridExtra')
# install.packages('grid')

In [3]:
# Importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(gridExtra)
library(grid)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

#### Reading the Data

Using the **read_csv()** function, we read the data from the source and store it in a variable. We perform a short analysis on the data, getting information about the number of columns, rows, as well as a preview of what each observation looks like. 

In [4]:
# Reading the data
speed_dating_data <- read_csv("http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/Speed%20Dating%20Data.csv")

# Number of Rows and Columns
num_rows <- nrow(speed_dating_data)
num_cols <- ncol(speed_dating_data)

speed_dating_summary <- tibble(Rows = num_rows, 
                  Columns = num_cols)

# Previewing the data
speed_dating_preview <- head(speed_dating_data)

[1mRows: [22m[34m8378[39m [1mColumns: [22m[34m195[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m   (4): field, undergra, from, career
[32mdbl[39m (187): iid, id, gender, idg, condtn, wave, round, position, positin1, or...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [5]:
speed_dating_summary
speed_dating_preview

Rows,Columns
<int>,<int>
8378,195


iid,id,gender,idg,condtn,wave,round,position,positin1,order,⋯,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,0,1,1,1,10,7,,4,⋯,5,7,7,7,7,,,,,
1,1,0,1,1,1,10,7,,3,⋯,5,7,7,7,7,,,,,
1,1,0,1,1,1,10,7,,10,⋯,5,7,7,7,7,,,,,
1,1,0,1,1,1,10,7,,5,⋯,5,7,7,7,7,,,,,
1,1,0,1,1,1,10,7,,7,⋯,5,7,7,7,7,,,,,
1,1,0,1,1,1,10,7,,6,⋯,5,7,7,7,7,,,,,


#### Summary Discussion

[link]: http://www.stat.columbia.edu/~gelman/arm/examples/speed.dating/Speed%20Dating%20Data%20Key.doc

From the speed dating dataset's [legend][link], we know there are many variables that are not useful in answering our research question, such as the individual's event ID, or the zipcode of where the individual was raised. There are many variables which do not use numerical values as well, such as the individual's career or field of study. Using 195 predictors would also be very computationally expensive, so we decided to choose 5 predictors: attractiveness, sincerity, intelligence, funniness, and ambition. We chose these variables because they can be numerically rated, and are commonly discussed traits regarding relationships, and therefore would be variables to use to predict whether an individual would match with their partner or not.

#### Tidying the Data

Our next step was to tidy the data into a state where we could easily use it to create our predictive model. We began by selecting the desired columns, and filtering out improper observations that could create future difficulties (N/A values, values that went by a different scale, etc.).

In [6]:
# Selecting desired rows
speed_dating_select <- speed_dating_data |>
    filter(wave != 6:9) |> # These waves had a different rating system for traits
    mutate(match = as_factor(match),
           gender = as_factor(gender)) |>
    select(match, gender, attr_o, sinc_o, intel_o, fun_o, amb_o)

# Filtering the data of improper observations
speed_dating_tidy <- speed_dating_select |>
    filter(attr_o %% 1 == 0,
          sinc_o %% 1 == 0,
          intel_o %% 1 == 0,
          fun_o %% 1 == 0,
          amb_o %% 1 == 0, 
          fun_o <= 10)

# Renaming categorical values
levels(speed_dating_tidy$gender)[2] <- "male"
levels(speed_dating_tidy$gender)[1] <- "female"
levels(speed_dating_tidy$match)[2] <- "yes"
levels(speed_dating_tidy$match)[1] <- "no"

# Renaming columns
colnames(speed_dating_tidy) <- c("Match", "Gender", "Attractiveness", "Sincerity", "Intelligence", "Funniness", "Ambition")

# Preview of tidied dataset
head(speed_dating_tidy)

“longer object length is not a multiple of shorter object length”


Match,Gender,Attractiveness,Sincerity,Intelligence,Funniness,Ambition
<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
no,female,6,8,8,8,8
no,female,7,8,10,7,7
yes,female,10,10,10,10,10
yes,female,7,8,9,8,9
yes,female,8,7,9,6,9
no,female,7,7,8,8,7


#### Column Legend
- **Match**: Whether or not the individual matched with their partner
- **Gender**: Gender of the individual
- **Attractiveness - Funniness**: How the individual was rated by their partner in each trait

#### Variable Importance

Before we create our model, we will perform another analysis of the data, this time seeing which traits have the highest correlation with matching with their partner. We will do this by grouping the data by whether the observation matched or not, finding the means of each groups traits, then finding the difference between the two groups' means. A larger difference will imply a stronger correlation between the trait and whether or not the individual matches, while a smaller difference implies a weaker correlation. 

In [14]:
# Taking means of each trait
speed_dating_means <- speed_dating_tidy |>
    group_by(Gender, Match) |>
    summarize(Mean_Attractiveness = mean(Attractiveness),
              Mean_Sincerity = mean(Sincerity),
              Mean_Intelligence = mean(Intelligence),
              Mean_Funniness = mean(Funniness),
              Mean_Ambition = mean(Ambition))

# Means of observations that matched
speed_dating_yes <- speed_dating_means |>
    slice(1) |>
    ungroup() |>
    select(-Match, -Gender) 

# Means of observations that did not match
speed_dating_no <- speed_dating_means |>
    slice(2) |>
    ungroup() |>
    select(-Match, -Gender)

# Taking the difference of "yes" and "no" means
speed_dating_diffs <- abs(speed_dating_yes - speed_dating_no)

# Renaming columns
colnames(speed_dating_diffs) <- c("Attract_Diff", "Sincere_Diff", "Intel_Diff", "Funny_Diff", "Ambition_Diff")

# Rebinding to gender column
gender_cols <- tibble(Gender = c("Female", "Male"))
gender_diffs <- bind_cols(gender_cols, speed_dating_diffs)

gender_diffs

[1m[22m`summarise()` has grouped output by 'Gender'. You can override using the
`.groups` argument.


Gender,Attract_Diff,Sincere_Diff,Intel_Diff,Funny_Diff,Ambition_Diff
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Female,1.341803,0.7043362,0.6866242,1.282081,0.7110397
Male,1.436137,0.8582535,0.7644241,1.650429,0.658394


#### Variable Discussion

From the table above, we observe that attractiveness and funniness have a significantly stronger correlation with getting a match than the other traits. Therefore, we will create a model which only uses attractiveness and funniness as predictors, as they will most likely result in a more accurate predictive model. 

#### Creating the Classification Model

TODO

## Discussion

## Summarize what you found

This project allows us to see which variables that correspond to the individuals traits that have a strong relation to how a couple get matched in speed dating.

## Discuss whether this is what you expected to find?

Yes, there are some things that we expected to find. We were expecting variables that are prominent in first impression of a person during speed dating tend to have a stronger relation with how likely the couple get matched than variables that require a deeper understanding of someone.

## Discuss what impact could such findings have?

The project has a potential to help researchers that focus on human relationship to understand which human characters or traits that determine the success rate couple of getting matched in speed dating. This project could be a helpful tool to identify human nature in speed dating. Moreover, this project also gives contribution in the data science field especially K-NN Regression.

## Discuss what future questions could this lead to?




## References