# Beauty and the Beholder: Exploring the Impact of Attractiveness Perception on Second Date Success

This project is a Python analysis of this speed dating dataset: https://data.world/annavmontoya/speed-dating-experiment. An explanation of all the column meanings and the background of the experiment can be found in the speed dating data key document.

- This data comes from a speed dating experiment in which many different people participated in dates.
- The study concerned five main attributes: **attractiveness, sincerity, intelligence, fun and ambition**.
- Each row in the table concerned a particular date. The column **iid** represents a particular individual. The column **pid** represents the iid of the other individual on the date (e.g. iid's partner).
- Each participant would rate themselves on these five attributes (scale 1 - 10, represented by columns **attr3_1**, **sinc3_1**, **intel3_1**, **fun3_1** and **amb3_1**) before dating begins. (For a particular row/date these are the self-ratings as stated by **iid**).
- Each participant allocates 100 points across these five attributes in what they look for in a date i.e. which attributes are most important. These are represented by **pf_o_att**, **pf_o_sin**, **pf_o_int**, **pf_o_fun** and **pf_o_amb**. (For a particular row/date these are the preferences as stated by individual **pid**, **not iid**).
- After each date, each participant would rate their partner on a scale of 1 - 10 for these five attributes (represented by columns **attr_o**, **sinc_o**, **intel_o**, **fun_o** and **amb_o**).
- Each row contains the **gender** of the person (**iid**) where 1 is male and 0 is female.
- For each row, we have many values but for this study the most relevant are the columns representing the five attributes that an individual rated themselves on, the values that their date (**pid**) rated an individual (**iid**) on a particular date across these five attributes and the initial preference allocation.
- We have a **dec_o** column representing the decision of the partner (**pid**). If the value is 1, they agreed for a second date. If the value is 0 they did not.

This study explores:
- The average allocation of preference initially by gender (pf_o_att, pf_o_sinc etc.).
- The actual most important preference/feature through a random forest analysis of ratings received (attr_o, sinc_o etc.) and second date success (dec_o).
- The influence of average attractiveness rating received (attr_o) on second date success (dec_o).
- How each level (2, 3, 4, ..., 10) of self rating of attractiveness effects second date success i.e. does an individual's confidence in their attractiveness alone increase second date potential.
- The impact of self-perceived attractiveness (comparing the difference between an indivdual's self rating of attractiveness with their average attractiveness rating received on their dates and how this drives second date potential).

In [170]:
import pandas as pd
import sqlite3
import time
from sklearn.impute import KNNImputer
import os

# .CSV to .DB

The below code imports the relevant columns from the .csv file to the .db file.

The relevant columns are:
- **iid**: the id of an individual
- **gender**: the gender of the iid
- **pid**: the id of an individual's date
- **match**: whether both individuals agreed on a second date (1 = yes, 0 = no)
- **dec_o**: the decision of the partner (pid) whether or not to go on a second date with the individual (iid) (1 = yes, 0 = no)
- **\_o** attributes: the ratings an individual received from their partner on five attributs
- **3_1** attributes: the ratings an individual gave themselves
- **pf_o** attributes: the allocation of 100 points across the five attributes in what they look for an ideal date partner (given by pid)

In [171]:
# Converts the .csv to .db
# pf_o columns will be mapped so their names are consistent with 3_1 and _o

# Parameters:
#   - csv_file (str): the name of the .csv
#   - encoding_type (str_): the encoding of the csv
#   - database_path (str): the path of the .db
#   - csv_columns (list): the columns to include from the original database

def create_db(csv_file : str, encoding_type : str,
              database_path : str, csv_columns : list):
    
    # Connect to database
    conn = sqlite3.connect(database_path)
    # Create cursor object
    cursor = conn.cursor()

    # Read the CSV as a dataframe
    df = pd.read_csv(csv_file, encoding=encoding_type)

    # Filter the dataframe to only include the selected columns
    # Use .copy() to create a copy of the DataFrame
    df_selected = df[csv_columns].copy()
    
    # Mapping between original and desired column names
    column_mapping = {
        'pf_o_att': 'pf_o_attr',
        'pf_o_sin': 'pf_o_sinc',
        'pf_o_int': 'pf_o_intel'
    }
    
    # Rename columns using the mapping 
    df_selected.rename(columns=column_mapping, inplace=True)
    
    # Use Pandas to_sql to insert the DataFrame into the SQLite table
    df_selected.to_sql('speed_dating', conn, 
                       index=False, if_exists='replace')

    # Create indexes
    cursor.execute('CREATE INDEX idx_iid ON speed_dating(iid);')
    cursor.execute('CREATE INDEX idx_pid ON speed_dating(pid);')

    # Commit changes to the database
    conn.commit()

    # Close the database
    conn.close()

In [172]:
# Running database creation with timing
start_time = time.time()

csv_file = 'speed_dating.csv'
encoding_type = 'latin'
database_path = 'speed_dating.db'
csv_columns = ['iid', 'gender', 'pid', 'match', 'dec_o',
               'attr_o','sinc_o', 'intel_o', 'fun_o', 'amb_o',
               'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1',
               'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb']

create_db(csv_file, encoding_type, database_path, csv_columns)

end_time = round(time.time() - start_time, 4)

print(f'.csv to .db conversion complete. It took {end_time} seconds.')

.csv to .db conversion complete. It took 0.2418 seconds.


# Data Pre-Processing

This involves several steps.
1) Deleting rows with several NULL values

2) Deleting missing (iid, pid), (pid, iid) pairs

3) Performing data imputation on missing values


## Deleting Rows with Several NULL Values

For this analysis we will be comparing the importance on each of the aforementioned set of attributes on second date success. This means we need all attributes per individual, per date. Data imputation can fill in NULL values if there are only one or two attributes missing per set (the \_o set, the 3_1 set or the pf_o set). However, if several values are missing per set data imputation can be unreliable. 

For this dataset, I have chosen three or more missing values as being too unreliable. This is so analysis is focused on more quality data. Additionally, there are many rows with one missing attribute, some with two and the rest have all five missing attributes so three felt like an appropriate number.

INCLUDE TESTS HERE.

The argument for amount of NULL values can be changed when you call the function argument in case you would like to include more data or alternatively more attributes are included in the analysis.


## Deleting Missing Pairs

For each date we have an iid and a pid. For each date there are two rows. One is from the perspective of iid (so iid = iid and pid = pid) and the other is from the perspective of pid (so iid = pid and pid = iid). So on the first row (perspective of iid) we see how pid rated iid on these five attributes. On the second row (perspective of pid) we see how iid rated pid on these five attributes. To ensure information is relevant, accurate and complete we must have both perspectives on a particular date. 


## Data Imputation

We now still have rows with one or two missing NULL values for a particular attribute set. For this we will use KNN imputation. Averaging across other attributes would not be appropriate as just because someone is rated highly in attractiveness does not necessarily mean they would be rated highly in intelligence.

INCLUDE CROSS-VALIDATION HERE

In [173]:
# Deletes rows with a specified number of NULL values based on a threshold.

# Parameters:
#   - database_path (str): The path to the SQLite database.
#   - columns_o (list): List of attributes ending in _o (ratings received).
#   - columns_3_1 (list): List of attributes ending in 3_1 (self ratings given).
#   - columns_pf_o (list): List of attributes in the form
#                          pf_o_{attribute} (attribute preferences).
#   - null_no (int): The threshold number of NULL values for row deletion.

def delete_null_values(database_path : str, columns_o : list,
                       columns_3_1 : list, columns_pf_o : list, null_no : int):
    # Connect to database
    conn = sqlite3.connect(database_path)

    # Create a cursor object
    cursor = conn.cursor()
    
    # Construct the conditions for NULL values in o_columns
    conditions_o = [f"({column} IS NULL)" for column in columns_o]

    # Construct the conditions for NULL values in 3_1 columns
    conditions_3_1 = [f"({column} IS NULL)" for column in columns_3_1]
    
    # Construct the conditions for NULL values in pf_o columns
    conditions_pf_o = [f"({column} IS NULL)" for column in columns_pf_o]
        
    # Combine the conditions using AND
    # If there are NULL values these will evaluate to 1
    # These will add up to 3 or more if there are 3 or 
    # NULL values in a particular column
    condition_query_o = f"({' + '.join(conditions_o)}) >= {null_no}"
    condition_query_3_1 = f"({' + '.join(conditions_3_1)}) >= {null_no}"
    condition_query_pf_o = f"({' + '.join(conditions_pf_o)}) >= {null_no}"

    # Construct the delete query
    delete_query = (
        f"DELETE FROM speed_dating "
        f"WHERE ({condition_query_o}) OR "
        f"({condition_query_3_1}) OR "
        f"({condition_query_pf_o})"
    )
    
    # Execute the query to find rows to be deleted
    cursor.execute(delete_query)

    # Get the count of deleted rows
    deleted_rows = cursor.rowcount

    # Commit the changes
    conn.commit()

    # Close the connection
    conn.close()

    # Print the number of rows deleted
    print(f"Deleted {deleted_rows} rows with three or more NULL values.")

In [174]:
# Deletes missing (iid, pid) pairs

# Parameters:
#   - database_path (str): the path to your .db

def delete_missing_pairs(database_path : str):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    # Create a cursor object
    cursor = conn.cursor()
    
    # Construct the delete query where there is corresponding
    # pid to iid and vice-versa
    delete_query = '''
    DELETE FROM speed_dating
    WHERE NOT EXISTS (
        SELECT 1
        FROM speed_dating AS s2
        WHERE speed_dating.iid = s2.pid AND speed_dating.pid = s2.iid)
    '''
    
    # Execute the delete query
    cursor.execute(delete_query)
    
    # Get the count of deleted rows with missing pairs
    deleted_missing_pairs = cursor.rowcount
    
    # Commit the changes
    conn.commit()
    
    # Close the connection
    conn.close()
    
    # Print the number of rows with missing pairs deleted
    print(f"Deleted {deleted_missing_pairs} rows with missing pairs.")

In [175]:
# CROSS-VALIDATE
# TAKE NUMBER OF NEIGHBOURS AS INPUT

# Performs KNN imputation on missing values

# Parameters:
#   - database_path (str): the path to your database
#   - columns_to_impute (list): the set of columns to perform data imputation on

def data_imputation(database_path : str, columns_to_impute : list):
    # Connect to database
    conn = sqlite3.connect(database_path)
    
    # Import data into dataframe
    df = pd.read_sql_query('SELECT * FROM speed_dating', conn)
    
    # Close connection
    conn.close()
    
    # Perform KNN imputation
    knn_imputer = KNNImputer()
    df_imputed = pd.DataFrame(knn_imputer.fit_transform(df[columns_to_impute]), 
                             columns = columns_to_impute)
    
    # Update the original DataFrame with the imputed values
    df[columns_to_impute] = df_imputed
    
    # Save the updated data back to the database
    conn = sqlite3.connect(database_path)
    df.to_sql('speed_dating', conn,
              if_exists = 'replace', index = False)
    print('Remaining NULL values have had data imputation performed on them.')
    conn.close()

In [176]:
# Running all the data preprocessing functions with timing.
start_time = time.time()

# Define the base names
# Add more if needed
columns_base = ['attr', 'sinc', 'intel', 'fun', 'amb']

# Define the _o columns
columns_o = [attribute + '_o' for attribute in columns_base]

# Define the 3_1 columns
columns_3_1 = [attribute + '3_1' for attribute in columns_base]
    
# Define the pf columns
columns_pf_o = [f'pf_o_{attribute}' for attribute in columns_base]

# Define the threshold for null value row deletion
null_no = 3

delete_null_values(db_file, columns_o, columns_3_1, columns_pf_o, null_no)
delete_missing_pairs(db_file)

# Currently only _o has missing values. Repeat call for further column sets
# if necessary
data_imputation(database_path, columns_o)

end_time = round(time.time() - start_time, 4)

print(f'Data pre-processing complete. It took {end_time} seconds.')

Deleted 421 rows with three or more NULL values.
Deleted 131 rows with missing pairs.
Remaining NULL values have had data imputation performed on them.
Data pre-processing complete. It took 0.4412 seconds.


## Table Creation

The below functions create the various tables needed for analysis. These also defined foreign keys and primary keys (i.e. the relation between tables).

For a full description of these tables, their schema, what the columns represent, justification of datatypes and how they link together please see the database schema .pdf.


In [177]:
# SEPARATE FUNCTION PER TABLE
# MOVE DATABASE CONNECTION INSIDE FUNCTION
# MOVE ATTRIBUTE COLUMN NAMES OUTISDE FUNCTION

# Creates the participants table

# Parameters:
#   - database_path (str): the path to your database

def create_participants_table(database_path):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    
    # Creator cursor object
    cursor = conn.cursor()
    
    # Drop the participants table if it exists
    cursor.execute("DROP TABLE IF EXISTS participants")
    
    # Create the participants table
    participants_table_query = '''
       CREATE TABLE participants(
           iid SMALLINT UNSIGNED PRIMARY KEY,
           gender TINYINT
           )
       '''
    
    # Execute query
    cursor.execute(participants_table_query)
    
    # Create indexes on iid column in the participants table
    cursor.execute('CREATE INDEX idx_participants_iid ON participants (iid)')
    
    # Insert data into participants table
    participants_insert_query = '''
        INSERT INTO participants(iid, gender)
        SELECT DISTINCT iid, gender
        FROM speed_dating;
    '''
    
    # Execute query
    cursor.execute(participants_insert_query)
    
    # Commit changes
    conn.commit()
    
    # Close the connection
    conn.close()

In [178]:
# Creates the dates table

# Parameters:
#   - database_path (str): the path to your database

def create_dates_table(database_path):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    
    # Creator cursor object
    cursor = conn.cursor()
    
    # Drop the dates table if it exists
    cursor.execute("DROP TABLE IF EXISTS dates")
    
    # Create the dates table
    dates_table_query = '''
        CREATE TABLE dates(
            date_id INTEGER PRIMARY KEY AUTOINCREMENT,
            iid SMALLINT UNSIGNED,
            pid SMALLINT UNSIGNED,
            match TINYINT,
            dec_o TINYINT,
            FOREIGN KEY (iid) REFERENCES participants (iid),
            FOREIGN KEY (pid) REFERENCES participants (iid)
        )
    '''
    
    # Execute query
    cursor.execute(dates_table_query)
    
    # Create indexes on date_id column in dates table
    cursor.execute('CREATE INDEX idx_dates_date_id ON dates (date_id)')
    
    # Insert data into dates table
    dates_insert_query = '''
        INSERT INTO dates (iid, pid, match, dec_o)
        SELECT iid, pid, match, dec_o
        FROM speed_dating
    '''
    
    # Execute the query
    cursor.execute(dates_insert_query)
    
    # Commit changes
    conn.commit()
    
    # Close the connection
    conn.close()

In [179]:
# Creates the attributes table

# Parameters:
#   - database_path (str): the path to your database
#   - columns_base (list): the base name of columns for attributes

def create_attributes_table(database_path, columns_base):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    
    # Creator cursor object
    cursor = conn.cursor()
    
    # Drop the attributes table if it exists
    cursor.execute("DROP TABLE IF EXISTS attributes")
    
    # Create the attributes table
    attributes_table_query = '''
        CREATE TABLE attributes(
            attr_id TINYINT UNSIGNED PRIMARY KEY,
            attribute_name TEXT
        )
    '''
    
    # Execute query
    cursor.execute(attributes_table_query)
    
    # Create indexes on attr_id column in the attribute table
    cursor.execute('CREATE INDEX idx_attributes_attr_id ON attributes (attr_id)')
    
    # Insert data into attributes table
    attributes_insert_query = '''
        INSERT INTO attributes (attr_id, attribute_name)
        VALUES (?, ?)
    '''
    
    # Insert attribute names into the attributes table
    for index, column in enumerate(columns_base, start = 1):
        cursor.execute(attributes_insert_query, (index, column))
        
    # Commit changes
    conn.commit()
    
    # Close the connection
    conn.close()

In [180]:
# Creates the ratings table

# Parameters:
#   - database_path (str): the path to your database
#   - columns_o (list): List of attributes ending in _o (ratings received).

def create_ratings_table(database_path, columns_o):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    
    # Creator cursor object
    cursor = conn.cursor()
    
    # Drop the ratings table if it exists
    cursor.execute("DROP TABLE IF EXISTS ratings")
    
    # Create the ratings table
    ratings_table_query = '''
        CREATE TABLE ratings(
            rating_id INTEGER PRIMARY KEY AUTOINCREMENT,
            date_id INTEGER UNSIGNED,
            attr_id INTEGER UNSIGNED,
            rating_value TINYINT UNSIGNED,
            FOREIGN KEY (date_id) REFERENCES date (date_id),
            FOREIGN KEY (attr_id) REFERENCES attributes(attr_id)
            )
    '''
    
    # Execute the query
    cursor.execute(ratings_table_query)
    
    # Create index on date_id and attr_id columns in the ratings table
    cursor.execute('CREATE INDEX idx_ratings_date_id ON ratings (date_id)')
    cursor.execute('CREATE INDEX idx_ratings_attr_id ON ratings (attr_id)')
    
    # Insert data into the ratings table for each column 
    for column_name in columns_o:
        
        # Insert data into the ratings table
        ratings_insert_query = '''
            INSERT INTO ratings (date_id, attr_id, rating_value)
            SELECT d.date_id, a.attr_id, s.{}
            FROM dates as d
            JOIN speed_dating AS s ON d.iid = s.iid AND d.pid = s.pid
            JOIN attributes AS a ON a.attribute_name || '_o' = lower(?)
        '''.format(column_name)
        
        # Execute query
        cursor.execute(ratings_insert_query, (column_name.lower(),))
        
    # Commit changes
    conn.commit()
    
    # Close the connection
    conn.close()

In [181]:
# Creates the self_ratings table

# Parameters:
#   - database_path: the path to your database
#   - columns_3_1 (list): List of attributes ending in 3_1 (self ratings given).

def create_self_ratings_table(database_path, columns_3_1):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    
    # Creator cursor object
    cursor = conn.cursor()
    
    # Drop the self_ratings table if it exists
    cursor.execute("DROP TABLE IF EXISTS self_ratings")
    
    # Create the self_ratings table
    self_ratings_table_query = '''
        CREATE TABLE self_ratings(
            self_rating_id INTEGER PRIMARY KEY AUTOINCREMENT,
            iid SMALLINT UNSIGNED,
            attr_id TINYINT UNSIGNED,
            self_rating_value TINYINT UNSIGNED,
            FOREIGN KEY (iid) REFERENCES participants (iid),
            FOREIGN KEY (attr_id) REFERENCES attributes(attr_id))
    '''
    
    # Execute query
    cursor.execute(self_ratings_table_query)
    
    # Creat indexes on self_rating iid and attr_id columns
    cursor.execute('CREATE INDEX idx_self_ratings_iid ON self_ratings (iid)')
    cursor.execute('CREATE INDEX idx_self_ratings_attr_id ON self_ratings(attr_id)')

    # Insert data into the self_ratings table for each column
    for column_name in columns_3_1:
        
        # Insert data into the self_ratings table
        self_ratings_insert_query = '''
            INSERT INTO self_ratings (iid, attr_id, self_rating_value)
            SELECT p.iid, a.attr_id, s.{}
            FROM participants AS p
            JOIN speed_dating AS s ON p.iid = s.iid
            JOIN attributes AS a ON a.attribute_name || '3_1' = lower(?)
        '''.format(column_name)
        
        # Execute query
        cursor.execute(self_ratings_insert_query, (column_name.lower(),))
        
    # Commit changes
    conn.commit()
    
    # Close the connection
    conn.close()

In [182]:
# Creates the self_ratings table

# Parameters:
#   - database_path: the path to your database
#   - columns_pf_o (list): List of attributes in the form
#                          pf_o_{attribute} (attribute preferences).

def create_preferences_table(database_path, columns_pf_o):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    
    # Creator cursor object
    cursor = conn.cursor()
    
    # Drop the preferences table if it exists
    cursor.execute("DROP TABLE IF EXISTS preferences")
    
    # Create the preferences table
    preferences_table_query = '''
        CREATE TABLE preferences(
            preferences_id INTEGER PRIMARY KEY AUTOINCREMENT,
            iid SMALLINT UNSIGNED,
            attr_id TINYINT UNSIGNED,
            pref_value TINYINT UNSIGNED,
            FOREIGN KEY (iid) REFERENCES participants (iid),
            FOREIGN KEY (attr_id) REFERENCES attributes(attr_id))
    '''
    
    # Execute query
    cursor.execute(preferences_table_query)
    
    # Creat indexes on preferences iid and attr_id columns
    cursor.execute('''CREATE INDEX IF NOT EXISTS 
                   idx_preferences_iid ON preferences (iid)''')
    cursor.execute('''CREATE INDEX IF NOT EXISTS 
                   idx_preferences_attr_id ON preferences (attr_id)''')

    # Insert data into the preferences table for each column
    for column_name in columns_pf_o:
        
        # Insert data into the preferences table
        preferences_insert_query = '''
        INSERT INTO preferences (iid, attr_id, pref_value)
        SELECT DISTINCT s.pid, a.attr_id, s.{}
        FROM speed_dating AS s
        JOIN attributes AS a ON 'pf_o_' || a.attribute_name = lower(?)
    '''.format(column_name)
        
        # Execute query
        cursor.execute(preferences_insert_query, (column_name.lower(),))
        
    # Commit changes
    conn.commit()
    
    # Close the connection
    conn.close()

In [183]:
# Running all the data preprocessing functions with timing.
start_time = time.time()

create_participants_table(database_path)
create_dates_table(database_path)
create_attributes_table(database_path, columns_base)
create_ratings_table(database_path, columns_o)
create_self_ratings_table(database_path, columns_3_1)
create_preferences_table(database_path, columns_pf_o)

end_time = round(time.time() - start_time, 4)

print(f'Table creation complete. It took {end_time} seconds.')

Table creation complete. It took 0.7673 seconds.
