# Beauty and the Beholder: Exploring the Impact of Attractiveness Perception on Second Date Success

This project is a Python analysis of this speed dating dataset: https://data.world/annavmontoya/speed-dating-experiment. An explanation of all the column meanings and the background of the experiment can be found in the speed dating data key document.

- This data comes from a speed dating experiment in which many different people participated in dates.
- The study concerned five main attributes: **attractiveness, sincerity, intelligence, fun and ambition**.
- Each row in the table concerned a particular date. The column **iid** represents a particular individual. The column **pid** represents the iid of the other individual on the date (e.g. iid's partner).
- Each participant would rate themselves on these five attributes (scale 1 - 10, represented by columns **attr3_1**, **sinc3_1**, **intel3_1**, **fun3_1** and **amb3_1**) before dating begins. (For a particular row/date these are the self-ratings as stated by **iid**).
- Each participant allocates 100 points across these five attributes in what they look for in a date i.e. which attributes are most important. These are represented by **pf_o_att**, **pf_o_sin**, **pf_o_int**, **pf_o_fun** and **pf_o_amb**. (For a particular row/date these are the preferences as stated by individual **pid**, **not iid**).
- After each date, each participant would rate their partner on a scale of 1 - 10 for these five attributes (represented by columns **attr_o**, **sinc_o**, **intel_o**, **fun_o** and **amb_o**).
- Each row contains the **gender** of the person (**iid**) where 1 is male and 0 is female.
- For each row, we have many values but for this study the most relevant are the columns representing the five attributes that an individual rated themselves on, the values that their date (**pid**) rated an individual (**iid**) on a particular date across these five attributes and the initial preference allocation.
- We have a **dec_o** column representing the decision of the partner (**pid**). If the value is 1, they agreed for a second date. If the value is 0 they did not.

This study explores:
- The average allocation of preference initially by gender (pf_o_att, pf_o_sinc etc.).
- The actual most important preference/feature through a random forest analysis of ratings received (attr_o, sinc_o etc.) and second date success (dec_o).
- The influence of average attractiveness rating received (attr_o) on second date success (dec_o).
- How each level (2, 3, 4, ..., 10) of self rating of attractiveness effects second date success i.e. does an individual's confidence in their attractiveness alone increase second date potential.
- The impact of self-perceived attractiveness (comparing the difference between an indivdual's self rating of attractiveness with their average attractiveness rating received on their dates and how this drives second date potential).

In [160]:
import pandas as pd
import sqlite3
import time
from sklearn.impute import KNNImputer
import os
import plotly.graph_objects as go
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score

## .CSV to .DB

The below code imports the relevant columns from the .csv file to the .db file.

The relevant columns are:
- **iid**: the id of an individual
- **gender**: the gender of the iid
- **pid**: the id of an individual's date
- **match**: whether both individuals agreed on a second date (1 = yes, 0 = no)
- **dec_o**: the decision of the partner (pid) whether or not to go on a second date with the individual (iid) (1 = yes, 0 = no)
- **\_o** attributes: the ratings an individual received from their partner on five attributs
- **3_1** attributes: the ratings an individual gave themselves
- **pf_o** attributes: the allocation of 100 points across the five attributes in what they look for an ideal date partner (given by pid)

In [161]:
# Converts the .csv to .db
# pf_o columns will be mapped so their names are consistent with 3_1 and _o

# Parameters:
#   - csv_file (str): the name of the .csv
#   - encoding_type (str_): the encoding of the csv
#   - database_path (str): the path of the .db
#   - csv_columns (list): the columns to include from the original database

def create_db(csv_file : str, encoding_type : str,
              database_path : str, csv_columns : list):
    
    # Connect to database
    conn = sqlite3.connect(database_path)
    # Create cursor object
    cursor = conn.cursor()

    # Read the CSV as a dataframe
    df = pd.read_csv(csv_file, encoding=encoding_type)

    # Filter the dataframe to only include the selected columns
    # Use .copy() to create a copy of the DataFrame
    df_selected = df[csv_columns].copy()
    
    # Mapping between original and desired column names
    column_mapping = {
        'pf_o_att': 'pf_o_attr',
        'pf_o_sin': 'pf_o_sinc',
        'pf_o_int': 'pf_o_intel'
    }
    
    # Rename columns using the mapping 
    df_selected.rename(columns=column_mapping, inplace=True)
    
    # Use Pandas to_sql to insert the DataFrame into the SQLite table
    df_selected.to_sql('speed_dating', conn, 
                       index=False, if_exists='replace')

    # Create indexes
    cursor.execute('CREATE INDEX idx_iid ON speed_dating(iid);')
    cursor.execute('CREATE INDEX idx_pid ON speed_dating(pid);')

    # Commit changes to the database
    conn.commit()

    # Close the database
    conn.close()

In [162]:
# Running database creation with timing
start_time = time.time()

csv_file = 'speed_dating.csv'
encoding_type = 'latin'
database_path = 'speed_dating.db'
csv_columns = ['iid', 'gender', 'pid', 'match', 'dec_o',
               'attr_o','sinc_o', 'intel_o', 'fun_o', 'amb_o',
               'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1',
               'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb']

create_db(csv_file, encoding_type, database_path, csv_columns)

end_time = round(time.time() - start_time, 4)

print(f'.csv to .db conversion complete. It took {end_time} seconds.')

.csv to .db conversion complete. It took 0.2686 seconds.


## Data Pre-Processing

This involves several steps.
1) Deleting rows with several NULL values

2) Deleting missing (iid, pid), (pid, iid) pairs

3) Performing data imputation on missing values

### Deleting Rows with Several NULL Values

For this analysis we will be comparing the importance on each of the aforementioned set of attributes on second date success. This means we need all attributes per individual, per date. Data imputation can fill in NULL values if there are only one or two attributes missing per set (the \_o set, the 3_1 set or the pf_o set). However, if several values are missing per set data imputation can be unreliable. 

For this dataset, I have chosen three or more missing values as being too unreliable. This is so analysis is focused on more quality data. Additionally, there are many rows with one missing attribute, some with two and the rest have all five missing attributes so three felt like an appropriate number.

INCLUDE TESTS HERE.

The argument for amount of NULL values can be changed when you call the function argument in case you would like to include more data or alternatively more attributes are included in the analysis.

In [163]:
# Deletes rows with a specified number of NULL values based on a threshold.

# Parameters:
#   - database_path (str): the path to the SQLite database.
#   - columns_o (list): list of attributes ending in _o (ratings received).
#   - columns_3_1 (list): list of attributes ending in 3_1 (self ratings given).
#   - columns_pf_o (list): list of attributes in the form
#                          pf_o_{attribute} (attribute preferences).
#   - null_no (int): the threshold number of NULL values for row deletion.

def delete_null_values(database_path : str, columns_o : list,
                       columns_3_1 : list, columns_pf_o : list, null_no : int):
    # Connect to database
    conn = sqlite3.connect(database_path)

    # Create a cursor object
    cursor = conn.cursor()
    
    # Construct the conditions for NULL values in o_columns
    conditions_o = [f"({column} IS NULL)" for column in columns_o]

    # Construct the conditions for NULL values in 3_1 columns
    conditions_3_1 = [f"({column} IS NULL)" for column in columns_3_1]
    
    # Construct the conditions for NULL values in pf_o columns
    conditions_pf_o = [f"({column} IS NULL)" for column in columns_pf_o]
        
    # Combine the conditions using AND
    # If there are NULL values these will evaluate to 1
    # These will add up to 3 or more if there are 3 or 
    # NULL values in a particular column
    condition_query_o = f"({' + '.join(conditions_o)}) >= {null_no}"
    condition_query_3_1 = f"({' + '.join(conditions_3_1)}) >= {null_no}"
    condition_query_pf_o = f"({' + '.join(conditions_pf_o)}) >= {null_no}"

    # Construct the delete query
    delete_query = (
        f"DELETE FROM speed_dating "
        f"WHERE ({condition_query_o}) OR "
        f"({condition_query_3_1}) OR "
        f"({condition_query_pf_o})"
    )
    
    # Execute the query to find rows to be deleted
    cursor.execute(delete_query)

    # Get the count of deleted rows
    deleted_rows = cursor.rowcount

    # Commit the changes
    conn.commit()

    # Close the connection
    conn.close()

    # Print the number of rows deleted
    print(f"Deleted {deleted_rows} rows with three or more NULL values.")

### Deleting Missing Pairs

For each date we have an iid and a pid. For each date there are two rows. One is from the perspective of iid (so iid = iid and pid = pid) and the other is from the perspective of pid (so iid = pid and pid = iid). So on the first row (perspective of iid) we see how pid rated iid on these five attributes. On the second row (perspective of pid) we see how iid rated pid on these five attributes. To ensure information is relevant, accurate and complete we must have both perspectives on a particular date.

In [164]:
# Deletes missing (iid, pid) pairs

# Parameters:
#   - database_path (str): the path to your .db

def delete_missing_pairs(database_path : str):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    # Create a cursor object
    cursor = conn.cursor()
    
    # Construct the delete query where there is corresponding
    # pid to iid and vice-versa
    delete_query = '''
    DELETE FROM speed_dating
    WHERE NOT EXISTS (
        SELECT 1
        FROM speed_dating AS s2
        WHERE speed_dating.iid = s2.pid AND speed_dating.pid = s2.iid)
    '''
    
    # Execute the delete query
    cursor.execute(delete_query)
    
    # Get the count of deleted rows with missing pairs
    deleted_missing_pairs = cursor.rowcount
    
    # Commit the changes
    conn.commit()
    
    # Close the connection
    conn.close()
    
    # Print the number of rows with missing pairs deleted
    print(f"Deleted {deleted_missing_pairs} rows with missing pairs.")

### Data Imputation

We now still have rows with one or two missing NULL values for a particular attribute set. For this we will use KNN imputation. Averaging across other attributes would not be appropriate as just because someone is rated highly in attractiveness does not necessarily mean they would be rated highly in intelligence.

INCLUDE CROSS-VALIDATION HERE

In [165]:
# CROSS-VALIDATE
# TAKE NUMBER OF NEIGHBOURS AS INPUT

# Performs KNN imputation on missing values

# Parameters:
#   - database_path (str): the path to your database
#   - columns_to_impute (list): the set of columns to perform data imputation on

def data_imputation(database_path : str, columns_to_impute : list):
    # Connect to database
    conn = sqlite3.connect(database_path)
    
    # Import data into dataframe
    df = pd.read_sql_query('SELECT * FROM speed_dating', conn)
    
    # Close connection
    conn.close()
    
    # Perform KNN imputation
    knn_imputer = KNNImputer()
    df_imputed = pd.DataFrame(knn_imputer.fit_transform(df[columns_to_impute]), 
                             columns = columns_to_impute)
    
    # Update the original DataFrame with the imputed values
    df[columns_to_impute] = df_imputed
    
    # Save the updated data back to the database
    conn = sqlite3.connect(database_path)
    df.to_sql('speed_dating', conn,
              if_exists = 'replace', index = False)
    print('Remaining NULL values have had data imputation performed on them.')
    conn.close()

In [166]:
# Running all the data preprocessing functions with timing.
start_time = time.time()

# Define the base names
# Add more if needed
columns_base = ['attr', 'sinc', 'intel', 'fun', 'amb']

# Define the full names for plots
# Add more if needed
columns_full = ['Attractive', 'Sincere', 'Intelligent', 'Fun', 'Ambitious']

# Define the _o columns
columns_o = [attribute + '_o' for attribute in columns_base]

# Define the 3_1 columns
columns_3_1 = [attribute + '3_1' for attribute in columns_base]
    
# Define the pf columns
columns_pf_o = [f'pf_o_{attribute}' for attribute in columns_base]

# Define the threshold for null value row deletion
null_no = 3

delete_null_values(database_path, columns_o, columns_3_1, columns_pf_o, null_no)
delete_missing_pairs(database_path)

# Currently only _o has missing values. Repeat call for further column sets
# if necessary
data_imputation(database_path, columns_o)

end_time = round(time.time() - start_time, 4)

print(f'Data pre-processing complete. It took {end_time} seconds.')

Deleted 421 rows with three or more NULL values.
Deleted 131 rows with missing pairs.
Remaining NULL values have had data imputation performed on them.
Data pre-processing complete. It took 0.4402 seconds.


## Table Creation

The below functions create the various tables needed for analysis. These also defined foreign keys and primary keys (i.e. the relation between tables).

For a full description of these tables, their schema, what the columns represent, justification of datatypes and how they link together please see the database schema .pdf.


In [167]:
# Creates the participants table

# Parameters:
#   - database_path (str): the path to your database
#   - columns_3_1 (list): list of attributes ending in 3_1 (self ratings given)
#   - columns_pf_o (list): list of attributes in the form
#                          pf_o_{attribute} (attribute preferences)

def create_participants_table(database_path, columns_3_1, columns_pf_o):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    
    # Creator cursor object
    cursor = conn.cursor()
    
    # Drop the participants table if it exists
    cursor.execute("DROP TABLE IF EXISTS participants")
    
    # Create the participants table
    participants_table_query = f'''
       CREATE TABLE participants(
           iid SMALLINT UNSIGNED PRIMARY KEY,
           gender TINYINT,
           {', '.join(f'{attr} TINYINT UNSIGNED' for attr in columns_3_1)},
           {', '.join(f'{attr} TINYINT UNSIGNED' for attr in columns_pf_o)}
           )
       '''
    
    # Execute query
    cursor.execute(participants_table_query)
    
    # Create indexes on iid column in the participants table
    cursor.execute('CREATE INDEX idx_participants_iid ON participants (iid)')
    
    # Insert data into participants table
    participants_insert_query = f'''
        INSERT INTO participants(iid, gender, {', '.join(columns_3_1)})
        SELECT DISTINCT iid, gender, {', '.join(columns_3_1)}
        FROM speed_dating;
    '''
        
    # Execute query
    cursor.execute(participants_insert_query)
        
    # Update data in participants table
    participants_update_query = f'''
        UPDATE participants
        SET {', '.join(f'{attr} = s.{attr}' for attr in columns_pf_o)}
        FROM
            speed_dating AS s
        WHERE
            participants.iid = s.pid;
    '''

    # Execute query
    cursor.execute(participants_update_query)
    
    # Commit changes
    conn.commit()
    
    # Close the connection
    conn.close()

In [168]:
# Creates the dates table

# Parameters:
#   - database_path (str): the path to your database
#   - columns_o (list): list of attributes ending in _o (ratings given)

def create_dates_table(database_path, columns_o):
    # Connect to the database
    conn = sqlite3.connect(database_path)
    
    # Create cursor object
    cursor = conn.cursor()
    
    # Drop the dates table if it exists
    cursor.execute("DROP TABLE IF EXISTS dates")
    
    # Create the dates table
    dates_table_query = f'''
        CREATE TABLE dates(
            date_id INTEGER PRIMARY KEY AUTOINCREMENT,
            iid SMALLINT UNSIGNED,
            pid SMALLINT UNSIGNED,
            match TINYINT,
            dec_o TINYINT,
            {', '.join(f'{attr} TINYINT UNSIGNED' for attr in columns_o)},
            FOREIGN KEY (iid) REFERENCES participants (iid),
            FOREIGN KEY (pid) REFERENCES participants (pid)
            )
    '''
    
    # Execute query
    cursor.execute(dates_table_query)
    
    # Create indexes on date_id column in dates table
    cursor.execute("CREATE INDEX idx_dates_date_id ON dates (date_id)")
    
    # Insert data into dates table
    dates_insert_query = f'''
        INSERT INTO dates (iid, pid, match, dec_o, {', '.join(columns_o)})
        SELECT iid, pid, match, dec_o, {', '.join(columns_o)}
        FROM speed_dating
    '''
    
    # Execute query
    cursor.execute(dates_insert_query)
    
    # Commit changes
    conn.commit()
    
    # Close connection
    conn.close()

In [169]:
# Running all the data preprocessing functions with timing.
start_time = time.time()

create_participants_table(database_path, columns_3_1, columns_pf_o)
create_dates_table(database_path, columns_o)

end_time = round(time.time() - start_time, 4)

print(f'Table creation complete. It took {end_time} seconds.')

Table creation complete. It took 0.1434 seconds.


## Attribute Importance

In this section we first examine and plot the average allocation of preference across the five attributes, sorted by gender.
We then train a Random Forest model, based on ratings received on dates and their influence on second date success.

The overall goal of this section is to compare how people initially placed their preference compared to which attributes actually drove second-date success.
Additionally, it will examine which attribute is the most important in driving second date potential.

### Initial Preference Allocation

As mentioned earlier, before dating began each participant would allocate 100 points across the five attributes representing what they look for in an ideal partner.

The below code retrieves the average allocation for each attribute separated by gender and plots the results.

In [170]:
# Fetches the average allocation of each attribute by gender

# Parameters:
#   - database_path (str): the path to your database
#   - columns_pf_o (list): list of attributes in the form
#                          pf_o_{attribute} (attribute preferences)
#   - gender_labels (list): the genders to be included 

def fetch_pref_data(database_path, columns_pf_o, gender_labels):
    # Connect to the database
    conn = sqlite3.connect(database_path)

    # Creator cursor object
    cursor = conn.cursor()
    
    # Generate the SELECT statement dynamically
    pref_columns = ', '.join([f'AVG({column})' for column in columns_pf_o])
    pref_query = f'''
        SELECT gender, {pref_columns}
        FROM participants
        GROUP BY gender
        ORDER BY gender
    '''
    
    # Execute query to retrieve the average allocation for each attribute
    # by gender
    cursor.execute(pref_query)
    
    rows = cursor.fetchall()

    # Commit changes
    conn.commit()
    
    # Close the connection
    conn.close()
    
    pref_data = {}

    for i, row in enumerate(rows):
        gender = gender_labels[i]
        average_pref = row[1:]  # Exclude the first element (gender) from the row
        pref_data[gender] = average_pref

    return(pref_data)

In [171]:
# Creates the traces for the plotly bar chart

# Parameters:
#   - data (dict): the data to be used containing x and y data
#                  for each gender
#   - x_axis (list): the values to use for x-axis labels
#   - colors (list): the colors to use for the male and female bar chart

def create_bar_traces(data, x_axis, colors):
    
    # Create traces for each gender
    traces = []
    
    # Set counter for colors list
    i = 0
    
    for gender, y_data in data.items():
        trace = go.Bar(
            # The x-axis labels to be used
            x=x_axis,
            # The y data to be used
            y=y_data,
            name=gender,
            marker_color=colors[i]
        )
        traces.append(trace)
        i = i + 1

    return traces

In [172]:
# Running the average initial preference allocation functions with timings
start_time = time.time()

# Set the plot colours for each gender
colors = ['#ff6361', '#003f5c']

# Replace 0 and 1 with the corresponding gender
gender_labels = ['Female', 'Male']

pref_data = fetch_pref_data(database_path, columns_pf_o, gender_labels)
traces = create_bar_traces(pref_data, columns_full, colors)

# Structuring the layout with titles and colours
layout = go.Layout(
    title='Average Allocation for Attributes by Gender',
    xaxis=dict(title='Attributes'),
    yaxis=dict(title='Average Allocation'),
    barmode='group',
    plot_bgcolor='#dadfe1',
    paper_bgcolor='#dadfe1'
)

# Create figure
fig = go.Figure(data=traces, layout=layout)

# Show the plot
fig.show()

end_time = round(time.time() - start_time, 4)

print(f'Initial preference allocation plot completed. It took {end_time} seconds.')

Initial preference allocation plot completed. It took 0.01 seconds.


From this we see that men prioritize attractiveness the most whereas women place a higher importance on intelligence.

Sincerity, intelligence and fun hold similar significance whereas ambitions is the least influential of the five attributes

### Actual Attribute Importance for Second Date Success

The below functions fetches the ratings received on a date and whether or not that date was a match (separated by gender). A Random Forest classifier is then trained to determine which attribute was actually most important in driving second date success.

In [173]:
# Fetches the ratings received on each date

# Parameters:
#   - database_path (str): the path to your database
#   - columns_o (list): list of attributes ending in _o (ratings given)

def fetch_rating_data(database_path, columns_o):

    # Connect to database
    conn = sqlite3.connect(database_path)

    # Creator cursor object
    cursor = conn.cursor()

    # Construct the SELECT query
    rating_query = f'''
        SELECT gender, dec_o, {', '.join(columns_o)}
        FROM participants AS p
        JOIN dates AS d on p.iid = d.iid
        ORDER BY gender
    '''

    # Execute query to retrieve the ratings received by iid
    # on each date
    cursor.execute(rating_query)

    rows = cursor.fetchall()

    # Commit changes
    conn.commit()

    # Close the connection
    conn.close()

    # Initialize a dictionary for organizing data separated by gender
    rating_data = {0: [[], []], 1: [[], []]}  # Use 0 for Female, 1 for Male

    # Separate data for men and women
    for row in rows:
        gender, dec_o, *attributes = row
        # Use gender directly as the key
        rating_data[gender][0].append(attributes)
        rating_data[gender][1].append(dec_o)
            
    return rating_data

In [174]:
# Returns the feature importances on second date success bsaed on rating data

# Parameters:
#   - rating_data (dict): ratings per date created by fetch_rating_data
#   - gender_labels (list): the genders to be included

def train_and_evaluate(rating_data, gender_labels):
    # Create mean squared error for each gender
    mse = []
    # Create accuracy for each gender
    accuracies = []
    # Create feature importance for each gender
    feature_importances = {}

    for i, gender in enumerate(gender_labels):
        # Split data into training and test data
        X_train, X_test, y_train, y_test = train_test_split(rating_data[i][0], rating_data[i][1], test_size=0.2, random_state=42)

        # Create a random forest regressor model
        # 200 gives slightly better accuracy than 100, increasing decreases accuracy
        # Naturally random_state is 42
        model = RandomForestRegressor(n_estimators=200, random_state=42)

        # Fit the model to the training data
        model.fit(X_train, y_train)

        # Predict on the test data
        y_pred = model.predict(X_test)

        # Calculate the mean squared error
        error = mean_squared_error(y_test, y_pred)
        mse.append(error)

        # Get the feature importances from the random forest model
        feature_importance = model.feature_importances_
        feature_importances[gender] = feature_importance.tolist()

        # Convert the continuous predictions to binary class labels
        y_pred_binary = (y_pred > 0.5).astype(int)

        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred_binary)
        accuracies.append(accuracy)
        
    return mse, accuracies, feature_importances

In [175]:
# Running the feature importance of attributes on second date success with timings
start_time = time.time()

rating_data = fetch_rating_data(database_path, columns_o)
mse, accuracies, feature_importances = train_and_evaluate(rating_data, gender_labels)
traces = create_bar_traces(feature_importances, columns_full, colors)

# Structuring the layout with titles and colours
layout = go.Layout(
    title='Feature Importance for Successful Second Date Prediction',
    xaxis=dict(title='Attributes'),
    yaxis=dict(title='Relative Feature Importance'),
    barmode='group',
    plot_bgcolor='#dadfe1',
    paper_bgcolor='#dadfe1'
)

# Create figure
fig = go.Figure(data=traces, layout=layout)

# Show the plot
fig.show()

for i, gender in enumerate(gender_labels):
    print(f"{gender} accuracy: {round(accuracies[i], 2)}%")
    print(f"{gender} mean squared error: {round(mse[i], 2)}\n")
          
end_time = round(time.time() - start_time, 4)

print(f'Feature importance plot completed. It took {end_time} seconds.')

Female accuracy: 0.7%
Female mean squared error: 0.21

Male accuracy: 0.71%
Male mean squared error: 0.2

Feature importance plot completed. It took 1.3414 seconds.


From this plot, we unveil the mistmatch between initial preferences and real-world dynamics.In reality, attractiveness far outweights all the other attributes and is the most important attribute in driving second date success.

Here we observe the interplay between perceived preferences, societal influences and the unpredictable nature of human connections.

## Effect of Attractiveness Rating Received on Second Date Success

In this section we explore how a participant's average attractiveness rating received effects their second date success.

For each participant we retrieve their percentage of successful second dates based on the **dec_o** column as well as the average attractiveness rating they received. We then compare these two values on a scatter plot for each participant, separated by gender.

In [188]:
# Fetches the ratings received on each date

# Parameters:
#   - database_path (str): the path to your database

def fetch_avg_attr_data(database_path):

    # Connect to database
    conn = sqlite3.connect(database_path)

    # Creator cursor object
    cursor = conn.cursor()

    # Construct the SELECT query
    avg_attr_query = f'''
        SELECT gender, AVG(attr_o), ((SUM(dec_o) * 1.0)/COUNT(*)) * 100
        FROM dates AS d
        JOIN participants AS p on d.iid = p.iid
        GROUP BY d.iid
        ORDER BY gender
    '''

    # Execute query to retrieve the average attractiveness rating and
    # second date success percentage per participant
    cursor.execute(avg_attr_query)

    rows = cursor.fetchall()

    # Commit changes
    conn.commit()

    # Close the connection
    conn.close()

    # Initialize a dictionary for organizing data separated by gender
    avg_attr_data = {'Female': [[], []], 'Male': [[], []]}  # Use 0 for Female, 1 for Male

    # Separate data for men and women
    for row in rows:
        gender, avg_attr_o, dec_o_ratio = row
        # Map numeric gender to string labels
        gender_str = 'Female' if gender == 0 else 'Male'
        avg_attr_data[gender_str][0].append(avg_attr_o)
        avg_attr_data[gender_str][1].append(dec_o_ratio)
            
    return avg_attr_data

{'Female': [[6.7, 7.7, 6.5, 7.0, 5.3, 6.8, 7.9, 8.2, 7.0, 6.36, 5.0625, 4.7875, 5.866666666666666, 6.428571428571429, 7.733333333333333, 7.6, 8.2, 5.733333333333333, 6.066666666666666, 7.875, 6.933333333333334, 6.0, 7.1875, 7.333333333333333, 6.2, 4.8533333333333335, 6.533333333333333, 4.333333333333333, 7.5, 5.3, 4.8, 3.9, 8.3, 4.4, 5.3, 6.5, 6.352941176470588, 6.833333333333333, 6.472222222222222, 5.888888888888889, 6.388888888888889, 7.888888888888889, 6.611111111111111, 7.333333333333333, 6.777777777777778, 6.388888888888889, 7.333333333333333, 7.5, 7.0, 6.944444444444445, 5.888888888888889, 8.38888888888889, 7.888888888888889, 7.138888888888889, 7.9, 6.4, 5.0, 7.222222222222222, 6.7, 7.777777777777778, 7.777777777777778, 7.8, 4.8, 7.8, 5.2, 6.5, 6.125, 7.3125, 4.725, 6.866666666666666, 5.4375, 5.5, 6.75, 7.0625, 7.3125, 7.25, 8.6875, 7.125, 6.3125, 7.75, 7.1875, 7.0, 7.625, 4.777777777777778, 5.8, 7.111111111111111, 5.0, 4.555555555555555, 8.0, 7.625, 7.333333333333333, 5.22222222

In [197]:
# Creates the traces for the plotly scatter chart

# Parameters:
#   - data (dict): the dictionary containing the x and y data
#                  for each gender
#   - colors (list): the colors to use for the male and female mar

def create_scatter_traces(data):
    
    # Create traces for each gender
    traces = []
    
    # Set counter for colors list
    i = 0
    
    for gender, (x_data, y_data) in avg_attr_data.items():
        trace = go.Scatter(
            # The x-axis data to be used
            x=x_data,
            # The y-axis data to be used
            y=y_data,
            mode='markers',
            name=gender,
            marker_color=colors[i]
        )
        traces.append(trace)
        i = i + 1
    
    return traces

[Scatter({
    'marker': {'color': '#ff6361'},
    'mode': 'markers',
    'name': 'Female',
    'x': [6.7, 7.7, 6.5, ..., 5.45, 4.095238095238095, 4.142857142857143],
    'y': [50.0, 60.0, 50.0, ..., 25.0, 14.285714285714285, 14.285714285714285]
}), Scatter({
    'marker': {'color': '#003f5c'},
    'mode': 'markers',
    'name': 'Male',
    'x': [5.6, 7.1, 4.8, ..., 5.136363636363637, 6.2, 7.3],
    'y': [40.0, 40.0, 40.0, ..., 27.27272727272727, 50.0, 75.0]
})]


In [198]:
# Running the feature importance of attributes on second date success with timings
start_time = time.time()

avg_attr_data = fetch_avg_attr_data(database_path)
traces = create_scatter_traces(avg_attr_data)

# Structuring the layout with titles and colours
layout = go.Layout(
    title='Influence of Average Attractiveness Rating Received on Second Date Success',
    xaxis=dict(title='Average Attractiveness Rating Received'),
    yaxis=dict(title='Perecentage of Successful Second Dates'),
    barmode='group',
    plot_bgcolor='#dadfe1',
    paper_bgcolor='#dadfe1'
)

# Create figure
fig = go.Figure(data=traces, layout=layout)

# Show the plot
fig.show()
          
end_time = round(time.time() - start_time, 4)

print(f'Average attractiveness plot completed. It took {end_time} seconds.')

Average attractiveness plot completed. It took 0.0239 seconds.


In the above plot we see a compelling correlation between a high attractiveness rating and second date success. An attractiveness rating of 5.5 - 6.5 exhibited 50% success.

Additionally, we see a difference in attractiveness ratings received by gender. Lower attractivenes ratings are predominantly made up by men where as the higher attractiveness ratings are primarily women.

This scatter plot does indeed highlight the importance of attractiveness in second date success. However, is second date success based solely on the attractiveness rating you receive?