# Algorithm for compatibility score

**Authors:** Akaran Sivakumar, Anne Skamris Holm, Johanne Sejrskild Rejsenhus, Kristiane Warncke, Matilda Rhys-Kristensen 


Welcome to the markdown file for our algorithm for compatibility score. This algorithm uses answers from a roomie survey to calculate a compatibility score between two users. this algorithm was originally created in the programming langauge R, but we have since then gone over to python, for easier compatibility with other parts of the project.

In [1]:
# Importing packages necessary for data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the dataset
Preliminary_testdata = pd.read_csv('data_test.csv')


# Cleaning the Data

We will now rename the columns in the dataframe to make it easier to work with. 


In [2]:
#rename column "Z001_03" to codeword
Preliminary_testdata.rename(columns={'Z001_03':'codeword'}, inplace=True)

#rename column "Z003_01" to age
Preliminary_testdata.rename(columns={'Z003_01':'age'}, inplace=True)

#rename column "Z002" to gender
Preliminary_testdata.rename(columns={'Z002':'gender'}, inplace=True)

#change value in column "gender" to f if value is 2 and to m if value is 1
Preliminary_testdata['gender'] = Preliminary_testdata['gender'].replace({1: 'm', 2: 'f'})

# Column name mapping
column_mapping = {
    'A501_01': 'first_priority',
    'A501_02': 'second_priority',
    'A501_03': 'third_priority',
    'A501_04': 'fourth_priority',
    'A501_05': 'fifth_priority'
}

# Rename columns using a loop
for old_name, new_name in column_mapping.items():
    Preliminary_testdata.rename(columns={old_name: new_name}, inplace=True)

We have unfortunately had a few participants, who did not finish the survey, so we will now remove them from the dataset. We also have instances of only one of the roomies completing the survey, which we unfortunately cannot use as we need to compare scores between minimum of 2 people. We will therefore remove these instances as well.


In [3]:
# Removing the participants who did not complete the test

#remove participants with value 0 in the column "FINISHED"
Preliminary_testdata = Preliminary_testdata[Preliminary_testdata.FINISHED != 0]

#Removing the participants whose roomies did not complete the test
#make all the values in the column "codeword" lowercase
Preliminary_testdata['codeword'] = Preliminary_testdata['codeword'].str.lower()

#using function .strip() to remove the spaces before and after the values in the column "codeword"
Preliminary_testdata['codeword'] = Preliminary_testdata['codeword'].str.strip()


#write a loop that checks the value in column "codeword" for each row and adds the row to a new dataframe if the value appears on more rows than once
#make a list of the values in column "codeword"
roomies = Preliminary_testdata['codeword'].tolist()

#make a list of the values that appear more than once in the list "roomies"
roomies_more_than_once = []
for i in roomies:
    if roomies.count(i) > 1:
        roomies_more_than_once.append(i)

#make a new dataframe with only the rows that have a value in column "codeword" that appears more than once in the list "roomies_more_than_once"
Preliminary_testdata = Preliminary_testdata[Preliminary_testdata['codeword'].isin(roomies_more_than_once)]

# removing the unncessary columns: index, CASE, QUESTNNR,MODE,STARTED,FINISHED,Q_VIEWER,LASTPAGE,MAXPAGE,MISSING,MISSREL,TIME_RSI,DEG_TIME,SERIAL,REF
Preliminary_testdata.drop(["CASE", 'QUESTNNR','MODE','STARTED','FINISHED','Q_VIEWER','LASTPAGE','MAXPAGE','MISSING','MISSREL','TIME_RSI','DEG_TIME','SERIAL','REF'], axis=1, inplace=True)


As this questionnaire was done as part of a research project at Aarhus University of Denmark, we have to remove the data from the participants who did not consent to having their data used for research purposes. We will therefore remove these instances as well.

We will now also calculate the necessary sample information usually reported in research i.e. mean age and sd and number of participants, and gender of the participants

In [4]:
#We will now also calculate the necessary sample information usually reported in research i.e. mean age and sd and number of participants, and gender of the participants
#calculate the mean age
mean_age = Preliminary_testdata['age'].mean()

#calculate the sd of the age
sd_age = Preliminary_testdata['age'].std()

#calculate the number of participants
number_of_participants = Preliminary_testdata['age'].count()

We have chosen to use an euclidean distance algorithm to compute the compability scores between roomies. This algorithm is based on the euclidean distance formula, which is a formula that calculates the distance between two points in a coordinate system.

In [5]:
#write a function that computes the euclidean distances between all rows for columns 0 - 26 and outputs a new dataframe with the distances
def euclidean_distance(df): # df is the dataframe
    # make a list of the column names
    column_names = df.columns.tolist()



from scipy.spatial import distance
import itertools

def euclidean_distances(df):
    # Extract the subset of columns (1-26) to calculate distances
    subset = df.iloc[:, 0:26]
    
    # Get all combinations of row indices
    row_combinations = list(itertools.combinations(df.index, 2))
    
    # Calculate Euclidean distances between rows and store them in a dictionary
    distances_dict = {}
    for i, j in row_combinations:
        dist = distance.euclidean(subset.loc[i], subset.loc[j])
        distances_dict[(i, j)] = dist
    
    # Create a DataFrame with the distances
    dist_df = pd.DataFrame.from_dict(distances_dict, orient='index', columns=['Euclidean Distance'])
    
    return dist_df

def euclidean_distances_dict(df):
    # Extract the subset of columns (1-26) to calculate distances
    subset = df.iloc[:, 0:26]
    
    # Get all combinations of row indices
    row_combinations = [(i, j) for i in df.index for j in df.index if i < j]
    
    # Calculate Euclidean distances between rows and store them in a dictionary
    distances_dict = {}
    for i, j in row_combinations:
        dist = distance.euclidean(subset.loc[i], subset.loc[j])
        distances_dict[(i, j)] = dist
    
    return distances_dict

In [6]:

# Calculate the distances
distances = euclidean_distances(Preliminary_testdata)

# Calculate the distances
distances_dict = euclidean_distances_dict(Preliminary_testdata)



In [7]:
# print the first column in the daaframe "Preliminary_testdata"
print(Preliminary_testdata.iloc[:,3])

2     1.0
4     1.0
6     1.0
8     1.0
10    1.0
11    1.0
12    2.0
17    3.0
19    3.0
20    1.0
21    1.0
Name: A004, dtype: float64


In [10]:
def calculate_category_compatibility_score(user_1, user_2, start_col, end_col, max_possible_score):
    # Extract the relevant columns from the DataFrame
    user_1_scores = user_1.iloc[start_col:end_col + 1]
    user_2_scores = user_2.iloc[start_col:end_col + 1]

    # Ensure both users have the same number of answers
    if len(user_1_scores) != len(user_2_scores):
        raise ValueError("Both users must answer the same number of questions.")

    # Calculate the compatibility score
    compatibility_score = (user_1_scores + user_2_scores).sum()

    # Divide by the maximum possible score in the category range
    compatibility_score /= max_possible_score

    return compatibility_score

def compute_category_compatibilities(df, category_ranges, max_possible_scores):
    num_users = df.shape[0]
    num_categories = len(category_ranges)
    compatibility_matrices = {}

    for category_name, (start_col, end_col) in category_ranges.items():
        compatibility_matrix = pd.DataFrame(0, index=range(num_users), columns=range(num_users))

        for i in range(num_users):
            for j in range(i + 1, num_users):
                max_possible_score = max_possible_scores[category_name]
                compatibility_score = calculate_category_compatibility_score(df.iloc[i], df.iloc[j], start_col, end_col, max_possible_score)
                compatibility_score = round(compatibility_score, 2)
                compatibility_matrix.iloc[i, j] = compatibility_score
                compatibility_matrix.iloc[j, i] = compatibility_score

        compatibility_matrices[category_name] = compatibility_matrix

    return compatibility_matrices


category_ranges = {
    'Cleanliness': (0, 6),
    'Communal Life': (7, 13),
    'Social Life': (14, 16),
    'Communication': (17, 22),
    'Personal Routine': (32, 35)
}

max_possible_scores = {
    'Cleanliness': 45,  # Maximum possible score for Cleanliness category
    'Communal Life': 41,  # Maximum possible score for Communal Life category
    'Social Life': 30,  # Maximum possible score for Social Life category
    'Communication': 37,  # Maximum possible score for Communication category
    'Personal Routine': 32  # Maximum possible score for Personal Routine category
}

compatibility_matrices = compute_category_compatibilities(Preliminary_testdata, category_ranges, max_possible_scores)

# Print compatibility matrices for each category
for category_name, compatibility_matrix in compatibility_matrices.items():
    print(f"Compatibility Matrix for {category_name}:")
    print(compatibility_matrix)
    print()



Compatibility Matrix for Cleanliness:
      0     1     2     3     4     5     6     7     8     9     10
0   0.00  0.62  0.71  0.60  0.64  0.67  0.67  0.71  0.62  0.82  0.67
1   0.62  0.00  0.76  0.64  0.69  0.71  0.71  0.76  0.67  0.87  0.71
2   0.71  0.76  0.00  0.73  0.78  0.80  0.80  0.84  0.76  0.96  0.80
3   0.60  0.64  0.73  0.00  0.67  0.69  0.69  0.73  0.64  0.84  0.69
4   0.64  0.69  0.78  0.67  0.00  0.73  0.73  0.78  0.69  0.89  0.73
5   0.67  0.71  0.80  0.69  0.73  0.00  0.76  0.80  0.71  0.91  0.76
6   0.67  0.71  0.80  0.69  0.73  0.76  0.00  0.80  0.71  0.91  0.76
7   0.71  0.76  0.84  0.73  0.78  0.80  0.80  0.00  0.76  0.96  0.80
8   0.62  0.67  0.76  0.64  0.69  0.71  0.71  0.76  0.00  0.87  0.71
9   0.82  0.87  0.96  0.84  0.89  0.91  0.91  0.96  0.87  0.00  0.91
10  0.67  0.71  0.80  0.69  0.73  0.76  0.76  0.80  0.71  0.91  0.00

Compatibility Matrix for Communal Life:
      0     1     2     3     4     5     6     7     8     9     10
0   0.00  0.46  0.63  0.

In [9]:
def calculate_max_possible_score(df, category_ranges):
    max_possible_scores = {}

    for category_name, (start_col, end_col) in category_ranges.items():
        max_possible_score = 0
        for i in range(df.shape[0]):
            user_scores = df.iloc[i, start_col:end_col + 1]
            max_possible_score += user_scores.max()
        
        max_possible_scores[category_name] = max_possible_score

    return max_possible_scores

max_possible_scores = calculate_max_possible_score(Preliminary_testdata, category_ranges)