## Machine Learning (involving clustering (K-means) and optimization (Linear Sum Assignment)) Algorithmic approach for solving the Room-mate (well flat-mate in realty) allocation problem for the BXs!

#### Before we begin: for any doubts, feedback or discussions please email me at my institutional email id - jai-ansh.bindra@polytechnique.edu!

### Imports - 

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from scipy.optimize import linear_sum_assignment

### Data Loading - 

In [2]:
data = pd.read_excel('/Users/jaianshsinghbindra/Downloads/Roommate matching/Sample_Data.xlsx')

### Print the labels for debugging.

In [3]:
file_path = '/Users/jaianshsinghbindra/Males Roommate form BX27(Responses).xlsx'
print(data.columns)

Index(['Name, Surname', 'Sex', 'What is your nationality(ies)',
       'What language(s) do you speak?', 'What is your sleeptime (weekdays)?',
       'What is your sleeptime (weekend)?', 'Does noise bother you?',
       'If at night from what time. Noise, if night what time',
       'How often are you willing to clean the common area?',
       'How do you rate your sharing habits?',
       'Do you clean your dishes right after using them?',
       'Do you mind if your roommate invites people to your flat?',
       'Would you invite people to your flat?',
       'How do you handle disagreements?', 'Are you a party-person?'],
      dtype='object')


### Additional functions for cleaning data if you are too lazy (might not work the best since I didn't use them!)

In [None]:
''' 
def clean_nationality(nationalities):
    cleaned = []
    for nationality in nationalities.split(','):
        nationality = nationality.strip().lower().replace('unofficially french', 'french')  # Normalizing(casing) nationality data
        cleaned.append(nationality)
    return cleaned

# Normalize Yes/No responses to lowercase - for the 'Selecting the options' response types in the form.
def normalize_yes_no(value):
    return value.strip().lower()

# Normalize languages by removing spaces
def normalize_languages(languages):
    return [lang.strip().lower() for lang in languages.split(',')]
    
    '''

### Label Handling -

In [4]:
data['Cleanliness'] = data['How often are you willing to clean the common area?'].apply(
    lambda x: 5 if x == 'Every day' else (
        4 if x == 'Few days per week' else (
            3 if x == '1 day per week' else 0
        )
    )
)
data['Dishes'] = data['Do you clean your dishes right after using them?'].apply(lambda x: 1 if x == 'Yes' else 0)
def noise_score(row):
    if row['Does noise bother you?'] == 'During the day':
        return 1
    elif row['Does noise bother you?'] == 'At night':
        time = row['If at night from what time. Noise, if night what time']
        return {'9pm': 2, '10pm': 3, '11pm': 4, 'midnight': 5, 'after midnight': 6}.get(time, 0)
    else:
        return 0

data['Noise Tolerance'] = data.apply(noise_score, axis=1)
data['Party Person'] = data['Are you a party-person?'].apply(lambda x: 1 if x == 'yes' else 0)
data['Sleeptime Weekdays'] = data['What is your sleeptime (weekdays)?'].apply(
    lambda x: {'8 to 9pm': 1, '9 to 10pm': 2, '10 to 11pm': 3, '11 to midnight': 4, 'midnight to 2 am': 5, 'after 2am': 6}.get(x, 3)
)
data['Sleeptime Weekends'] = data['What is your sleeptime (weekend)?'].apply(
    lambda x: {'8 to 9pm': 1, '9 to 10pm': 2, '10 to 11pm': 3, '11 to midnight': 4, 'midnight to 2 am': 5, 'after 2am': 6}.get(x, 3)
)
data['French Speaker'] = data['What language(s) do you speak?'].apply(lambda x: 1 if 'French' in x else 0)
data['French National'] = data['What is your nationality(ies)'].apply(lambda x: 1 if 'French' in x else 0)
data['Nationality'] = data['What is your nationality(ies)'].apply(lambda x: x.split(','))  # Split multiple nationalities into a list
data['Mind Invites'] = data['Do you mind if your roommate invites people to your flat?'].apply(lambda x: 1 if x == 'yes' else 0)
data['Invite People'] = data['Would you invite people to your flat?'].apply(lambda x: 1 if x == 'Yes' else 0)
data['Handle Disagreements'] = data['How do you handle disagreements?'].apply(lambda x: {'Mediated discussion': 1, 'Confrontation': 2}.get(x, 0))
data['Sharing Habits'] = data['How do you rate your sharing habits?'] 

### Weights -

In [5]:
# Da, rememeber the more the weight, the more likely those people to be grouped together by the algo.
#Think of it as a grouping together algorithmic structure.
weights = {
    'Cleanliness': 6, #I could have put 7 too here but I felt like it would have been too heavy...
    'Dishes': 3,
    'Sleeptime Weekdays': 4,
    'Sleeptime Weekends': 4,
    'Noise Tolerance': 3,
    'Party Person': 3,
    'Nationality': 1,
    'Mind Invites': 3,
    'Invite People': 2,
    'Handle Disagreements': 2,
    'Sharing Habits': 3,  
}

### Calculating compatibility scores - 

In [6]:
def calculate_compatibility_score(student1, student2):
    score = 0
    for factor in weights:
        if factor == 'Nationality':  # Special handling for nationality
            common_nationalities = set(student1['Nationality']).intersection(set(student2['Nationality']))
            score += weights[factor] * (len(common_nationalities) > 0)
        else:
            score += weights[factor] * (student1[factor] == student2[factor])
    return score

### Implementing/Evaluating the Compatibility Matrices - 

In [7]:
num_students = len(data)
compatibility_matrix = np.zeros((num_students, num_students))

for i in range(num_students):
    for j in range(num_students):
        if i != j:
            compatibility_matrix[i, j] = calculate_compatibility_score(data.iloc[i], data.iloc[j])

### Clustering - 

In [8]:
kmeans = KMeans(n_clusters=num_students // 4).fit(compatibility_matrix)
data['Cluster'] = kmeans.labels_

  super()._check_params_vs_input(X, default_n_init=10)


### Defining and Implementing the necessary constraints demanded by the admins (administration); also optimizing using LSA (Linear Sum Assignment (Problem)) -

In [9]:
#Implementing the penalty constraint system.
cost_matrix = np.zeros((num_students, num_students))

for i in range(num_students):
    for j in range(num_students):
        if i != j:
            # Apply constraints
            if data.iloc[i]['Sex'] != data.iloc[j]['Sex']:
                cost_matrix[i, j] = float('inf')
            elif data.iloc[i]['French National'] and data.iloc[j]['French National']:
                cost_matrix[i, j] = float('inf')
            elif data.iloc[i]['French Speaker'] and data.iloc[j]['French Speaker']:
                cost_matrix[i, j] += 8  #could have been 5 or 10, maybe 5 would have worked better but I decided to settle for 8 based on the data I had.
            elif len(set(data.iloc[i]['Nationality']).intersection(set(data.iloc[j]['Nationality']))) > 0:
                cost_matrix[i, j] += 10
            else:
                cost_matrix[i, j] -= compatibility_matrix[i, j]

row_ind, col_ind = linear_sum_assignment(cost_matrix)

### Extra code for ensuring atleast one French National per room (which I didn't use due to the trends in the data I was working with) - 

In [None]:
'''
# Ensure at least one French National in each room
clusters = {i: [] for i in range(num_students // 4)}
for i in range(len(row_ind)):
    clusters[data.iloc[row_ind[i]]['Cluster']].append(data.iloc[row_ind[i]])

# Check and adjust clusters to ensure at least one French National in each room
for cluster_id, students in clusters.items():
    if not any(student['French National'] for student in students):
        # Find a French National to swap in
        for other_cluster_id, other_students in clusters.items():
            if other_cluster_id != cluster_id and any(student['French National'] for student in other_students):
                french_student = next(student for student in other_students if student['French National'])
                other_students.remove(french_student)
                students.append(french_student)
                break
'''

### Forming the groups of people (to be filled into the appartments), finally the good part - 

In [10]:
groups = {}
for i in range(0, len(row_ind), 4):
    group = []
    for j in range(4):
        if i + j < len(row_ind):
            group.append(data.iloc[row_ind[i + j]]['Name, Surname'])
    groups[f'Group {i // 4 + 1}'] = group

#As I mentioned in the Readme file, I changed some factors specifically for the data I had, so there are some warning messages here, but you can safey ignore them while working with real time big sized-data.
#If you applied the code for ensuring one french speaker per room -
#You need to comment out the aformentioned code of this cell and run the code given below instead - 

'''
groups = {}
for i, (cluster_id, students) in enumerate(clusters.items()):
    group = [student['Name, Surname'] for student in students]
    groups[f'Group {i + 1}'] = group
'''

"\ngroups = {}\nfor i, (cluster_id, students) in enumerate(clusters.items()):\n    group = [student['Name, Surname'] for student in students]\n    groups[f'Group {i + 1}'] = group\n"

### Print the groups!

In [11]:
for group, members in groups.items():
    print(f"{group}: {', '.join(members)}")

Group 1: John Doe, Jane Smith, Alice Johnson, Bob Brown
Group 2: Charlie Davis, Eve Wilson, Frank Harris, Grace Lee


### Save data to a newly generated Excel format file with a name of your choice! 

In [19]:
output = pd.DataFrame.from_dict(groups, orient='index').transpose()
output.to_excel('room_allocation_results.xlsx', index=False)