# CSUEB Data Science Club Fall 2020 Project

This project is for undergraduate and graduate students who are looking for an extracurricular project to sharpen their data science skills. The problem is based off a real mentorship program being offered by the San Francisco Professional Chapter of ALPFA in partnership with the CSUEB Student Chapter of ALPFA. The program was launched over the summer 2020 and will have recurring periodic enrollment moving forward, this project seeks to automate the matching of mentors/mentees, a process which is being done manually. The tasks will be broken up into three sections, beginning with the creation of our mock survey results below. The first task: Mentee Ranking, will be solved at our club's second live event later this semester, and the last task: Stable Matching, will be solved at our club's final event at the end of the semester. Direct any questions to info.csueb.dsc@gmail.com. Happy problem solving!

#### We begin by importing some basic packages

In [220]:
import random
import pandas as pd

This is a function to generate a list of 10 random numbers from 0 to a specified number "num":

In [221]:
def surveyCol(num):
    return [random.randint(0, num) for _ in range(10)]

This is script to create objects from a class called "Participants," these objects are structured like a dictionary with key-value pairs but will need to be converted to a "dict" type for us to perform dictionary operations on them.

In [222]:
class Participant:
    def __init__(self, name):
        self.name = name
        self.primary = surveyCol(5)
        self.ideal_match = surveyCol(5)
        self.level_of_importance = surveyCol(2)

Here is our lists of participating mentors and mentees.

In [223]:
mentor_names = ['Jose', 'Amanda', 'Francisco', 'Megan', 'Phil', 'Carla']
mentee_names = ['Chris', 'Kevin', 'Rachel', 'Monica', 'Emily', 'William']

These are two functions, the first takes dictionary structured objects converts them to a dictionary format, and will be called in our second function. "surveyGroup" takes a list of strings as an argument and creates a "Participant" object from each string, and calls the "convert" function on each object. A new list of dictionaries is returned.

In [224]:
def convert(dict):
    dict = dict.__dict__
    return dict

def surveyGroup(list):
    user_list = []
    for i in range(len(list)):
        user_list.append(convert(Participant(list[i])))
    return user_list

Here we pass our lists of participating mentor and mentee names to the above functions and get our new lists with each name, primary survey answers, ideal matches survey answers and a level of importance survey responses as keys with their respective values.

In [225]:
mentors = surveyGroup(mentor_names)
mentees = surveyGroup(mentee_names)

Here we print out our newly created lists

In [226]:
for mentor in mentors:
    print(mentor)

{'name': 'Jose', 'primary': [3, 3, 1, 0, 4, 2, 1, 2, 5, 2], 'ideal_match': [4, 1, 1, 5, 5, 2, 4, 4, 2, 4], 'level_of_importance': [2, 0, 2, 1, 0, 1, 0, 2, 1, 1]}
{'name': 'Amanda', 'primary': [3, 0, 5, 2, 1, 0, 2, 4, 4, 4], 'ideal_match': [1, 5, 2, 0, 1, 5, 3, 4, 4, 1], 'level_of_importance': [1, 2, 0, 1, 2, 2, 2, 0, 0, 0]}
{'name': 'Francisco', 'primary': [5, 0, 2, 1, 2, 1, 1, 5, 0, 5], 'ideal_match': [1, 5, 1, 5, 0, 5, 4, 0, 1, 4], 'level_of_importance': [0, 1, 2, 0, 2, 0, 0, 1, 2, 0]}
{'name': 'Megan', 'primary': [4, 3, 2, 5, 1, 5, 0, 3, 1, 0], 'ideal_match': [0, 0, 0, 0, 5, 2, 3, 5, 2, 2], 'level_of_importance': [1, 2, 2, 2, 0, 1, 0, 1, 0, 2]}
{'name': 'Phil', 'primary': [1, 3, 2, 4, 0, 5, 1, 1, 5, 0], 'ideal_match': [3, 4, 1, 1, 5, 2, 3, 1, 5, 3], 'level_of_importance': [1, 2, 0, 0, 0, 0, 2, 0, 0, 2]}
{'name': 'Carla', 'primary': [2, 2, 3, 3, 0, 5, 0, 5, 0, 4], 'ideal_match': [0, 4, 2, 4, 4, 5, 2, 3, 2, 3], 'level_of_importance': [1, 0, 2, 1, 0, 2, 2, 1, 1, 0]}


In [227]:
for mentee in mentees:
    print(mentee)

{'name': 'Chris', 'primary': [5, 5, 0, 2, 2, 4, 2, 3, 0, 5], 'ideal_match': [4, 3, 0, 1, 5, 5, 3, 1, 0, 3], 'level_of_importance': [1, 2, 1, 1, 2, 1, 2, 2, 0, 0]}
{'name': 'Kevin', 'primary': [1, 5, 3, 5, 1, 0, 3, 4, 3, 4], 'ideal_match': [2, 3, 1, 2, 0, 1, 5, 1, 5, 0], 'level_of_importance': [0, 0, 0, 2, 2, 0, 0, 0, 0, 1]}
{'name': 'Rachel', 'primary': [3, 5, 0, 3, 5, 3, 0, 1, 5, 1], 'ideal_match': [3, 2, 4, 0, 4, 4, 3, 4, 0, 0], 'level_of_importance': [0, 2, 0, 0, 2, 2, 2, 0, 1, 2]}
{'name': 'Monica', 'primary': [4, 4, 0, 0, 1, 2, 4, 0, 5, 0], 'ideal_match': [2, 1, 0, 2, 0, 3, 1, 5, 2, 3], 'level_of_importance': [0, 0, 2, 1, 1, 1, 1, 2, 0, 2]}
{'name': 'Emily', 'primary': [0, 4, 4, 1, 5, 4, 1, 3, 0, 4], 'ideal_match': [1, 5, 2, 1, 4, 4, 0, 0, 0, 1], 'level_of_importance': [2, 1, 1, 2, 1, 2, 0, 1, 2, 2]}
{'name': 'William', 'primary': [5, 2, 1, 2, 3, 0, 2, 5, 2, 5], 'ideal_match': [0, 0, 3, 4, 0, 4, 0, 4, 3, 1], 'level_of_importance': [2, 1, 2, 0, 2, 2, 0, 0, 1, 0]}


#### In this step we want to convert our list items(dictionaries), to data frames to make them easier to work with in performing analysis. This is a problem because the first key is not like the others in that it is a string and not a list of 10 integers. We remove it with the pop() function.

In [228]:
for mentor in mentors:
    mentor.pop('name')

Now we must assign the corresponding survey response key-value pairs to a variable with the participating mentor's name:

In [229]:
Jose = pd.DataFrame.from_dict(mentors[0])
Amanda = pd.DataFrame.from_dict(mentors[1])
Francisco = pd.DataFrame.from_dict(mentors[2])
Megan = pd.DataFrame.from_dict(mentors[3])
Phil = pd.DataFrame.from_dict(mentors[4])
Carla = pd.DataFrame.from_dict(mentors[5])

We repeat this process for the mentees:

In [230]:
for mentee in mentees:
    mentee.pop('name')

In [231]:
Chris = pd.DataFrame.from_dict(mentees[0])
Kevin = pd.DataFrame.from_dict(mentees[1])
Monica = pd.DataFrame.from_dict(mentees[2])
Rachel = pd.DataFrame.from_dict(mentees[3])
Emily = pd.DataFrame.from_dict(mentees[4])
William = pd.DataFrame.from_dict(mentees[5])

Finally we create a list of data frames to make parsing through them for analysis more efficient:

In [232]:
df_mentors = [Jose, Amanda, Francisco, Megan, Phil, Carla]
df_mentees = [Chris, Kevin, Monica, Rachel, Emily, William]

### Task 1: Create a compatibility ranking system for mentors & mentees and return a dictionary with the name of each mentor as the value and a sorted list of mentees matched from most compatible to least compatible. 

In [233]:
#Your code here
#Tip: Use the geometric mean of the mentor/mentee survey scores to determine compatibility score 
# used for ranking potential matches.

In [234]:
print(Jose['ideal_match'])
print(Jose.iloc[0:10,1])

0    4
1    1
2    1
3    5
4    5
5    2
6    4
7    4
8    2
9    4
Name: ideal_match, dtype: int64
0    4
1    1
2    1
3    5
4    5
5    2
6    4
7    4
8    2
9    4
Name: ideal_match, dtype: int64


In [235]:
print(Chris['primary'])

0    5
1    5
2    0
3    2
4    2
5    4
6    2
7    3
8    0
9    5
Name: primary, dtype: int64


In [236]:
#ListComp (list completion) function accepts two dataframes as arguments, and returns a single list.  For mentor's
#preferences and levels of importance matching with their ideal mentee, we would send the first mentor(Jose) 
#through the function 6 times, once per mentee.  Once this is done for all 6 mentors, we would do this for all
#6 mentees, eventually iterating this function 6^2 * 2 times (=72), for 10 questions each, for a grand total of 720 values.
def listComp(df1, df2):
    list_match = []
    for j in range(10): #for loop to iterate 10 times, the length of each individual's dictionary
        if df1.iloc[j, 2] == 2: #j represents the current iteration of the for loop.  the 2 represents the third 
            #column of the (index 2) of the first dataframe.  Therefore, we are reading in each number from the third
            #column as a loop.  If that value equals a 2, than that means that that person's level of importance is
            #maxed out, and therefore, we will end up squaring the value that ends up in that persons ideal match
            #column, which is indexed 1.
            if abs(df1.iloc[j,1] - df2.iloc[j,0]) == 0: #From here, we will subtract the current person's ideal 
                #match number by the match's primary value to see how far apart the two people are. Say the first 
                # question has to do with cleanliness.  If person A finds cleanliness very important, and gives it a 2,
                # and then gives an ideal match score of 5, and then person B also gives cleanliness a primary score
                # of 5, then we will meet the requirement of this line's code.
                list_match.append((df1.iloc[j, 1])**2) #Because it has a higher importance, we will take the value
                #of the current ideal match column and square it, and then append it to a list that we will 
                # eventually return back for each person.  So if they match, and the score is a 5, the returned score
                # for that question will be 25.
            elif abs(df1.iloc[j,1] - df2.iloc[j,0]) == 1:
                list_match.append(((df1.iloc[j, 1])**2)*0.5)#This mathematics can change for each person's matchmaking
                #algorithm.  It would take a lot of research and trial-and-error to really find a mathematical
                #algorithm that would connect ideal mentors to mentees, and luckily with Machine Learning, things like
                #that can be automated nowadays.  For my code, I decided that if the absolute value of the difference
                #between the ideal match score of person A and the primary score of person B is 1, than I can still 
                #square the value of the ideal match, but then I would multiply it in half.  Below, I do the same
                #with a difference of 2, but then I divide it by 4.
            elif abs(df1.iloc[j,1] - df2.iloc[j,0]) == 2:
                list_match.append(((df1.iloc[j, 1])**2)*0.25)
            else:
                list_match.append(0)#Here, if the difference of scores is over 2, I don't believe that the two scores
                #have anything in common, so we would just append a zero.
        elif df1.iloc[j, 2] == 1:  #If the level of importance of person A is a 1, then we would do the same
            #mathematics as above, but without sqaring anything first.  So if person A and person B had a 3 for ideal
            #match and primary columns, respectively, I would just append whatever value is in person A's ideal match
            #column.  This eliminates one level of mathematics and lowers the weight of importance of the value
            #we are appending.
            if abs(df1.iloc[j,1] - df2.iloc[j,0]) == 0:
                list_match.append((df1.iloc[j, 1]))
            elif abs(df1.iloc[j,1] - df2.iloc[j,0]) == 1:
                list_match.append(((df1.iloc[j, 1]))*0.5)
            elif abs(df1.iloc[j,1] - df2.iloc[j,0]) == 2:
                list_match.append(((df1.iloc[j, 1]))*0.25)
            else:
                list_match.append(0)
        else: #If the level of importance is a 0, we can just append a zero.  If it isn't important to the person, we
            #don't really need to add it to the grand sum.  Once again, if we were to tinker with the code, eventually
            #we might come up with something ideal, but we are showcasing the logic needed to come up with a working
            #algorithm.
            list_match.append(0)
    return list_match
    

In [237]:
listComp(Jose, Chris)#Example to show how this works
#As we see below, the 10th value is 12.5  Jose's level of importance for question 1 was 2, so we know we are gonna
#square something.  His ideal_match score is 5, and Chris' primary match score is 4.  With a difference of one, we
#are going to square the ideal_match number (5^2=25) and then multiply it by .5 to get 12.5...as you can see
#below.

[8.0, 0, 0.5, 0, 0, 0.5, 0, 8.0, 0.5, 2.0]

In [238]:
def geoMean(list1, list2): #Takes two lists as arguments.  This is to calculate geometric mean.
    match_score1 = sum(list1)/len(list1) #Our math is to take the sum of the values and divide it by 10 to get
    #the match score for person A.  We will repeat this below for person B.
    match_score2 = sum(list2)/len(list2)
    return (match_score1*match_score2)**0.5#Scores multiplied together, and then square rooted.

In [239]:
geoMean(listComp(Jose, Chris), listComp(Chris, Jose))

2.2839111191112496

In [240]:
def get_df_name(df): #This is a function to get the name of the person we are currently iterating on, and attaching
    #it to the outside of a dictionary, to create a dictionary of dictionary.
    name = [x for x in globals() if globals()[x] is df][0]#The globals method returns the dictionary of the current
    #global symbol table.  So if we are entering a dataframe of Jose, this function literally just returns
    #the string of "Jose".  The if statement makes sure this only works if the argument is a dataframe.  The "[0]"
    #assigns the value of column zero to name, and if we had previously set up all the previous functions to 
    #build the dictionaries/dataframes properly, this should execute.
    return name

In [241]:
def matching(df_list1, df_list2): #automated function that takes lists of dataframes of mentors and mentees
    #responses, and eventually we will have matching scores for each combo of mentor and mentee.
    dict1 = {} #two empty dictionaries.  dict1 will be outer dictionary that contains the mentor or mentee we
    #are currently finding the six matching scores for.  Person A's dictionary
    dict2 = {} #inner empty dictionary that will eventually contain the six matching scores that go with Person A.
    #Basically contains 6 person B's names and corresponding scores as a ('key',value) pair.
    for i in df_list1:#iterate i as outer loop
        for j in df_list2:#iterate j as inner loop
            dict2[get_df_name(j)] = round(geoMean(listComp(i, j), listComp(j, i)), 2) #first part attaches
            #person B's name as current key.  everything on right side of equal sign creates value to match
            #with current key. We are sending two lists into the geoMean function as arguments, and then rounding
            #that value by two decimals.
        dict1[get_df_name(i)] = sorted(dict2.items(), key=lambda x: x[1], reverse=True)#Outer loop attaches
        #person A's name as a key.  The key-value pair for the outer dictionary is (person A: dictionary of
        #matching values).  The second part of the code is a little complex, but we are sorting dict2 by x[1]'s
        #value, which would be whatever the score is.  Therefore, the final complete inner dictionaries will be
        #sorted in a descending order by match score.
    return dict1

In [242]:
optimal_mentor_matches = matching(df_mentors, df_mentees)

In [243]:
optimal_mentee_matches = matching(df_mentees, df_mentors)

In [244]:
optimal_mentor_matches

{'Jose': [('Chris', 2.28),
  ('Monica', 1.61),
  ('William', 1.23),
  ('Emily', 1.22),
  ('Rachel', 1.18),
  ('Kevin', 0.52)],
 'Amanda': [('Rachel', 1.87),
  ('Chris', 1.72),
  ('Monica', 1.32),
  ('Kevin', 1.2),
  ('Emily', 0.46),
  ('William', 0.42)],
 'Francisco': [('Rachel', 0.95),
  ('Monica', 0.63),
  ('Chris', 0.56),
  ('Emily', 0.35),
  ('Kevin', 0.33),
  ('William', 0.26)],
 'Megan': [('William', 0.85),
  ('Chris', 0.57),
  ('Emily', 0.57),
  ('Monica', 0.55),
  ('Rachel', 0.47),
  ('Kevin', 0.0)],
 'Phil': [('Emily', 1.7),
  ('Chris', 1.64),
  ('Monica', 1.27),
  ('William', 1.23),
  ('Rachel', 0.7),
  ('Kevin', 0.47)],
 'Carla': [('Chris', 1.54),
  ('Emily', 1.36),
  ('William', 1.29),
  ('Monica', 1.15),
  ('Rachel', 0.8),
  ('Kevin', 0.41)]}

In [245]:
optimal_mentee_matches

{'Chris': [('Jose', 2.28),
  ('Amanda', 1.72),
  ('Phil', 1.64),
  ('Carla', 1.54),
  ('Megan', 0.57),
  ('Francisco', 0.56)],
 'Kevin': [('Amanda', 1.2),
  ('Jose', 0.52),
  ('Phil', 0.47),
  ('Carla', 0.41),
  ('Francisco', 0.33),
  ('Megan', 0.0)],
 'Monica': [('Jose', 1.61),
  ('Amanda', 1.32),
  ('Phil', 1.27),
  ('Carla', 1.15),
  ('Francisco', 0.63),
  ('Megan', 0.55)],
 'Rachel': [('Amanda', 1.87),
  ('Jose', 1.18),
  ('Francisco', 0.95),
  ('Carla', 0.8),
  ('Phil', 0.7),
  ('Megan', 0.47)],
 'Emily': [('Phil', 1.7),
  ('Carla', 1.36),
  ('Jose', 1.22),
  ('Megan', 0.57),
  ('Amanda', 0.46),
  ('Francisco', 0.35)],
 'William': [('Carla', 1.29),
  ('Jose', 1.23),
  ('Phil', 1.23),
  ('Megan', 0.85),
  ('Amanda', 0.42),
  ('Francisco', 0.26)]}

### Task 2: Based on the sorted list of potential matches pair every mentor with their best available mentee match.

In [246]:
print(optimal_mentor_matches['Jose'][0])

('Chris', 2.28)


In [247]:
for key, value in optimal_mentor_matches.items():
    for k, v in value:
        print(key, k, v)

Jose Chris 2.28
Jose Monica 1.61
Jose William 1.23
Jose Emily 1.22
Jose Rachel 1.18
Jose Kevin 0.52
Amanda Rachel 1.87
Amanda Chris 1.72
Amanda Monica 1.32
Amanda Kevin 1.2
Amanda Emily 0.46
Amanda William 0.42
Francisco Rachel 0.95
Francisco Monica 0.63
Francisco Chris 0.56
Francisco Emily 0.35
Francisco Kevin 0.33
Francisco William 0.26
Megan William 0.85
Megan Chris 0.57
Megan Emily 0.57
Megan Monica 0.55
Megan Rachel 0.47
Megan Kevin 0.0
Phil Emily 1.7
Phil Chris 1.64
Phil Monica 1.27
Phil William 1.23
Phil Rachel 0.7
Phil Kevin 0.47
Carla Chris 1.54
Carla Emily 1.36
Carla William 1.29
Carla Monica 1.15
Carla Rachel 0.8
Carla Kevin 0.41


In [248]:
list1 = []
for key, value in optimal_mentor_matches.items():
    for k, v in value:
        for i in k:
            list2 = []
            list2.append(v)
            list2.append(key)
            list2.append(k)
        list1.append(list2)
print(list1)

[[2.28, 'Jose', 'Chris'], [1.61, 'Jose', 'Monica'], [1.23, 'Jose', 'William'], [1.22, 'Jose', 'Emily'], [1.18, 'Jose', 'Rachel'], [0.52, 'Jose', 'Kevin'], [1.87, 'Amanda', 'Rachel'], [1.72, 'Amanda', 'Chris'], [1.32, 'Amanda', 'Monica'], [1.2, 'Amanda', 'Kevin'], [0.46, 'Amanda', 'Emily'], [0.42, 'Amanda', 'William'], [0.95, 'Francisco', 'Rachel'], [0.63, 'Francisco', 'Monica'], [0.56, 'Francisco', 'Chris'], [0.35, 'Francisco', 'Emily'], [0.33, 'Francisco', 'Kevin'], [0.26, 'Francisco', 'William'], [0.85, 'Megan', 'William'], [0.57, 'Megan', 'Chris'], [0.57, 'Megan', 'Emily'], [0.55, 'Megan', 'Monica'], [0.47, 'Megan', 'Rachel'], [0.0, 'Megan', 'Kevin'], [1.7, 'Phil', 'Emily'], [1.64, 'Phil', 'Chris'], [1.27, 'Phil', 'Monica'], [1.23, 'Phil', 'William'], [0.7, 'Phil', 'Rachel'], [0.47, 'Phil', 'Kevin'], [1.54, 'Carla', 'Chris'], [1.36, 'Carla', 'Emily'], [1.29, 'Carla', 'William'], [1.15, 'Carla', 'Monica'], [0.8, 'Carla', 'Rachel'], [0.41, 'Carla', 'Kevin']]


In [249]:
list1.sort(key = lambda i: i[0], reverse=True)

In [250]:
list1

[[2.28, 'Jose', 'Chris'],
 [1.87, 'Amanda', 'Rachel'],
 [1.72, 'Amanda', 'Chris'],
 [1.7, 'Phil', 'Emily'],
 [1.64, 'Phil', 'Chris'],
 [1.61, 'Jose', 'Monica'],
 [1.54, 'Carla', 'Chris'],
 [1.36, 'Carla', 'Emily'],
 [1.32, 'Amanda', 'Monica'],
 [1.29, 'Carla', 'William'],
 [1.27, 'Phil', 'Monica'],
 [1.23, 'Jose', 'William'],
 [1.23, 'Phil', 'William'],
 [1.22, 'Jose', 'Emily'],
 [1.2, 'Amanda', 'Kevin'],
 [1.18, 'Jose', 'Rachel'],
 [1.15, 'Carla', 'Monica'],
 [0.95, 'Francisco', 'Rachel'],
 [0.85, 'Megan', 'William'],
 [0.8, 'Carla', 'Rachel'],
 [0.7, 'Phil', 'Rachel'],
 [0.63, 'Francisco', 'Monica'],
 [0.57, 'Megan', 'Chris'],
 [0.57, 'Megan', 'Emily'],
 [0.56, 'Francisco', 'Chris'],
 [0.55, 'Megan', 'Monica'],
 [0.52, 'Jose', 'Kevin'],
 [0.47, 'Megan', 'Rachel'],
 [0.47, 'Phil', 'Kevin'],
 [0.46, 'Amanda', 'Emily'],
 [0.42, 'Amanda', 'William'],
 [0.41, 'Carla', 'Kevin'],
 [0.35, 'Francisco', 'Emily'],
 [0.33, 'Francisco', 'Kevin'],
 [0.26, 'Francisco', 'William'],
 [0.0, 'Megan', '

In [251]:
def stable(list):
    new_mentors = []
    new_mentees = []
    new_final_list = []
    for i in list:
        if i[1] not in new_mentors and i[2] not in new_mentees:
            new_final_list.append(i)
            new_mentors.append(i[1])
            new_mentees.append(i[2])
            
    print(new_final_list)
                    
        
        

In [252]:
stable(list1)

[[2.28, 'Jose', 'Chris'], [1.87, 'Amanda', 'Rachel'], [1.7, 'Phil', 'Emily'], [1.29, 'Carla', 'William'], [0.63, 'Francisco', 'Monica'], [0.0, 'Megan', 'Kevin']]
