

# Mentor-Mentee Match, via Vectorization, PCA(Principal Component Analysis), Clustering, and Test Search for the Best Model

# This notebook is part#1, Vectorization

# Goal: modeling the possibility in end-result match of 1 mentee to multiple mentor candidates, or 1 mentor to multiple mentee candidates, or multiple mentor and mentee candidates in the same clustering. By vectorization and PCA preparation, for clustering of mentor and mentee in the similarity group of same interests, by considering gender, time zone, interest of life, and tech stack. One similarity group in one clustering of mentor and mentee; then the client can manually choose within the same clustering for the final match, as the client mentioned currently they would still prefer human intelligence for the final match with various reasons. The benefit of this model is one mentor may have bandwidth to coach multiple mentees, and one mentee may have several optional mentors for match-maker's decision; delivered by reading the same clustering the options provided for the match-maker.



In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import AgglomerativeClustering
#from sklearn.metrics import calinski_harabaz_score, silhouette_score, davies_bouldin_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from tqdm import tqdm_notebook as tqdm

In [None]:
# mock-up dataset in csv file with 10 observations
# we could also take the excel file from client converted to csv for modeling experiments

raw_df  = pd.read_csv('Mentor_Mentee.csv')

# The dataset (could be an excel file from client) consists of personal info, tech stack(For mentee, it is the interest level to learn on scale of 1 to 10. For mentor, it is the proficiency level on scale of 1 to 10.) Bios, could be a check-box type survey or each person's own words; and list of different interests of life on scale of 1 to 10.

In [None]:
raw_df.head()

Unnamed: 0,ID,First Name,Last Name,Gender,Time zone,Bios,Python,SQL,iOS,Android,CSS,HTML,JavaScript,Movies,TV,Religion,Music,Sports,Books,Video Games,Mentor,Mentee
0,1,John,Smith,M,mountain,Typical twitter fanatic. Infuriatingly humble thinker. Lifelong coffee practitioner. Organizer,10,8,0,0,5,10,10,5,3,4,1,3,6,7,1,0
1,2,Jane,Doe,F,mountain,web junkie. Analyst. Infuriatingly humble introvert. Food nerd. Lifelong music fanatic. Coffee lover,10,10,0,0,0,9,9,7,9,5,1,9,4,10,0,1
2,3,Mike,Masters,M,central,Avid web maven. Food practitioner. Gamer. Twitter fanatic. Pop culture scholar. Zombie evangelist.,10,9,8,8,9,9,10,1,2,6,5,6,5,10,1,0
3,4,Tim,Andersen,M,eastern,Twitteraholic. Extreme web fanatic. Food buff. Infuriatingly humble entrepreneur.,5,5,0,0,10,10,10,5,2,7,8,2,6,6,1,0
4,5,Miles,Tam,M,pacific,Bacon enthusiast. Falls down a lot. Freelance social media fan. Infuriatingly humble introvert.,0,0,0,0,10,10,10,6,6,6,4,3,6,3,0,1


# Start preparing vectorization

In [None]:
df = raw_df.drop(['ID', 'First Name', 'Last Name',
                                        'Gender', 'Time zone'],axis=1)

In [None]:
df.head()

Unnamed: 0,Bios,Python,SQL,iOS,Android,CSS,HTML,JavaScript,Movies,TV,Religion,Music,Sports,Books,Video Games,Mentor,Mentee
0,Typical twitter fanatic. Infuriatingly humble thinker. Lifelong coffee practitioner. Organizer,10,8,0,0,5,10,10,5,3,4,1,3,6,7,1,0
1,web junkie. Analyst. Infuriatingly humble introvert. Food nerd. Lifelong music fanatic. Coffee lover,10,10,0,0,0,9,9,7,9,5,1,9,4,10,0,1
2,Avid web maven. Food practitioner. Gamer. Twitter fanatic. Pop culture scholar. Zombie evangelist.,10,9,8,8,9,9,10,1,2,6,5,6,5,10,1,0
3,Twitteraholic. Extreme web fanatic. Food buff. Infuriatingly humble entrepreneur.,5,5,0,0,10,10,10,5,2,7,8,2,6,6,1,0
4,Bacon enthusiast. Falls down a lot. Freelance social media fan. Infuriatingly humble introvert.,0,0,0,0,10,10,10,6,6,6,4,3,6,3,0,1


In [None]:
def string_convert(x):
    """
    First converts the lists in the DF into strings
    """
    if isinstance(x, list):
        return ' '.join(x)
    else:
        return x
    
# Looping through the columns and applying the function
for col in df.columns:
    df[col] = df[col].apply(string_convert)

In [None]:
df

Unnamed: 0,Bios,Python,SQL,iOS,Android,CSS,HTML,JavaScript,Movies,TV,Religion,Music,Sports,Books,Video Games,Mentor,Mentee
0,Typical twitter fanatic. Infuriatingly humble thinker. Lifelong coffee practitioner. Organizer,10,8,0,0,5,10,10,5,3,4,1,3,6,7,1,0
1,web junkie. Analyst. Infuriatingly humble introvert. Food nerd. Lifelong music fanatic. Coffee lover,10,10,0,0,0,9,9,7,9,5,1,9,4,10,0,1
2,Avid web maven. Food practitioner. Gamer. Twitter fanatic. Pop culture scholar. Zombie evangelist.,10,9,8,8,9,9,10,1,2,6,5,6,5,10,1,0
3,Twitteraholic. Extreme web fanatic. Food buff. Infuriatingly humble entrepreneur.,5,5,0,0,10,10,10,5,2,7,8,2,6,6,1,0
4,Bacon enthusiast. Falls down a lot. Freelance social media fan. Infuriatingly humble introvert.,0,0,0,0,10,10,10,6,6,6,4,3,6,3,0,1
5,Pop culture junkie. TV buff. Reader. Friendly travel expert. Incurable twitteraholic. Social media ninja. Coffee enthusiast. Internet specialist,0,0,10,10,0,0,0,0,5,7,5,9,2,0,1,0
6,Typical thinker. Amateur explorer. Reader. Extreme student. TV fanatic. Social media ninja.,0,0,10,10,0,0,0,5,4,7,5,0,8,4,0,1
7,zombit marven. Travel geek. Professional social media buff. Avid pop culture lover.,7,7,0,0,10,10,10,9,8,8,2,0,0,5,0,1
8,lifelong introvert. General travel maven. Hipster-friendly web trailblazer. Writter. Alcohol fan. Student. Communicator. Coffee guru.,10,10,5,5,5,9,9,8,8,2,6,1,2,9,0,1
9,travel ninja. Amateur pop culture evangelist. Web fanatic. Freelance communicator. Zombie geek.,0,0,10,10,10,10,10,9,0,2,2,5,6,8,1,0


# Vectorization

In [None]:
def vectorization(df, columns):
    """
    Using recursion, iterate through the df until all the categories have been vectorized
    """
    column_name = columns[0]
    
    # Checking if the column name has been removed already
    if column_name not in ['Bios', 'Movies','Religion', 'Music', 'Books', 'Sports']:
        return df
    
    if column_name in ['Religion']:
        df[column_name.lower()] = df[column_name].cat.codes
        
        df = df.drop(column_name, 1)
        
        return vectorization(df, df.columns)
    
    else:
        # Instantiating the Vectorizer
        vectorizer = CountVectorizer()
        
        # Fitting the vectorizer to the Bios
        x = vectorizer.fit_transform(df[column_name])

        # Creating a new DF that contains the vectorized words
        df_wrds = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names())

        # Concating the words DF with the original DF
        new_df = pd.concat([df, df_wrds], axis=1)

        # Dropping the column because it is no longer needed in place of vectorization
        new_df = new_df.drop(column_name, axis=1)
        
        return vectorization(new_df, new_df.columns) 

In [None]:
# Creating the vectorized DF
vect_df = vectorization(df, df.columns)



# Scaling the data

# It will assist our clustering algorithm’s performance, is scaling categories. This will potentially decrease the time it takes to fit and transform our clustering algorithm to the dataset.


In [None]:
scaler = MinMaxScaler()
# vect_df  = pd.DataFrame(scaler.fit_transform(new_df), 
#                                columns=new_df.columns,
#                                index=new_df.index)


vect_df = pd.DataFrame(scaler.fit_transform(vect_df), index=vect_df.index, columns=vect_df.columns)

#pd.DataFrame(scaler.fit_transform(vect_df), index=vect_df.index, columns=vect_df.columns)

In [None]:
# Creating a new DF that contains the vectorized words
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['Bios'])
df_wrds = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names())
df_wrds



Unnamed: 0,alcohol,amateur,analyst,avid,bacon,buff,coffee,communicator,culture,down,enthusiast,entrepreneur,evangelist,expert,explorer,extreme,falls,fan,fanatic,food,freelance,friendly,gamer,geek,general,guru,hipster,humble,incurable,infuriatingly,internet,introvert,junkie,lifelong,lot,lover,marven,maven,media,music,nerd,ninja,organizer,pop,practitioner,professional,reader,scholar,social,specialist,student,thinker,trailblazer,travel,tv,twitter,twitteraholic,typical,web,writter,zombie,zombit
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0
1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,1,1,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0
3,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0
4,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,1,0,0,1,0,1,0,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,1,1,0,0,1,0,0,1,0,0,0,0
7,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1
8,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,1,1,0,0
9,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0


In [None]:
# Concatenating the words DF with the original DF
new_df = pd.concat([df, df_wrds], axis=1)

# Dropping the Bios because it is no longer needed in place of vectorization
new_df.drop('Bios', axis=1, inplace=True)

# Viewing the new DF
new_df

Unnamed: 0,Python,SQL,iOS,Android,CSS,HTML,JavaScript,Movies,TV,Religion,Music,Sports,Books,Video Games,Mentor,Mentee,alcohol,amateur,analyst,avid,bacon,buff,coffee,communicator,culture,down,enthusiast,entrepreneur,evangelist,expert,explorer,extreme,falls,fan,fanatic,food,freelance,friendly,gamer,geek,general,guru,hipster,humble,incurable,infuriatingly,internet,introvert,junkie,lifelong,lot,lover,marven,maven,media,music,nerd,ninja,organizer,pop,practitioner,professional,reader,scholar,social,specialist,student,thinker,trailblazer,travel,tv,twitter,twitteraholic,typical,web,writter,zombie,zombit
0,10,8,0,0,5,10,10,5,3,4,1,3,6,7,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0
1,10,10,0,0,0,9,9,7,9,5,1,9,4,10,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,1,1,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,10,9,8,8,9,9,10,1,2,6,5,6,5,10,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0
3,5,5,0,0,10,10,10,5,2,7,8,2,6,6,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0
4,0,0,0,0,10,10,10,6,6,6,4,3,6,3,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,10,10,0,0,0,0,5,7,5,9,2,0,1,0,0,0,0,0,0,1,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,1,0,0,1,0,1,0,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,0,0
6,0,0,10,10,0,0,0,5,4,7,5,0,8,4,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,1,1,0,0,1,0,0,1,0,0,0,0
7,7,7,0,0,10,10,10,9,8,8,2,0,0,5,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1
8,10,10,5,5,5,9,9,8,8,2,6,1,2,9,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,1,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,1,1,0,0
9,0,0,10,10,10,10,10,9,0,2,2,5,6,8,1,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0


In [None]:
new_df.shape

(10, 78)