# Clustering For Match Making

### Remainder Problem
Instead of being left with 1 user who gets grouped with 3 bots, make the last three groups all have 3 users and 1 bot

### Clustering algorithms
Split based on ranked squad_score: Simpliest mvp
- ended up implementing this method due to time constrants. The other methods involve utilizing all of the metrics which invloved setting up a DS database which we dodn't have implemented at this time.

K Means: Doesn't split into even groups like we need, cluster into 5 similar tiers and then group within each cluster based off of Squad Score.

KNN: Select a random user, grab the 3 closest neighbors, repeat until everyone is matched

Agglomarative clustering: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

linear sum assignment: https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html

In [12]:
# Imports
import pandas as pd
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

In [3]:
# Read in metrics csv
df = pd.read_csv('story_squad_metrics.csv', index_col='Unnamed: 0')
df.head()

Unnamed: 0,story_id,story_length,avg_word_len,quotes_num,unique_words_num,squad_score
0,3132,1375,5.092593,6,138,39.177001
1,3104,903,4.961538,0,110,26.173076
2,3103,750,5.0,1,93,24.497113
3,3117,439,4.877778,1,56,16.45433
4,3102,1812,4.897297,0,193,41.083353


### Using K Means Clustering
- Segment the users into different clusters based on all of the metrics - 5 'tiers'
- Within their clusters, group into groups of 4 based on their Squad Score

In [4]:
# Get clusters

# Instantiate scaler
scaler = StandardScaler()

# Pull out features
features = df.drop('story_id', axis=1)

# Scale data
norm_x = scaler.fit_transform(features)

# Instantiate model - 5 clusters
model= KMeans(n_clusters = 5)

# Predict clusters
df['cluster'] = model.fit_predict(norm_x)

# View df
df.head()

Unnamed: 0,story_id,story_length,avg_word_len,quotes_num,unique_words_num,squad_score,cluster
0,3132,1375,5.092593,6,138,39.177001,3
1,3104,903,4.961538,0,110,26.173076,0
2,3103,750,5.0,1,93,24.497113,0
3,3117,439,4.877778,1,56,16.45433,0
4,3102,1812,4.897297,0,193,41.083353,3


In [5]:
# Cluster distribution
df['cluster'].value_counts()

3    49
0    45
2    39
4    19
1    15
Name: cluster, dtype: int64

The associated number of the clusters change each time you run the model, but the distribution remains the exact same

# MVP Clustering from Squad Scores

Grouping function:
- uses the remainder from len(df) % 4 to make decisions of how to group the last users
- edge case conditions handles if there are less than 6 users, otherwise it would return some keys with blank values

In [6]:
def group_4(df):
    '''
    Function to split given dataframe into groups of 4 based on their ranked squad_scores
    When there is a remainder of users not evenly divisable by 4, it will split the remainder so there is never more than 1 computer user in a group, unless there are less that 3 users.

        Input: df to be grouped containing the column 'squad_score' and 'story_id'
        Output: Dictionary of groupings. {group #: list of story_id's}

    '''
    # Rank by squad_score
    df = df.sort_values(by= ['squad_score'], ascending= False)

    # Initial variables
    split = len(df) // 4
    remainder = len(df) % 4
    group_dict = {}

    # Edge Cases: 
    # - less than 4, they are all in one group
    # - 5, one group of 3 one group of 2
    if len(df) == 5:
        group_dict[1] = list(df['story_id'][:3])
        group_dict[2] = list(df['story_id'][3:])
        return group_dict
    
    if len(df) < 4:
        group_dict[1] = list(df['story_id'][:])
        return group_dict

    # If the remainder is 3 -> last group will be a group of 3 users
    if remainder == 3:
        for i in range(split):
            # Group by top 4 squad scores
            group_dict[i+1] = list(df['story_id'][:4])
            # Drop stories you have grouped already 
            df = df[4:]
        
        # Final group is the last 3 remainders
        group_dict[split +1] = list(df['story_id'][:])
        return group_dict

    # If the remainder is 2 -> last 2 groups will be groups of 3
    elif remainder == 2:
        # Leave the last 2 groups to split into 2 groups of 3
        for i in range(split -1):
            # Group by top 4 squad scores
            group_dict[i+1] = list(df['story_id'][:4])
            # Drop stories you have grouped already
            df = df[4:]

        # The last two groups will be groups of 3
        group_dict[split] = list(df['story_id'][:3])
        group_dict[split + 1] = list(df['story_id'][3:])
        return group_dict

    # If the remainder is 1 -> last 3 groups will be groups of 3
    elif remainder == 1:
        # Leave the last 3 groups to be split into 3 groups of 3
        for i in range(split -2):
            # Group by top 4 squad scores
            group_dict[i+1] = list(df['story_id'][:4])
            # Drop stories you have already grouped
            df = df[4:]

        # The last three groups as groups of 3
        group_dict[split -1] = list(df['story_id'][:3])
        group_dict[split] = list(df['story_id'][3:6])
        group_dict[split + 1] = list(df['story_id'][6:])
        return group_dict
    
    # If the remainder is 0, split evenly by 4
    elif remainder == 0:
        for i in range(split):
            # Group by top 4 squad scores
            group_dict[i+1] = list(df['story_id'][:4])
            # Drop stories you have already grouped
            df = df[4:]
        return group_dict

    else:
        return 'Invalid number of remaining users'

In [7]:
# Extract each cluster 'tier'
first = df[df['cluster']== 0]
second = df[df['cluster']== 1]
third = df[df['cluster']== 2]
fourth = df[df['cluster']== 3]
fifth = df[df['cluster']== 4]

In [10]:
# Create all the cluster dictionaries
first_cluster = group_4(first)
second_cluster = group_4(second)
third_cluster = group_4(third)
fourth_cluster = group_4(fourth)
fifth_cluster = group_4(fifth)

In [11]:
# View each cluster and their groupings
print(f'First Cluster: {first_cluster}')
print(f'Second Cluster: {second_cluster}')
print(f'Third Cluster: {third_cluster}')
print(f'Fourth Cluster: {fourth_cluster}')
print(f'Fifth Cluster: {fifth_cluster}')

First Cluster: {1: [5209, 3216, 5123, 3215], 2: [5229, 5120, 5102, 3247], 3: [3128, 5232, 5243, 3210], 4: [3123, 3116, 5258, 3111], 5: [3223, 5207, 3227, 3104], 6: [5110, 3225, 3125, 3103], 7: [3207, 3238, 3124, 3113], 8: [3222, 3206, 3131, 3107], 9: [3214, 3226, 3236, 5233], 10: [3119, 3235, 3121], 11: [3127, 3110, 3117], 12: [3202, 3240, 3229]}
Second Cluster: {1: [5234, 5235, 5213, 5219], 2: [5107, 5215, 5117, 5118], 3: [5129, 5227, 5119, 5222], 4: [5262, 5247, 5206]}
Third Cluster: {1: [5244, 5122, 3129, 5202], 2: [3234, 3246, 3221, 5256], 3: [3248, 3203, 3243, 5103], 4: [3217, 5238, 5126, 5112], 5: [3237, 3218, 5216, 5264], 6: [3208, 5115, 5101, 3205], 7: [3241, 3105, 3244, 5104], 8: [3108, 5116, 3126, 5214], 9: [5109, 3112, 3106, 3120], 10: [3231, 3201, 3228]}
Fourth Cluster: {1: [5248, 5257, 5111, 5261], 2: [5241, 5217, 5208, 5230], 3: [5105, 3115, 5121, 3219], 4: [5113, 5204, 5251, 5259], 5: [3204, 5218, 5255, 5203], 6: [3122, 5237, 3109, 5210], 7: [5125, 5246, 3102, 5108], 8: 

In [48]:
# Testing edge cases
remainder_0 = first[3:]
remainder_1 = first[2:]
remainder_2 = first[1:]
remainder_3 = first
small_num_9 = first[-9:]
small_num_5 = first[-5:]
small_num_4 = first[-4:]
small_num_3 = first[-3:]
small_num_2 = first[-2:]
small_num_1 = first[-1:]

In [59]:
clust_dict = group_4(small_num_1)
clust_dict

{0: [5104]}

In [15]:
# Sanity check - each story from the cluster was grouped
unique = set()

for inner_list in clust_dict.values():
    for item in inner_list:
        unique.add(item)
    
len(unique) == len(first)

True

## KNN
- Pull a random story, find the three nearest stories to create a group
- Drop the stories that you have already grouped
- TODO: Deal with the remainder problem

In [116]:
# Wrap in function that continues until the remainder problem
# Refit the data after you drop the grouped stories 
    # - otherwise it could suggest a story that we have already grouped and dropped

def group_nn(df):
    '''
    Function creates groups of four from the input df by using Nearest Neighbors
    Pulls one user, finds the three most similar users in the cohort to form the group
    When there is a remainder of users not evenly divisable by 4, it will split the remainder so there is never more than 1 computer user in a group, unless there are less that 3 users.

        Input: df to be grouped containing the column 'story_id'
        Output: dictionary of groupings {group ID #: list of story_id's}
    '''
    # Empty dictionary to store the groupings
    groups = {}

    # Instantiate scaler
    scaler = StandardScaler()

    # Pull out features
    features = df.drop('story_id', axis=1)

    # Scale data
    norm_x = scaler.fit_transform(features)

    # Turn into df
    df_norm_x = pd.DataFrame(norm_x)

    # Instantiate model - groups of 4
    nn = NearestNeighbors(n_neighbors=4, algorithm='kd_tree')

    # Counter to use as key for groups in dictionary
    counter = 1

    # While loop that takes the top user and creates a group with the its three closest users
    # Drops grouped users and continues until there are less than 12 users left to group
    # Remainder problem will be dealt with after the while loop runs
    while len(df_norm_x) >11:
        # Fit the nearest neighbors model
        nn.fit(df_norm_x)

        # Find nearest neighbors
        array_1, array_2 = nn.kneighbors([df_norm_x.iloc[0].values])

        # Put story_id list into groups dictionary
        groups[counter] = [df['story_id'][item] for item in array_2[0]]

        # Increment the counter
        counter += 1

        # Drop the users you have already grouped
        # From both df's that you are using
        df_norm_x = df_norm_x.drop(array_2[0])
        df = df.drop(array_2[0])

        # Reset the index
        # For both datasets that you are using
        df_norm_x.reset_index(inplace= True, drop= True)
        df.reset_index(inplace= True, drop= True)

    # TODO: Deal with remainders
    print(f"Remaining users: {len(df_norm_x)}")


    return groups

In [117]:
# Group 167 stories from db
nn_groups = group_nn(df)
nn_groups

Remaining users: 11


{1: [3132, 3220, 3245, 5132],
 2: [3104, 3111, 5110, 3124],
 3: [3103, 3222, 3238, 3207],
 4: [3117, 3127, 3110, 3119],
 5: [3102, 3211, 5113, 3101],
 6: [3105, 5101, 5104, 3208],
 7: [3129, 3234, 3221, 3246],
 8: [3116, 3223, 3247, 3128],
 9: [3118, 3232, 5249, 5225],
 10: [3120, 3112, 3231, 3201],
 11: [3121, 3202, 3235, 5233],
 12: [3126, 5109, 3108, 3241],
 13: [3131, 3214, 3206, 3227],
 14: [3109, 5218, 5203, 5210],
 15: [3107, 3226, 3113, 5258],
 16: [3106, 3228, 3205, 3237],
 17: [3130, 5106, 3209, 5224],
 18: [3115, 5241, 5251, 5237],
 19: [3123, 5232, 5207, 3210],
 20: [3125, 5120, 5123, 3215],
 21: [3122, 5255, 3204, 5208],
 22: [3114, 5260, 5240, 3213],
 23: [3216, 5102, 5243, 5229],
 24: [3229, 3225, 3240, 3236],
 25: [3218, 3217, 5264, 3203],
 26: [3243, 5126, 5112, 5256],
 27: [3244, 5115, 5116, 5216],
 28: [3219, 5105, 5217, 3239],
 29: [3212, 5242, 5245, 5220],
 30: [3224, 5263, 5114, 5125],
 31: [3248, 5103, 5244, 5122],
 32: [3230, 5257, 5111, 5205],
 33: [5254, 5130,

In [107]:
# Sanity check - each story from the cluster was grouped
unique = set()

for inner_list in nn_groups.values():
    for item in inner_list:
        unique.add(item)
    
len(unique) == len(df) - 11

True