In [483]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.spatial
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import hamming, pdist, squareform
from sklearn.metrics import jaccard_score

In [484]:
df = pd.read_parquet("wbmasterf.parquet")
df["ageg"] = pd.to_numeric(df["ageg"])

# Part 3: Building the recommendation model

**Collaborative filtering**

We have 3 options:

- **User-based filtering:** find users with similar interaction patterns as our target user and then recommend items that similar users have interacted with to our target user.
<br/>

- **Item-based filtering:** identify items that are similar to the ones our target users has interacted with and then recommend to our target user.
<br/>

- A **hybrid** of the two
<br/>


We are building a hybrid.


## **1. User-based similarities**

Create a matrix with binary information on user interactions. Rows represent users, columns represent items.

In [485]:
user_item_matrix = df.pivot(index="child_id", columns="item_id", values="value")

#The order of the items gets jumbled up, so sort them again from 1 to 680
itemsorted = sorted(user_item_matrix.columns, key=lambda x: int(x.split("_")[1]))
user_item_matrix = user_item_matrix[itemsorted]

Adjust item IDs to match python indices. We now have IDs from 0 to 679 instead of 1 to 680.

In [486]:
# Subtract 1 from all item IDs in the user_item_matrix columns
user_item_matrix.columns = [f"item_{int(col.split('_')[1]) - 1}" if col.startswith("item_") else col for col in user_item_matrix.columns]

Input a test user.

In [487]:
target_age_group = 5 
target_sex = "Male"

Our EDA revealed that user characteristics, such as **`ageg`** and **`sex`**, have a big influence on user/child vocabulary. Segment our data in order to match the users in our database with our hypothetical target user.

UX wants users to be able to opt to not share sex/gender information. If the user input for sex is passed on as `None`, all age-matched users are included, regardless of their sex.

In [488]:
# Filter our df so that it only includes information of age-matched users
filtered_users = df[df["ageg"] == target_age_group]

# Only filter when sex input is not NA
if target_sex != "None":
    filtered_users = filtered_users[filtered_users["sex"].str.lower() == target_sex.lower()]

# Extract the IDs of the filtered users
filtered_user_ids = filtered_users["child_id"].unique()

# Create a boolean mask to filter user information for the demographically matched users
user_filter_mask = user_item_matrix.index.isin(filtered_user_ids)

# Apply the user filter to our "master" matrix
target_matrix = user_item_matrix[user_filter_mask]

Now we start working with our target user's interaction history.

In [489]:
user_interactions = np.zeros(680)
user_interactions[[0,1]] = 1

# Optionally, use one of the kids from our df
# test = df[df["child_id"] == 1]["value"]
# interacted_items = test
# user_interactions[interacted_items == 1] = 1

## **2. Item-based similarities**

We also want to take item similarity into account, so we need to create an item similarity matrix (or rather *dissimilarity*, since we are working with Jaccard's *distance*). Higher scores indicate less similarity.

Transpose the matrix. Rows now represent items and columns represent users. Then create a (dis)similarity matrix for the items.

In [490]:
target_item_matrix = target_matrix.T
itemsim = (scipy.spatial.distance.cdist(target_item_matrix.values, target_item_matrix.values, metric="jaccard"))

Initialize the nearest neighbors model. We opt for Jaccard's distance, a simple metric suitable for binary data. We fit the model to the **filtered** user-item interaction matrix.

In [491]:
knn = NearestNeighbors(n_neighbors=1, metric="jaccard", algorithm="brute")
knn.fit(target_matrix)

the process:

- Neighbours are identified, and their distances and indices are calculated
- The interaction history of these neighbours is collected
- Neighbour items that the user has already interacted with are dropped
- Item scores for the remaining neighbour items are calculated (for each neighbour item, the score is the mean distance of the respective item to all of the items the user has interacted with)
- These scores (weighted) are then combined with the distance score (weighted) of the neighbour whose interaction history they come from. For double items, only the smallest (best) score is taken into consideration.


In [495]:
def recommendations_model(user_interactions):
    distance, neighbor_indices = knn.kneighbors([user_interactions])

    user_based_recommendations = []
    for neighbor_index in neighbor_indices[0]:
        neighbor_interactions = target_matrix.iloc[neighbor_index] 
                                # change target_matrix to user_item_matrix to drop segmentation
        
        user_based_recommendations.extend([int(item.split('_')[1]) for item in neighbor_interactions[neighbor_interactions == 1].index])

    user_based_recommendations = [item for item in user_based_recommendations if user_interactions[item] == 0]
    user_based_recommendations = list(set(user_based_recommendations))

    item_scores = {
        item_id: sum(
            itemsim[item_id, user_interaction]
            for user_interaction, interaction_value in enumerate(user_interactions)
            if interaction_value == 1
        ) / sum(user_interactions)
        for item_id in user_based_recommendations
    }

    final_scores = {
        item_id: combined_score
        for item_id, neighbor_interaction in zip(item_scores.keys(), neighbor_interactions)
        for combined_score in [(distance * user_weight + item_scores[item_id] * item_weight)]
        if item_id not in combined_score or combined_score < final_scores[item_id]
    }

    final_scores = {key: np.min(value) for key, value in final_scores.items()}
    final_scores = {key+1: value for key, value in final_scores.items()}


    for item_id, score in final_scores.items():
        if item_id in wordsz.index:
            item_definition = wordsz.loc[item_id, "word"]
            wbi = wordsz.loc[item_id, "wordBankId"]
            
    return final_scores

Define the weights:

In [496]:
user_weight = 0.3    # data already segmented, so lower weight for users
item_weight = 1 - user_weight

Run the function: output is a dictionary, keys represent item_ids and values are distance scores, ranging from 0 to 1. Lower values indicate higher recommendation priority.

In [497]:
recommendations_model(user_interactions)



{356: 0.3564465408805031,
 5: 0.35574726083971053,
 6: 0.34701047349700465,
 389: 0.35951327433628316,
 361: 0.3783078880407124,
 204: 0.4299597478591236,
 366: 0.35556603773584905,
 400: 0.37088104325699744,
 187: 0.39578266104756166,
 348: 0.36100727702954566}

## Final thoughts:

<span style="background-color:red;color:white;padding:5px;">**Problem:**</span>

It seems a bit redundant to calculate both user-based similarity **and** segment our data based on the demographics of the target user

<span style="background-color:green;color:white;padding:5px;">**Solution:**</span>

Drop the data segmentation. In the final model (used in API) segmentation was dropped by training our knn on our complete dataset and then serializing it (**`mod.pkl`**).

<span style="background-color:red;color:white;padding:5px;">**Problem:**</span>

Since we will not have complete information on which items the target user has already interacted with (i.e. the words the child has already learned) for our prototype (not all words will be included in assessment), it is likely that many of the recommended items will be words that the user has actually already interacted with, especially since recommended items tend to be "popular" items. An idea would be to recommend not only items identified as most similar, but add some low-similarity items into the list of the recommendations. The issue here is that these words will likely not be age-appropriate for especially younger users, so this is not an ideal solution. 

<span style="background-color:red;color:white;padding:5px;">**Problem:**</span>

How will an extreme case of a user who has not interaced with **any** items impact our model?

Looking at the wordbank data, it seems very unlikely that we will encounter a child who has learned 0 of the 680 possible words. However, our MVP will only be able to collect information for a limited number of items (40), which increases the likelihood of receiving an input filled with 0's.

- **User-based similarities:** The metric used in our **`knn`** (jaccard) measures asymmetric binary attributes. The distances to all possible neighbours will therefore be 1, since their respective jaccard similarities will always be 0. However, the model will still identify neighbours (I am assuming that they are chosen randomly), so this will not stop an output from being generated.
<br/>

- **Item-based similarities:** The nature of how **`item_scores`** is calculated requires **`user_interactions`** to contain `1`'s. If not, it will just return a list of NA's, which will then impact the calculations in **`final_scores`**. No usable output will be generated.

<span style="background-color:green;color:white;padding:5px;">**Solution:**</span>

Code designed to handle this kind of extreme case could be built into **`item_scores`** (for example, giving each item a score of `0` instead of `NA`, or dropping the calculation of these scores), but recommendations would still be generated completely randomly. Instead, the user could be provided with a list of items that are generally popular for their age group and sex, no model necessary. Additionally, assessment items should be chosen carefully and tailored to each individual user's characteristics (age/sex) to ensure that some of the presented words will be ones that thes user has already learned.

<span style="background-color:red;color:white;padding:5px;">**Problem:**</span>

Evaluating the model will be tricky, since we cannot collect new data and are also working with binary data.

<span style="background-color:green;color:white;padding:5px;">**Solution:**</span>

- Now that the data segmentation has been dropped, we can evaluate our user-based similarities by investigating the demographic characteristics of the identified neighbours. Will the "nearest neighbours" be of the same age and gender as our target user?
<br/>


- Additionally, we can evaluate the model on its tendency to recommend items that the user has already learned.


***

code for API result formatting

In [None]:
wordsz = pd.read_parquet("words.parquet")
word_mapping = dict(zip(wordsz["wordBankId"], wordsz.index))

formatted_final_scores = []

for item_id, score in final_scores.items():
    if item_id in wordsz.index:
        item_definition = wordsz.loc[item_id, "word"]
        wbi = wordsz.loc[item_id, "wordBankId"]

        formatted_item = {
            "name": item_definition,
            "priority": score,
            "wordBankId": wbi
        }

        formatted_final_scores.append(formatted_item)
        
formatted_final_scores