In [91]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.spatial
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import hamming, pdist, squareform
from sklearn.metrics import jaccard_score

In [92]:
df = pd.read_csv("wbmaster.csv")

# Part 3: Building the recommendation model

**Collaborative filtering**

We have 3 options:

- **User-based filtering:** find users with similar interaction patterns as our target user and then recommend items that similar users have interacted with to our target user.

- **Item-based filtering:** identify items that are similar to the ones our target users has interacted with and then recommend to our target user.

- A **hybrid** of the two

We are building a hybrid.


## **1. User-based similarities**

Create a matrix with binary information on user interactions. Rows represent users, columns represent items.

In [119]:
user_item_matrix = df.pivot(index="child_id", columns="item_id", values="value")

#The order of the items gets jumbled up, so sort them again from 1 to 680
itemsorted = sorted(user_item_matrix.columns, key=lambda x: int(x.split("_")[1]))
user_item_matrix = user_item_matrix[itemsorted]

Adjust item IDs to match python indices. We now have IDs from 0 to 679 instead of 1 to 680.

In [120]:
# Subtract 1 from all item IDs in the user_item_matrix columns
user_item_matrix.columns = [f"item_{int(col.split('_')[1]) - 1}" if col.startswith("item_") else col for col in user_item_matrix.columns]


Define a hypothetical target user's characteristics.

- Age group

In [34]:
while True:
    target_age_group = input("Enter the target's age group (1, 2, 3, 4, or 5): ")

    if target_age_group in ['1', '2', '3', '4', '5']:
        print("Target Age Group:", target_age_group)
        break
    else:
        print("Invalid input. Please enter a valid age group.")

Enter the target's age group (1, 2, 3, 4, or 5): 3
Target Age Group: 3


- Sex

In [28]:
while True:
    target_sex = input("Enter the target's sex (Female, Male or None): ").lower()

    if target_sex in ["female", "male", "none"]:
        print("Target's Sex:", target_sex)
        break
    else:
        print("Invalid input. Please enter a valid option.")

Enter the target's sex (Female, Male or None): male
Target's Sex: male


In [136]:
#Alternatively, just create the variables
target_age_group = 5 
target_sex = "Male"

Our EDA revealed that user characteristics, such as **`ageg`** and **`sex`**, have a big influence on user/child vocabulary. Segment our data in order to match the users in our database with our hypothetical target user.

In [137]:
# Filter our df so that it only includes information of age-matched users
filtered_users = df[df["ageg"] == int(target_age_group)]

# UX wants users to be able to opt to not share sex/gender information, so we add an if 
# statement to be able to account for cases in which the target user does not share sex
# information, i.e. "None". Df is *only* filtered to include sex-matched users in cases 
# where the input is Male/Female. For "None", all age-matched users are included in matrix,
# regardless of their sex.
if target_sex != "None":
    filtered_users = filtered_users[filtered_users["sex"].str.lower() == target_sex.lower()]

# Extract the IDs of the filtered users
filtered_user_ids = filtered_users["child_id"].unique()

# Create a boolean mask to filter user information for the demographically matched users
user_filter_mask = user_item_matrix.index.isin(filtered_user_ids)

# Apply the user filter to our "master" matrix
target_matrix = user_item_matrix[user_filter_mask]

Now we start working with our target user's interaction history.

In [138]:
test = df[df["child_id"] == 1]["value"]
test.index = test.index + 1
interacted_items = test

# It's impossible to include all 680 items for our assessment questionnaire, so by default
# we assume that if we have no information on an item, then it has not been interacted with/
# learned. Hence we start with zeros only.

user_interactions = np.zeros(680)
user_interactions[interacted_items == 1] = 1

## **2. Item-based similarities**

We also want to take item similarity into account, so we need to create an item similarity matrix (or rather *dissimilarity*, since we are working with Jaccard's *distance*). Higher scores indicate less similarity.

Transpose the matrix. Rows now represent items and columns represent users. Then create a (dis)similarity matrix for the items.

In [139]:
target_item_matrix = target_matrix.T
itemsim = (scipy.spatial.distance.cdist(target_item_matrix.values, target_item_matrix.values, metric="jaccard"))

Initialize the nearest neighbors model. We opt for Jaccard's distance since we have binary data, it's a simple metric, and our binary data is asymmetric. We fit the model to the **filtered** user-item interaction matrix.

In [140]:
# Initialize the nearest neighbors model
knn = NearestNeighbors(n_neighbors=1, metric="jaccard", algorithm="brute")

# Fit the model to your user-item interaction matrix
knn.fit(target_matrix)

In [1]:
def get_recommendations(user_interactions, target_matrix, itemsim):
    # Calculate (user-based) neighbors and their distances
    knn = NearestNeighbors(n_neighbors=2, metric="jaccard", algorithm="brute")
    knn.fit(target_matrix)

    distance, neighbor_indices = knn.kneighbors([user_interactions])

    # Get the interaction history of the identified neighbors for all items
    user_based_recommendations = []
    for neighbor_index in neighbor_indices[0]:
        neighbor_interactions = target_matrix.iloc[neighbor_index]

        # For the items that the neighbors have interacted with (1's), convert the item names
        # into integers and append them to our empty list "user_based_recommendations"
        user_based_recommendations.extend([int(item.split('_')[1]) for item in neighbor_interactions[neighbor_interactions == 1].index])

    # Only include items that the target user has not already interacted with
    user_based_recommendations = [item for item in user_based_recommendations if 0 <= item <= len(user_interactions) and user_interactions[item] == 0]
    user_based_recommendations = list(set(user_based_recommendations))

    # Item-based collaborative filtering (as previously explained)
    item_scores = {}
    for item_id in range(len(itemsim)):
        similar_items = np.argsort(itemsim[item_id])[::-1]
        similar_items = [item for item in similar_items if user_interactions[item] == 0]
        similarity_score = sum(itemsim[item_id][similar_items])
        item_scores[item_id] = similarity_score

    sorted_items = sorted(item_scores.items(), key=lambda x: x[1], reverse=True)
    item_based_recommendations = [item_id for item_id, _ in sorted_items]
    item_based_recommendations = list(set(item_based_recommendations))

    # Combine user-based and item-based recommendations using a weighted average
    user_weight = 0.3 
    item_weight = 1 - user_weight

    combined_recommendations = []
    for item_id in user_based_recommendations:
        user_score = 1 
        item_score = item_scores.get(item_id, 0)
        combined_score = (user_weight * user_score) + (item_weight * item_score)

        combined_recommendations.append((item_id, combined_score))

    # Sort the combined recommendations by their scores
    combined_recommendations.sort(key=lambda x: x[1], reverse=True)

    # Bring the IDs back to their original values
    combined_recommendations = [(item_id + 1, combined_score) for item_id, combined_score in combined_recommendations]

    # Extract the recommended item IDs
    final_items = [item_id for item_id, _ in combined_recommendations]

    return final_items

# Example usage:
user_interactions = [0, 1, 0, 1, 0, 0, 1, 0]  # Replace with the interaction history of a child
target_matrix = pd.DataFrame(...)  # Replace with the interaction history matrix for all children
itemsim = np.array(...)  # Replace with the item similarity matrix

recommendations = get_recommendations(user_interactions, target_matrix, itemsim)
print("Final Recommendations:", recommendations)


ValueError: DataFrame constructor not properly called!

In [162]:
# Calculate (user-based) neighbours and their distances
distance, neighbor_indices = knn.kneighbors([user_interactions])

# Get the interaction history of the identified neighbours for all 680 items
user_based_recommendations = []
for neighbor_index in neighbor_indices[0]:
    neighbor_interactions = target_matrix.iloc[neighbor_index]
    
    # For the items that the neighbours have interacted with (1's), convert the item names
    # into integers and append them to our empty list "user_based_recommendations"
    user_based_recommendations.extend([int(item.split('_')[1]) for item in neighbor_interactions[neighbor_interactions == 1].index])

# Only include items that the target user has not already interacted with
user_based_recommendations = [item for item in user_based_recommendations if 0 <= item <= len(user_interactions) and user_interactions[item] == 0]
user_based_recommendations = list(set(user_based_recommendations))

# Item-based collaborative filtering (as previously explained)
item_scores = {}
for item_id in range(len(itemsim)):
    similar_items = np.argsort(itemsim[item_id])[::-1]
    similar_items = [item for item in similar_items if user_interactions[item] == 0]
    similarity_score = sum(itemsim[item_id][similar_items])
    item_scores[item_id] = similarity_score

sorted_items = sorted(item_scores.items(), key=lambda x: x[1], reverse=True)
item_based_recommendations = [item_id for item_id, _ in sorted_items]
item_based_recommendations = list(set(item_based_recommendations))

# Combine user-based and item-based recommendations using a weighted average
user_weight = 0.3 
item_weight = 1 - user_weight

combined_recommendations = []
for item_id in user_based_recommendations:
    user_score = 1 
    item_score = item_scores.get(item_id, 0)
    combined_score = (user_weight * user_score) + (item_weight * item_score)
    
    combined_recommendations.append((item_id, combined_score))

# Sort the combined recommendations by their scores
combined_recommendations.sort(key=lambda x: x[1], reverse=True)

# Bring the IDs back to their original values
combined_recommendations = [(item_id + 1, combined_score) for item_id, combined_score in combined_recommendations]

# Extract the recommended item IDs
final_items = [item_id for item_id, _ in combined_recommendations]

print("Final Recommendations (w/ Scores):", combined_recommendations)



Final Recommendations (w/ Scores): [(181, 103.71465831032151), (436, 96.3837804632133), (640, 96.17399490683499), (330, 90.92694293778204), (547, 90.88487061467883), (178, 88.23202136197749), (328, 85.39501282613722), (326, 85.08402874928659), (343, 84.94046716375139), (257, 81.94924005476018), (590, 80.72474628423164), (156, 80.49406691976436), (172, 79.81458963457685), (64, 78.87192608851717), (646, 78.60349603588718), (152, 78.56147976048123), (327, 77.14996442826008), (151, 76.80550271311337), (589, 76.74455813940267), (650, 76.15488180413901), (34, 75.93206362506996), (475, 75.5293522757116), (355, 74.54926937096344), (574, 74.41375841496323), (230, 74.04143031619098), (140, 73.61846831075451), (583, 73.25869265780797), (266, 73.07550657323381), (146, 73.06677924758024), (598, 72.38501904650983), (184, 71.67400672557312), (350, 71.61943761571274), (580, 71.04122230052323), (613, 70.98123910737678), (40, 70.74529123092609), (373, 69.66767742594655), (591, 69.66358749913138), (80, 6

**Criticisms:**

- It seems a bit redundant to calculate both user-based similarity **and** segment our data based on the demographics of the target user

- Since we will not have complete information on which items the target user has already interacted with (i.e. the words the child has already learned) for our prototype, it is likely that many of the recommended items will be words that the user has actually already interacted with, especially since recommended items tend to be "popular" items

- Evaluating the model will be tricky, since we cannot collect new data and are also working with binary data

**Solutions:**

- Drop the data segmentation
- Recommend not only the items with the highest score, but add some low-scoring items into the list of the recommendations. **Problem:** these words will likely not be age-appropriate for especially younger users
- Now that the data segmentation has been dropped, we can evaluate our user-based similarities by investigating the demographical characteristics of the similar users. Will the "nearest neighbours" be of the same age and gender as our target user? Find out in Part 4a.