## Part 5: Final model

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.spatial
from statsmodels.formula.api import ols
import json
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import hamming, pdist, squareform
from sklearn.metrics import jaccard_score

In [2]:
df = pd.read_csv("wbmaster.csv")

In [3]:
df2 = pd.read_csv("checkdat.csv")

**Collaborative filtering**

This is the final CF model. 

- No segmentation of the user-item/item-user matrix.
- Only 6 words will be recommended (ranked by highest similarities, no unique/dissimilar words will be featured)
- Only words for which audio data/images are available will be recommended
- Adjustion of weights. The weighting of the user has been bumped up by .1

The mock json file has information on which words are exercise appropriate (they are noted as "is_audio").

In [5]:
with open('output1.json', 'r') as file:
    ilegend = json.load(file)

In [6]:
words = [item["name"] for item in ilegend if item["is_audio"]]
ids = []

for word in words:
    match = df2[df2["item_definition"] == word]
    ids.extend(match["item_id"].tolist())

ids = ([int(item.split('_')[1]) for item in ids])

**ids** is a list containing the ID's of all words that are exercise-appropriate. Only 236 items left, unfortunately.

**1. User-based similarities**: Create user item matrix, adjust IDs to match python indices

In [7]:
user_item_matrix = df.pivot(index="child_id", columns="item_id", values="value")
itemsorted = sorted(user_item_matrix.columns, key=lambda x: int(x.split("_")[1]))
user_item_matrix = user_item_matrix[itemsorted]
user_item_matrix.columns = [f"item_{int(col.split('_')[1]) - 1}" if col.startswith("item_") else col for col in user_item_matrix.columns]

**2. Item-based similarities:**
Transpose user item matrix, calculate jaccard distance.

In [12]:
item_user_matrix = user_item_matrix.T
itemsim = (scipy.spatial.distance.cdist(item_user_matrix.values, item_user_matrix.values, metric="jaccard"))

**3. Target user interaction history:** Input values given for the 40 assessment items - NAs/the other 640 items are filled with 0's

In [13]:
test = df[df["child_id"] == 1]["value"]
test.index = test.index + 1
interacted_items = test

# It's impossible to include all 680 items for our assessment questionnaire, so by default
# we assume that if we have no information on an item, then it has not been interacted with/
# learned. Hence we start with zeros only.

user_interactions = np.zeros(680)
user_interactions[interacted_items == 1] = 1

**4. Initialize the model:**

In [14]:
# Initialize the nearest neighbors model
knn = NearestNeighbors(n_neighbors=2, metric="jaccard", algorithm="brute")

# Fit the model to your user-item interaction matrix
knn.fit(user_item_matrix)

In [15]:
# Calculate (user-based) neighbours and their distances
distance, neighbor_indices = knn.kneighbors([user_interactions])

# Get the interaction history of the identified neighbours for all 680 items
user_based_recommendations = []
for neighbor_index in neighbor_indices[0]:
    neighbor_interactions = user_item_matrix.iloc[neighbor_index]
    
    # For the items that the neighbours have interacted with (1's), convert the item names
    # into integers and append them to our empty list "user_based_recommendations"
    user_based_recommendations.extend([int(item.split('_')[1]) for item in neighbor_interactions[neighbor_interactions == 1].index])

# Only include items that the target user has not already interacted with
user_based_recommendations = [item for item in user_based_recommendations if 0 <= item <= len(user_interactions) and user_interactions[item] == 0]
user_based_recommendations = list(set(user_based_recommendations))

# Item-based collaborative filtering (as previously explained)
item_scores = {}
for item_id in range(len(itemsim)):
    similar_items = np.argsort(itemsim[item_id])[::-1]
    similar_items = [item for item in similar_items if user_interactions[item] == 0]
    similarity_score = sum(itemsim[item_id][similar_items])
    item_scores[item_id] = similarity_score

sorted_items = sorted(item_scores.items(), key=lambda x: x[1], reverse=True)
item_based_recommendations = [item_id for item_id, _ in sorted_items]
item_based_recommendations = list(set(item_based_recommendations))

# Combine user-based and item-based recommendations using a weighted average
user_weight = 0.3 
item_weight = 1 - user_weight

combined_recommendations = []
for item_id in user_based_recommendations:
    user_score = 1 
    item_score = item_scores.get(item_id, 0)
    combined_score = (user_weight * user_score) + (item_weight * item_score)
    
    combined_recommendations.append((item_id, combined_score))

# Sort the combined recommendations by their scores
combined_recommendations.sort(key=lambda x: x[1], reverse=True)

# Bring the IDs back to their original values
combined_recommendations = [(item_id + 1, combined_score) for item_id, combined_score in combined_recommendations]

# Extract the recommended item IDs
final_items = [item_id for item_id, _ in combined_recommendations]
final_items = [item for item in final_items if item in ids][0:6]

print("Final Recommendations (w/ Scores):", combined_recommendations)



Final Recommendations (w/ Scores): [(436, 100.96033689355367), (640, 100.94260368631787), (674, 99.8765078895528), (678, 98.12249602459792), (672, 94.59222314626582), (677, 91.3845828827388), (599, 90.790575902231), (590, 88.63503349569186), (210, 88.11848172838852), (581, 87.03138106207327), (606, 86.71677220318885), (475, 86.05987044367238), (594, 85.44287828446743), (146, 85.04704509267196), (647, 84.92808738904316), (671, 84.06617333400919), (140, 83.48424791342926), (350, 83.38981423639984), (673, 82.8627383456341), (598, 82.69740866272384), (204, 81.58941783064749), (583, 81.2383966750311), (604, 81.1929691943959), (373, 81.13517977370252), (348, 80.55612989006052), (610, 80.29282486262584), (642, 80.2695702388507), (11, 80.12008864219513), (600, 79.93120895876324), (613, 79.84192458126574), (315, 79.20738377134374), (5, 79.0327247398992), (580, 78.64400896629267), (619, 78.4364516979211), (143, 78.4143309742943), (665, 77.70970889659208), (602, 77.30574958821136), (591, 77.27929

Filter out the items and pick only those that are also listed in **ids**, i.e. the ones that are "exercise appropriate". The items are already ranked by similarity, so pick out the first 6. These will be recommended to the user in exercise-form.

In [28]:
final_items = [item for item in final_items if item in ids][0:6]

In [40]:
final_words = []

for item in final_items:
    item = "item_" + str(item)
    definition = df2[df2["item_id"] == item]["item_definition"]
    final_words.append(definition)
    
final_words = [item.values[0] for item in final_words]

In [41]:
final_words

['pudding', 'penis', 'sister', 'baby', 'sprinkler', 'meow']

In [38]:
[item.values[0] for item in final_words]

['pudding', 'penis', 'sister', 'baby', 'sprinkler', 'meow']

**Criticisms:**

- It seems redundant to calculate both user-based similarity **and** segment our data based on the demographics of the target user

- Since we do not have complete information on which items the target user has already interacted with (i.e. the words the child has learned), it is likely that many of the recommended items will be words that the user has actually already interacted with, especially since these are "popular" items

- Evaluating the model will be tricky, since we cannot collect new data and are also working with binary data

**Solutions:**

- Drop the data segmentation
- Recommend not only the items with the highest score, but add some low-scoring items into the list of the recommendations. **Problem:** these words will likely not be appropriate for the target user, especially those in younger age groups
- Now that the data segmentation has been dropped, we can evaluate our user-based similarities by investigating the demographical characteristics of the similar users. Will the "nearest neighbours" be of the same age and gender as our target user?