# Explanation of Logical Error

There was a logical error in the code-along code, this is my process for identifying and solving it

In [23]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from fuzzywuzzy import process
from IPython.display import display

# Code-Along Code (With Error)

In [44]:
df_movies = pd.read_csv("assets/movies_small.csv", usecols = ["movieId", "title"])
df_ratings = pd.read_csv("assets/ratings_small.csv", usecols = ["userId", "movieId", "rating"])

movies_cat = pd.Categorical(df_ratings['movieId'])
users_cat = pd.Categorical(df_ratings['userId'])

mat_movies_users = csr_matrix((df_ratings['rating'], (movies_cat.codes, users_cat.codes)))

model_KNN = NearestNeighbors(metric = "cosine", algorithm = "brute")
model_KNN.fit(mat_movies_users) # fitting model to sparse matrix

def recommender(movie_name, model = model_KNN, data = mat_movies_users, n_recommendations = 9):
    idx = process.extractOne(movie_name, df_movies["title"])[2]

    distances, indices = model.kneighbors(data[idx], n_neighbors = n_recommendations + 1)

    indices = indices.flatten()[1:]
    
    # print results:
    print(f"Movie Selected: \"{df_movies.loc[idx]['title']}\"\n") # selected movie title
    for a, i in enumerate(indices): # looping through indices:
        print(f"{a + 1}. {df_movies.loc[i]['title']}") # print each title in order from closest to farthest

### Using the small dataset, the recommendations are really bad

Initially thought it was due to size of dataset, but it is possible to find similar results in the full dataset as well when searching for less popular movies

In [25]:
recommender("Star Wars")

Movie Selected: "Star Wars: Episode IV - A New Hope (1977)"

1. Cheech and Chong's Up in Smoke (1978)
2. Once Upon a Time in the West (C'era una volta il West) (1968)
3. Princess Bride, The (1987)
4. Walk on the Moon, A (1999)
5. Some Kind of Wonderful (1987)
6. Arsenic and Old Lace (1944)
7. Black Mask (Hak hap) (1996)
8. Local Hero (1983)
9. Godfather, The (1972)


Since movies are sorted from more to less popular, if less popular movies don't give good recommendations it makes sense for there to be an index mismatch somewhere in the dataset  

This would also explain why the small dataset gives worse recommendations, as removing more movies would create index mismatches earlier in the dataset

### Created test function to check for index mismatches

In [26]:
def test(indices):
    errors = 0
    
    for idx in indices:
        id = df_movies.loc[idx]["movieId"]
        sum_dataframe = df_ratings[df_ratings["movieId"] == id]["rating"].sum()
        sum_matrix = mat_movies_users[idx].sum()


        if sum_dataframe != sum_matrix:
            errors += 1
            print(f"Comparing index {idx}, result: {sum_dataframe == sum_matrix} ({sum_dataframe, sum_matrix})")
        
    if errors == 0:
        print("All indices tested with no mismatches")
    print(f"{errors} errors found")

### Initially, indices seem to line up properly

In [45]:
indices = [i for i in range(0, 500)]
test(indices)

All indices tested with no mismatches
0 errors found


### However, at index 816 something happens

From index 816, every single index is mismatched

In [46]:
indices = [i for i in range(800, 820)]
test(indices)

Comparing index 816, result: False ((0.0, 105.0))
Comparing index 817, result: False ((105.0, 37.5))
Comparing index 818, result: False ((37.5, 278.5))
Comparing index 819, result: False ((278.5, 349.5))
4 errors found


### In the full dataset, this is seen at index 8403 and up

This explains why rarely some movies give very bad recommendations even in the large dataset

In [43]:
# indices = [i for i in range(8400, 8407)]
# test(indices)

Comparing index 8403, result: False ((0.0, 201.0))
Comparing index 8404, result: False ((201.0, 97.0))
Comparing index 8405, result: False ((97.0, 131.0))
Comparing index 8406, result: False ((131.0, 104.5))
4 errors found


### Identifying the issue

An issue with index mismatching has been seen in both datasets, but more commonly in the small dataset  


Describing the recommender:  
1. In the recommender function, an index matching the input title is picked out from df_movies  
2. Then, the KNN model is given data[idx], in other words that same index from the data matrix  

This is where the issue lies. The sparse matrix is created from df_ratings, which only has movieIds movies that have a rating. If a movie does not have a single rating, it does not exist in the ratings dataframe, and so there is an index mismatch. This is seen in the full dataset at index 8403. This is also why the error happens earlier and more frequently in the small dataset, as there are a lot of ratings removed, and thus more movies that do not appear in the ratings dataset.

As an example, if the title of a movieId 817 was searched for, idx would get the value 816 (index of that movie in df_movies), then index 816 from the matrix would be selected to go into the KNN model, but since movieId 817 has no ratings, this index actually points to movieId 818, and so it makes prediction on an entirely different movie than the one that was input

---

### Solution?

There are 2 easily available solutions to the issue:  

1. Add in all movieIds from df_movies to the matrix when it is created  
2. Remove all modieIds that do not have a rating from df_movies  

Both of these solutions would make both dataframes contain the same unique set of movieIds, and when the matrix is created it will all line up properly regardless of search  

Below, I'm solving the issue using the first solution by adding the following code to the categorical before creating the matrix:
```py
categories = df_movies["movieId"]
```

In [47]:
df_movies = pd.read_csv("assets/movies_small.csv", usecols = ["movieId", "title"])
df_ratings = pd.read_csv("assets/ratings_small.csv", usecols = ["userId", "movieId", "rating"])

movies_cat = pd.Categorical(df_ratings["movieId"], categories = df_movies["movieId"]) # <- here
users_cat = pd.Categorical(df_ratings["userId"])

mat_movies_users = csr_matrix((df_ratings['rating'], (movies_cat.codes, users_cat.codes)))

model_KNN = NearestNeighbors(metric = "cosine", algorithm = "brute")
model_KNN.fit(mat_movies_users) # fitting model to sparse matrix

def recommender(movie_name, model = model_KNN, data = mat_movies_users, n_recommendations = 9):
    idx = process.extractOne(movie_name, df_movies["title"])[2]

    distances, indices = model.kneighbors(data[idx], n_neighbors = n_recommendations + 1)

    indices = indices.flatten()[1:]
    
    # print results:
    print(f"Movie Selected: \"{df_movies.loc[idx]['title']}\"\n") # selected movie title
    for a, i in enumerate(indices): # looping through indices:
        print(f"{a + 1}. {df_movies.loc[i]['title']}") # print each title in order from closest to farthest

### Now, everything lines up, and the predictions are much better, even on the small dataset

In [48]:
recommender("Star Wars")

Movie Selected: "Star Wars: Episode IV - A New Hope (1977)"

1. Star Wars: Episode V - The Empire Strikes Back (1980)
2. Star Wars: Episode VI - Return of the Jedi (1983)
3. Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
4. Matrix, The (1999)
5. Indiana Jones and the Last Crusade (1989)
6. Back to the Future (1985)
7. Star Wars: Episode I - The Phantom Menace (1999)
8. Terminator, The (1984)
9. Godfather, The (1972)


### Using the test function again, there are no longer any index mismatches found

In [49]:
def test(indices):
    errors = 0
    
    for idx in indices:
        id = df_movies.loc[idx]["movieId"]
        sum_dataframe = df_ratings[df_ratings["movieId"] == id]["rating"].sum()
        sum_matrix = mat_movies_users[idx].sum()


        if sum_dataframe != sum_matrix:
            errors += 1
            print(f"Comparing index {idx}, result: {sum_dataframe == sum_matrix} ({sum_dataframe, sum_matrix})")
        
    if errors == 0:
        print("All indices tested with no mismatches")
    print(f"{errors} errors found")

### Iterating over every single index

In [51]:
indices = [i for i in range(0, len(df_movies))]
test(indices)

All indices tested with no mismatches
0 errors found
