# The k-NN algorithm (recommendation system)

Let's create a recommendation system based on k-NN, where it can predict your rating for something. This will be roughly analogous to [Collaborative Filtering](https://en.wikipedia.org/wiki/Collaborative_filtering).

We will create vectors based on ratings. This is likely to be highly inefficient!

## The data

[MovieLens](https://grouplens.org/) provides datasets from the data they use for their own recommendation system. As they don't provide any archived/fixed-in-time data file, we'll obtain one from <https://archive.org> (2016/11/16). The download is likely to be very slow.

In [1]:
# Download Data
import os
import urllib.request
from pathlib import Path

# Source: https://grouplens.org/datasets/movielens/latest/
# Size: 230MB
DATA_PATH = Path('data/2-movielens.zip')
DATA_URL = 'https://web.archive.org/web/20161112181144/http://files.grouplens.org/datasets/movielens/ml-latest.zip'
if not DATA_PATH.exists():
    Path('data').mkdir(exist_ok=True)
    urllib.request.urlretrieve(DATA_URL, DATA_PATH)

The data will need significant processing before we can use it for recommendations.

The first and most important file is the `ratings.csv` file. This contains timestamped user ratings for movies on a scale of 0.5-5.0 stars (0.5 increments). This is sparse, but k-NN cannot handle sparse data so we'll look at how we can get around this.

Due to the data sizes involved, we're going to start using Pandas

In [2]:
# Read data
# CSV structure: userId, movieId, rating, timestamp.
import contextlib
import zipfile
from datetime import datetime
import numpy as np
import pandas as pd


with zipfile.ZipFile(DATA_PATH) as zipin:
    zipin.testzip()
    with zipin.open('ml-latest/ratings.csv') as fin:
        data_df = pd.read_csv(
            fin,
            dtype={'userId': np.int32, 'movieId': np.int32, 'timestamp': np.int32},
            converters={'rating': lambda x: np.int8(float(x) * 2)},
            parse_dates=['timestamp'],
            date_parser=datetime.utcfromtimestamp,
        )

# Doesn't seem to be possible to set dtype when using converter
data_df['rating'] = data_df['rating'].astype(np.int8)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24404096 entries, 0 to 24404095
Data columns (total 4 columns):
userId       int32
movieId      int32
rating       int8
timestamp    datetime64[ns]
dtypes: datetime64[ns](1), int32(2), int8(1)
memory usage: 395.7 MB


Let's exclude anyone with less than 100 ratings, and also exclude any movie with less than 1000 ratings.

It feels as though there is some point where we don't have enough information, but these are arbitrary cut-offs.

In [3]:
filtered_df_0 = data_df

while True:
    print('Filtering by movies...')
    movie_rating_counts = filtered_df_0.groupby('movieId', sort=False)['rating'].count()
    movies_with_sufficient_ratings = movie_rating_counts[movie_rating_counts >= 1000]
    filtered_df_1 = filtered_df_0[filtered_df_0['movieId'].isin(movies_with_sufficient_ratings.index)]
    print('Cutting down to', len(movies_with_sufficient_ratings), 'movies from', len(movie_rating_counts), 'original movies')
    print('Cutting down to', len(filtered_df_1), 'ratings from', len(filtered_df_0), 'previous #ratings')

    print('Filtering by users...')
    user_rating_counts = filtered_df_1.groupby('userId', sort=False)['rating'].count()
    users_with_sufficient_ratings = user_rating_counts[user_rating_counts >= 100]
    filtered_df_2 = filtered_df_1[filtered_df_1['userId'].isin(users_with_sufficient_ratings.index)]
    print('Cutting down to', len(users_with_sufficient_ratings), 'users from', len(user_rating_counts), 'original users')
    print('Cutting down to', len(filtered_df_2), 'ratings from', len(filtered_df_1), 'previous #ratings')
    
    if len(filtered_df_2) == len(filtered_df_0):
        break
    filtered_df_0 = filtered_df_2

filtered_df = filtered_df_2

Filtering by movies...
Cutting down to 3577 movies from 39443 original movies
Cutting down to 21883270 ratings from 24404096 previous #ratings
Filtering by users...
Cutting down to 57659 users from 258802 original users
Cutting down to 16175345 ratings from 21883270 previous #ratings
Filtering by movies...
Cutting down to 3296 movies from 3577 original movies
Cutting down to 15919146 ratings from 16175345 previous #ratings
Filtering by users...
Cutting down to 57091 users from 57659 original users
Cutting down to 15863428 ratings from 15919146 previous #ratings
Filtering by movies...
Cutting down to 3287 movies from 3296 original movies
Cutting down to 15854464 ratings from 15863428 previous #ratings
Filtering by users...
Cutting down to 57076 users from 57091 original users
Cutting down to 15852979 ratings from 15854464 previous #ratings
Filtering by movies...
Cutting down to 3287 movies from 3287 original movies
Cutting down to 15852979 ratings from 15852979 previous #ratings
Filteri

In [4]:
filtered_df.head(2)
for g, d in filtered_df.groupby('userId'):
    print(g, d)
    break

10      userId  movieId  rating           timestamp
232      10        1       8 2005-02-04 18:27:46
233      10        2       6 2005-02-04 18:37:36
234      10        3       7 2005-02-04 18:41:40
235      10        6       8 2005-02-04 18:33:49
236      10       10       6 2005-02-04 18:30:45
237      10       11       8 2005-02-04 18:35:57
238      10       16       7 2005-02-04 18:43:39
239      10       21       8 2005-02-04 18:32:29
240      10       22       6 2005-02-04 18:00:34
241      10       32       5 2005-02-04 18:28:13
242      10       34       7 2005-02-04 18:30:06
243      10       50       6 2005-02-04 18:28:57
244      10       62       8 2005-02-04 18:33:55
245      10      105       7 2005-02-04 18:01:50
246      10      110       8 2005-02-04 18:26:58
247      10      111       5 2005-02-04 18:35:43
248      10      141       5 2005-02-04 18:32:51
249      10      150       6 2005-02-04 18:26:53
250      10      160       5 2005-02-04 18:40:13
251      10      

# k-NN methods

In [5]:
def get_distance(user_1_df, user_2_df, threshold=20):
    """Return the distance between two users.
    
    If we don't have threshold shared ratings for movies, then we return None.
    """
    shared_ratings = pd.merge(user_1_df, user_2_df, how='inner', on=['movieId'])
    if len(shared_ratings) < threshold:
        return None
    
    return sum((shared_ratings['rating_x'] - shared_ratings['rating_y'])**2)

In [20]:
import statistics
from operator import itemgetter
from collections import Counter
from datetime import datetime


def get_movie_prediction(grouped_users_df, test_user_df, movie_id, knn=5, distance_func=get_distance, threshold=20):
    """Return our prediction for the particular movie."""
    # We should remove the movie from test_user_df!
    test_user_df = test_user_df[test_user_df['movieId'] != movie_id]

    # Build up a list of distances to users
    print(datetime.utcnow())
    user_distances = []  # (user, distance)
    for user_id, train_user_df in grouped_users_df:
        distance = distance_func(train_user_df, test_user_df, threshold=threshold)
        # Skip users with insufficient data for a good distance
        if distance:
            user_distances.append((user_id, distance))
    user_distances.sort(key=itemgetter(1))
    neighbours = map(itemgetter(0), user_distances)
    
    # Average the ratings of the knn closest users
    print(datetime.utcnow())
    user_movie_ratings = []
    for next_closest_user_id in neighbours:
        user_movie_rating = grouped_users_df.get_group(next_closest_user_id).query('movieId==' + str(movie_id))
        try:
            user_movie_ratings.append(float(user_movie_rating.at[user_movie_rating.index[0], 'rating']))
        except IndexError:
            continue

        if len(user_movie_ratings) >= knn:
            break

    print(user_movie_ratings)
    print(datetime.utcnow())
    prediction = statistics.mean(user_movie_ratings)
    return prediction


def get_prediction(grouped_users_df, test_user_df, movies, knn=5, distance_func=get_distance, threshold=20):
    """Return our predictions for for the test user for all movies."""
    test_user_id = test_user_df.iloc[0].loc['userId']
    predictions = []
    for movie_id in movies:
        prediction = get_movie_prediction(
            grouped_users_df,
            test_user_df,
            movie_id,
            knn=knn,
            distance_func=distance_func,
            threshold=threshold,
        )
        predictions.append({
            'userId': test_user_id,
            'movieId': movie_id,
            'prediction': prediction,
        })
    
    return pd.DataFrame(predictions)

## Test some code

In [7]:
grouped_users_df = filtered_df.groupby('userId')
test_user = filtered_df[filtered_df['userId'] == 14]
movies = filtered_df['movieId'].unique()

In [21]:
get_movie_prediction(grouped_users_df, test_user, 1)

2018-03-02 17:14:12.633213
2018-03-02 17:16:20.961153
[9.0, 9.0, 8.0, 9.0, 6.0]
2018-03-02 17:16:20.981072


8.2

In [None]:
def get_accuracy(grouped_users_df, test_user_df, movies, knn=5, distance_func=get_distance, threshold=25):
    """Return an %-accuracy table perfect, off-by-1, off-by-2."""
    correct_count = 0
    for test_datum in test_data:
        prediction = get_prediction(train_data, test_datum, knn=knn)
        if prediction == test_datum[-1]:
            correct_count += 1
    
    return (correct_count / len(test_data))

In [None]:
test_user_df = test_user[test_user['movieId'] != 1]


#     shared_ratings = pd.merge(user_1_df, user_2_df, how='inner', on=['movieId'])
#     if len(shared_ratings) < threshold:
#         return None

def aa(group):
    return get_distance(group, test_user_df, threshold=20)

def ff(group):
    return group['movieId'].isin(test_user_df['movieId']).count() > 20

xx = grouped_users_df.filter(ff).any().apply(aa)

In [None]:
%debug