## Recommender system

Whether is is watching Youtube, ordering food online, buy books online, listening on Spotify, using LinkedIn. You
get constant recommendations for new videoclips, what to eat and much more. What's behind all this is a recommender system.

### Warmup
Creating a easy recommender system for movies with KNN. 

100.000 dataset.

In [43]:
import pandas as pd
from scipy.sparse import csr_matrix
# k-nearest
from sklearn.neighbors import NearestNeighbors

## Efficent way to match two strings together
    # What it does is:
        # if we have spelling mistakes, capitalizing or forget to add spaces it can mash movies
        # it can select the index
from fuzzywuzzy import process
import numpy as np

In [44]:
movies = ("/Users/joeloscarsson/Documents/www/Machine-Learning/Projects/data2/movies.csv")
ratings = ("/Users/joeloscarsson/Documents/www/Machine-Learning/Projects/data2/ratings.csv")

In [45]:
df_movies = pd.read_csv(movies, usecols=['movieId', 'title'], dtype={'movieId': 'int32', 'title': 'str'})
df_ratings = pd.read_csv(ratings, usecols=['userId', 'movieId', 'rating'], dtype={'userId': 'int32', 'movieId': 'int32', 'rating':'float32'})

In [46]:
# df_ratings.index
df_ratings.index

RangeIndex(start=0, stop=100836, step=1)

In [68]:
# To be able to convert data to see K-Nearest
# Use Spare Matrix
# example 
#          Users
#         [4,4,5] A
# Movies  [3,3,4] B == Cos A,B) => 0.95 Similar
#         [3,2,1] C


# Why we use "rating" is the same reason we use "Sales" in Excercises. So we can get something for y-axis to compare
movies_users = df_ratings.pivot(index='movieId', columns='userId', values='rating').fillna(0)
# A lot of NaN values because people haven't voted on a movie. We can't process this data so we used .fillna(0)
mat_movies_users = csr_matrix(movies_users.values)

In [49]:
df_movies.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [50]:
# Specifying distance between two vectors 
# Euclidean Distance
# Manhattan Distance
# Minkowski Distance


# Cosine Similarity


# Using brute because traverse thru all datapoints in the whole dataset 
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors= 20)

In [51]:
model_knn.fit(mat_movies_users)

In [52]:
df_movies.loc[0]

movieId                   1
title      Toy Story (1995)
Name: 0, dtype: object

In [53]:
def test(indices):
    for idx in indices:
        id = df_movies.loc[idx]["movieId"]
        sum_dataframe = df_ratings[df_ratings["movieId"] == id]["rating"].sum()
        sum_matrix = mat_movies_users[idx].sum()

        print(f"Comparing index {idx}, result: {sum_dataframe == sum_matrix}")

indices = [i for i in range(100)]
test(indices)

Comparing index 0, result: False
Comparing index 1, result: False
Comparing index 2, result: False
Comparing index 3, result: False
Comparing index 4, result: False
Comparing index 5, result: False
Comparing index 6, result: False
Comparing index 7, result: False
Comparing index 8, result: False
Comparing index 9, result: False
Comparing index 10, result: False
Comparing index 11, result: False
Comparing index 12, result: False
Comparing index 13, result: False
Comparing index 14, result: False
Comparing index 15, result: False
Comparing index 16, result: False
Comparing index 17, result: False
Comparing index 18, result: False
Comparing index 19, result: False
Comparing index 20, result: False
Comparing index 21, result: False
Comparing index 22, result: False
Comparing index 23, result: False
Comparing index 24, result: False
Comparing index 25, result: False
Comparing index 26, result: False
Comparing index 27, result: False
Comparing index 28, result: False
Comparing index 29, resu

In [54]:
# Recommender (movie_name) => List of Movies recommended

def recommender(movie_name, data, model, n_recommendations):
    model.fit(data)
    # Extracting one movie fuzzy has selected for us
    # so i choose 'title' from df_movies and i want to match movie names in the title column
    # We specify index 2 because we have a tuple
    idx=process.extractOne(movie_name, df_movies['title'])[2] 
    # print(idx)
    print('Movie Selected: ', df_movies['title'][idx], 'Index: ', idx)
    print('Searching for recommendations......')
    # We specified the index from one movie to find similar movies
        # We got the movie index from what we extracted with help of fuzzywuzzy
    distances, indices=model.kneighbors(data[idx], n_neighbors=n_recommendations)

    # What we get out is the closest similarities but outcommented this and created a for loop
    # print(distances, indices)

    # We want all indices close to 100%
    # We dont want to compare toy story to toy story therefor i!=idx
    for i in indices:
        print(df_movies['title'][i].where(i!=idx))

recommender('toy story', mat_movies_users, model_knn,20)

# gives me the matched sequence it got (90)
# gives me the index of the particular movie in the dataset(0)
# recommender('toy story')
 # Output ('Toy Story (1995)', 90, 0)

 # Based on the user ratings we got movies similair to 'toy story'. Doesn't have to be similar movies though
 # It is possibly to sort based on genres aswell

Movie Selected:  Toy Story (1995) Index:  0
Searching for recommendations......
0                                                    NaN
265                 Ready to Wear (Pret-A-Porter) (1994)
312                                          Cobb (1994)
367                                    Blown Away (1994)
56     Don't Be a Menace to South Central While Drink...
90                                      Mr. Wrong (1996)
468                                    Short Cuts (1993)
38                                Dead Presidents (1995)
287                        Star Trek: Generations (1994)
451                               Renaissance Man (1994)
44                                     Pocahontas (1995)
18                 Ace Ventura: When Nature Calls (1995)
589                                    Last Dance (1996)
134                                  Crimson Tide (1995)
329                                    Paper, The (1994)
479                            Surviving the Game (1994)
216     