## Model-based Collaborative Filtering Systems using SVD Matrix Factorization

__Introduction:__
    In these types of systems, you build a model from user ratings, and then make recommendations based on that model.
In this project we are going to use MovieLens dataset collected by the GroupLens Research Project at the University of Minnesota. https://grouplens.org/datasets/movielens/100k/

In [1]:
# importing libraries
import numpy as np
import pandas as pd
import sklearn
from sklearn.decomposition import TruncatedSVD

### Preparing the data

In [2]:
# creating columns for the first dataset dataset
columns = ['user_id', 'item_id', 'rating', 'timestamp']

# importing the dataset
frame = pd.read_csv('..Datasets//u.data', sep='\t', names=columns)
frame.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
# creating column names for the second dataset
columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
          'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
          'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

# importing the item attributes dataset
movies = pd.read_csv('C://Users//Baash//Desktop//Datasets//u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]
movie_names.head()

Unnamed: 0,item_id,movie title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


Next, we are going to combine the two dataframes into 1

In [4]:
combined_movies_data = pd.merge(frame, movie_names, on='item_id')
combined_movies_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp,movie title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


As you can see we hav got a dataset that have all the movies, and all the reviews that were given to each of the movies.

In [5]:
# Let's find out which movie has got the most reviews
combined_movies_data.groupby('item_id')['rating'].count().sort_values(ascending=False).head()

item_id
50     583
258    509
100    508
181    507
294    485
Name: rating, dtype: int64

Movie id 50 has got the most reviews of 583 and is the most popular movie.

Let's find out what is the name of that popular movie

In [6]:
filter = combined_movies_data['item_id']==50
combined_movies_data[filter]['movie title'].unique()

array(['Star Wars (1977)'], dtype=object)

There you go, movie id 50 is actually Star Wars 1977

### Building a Utility Matrix
Now let's turn into building Utility Matrix. This matrix contain a value for each user and each movie. For cases where the user did provide a movie review, that rating shows us a numeric value. All other user movie values will return as null.

In [8]:
rating_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
rating_crosstab.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,2,5,0,0,3,4,0,0,...,0,0,0,5,3,0,0,0,4,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,2,0,0,0,0,4,0,0,...,0,0,0,4,0,0,0,0,4,0


### Transposing the Matrix
 we're going to take this utility matrix, and transpose it, and then we're going to use SVD to decompose it down to synthetic representations, of the user reviews.

In [9]:
rating_crosstab.shape

(943, 1664)

In [11]:
X = rating_crosstab.transpose()
X.shape

(1664, 943)

### Decomposing the Matrix

In [12]:
SVD = TruncatedSVD(n_components=12, random_state=17)

resultant_matrix = SVD.fit_transform(X)

resultant_matrix.shape

(1664, 12)

### Generating a Correlation Matrix
Next let's move into generating a correlation matrix. We'll calculate the Pearson r correlation coefficient, for every movie pair in the resultant matrix. With correlation being based on similarities between user preferences. 

In [15]:
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape

(1664, 1664)

### Isolating Star Wars From the Correlation Matrix

In [16]:
# generating movie names index
movie_names = rating_crosstab.columns

# changing the resulting numpy array to a list
movies_list = list(movie_names)

# finding the numeric index value of Star Wars, to use it as a movie of interest
star_wars = movies_list.index('Star Wars (1977)')
star_wars

1398

In [17]:
# let's isolate the array that represents Star Wars, at numerical index value 1398
corr_star_wars = corr_mat[1398]
corr_star_wars.shape

(1664,)

Now, let's generate a list of movie names that exhibit a high degree of correlation with Star Wars.

### Recommending a Highly Correlated Movie

In [18]:
list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.9)])


['Die Hard (1988)',
 'Empire Strikes Back, The (1980)',
 'Fugitive, The (1993)',
 'Raiders of the Lost Ark (1981)',
 'Return of the Jedi (1983)',
 'Terminator 2: Judgment Day (1991)',
 'Terminator, The (1984)',
 'Toy Story (1995)']

Finally, we will make a list of movies that correlates with Star Wars even a little closer.

In [20]:
list(movie_names[(corr_star_wars<1.0) & (corr_star_wars > 0.95)])

['Return of the Jedi (1983)']

Both movies came out around they same time, and they're both very popular sci-fi films. So it really makes sense that if a person likes Star Wars from 1977, then they'll probably also really like Return of the Jedi 1983.

<center> THE END <center/>