In [1]:
#importing libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
# nltk.download('stopwords')
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

import warnings 
warnings.filterwarnings('ignore')

## Load the movies and ratings data 

In [2]:
ratings =  pd.read_csv('ml-1m/ratings.dat',sep='::',header=None,names=["UserID", "MovieID", "Rating", "Timestamp"], encoding="ISO-8859-1")
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [3]:
movies = pd.read_csv('ml-1m/movies.dat',sep='::',header=None,names=["MovieID", "Title", "Genres"], encoding="ISO-8859-1")
movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
users = pd.read_csv('ml-1m/users.dat',sep='::',header=None,names=["UserID", "Gender", "Age", "Occupation", "Zip-code"], encoding="ISO-8859-1")
users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


## Singular Value Decomposition

Matrix decomposition, also known as matrix factorization, is the process of defining a given matrix using the elements that make it up.
The Singular-Value Decomposition, or SVD, is a matrix decomposition method that simplifies specific future matrix operations by reducing a matrix to its constituent elements.

A factorization of a matrix into three matrices is called Singular Value Decomposition (SVD).
A rectangular gene expression data matrix is used in singular value decomposition defined as A, where A is n x p matrix. The genes are represented by the n rows, and the p columns represent the experimental circumstances.
Unlike eigendecomposition, which requires a square matrix to decompose, SVD allows you to decompose a rectangle matrix with multiple rows and columns.

## Principal Component Analysis 

Principal Component Analysis, or PCA, is a dimensionality-reduction approach for reducing the dimensionality of large data sets by transforming a large collection of variables into a smaller one that retains the majority of the information in the large set. PCA simplifies data, lowers noise, and uncovers hidden variables that aren't quantified. PCA assists us in identifying a smaller number of features that will compress our original dataset, capturing up to a percentage of its variance depending on the number of new features we choose. This transformation is designed to have the most significant substantial variance in the first principal component. That is, it accounts for as much of the variability in the data as possible, and each succeeding part, in turn, has the highest possible variance possible

##  m x u matrix with movies as row and users as column and Normalize the matrix.

In [5]:
ratings_mat = np.ndarray(shape=(np.max(ratings.MovieID.values), np.max(ratings.UserID.values)),
                         dtype=np.uint8)

ratings_mat[ratings.MovieID.values-1, ratings.UserID.values-1] = ratings.Rating.values

In [7]:
normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

## covariance matrix for the entire dataset

In [11]:
covMatrix = np.cov(normalised_mat,bias=True)
print (covMatrix)

[[ 4.12962236e+00  5.75673984e-01  2.58562508e-01 ...  3.68631310e-02
  -1.89651770e-03  1.01335714e-01]
 [ 5.75673984e-01  1.16329503e+00  1.58818473e-01 ...  2.35825183e-02
   1.72580150e-04  4.73469365e-02]
 [ 2.58562508e-01  1.58818473e-01  7.53929543e-01 ...  1.23723740e-02
  -3.18604447e-03  2.89347507e-02]
 ...
 [ 3.68631310e-02  2.35825183e-02  1.23723740e-02 ...  1.28726701e-01
   2.31599491e-02  7.54817223e-02]
 [-1.89651770e-03  1.72580150e-04 -3.18604447e-03 ...  2.31599491e-02
   1.07279944e-01  5.63097013e-02]
 [ 1.01335714e-01  4.73469365e-02  2.89347507e-02 ...  7.54817223e-02
   5.63097013e-02  9.15346668e-01]]


## eigen vectors from the covariance matrix

In [13]:
evals, evecs = np.linalg.eig(covMatrix)

In [16]:
from numpy import linalg as LA
w, v = LA.eig(covMatrix)

## Using cosine similarity find 10 closest movies using the 50 components from PCA

In [25]:
k = 50
movie_id = 1 # Grab an id from movies.dat
top_n = 10

sliced = evecs[:, :k] # representative data
top_indexes = top_cosine_similarity(sliced, movie_id, top_n)
print_similar_movies(movies, movie_id, top_indexes)

Recommendations for Toy Story (1995): 

Toy Story (1995)
Toy Story 2 (1999)
Babe (1995)
Bug's Life, A (1998)
Pleasantville (1998)
Babe: Pig in the City (1998)
Aladdin (1992)
Stuart Little (1999)
Secret Garden, The (1993)
Tarzan (1999)
