# Recomender System
Streaming services like Netflix and Amazon use past viewing data from a customer and others to suggest new content. For example, Netflix once asked users to rate movies on a scale of $1–5$, creating a large matrix with ratings from around 480,189 customers for 17,770 movies. Since most users only watched about 200 movies, 99% of the matrix was empty.

To recommend movies, Netflix needed to fill in the missing ratings. The idea is that users who have watched similar movies may have similar preferences, allowing the system to predict ratings for movies a user hasn't seen, based on ratings from similar customers.

In this lab, we use an equivalent version of PCA for Movie Recommendations (for more information, please refer to the related videos posted on Moodle for this week).

Instructions:

**Step 1:** Data Gathering

**Step 2:** Data Preprocesing

**Step 3:** Apply SVD

**Step 4:** Writing a function to recommend movies for any user.

## Step 1: Data Gathering:

1. Start by importing the necessary Python libraries, such as Numpy and Pandas.

2. Next, visit the provided URL: http://grouplens.org/datasets/movielens/. Under the "recommended for education and development" section, locate and download the file named `ml-latest-small.zip` (which has a size of 1 MB).

3. After downloading, import the CSV files contained within the zip file.


In [2]:
import numpy as np
import pandas as pd

# read the movies.csv
movies = pd.read_csv("movies.csv")
# read the ratings.csv
ratings = pd.read_csv("ratings.csv")

##Step 2: Data Preprocessing:



1. Begin by examining the first few rows of your data to familiarize yourself with its structure.

2. Transform the data into a user-item rating matrix, where each row represents a user, each column represents a movie, and the values in the matrix are the ratings given by the users to the movies. You can achieve this using the `.pivot(index = 'userId', columns ='movieId', values = 'rating')` function.


3. print a few rows to see if it is in the suitable format. You will probably see a lot of 'NaN' (not a number) values. To apply SVD, we need to have numerical values. Common treatment to handle these 'NaN' values include replacing them with zero or the average rating for each row or column. Discuss which one do you think is better. Use `.fillna()`

4. Normalization step: De-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.




In [3]:
# exploring movies
print(movies.head())
print(movies.info())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None


In [4]:
# exploring ratings
print(ratings.head())
print(ratings.info())

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
None


In [5]:
# Transform the data into a user-item rating matrix
user_item_rating_matrix = ratings.pivot(index = 'userId', columns ='movieId', values = 'rating')

In [6]:
# explore the outcome
print(user_item_rating_matrix.head())

movieId  1       2       3       4       5       6       7       8       \
userId                                                                    
1           4.0     NaN     4.0     NaN     NaN     4.0     NaN     NaN   
2           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
3           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
4           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
5           4.0     NaN     NaN     NaN     NaN     NaN     NaN     NaN   

movieId  9       10      ...  193565  193567  193571  193573  193579  193581  \
userId                   ...                                                   
1           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
2           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
3           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
4           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     N

In [7]:
#handeling missing info using the average
user_item_rating_matrix.fillna(user_item_rating_matrix.mean(), inplace=True)

In [8]:
#print outcome
print(user_item_rating_matrix.head())

movieId   1         2         3         4         5         6         7       \
userId                                                                         
1        4.00000  3.431818  4.000000  2.357143  3.071429  4.000000  3.185185   
2        3.92093  3.431818  3.259615  2.357143  3.071429  3.946078  3.185185   
3        3.92093  3.431818  3.259615  2.357143  3.071429  3.946078  3.185185   
4        3.92093  3.431818  3.259615  2.357143  3.071429  3.946078  3.185185   
5        4.00000  3.431818  3.259615  2.357143  3.071429  3.946078  3.185185   

movieId  8       9         10      ...  193565  193567  193571  193573  \
userId                             ...                                   
1         2.875   3.125  3.496212  ...     3.5     3.0     4.0     4.0   
2         2.875   3.125  3.496212  ...     3.5     3.0     4.0     4.0   
3         2.875   3.125  3.496212  ...     3.5     3.0     4.0     4.0   
4         2.875   3.125  3.496212  ...     3.5     3.0     4.0     4.

In [9]:
#Convert the data frame into a matrix (numpy array) using .values
values = user_item_rating_matrix.values
print(values.shape, values.size)

(610, 9724) 5931640


In [11]:
#de-mean data: data - average of data
mean = np.mean(values, axis=1)
# (use reshape.(-1,1)to align the dimensions properly)
values_reshaped = values - mean.reshape(-1,1)

In [12]:
#print the outcome
print(values_reshaped.shape)
print(values_reshaped)

(610, 9724)
[[ 0.71824662  0.1500648   0.71824662 ...  0.21824662  0.21824662
   0.71824662]
 [ 0.65846657  0.16935452 -0.00284828 ...  0.23753633  0.23753633
   0.73753633]
 [ 0.66309081  0.17397876  0.00177596 ...  0.24216058  0.24216058
   0.74216058]
 ...
 [-0.74956362 -1.24956362 -1.24956362 ...  0.25043638  0.25043638
   0.75043638]
 [-0.26159726  0.17022093 -0.00198187 ...  0.23840274  0.23840274
   0.73840274]
 [ 1.7069987   0.13881688 -0.03338592 ...  0.2069987   0.2069987
   0.7069987 ]]


## Step 3: Finding the Best Rank k:



The best rank $k$ is a matrix with prediction values; discuss this.

1. Use k = 50. Determining the optimal rank 'k' for movie recomendation is another problem which can be the topic of your final project.


3. From this matrix, construct the corresponding dataframe using: pd.DataFrame(prediction matrix, columns = original_dataframe.columns). This dataFrame will contain predicted ratings for movies by different users. Each row represents a user, and each column represents a movie, with the entries containing predicted ratings.

In [13]:
# use TruncatedSVD to perform dimensionality reduction with svd
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50)
U = svd.fit_transform(values_reshaped)
sigma = svd.singular_values_
Vt = svd.components_



In [14]:
# print U and its shape
print(U.shape)
print(U)

(610, 50)
[[-8.62817224e+01  3.24542034e+00  2.48409392e+00 ... -1.30879334e+00
  -2.48720643e+00  1.28567640e+00]
 [-8.57684945e+01 -8.03521280e-02  3.60622164e-02 ... -2.18059926e-01
   5.63007943e-02 -1.20421649e-01]
 [-8.56175212e+01 -7.11654246e-01 -8.47285266e-02 ...  5.07317538e-01
   1.02056025e+00 -6.88587854e-01]
 ...
 [-8.57666407e+01 -2.52647517e+00 -9.98833873e+00 ... -2.00745034e+00
   2.38659049e-01 -2.73231702e+00]
 [-8.56962863e+01 -4.57895381e-01 -2.99197961e-01 ... -8.22795947e-03
  -2.20188752e-01  6.29338830e-02]
 [-8.67776898e+01  6.12402845e+00 -1.08102231e+00 ... -1.16997711e+00
  -1.18126539e+00 -4.32564270e+00]]


In [15]:
# print V.T and its shape
print(Vt.shape)
print(Vt)

(50, 9724)
[[-7.68181837e-03 -1.97614783e-03  3.15195011e-05 ... -2.76926894e-03
  -2.76926894e-03 -8.59846188e-03]
 [ 7.51646197e-02  7.14773691e-02  3.98501767e-02 ... -4.76074399e-03
  -4.76074399e-03 -5.23082461e-03]
 [ 2.08885246e-02 -5.13469219e-05  5.22719433e-02 ... -8.72807520e-05
  -8.72807520e-05 -5.04671501e-05]
 ...
 [ 3.25895268e-02  2.34787367e-02 -2.47062599e-02 ... -1.48228616e-04
  -1.48228616e-04 -2.41938639e-04]
 [-1.37449380e-02  3.87827928e-02  1.17344102e-03 ...  6.20182019e-04
   6.20182019e-04  5.23542842e-04]
 [ 1.21141712e-02  3.99919092e-03 -1.01340250e-02 ...  7.97445646e-05
   7.97445646e-05  1.19468224e-04]]


In [16]:
#print sigma and its shape
print(sigma.shape)
print(sigma)

(50,)
[2118.4775165    50.98252305   39.59690385   38.87654926   36.2582413
   33.79613537   33.30282121   32.49143037   31.33550532   30.97643036
   30.32635729   29.45885268   29.33374325   28.59803447   27.53501942
   27.35190401   26.87841106   26.42010872   26.27503515   25.69494669
   25.4565331    25.14726911   24.96705492   24.51443284   24.34949749
   24.05227549   23.82671155   23.52103304   23.34363268   23.26854725
   22.93727371   22.82661788   22.51857509   22.04220448   21.8906023
   21.86089396   21.68282176   21.57841309   21.31527715   21.28742114
   21.06231613   20.854017     20.63330392   20.3758623    20.23238561
   20.15139578   19.97599409   19.74648934   19.64196077   19.54632895]


In [17]:
# convert sigma into a diagonal matrix using np.diag
sigma_diag = np.diag(sigma)
print(sigma_diag.shape)

(50, 50)


In [18]:
# Estimate your data by computing U *sigma* V
first_array = np.matmul(U, sigma_diag)
data = np.matmul(first_array, Vt)

In [19]:
# now we can predict rating by adding mean to this estimate
data = data + mean.reshape(-1,1)

In [20]:
#use pd.DataFrame to construct a dataframe containing ratings
prediction = pd.DataFrame(data, columns=user_item_rating_matrix.columns)
print(prediction.head)
#print a few rows

<bound method NDFrame.head of movieId       1           2          3            4           5       \
0        1414.773992  374.225551  15.310791 -1926.114336 -397.088784   
1        1398.694721  362.332572  -3.181601 -1914.510265 -400.619830   
2        1396.258113  360.455420  -2.793815 -1911.593252 -400.503659   
3        1390.252330  359.580176  -3.861065 -1911.641998 -399.129552   
4        1399.874585  360.697728  -5.265868 -1914.603614 -400.904435   
..               ...         ...        ...          ...         ...   
605      1386.831732  350.079724  -4.483793 -1911.720392 -403.267984   
606      1388.828664  365.389224   5.347366 -1910.658366 -397.327214   
607      1338.275508  318.884726 -50.398301 -1914.221160 -412.696386   
608      1394.694605  359.892163  -4.753444 -1913.212527 -401.586421   
609      1463.150642  375.646796   1.552030 -1937.865672 -402.744141   

movieId       6           7           8           9           10      ...  \
0        1471.048700 -158.93

## Step 4: Movie Recommendations:


Write a recommendation function that suggests movies to a user based on predicted ratings. It takes in a user id and a number k, prints user's original rating, and recomends k movies.







In [21]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):

    #Step 1: Get and sort the user's predictions
    ## Adjust userID to match the zero-based index in predictions_df
    user_row_number = userID - 1 # UserID starts at 1, not 0
    ## Sort the predicted ratings for userID in descending order (highest predicted ratings first).
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)

    # Step 2: Get the user's data and merge in the movie information
    ## Filter the original ratings DataFrame to only include the movies rated by userID .
    user_data = original_ratings_df[original_ratings_df.userId == (userID)]
    ##Merge the user data with movie details (titles, genres, etc.), and sort them by their actual ratings.
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId').
                     sort_values(['rating'], ascending=False)
                 )

    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print ('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))



    # Step 3: Recommend the highest predicted rating movies that the user hasn't seen yet
    ## Filter out movies the user has already rated.
    recommendations = (movies_df[~movies_df['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )
    print(recommendations)

    return user_full, recommendations



In [22]:
already_rated, predictions = recommend_movies(prediction, 400, movies, ratings, 3)

User 400 has already rated 43 movies.
Recommending the highest 3 predicted ratings movies not already rated.
      movieId                                           title  \
9347   163925                    Wings, Legs and Tails (1986)   
5278     8804  Story of Women (Affaire de femmes, Une) (1988)   
2640     3567                               Bossa Nova (2000)   

                    genres  
9347      Animation|Comedy  
5278                 Drama  
2640  Comedy|Drama|Romance  


__Last step:__
Add your own rating to the ratings dataframe and evaluate how well your recommender system performs!

In [31]:
def getPredictionMatrix( input ):
  user_item_rating_matrix = input.pivot(index = 'userId', columns ='movieId', values = 'rating')
  user_item_rating_matrix.fillna(user_item_rating_matrix.mean(), inplace=True)
  values = user_item_rating_matrix.values
  mean = np.mean(values, axis=1)
  values_reshaped = values - mean.reshape(-1,1)
  svd = TruncatedSVD(n_components=50)
  U = svd.fit_transform(values_reshaped)
  sigma = svd.singular_values_
  Vt = svd.components_
  sigma_diag = np.diag(sigma)
  first_array = np.matmul(U, sigma_diag)
  data = np.matmul(first_array, Vt)
  data = data + mean.reshape(-1,1)
  prediction = pd.DataFrame(data, columns=user_item_rating_matrix.columns)
  return prediction

In [33]:
my_ratings = pd.DataFrame({"userId" : [611,611,611,611,611,611],
                           "movieId" : [1,2,35,43,62,107],
                           "rating" : [4.5,4,2.5,5,4,4.5],
                           "timestamp" : [1493846425,1493846435,1493846445,1493846455,1493846465,1493846475]})
new_ratings = pd.concat([ratings, my_ratings], ignore_index=True)

print(new_ratings.tail(6))
new_prediction = getPredictionMatrix(new_ratings)

        userId  movieId  rating   timestamp
100836     611        1     4.5  1493846425
100837     611        2     4.0  1493846435
100838     611       35     2.5  1493846445
100839     611       43     5.0  1493846455
100840     611       62     4.0  1493846465
100841     611      107     4.5  1493846475


In [35]:
already_rated, predictions = recommend_movies(new_prediction, 611, movies, new_ratings, 10)
# it needs more data to work with, but its a pain to put in a bunch of those ratings so at the moment I am done.

User 611 has already rated 6 movies.
Recommending the highest 10 predicted ratings movies not already rated.
      movieId                                              title  \
3969     5607     Son of the Bride (Hijo de la novia, El) (2001)   
8813   131098                                Saving Santa (2013)   
2606     3496                            Madame Sousatzka (1988)   
8815   131130            Tom and Jerry: A Nutcracker Tale (2007)   
8816   131237                         What Men Talk About (2010)   
8820   131610                                 Willy/Milly (1986)   
8823   131724  The Jinx: The Life and Deaths of Robert Durst ...   
2634     3531                All the Vermeers in New York (1990)   
8832   132153                                     Buzzard (2015)   
8834   132333                                        Seve (2014)   

                         genres  
3969               Comedy|Drama  
8813  Animation|Children|Comedy  
2606                      Drama  
8815  

In [87]:
print(ratings)
print(ratings["rating"].max())

        userId  movieId  rating   timestamp
0            1        1     4.0   964982703
1            1        3     4.0   964981247
2            1        6     4.0   964982224
3            1       47     5.0   964983815
4            1       50     5.0   964982931
...        ...      ...     ...         ...
100831     610   166534     4.0  1493848402
100832     610   168248     5.0  1493850091
100833     610   168250     5.0  1494273047
100834     610   168252     5.0  1493846352
100835     610   170875     3.0  1493846415

[100836 rows x 4 columns]
5.0


AttributeError: 'DataFrame' object has no attribute 'append'

In [84]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


Well Done! You are done with this lab too!

Refrences:

1. https://www.statlearning.com/

2. https://beckernick.github.io/datascience/

3. http://grouplens.org/datasets/movielens/


User 400 has already rated 43 movies.
Recommending the highest 3 predicted ratings movies not already rated.
