# Recomender System
Streaming services like Netflix and Amazon use past viewing data from a customer and others to suggest new content. For example, Netflix once asked users to rate movies on a scale of $1–5$, creating a large matrix with ratings from around 480,189 customers for 17,770 movies. Since most users only watched about 200 movies, 99% of the matrix was empty.

To recommend movies, Netflix needed to fill in the missing ratings. The idea is that users who have watched similar movies may have similar preferences, allowing the system to predict ratings for movies a user hasn't seen, based on ratings from similar customers.

In this lab, we use an equivalent version of PCA for Movie Recommendations (for more information, please refer to the related videos posted on Moodle for this week).

Instructions:

**Step 1:** Data Gathering

**Step 2:** Data Preprocesing

**Step 3:** Apply SVD

**Step 4:** Writing a function to recommend movies for any user.

## Step 1: Data Gathering:

1. Start by importing the necessary Python libraries, such as Numpy and Pandas.

2. Next, visit the provided URL: http://grouplens.org/datasets/movielens/. Under the "recommended for education and development" section, locate and download the file named `ml-latest-small.zip` (which has a size of 1 MB).

3. After downloading, import the CSV files contained within the zip file.


In [3]:
import numpy as np
import pandas as pd

# read the movies.csv
movies = pd.read_csv("movies.csv")
# read the ratings.csv
ratings = pd.read_csv("ratings.csv")

##Step 2: Data Preprocessing:



1. Begin by examining the first few rows of your data to familiarize yourself with its structure.

2. Transform the data into a user-item rating matrix, where each row represents a user, each column represents a movie, and the values in the matrix are the ratings given by the users to the movies. You can achieve this using the `.pivot(index = 'userId', columns ='movieId', values = 'rating')` function.


3. print a few rows to see if it is in the suitable format. You will probably see a lot of 'NaN' (not a number) values. To apply SVD, we need to have numerical values. Common treatment to handle these 'NaN' values include replacing them with zero or the average rating for each row or column. Discuss which one do you think is better. Use `.fillna()`

4. Normalization step: De-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.




In [5]:
# exploring movies
print(movies.head())
print(movies.info())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None


In [6]:
# exploring ratings
print(ratings.head())
print(ratings.info())

   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
None


In [12]:
# Transform the data into a user-item rating matrix
user_item_rating_matrix = ratings.pivot(index = 'userId', columns ='movieId', values = 'rating')

In [9]:
# explore the outcome
print(user_item_rating_matrix.head())

movieId  1       2       3       4       5       6       7       8       \
userId                                                                    
1           4.0     NaN     4.0     NaN     NaN     4.0     NaN     NaN   
2           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
3           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
4           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
5           4.0     NaN     NaN     NaN     NaN     NaN     NaN     NaN   

movieId  9       10      ...  193565  193567  193571  193573  193579  193581  \
userId                   ...                                                   
1           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
2           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
3           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     NaN   
4           NaN     NaN  ...     NaN     NaN     NaN     NaN     NaN     N

In [13]:
#handeling missing info using the average
user_item_rating_matrix.fillna(user_item_rating_matrix.mean(), inplace=True)

In [14]:
#print outcome
print(user_item_rating_matrix.head())

movieId   1         2         3         4         5         6         7       \
userId                                                                         
1        4.00000  3.431818  4.000000  2.357143  3.071429  4.000000  3.185185   
2        3.92093  3.431818  3.259615  2.357143  3.071429  3.946078  3.185185   
3        3.92093  3.431818  3.259615  2.357143  3.071429  3.946078  3.185185   
4        3.92093  3.431818  3.259615  2.357143  3.071429  3.946078  3.185185   
5        4.00000  3.431818  3.259615  2.357143  3.071429  3.946078  3.185185   

movieId  8       9         10      ...  193565  193567  193571  193573  \
userId                             ...                                   
1         2.875   3.125  3.496212  ...     3.5     3.0     4.0     4.0   
2         2.875   3.125  3.496212  ...     3.5     3.0     4.0     4.0   
3         2.875   3.125  3.496212  ...     3.5     3.0     4.0     4.0   
4         2.875   3.125  3.496212  ...     3.5     3.0     4.0     4.

In [55]:
#Convert the data frame into a matrix (numpy array) using .values
values = user_item_rating_matrix.values
print(values.shape, values.size)

(610, 9724) 5931640


In [60]:
mean.shape

(610,)

In [62]:
#de-mean data: data - average of data
mean = np.mean(values, axis=1)
# (use reshape.(-1,1)to align the dimensions properly)
values_reshaped = values - mean.reshape(-1,1)

In [63]:
#print the outcome
print(values_reshaped.shape)
print(values_reshaped)

(610, 9724)
[[ 0.71824662  0.1500648   0.71824662 ...  0.21824662  0.21824662
   0.71824662]
 [ 0.65846657  0.16935452 -0.00284828 ...  0.23753633  0.23753633
   0.73753633]
 [ 0.66309081  0.17397876  0.00177596 ...  0.24216058  0.24216058
   0.74216058]
 ...
 [-0.74956362 -1.24956362 -1.24956362 ...  0.25043638  0.25043638
   0.75043638]
 [-0.26159726  0.17022093 -0.00198187 ...  0.23840274  0.23840274
   0.73840274]
 [ 1.7069987   0.13881688 -0.03338592 ...  0.2069987   0.2069987
   0.7069987 ]]


## Step 3: Finding the Best Rank k:



The best rank $k$ is a matrix with prediction values; discuss this.

1. Use k = 50. Determining the optimal rank 'k' for movie recomendation is another problem which can be the topic of your final project.


3. From this matrix, construct the corresponding dataframe using: pd.DataFrame(prediction matrix, columns = original_dataframe.columns). This dataFrame will contain predicted ratings for movies by different users. Each row represents a user, and each column represents a movie, with the entries containing predicted ratings.

In [64]:
# use TruncatedSVD to perform dimensionality reduction with svd
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50)
U = svd.fit_transform(values_reshaped)
sigma = svd.singular_values_
Vt = svd.components_



In [65]:
# print U and its shape
print(U.shape)
print(U)

(610, 50)
[[-8.62817224e+01  3.24540098e+00  2.48469995e+00 ...  7.84936181e-01
  -2.62865442e+00  1.44518691e+00]
 [-8.57684945e+01 -8.03517201e-02  3.60114411e-02 ...  1.14173589e-02
   2.53150272e-01  4.86676134e-03]
 [-8.56175212e+01 -7.11701114e-01 -8.49937649e-02 ... -9.05915840e-01
   3.77811721e-01 -2.22543614e+00]
 ...
 [-8.57666407e+01 -2.52646751e+00 -9.98824062e+00 ... -8.46634333e-02
   2.99965475e+00  1.76364651e+00]
 [-8.56962863e+01 -4.57895759e-01 -2.99233722e-01 ...  1.08411166e-01
  -7.70764667e-02  2.11478182e-01]
 [-8.67776898e+01  6.12398046e+00 -1.08047871e+00 ...  1.71520334e+00
   3.44261163e+00 -5.88995893e-01]]


In [66]:
# print V.T and its shape
print(Vt.shape)
print(Vt)

(50, 9724)
[[-7.68181837e-03 -1.97614783e-03  3.15195011e-05 ... -2.76926894e-03
  -2.76926894e-03 -8.59846188e-03]
 [ 7.51635503e-02  7.14776503e-02  3.98506617e-02 ... -4.76074639e-03
  -4.76074639e-03 -5.23082890e-03]
 [ 2.08859378e-02 -5.62680452e-05  5.22739923e-02 ... -8.72948254e-05
  -8.72948254e-05 -5.05135845e-05]
 ...
 [ 2.91413073e-02 -1.50226611e-02 -6.94574199e-03 ... -5.58600603e-04
  -5.58600603e-04 -5.42091005e-04]
 [-4.33022073e-02 -9.94852084e-03  8.51243606e-03 ...  3.57121104e-04
   3.57121104e-04  3.09984651e-04]
 [-2.77311115e-02 -8.70639224e-03  3.74350120e-03 ... -5.45747875e-04
  -5.45747875e-04 -4.49319100e-04]]


In [67]:
#print sigma and its shape
print(sigma.shape)
print(sigma)

(50,)
[2118.4775165    50.98252305   39.59690343   38.87654867   36.25823439
   33.79609184   33.30279443   32.49134738   31.3353296    30.97644785
   30.32638847   29.45907963   29.33417799   28.59789368   27.53450296
   27.35132401   26.87870911   26.41960053   26.267024     25.69384792
   25.45488435   25.145118     24.9615291    24.51442564   24.33633926
   24.0500835    23.83504847   23.51165321   23.33079009   23.27334245
   22.93012937   22.8233689    22.50691212   21.99336104   21.88057012
   21.83052639   21.62708733   21.60813286   21.42764825   21.20454135
   21.12859492   20.70472455   20.59514223   20.41353382   20.35693042
   20.19876587   19.83729948   19.66354313   19.46790117   19.39619836]


In [68]:
# convert sigma into a diagonal matrix using np.diag
sigma_diag = np.diag(sigma)
print(sigma_diag.shape)

(50, 50)


In [74]:
# Estimate your data by computing U *sigma* V
first_array = np.matmul(U, sigma_diag)
data = np.matmul(first_array, Vt)

In [76]:
# now we can predict rating by adding mean to this estimate
data = data + mean.reshape(-1,1)

In [78]:
#use pd.DataFrame to construct a dataframe containing ratings
prediction = pd.DataFrame(data, columns=user_item_rating_matrix.columns)
print(prediction.head)
#print a few rows

<bound method NDFrame.head of movieId       1           2          3            4           5       \
0        1415.378595  375.148705  13.880129 -1926.250911 -399.088455   
1        1399.019246  362.317592  -3.252775 -1914.517360 -400.593044   
2        1393.856245  360.656000  -3.015361 -1911.623289 -400.207290   
3        1386.326870  357.595134   0.027390 -1911.964728 -399.665795   
4        1400.490774  360.870533  -5.173405 -1914.575321 -400.626288   
..               ...         ...        ...          ...         ...   
605      1390.184807  349.991574  -4.511977 -1911.748249 -404.521512   
606      1391.298115  365.888684   6.454991 -1910.626098 -396.156073   
607      1332.237997  320.202116 -52.380195 -1914.775460 -413.730351   
608      1395.398419  359.704723  -4.813563 -1913.199111 -401.551485   
609      1455.853131  380.041840  -0.888030 -1938.288688 -406.654274   

movieId       6           7           8           9           10      ...  \
0        1472.572111 -160.56

## Step 4: Movie Recommendations:


Write a recommendation function that suggests movies to a user based on predicted ratings. It takes in a user id and a number k, prints user's original rating, and recomends k movies.







In [79]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):

    #Step 1: Get and sort the user's predictions
    ## Adjust userID to match the zero-based index in predictions_df
    user_row_number = userID - 1 # UserID starts at 1, not 0
    ## Sort the predicted ratings for userID in descending order (highest predicted ratings first).
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)

    # Step 2: Get the user's data and merge in the movie information
    ## Filter the original ratings DataFrame to only include the movies rated by userID .
    user_data = original_ratings_df[original_ratings_df.userId == (userID)]
    ##Merge the user data with movie details (titles, genres, etc.), and sort them by their actual ratings.
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId').
                     sort_values(['rating'], ascending=False)
                 )

    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print ('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))



    # Step 3: Recommend the highest predicted rating movies that the user hasn't seen yet
    ## Filter out movies the user has already rated.
    recommendations = (movies_df[~movies_df['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )
    print(recommendations)

    return user_full, recommendations



In [80]:
already_rated, predictions = recommend_movies(prediction, 400, movies, ratings, 3)

User 400 has already rated 43 movies.
Recommending the highest 3 predicted ratings movies not already rated.
      movieId                             title        genres
1624     2196                  Knock Off (1998)        Action
3083     4180        Reform School Girls (1986)  Action|Drama
9173   151769  Three from Prostokvashino (1978)     Animation


__Last step:__
Add your own rating to the ratings dataframe and evaluate how well your recommender system performs!

In [None]:
my_ratings = pd.DataFrame({"userId" : [], "movieId" : [], "rating" : [], "timestamp" : []})
ratings.concact(my_ratings)

In [87]:
print(ratings)
print(ratings["rating"].max())

        userId  movieId  rating   timestamp
0            1        1     4.0   964982703
1            1        3     4.0   964981247
2            1        6     4.0   964982224
3            1       47     5.0   964983815
4            1       50     5.0   964982931
...        ...      ...     ...         ...
100831     610   166534     4.0  1493848402
100832     610   168248     5.0  1493850091
100833     610   168250     5.0  1494273047
100834     610   168252     5.0  1493846352
100835     610   170875     3.0  1493846415

[100836 rows x 4 columns]
5.0


AttributeError: 'DataFrame' object has no attribute 'append'

In [84]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


Well Done! You are done with this lab too!

Refrences:

1. https://www.statlearning.com/

2. https://beckernick.github.io/datascience/

3. http://grouplens.org/datasets/movielens/


User 400 has already rated 43 movies.
Recommending the highest 3 predicted ratings movies not already rated.
