# Recomender System
Streaming services like Netflix and Amazon use past viewing data from a customer and others to suggest new content. For example, Netflix once asked users to rate movies on a scale of $1–5$, creating a large matrix with ratings from around 480,189 customers for 17,770 movies. Since most users only watched about 200 movies, 99% of the matrix was empty.

To recommend movies, Netflix needed to fill in the missing ratings. The idea is that users who have watched similar movies may have similar preferences, allowing the system to predict ratings for movies a user hasn't seen, based on ratings from similar customers.

In this lab, we use an equivalent version of PCA for Movie Recommendations (for more information, please refer to the related videos posted on Moodle for this week).

Instructions:

**Step 1:** Data Gathering

**Step 2:** Data Preprocesing

**Step 3:** Apply SVD

**Step 4:** Writing a function to recommend movies for any user.

## Step 1: Data Gathering:

1. Start by importing the necessary Python libraries, such as Numpy and Pandas.

2. Next, visit the provided URL: http://grouplens.org/datasets/movielens/. Under the "recommended for education and development" section, locate and download the file named `ml-latest-small.zip` (which has a size of 1 MB).

3. After downloading, import the CSV files contained within the zip file.


In [1]:
import numpy as np
import pandas as pd

# read the movies.csv

# read the ratings.csv


##Step 2: Data Preprocessing:



1. Begin by examining the first few rows of your data to familiarize yourself with its structure.

2. Transform the data into a user-item rating matrix, where each row represents a user, each column represents a movie, and the values in the matrix are the ratings given by the users to the movies. You can achieve this using the `.pivot(index = 'userId', columns ='movieId', values = 'rating')` function.


3. print a few rows to see if it is in the suitable format. You will probably see a lot of 'NaN' (not a number) values. To apply SVD, we need to have numerical values. Common treatment to handle these 'NaN' values include replacing them with zero or the average rating for each row or column. Discuss which one do you think is better. Use `.fillna()`

4. Normalization step: De-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.




In [50]:
# exploring movies


In [51]:
# exploring ratings


In [7]:
# Transform the data into a user-item rating matrix

In [8]:
# explore the outcome

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [9]:
#handeling missing info

In [52]:
#print outcome

In [53]:
#Convert the data frame into a matrix (numpy array) using .values

In [54]:
#de-mean data: data - average of dada
# (use reshape.(1,-1)to align the dimensions properly)

In [55]:
#print the outcome

## Step 3: Finding the Best Rank k:



The best rank $k$ is a matrix with prediction values; discuss this.

1. Use k = 50. Determining the optimal rank 'k' for movie recomendation is another problem which can be the topic of your final project.


3. From this matrix, construct the corresponding dataframe using: pd.DataFrame(prediction matrix, columns = original_dataframe.columns). This dataFrame will contain predicted ratings for movies by different users. Each row represents a user, and each column represents a movie, with the entries containing predicted ratings.

In [15]:
# use TruncatedSVD to perform dimensionality reduction with svd
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50)
U = svd.fit_transform(de-meaned data)
sigma = svd.singular_values_
Vt = svd.components_



In [56]:
# print U and its shape


In [57]:
# print V.T and its shape



In [58]:

#print sigma and its shape

In [23]:
# convert sigma into a diagonal matrix using np.diag

In [24]:
# Estimate your data by computing U *sigma* V

In [25]:
# now we can predict rating by adding mean to this estimate

In [59]:
#use pd.DataFrame to construct a dataframe containing ratings


#print a few rows

## Step 4: Movie Recommendations:


Write a recommendation function that suggests movies to a user based on predicted ratings. It takes in a user id and a number k, prints user's original rating, and recomends k movies.







In [48]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):

    #Step 1: Get and sort the user's predictions
    ## Adjust userID to match the zero-based index in predictions_df
    user_row_number = userID - 1 # UserID starts at 1, not 0
    ## Sort the predicted ratings for userID in descending order (highest predicted ratings first).
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)

    # Step 2: Get the user's data and merge in the movie information
    ## Filter the original ratings DataFrame to only include the movies rated by userID .
    user_data = original_ratings_df[original_ratings_df.userId == (userID)]
    ##Merge the user data with movie details (titles, genres, etc.), and sort them by their actual ratings.
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'movieId').
                     sort_values(['rating'], ascending=False)
                 )

    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print ('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))



    # Step 3: Recommend the highest predicted rating movies that the user hasn't seen yet
    ## Filter out movies the user has already rated.
    recommendations = (movies_df[~movies_df['movieId'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movieId',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )
    print(recommendations)

    return user_full, recommendations



In [49]:
already_rated, predictions = recommend_movies(preds_df, 400, movies, ratings, 3)

User 400 has already rated 43 movies.
Recommending the highest 3 predicted ratings movies not already rated.
      movieId                       title            genres
453       527     Schindler's List (1993)         Drama|War
2121     2858      American Beauty (1999)     Drama|Romance
1480     2028  Saving Private Ryan (1998)  Action|Drama|War


__Last step:__
Add your own rating to the ratings dataframe and evaluate how well your recommender system performs!

Well Done! You are done with this lab too!

Refrences:

1. https://www.statlearning.com/

2. https://beckernick.github.io/datascience/

3. http://grouplens.org/datasets/movielens/


User 400 has already rated 43 movies.
Recommending the highest 3 predicted ratings movies not already rated.
